算法代写-DSME6650

时间：2022-02-28

1

DSME6650
Review Exercise Set 1 Suggested Outline

Please note that
• The following only provides an outline of possible answers. You need to add proper
explanations or examples whenever appropriate.
• There may be more than one possible answer for some of the questions.

1. Consider the following data set

Customer ID Transaction ID Items Purchased
1 0003 {a, d, e}
1 0012 {a, b, c, e}
2 0017 {b, c, e}
2 0026 {b, d, e}
3 0021 {c, d}
3 0029 {a, b, c}
4 0030 {a, b, d, e}
4 0041 {a, c, d, e}
5 0035 {a, d, e}
5 0039 (a, b, e)

(a) Compute the support for the itemsets {e}, {b, d}, and {b, d, e} with respect to the
transaction ids.
(b) By using the results of (a), compute the confidence for the association rules {b, d} →
{e} and {e} → {b, d}.
(c) Repeat (a) and (b) with respect to customer ids.
(d) Is there any relationship between the support and confidence with respect to the
transaction ids and those with respect to the customer ids?
Suggested Outline: In the following supp and conf denote support and confidence respectively.
(a) With respect to the transaction id, there are 10 transactions in total, so
Supp({e}) = 8/10=0.8, supp({b, d}) = 2/10 = 0.2, and supp({b, d, e}) = 2/10 = 0.2.
(b) Note that conf(X → Y) = supp(XY)/supp(X), so conf({b, d} → {e}) = 0.2/0.2 = 1 and conf({e}
→ {b, d}) = 0.2/0.8 = 0.25.
(c) With respect to customer id, there are 5 transactions in total, and an item is regarded as
present in the transaction if it is present in at least one of the customer’s transactions; so
Supp({e} = 4/5 = 0.8, supp({b, d}) = 5/5 = 1, and supp({b, d, e}) = 4/5 = 0.8.
(d) conf({b, d} → {e}) = 0.8/1 = 0.8 and conf({e} → {b, d}) = 0.8/0.8 = 1.
2. (a) Compute the confidence for the association rules  → X and X → .
(b) Let C1, C2, and C3 be the confidence values of the rules {a} → {b} and {a} → {b, c},
and {a, c} → {b} respectively. What are the possible relationships among the
confidence values?
2

(c) Suppose the confidence of the rules X → Y and Y → Z are larger than some
threshold, say, a. Is it possible for the confidence of X → Z to have a value less than a?
Suggested Outline: In the following supp and conf denote support and confidence
respectively. (a) For the rule  → X, since  is a subset of any itemset,
conf( → X) = supp( → X)/supp( ) = supp( → X)
For the rule X → ,
conf (X → ) = (No. of transactions containing X  ) / (No. of transactions containing X )
= 1 or 100%.
(b) Note that if X and Y are itemsets such that X  Y then any transaction containing Y
must contain X. In other words, the number of transactions containing X ≥ the number
of transactions containing Y. Now,
C1 = (No. of transactions containing {a, b}) / (No. of transactions containing {a})
C2 = (No. of transactions containing {a, b, c}) / (No. of transactions containing {a})
By the above observation, we have C1 ≥ C2.
Since
C3 = (No. of transactions containing {a, b, c})/(No. of transactions containing {a, c})
Again, we have from the above observation, C2 ≤ C3.
3. What is the total number of possible association rules that can be formulated from a
data set with n items (including rules with zero support)?
Suggested Outline: In the following we ignore cases in which the antecedent or the
consequent is empty. Also note that the same item cannot be in the antecedent and the
consequent. Therefore,
• Given any integer 1 ≤ k ≤ n,
o There are
ways to choose k items to form the antecedent
o For each of these choices, we can form the consequent with t items, where t =
1, …, n – k. Now for each t, there are
− to choose the t items (since no
item can appears in both the antecedent and consequent; for each k, we can
only choose the t items from the remaining n – k items.
o Thus, for each choice of k items to form the antecedent, there are
∑
−
−
=1

ways to form the consequent.
o So by multiplication rule, for each k, there are

∑
−
−
=1

possible association rules.
• Hence, altogether, there are
∑ (
∑
−
−
=1
)

=1

possible association rules.
3

Next note that we have
( + ) = ∑

=0
−

So by putting x = y = 1, we have
∑

−1
=1
= 2 − 2
And by putting x = 1 and y =2, we have
∑
2−
−1
=1
= 3 − 2 − 1

Using these and the above results, we have

∑ (
∑
−−
=1 )

=1 = ∑ (
(2− − 1))=1
= ∑ (
(2− − 1))=1
= ∑

=1 2
− − ∑

=1
= 3 − 2 − 1 − 2 + 2
= 3 − 2+1 + 1

4. Consider the following data set for a binary class problem
X Y Class Label
T F 1
T T 1
T T 1
T F 0
T T 1
F F 0
F F 0
F F 0
T T 0
T F 0

Calculate the information gain when splitting on X and Y. Which attribute would the
decision tree algorithm introduced in class choose?
Suggested Outline: The overall information before splitting is
Infoorig = Info[4, 6] = -0.4 log 0.4 – 0.6 log 0.6 = 0.9710
The information gain after splitting at X is computed as follows:
InfoX=T = Info[4, 3] = -4/7 log (4/7) – 3/7 log (3/7) = 0.9852
InfoX=F = Info[3, 0] = 0
The expected information at attribute X is
4

InfoX = Info([4, 3], [3, 0]) = 7/10 Info([4, 3]) + 3/10 Info([3, 0]) = 0.6879
Information gain after splitting at X = Infoorig – InfoX = 0.9710 – 0.6879 = 0.2813
By doing similar computationy, the information gain after splitting on Y is 0.2565
Since the information gain on splitting on X is higher than that of Y, X will be chosen to
split the node.
5. Consider the following data set of training examples

A B C No. of Class C1 Examples No. of Class C2 Examples
0 0 0 5 40
0 0 1 0 15
0 1 0 10 5
0 1 1 45 0
1 0 0 10 5
1 0 1 25 0
1 1 0 5 20
1 1 1 0 15

(a) Construct a two level decision tree by using the greedy approach using the
classification error rate as the criterion for splitting (and split at the lowest error
rate).
(b) Construct a two-level decision tree by first splitting at A then using the greedy
approach for the remaining split.
(c) Comment on the results.

Suggested Outline: (a) For level 1 splitting:
The error rates for attributes A, B , and C are computed as follows:
For A: The classification count is summarised in the following table
A C1 C2
0 60 60
1 40 40
The classification count are equal in both 0 and 1, so the classification can be chosen
randomly in both cases but the choice must be different for 0 and 1 (if we choose C1 (C2)
for 0 then we should choose C2 (C1) for 1). However, in either way, we have the error
rate is (60 + 40)/200 = 0.5.
For B: The classification count I summarized in the following table
B C1 C2
0 40 60
1 60 40
In this case, C2 is the classification for 0 and C1 is the classification for 1. The error rate
is (40 + 40)/200 = 0.4.
For C: The classification count I summarized in the following table
C C1 C2
0 30 70
5

1 70 30
In this case, C2 is the classification for 0 and C1 is the classification for 1. The error rate
is (30 + 30)/200 = 0.3.
Since C gives the lowest error rate, it is chosen as the splitting attribute at level 1.

For level 2 splitting: For C=0, the classification counts for A and B are summarized in the
following tables

The error rates in both cases are (15 + 15)/100 = 0.3
For C=1, the classification counts for A and B are summarized in the following tables

The error rates in both cases are (15 + 15)/100 = 0.3
In this case, we see that we can choose either A or B as the splitting attribute for C=0
and C=1, it will give the same result. So

The overall error rate is (15 + 15 + 15 +15)/200 = 0.3

(b) Split at A. When A = 0., the classification counts for B and C are summarized in the
following tables.

B C1 C2
0 15 45
1 15 25
A C1 C2
0 15 45
1 15 25
B C1 C2
0 25 15
1 45 15
A C1 C2
0 45 15
1 25 15
C
0 1
A or B A or B
C2
0
C2
1
C1
0
C1
1
6

The error rates of using B and C are 1/12 and 1/4 respectively. Since attribute B leads to
a smaller error rate, B is chosen.
When A = 1, the classification counts for B and C are summarized in the following tables.

The error rates of using B and C are 1/8 and 3/8 respectively. Since attribute B leads to
a smaller error rate, B is chosen. So

The overall error rate is (10 + 10)/200 = 0.1
(c) Note that the overall error rate from (a) is much higher than that from (b), the greedy
approach in this case does not produce an optimal result. In other word, the greedy
approach does not always produce an optimal result.
6. Consider the following data set of training examples

Instance X Y Z Class
1 0 0 1 B
2 1 0 1 A
3 0 1 0 B
4 1 0 0 B
C C1 C2
0 15 45
1 45 15
B C1 C2
0 5 55
1 55 5
C C1 C2
0 15 25
1 25 15
B C1 C2
0 35 5
1 5 35
7

5 1 0 1 A
6 0 0 1 A
7 1 1 0 B
8 0 0 0 B
9 0 1 0 A
10 1 1 1 A

(a) Estimate the conditional probabilities P(X=1|A), P(Y=1|A), P(Z=1|A), P(X=1|B),
P(Y=1|B), and P(Z=1|B).
(b) Predict the class label for a test sample (X=1, Y=1, Z=1) by using the navie Bayes
approach.
Suggested Outline: (a) P(X=1|A) = 0.6, P(Y=1|A) = 0.4, P(Z=1|A) = 0.8, P(X=1|B) = 0.4,
P(Y=1|B) = 0.4, P(Z=1|B) = 0.2.
(b) We need to compute P(A|(X=1, Y=1, Z=1)) and P(B|(X=1, Y=1, Z=1)). Note that by
Bayes, we have
P(A|(X=1, Y=1, Z=1)) = (P((X=1, Y=1, Z=1)|A)P(A))/P(X=1, Y=1, Z=1)
and
P(B|(X=1, Y=1, Z=1)) = (P((X=1, Y=1, Z=1)|B)P(B))/P(X=1, Y=1, Z=1)
Since P(A) = P(B) = 0.5 so it suffices to compare P((X=1, Y=1, Z=1)|A) and P((X=1, Y=1,
Z=1)|B). From (a) and the independence assumption,
P((X=1, Y=1, Z=1)|A) = P(X=1|A)P(Y=1|A)P(Z=1|A) = (0.6)(0.4)(0.8) = 0.192
P((X=1, Y=1, Z=1)|B) = P(X=1|B)P(Y=1|B)P(Z=1|B) = (0.4)(0.4)(0.2) = 0.032
Since P((X=1, Y=1, Z=1)|A) is larger the record is assigned to class A.
7. Consider the following data set of training examples

(a) How does Naïve Bayes perform on this data set with respect to class A and class B?
8

(b) If each class is divided into subclasses, so now we have classes A1, A2, B1, and B2,
how does Naïve Bayes perform with respect to the four subclasses?
(c) Repeat (a) and (b) for decision tree.
Suggested Outline: (a) Naïve Bayes will not perform well since the conditional
probabilities of the different attributes given a particular class are the same for both
classes.
(b) The performance will be improved since in the subclasses, the conditional
probabilities of the different attributes given a particular class are different.
(c) For the two classes A and B, decision tree will not perform well. The performance will
be improved for the four subclasses.
8. Explain why the formula for P(E|H) holds in Chapter 4 ppt slide 24.
Suggested Outline: Note that = 1 + 2 ⋯ + and it is given that P(word i|H) = Pi.
Also recall in our model the document is represented by the bag of words that contains
all the words appear in the document (with multiple occurrences of a word appearing
multiple times) regardless of the ordering of the words.
The number of documents in which word i appears ni times
= the number of ways to permute N objects in which there are ni of object i
=
!
1!2!⋯!

Under the independence assumption
P(E|H) =
!
1!2!⋯!
∏

=1