ISOM 5535 High Dimensional Statistics with Business Applications Lilun Du
Practice Questions for Examination
1. (Lecture 1)
a. Suppose that ∑
=1 = 0,∑
=1 = 0,∑
2
=1 = 1 for = 1,… , and = ∑
=1 + ,
where 1, … , are independent and identically distributed from a (0,
2) distribution. Derive
the likelihood function.
b. Assume the prior 1, … , are independent and identically distributed from a double
exponential distribution with mean 0 and common scale parameter , i.e () =
1
2
exp(−

),
for = 1,… , . Derive the posterior for , where = (1, … , )
. (Hints: (, ) ∝
(, )(), where = (1, … , )
, = {}. You can just derive from (, )())
c. If the lasso estimates of is the solution of maximizing the posterior in (b), what is ?
d. Assume the prior 1, … , are independent and identically distributed from a normal
distribution with mean zero and variance , i.e. () =
1
√22
exp (− (
2
22
)), for = 1,… , .
Derive the posterior of , where = (1, … , )
.
e. If the ridge estimates of is the solution of maximizing the posterior in (d), what is ?
Solutions:
a. (, 2) = ∏
1
√22
exp(−
(−∑
=1 )
2
22
) =1
= (22)−
2 exp (−
1
22
∑ ( −∑
=1
)
2
=1
)
b. (, )() =
1
√22
exp (−
1
22
∑ ( − ∑
=1 )
2
=1 ) ∏
1
2
exp(−

)=1
= (22)−
2(2)− exp(−(
1
22
‖ − ‖2
2 +
‖‖1
))
c. ̂ = argmax
(, )()
= argmax
log (, )()
= argmax
−
2
log 22 − log 2 − (
1
22
‖ − ‖2
2 +
‖‖1
)
= argmin
1
22
‖ − ‖2
2 +
‖‖1
= argmin
1
2
(
1
2
‖ − ‖2
2 +
2
‖‖1)
= argmin
1
2
‖ − ‖2
2 +
2
‖‖1
Compare to lasso problem, =
2
d. (, )() =
1
√22
exp (−
1
22
∑ ( − ∑
=1 )
2
=1 ) ∏
1
√22
exp(−(
2
22
))=1
= (22)−
2(22)−
2 exp(− (
1
22
‖ − ‖2
2 +
1
22
‖‖2
2))
e. ̂ = argmax
(, )()
= argmax
log (, )()
= argmax
−
2
log 22 −
2
log 22 − (
1
22
‖ − ‖2
2 +
1
22
‖‖2
2)
= argmin
1
22
‖ − ‖2
2 +
1
22
‖‖2
2
= argmin
1
2
(
1
2
‖ − ‖2
2 +
2
22
‖‖2
2)
Compare to ridge problem, =
2
22
2. (Lecture 1)
Consider a simple special case with = , and = , where is a × identity matrix.
The ridge regression then has a form:
min
β1,…,
∑( − )
2
=1
+ ∑
2
=1
a. Without any constraint, show that the least squares estimate is:
̂ =
b. Show that the ridge regression estimate is:
̂
=
1 +
c. If = 1, draw a line graph of ̂
against from = −1 to = 1.
d. Given that the LASSO estimate is:
̂
= sign() ( −
2
)
+
=
{
−
2
, > 0  >
2
+
2
, < 0  >
2
0,  <
2
If = 1, draw a line graph of ̂
against from = −1 to = 1.
e. From the graphs in (c) and (d), what is the difference between ̂
and ̂
?
Solution:
a. (1, … , ) =
1
2
∑ ( − )
2
=1
= − + = 0
=
b. (1, … , ) =
1
2
∑ ( − )
2
=1 + ∑
2
=1
= −2 + 2 + 2 = 0
(1 + ) =
̂
=
1 +
c.
d.
e. From the two graphs, we can see that ridge regression coefficient estimates are shrunken
proportionally towards zero, while the lasso coefficient estimates are softthresholded towards
zero.
3. (Lecture 2)
In logistic regression,
() =
0+1
1 + 0+1
a. Show that the loglikelihood function is:
ℓ(0, 1) = log (0, 1) =∑((0 + 1) − log(1 +
0+1))
=1
b. Show that
ℓ
0
= ∑ ( − )
=1 , where = ().
c. Show that
ℓ
1
= ∑ ( − )
=1 .
d. Show that
ℓ
01
=
0
(
ℓ
1
) = −∑ (1 − )
=1
Solution:
a. (0, 1) = ∏ (())
(1 − ())
1−
=1
log((0, 1)) =∑( log(()) + (1 − ) log(1 − ()))
=1
log(()) = 0 + 1 − log(1 +
0+1)
log(1 − ()) = log(1 +
0+1)
ℓ(0, 1) = log((0, 1)) =∑((0 + 1) − log(1 +
0+1))
=1
b.
0
= ∑ ( −
1
1+0+1
× 0+1)=1
=∑( −
0+1
1 + 0+1
)
=1
=∑( − )
=1
c.
1
= ∑ ( −
1
1+0+1
× 0+1 × )
=1
=∑( −
1
1 + 0+1
× 0+1 × )
=1
=∑( − )
=1
d.
ℓ
01
=
0
(
ℓ
1
) = −∑
(0+1)(1+0+1)−(0+1)(0+1)
(1+0+1)
2
=1
= −∑
0+1
(1 + 0+1)2
=1
= −∑
0+1
1 + 0+1
×
1
1 + 0+1
=1
= −∑(1 − )
=1
4. (Lecture 2)
A Bayes classifier for classes is:
() = arg max
∈{1,2,…,}
( =  = )
In LDA, for = 2, = 1 and 1 = 2,
a. Show that the Bayes classifier assigns an observation to class 1 if 2(1 − 2) > 1
2 − 2
2, and
else to class 2.
b. Show that the Bayes decision boundary, such that it is indifferent to classify an observation to
either class 1 or class 2, will be:
=
1 + 2
2
Solution:
a. ( = 1 = ) =
1
11()+22()
1
√22
exp (−
(−1)
2
22
)
( = 2 = ) =
2
11() + 22()
1
√22
exp (−
( − 2)
2
22
)
If the Bayes classifier assigns an observation to class 1,
1
√22
exp(−
( − 1)
2
22
) >
1
√22
exp(−
( − 2)
2
22
)
exp(−
( − 1)
2 − ( − 2)
2
22
) > 1
−( − 1)
2 + ( − 2)
2 > 0
−2 + 21 − 1
2 + 2 − 22 + 2
2 > 0
2(1 − 2) − (1
2 − 2
2) > 0
2(1 − 2) > (1
2 − 2
2)
b. If it is indifferent to classify an observation to either class 1 or class 2,
2(1 − 2) = (1
2 − 2
2)
2(1 − 2) = (1 + 2)(1 − 2)
=
1 + 2
2
5. (Lecture 2)
Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”)
based on , last year’s percent profit. We examine a large number of companies and discover that the
mean value of for companies that issued a dividend was ̅ = 10, while the mean for those that didn’t
was ̅ = 0. In addition, the variance of X for these two sets of companies was ̂2 = 36. Finally, 80 %
of companies issued dividends. Assuming that X follows a normal distribution, predict the probability
that a company will issue a dividend this year given that its percentage profit was = 4 last year.
Solution:
1 = 0.8
1(4) =
1
√2(36)
exp(−
(4−10)2
2(36)
) = 0.040328
2(4) =
1
√2(36)
exp(−
(4−0)2
2(36)
) = 0.053241
( = 1 = 4) =
(0.8)(0.040328)
(0.8)(0.040328) + (0.2)(0.053241)
= 0.751852
6. (Lecture 3)
Suppose that a curve ̂ is computed to smoothly fit a set of points using the following formula:
̂ = argmin
(∑( − ())
2
=1
+ ∫[()()]
2
)
where () represents the th derivative of (and (0) = 0). Describe ̂ in each of the following
scenarios.
a. = ∞, = 0
b. = ∞, = 1
c. = ∞, = 2
d. = ∞, = 3
e. = 0, = 3
Solution:
a. If = ∞,
̂ = argmin
∫[()()]
2
If = 0, ()() = (0)() = 0
∴ ̂() = 0
b. ∫[()()]
2
will be minimized when ()() = 0.
If = 1, we need ′() = 0.
Notice that if () = , where is a constant, ′() = 0
∴ () = , where is a constant.
c. If = 2, we need ′′() = 0.
Notice that if () = + , where and are constants, ′() = and ′′() = 0
∴ ̂() = + , where and are constants.
d. If = 2, we need ′′′() = 0.
Notice that if () = 2 + + , where , and are constants, ′′′() = 0
∴ ̂() = 2 + + , where , and are constants.
e. If = 0,
̂ = argmin
∑( − ())
2
=1
∴ ̂() exactly interpolates all the observations.
7. Consider the truncated power series representation for cubic splines with K interior knots. Let
() = 0 + 1 + 2
2 + 3
3 +∑( − )+
3
=1
Prove that the natural boundary conditions for natural cubic splines imply the following linear
constraints on the coefficients:
2 = 0, ∑
=1
= 0
3 = 0, ∑
=1
= 0
(Hints: () is linear when < 1 or ≥ .)
Solution:
When < 1,
() = 0 + 1 + 2
2 + 3
3
If () is linear when < 1,
2 = 0, 3 = 0
When ≥ ,
() = 0 + 1 + 2
2 + 3
3 +∑( − )
3
=1
′() = 1 + 22 + 33
2 + 3∑( − )
2
=1
′′() = 22 + 63 + 6∑( − )
=1
′′′() = 63 + 6∑
=1
If () is linear when ≥ ,
′′() and ′′′() will be 0.
′′′() = 0 ⇒ ∑
=1 = 0
′′() = 0 ⇒ ∑ ( − )
=1 = 0
∑ ( − ) = 0
=1
∑
=1 − ∑
=1 = 0
∑
=1 = 0
8. (Lecture 4)
Given the correlation matrix
1 0.9
0.9 1
R
=
a. Compute the eigenvalues 1 and 2 of R and the corresponding eigenvectors 1 and 2 of R,
with unit lengths.
b. Show that 1 2 ( )tr R + = where the trace of a matrix equals the sum of its diagonal components.
c. Show that 1 2 det( )R = where det(R) is the determinant of the matrix R.
d. Find out the principal components 1C and 2C and hence find the loadings of the variables. Also
compute the weights of the principal components 1C and 2C , verify that they are uncorrelated.
e. What proportion of the total variance in the data does the first principle component account for?
Solution:
a.

1 − 0.9
0.9 1 −
 = 0
= 1.9 or 0.1
When 1 = 1.9,
(
1 0.9
0.9 1
) (
) = (
1.9
1.9
)
{
+ 0.9 = 1.9
0.9 + = 1.9
⇒ =
⇒ 1 = (
1/√2
1/√2
)
When 2 = 0.1,
(
1 0.9
0.9 1
) (
) = (
0.1
0.1
)
{
+ 0.9 = 0.1
0.9 + = 0.1
⇒ = −
⇒ 2 = (
1/√2
−1/√2
)
b. Let = (
)

−
−
 = 0
( − )( − ) − = 0
2 − ( + ) + ( − ) = 0
Since 1 and 2 are roots of the above quadratic equation, 1 + 2 = + = ().
c. From b, 12 = − = .
d.
1 = 1
(
1
2
) =
1
√2
1 +
1
√2
2 = 0.7071 + 0.7072
2 = 2
(
1
2
) =
1
√2
1 −
1
√2
2 = 0.7071 − 0.7072
The weight of the first principal component =
1.9
1.9+0.1
= 0.95
The weight of the second principal component =
0.1
1.9+0.1
= 0.05
(
1/√2
1/√2
) (1/√2 1/√2) =
1
2
−
1
2
= 0
Therefore 1 and 2are uncorrelated.
e. The first principal component accounts for 95% of the total variance in the data.
9. (Lecture 4)
Determine the population principal components 1 and 2 for the covariance matrix = (
5 2
2 2
).
Also, calculate the proportion of the total population variance explained by the first principal component.
Solution:

5 − 2
2 2 −
 = 0
2 − 7 + 6 = 0 ⇒ = 6 or 1
When 1 = 6,
(
5 2
2 2
) (
) = (
6
6
)
{
5 + 2 = 6
2 + 2 = 6
⇒ = 2
⇒ 1 = (
2
1
)
1 =
1
√22 + 12
(
2
1
) =
(
2
√5
1
√5)
= (
0.894
0.447
)
= 0.8941 + 0.4472
When 2 = 1,
(
5 2
2 2
) (
) = (
)
{
5 + 2 =
2 + 2 =
⇒ 2 = −
⇒ 1 = (
1
−2
)
2 =
1
√(1)2 + (−2)2
(
1
−2
) =
(
1
√5
−
2
√5)
= (
0.447
−0.894
)
= 0.4471 − 0.8942
The first principal component accounts for
6
6+1
= 85.7% of the total population variance.
10. (Lecture 4)
a. Show that the covariance matrix
= (
1.0 0.63 0.45
0.63 1.0 0.35
0.45 0.35 1.0
)
for the = 3 standardized random variables 1, 2 and 3 can be generated by the = 1
factor model
1 = 0.91 + 1
2 = 0.71 + 2
3 = 0.51 + 3
where (1) = 1, (, 1) = 0, and
= () = (
0.19 0 0
0 0.51 0
0 0 0.75
).
This is, write in form = ′+.
b. Use the information in (a),
i. Calculate communalities ℎ
2, = 1,2,3, and interpret these quantities.
ii. Calculate (, 1) for = 1,2,3. Which variable might carry the greatest weight in
“naming” the common factor? Why?
c. The eigenvalues and eigenvectors of the correlation matrix in (a) are
1 = 1.96, 1′ = [0.625,0.593,0.507]
2 = 0.68, 2′ = [−0.219, −0.491,0.843]
3 = 0.36, 3′ = [0.749,−0.638,−0.177]
i. Assuming an m = 1 factor model, calculate the loading matrix L and matrix of specific
variance using the principal component solution method. Compare the results with
those in (a).
ii. What proportion of the total population variance is explained by the first common factor?
Solution:
a. = ′ +
(
1.0 0.63 0.45
0.63 1.0 0.35
0.45 0.35 1.0
) = ′ + (
0.19 0 0
0 0.51 0
0 0 0.75
)
′ = (
1.0 0.63 0.45
0.63 1.0 0.35
0.45 0.35 1.0
) − (
0.19 0 0
0 0.51 0
0 0 0.75
) = (
0.81 0.63 0.45
0.63 0.49 0.35
0.45 0.35 0.25
)
= (
0.9
0.7
0.5
) (0.9 0.7 0.5)
= (
0.9
0.7
0.5
) (0.9 0.7 0.5) + (
0.19 0 0
0 0.51 0
0 0 0.75
)
b.
i. ℎ1
2 = 0. 92 = 0.81, ℎ2
2 = 0. 72 = 0.49, ℎ3
2 = 0. 52 = 0.25
ii. (1, 1) =
(0.91+1,1)
√(1)√(1)
= 0.9(1, 1) + (1, 1) = 0.9
Similarly, (2, 1) = 0.7 and (2, 1) = 0.5
1 carry the greatest weight in "naming" the common factor.
c.
i. 1 = 0.6251 + 0.5932 + 0.5073
2 = −0.2191 − 0.4912 + 0.8433
3 = 0.7491 − 0.6382 − 0.1773
Hence, we have,
1 = 0.6251 − 0.2192 + 0.7493
2 = 0.5931 − 0.4912 − 0.6383
3 = 0.5071 + 0.8432 − 0.1773
Since
(1) = 0.625
2 + 0.5932 + 0.5072 + 2(0.625)(0.593)(0.63)
+2(0.593)(0.507)(0.35) + 2(0.625)(0.507)(0.45)
= 1.963 = 1
we have = √1.96
The loading matrix = (0.875 0.8302 0.7098).
= (
1.0 0.63 0.45
0.63 1.0 0.35
0.45 0.35 1.0
) − (
0.875
0.8302
0.7098
) (0.875 0.8302 0.7098)
= (
1.0 0.63 0.45
0.63 1.0 0.35
0.45 0.35 1.0
) − (
0.7656 0.7264 0.6211
0.7264 0.6892 0.5893
0.6211 0.5893 0.5038
)
= (
0.2344 −0.0964 −0.1711
−0.0964 0.3108 −0.2393
−0.1711 −0.2393 0.4962
)
ii)
Factor Loadings
Variable 1 Communality ℎ
2
Specificity
2
1 0.875 0.7656 0.2344
2 0.8302 0.6892 0.3108
3 0.7098 0.5038 0.4962
Variance Explained 1.96 1.96 1.04
Percentage 0.653 0.6533 0.3467
11. (Lecture 4)
Let = [
1
2
] be a random vector from a bivariate normal distribution with mean vector = [
1
2
] and
covariance matrix Σ = [
1
2 12
12 2
2 ]. The family of contours of X is a family of ellipsoids in the original
coordinate system of which the value of the density for along the contour is constant. Let = [
1
2
],
the joint probability density of 1 and 2 is given by
f1,2() =
1
2Σ
1
2
exp [−
1
2
( − )Σ−1( − )] for ∈ ℝ2.
a. Show that
1
2Σ
1
2
exp [−
1
2
( − )Σ−1( − )] ≡
1
212√1−2
exp [−
1
2(1−2)
(1
2 − 212 +
2
2)]
where =
12
12
and =
−
for = 1,2.
b. Show that 1,2() ≤
1
212√1−2
for all ∈ ℝ2.
c. Consider a set of ∈ ℝ2 that has the equal value of 1,2(), i.e. { ∶ 1,2() =
for some ≤
1
212√1−2
}. Show that 1,2() = can be written as
( − )Σ−1( − ) = 2
for some constant .
d. Let = [
1
2
] = ( − ) be the a vector of principal components of associated with its
covariance matrix, where Σ = Λ by Eigendecomposition. Show that ( − )Σ−1( − ) =
2 can be written as
1
2
1
+
2
2
2
= 2
where is the constant in part (c); 1 and 2 are the eigenvalues.
e. Let 1 = 2 = 0, Σ = [
2 −1
−1 2
] and = 1. Plot the graph of
1
2
1
+
2
2
2
= 2
on 12plane. Hence, add a pair of axes 1 and 2.
Solution:
a. (1): Σ = 
1
2 12
12 2
2  = 1
22
2 − 12
2 = 1
22
2(1 − 2), where =
12
12
(2): Σ−1 =
1
Σ
(
2
2 −12
−12 1
2 ) =
1
1
22
2(1−2)
(
2
2 −12
−12 1
2 ) =
1
1−2
(
1
1
2 −
σ12
−
σ12
1
2
2
)
(3): ( − )Σ−1( − ) =
1
1−2
(1 − 1 2 − 2) (
1
1
2 −
σ12
−
σ12
1
2
2
)(
1 − 1
2 − 2
)
=
1
1 − 2
((1 − 1)
2
1
1
2 −
2
12
(1 − 1)(2 − 2) + (2 − 2)
2
1
2
2
Substitute (1) and (3) into L.H.S, then we have the result.
b. 1,2() =
1
212√1−2
exp [−
1
2(1−2)
(1
2 − 212 + 2
2)]
Note that
1
2 − 212 + 2
2 ≥ 1
2 − 212 + 2
2 as ≤ 1
= (1 − 2)
2 ≥ 0
1,2() ≤
1
212√1 − 2
0 =
1
212√1 − 2
c. 1,2() =
1
212√1 − 2
exp [−
1
2
( − )Σ−1( − )] =
( − )Σ−1( − ) = −2 log (212√1 − 2) =
2
Since 212√1 − 2 ≤ 212√1 − 2 ×
1
212√1−2
= 1
Hence, −2 log(212√1 − 2) > −2 log 1 = 0
We can write −2 log(212√1 − 2) as it is always positive.
d. = ( − )
( − )Σ−1( − )
= ( − )Λ−1( − )
= Λ−1
= (1 2) (
1
−1 0
0 2
−1)(
1
2
)
= (
1
1
2
2
) (
1
2
)
=
1
2
1
+
2
2
2
∴
1
2
1
+
2
2
2
= 2
e. Solve 
2 − −1
−1 2 −
 = 0
(2 − )2 = 1
= 3 or = 1
When = 3,
(
−1 −1
−1 −1
)~(
−1 −1
0 0
)
1 =
1
√(−1)2 + 12
(
1
−1
) =
1
√2
(
1
−1
)
When = 1
(
1 −1
−1 1
)~ (
1 −1
0 0
)
2 =
1
√12 + 12
(
1
1
) =
1
√2
(
1
1
)
∴ Λ = (
3 0
0 1
) , =
1
√2
(
1 1
−1 1
)
The equation becomes
1
2
3
+
2
2
1
= 2
To find the angle of rotation, = (
1
√2
−
1
√2
1
√2
1
√2
) = (
cos − sin
sin cos
)
cos =
1
√2
, sin =
1
√2
= 45° (anticlockwise)
12. (Lecture 5)
Consider the matrix of distances
1 2 3 4
1
2
3
4
[
0
1 0
11 2 0
5 3 4 0
]
Cluster the four items using each of the following procedures.
a. Single linkage hierarchical procedure.
b. Complete linkage hierarchical procedure.
c. Average linkage hierarchical procedure.
d. Draw the dendrograms and compare the results in (a), (b) and (c).
Solution:
a.
1 2 3 4
1
2
3
4
[
0
1 0
11 2 0
5 3 4 0
]
1 = {1,2}
→
1 3 4
1
3
4
[
0
2 0
3 4 0
]
2 = {1, 3}
→
2 4
2
4
[
0
3 0
]
b.
1 2 3 4
1
2
3
4
[
0
1 0
11 2 0
5 3 4 0
]
1 = {1,2}
→
1 3 4
1
3
4
[
0
11 0
5 4 0
]
2 = {3,4}
→
1 2
1
2
[
0
11 0
]
c.
1 2 3 4
1
2
3
4
[
0
1 0
11 2 0
5 3 4 0
]
1 = {1,2}
→
1 3 4
1
3
4
[
0
6.5 0
4 4 0
]
2 = {1, 4}
→
1 3
1
3
[
0
5.67 0
]
or
2 = {3,4}
→
1 2
1
2
[
0
5.25 0
]
13. (Lecture 5)
The distances between pairs of five items are as follows:
1 2 3 4 5
1
2
3
4
5 [
0
4 0
6 9 0
1 7 10 0
6 3 5 8 0 ]
Cluster the five items using the single linkage, complete linkages, and average linkage hierarchical
methods. Draw the dendrograms and compare the results.
Solution:
Single linkage
1 2 3 4 5
1
2
3
4
5 [
0
4 0
6 9 0
1 7 10 0
6 3 5 8 0 ]
1 {1, 4}C =
⎯⎯→
1 2 3 5
1
2
3
5
[
0
4 0
6 9 0
6 3 5 0
]
2 = {2,5}
→
1 2 3
1
2
3
[
0
4 0
6 5 0
]
3 = {1, 2}
→
3 3
3
3
[
0
5 0
]
Complete linkage
1 2 3 4 5
1
2
3
4
5 [
0
4 0
6 9 0
1 7 10 0
6 3 5 8 0 ]
1 = {1,4}
→
1 2 3 5
1
2
3
5
[
0
7 0
10 9 0
8 3 5 0
]
2 = {2,5}
→
1 2 3
1
2
3
[
0
8 0
10 9 0
]
3 = {1, 2}
→
3 3
3
3
[
0
10 0
]
Average linkage
1 2 3 4 5
1
2
3
4
5 [
0
4 0
6 9 0
1 7 10 0
6 3 5 8 0 ]
1 = {1,4}
→
1 2 3 5
1
2
3
5
[
0
5.5 0
8 9 0
7 3 5 0
]
2 = {2,5}
→
1 2 3
1
2
3
[
0
6.25 0
8 7 0
]
3 = {1, 2}
→
3 3
3
3
[
0
7.5 0
]
The clustering results are the same.
14. (Lecture 5)
Suppose we measure two variables 1X and 2X for four items A, B, C and D. The data are as follows:
Observations
Item
1x 2x
A 5 4
B 1 2
C 1 1
D 3 1
Use the Kmeans clustering technique to divide the items into K = 2 clusters. Start with the initial groups
( AB ) and ( CD ).
Solution:
Iteration Observation
x
Cluster
C1
Cluster
C2
Centroid
C1
Centroid
C2
Distance (x, C1) Distance (x, C2)
1 A AB CD (3, 1) (1, 1) 3.6055513 5
1 B AB CD (3, 1) (1, 1) 3.6055513 3
1 C AB CD (3, 1) (1, 1) 4 2
1 D AB CD (3, 1) (1, 1) 0 2
2 A AD BC (4, 2.5) (0, 0.5) 1.8027756 6.7268120
2 B AD BC (4, 2.5) (0, 0.5) 5.4083269 1.8027756
2 C AD BC (4, 2.5) (0, 0.5) 5.2201533 1.8027756
2 D AD BC (4, 2.5) (0, 0.5) 1.8027756 3.3541020
Stop
15. (Lecture 5)
We have collected samples 1 , 2, … , . Each sample is from either (, 1) or
(, 2), where is known but 1 and 2. Note that ~(, ) if its probability
mass function is given by
Pr( = ) ={(
) (1 − )−
0
if = 0,1,2, … ,
otherwise.
.
Let be an indictor function for the
ℎ observation such that = 1 if the
ℎ observation is from
(, 1); and = 0 if the
ℎ observation is from (, 2). Assume that ( =
1) = and ( = 0) = 1 − , where is unknown.
a. Derive the log completedata likelihood function ℓ(, 1, 21, … , , 1, … , ).
b. Derive an EM algorithm to estimate unknown parameters (, 1, 2).
Solution:
a. (, ) = ∏ {[ (
) 1
(1 − 1)
−]
+ [(1 − ) (
) 2
(1 − 2)
−]
1−
}=1
= ∑
=1 (1 − )∑ (1−)
=1 ∏ (
) 1
∑
=1 (1 − )
∑ (−)
=1 2
∑ (1−)
=1 (1
=1
− )
∑ (−)(1−)
=1
(, ) =∑
=1
log +∑ (1 − )
=1
log(1 − ) +∑ (
)
=1
+∑
=1
log 1
+∑ ( − )
=1
log(1 − 1) +∑ (1 − )
=1
log 2
+∑ (1 − )( − )
=1
log 2
b. Let (, ()) = ,() ((, ))
=∑ ̂
()
=1
log +∑ (1 − ̂
())
=1
log(1 − ) +∑ (
)
=1
+∑ ̂
()
=1
log 1
+∑ ( − )̂
()
=1
log(1 − 1) +∑ (1 − ̂
())
=1
log 2
+∑ (1 − ̂
())( − )
=1
log 2
Where ̂
() = ,()
() = Pr( = 1,
()) =
̂()(;1)
̂()(;1)+(1−̂
())(;2)
Where (; 1) = (
) (1 − )−
=
∑ ̂
()
=1
−
∑ (1 − ̂
())=1
1 −
Set
=̂+1 = 0
̂(+1) =
∑ ̂
()
=1
1
=
∑ ̂
()
=1
1
−
∑ ( − )̂
()
=1
1 − 1
Set
1

1=1
(+1) = 0
(1 − 1)∑ ̂
()
=1
− 1∑ ( − )̂
()
=1
= 0
∑ ̂
()
=1
− 1∑ ̂
()
=1
− 1∑ ̂
()
=1
+ 1∑ ̂
()
=1
= 0
1∑ ̂
()
=1
=∑ ̂
()
=1
̂1
(+1) =
∑ ̂
()
=1
∑ ̂
()
=1
Similarly, ̂2
(+1) =
∑ (1−̂
()
)
=1
∑ (1−̂
()
)=1
(1): Set = 1;
(2): When (, (+1)) − (, ()) > , ∃ > 0
Estep: Calculate (, ());
Mstep: Calculate ̂(), ̂1
(), ̂2
()
← + 1;
Else break;
(3): Repeat (2)
学霸联盟