University of Waterloo CS480/680 2020 Spring
CS480/680: Introduction to Machine Learning
Homework 5
Due: 11:59 pm, July 13, 2020, submit on LEARN.
Include your name and student number!
Submit your writeup in pdf and all source code in a zip file (with proper documentation). Write a script for each
programming exercise so that the TAs can easily run and verify your results. Make sure your code runs!
[Text in square brackets are hints that can be ignored.]
Exercise 1: Gaussian Mixture Model (GMM) (10 pts)
Notation: For a matrix A, |A| denotes its determinant. For a diagonal matrix diag(s), |diag(s)| = ∏i si.
Algorithm 1: EM for GMM.
Input: X ∈ Rn×d, K ∈ N, initialization for model
// model includes pi ∈ RK+ and for each 1 ≤ k ≤ K, µk ∈ Rd and Sk ∈ Sd+
// pik ≥ 0,
∑K
k=1 pik = 1, Sk symmetric and positive definite.
// random initialization suffices for full credit.
// alternatively, can initialize r by randomly assigning each data to one of the K
components
Output: model, `
1 for iter = 1 : maxiter do
// step 2, for each i = 1, . . . , n
2 for k = 1, . . . ,K do
3 rik ← pik|Sk|−1/2 exp[− 12 (xi − µk)>S−1k (xi − µk)] // compute responsibility
// for each i = 1, . . . , n
4 ri. ←
∑K
k=1 rik
// for each k = 1, . . . ,K and i = 1, . . . , n
5 rik ← rik/ri. // normalize
// compute negative log-likelihood
6 `(iter) = −∑ni=1 log(ri.)
7 if iter > 1 && |`(iter)− `(iter − 1)| ≤ tol ∗ |`(iter)| then
8 break
// step 1, for each k = 1, . . . ,K
9 r.k ←
∑n
i=1 rik
10 pik ← r.k/n
11 µk =
∑n
i=1 rikxi/r.k
12 Sk ←
(∑n
i=1 rikxix
>
i /r.k
)− µkµ>k
1. (2 pts) Derive and implement the EM algorithm for the diagonal Gaussian mixture model, where all
covariance matrices are constrained to be diagonal. Algorithm 1 recaps all the essential steps and serves
as a hint rather than a verbatim instruction. In particular, you must change the highlighted steps
accordingly (with each Sk being a diagonal matrix), along with formal explanations. Analyze the space
and time complexity of your implementation.
[You might want to review the steps we took in class (lecture 16) to get the updates in Algorithm 1 and
adapt them to the simpler case here. The solution should look like sj =
∑n
i=1 rik(xij−µj)2∑n
i=1 rik
=
∑n
i=1 rikx
2
ij∑n
i=1 rik
−µ2j
for the j-th diagonal. Multiplying an n× p matrix with a p×m matrix costs O(mnp). Do not maintain
a diagonal matrix explicitly; using a vector for its diagonal suffices.]
To stop the algorithm, set a maximum number of iterations (say maxiter = 500) and also monitor the
change of the negative log-likelihood `:
` = −
n∑
i=1
log
[
K∑
k=1
pik|2piSk|−1/2 exp[− 12 (xi − µk)>S−1k (xi − µk)]
]
, (1)
University of Waterloo CS480/680 2020 Spring
where xi is the i-th column of X
>. As a debug tool, note that ` should decrease from step to step, and
we can stop the algorithm if the decrease is smaller than a predefined threshold, say tol = 10−5.
Ans: The updates for pi, µk, and r remain the same. For the covariance matrix S = diag(s) which is
constrained to be diagonal, we need to solve (for the k-th component):
min
sk
n∑
i=1
rik
[
1
2
1> log sk +
1
2
(xi − µk)> diag(sk)−1(xi − µk)
]
. (2)
Taking derivative w.r.t. sk and setting it to 0:
n∑
i=1
rik[1./sk − [(xi − µk)./sk].2] = 0. (3)
Thus, we have the (simplified) update rule:
sk =
∑n
i=1 rik(xi − µk).2∑n
i=1 rik
. (4)
Time: O(ndK). Space: O(nK +Kd) (ignoring input O(nd)).
2. (2 pts) Redo Ex 1.1 with the spherical Gaussian mixture model, where each covariance matrix Sk = skI is
a multiple of the identity matrix I. Derive the update for sk and implement the resulting EM algorithm.
Analyze the space and time complexity of your implementation.
Ans: The updates for pi, µk, and r remain the same. For the covariance matrix Sk = skI, we need to
solve (for the k-th component):
min
sk
n∑
i=1
rik
[
1
2
d log sk +
1
2sk
‖xi − µk‖22
]
. (5)
Taking derivative w.r.t. sk and setting it to 0:
n∑
i=1
rik[d/sk − ‖xi − µk‖22/s2k] = 0. (6)
Thus, we have the (simplified) update rule:
sk =
∑n
i=1 rik‖xi − µk‖22
d
∑n
i=1 rik
. (7)
Time: O(ndK). Space: O(nK +Kd) (ignoring input O(nd)).
[sk turns out to be just the average of sk in Ex 1.1.]
3. (2 pts) Redo Ex 1.1 where we fit d GMMs (each with K components) to each feature X:,j , separately.
Implement the resulting EM algorithm. Analyze the space and time complexity of your implementation.
Ans: For each dimension j, we have from the previous exercise:
s
(j)
k =
∑n
i=1 r
(j)
ik (xij − µ(j)k )2∑n
i=1 r
(j)
ik
. (8)
Time: O(ndK). Space: O(ndK) (ignoring input O(nd)).
[Note that r
(j)
ik is different from rik in Ex 1.1.]
Marking scheme for 1.1-1.3:
University of Waterloo CS480/680 2020 Spring
 1pt for correct implementation; some marks may be deducted if code is not testable, some marks
may be awarded for wrong implementation if bugs are minor, code will not be tested on underflow
issues
 0.5pt for correct complexities, 0.5pt for correct update rules; some (or full) marks may be awarded
if mistakes are minor
4. (4 pts) Next, we apply (the adapted) Algorithm 1 in Ex 1.1 to the MNIST dataset (that you already
experimented on in A7). For each of the 10 classes (digits), we can use its (only its) training images to
estimate its (class-conditional) distribution by fitting a GMM (with say K = 5, roughly corresponding
to 5 styles of writing this digit). This gives us the density estimate p(x|y) where x is an image (of some
digit) and y is the class (digit). We can now classify the test set using the Bayes classifier:
yˆ(x) = arg max
c=0,...,9
Pr(Y = c) · p(X = x|Y = c)︸ ︷︷ ︸
∝ Pr(Y=c|X=x)
, (9)
where the probabilities Pr(Y = c) can be estimated using the training set, e.g., the proportion of the
c-th class in the training set, and the density p(X = x|Y = c) is estimated using GMM for each class c
separately. Report your error rate on the test set as a function of K (if time is a concern, using K = 5
will receive full credit). [Optional: Reduce dimension by PCA may boost accuracy quite a bit. Your
running time should be on the order of minutes (for one K), if you do not introduce extra for-loops in
Algorithm 1.]
[In case you are wondering, our classification procedure above belongs to the so-called plug-in estimators
(plug the estimated densities to the known optimal Bayes classifier). However, note that estimating the
density p(X = x|Y = c) is actually harder than classification. Solving a problem (e.g. classification)
through some intermediate harder problem (e.g. density estimation) is almost always a bad idea.]
Marking scheme for 1.4:
 +1pt if data is loaded correctly and code runs
 +1pt for correct implementation of Bayes classifier
 +2pt for correct GMM implementation that can deal with underflows; most students that miss these
points will have NaN at some point of their training procedure
 some points may be awarded if mistakes are minor 