xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

代写-CS480/680

时间：2021-03-16

University of Waterloo CS480/680 2020 Spring

CS480/680: Introduction to Machine Learning

Homework 5

Due: 11:59 pm, July 13, 2020, submit on LEARN.

Include your name and student number!

Submit your writeup in pdf and all source code in a zip file (with proper documentation). Write a script for each

programming exercise so that the TAs can easily run and verify your results. Make sure your code runs!

[Text in square brackets are hints that can be ignored.]

Exercise 1: Gaussian Mixture Model (GMM) (10 pts)

Notation: For a matrix A, |A| denotes its determinant. For a diagonal matrix diag(s), |diag(s)| = ∏i si.

Algorithm 1: EM for GMM.

Input: X ∈ Rn×d, K ∈ N, initialization for model

// model includes pi ∈ RK+ and for each 1 ≤ k ≤ K, µk ∈ Rd and Sk ∈ Sd+

// pik ≥ 0,

∑K

k=1 pik = 1, Sk symmetric and positive definite.

// random initialization suffices for full credit.

// alternatively, can initialize r by randomly assigning each data to one of the K

components

Output: model, `

1 for iter = 1 : maxiter do

// step 2, for each i = 1, . . . , n

2 for k = 1, . . . ,K do

3 rik ← pik|Sk|−1/2 exp[− 12 (xi − µk)>S−1k (xi − µk)] // compute responsibility

// for each i = 1, . . . , n

4 ri. ←

∑K

k=1 rik

// for each k = 1, . . . ,K and i = 1, . . . , n

5 rik ← rik/ri. // normalize

// compute negative log-likelihood

6 `(iter) = −∑ni=1 log(ri.)

7 if iter > 1 && |`(iter)− `(iter − 1)| ≤ tol ∗ |`(iter)| then

8 break

// step 1, for each k = 1, . . . ,K

9 r.k ←

∑n

i=1 rik

10 pik ← r.k/n

11 µk =

∑n

i=1 rikxi/r.k

12 Sk ←

(∑n

i=1 rikxix

>

i /r.k

)− µkµ>k

1. (2 pts) Derive and implement the EM algorithm for the diagonal Gaussian mixture model, where all

covariance matrices are constrained to be diagonal. Algorithm 1 recaps all the essential steps and serves

as a hint rather than a verbatim instruction. In particular, you must change the highlighted steps

accordingly (with each Sk being a diagonal matrix), along with formal explanations. Analyze the space

and time complexity of your implementation.

[You might want to review the steps we took in class (lecture 16) to get the updates in Algorithm 1 and

adapt them to the simpler case here. The solution should look like sj =

∑n

i=1 rik(xij−µj)2∑n

i=1 rik

=

∑n

i=1 rikx

2

ij∑n

i=1 rik

−µ2j

for the j-th diagonal. Multiplying an n× p matrix with a p×m matrix costs O(mnp). Do not maintain

a diagonal matrix explicitly; using a vector for its diagonal suffices.]

To stop the algorithm, set a maximum number of iterations (say maxiter = 500) and also monitor the

change of the negative log-likelihood `:

` = −

n∑

i=1

log

[

K∑

k=1

pik|2piSk|−1/2 exp[− 12 (xi − µk)>S−1k (xi − µk)]

]

, (1)

Yao-Liang Yu (yaoliang.yu@uwaterloo.ca)©2020

University of Waterloo CS480/680 2020 Spring

where xi is the i-th column of X

>. As a debug tool, note that ` should decrease from step to step, and

we can stop the algorithm if the decrease is smaller than a predefined threshold, say tol = 10−5.

Ans: The updates for pi, µk, and r remain the same. For the covariance matrix S = diag(s) which is

constrained to be diagonal, we need to solve (for the k-th component):

min

sk

n∑

i=1

rik

[

1

2

1> log sk +

1

2

(xi − µk)> diag(sk)−1(xi − µk)

]

. (2)

Taking derivative w.r.t. sk and setting it to 0:

n∑

i=1

rik[1./sk − [(xi − µk)./sk].2] = 0. (3)

Thus, we have the (simplified) update rule:

sk =

∑n

i=1 rik(xi − µk).2∑n

i=1 rik

. (4)

Time: O(ndK). Space: O(nK +Kd) (ignoring input O(nd)).

2. (2 pts) Redo Ex 1.1 with the spherical Gaussian mixture model, where each covariance matrix Sk = skI is

a multiple of the identity matrix I. Derive the update for sk and implement the resulting EM algorithm.

Analyze the space and time complexity of your implementation.

Ans: The updates for pi, µk, and r remain the same. For the covariance matrix Sk = skI, we need to

solve (for the k-th component):

min

sk

n∑

i=1

rik

[

1

2

d log sk +

1

2sk

‖xi − µk‖22

]

. (5)

Taking derivative w.r.t. sk and setting it to 0:

n∑

i=1

rik[d/sk − ‖xi − µk‖22/s2k] = 0. (6)

Thus, we have the (simplified) update rule:

sk =

∑n

i=1 rik‖xi − µk‖22

d

∑n

i=1 rik

. (7)

Time: O(ndK). Space: O(nK +Kd) (ignoring input O(nd)).

[sk turns out to be just the average of sk in Ex 1.1.]

3. (2 pts) Redo Ex 1.1 where we fit d GMMs (each with K components) to each feature X:,j , separately.

Implement the resulting EM algorithm. Analyze the space and time complexity of your implementation.

Ans: For each dimension j, we have from the previous exercise:

s

(j)

k =

∑n

i=1 r

(j)

ik (xij − µ(j)k )2∑n

i=1 r

(j)

ik

. (8)

Time: O(ndK). Space: O(ndK) (ignoring input O(nd)).

[Note that r

(j)

ik is different from rik in Ex 1.1.]

Marking scheme for 1.1-1.3:

Yao-Liang Yu (yaoliang.yu@uwaterloo.ca)©2020

University of Waterloo CS480/680 2020 Spring

1pt for correct implementation; some marks may be deducted if code is not testable, some marks

may be awarded for wrong implementation if bugs are minor, code will not be tested on underflow

issues

0.5pt for correct complexities, 0.5pt for correct update rules; some (or full) marks may be awarded

if mistakes are minor

4. (4 pts) Next, we apply (the adapted) Algorithm 1 in Ex 1.1 to the MNIST dataset (that you already

experimented on in A7). For each of the 10 classes (digits), we can use its (only its) training images to

estimate its (class-conditional) distribution by fitting a GMM (with say K = 5, roughly corresponding

to 5 styles of writing this digit). This gives us the density estimate p(x|y) where x is an image (of some

digit) and y is the class (digit). We can now classify the test set using the Bayes classifier:

yˆ(x) = arg max

c=0,...,9

Pr(Y = c) · p(X = x|Y = c)︸ ︷︷ ︸

∝ Pr(Y=c|X=x)

, (9)

where the probabilities Pr(Y = c) can be estimated using the training set, e.g., the proportion of the

c-th class in the training set, and the density p(X = x|Y = c) is estimated using GMM for each class c

separately. Report your error rate on the test set as a function of K (if time is a concern, using K = 5

will receive full credit). [Optional: Reduce dimension by PCA may boost accuracy quite a bit. Your

running time should be on the order of minutes (for one K), if you do not introduce extra for-loops in

Algorithm 1.]

[In case you are wondering, our classification procedure above belongs to the so-called plug-in estimators

(plug the estimated densities to the known optimal Bayes classifier). However, note that estimating the

density p(X = x|Y = c) is actually harder than classification. Solving a problem (e.g. classification)

through some intermediate harder problem (e.g. density estimation) is almost always a bad idea.]

Marking scheme for 1.4:

+1pt if data is loaded correctly and code runs

+1pt for correct implementation of Bayes classifier

+2pt for correct GMM implementation that can deal with underflows; most students that miss these

points will have NaN at some point of their training procedure

some points may be awarded if mistakes are minor

Yao-Liang Yu (yaoliang.yu@uwaterloo.ca)©2020

学霸联盟

CS480/680: Introduction to Machine Learning

Homework 5

Due: 11:59 pm, July 13, 2020, submit on LEARN.

Include your name and student number!

Submit your writeup in pdf and all source code in a zip file (with proper documentation). Write a script for each

programming exercise so that the TAs can easily run and verify your results. Make sure your code runs!

[Text in square brackets are hints that can be ignored.]

Exercise 1: Gaussian Mixture Model (GMM) (10 pts)

Notation: For a matrix A, |A| denotes its determinant. For a diagonal matrix diag(s), |diag(s)| = ∏i si.

Algorithm 1: EM for GMM.

Input: X ∈ Rn×d, K ∈ N, initialization for model

// model includes pi ∈ RK+ and for each 1 ≤ k ≤ K, µk ∈ Rd and Sk ∈ Sd+

// pik ≥ 0,

∑K

k=1 pik = 1, Sk symmetric and positive definite.

// random initialization suffices for full credit.

// alternatively, can initialize r by randomly assigning each data to one of the K

components

Output: model, `

1 for iter = 1 : maxiter do

// step 2, for each i = 1, . . . , n

2 for k = 1, . . . ,K do

3 rik ← pik|Sk|−1/2 exp[− 12 (xi − µk)>S−1k (xi − µk)] // compute responsibility

// for each i = 1, . . . , n

4 ri. ←

∑K

k=1 rik

// for each k = 1, . . . ,K and i = 1, . . . , n

5 rik ← rik/ri. // normalize

// compute negative log-likelihood

6 `(iter) = −∑ni=1 log(ri.)

7 if iter > 1 && |`(iter)− `(iter − 1)| ≤ tol ∗ |`(iter)| then

8 break

// step 1, for each k = 1, . . . ,K

9 r.k ←

∑n

i=1 rik

10 pik ← r.k/n

11 µk =

∑n

i=1 rikxi/r.k

12 Sk ←

(∑n

i=1 rikxix

>

i /r.k

)− µkµ>k

1. (2 pts) Derive and implement the EM algorithm for the diagonal Gaussian mixture model, where all

covariance matrices are constrained to be diagonal. Algorithm 1 recaps all the essential steps and serves

as a hint rather than a verbatim instruction. In particular, you must change the highlighted steps

accordingly (with each Sk being a diagonal matrix), along with formal explanations. Analyze the space

and time complexity of your implementation.

[You might want to review the steps we took in class (lecture 16) to get the updates in Algorithm 1 and

adapt them to the simpler case here. The solution should look like sj =

∑n

i=1 rik(xij−µj)2∑n

i=1 rik

=

∑n

i=1 rikx

2

ij∑n

i=1 rik

−µ2j

for the j-th diagonal. Multiplying an n× p matrix with a p×m matrix costs O(mnp). Do not maintain

a diagonal matrix explicitly; using a vector for its diagonal suffices.]

To stop the algorithm, set a maximum number of iterations (say maxiter = 500) and also monitor the

change of the negative log-likelihood `:

` = −

n∑

i=1

log

[

K∑

k=1

pik|2piSk|−1/2 exp[− 12 (xi − µk)>S−1k (xi − µk)]

]

, (1)

Yao-Liang Yu (yaoliang.yu@uwaterloo.ca)©2020

University of Waterloo CS480/680 2020 Spring

where xi is the i-th column of X

>. As a debug tool, note that ` should decrease from step to step, and

we can stop the algorithm if the decrease is smaller than a predefined threshold, say tol = 10−5.

Ans: The updates for pi, µk, and r remain the same. For the covariance matrix S = diag(s) which is

constrained to be diagonal, we need to solve (for the k-th component):

min

sk

n∑

i=1

rik

[

1

2

1> log sk +

1

2

(xi − µk)> diag(sk)−1(xi − µk)

]

. (2)

Taking derivative w.r.t. sk and setting it to 0:

n∑

i=1

rik[1./sk − [(xi − µk)./sk].2] = 0. (3)

Thus, we have the (simplified) update rule:

sk =

∑n

i=1 rik(xi − µk).2∑n

i=1 rik

. (4)

Time: O(ndK). Space: O(nK +Kd) (ignoring input O(nd)).

2. (2 pts) Redo Ex 1.1 with the spherical Gaussian mixture model, where each covariance matrix Sk = skI is

a multiple of the identity matrix I. Derive the update for sk and implement the resulting EM algorithm.

Analyze the space and time complexity of your implementation.

Ans: The updates for pi, µk, and r remain the same. For the covariance matrix Sk = skI, we need to

solve (for the k-th component):

min

sk

n∑

i=1

rik

[

1

2

d log sk +

1

2sk

‖xi − µk‖22

]

. (5)

Taking derivative w.r.t. sk and setting it to 0:

n∑

i=1

rik[d/sk − ‖xi − µk‖22/s2k] = 0. (6)

Thus, we have the (simplified) update rule:

sk =

∑n

i=1 rik‖xi − µk‖22

d

∑n

i=1 rik

. (7)

Time: O(ndK). Space: O(nK +Kd) (ignoring input O(nd)).

[sk turns out to be just the average of sk in Ex 1.1.]

3. (2 pts) Redo Ex 1.1 where we fit d GMMs (each with K components) to each feature X:,j , separately.

Implement the resulting EM algorithm. Analyze the space and time complexity of your implementation.

Ans: For each dimension j, we have from the previous exercise:

s

(j)

k =

∑n

i=1 r

(j)

ik (xij − µ(j)k )2∑n

i=1 r

(j)

ik

. (8)

Time: O(ndK). Space: O(ndK) (ignoring input O(nd)).

[Note that r

(j)

ik is different from rik in Ex 1.1.]

Marking scheme for 1.1-1.3:

Yao-Liang Yu (yaoliang.yu@uwaterloo.ca)©2020

University of Waterloo CS480/680 2020 Spring

1pt for correct implementation; some marks may be deducted if code is not testable, some marks

may be awarded for wrong implementation if bugs are minor, code will not be tested on underflow

issues

0.5pt for correct complexities, 0.5pt for correct update rules; some (or full) marks may be awarded

if mistakes are minor

4. (4 pts) Next, we apply (the adapted) Algorithm 1 in Ex 1.1 to the MNIST dataset (that you already

experimented on in A7). For each of the 10 classes (digits), we can use its (only its) training images to

estimate its (class-conditional) distribution by fitting a GMM (with say K = 5, roughly corresponding

to 5 styles of writing this digit). This gives us the density estimate p(x|y) where x is an image (of some

digit) and y is the class (digit). We can now classify the test set using the Bayes classifier:

yˆ(x) = arg max

c=0,...,9

Pr(Y = c) · p(X = x|Y = c)︸ ︷︷ ︸

∝ Pr(Y=c|X=x)

, (9)

where the probabilities Pr(Y = c) can be estimated using the training set, e.g., the proportion of the

c-th class in the training set, and the density p(X = x|Y = c) is estimated using GMM for each class c

separately. Report your error rate on the test set as a function of K (if time is a concern, using K = 5

will receive full credit). [Optional: Reduce dimension by PCA may boost accuracy quite a bit. Your

running time should be on the order of minutes (for one K), if you do not introduce extra for-loops in

Algorithm 1.]

[In case you are wondering, our classification procedure above belongs to the so-called plug-in estimators

(plug the estimated densities to the known optimal Bayes classifier). However, note that estimating the

density p(X = x|Y = c) is actually harder than classification. Solving a problem (e.g. classification)

through some intermediate harder problem (e.g. density estimation) is almost always a bad idea.]

Marking scheme for 1.4:

+1pt if data is loaded correctly and code runs

+1pt for correct implementation of Bayes classifier

+2pt for correct GMM implementation that can deal with underflows; most students that miss these

points will have NaN at some point of their training procedure

some points may be awarded if mistakes are minor

Yao-Liang Yu (yaoliang.yu@uwaterloo.ca)©2020

学霸联盟