COMP3670/6670-无代写|学霸联盟

COMP3670/6670-无代写

时间：2022-11-08

The Australian National University
School of Computing
COMP3670/6670 Introduction to Machine Learning
Semester 2, 2021
Final Exam
• Write your name and UID on the first page (you will be fine if you forget to write them).
• This is an open book exam. You may bring in any materials including electronic and paper-
based ones. Any calculators (programmable included) are allowed. No communication devices
are permitted during the exam.
• Reading time: 30 minutes
• Writing time: 180 minutes
• For all the questions, write your answer CLEARLY on papers prepared by yourself.
• There are totally 9 pages (including the cover page)
• Points possible: 100
• This is not a hurdle.
• When you are asked to provide a justification to your answer, if your justification is incorrect,
you will get 0.
1
• Section 1. Linear Algebra and Matrix Decomposition (13 points)
1. (6 points) Show that it is impossible to have a set of 3 orthogonal non-zero vectors
{v1,v2,v3} in R2.
(Hint: Write v1 = [x1, y1]
T and first assume that x1 = 0. Then, prove it for the case where
x1 ̸= 0, using the first case to help you.)
Solution. We write
v1 =
[
x1
y1
]
v2 =
[
x2
y2
]
v3 =
[
x3
y3
]
As the hint suggests, we first assume that x1 = 0. Then since v1 ·v2 = 0, we have y1y2 = 0.
Since y1 cannot be zero (otherwise v1 = 0) we must have y2 = 0. Since v1 · v3 = 0,
we have x1x3 + y1y3 = y1y3 = 0, and y1 ̸= 0, so y3 = 0. Since v2 · v3 = 0, we have
x2x3 + y2y3 = x2x3 = 0. Now, y2 = 0, so x2 ̸= 0 (else v2 = 0), so x3 = 0. We have that
x3 = 0 and y3 = 0, so v3 = 0, a contradiction.
Now, if any of the x1, x2, x3, y1, y2, y3 were zero, this would also result in a contradiction
in the same way, as by symmetry we can reorder the vectors so the zero was somewhere in
the first vector, and since x1y1 + x2y2 = y1x1 + y2x2, we could swap the x and y values in
all vectors so that x1 = 0, from which the contradiction follows by above.
Now, assume that x1 ̸= 0. v1 · v2 gives
v1 · v2 = 0
x1x2 + y1y2 = 0
x2 =
−y1y2
x1
and similarly
v1 · v3 = 0
x1x3 + y1y3 = 0
x3 =
−y1y3
x1
Then v2 · v3 = 0, so
x2x3 + y2y3 = 0(−y1y2
x1
)(−y1y3
x1
)
+ y2y3 = 0
y21y2y3
x21
+ y2y3 = 0
y2y3
(
y21
x21
+ 1
)
= 0
Then either y2 = 0 (contradiction), or y3 = 0 (contradiction) or
y21
x21
+1 = 0, giving y21 = −x21.
Since the left hand side is non-negative, and the right hand side is non-positive, the only
way to satisfy this equation is y1 = x1 = 0, a contradiction.
2. (7 points) Consider an arbitrary matrix A ∈ R2×2
A =
[
a b
c d
]
We would like A to satisfy the following properties:
(a) A is non-invertible.
(b) A is symmetric.
2
(c) All the entries of A are positive.
(d) A has a positive eigenvalue λ = p > 0.
Find the set of all matrices A (as a function of p) that satisfy all the above constraints.
Solution. A is symmetric, so b = c. We can eliminate c and consider only matricies of the
form
A =
[
a b
b d
]
.
A is non-invertible, so ad − b2 = 0. Also, a, b, d > 0. We have that p > 0 is an eigenvalue,
so we find all eigenvalues using the characteristic polynomial.
(a− λ)(d− λ)− b2 = 0
ad− aλ− dλ+ λ2 − b2 = 0
−aλ− dλ+ λ2 = 0
λ(λ− a− d) = 0
So λ = 0 or λ = a+ d. Since 0 cannot be positive, we have that p = a+ d, or d = p− a. We
can eliminate d.
A =
[
a b
b p− a
]
Finally, since ad − b2 = 0, we have a(p − a) = b2, and hence b = ±√a(p− a). Since all
entries need to be positive, we take the positive root, b =
√
a(p− a). For the square root
to be defined (and not zero), we need a(p − a) > 0, which implies that 0 < a < p. Hence,
the set of all possible A is given by{[
a
√
a(p− a)√
a(p− a) p− a
]
: 0 < a < p
}
as required.
3
• Section 2. Analytic Geometry and Vector Calculus (12 points)
1. (6 points) Find all matrices T ∈ R2×2 such that for any v ∈ R2,
vTv = (Tv)Tv = (Tv)TTv
Solution. Consider v · v = T (v) · v. Let v = [x, y]T . Then
v · v = T (v) · v
x2 + y2 = x(ax+ by) + y(cx+ dy)
x2 + y2 = ax2 + (b+ c)xy + dy2
By choosing x = 0, y = 1, we obtain d = 1, and by choosing x = 1, y = 0 we obtain a = 1.
Then, we choose x = 1 and leave y arbitrary for the moment. Substituting, we obtain
1 + y2 = 1 + (b+ c)y + y2
from which we obtain (b+ c)y = 0 for all y, so c = −b. We now consider the other equation
v · v = T (v) · T (v)
x2 + y2 = (ax+ by)2 + (cx+ dy)2
x2 + y2 = (a2 + c2)x2 + (b2 + d2)y2 + 2(ab+ cd)xy
x2 + y2 = (1 + b2)x2 + (1 + b2)y2 + 2(b− b)xy
x2 + y2 = (1 + b2)(x2 + y2)
1 = 1 + b2
b = 0
So a = 1, b = 0, c = −b = 0, d = 1. So the only matrix that satisfies this property is the
identity.
T =
[
1 0
0 1
]
2. (6 points) Let x,a ∈ Rn×1, and define f : Rn×1 → R as
f(x) = 3 exp(2xTAx)
where exp(x) := ex, the exponential function. Compute ∇xf(x).
Solution. We apply the chain rule. Note that f(x) = g(h(x)), where g : R → R, g(h) =
3 exp(2h) and h : Rn → R, h(x) = xTAx. We have by lectures/assignment that ∇xh(x) =
xT (A+AT ), and by calculus we have that ∇hg(h) = 6 exp(2h). Therefore
df
dx
=
dg
dh
dh
dx
= 6 exp(2h)xT (A+AT ) = 6 exp 2(xTAx)xT (A+AT )
4
• Section 3. Probability (15 points)
Consider the following scenario. I have two boxes, box A and box B. In their initial configuration,
Box A contains 3 red balls and 3 white balls; Box B contains 6 red balls.
We swap the balls in boxes via the following procedure.
Procedure 1:We draw a ball xA uniformly at random from box A, and draw a ball xB uniformly
at random from box B. We place ball xA into box B and place ball xB into box A.
Let RA and RB be random variables representing the number of red balls in box A and box B
respectively.
1. (3 points) Describe the probability mass functions p(Ra = n) and p(Rb = n) for all integers
n ≥ 0 for the initial configuration of the boxes.
Solution. Clearly, box A contains 3 red balls, so
p(RA = n) =
{
1 n = 3
0 n ̸= 3
and box B contains 6 red balls, so
p(RB = n) =
{
1 n = 6
0 n ̸= 6
2. (3 points) After performing Procedure 1, what is the new joint probability mass functions
p(RA = na, RB = bn)?
Solution. We are guaranteed to move a red ball from box B into box A, it is equally likely
that the ball xA chosen from A is red. So, we have a 0.5 probability of nothing changing.
and a probability of 0.5 of adding a red ball to box A and loosing a red ball from box B.
Hence,
p(RA = na, RB = nb) =

0.5 (na, nb) = (3, 6)
0.5 (na, nb) = (4, 5)
0 else
We now describe a new method of swapping balls between boxes.
Procedure 2:We draw a ball xA uniformly at random from box A, and place it in box B. Then,
we draw a ball xB uniformly at random from box B, and place it in box A.
3. (2 points) Is Procedure 2 equivalent to Prodecure 1? Why or why not? Use no more than
3 sentences to explain your answer.
Solution. No, as we place xA into Box B before selecting a ball from box B. This means
it is possible (though unlikely) that the ball drawn from B was the same ball moved there
from A, so it is possible that xB is black (which was impossible in the previous case).
4. (3 points) We reset the experiment to the initial configuration, and then perform Procedure
2.
(a) Compute the probability xA is white.
(b) Compute the probability that xB is white.
Solution. Box A has the same number of red and white balls, so p(xA = white) = 1/2.
Now, the outcome of xB depends on what xA was, so we use the sum and product rules.
p(xB = white) = p(xB = white|xA = white)p(xA = white)+p(xB = white|xA = red)p(xA = red)
5
If xA = white, then box B now has 6 red balls and 1 white ball, so p(xB = white|xA =
white) = 1/7. If xA = red, then box B still contains only red balls, so p(xB = white|xA =
red) = 0. Hence,
p(xB = white) = 1/7× 1/2 + 0× 1/2 = 1/14
5. (4 points) We reset the experiment to the initial configuration, and then perform Proce-
dure 1. After that, we select either box A or box B by some uniformly random procedure
(like flipping a fair coin), and draw a ball from the selected box. The ball drawn was white.
What was the probability that box B was selected?
Solution.
p(Box = B|ball = w)
=
p(ball = w|Box = B)p(Box = B)
p(ball = w|Box = A)p(Box = A) + p(ball = w|Box = B)p(Box = B)
First, consider p(ball = w|Box = B). If box B were selected, there is a 1/2 chance of it
containing 5 red balls out of 6, or 1/2 chance of 6 red balls out of 6. So,
p(ball = w|Box = B) = 1/2× 5/6 + 1/2× 1 = 11/12
If box A were selected, there is a 1/2 chance of it containing 3 red balls out of 6, or 1/2
chance of 4 red balls out of 6.
p(ball = w|Box = A) = 1/2× 3/6 + 1/2× 4/6 = 7/12
Hence.
p(Box = B|ball = w)
=
11/12× 1/2
7/12× 1/2 + 11/12× 1/2 = 11/18
6
• Section 4. Clustering and Gaussian Mixture Model (GMM) (16 points)
Suppose you have N data points in R: x1, x2, ..., xn, xn+1, ..., xN . Now you want to partition
them into two clusters C1 and C2. First, you assign x1, x2, ..., xn into C1 and assign xn+1, ..., xN
into C2. After that, you want to calculate the new cluster centers µ1 and µ2 of C1 and C2. For
each cluster, your objective is to minimise the averaged squared distances between each data
point and its assigned center.
1. (3 points) Calculate µ1 and µ2 that achieve your objective (M-step). (Only showing result
without the optimisation process will lead to 0 mark.) Hint: after the assignment, you can
only use the first n points to calculate µ1, and the rest points to calculate µ2.
Solution. We denote the two centers as µ1 and µ2, respectively. The loss function can be
written as:
L =
1
n
n∑
i=1
(xi − µ1)2 + 1
N − n
N∑
i=n+1
(xi − µ2)2.
We then calculate the partial derivatives of L w.r.t µ1 and µ2. We have
∂L
∂µ1
=
2
n
n∑
i=1
(µ1 − xi),
∂L
∂µ2
=
2
n
N∑
i=n+1
(µ2 − xi).
We set both partial derivatives to 0. We can easily get
µ1 =
1
n
n∑
i=1
xi, µ2 =
1
N − n
N∑
i=n+1
xi.
2. (2 points) Are µ1 and µ2 a global minimum solution to this M-step (calculating the means
of clusters)? Use no more than 3 sentences to explain your answer.
Solution. Yes. Reason 1: the assignment strategy (first n points to C1 and the rest to C2)
is given. Reason 2: the optimization of µ1 does not depend on µ2 (does not assume given
values of µ2), and the optimization of µ2 does not depend on µ1 (does not assume given
values of µ1).
Given µ1 and µ2 calculated in Question 1 above, you now want to update the assignment (E-
step). This time again you aim to minimise the sum of squared distances between each data
point and its center.
3. (3 points) What is your optimal assignment strategy (you can only assign a data point to
one cluster, i.e., hard assignment)? Why is it optimal for your aim? (You can use whatever
is helpful to illustrate, such as figures and maths. Only showing the result without a clear
demonstration/proof process with lead to 0 mark)
Solution. We should assign each point to its closest center.
To prove, we only have to optimize the assignment of each individual data point, because
the data points are independently and identically distributed (i.i.d). For data point xi,
because of hard assignment, the loss can be written as,
Lh =
{
(xi − µ1)2, if xi is assigned to center 1
(xi − µ2)2, if xi is assigned to center 2.
Apparently, Lh is minimised when our assignment strategy allows L to always take the
lesser between (xi − µ1)2 and (xi − µ2)2. In other words, our assignment strategy should
always satisfy the lesser between (xi − µ1)2 and (xi − µ2)2 is chosen. That is, xi should be
assigned to the center it has a closer distance with.
7
4. (2 points) Is this assignment a global minimum solution to the E-step under hard assign-
ment? Use no more than 3 sentences to explain your answer.
Solution. Yes. This is because µ1 and µ2 are considered as given. Moreover, hard assign-
ment is used. Under these conditions, the optimization of the assignment strategy gives a
global minimum to the E-step.
Following the above questions, we do further explorations. The GMM performs soft assignment,
i.e., assigning a data point into multiple clusters, and this assignment is accompanied by respon-
sibilities. Now, we want to explore a similar scheme in k-means. Specifically, we define
rm2 =
(xm − µ1)2
(xm − µ1)2 + (xm − µ2)2 , rm1 =
(xm − µ2)2
(xm − µ1)2 + (xm − µ2)2 ,
where rm2 denotes how likely xm belongs to C2, and rm1 denotes how likely xm belongs to C1.
1. (2 point) Suppose xm is assigned to C1 under hard assignment, then under soft assignment,
which cluster will bear a higher responsibility for xm? Use two sentences to explain.
Solution. If xm is assigned to C1 under hard assignment, it means (xm−µ1)2 < (xm−µ2)2.
Therefore, rm2 < rm1. So xm should be assigned to C1 with a higher responsibility.
2. (4 points) Given µ1 and µ2, when looking at a single data point xm, does soft assignment
have a lower loss value than the hard assignment? Prove it. Hint: your loss function will be
the sum of the squared distance from xm to each center multiplied by the “responsibility”
rmk, k = 1, 2.
Solution. No.
Ls = rm1(xm − µ1)2 + rm2(xm − µ2)2.
When (xm − µ1)2 < (xm − µ2)2,
Lh = (xm − µ1)2.
Lh − Ls = (xm − µ1)2 − rm1(xm − µ1)2 − rm2(xm − µ2)2 (1)
=
(xm − µ1)4 + (xm − µ1)2(xm − µ2)2 − (xm − µ1)4 − (xm − µ2)4
(xm − µ1)2 + (xm − µ2)2 (2)
=
((xm − µ1)2 − (xm − µ2)2)(xm − µ2)2
(xm − µ1)2 + (xm − µ2)2 < 0 (3)
Therefore, the soft assignment actually gives a higher loss value than hard assignment.
8
• Section 5. Linear Regression (13 points)
1. (4 points) In least squares linear regression, our empirical risk is
R =
1
N
N∑
n=1
(yn − θTxn)2,
where xn ∈ Rd is a training sample, yn ∈ R is its label. θ contains the model parameters.
Now we use a sigmoid function in this empirical risk, i.e.,
R =
1
N
N∑
n=1
(yn − sigmoid(θTxn))2.
In no more than 4 sentences, state the reason why using sigmoid function in this way is not
desirable.
Solution. The sigmoid function can only output values between 0 and 1, while the ground
truth y could be other values. Moreover, the sigmoid function gives very similar values when
θTx is very large (or very small) and thus will give similar loss values. This property will
compromise model prediction performance under very large or very small θTx.
2. Now we have four data points shown in the figure below. We use the least square error in
regression.
x
y
x
y
(a) (b)
1) (3 points) In (a), draw the linear regression output and the first principal component.
Are they of the same direction/orientation? Use no more than 3 sentences to explain
why.
Solution. As shown in figure (a) below, the linear regression line should be y = 1.
The first principal component is vertical. Students get full marks if the PCA result is
vertical.
2) (3 points) In (b), draw the regression output, using language to describe when necessary.
Note: you can use any drawing software (PowerPoint, pdf editor, etc) to di-
rectly do it on figure; otherwise, approximately draw the four points on your paper
and then draw the regression/PCA lines.
Solution. As shown in figure (b) below, the linear regression output should be any
straight line passing the point (2, 1) except the vertical one.
3. (3 points) In the lecture slides, to derive compact formula, we include a dummy feature
x0 = 1 into the sample vector x, and correspondingly θ includes the coefficient θ0 of
this dummy feature. Now I want to remove this dummy feature and its coefficient from the
9
-6 -4 -2 0 2 4 6
x
-5
-4
-3
-2
-1
0
1
2
3
4
5
y
-6 -4 -2 0 2 4 6
x
-5
-4
-3
-2
-1
0
1
2
3
4
5
y
regression model. Will the resulting model give better test accuracy after training (sufficient
training samples, no over-fitting)? In 3 sentences justify your answer.
Solution. Removing this dummy dimension from the model will never give better perfor-
mance. This is because the new model has to be going through the origin - its expressive
power is usually lower. The resulting model will have a similar performance with the origi-
nal model only if the underlying function itself goes through the origin, i.e., θ0 should be
0 anyway.
10
• Section 6. Principal Component Analysis (PCA) and Linear Regression (14 points)
1. (4 points) Give the main steps for principal component analysis. For example, given X ∈
RD×N , list the steps to reduce dimensions to D − 1, and calculate a new X ∈ R(D−1)×N .
Solution. 1. Standardise the data. Subtract the mean and divide by the standard deviation.
2. Perform eigendecomposition of the covariance matrix 1NXX
T .
3. Select the largest D−1 eigenvalues, and the corresponding D−1 eigenvectors constitute
the projection matrix B. The project is calculated as X˜ = BBTX.
4. We undo the standardisation, yielding the dimension-reduced dataset in the original
space.
2. (2 points) As shown in the figure below, we generate a 2D dataset with size 100 × 2. The
eigenvalues and corresponding eigenvectors are given under the figure. Compute θ, which
is defined as the angle (≤ 90◦) between the first principal component and x−axis.
Solution. The larger eigenvalue is 1683, so the major axis is v2 = [−0.71,−0.71]T . tan(θ) =
−0.71
−0.71 = 1.Soθ = 45
◦.
-2 0
x-axis
y-
ax
is
2 4 6
-4
0
4
8
λ1 = 316, λ2 = 1683
v1 = [−0.71, 0.71]T , v2 = [−0.71,−0.71]T
3. (2 points) Following the previous question, if we reduce the original data to one-dimensional
data using PCA, calculate the percentage of information loss.
Solution. ratio = 316316+1683 = 15.81%.
4. (3 points) Suppose we have sufficient training data. If we use PCA to first project the
training data onto a few principal components, i.e., performing dimension reduction, will
this lower-dimensional training set give a better linear regression model than the original
higher-dimensional training data? In three sentences, justify your answer.
Solution. No, the new model trained with lower-dimensional data samples is no better than
the old one. It is because through PCA there is inevitable some information loss (although
not much). Since we have sufficient training data, the higher-dimensional data samples will
not have over-fitting problem and, due to no information loss, allow better model training
5. (3 points) Suppose you perform the following PCA operations on your d-dimensional data
points of which the covariance matrix has d distinct eigenvalues. Operation 1: project
the data onto a j-dimensional space, and then project the j-dimensional data onto a
11
g-dimensional space. Operation 2: project the d-dimensional data directly onto the g-
dimensional space. Here d > j > g. Are the PCA results of the two operations the same?
In no more than 5 sentences explain your answer.
Solution. The results are the same. Because the covariance matrix has distinct eigenvalues,
its eigenvectors are orthogonal to each other. After projecting data onto j-dimensional
space, eigenvectors associated with the largest d−j eigenvalues remain, while the others are
discarded. The discarded ones do not influence the remaining ones, as they are orthogonal.
So no matter how many times you perform PCA, as long as the final dimension is the same,
the finally selected eigenvectors are the same, so the PCA results are the same.
12
• Section 7. Classification (17 points)
In your lecture slides, the zero-one loss, hinge loss and logistic loss (with basis e) can be plotted
as the following figure. The loss functions are written as the function of z = y(θTx). Here,
y ∈ {−1, 1} is the label of data sample x ∈ Rd, where d is the feature dimension. θ ∈ Rd
contains the model parameters.
! = #(%&')
)*++(!
)
1. (1 point) What does a small z mean? What does a large z mean? Use one sentence to
answer each.
Solution. When z is large, the classifier makes a correct prediction; when z is small, the
classifier makes an incorrect prediction.
2. (3 points) If we use the least squares loss function to train a classifier, please write down
the loss function 1) using θTx as argument, and 2) using z as argument.
Solution. When using θTx as argument,
Loss1 =
1
N
N∑
n=1
(y − θTx)2.
Now we use z = yθTx as argument. When y = 1,
Loss2 =
1
N
N∑
n=1
(1− yθTx)2 = 1
N
N∑
n=1
(1− z)2.
When y = −1,
Loss2 =
1
N
N∑
n=1
(−1 + yθTx)2 = 1
N
N∑
n=1
(1− yθTx)2 = 1
N
N∑
n=1
(1− z)2.
Therefore, Loss2 =
1
N
∑N
n=1(1− z)2
3. (1 point) In the figure above, draw the least squares loss function in the same figure with
the other three loss functions.
Note: you can use any drawing software (PowerPoint, pdf editor, etc) to directly do
it on figure; otherwise, roughly replicate this figure on paper and then add the curve for
the least squares loss.
Solution. This curve should be L = (1− z)2.
4. (2 points) From what you have drawn, explain in 2 sentences why least squares is not a
good choice for classifier training.
Solution. Least squares loss gives positive loss values for correctly classified samples
(z > 1); When z increases from 1 to inf, the loss increases even faster, meaning that the
classifier will be updated a lot even under easy samples (correctly classified).
13
5. (2 points) Following the slides, we now transform z into probability:
p = p(y|x) = sigmoid(z) = 1
1 + e−z
=
1
1 + e(−yθTx)
.
Now rewrite the logistic loss using p as argument. For your convenience, the logistic loss is
LossL(z) = log(1 + e
−z) when using z as argument.
Solution. It is easy to know that
e−z = p−1 − 1.
Therefore, LossL(p) = log(1 + e
−z) = log(1 + p−1 − 1) = log(p−1) = − log(p).
6. (2 points) In the loss function LossL(p) you just derived, what does a small p mean? what
does a large p mean?
Solution. A small p (p < 0.5) means a data sample is incorrectly classified; A large p
(p > 0.5) means a data sample is correctly classified.
7. (2 points) Suppose you are working on a two-way classification task. During classifier train-
ing, while some samples can be easily and successfully classified into the correct class, others
are rather difficult and oftentimes misclassified. You want to solve this issue by designing
a loss function that allows the classifier to focuse more on such difficult and easily misclas-
sified samples. We have two loss functions for you to choose, all expressed as a function of
p: L1(p) and L2(p). We draw these loss functions in the figure below against p. Which loss
function would you like to choose? Use no more than 3 sentences to justify your answer.
Solution. I would like to use the red loss function L2. In L2, the loss for easy samples (those
correctly classified) is very small when p > 0.6; the loss for hard samples (those incorrectly
classified) remains large. It is clear from the figure that L2 pays much less attention to the
easy samples while focusing primarily on the hard samples.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
0
1
2
3
4
Lo
ss
(p
)
L1
L2
8. You want a K-class classifier, where the number of classes K > 2. According to the lecture
slides, this classifier contains K linear functions
gk(x) = θ
T
k x+ θk0, k = 1, ...,K.
Here, gk(x) ∈ R characterises how likely sample x belongs to the kth class. Besides, vector
[θTk , θk0]
T is also called the prototype of the kth class. If a test sample is closest to the
prototype of the ith class, it will be classified into the ith class. Suppose for some reason,
that you find it very undesirable to train the classifier using methods like gradient descent.
14
In other words, you find using what we’ve learned in the classification lecture results in a
poor classifier.
1) (2 points) In two sentences, give a potential reason why the classifier trained by gradient
descent gives very poor performance (suppose your make no mistakes in programming and
math). Trivial answers (e.g., my computer is down) will receive 0.
Solution. There are too few training samples, which would lead to severe over-fitting if
training is undertaken.
2) (2 points) Propose a way to calculate the prototype vectors that could give decent
classification performance without training the classifier.
Solution. For each class, calculate the average data vector and use this vector (plus the
dummy dimention 1) as the prototype.
——— End of the paper ——–