ST 443 Final Exam Solutions (2020)
1. (a) i. False. yi is not linear in α1, α2, and no simple transformation will make it linear
(log(xα1α2i + εi) 6= α1 log xi + αi + εi)
ii. True. The derivative of L2 loss can grow without bound whereas the L1 loss
gradient is bounded, hence the influence of an outlier is limited.
iii. False. There are no guarantees that the support vectors remain the same. The
feature vectors corresponding to polynomial kernels are non-linear functions of
the original input vectors and thus the support points for maximum margin
separation in the feature space can be quite different.
iv. False. Recursive binary splitting ensures axis-alighed splits, so the topleft split
is impossible.
v. False. In order to decorrelate the trees, a random selection of input variables at
each split (not internal nodes) are discarded.
(b) In both cases, the decision boundary is piecewise linear. Decision trees do axis-aligned
splits while 1-nearest neighbour gives a voronoi diagram.
(c) Suppose there are K > 2 classes.
• One-versus-one classification: fit all K(K − 1)/2 pairwise classifiers and classify
the observation to the class that is most frequently assigned among all pairwise
classifications. The main disadvantages: high computational cost when K is large
and the sample size may not be sufficiently large.
• One-versus-all classification: fit K binary SVMs classifiers, f̂k(x), k = 1, . . . ,K.
Classify the observation to the class for which the corresponding f̂ is the largest.
observations for binary classes when K is large.
2. (a) i. See the dataset in Figure 1
ii. See the dataset in Figure 2
(b) i. The left give points are misclassified and the error is 5/10.
ii. The left ‘-’ point is misclassified and the error is 1/10.
(c) 1-nearest neighbour error at x is p0(x)p1(x
′) + p0(x′)p1(x).
(d) Augment Y with p additional zeros and get Y˜ = (Y,0)T and augment X with the
multiple of the p× p identity matrix √λαIp to get
[
X√
λαIp
]
and set λ˜ = λ(1− α).
Then we solve the lasso minimization problem min
β∈Rp
||Y˜ − X˜β||2 + λ˜∑pj=1 |βj |.
3. (a) We have P (Y = 1) = P (Y = 0) = 1/2. fBayes = arg max
y
P (Y = y|X = x) =
arg max
y
P (X = x|Y = y)P (Y = y) = arg max
y
P (X = x|Y = y). Therefore fBayes(1) =
c©LSE ST 2020/ST443 Page 1 of 3
Figure 1: Solution to Question 2. (b) i.
Figure 2: Solution to Question 2. (b) ii.
1 since p = P (X = 1|Y = 1) > P (X = 1|Y = 0) = q and fBayes(0) = 0 since
1− p = P (X = 0|Y = 1) < P (X = 0|Y = 0) = 1− q. Hence Bayes optimal classifier
is fBayes(x) = x. Its risk is P (X 6= Y ) = P (Y = 1)P (X = 0|Y = 1) + P (Y =
0)P (X = 1|Y = 0) = 12(1− p) + 12q.
(b) i. We can write arg min
f
E`α,β(f(X), Y )
= arg min
f
EX,Y {αI(f(X) = 1, Y = 0) + βI(f(X) = 0, Y = 1)}
= arg min
f
EX [EY |X{αI(f(X) = 1, Y = 0) + βI(f(X) = 0, Y = 1)}]
= arg min
f
EX [

y=0,1
{αI(f(X) = 1, y = 0) + βI(f(X) = 0, y = 1)]P (Y = y|X)
= arg min
f

x
[αI(f(x) = 1)P (Y = 0|x) + βI(f(x) = 0)P (Y = 1|x)]dP (x)
We may minimize the integrand at each x by taking fBayes(x) = 1 if βP (Y =
1|x) ≥ αP (Y = 0|x) and 0 if βP (Y = 1|x) < αP (Y = 0|x).
c©LSE ST 2020/ST443 Page 2 of 3
ii.
E`α,β(f(X), Y ) = αP (f(X) = 1, Y = 0) + βP (f(X = 0, Y = 1)
= αP (f(X) = 1|Y = 0)P (Y = 0) + βP (f(X) = 0|Y = 1)P (Y = 1),
which is same as R if α = 1P (Y=0) and β =
1
P (Y=1) .
4. (a) Sλ = B(B
TB + λΩ)−1BT =
(
B−T (BTB + λΩ)B−1
)−1
= (I + λB−TΩB−1)−1. So
K = B−TΩB−1 and does not depend on λ.
(b) Let U = (u1, . . . ,un) and D = diag(d1, . . . , dn), then K = UDU
T and I = UUT . By
(a), Sλ = (U(I + λD)U
T )−1 = U(I + λD)−1UT =
∑n
i=1
1
1+λdi
uiu
T
i .
(c) ĝλ =
∑n
i=1
uTi y
1+λdi
ui and the regression results in ŷ =
∑n
i=1 u
T
i yui. So smoothing spline
shrinks the coefficients in the regression, with more shrinkage assigned to eigenvectors
ui that correspond to large eigenvalues di.
(d) By (b), trace(AB) = trace(BA) and orthonormality of U, we have dfλ = trace((I +
λD)−1UTU) = trace
(
(I + λD)−1
)
=
∑n
i=1
1
1+λdi
.
(b) Average linkage and complete linkage would assign “clumps” 1 and 3 to the first
cluster and 2 and 4 to the second. Single linkage would assign 1 and 2 to one cluster,
3 and 4 to the other.
(c) Single linkage would successfully separate the two moons in Figure 3 (b) of the prob-
lem. See Figure 3 below. Using single linkage would give {1, 2} and {3, 4}, however,
complete linkage and average linkage would cluster 1 and 4 first. None of the methods
would work in Figure 3 (b) of the problem, since even for single linkage, there are
some points connecting the two moons.
Figure 3: Solution to Question 5. (c)
c©LSE ST 2020/ST443 Page 3 of 3 