xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

程序代写案例-ST 443

时间：2021-04-25

ST 443 Final Exam Solutions (2020)

1. (a) i. False. yi is not linear in α1, α2, and no simple transformation will make it linear

(log(xα1α2i + εi) 6= α1 log xi + αi + εi)

ii. True. The derivative of L2 loss can grow without bound whereas the L1 loss

gradient is bounded, hence the influence of an outlier is limited.

iii. False. There are no guarantees that the support vectors remain the same. The

feature vectors corresponding to polynomial kernels are non-linear functions of

the original input vectors and thus the support points for maximum margin

separation in the feature space can be quite different.

iv. False. Recursive binary splitting ensures axis-alighed splits, so the topleft split

is impossible.

v. False. In order to decorrelate the trees, a random selection of input variables at

each split (not internal nodes) are discarded.

(b) In both cases, the decision boundary is piecewise linear. Decision trees do axis-aligned

splits while 1-nearest neighbour gives a voronoi diagram.

(c) Suppose there are K > 2 classes.

• One-versus-one classification: fit all K(K − 1)/2 pairwise classifiers and classify

the observation to the class that is most frequently assigned among all pairwise

classifications. The main disadvantages: high computational cost when K is large

and the sample size may not be sufficiently large.

• One-versus-all classification: fit K binary SVMs classifiers, f̂k(x), k = 1, . . . ,K.

Classify the observation to the class for which the corresponding f̂ is the largest.

The main disadvantage: the unbalanced classification task with very different

observations for binary classes when K is large.

2. (a) i. See the dataset in Figure 1

ii. See the dataset in Figure 2

(b) i. The left give points are misclassified and the error is 5/10.

ii. The left ‘-’ point is misclassified and the error is 1/10.

(c) 1-nearest neighbour error at x is p0(x)p1(x

′) + p0(x′)p1(x).

(d) Augment Y with p additional zeros and get Y˜ = (Y,0)T and augment X with the

multiple of the p× p identity matrix √λαIp to get

[

X√

λαIp

]

and set λ˜ = λ(1− α).

Then we solve the lasso minimization problem min

β∈Rp

||Y˜ − X˜β||2 + λ˜∑pj=1 |βj |.

3. (a) We have P (Y = 1) = P (Y = 0) = 1/2. fBayes = arg max

y

P (Y = y|X = x) =

arg max

y

P (X = x|Y = y)P (Y = y) = arg max

y

P (X = x|Y = y). Therefore fBayes(1) =

c©LSE ST 2020/ST443 Page 1 of 3

Figure 1: Solution to Question 2. (b) i.

Figure 2: Solution to Question 2. (b) ii.

1 since p = P (X = 1|Y = 1) > P (X = 1|Y = 0) = q and fBayes(0) = 0 since

1− p = P (X = 0|Y = 1) < P (X = 0|Y = 0) = 1− q. Hence Bayes optimal classifier

is fBayes(x) = x. Its risk is P (X 6= Y ) = P (Y = 1)P (X = 0|Y = 1) + P (Y =

0)P (X = 1|Y = 0) = 12(1− p) + 12q.

(b) i. We can write arg min

f

E`α,β(f(X), Y )

= arg min

f

EX,Y {αI(f(X) = 1, Y = 0) + βI(f(X) = 0, Y = 1)}

= arg min

f

EX [EY |X{αI(f(X) = 1, Y = 0) + βI(f(X) = 0, Y = 1)}]

= arg min

f

EX [

∑

y=0,1

{αI(f(X) = 1, y = 0) + βI(f(X) = 0, y = 1)]P (Y = y|X)

= arg min

f

∫

x

[αI(f(x) = 1)P (Y = 0|x) + βI(f(x) = 0)P (Y = 1|x)]dP (x)

We may minimize the integrand at each x by taking fBayes(x) = 1 if βP (Y =

1|x) ≥ αP (Y = 0|x) and 0 if βP (Y = 1|x) < αP (Y = 0|x).

c©LSE ST 2020/ST443 Page 2 of 3

ii.

E`α,β(f(X), Y ) = αP (f(X) = 1, Y = 0) + βP (f(X = 0, Y = 1)

= αP (f(X) = 1|Y = 0)P (Y = 0) + βP (f(X) = 0|Y = 1)P (Y = 1),

which is same as R if α = 1P (Y=0) and β =

1

P (Y=1) .

4. (a) Sλ = B(B

TB + λΩ)−1BT =

(

B−T (BTB + λΩ)B−1

)−1

= (I + λB−TΩB−1)−1. So

K = B−TΩB−1 and does not depend on λ.

(b) Let U = (u1, . . . ,un) and D = diag(d1, . . . , dn), then K = UDU

T and I = UUT . By

(a), Sλ = (U(I + λD)U

T )−1 = U(I + λD)−1UT =

∑n

i=1

1

1+λdi

uiu

T

i .

(c) ĝλ =

∑n

i=1

uTi y

1+λdi

ui and the regression results in ŷ =

∑n

i=1 u

T

i yui. So smoothing spline

shrinks the coefficients in the regression, with more shrinkage assigned to eigenvectors

ui that correspond to large eigenvalues di.

(d) By (b), trace(AB) = trace(BA) and orthonormality of U, we have dfλ = trace((I +

λD)−1UTU) = trace

(

(I + λD)−1

)

=

∑n

i=1

1

1+λdi

.

5. (a) Average linkage

(b) Average linkage and complete linkage would assign “clumps” 1 and 3 to the first

cluster and 2 and 4 to the second. Single linkage would assign 1 and 2 to one cluster,

3 and 4 to the other.

(c) Single linkage would successfully separate the two moons in Figure 3 (b) of the prob-

lem. See Figure 3 below. Using single linkage would give {1, 2} and {3, 4}, however,

complete linkage and average linkage would cluster 1 and 4 first. None of the methods

would work in Figure 3 (b) of the problem, since even for single linkage, there are

some points connecting the two moons.

Figure 3: Solution to Question 5. (c)

c©LSE ST 2020/ST443 Page 3 of 3

学霸联盟

1. (a) i. False. yi is not linear in α1, α2, and no simple transformation will make it linear

(log(xα1α2i + εi) 6= α1 log xi + αi + εi)

ii. True. The derivative of L2 loss can grow without bound whereas the L1 loss

gradient is bounded, hence the influence of an outlier is limited.

iii. False. There are no guarantees that the support vectors remain the same. The

feature vectors corresponding to polynomial kernels are non-linear functions of

the original input vectors and thus the support points for maximum margin

separation in the feature space can be quite different.

iv. False. Recursive binary splitting ensures axis-alighed splits, so the topleft split

is impossible.

v. False. In order to decorrelate the trees, a random selection of input variables at

each split (not internal nodes) are discarded.

(b) In both cases, the decision boundary is piecewise linear. Decision trees do axis-aligned

splits while 1-nearest neighbour gives a voronoi diagram.

(c) Suppose there are K > 2 classes.

• One-versus-one classification: fit all K(K − 1)/2 pairwise classifiers and classify

the observation to the class that is most frequently assigned among all pairwise

classifications. The main disadvantages: high computational cost when K is large

and the sample size may not be sufficiently large.

• One-versus-all classification: fit K binary SVMs classifiers, f̂k(x), k = 1, . . . ,K.

Classify the observation to the class for which the corresponding f̂ is the largest.

The main disadvantage: the unbalanced classification task with very different

observations for binary classes when K is large.

2. (a) i. See the dataset in Figure 1

ii. See the dataset in Figure 2

(b) i. The left give points are misclassified and the error is 5/10.

ii. The left ‘-’ point is misclassified and the error is 1/10.

(c) 1-nearest neighbour error at x is p0(x)p1(x

′) + p0(x′)p1(x).

(d) Augment Y with p additional zeros and get Y˜ = (Y,0)T and augment X with the

multiple of the p× p identity matrix √λαIp to get

[

X√

λαIp

]

and set λ˜ = λ(1− α).

Then we solve the lasso minimization problem min

β∈Rp

||Y˜ − X˜β||2 + λ˜∑pj=1 |βj |.

3. (a) We have P (Y = 1) = P (Y = 0) = 1/2. fBayes = arg max

y

P (Y = y|X = x) =

arg max

y

P (X = x|Y = y)P (Y = y) = arg max

y

P (X = x|Y = y). Therefore fBayes(1) =

c©LSE ST 2020/ST443 Page 1 of 3

Figure 1: Solution to Question 2. (b) i.

Figure 2: Solution to Question 2. (b) ii.

1 since p = P (X = 1|Y = 1) > P (X = 1|Y = 0) = q and fBayes(0) = 0 since

1− p = P (X = 0|Y = 1) < P (X = 0|Y = 0) = 1− q. Hence Bayes optimal classifier

is fBayes(x) = x. Its risk is P (X 6= Y ) = P (Y = 1)P (X = 0|Y = 1) + P (Y =

0)P (X = 1|Y = 0) = 12(1− p) + 12q.

(b) i. We can write arg min

f

E`α,β(f(X), Y )

= arg min

f

EX,Y {αI(f(X) = 1, Y = 0) + βI(f(X) = 0, Y = 1)}

= arg min

f

EX [EY |X{αI(f(X) = 1, Y = 0) + βI(f(X) = 0, Y = 1)}]

= arg min

f

EX [

∑

y=0,1

{αI(f(X) = 1, y = 0) + βI(f(X) = 0, y = 1)]P (Y = y|X)

= arg min

f

∫

x

[αI(f(x) = 1)P (Y = 0|x) + βI(f(x) = 0)P (Y = 1|x)]dP (x)

We may minimize the integrand at each x by taking fBayes(x) = 1 if βP (Y =

1|x) ≥ αP (Y = 0|x) and 0 if βP (Y = 1|x) < αP (Y = 0|x).

c©LSE ST 2020/ST443 Page 2 of 3

ii.

E`α,β(f(X), Y ) = αP (f(X) = 1, Y = 0) + βP (f(X = 0, Y = 1)

= αP (f(X) = 1|Y = 0)P (Y = 0) + βP (f(X) = 0|Y = 1)P (Y = 1),

which is same as R if α = 1P (Y=0) and β =

1

P (Y=1) .

4. (a) Sλ = B(B

TB + λΩ)−1BT =

(

B−T (BTB + λΩ)B−1

)−1

= (I + λB−TΩB−1)−1. So

K = B−TΩB−1 and does not depend on λ.

(b) Let U = (u1, . . . ,un) and D = diag(d1, . . . , dn), then K = UDU

T and I = UUT . By

(a), Sλ = (U(I + λD)U

T )−1 = U(I + λD)−1UT =

∑n

i=1

1

1+λdi

uiu

T

i .

(c) ĝλ =

∑n

i=1

uTi y

1+λdi

ui and the regression results in ŷ =

∑n

i=1 u

T

i yui. So smoothing spline

shrinks the coefficients in the regression, with more shrinkage assigned to eigenvectors

ui that correspond to large eigenvalues di.

(d) By (b), trace(AB) = trace(BA) and orthonormality of U, we have dfλ = trace((I +

λD)−1UTU) = trace

(

(I + λD)−1

)

=

∑n

i=1

1

1+λdi

.

5. (a) Average linkage

(b) Average linkage and complete linkage would assign “clumps” 1 and 3 to the first

cluster and 2 and 4 to the second. Single linkage would assign 1 and 2 to one cluster,

3 and 4 to the other.

(c) Single linkage would successfully separate the two moons in Figure 3 (b) of the prob-

lem. See Figure 3 below. Using single linkage would give {1, 2} and {3, 4}, however,

complete linkage and average linkage would cluster 1 and 4 first. None of the methods

would work in Figure 3 (b) of the problem, since even for single linkage, there are

some points connecting the two moons.

Figure 3: Solution to Question 5. (c)

c©LSE ST 2020/ST443 Page 3 of 3

学霸联盟