xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

python或者R代写-GR5241

时间：2021-04-21

STAT GR5241 Sample Final

Name:

UNI:

Every student must write the following phrase on their cover page. Please fill in your printed

name, signature and date.

“I will not engage in academic dishonest activities for the STAT GR5241 final exam.

Signature: Date: ”

Problem 1: AdaBoost [20 points]

In this problem we perform the adaboost classification algorithm described in the lecture notes and homework

5. The dataset of interest consists of n = 100 training cases, p = 2 continuous features x1, x2, and

dichotomous response Y labeled Y = −1 and Y = 1. The training data is displayed in Figure 1, which

shows x2 versus x1 split by Y . Figure 1 also displays the location of a single test case xtest and training

case x17.

The weak learner is a decision stump, i.e., for the bth boosting iteration:

cb(x|j, θ) =

{

+1 if xj > θ

−1 if xj ≤ θ

The decision stump is trained by minimizing weighted 0-1 loss:

(j(b), θ(b)) := argmin

j,θ

∑n

i=1 wiI{yi 6= cb(xi|j, θ)}∑n

i=1 wi

Our adaboost classifier is constructed from B = 8 weak learners. For boosting iterations b = 1, 2, . . . , 8, the

trained decision stumps c1, c2, . . . , c8 are:

(j(1) = 2, θ(1) = 0.5016) (j(2) = 1, θ(2) = 0.6803) (j(3) = 2, θ(3) = 0.9128)

(j(4) = 2, θ(4) = 0.5016) (j(5) = 1, θ(5) = 0.4986) (j(6) = 1, θ(6) = 0.9278)

(j(7) = 2, θ(7) = 0.5016) (j(8) = 1, θ(8) = 0.4986)

1

Figure 1: Problem 1 training data

The weighted errors b are:

1 = 0.1400 2 = 0.1221 3 = 0.2018

4 = 0.1784 5 = 0.1475 6 = 0.1670

7 = 0.1760 8 = 0.1821

Solve problems (1.i)-(1.ii):

1.i [10 points] Classify the test case xtest =

(

0.600 0.395

)

based on the trained adaboost model. Use

all B = 8 weak learners to estimate Yˆtest. Show the details of your calculation for full credit.

1.ii [10 points] Does the structure of the training data (Figure 1) provide an advantage or disadvantage

when using adaboost with decision stumps? Describe your answer in a few sentences.

2

Problem 2: Neural Networks [30 points]

In this problem we fit a neural network to perform classification on the same n = 100 training data from

Problem 1. The dichotomous response Y is labeled Y = 0 and Y = 1, instead of −1 and 1. Our neural

network consists of a single hidden layer with d = 2 derived features, sigmoid activation function and

softmax output function. The schematic of this model is displayed below:

b1 b2

x1 h1 ya

x2 h2 yb

b

[1]

1

w

[1]

11

w

[1]

12

b

[1]

2

w

[1]

21

w

[1]

22

b

[2]

1

w

[2]

11

w

[2]

12

b

[2]

2

w

[2]

21

w

[2]

22

The objective or total cost Q(θ) is cross-entropy:

Q(θ) = −

n∑

i=1

(

yi log f1(xi) + (1− yi) log f2(xi)

)

=

n∑

i=1

Qi(θ),

where f1(xi) represents the first row and i

th column of matrix P from lecture notes (NN Backpropagation.pdf).

Similarly f2(xi) represents the second row and i

th column of matrix P.

The neural network is trained by minimizing Q(θ) with respect to θ, where parameter θ is the collection of

weights and biases W[1],b[1],W[2] and b[2]. The quantity Qi(θ) represents the i

th training case’s contribu-

tion to total cross-entropy Q(θ). The neural network is trained via gradient descent, yielding the estimated

weights:

bˆ[1] =

(

bˆ

[1]

1

bˆ

[1]

2

)

(2×1)

=

(−5.7840

3.4478

)

, Wˆ[1] =

(

wˆ

[1]

11 wˆ

[1]

12

wˆ

[1]

21 wˆ

[1]

22

)

(2×2)

=

(−17.7838 43.4752

−7.5506 −0.5264

)

bˆ[2] =

(

bˆ

[2]

1

bˆ

[2]

2

)

(2×1)

=

(−2.0010

2.0030

)

, Wˆ[2] =

(

wˆ

[2]

11 wˆ

[2]

12

wˆ

[2]

21 wˆ

[2]

22

)

(2×2)

=

(

25.6367 −66.6665

−26.3790 68.6225

)

3

Solve problems (2.i)-(2.iv):

2.i [10 points] Classify the test case xtest =

(

0.600 0.395

)

based on the trained neural network. Show

the details of your calculation for full credit.

2.ii [5 points] The 17th case of the training data is x17 =

(

0.4474 0.8764

)

with label y17 = 0. Compute

the cost of case 17, i.e., compute Q17(θˆ).

2.iii [10 points] Note that the gradient of Q(θ) can be expressed

∇Q(θ) =

n∑

i=1

∇Qi(θ).

Use the backpropagation algorithm to compute ∇Q17(θˆ). This quantity represents the 17th case’s

contribution to the full gradient of Q(θ), evaluated at the point θ = θˆ . Show the details of your

calculation for full credit.

2.iv [5 points] What should ∇Q(θˆ) approximately equal? Justify your answer in one or two sentences.

Problem 3: Smoothing Splines [10 points]

Consider two curves, gˆ1 and gˆ2, defined by

gˆ1 = arg min

g

(

n∑

i=1

(yi − g(xi))2 + λ

∫

[g(3)(x)]2dx

)

,

gˆ2 = arg min

g

(

n∑

i=1

(yi − g(xi))2 + λ

∫

[g(4)(x)]2dx

)

,

where g(m) represents the mth derivative of g and λ > 0 is the tuning parameter. Similar to linear regression,

the training and testing RSS are respectively defined

RSStrain =

n∑

i=1

(yi − gˆ(xi))2

RSStest =

n∗∑

i=1

(y∗i − gˆ(x∗i ))2,

where gˆ is fit on the training data.

Solve problems (3.i)-(3.ii):

3.i [5 points] As λ → ∞, will gˆ1 or gˆ2 have the smaller training RSS, or there is no definite answer?

Describe your answer in a few sentences.

3.ii [5 points] As λ→∞, will gˆ1 or gˆ2 have the smaller test RSS, or there is no definite answer? Describe

your answer in a few sentences.

4

Problem 4: Optimization [10 points]

Consider the L2-penalized logistic regression model having p features. For simplicity, we exclude the inter-

cept and do not standardize the features. The objective or cost function is

Q(β) = Q(β1, β2, · · · , βp) = 1

n

L(β1, β2, · · · , βp) + λ

p∑

j=1

β2j ,

where L(β1, β2, · · · , βp) is the negative log-likelihood and λ > 0 is the tuning parameter. For fixed λ > 0,

our goal is to estimate β =

(

β1, β2, · · · , βp

)T

.

Solve the following problem:

Derive the update step for Newton’s method:

β(t+1) := β(t) − [HQ(β(t))]−1 · ∇Q(β(t))

Problem 5: EM Algorithm and Clustering [10 points]

For fixed K > 0, let x1, x2, . . . , xn be iid cases from the mixture distribution

pi(x) =

K∑

k=1

ckp(x|µk),

where

∑

k ck = 1, ck ≥ 0 and p(x|µk) is the exponential density

p(x|µk) = 1

µk

exp

(

− x

µk

)

, x > 0.

Here we are clustering the cases based on an exponential mixture model.

Solve the following problem:

Write down the EM algorithm for estimating parameters c1, c2, . . . , cK and µ1, µ2, . . . , µk. Note: this is an

easy problem by exploiting properties of exponential families.

Problem 6: True or False [20 points]

For this section, please clearly circle either TRUE or FALSE for each question. In your case, please write

down TRUE or FALSE on your submitted exam. Ambiguous choices will be marked incorrect. There is

no additional work required to justify each answer.

6.i In logistic regression, negative log-likelihood Q(β) is a convex function of β:

TRUE FALSE

5

6.ii Consider a neural network with one hidden layer, sigmoid activation and softmax output function.

The total cost Q(θ) (cross-entropy) is a convex function of θ:

TRUE FALSE

6.iii Consider a Gaussian mixture model with negative log-likelihood

Q(θ) = −

n∑

i=1

log

( K∑

k=1

ckp(xi|µk,Σk)

)

,

where θ is the collection means µ1,µ2, . . . ,µK , covariances Σ1,Σ2, . . . ,ΣK and mixture parameters

c1, c2, . . . , cK (see EM/clustering lecture notes for details). The negative log-likelihood is a convex

function of θ:

TRUE FALSE

6.iv Let x,x′ ∈ Rd and assume that k1(x,x′) and k2(x,x′) are valid kernels. The function

k(x,x′) = k1(x,x′)− k2(x,x′)

is a valid kernel:

TRUE FALSE

6.v Let x,x′ ∈ Rd. The function k(x,x′) = 1 + xTx′ is a valid kernel:

TRUE FALSE

6.vi Consider L1-penalized linear or logistic regression (LASSO). As λ > 0 increases, the L1 penalty forces

some coefficients βj to be exactly 0 and the remaining non-zero coefficients are the same estimates

produced from unpenalized regression:

TRUE FALSE

6.vii Consider L2-penalized linear or logistic regression (Ridge). Larger values of λ > 0 will typically

decrease the bias:

TRUE FALSE

6.viii Consider linear regression (OLS) with p > 0 features. Larger values of p will typically decrease the

bias:

TRUE FALSE

6

6.ix Consider k-NN classification or regression. Larger values of k > 0 will typically decrease the bias:

TRUE FALSE

6.x Consider a model with tuning parameter α. The k-fold cross-validation error

CV (fˆ , α) =

1

K

K∑

k=1

1

|Bk|

∑

(x,y)∈Bk

L(y, fˆ−k(x, α))

is an unbiased estimator of the true generalization error of fˆ :

TRUE FALSE

7

学霸联盟

Name:

UNI:

Every student must write the following phrase on their cover page. Please fill in your printed

name, signature and date.

“I will not engage in academic dishonest activities for the STAT GR5241 final exam.

Signature: Date: ”

Problem 1: AdaBoost [20 points]

In this problem we perform the adaboost classification algorithm described in the lecture notes and homework

5. The dataset of interest consists of n = 100 training cases, p = 2 continuous features x1, x2, and

dichotomous response Y labeled Y = −1 and Y = 1. The training data is displayed in Figure 1, which

shows x2 versus x1 split by Y . Figure 1 also displays the location of a single test case xtest and training

case x17.

The weak learner is a decision stump, i.e., for the bth boosting iteration:

cb(x|j, θ) =

{

+1 if xj > θ

−1 if xj ≤ θ

The decision stump is trained by minimizing weighted 0-1 loss:

(j(b), θ(b)) := argmin

j,θ

∑n

i=1 wiI{yi 6= cb(xi|j, θ)}∑n

i=1 wi

Our adaboost classifier is constructed from B = 8 weak learners. For boosting iterations b = 1, 2, . . . , 8, the

trained decision stumps c1, c2, . . . , c8 are:

(j(1) = 2, θ(1) = 0.5016) (j(2) = 1, θ(2) = 0.6803) (j(3) = 2, θ(3) = 0.9128)

(j(4) = 2, θ(4) = 0.5016) (j(5) = 1, θ(5) = 0.4986) (j(6) = 1, θ(6) = 0.9278)

(j(7) = 2, θ(7) = 0.5016) (j(8) = 1, θ(8) = 0.4986)

1

Figure 1: Problem 1 training data

The weighted errors b are:

1 = 0.1400 2 = 0.1221 3 = 0.2018

4 = 0.1784 5 = 0.1475 6 = 0.1670

7 = 0.1760 8 = 0.1821

Solve problems (1.i)-(1.ii):

1.i [10 points] Classify the test case xtest =

(

0.600 0.395

)

based on the trained adaboost model. Use

all B = 8 weak learners to estimate Yˆtest. Show the details of your calculation for full credit.

1.ii [10 points] Does the structure of the training data (Figure 1) provide an advantage or disadvantage

when using adaboost with decision stumps? Describe your answer in a few sentences.

2

Problem 2: Neural Networks [30 points]

In this problem we fit a neural network to perform classification on the same n = 100 training data from

Problem 1. The dichotomous response Y is labeled Y = 0 and Y = 1, instead of −1 and 1. Our neural

network consists of a single hidden layer with d = 2 derived features, sigmoid activation function and

softmax output function. The schematic of this model is displayed below:

b1 b2

x1 h1 ya

x2 h2 yb

b

[1]

1

w

[1]

11

w

[1]

12

b

[1]

2

w

[1]

21

w

[1]

22

b

[2]

1

w

[2]

11

w

[2]

12

b

[2]

2

w

[2]

21

w

[2]

22

The objective or total cost Q(θ) is cross-entropy:

Q(θ) = −

n∑

i=1

(

yi log f1(xi) + (1− yi) log f2(xi)

)

=

n∑

i=1

Qi(θ),

where f1(xi) represents the first row and i

th column of matrix P from lecture notes (NN Backpropagation.pdf).

Similarly f2(xi) represents the second row and i

th column of matrix P.

The neural network is trained by minimizing Q(θ) with respect to θ, where parameter θ is the collection of

weights and biases W[1],b[1],W[2] and b[2]. The quantity Qi(θ) represents the i

th training case’s contribu-

tion to total cross-entropy Q(θ). The neural network is trained via gradient descent, yielding the estimated

weights:

bˆ[1] =

(

bˆ

[1]

1

bˆ

[1]

2

)

(2×1)

=

(−5.7840

3.4478

)

, Wˆ[1] =

(

wˆ

[1]

11 wˆ

[1]

12

wˆ

[1]

21 wˆ

[1]

22

)

(2×2)

=

(−17.7838 43.4752

−7.5506 −0.5264

)

bˆ[2] =

(

bˆ

[2]

1

bˆ

[2]

2

)

(2×1)

=

(−2.0010

2.0030

)

, Wˆ[2] =

(

wˆ

[2]

11 wˆ

[2]

12

wˆ

[2]

21 wˆ

[2]

22

)

(2×2)

=

(

25.6367 −66.6665

−26.3790 68.6225

)

3

Solve problems (2.i)-(2.iv):

2.i [10 points] Classify the test case xtest =

(

0.600 0.395

)

based on the trained neural network. Show

the details of your calculation for full credit.

2.ii [5 points] The 17th case of the training data is x17 =

(

0.4474 0.8764

)

with label y17 = 0. Compute

the cost of case 17, i.e., compute Q17(θˆ).

2.iii [10 points] Note that the gradient of Q(θ) can be expressed

∇Q(θ) =

n∑

i=1

∇Qi(θ).

Use the backpropagation algorithm to compute ∇Q17(θˆ). This quantity represents the 17th case’s

contribution to the full gradient of Q(θ), evaluated at the point θ = θˆ . Show the details of your

calculation for full credit.

2.iv [5 points] What should ∇Q(θˆ) approximately equal? Justify your answer in one or two sentences.

Problem 3: Smoothing Splines [10 points]

Consider two curves, gˆ1 and gˆ2, defined by

gˆ1 = arg min

g

(

n∑

i=1

(yi − g(xi))2 + λ

∫

[g(3)(x)]2dx

)

,

gˆ2 = arg min

g

(

n∑

i=1

(yi − g(xi))2 + λ

∫

[g(4)(x)]2dx

)

,

where g(m) represents the mth derivative of g and λ > 0 is the tuning parameter. Similar to linear regression,

the training and testing RSS are respectively defined

RSStrain =

n∑

i=1

(yi − gˆ(xi))2

RSStest =

n∗∑

i=1

(y∗i − gˆ(x∗i ))2,

where gˆ is fit on the training data.

Solve problems (3.i)-(3.ii):

3.i [5 points] As λ → ∞, will gˆ1 or gˆ2 have the smaller training RSS, or there is no definite answer?

Describe your answer in a few sentences.

3.ii [5 points] As λ→∞, will gˆ1 or gˆ2 have the smaller test RSS, or there is no definite answer? Describe

your answer in a few sentences.

4

Problem 4: Optimization [10 points]

Consider the L2-penalized logistic regression model having p features. For simplicity, we exclude the inter-

cept and do not standardize the features. The objective or cost function is

Q(β) = Q(β1, β2, · · · , βp) = 1

n

L(β1, β2, · · · , βp) + λ

p∑

j=1

β2j ,

where L(β1, β2, · · · , βp) is the negative log-likelihood and λ > 0 is the tuning parameter. For fixed λ > 0,

our goal is to estimate β =

(

β1, β2, · · · , βp

)T

.

Solve the following problem:

Derive the update step for Newton’s method:

β(t+1) := β(t) − [HQ(β(t))]−1 · ∇Q(β(t))

Problem 5: EM Algorithm and Clustering [10 points]

For fixed K > 0, let x1, x2, . . . , xn be iid cases from the mixture distribution

pi(x) =

K∑

k=1

ckp(x|µk),

where

∑

k ck = 1, ck ≥ 0 and p(x|µk) is the exponential density

p(x|µk) = 1

µk

exp

(

− x

µk

)

, x > 0.

Here we are clustering the cases based on an exponential mixture model.

Solve the following problem:

Write down the EM algorithm for estimating parameters c1, c2, . . . , cK and µ1, µ2, . . . , µk. Note: this is an

easy problem by exploiting properties of exponential families.

Problem 6: True or False [20 points]

For this section, please clearly circle either TRUE or FALSE for each question. In your case, please write

down TRUE or FALSE on your submitted exam. Ambiguous choices will be marked incorrect. There is

no additional work required to justify each answer.

6.i In logistic regression, negative log-likelihood Q(β) is a convex function of β:

TRUE FALSE

5

6.ii Consider a neural network with one hidden layer, sigmoid activation and softmax output function.

The total cost Q(θ) (cross-entropy) is a convex function of θ:

TRUE FALSE

6.iii Consider a Gaussian mixture model with negative log-likelihood

Q(θ) = −

n∑

i=1

log

( K∑

k=1

ckp(xi|µk,Σk)

)

,

where θ is the collection means µ1,µ2, . . . ,µK , covariances Σ1,Σ2, . . . ,ΣK and mixture parameters

c1, c2, . . . , cK (see EM/clustering lecture notes for details). The negative log-likelihood is a convex

function of θ:

TRUE FALSE

6.iv Let x,x′ ∈ Rd and assume that k1(x,x′) and k2(x,x′) are valid kernels. The function

k(x,x′) = k1(x,x′)− k2(x,x′)

is a valid kernel:

TRUE FALSE

6.v Let x,x′ ∈ Rd. The function k(x,x′) = 1 + xTx′ is a valid kernel:

TRUE FALSE

6.vi Consider L1-penalized linear or logistic regression (LASSO). As λ > 0 increases, the L1 penalty forces

some coefficients βj to be exactly 0 and the remaining non-zero coefficients are the same estimates

produced from unpenalized regression:

TRUE FALSE

6.vii Consider L2-penalized linear or logistic regression (Ridge). Larger values of λ > 0 will typically

decrease the bias:

TRUE FALSE

6.viii Consider linear regression (OLS) with p > 0 features. Larger values of p will typically decrease the

bias:

TRUE FALSE

6

6.ix Consider k-NN classification or regression. Larger values of k > 0 will typically decrease the bias:

TRUE FALSE

6.x Consider a model with tuning parameter α. The k-fold cross-validation error

CV (fˆ , α) =

1

K

K∑

k=1

1

|Bk|

∑

(x,y)∈Bk

L(y, fˆ−k(x, α))

is an unbiased estimator of the true generalization error of fˆ :

TRUE FALSE

7

学霸联盟