xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

用r(markdown)或者python(jupyter)代写-GR5241

时间：2021-01-23

Statistical Machine Learning GR5241

Spring 2021

Homework 1

Due: Friday, February 5th at 11:59pm

Homework submission: Please submit your homework as a pdf on Canvas.

Problem 1 (Training Error and Test Error, 25 points)

This exercise is a simulation based version and extension of James 3.7.4.

I collect a set of data (n = 100 observations) containing a single feature and a quantitative response. I then

fit a linear regression model to the data, as well as separate quadratic regression and cubic regressions, i.e.

Y = β0 + β1X + β2X

2 + and Y = β0 + β1X + β2X

2 + β3X

3 + .

1. Suppose that the true relationship between X and Y is linear, i.e. Y = β0 +β1X+ . Consider the training

mean square error for the linear regression, and also the training mean square errors for the quadratic and

cubic regressions. Would we expect one to be lower than the other, would we expect them to be the same,

or is there not enough information to tell?

2. Answer (1) using test rather than training.

3. Investigate this problem further through simulation:

(a) Consider simulating a single realized dataset (R = 1) and computing the training error and test error

for different values of model complexity. Construct a single graphic displaying the training error and

the test error as a function of model complexity (or flexibility). Would you expect the pattern to be

consistent with a different simulated dataset?

(b) Consider simulating several realized datasets (R = 1000) and computing the training error and test

error for different values of model complexity. In this case you will average over all R = 1000 realized

errors to estimate the true expected training and testing error. Construct a single graphic displaying

the training error and the test error as a function of model complexity (or flexibility).

Problem 2 (Maximum Likelihood Estimation, 25 points)

In this problem, we analytically derive maximum likelihood estimators for the parameters of an example model

distribution, the gamma distribution.

The gamma distribution is univariate (one-dimensional) and continuous. It is controlled by two parameters,

the location parameter µ and the shape parameter ν. For a gamma-distributed random variable X, we write

X ∼ G (µ, ν). G is defined by the following density function:

p (x|µ, ν) :=

(

ν

µ

)ν

xν−1

Γ (ν)

exp

(

−νx

µ

)

,

where x ≥ 0 and µ, ν > 0.1 Whenever ν > 1, the gamma density has a single peak, much like a Gaussian. Unlike

1 The symbol Γ denotes the distribution’s namesake, the gamma function, defined by

Γ (ν) :=

∫ ∞

0

e−ttν−1dt .

The gamma function is a generalization of the factorial to the real line: Γ (n) = (n− 1)! for all n ∈ N. Fortunately, we will not have

to make explicit use of the integral.

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5

0

0.5

1

1.5

2

2.5

3

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

2.5

Figure 1: Left: The plot shows the density for different values of the location parameter (µ = 0.3, 0.5, 1.0, 3.0),

with the shape parameter fixed to ν = 2. Since ν > 1, the densities peak. As we increase µ, the peak moves to

the right, and the curve flattens. Middle: For µ = 0.5 fixed, we look at different values of the shape parameter

(ν = 2, 3, 4, 5, 19). Again, all the densities peak, with the peak shifting to the right as we increase ν. Right: If

ν < 1, the density turns into a monotonously decreasing function. The smaller the value of ν, the sharper the

curve dips towards the origin.

the Gaussian, it is not symmetric. The first two moment statistics of the gamma distribution are given by

E [X] = µ and Var [X] =

µ2

ν

(1)

for X ∼ G (µ, ν). The plots in Figure 1 should give you a rough idea of what the gamma density may look like

and how different parameter values influence its behavior.

Homework questions:

1. Write the general analytic procedure to obtain the maximum likelihood estimator (including logarithmic

transformation) in the form of a short algorithm or recipe. A few words are enough, but be precise:

Write all important mathematical operations as formulae. Assume that data is given as an i. i. d. sample

x1, . . . , xn. Denote the conditional density in question by p (x|θ), and the likelihood by l (θ). Make sure

both symbols show up somewhere in your list, as well as a logarithm turning a product into a sum.

2. Derive the ML estimator for the location parameter µ, given data values x1, . . . , xn. Conventionally, an

estimator for a parameter is denoted by adding a hat: µˆ. Considering the expressions in (1) for the mean

and variance of the gamma distribution, and what you know about MLE for Gaussians, the result should

not come as a surprise.

3. A quick look at the gamma density will tell you that things get more complicated for the shape parameter: ν

appears inside the gamma function, and both inside and outside the exponential. Thus, instead of deriving

a formula of the form νˆ := . . . , please show the following: Given an i. i. d. data sample x1, . . . , xn and the

value of µ, the ML estimator νˆ for the gamma distribution shape parameter solves the equation

n∑

i=1

(

ln

(

xiνˆ

µ

)

−

(

xi

µ

− 1

)

− φ (νˆ)

)

= 0 .

The symbol φ is a shorthand notation for

φ (ν) :=

∂Γ(ν)

∂ν

Γ (ν)

.

In mathematics, φ is known as the digamma function.

2

Problem 3 (Bayes-Optimal Classifier, 25 points)

Consider a classification problem with K classes and with observations in Rd. Now suppose we have access to the

true joint density p(x, y) of the data x and the labels y. From p(x, y) we can derive the conditional probability

P (y|x), that is, the posterior probability of class y given observation x.

In the lecture, we have introduced a classifier f0 based on p, defined as

f0(x) := arg max

y∈[K]

P (y|x) ,

the Bayes-optimal classifier.

Homework question: Show that the Bayes-optimal classifier is the classifier which minimizes the probability of

error, under all classifiers in the hypothesis class

H := {f : Rd → [K] | f integrable } .

(If you are not familiar with the notion of an integrable function, just think of this as the set of all functions from

Rd to the set [K] of class labels.)

Hints:

• The probability of error is precisely the risk under zero-one loss.

• Try solving this exercise using K = 3 classes, then generalize the result to K classes.

• You can greatly simplify the problem by decomposing the risk R(f) into conditional risks R(f |x):

R(f |x) :=

∑

y∈[K]

L0-1(y, f(x))P (y|x) and hence R(f) =

∫

Rd

R(f |x)p(x)dx .

If you can show that f0 minimizes R(f |x) for every x ∈ Rd, the result for R(f) follows by monotonicity of

the integral.

Problem 4 (MAP Dirichlet-Multinomial Model, 25 points)

Consider observing n dice rolls x1, . . . , xn, where xi ∈ {1, . . . ,K}, i.e., n rolls from a K-sided die. Each index

{1, . . . ,K} takes on probabilities θ1, . . . , θK , such that

∑K

j=1 θk = 1. Assuming the data is iid, the likelihood

has the form

L(θ1, . . . , θK |x1, . . . , xn) = f(x1, . . . , xn|θ1, . . . , θK) =

K∏

k=1

θnkk ,

where nk =

∑n

i=1 1(yi = k) is the number of times event k occurred. In this setting we choose a Dirichlet prior

on parameters θ1, . . . , θK . The Dirichlet prior has pdf

q(θ1, . . . , θK) =

1

B(α1, . . . , αK)

K∏

k=1

θαk−1k ,

with support θk ∈ (0, 1), k = 1, . . . ,K and

∑K

k=1 θk = 1. The normalizing constant of the prior is∏K

k=1 Γ(αk)

Γ(

∑K

k=1 αk)

.

3

Homework questions:

1. Derive the posterior distribution

Π(θ1, . . . , θK |x1, . . . , xn)

Note that the symbol “small capital pi” (Π) is the posterior pdf where the symbol “large capital pi” (

∏K

k=1)

is a product.

2. Derive the “Bayesian” maximum a posterior (MAP) estimator of parameters θ1, . . . , θK .

3. Derive the “frequintist” maximum likelihood estimator (MLE) of parameters θ1, . . . , θK .

Hints:

• Use the “kernel-trick” to derive the posterior. This is an easy problem.

• You should use the Lagrangian multiplier technique to solve for the MAP and MLE.

4

Spring 2021

Homework 1

Due: Friday, February 5th at 11:59pm

Homework submission: Please submit your homework as a pdf on Canvas.

Problem 1 (Training Error and Test Error, 25 points)

This exercise is a simulation based version and extension of James 3.7.4.

I collect a set of data (n = 100 observations) containing a single feature and a quantitative response. I then

fit a linear regression model to the data, as well as separate quadratic regression and cubic regressions, i.e.

Y = β0 + β1X + β2X

2 + and Y = β0 + β1X + β2X

2 + β3X

3 + .

1. Suppose that the true relationship between X and Y is linear, i.e. Y = β0 +β1X+ . Consider the training

mean square error for the linear regression, and also the training mean square errors for the quadratic and

cubic regressions. Would we expect one to be lower than the other, would we expect them to be the same,

or is there not enough information to tell?

2. Answer (1) using test rather than training.

3. Investigate this problem further through simulation:

(a) Consider simulating a single realized dataset (R = 1) and computing the training error and test error

for different values of model complexity. Construct a single graphic displaying the training error and

the test error as a function of model complexity (or flexibility). Would you expect the pattern to be

consistent with a different simulated dataset?

(b) Consider simulating several realized datasets (R = 1000) and computing the training error and test

error for different values of model complexity. In this case you will average over all R = 1000 realized

errors to estimate the true expected training and testing error. Construct a single graphic displaying

the training error and the test error as a function of model complexity (or flexibility).

Problem 2 (Maximum Likelihood Estimation, 25 points)

In this problem, we analytically derive maximum likelihood estimators for the parameters of an example model

distribution, the gamma distribution.

The gamma distribution is univariate (one-dimensional) and continuous. It is controlled by two parameters,

the location parameter µ and the shape parameter ν. For a gamma-distributed random variable X, we write

X ∼ G (µ, ν). G is defined by the following density function:

p (x|µ, ν) :=

(

ν

µ

)ν

xν−1

Γ (ν)

exp

(

−νx

µ

)

,

where x ≥ 0 and µ, ν > 0.1 Whenever ν > 1, the gamma density has a single peak, much like a Gaussian. Unlike

1 The symbol Γ denotes the distribution’s namesake, the gamma function, defined by

Γ (ν) :=

∫ ∞

0

e−ttν−1dt .

The gamma function is a generalization of the factorial to the real line: Γ (n) = (n− 1)! for all n ∈ N. Fortunately, we will not have

to make explicit use of the integral.

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5

0

0.5

1

1.5

2

2.5

3

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

2.5

Figure 1: Left: The plot shows the density for different values of the location parameter (µ = 0.3, 0.5, 1.0, 3.0),

with the shape parameter fixed to ν = 2. Since ν > 1, the densities peak. As we increase µ, the peak moves to

the right, and the curve flattens. Middle: For µ = 0.5 fixed, we look at different values of the shape parameter

(ν = 2, 3, 4, 5, 19). Again, all the densities peak, with the peak shifting to the right as we increase ν. Right: If

ν < 1, the density turns into a monotonously decreasing function. The smaller the value of ν, the sharper the

curve dips towards the origin.

the Gaussian, it is not symmetric. The first two moment statistics of the gamma distribution are given by

E [X] = µ and Var [X] =

µ2

ν

(1)

for X ∼ G (µ, ν). The plots in Figure 1 should give you a rough idea of what the gamma density may look like

and how different parameter values influence its behavior.

Homework questions:

1. Write the general analytic procedure to obtain the maximum likelihood estimator (including logarithmic

transformation) in the form of a short algorithm or recipe. A few words are enough, but be precise:

Write all important mathematical operations as formulae. Assume that data is given as an i. i. d. sample

x1, . . . , xn. Denote the conditional density in question by p (x|θ), and the likelihood by l (θ). Make sure

both symbols show up somewhere in your list, as well as a logarithm turning a product into a sum.

2. Derive the ML estimator for the location parameter µ, given data values x1, . . . , xn. Conventionally, an

estimator for a parameter is denoted by adding a hat: µˆ. Considering the expressions in (1) for the mean

and variance of the gamma distribution, and what you know about MLE for Gaussians, the result should

not come as a surprise.

3. A quick look at the gamma density will tell you that things get more complicated for the shape parameter: ν

appears inside the gamma function, and both inside and outside the exponential. Thus, instead of deriving

a formula of the form νˆ := . . . , please show the following: Given an i. i. d. data sample x1, . . . , xn and the

value of µ, the ML estimator νˆ for the gamma distribution shape parameter solves the equation

n∑

i=1

(

ln

(

xiνˆ

µ

)

−

(

xi

µ

− 1

)

− φ (νˆ)

)

= 0 .

The symbol φ is a shorthand notation for

φ (ν) :=

∂Γ(ν)

∂ν

Γ (ν)

.

In mathematics, φ is known as the digamma function.

2

Problem 3 (Bayes-Optimal Classifier, 25 points)

Consider a classification problem with K classes and with observations in Rd. Now suppose we have access to the

true joint density p(x, y) of the data x and the labels y. From p(x, y) we can derive the conditional probability

P (y|x), that is, the posterior probability of class y given observation x.

In the lecture, we have introduced a classifier f0 based on p, defined as

f0(x) := arg max

y∈[K]

P (y|x) ,

the Bayes-optimal classifier.

Homework question: Show that the Bayes-optimal classifier is the classifier which minimizes the probability of

error, under all classifiers in the hypothesis class

H := {f : Rd → [K] | f integrable } .

(If you are not familiar with the notion of an integrable function, just think of this as the set of all functions from

Rd to the set [K] of class labels.)

Hints:

• The probability of error is precisely the risk under zero-one loss.

• Try solving this exercise using K = 3 classes, then generalize the result to K classes.

• You can greatly simplify the problem by decomposing the risk R(f) into conditional risks R(f |x):

R(f |x) :=

∑

y∈[K]

L0-1(y, f(x))P (y|x) and hence R(f) =

∫

Rd

R(f |x)p(x)dx .

If you can show that f0 minimizes R(f |x) for every x ∈ Rd, the result for R(f) follows by monotonicity of

the integral.

Problem 4 (MAP Dirichlet-Multinomial Model, 25 points)

Consider observing n dice rolls x1, . . . , xn, where xi ∈ {1, . . . ,K}, i.e., n rolls from a K-sided die. Each index

{1, . . . ,K} takes on probabilities θ1, . . . , θK , such that

∑K

j=1 θk = 1. Assuming the data is iid, the likelihood

has the form

L(θ1, . . . , θK |x1, . . . , xn) = f(x1, . . . , xn|θ1, . . . , θK) =

K∏

k=1

θnkk ,

where nk =

∑n

i=1 1(yi = k) is the number of times event k occurred. In this setting we choose a Dirichlet prior

on parameters θ1, . . . , θK . The Dirichlet prior has pdf

q(θ1, . . . , θK) =

1

B(α1, . . . , αK)

K∏

k=1

θαk−1k ,

with support θk ∈ (0, 1), k = 1, . . . ,K and

∑K

k=1 θk = 1. The normalizing constant of the prior is∏K

k=1 Γ(αk)

Γ(

∑K

k=1 αk)

.

3

Homework questions:

1. Derive the posterior distribution

Π(θ1, . . . , θK |x1, . . . , xn)

Note that the symbol “small capital pi” (Π) is the posterior pdf where the symbol “large capital pi” (

∏K

k=1)

is a product.

2. Derive the “Bayesian” maximum a posterior (MAP) estimator of parameters θ1, . . . , θK .

3. Derive the “frequintist” maximum likelihood estimator (MLE) of parameters θ1, . . . , θK .

Hints:

• Use the “kernel-trick” to derive the posterior. This is an easy problem.

• You should use the Lagrangian multiplier technique to solve for the MAP and MLE.

4