Statistical Machine Learning GR5241
Spring 2021
Homework 1
Due: Friday, February 5th at 11:59pm
Problem 1 (Training Error and Test Error, 25 points)
This exercise is a simulation based version and extension of James 3.7.4.
I collect a set of data (n = 100 observations) containing a single feature and a quantitative response. I then
fit a linear regression model to the data, as well as separate quadratic regression and cubic regressions, i.e.
Y = β0 + β1X + β2X
2 + and Y = β0 + β1X + β2X
2 + β3X
3 + .
1. Suppose that the true relationship between X and Y is linear, i.e. Y = β0 +β1X+ . Consider the training
mean square error for the linear regression, and also the training mean square errors for the quadratic and
cubic regressions. Would we expect one to be lower than the other, would we expect them to be the same,
or is there not enough information to tell?
2. Answer (1) using test rather than training.
3. Investigate this problem further through simulation:
(a) Consider simulating a single realized dataset (R = 1) and computing the training error and test error
for different values of model complexity. Construct a single graphic displaying the training error and
the test error as a function of model complexity (or flexibility). Would you expect the pattern to be
consistent with a different simulated dataset?
(b) Consider simulating several realized datasets (R = 1000) and computing the training error and test
error for different values of model complexity. In this case you will average over all R = 1000 realized
errors to estimate the true expected training and testing error. Construct a single graphic displaying
the training error and the test error as a function of model complexity (or flexibility).
Problem 2 (Maximum Likelihood Estimation, 25 points)
In this problem, we analytically derive maximum likelihood estimators for the parameters of an example model
distribution, the gamma distribution.
The gamma distribution is univariate (one-dimensional) and continuous. It is controlled by two parameters,
the location parameter µ and the shape parameter ν. For a gamma-distributed random variable X, we write
X ∼ G (µ, ν). G is defined by the following density function:
p (x|µ, ν) :=
(
ν
µ

xν−1
Γ (ν)
exp
(
−νx
µ
)
,
where x ≥ 0 and µ, ν > 0.1 Whenever ν > 1, the gamma density has a single peak, much like a Gaussian. Unlike
1 The symbol Γ denotes the distribution’s namesake, the gamma function, defined by
Γ (ν) :=
∫ ∞
0
e−ttν−1dt .
The gamma function is a generalization of the factorial to the real line: Γ (n) = (n− 1)! for all n ∈ N. Fortunately, we will not have
to make explicit use of the integral.
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
2
2.5
3
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
Figure 1: Left: The plot shows the density for different values of the location parameter (µ = 0.3, 0.5, 1.0, 3.0),
with the shape parameter fixed to ν = 2. Since ν > 1, the densities peak. As we increase µ, the peak moves to
the right, and the curve flattens. Middle: For µ = 0.5 fixed, we look at different values of the shape parameter
(ν = 2, 3, 4, 5, 19). Again, all the densities peak, with the peak shifting to the right as we increase ν. Right: If
ν < 1, the density turns into a monotonously decreasing function. The smaller the value of ν, the sharper the
curve dips towards the origin.
the Gaussian, it is not symmetric. The first two moment statistics of the gamma distribution are given by
E [X] = µ and Var [X] =
µ2
ν
(1)
for X ∼ G (µ, ν). The plots in Figure 1 should give you a rough idea of what the gamma density may look like
and how different parameter values influence its behavior.
Homework questions:
1. Write the general analytic procedure to obtain the maximum likelihood estimator (including logarithmic
transformation) in the form of a short algorithm or recipe. A few words are enough, but be precise:
Write all important mathematical operations as formulae. Assume that data is given as an i. i. d. sample
x1, . . . , xn. Denote the conditional density in question by p (x|θ), and the likelihood by l (θ). Make sure
both symbols show up somewhere in your list, as well as a logarithm turning a product into a sum.
2. Derive the ML estimator for the location parameter µ, given data values x1, . . . , xn. Conventionally, an
estimator for a parameter is denoted by adding a hat: µˆ. Considering the expressions in (1) for the mean
and variance of the gamma distribution, and what you know about MLE for Gaussians, the result should
not come as a surprise.
3. A quick look at the gamma density will tell you that things get more complicated for the shape parameter: ν
appears inside the gamma function, and both inside and outside the exponential. Thus, instead of deriving
a formula of the form νˆ := . . . , please show the following: Given an i. i. d. data sample x1, . . . , xn and the
value of µ, the ML estimator νˆ for the gamma distribution shape parameter solves the equation
n∑
i=1
(
ln
(
xiνˆ
µ
)

(
xi
µ
− 1
)
− φ (νˆ)
)
= 0 .
The symbol φ is a shorthand notation for
φ (ν) :=
∂Γ(ν)
∂ν
Γ (ν)
.
In mathematics, φ is known as the digamma function.
2
Problem 3 (Bayes-Optimal Classifier, 25 points)
Consider a classification problem with K classes and with observations in Rd. Now suppose we have access to the
true joint density p(x, y) of the data x and the labels y. From p(x, y) we can derive the conditional probability
P (y|x), that is, the posterior probability of class y given observation x.
In the lecture, we have introduced a classifier f0 based on p, defined as
f0(x) := arg max
y∈[K]
P (y|x) ,
the Bayes-optimal classifier.
Homework question: Show that the Bayes-optimal classifier is the classifier which minimizes the probability of
error, under all classifiers in the hypothesis class
H := {f : Rd → [K] | f integrable } .
(If you are not familiar with the notion of an integrable function, just think of this as the set of all functions from
Rd to the set [K] of class labels.)
Hints:
• The probability of error is precisely the risk under zero-one loss.
• Try solving this exercise using K = 3 classes, then generalize the result to K classes.
• You can greatly simplify the problem by decomposing the risk R(f) into conditional risks R(f |x):
R(f |x) :=

y∈[K]
L0-1(y, f(x))P (y|x) and hence R(f) =

Rd
R(f |x)p(x)dx .
If you can show that f0 minimizes R(f |x) for every x ∈ Rd, the result for R(f) follows by monotonicity of
the integral.
Problem 4 (MAP Dirichlet-Multinomial Model, 25 points)
Consider observing n dice rolls x1, . . . , xn, where xi ∈ {1, . . . ,K}, i.e., n rolls from a K-sided die. Each index
{1, . . . ,K} takes on probabilities θ1, . . . , θK , such that
∑K
j=1 θk = 1. Assuming the data is iid, the likelihood
has the form
L(θ1, . . . , θK |x1, . . . , xn) = f(x1, . . . , xn|θ1, . . . , θK) =
K∏
k=1
θnkk ,
where nk =
∑n
i=1 1(yi = k) is the number of times event k occurred. In this setting we choose a Dirichlet prior
on parameters θ1, . . . , θK . The Dirichlet prior has pdf
q(θ1, . . . , θK) =
1
B(α1, . . . , αK)
K∏
k=1
θαk−1k ,
with support θk ∈ (0, 1), k = 1, . . . ,K and
∑K
k=1 θk = 1. The normalizing constant of the prior is∏K
k=1 Γ(αk)
Γ(
∑K
k=1 αk)
.
3
Homework questions:
1. Derive the posterior distribution
Π(θ1, . . . , θK |x1, . . . , xn)
Note that the symbol “small capital pi” (Π) is the posterior pdf where the symbol “large capital pi” (
∏K
k=1)
is a product.
2. Derive the “Bayesian” maximum a posterior (MAP) estimator of parameters θ1, . . . , θK .
3. Derive the “frequintist” maximum likelihood estimator (MLE) of parameters θ1, . . . , θK .
Hints:
• Use the “kernel-trick” to derive the posterior. This is an easy problem.
• You should use the Lagrangian multiplier technique to solve for the MAP and MLE.
4  