程序代写案例-ST3189|学霸联盟

程序代写案例-ST3189

时间：2022-05-20

ST3189
Machine Learning
Suitable for all candidates
Instructions to candidates
This paper contains four questions. Answer ALL FOUR. All questions will be given
equal weight (25%).
The marks in brackets reflect marks for each question.
Time allowed - Reading Time: None
Writing Time: 2 hours
You are supplied with: Graph paper
You may also use: No additional materials
Calculators: Calculators are allowed in this examination
1. (a) Suppose that yi ∼ N(µ, 1) for i = 1, . . . , n and that the yi’s are independent.
i. Show that the sample mean estimator µˆ1 =
1
n
∑n
i yi is obtained from
minimising the least squares criterion [7 marks]
µˆ1 = argmin
µ
n∑
i=1
(yi − µ)2,
and that µˆ1 an unbiased estimator of µ. Also find the variance of µˆ1.
Answer: Show that the derivative of
∑n
i=1(yi − µ)2 wrt µ is equal to
2
∑
i yi − 2nµ. Setting it equal to 0 and solving then yields µˆ1 = 1n
∑n
i yi.
We then get
E( 1
n
n∑
i
yi) =
1
n
∑
i=1
nE(yi) = µ,
which implies that the estimator is unbiased. For the variance not that
var(
1
n
n∑
i
yi) =
1
n2
n∑
i=1
var(yi) =
1
n
ii. Consider adding a penalty term to the least squares criterion, and therefore
using the estimator that minimises
µˆ2 = argmin
µ
n∑
i=1
(yi − µ)2 + λµ2
for the mean, where λ is a non-negative tuning parameter. Derive µˆ2, find
it bias and show that its variance is lower than that of µˆ1 [7 marks]
Answer: The derivative w.r.t. µ is 2
∑
(yi−µ) + 2λµ. Setting it to 0 gives
µˆ2 =
∑n
i=1 yi
n+ λ
.
Then
E(µˆ2) =
n
n+ λ
µ,
Bias(µˆ2) = E(µˆ2)− µ = n
n+ λ
µ− µ = − λ
n+ λ
µ
var (µˆ2) = var
(∑n
i yi
n+ λ
)
=
1
(n+ λ)2
n∑
i=1
var(yi) =
n
(n+ λ)2
.
Note that var (µˆ2) < var(µˆ1) since
n
(n+λ)2
< 1
n
as λ > 0.
(b) Consider the multiple linear regression model
yi = β0 +
p∑
j=1
βjxij + i, i = 1, . . . , n, j = 1, dots, p,
where β = (β1, ..., βp)
T and = (1, ..., n)
T ∼ N(0, σ2In).
2
i. When p is comparable to n, the multicollinearity becomes an issue. De-
scribe the effects of multicollinearity on the estimated coefficients, the
associated standard errors and the significance of the coefficients using the
ordinary maximum likelihood method. [3 marks]
Answer: The estimated coefficients and the associated standard errors can
both become very large, making the coefficients non-significant in the end.
ii. The ridge regression estimate of β can be obtained by minimising a par-
ticular expression with respect to β. Write down this expression as well as
an alternative formulation of it. [4 marks]
Answer: The expression is
n∑
i=1
(
yi − β0 −
p∑
j=1
βjxij
)2
+ λ
p∑
j=1
β2j ,
where λ > 0 is a tuning parameter.
It can be shown however, that minimise the above expression is equivalent
to minimising
n∑
i=1
(
yi − β0 −
p∑
j=1
βjxij
)2
, subject to
p∑
j=1
β2j ≤ s,
where s > 0 is a tuning parameter.
iii. Explain why ridge regression can potentially correct the problems of
multicollinearity. [2 marks]
Answer: It is because the magnitude of the estimated coefficients are re-
stricted by setting
∑p
j=1 β
2
j ≤ s.
iv. Provide an advantage and a disadvantage of ridge regression over the stan-
dard linear regression. [2 marks]
Answer: Ridge regression estimates are biased but have lower variance.
3
2. Let x = (x1, . . . , x100), with
∑
i xi = 20, be a random sample from the Exponential(λ)
distribution with probability density function given by
f(xi|λ) = 1
λ
exp
(
−xi
λ
)
, xi > 0, λ > 0.
Note that E(xi) = λ.
(a) Assign the IGamma(0.1, 0.1) prior to λ and find the corresponding posterior
distribution. [5 marks]
Answer: The likelihood can be written as
f(x|θ) =
100∏
i=1
1
λ
exp
(
−xi
λ
)
λ−n exp
(
−
∑n
i=1 xi
λ
)
= λ−100 exp
(
−20
λ
)
,
and the prior is pi(θ) ∝ λ−0.1−1 exp (−0.1
λ
)
. Hence the posterior is
pi(θ|x) ∝ λ−100 exp
(
−20
λ
)
λ−0.1−1 exp
(
−0.1
λ
)
= λ−(100.1)−1 exp
(
−20.1
λ
)
which can be recognised as the IGamma(100.1, 20.1) distribution.
(b) Find the Jeffreys’ prior for λ. Which is the corresponding posterior distribu-
tion? [6 marks]
Answer: We can write l(x|λ) = log f(x|λ) = −100 log λ− 20
λ
∂
∂λ
l(x|λ) = −100
λ
+
20
λ2
,
∂2
∂λ2
l(x|λ) = 100
λ2
− 220
λ3
I(λ) = −E
[
∂2
∂λ2
l(x|λ)
]
= −E
[
100
λ2
− 220
λ3
]
= −100
λ2
+ 2
∑
iE(xi)
λ3
= −100
λ2
+
200
λ2
=
100
λ2
Hence Jeffreys’ prior is pi(λ) ∝ I(λ)1/2 ∝ (λ−2)1/2 = λ−1. The posterior
becomes
pi(θ|x) ∝ λ−100 exp
(
−20
λ
)
λ−1 = λ−100−1 exp
(
−20
λ
)
which can be recognised as the IGamma(100, 20)
(c) Find a Bayes estimator for λ based on the priors of parts (a) and (b). [3 marks]
Answer: A standard Bayes estimator is the posterior mean which is equal to
(see appendix)
20.1
100.1− 1 = 0.203
or
20
100− 1 = 0.202
depending on the chosen prior.
4
(d) Let y represent a future observation from the same model. Find the predictive
distribution of y based either on the prior of part (a) or (b). [6 marks]
Answer:
f(y|x) =
∫
Λ
f(y|λ)pi(λ|x)dλ
=
∫ ∞
0
1
λ
exp
(
−y
λ
) (20)100
Γ(100)
λ−(100)−1 exp
(
−20
λ
)
dλ
=
(20)100
Γ(100)
∫ ∞
0
λ−(101)−1 exp
(
−20 + y
λ
)
dλ
=
(20)100
Γ(20)
Γ(101)
(20 + y)101
for y > 0.
(e) Describe how you can calculate the mean the of the predictive distribution in
software such as R. [5 marks].
Answer: Note that we can write the mean of the predictive distribution as
E(y|x) =
∫ ∞
0
yf(y|λ)pi(λ|x)dλ.
Hence a Monte Carlo scheme would draw samples y{k}, k = 1, . . . , N from
f(y|λ)pi(λ|x) for some large N and then just take
Ê(y|x) =
∑N
k=1 y
{k}
N
,
To that in R once can
i. Draw -say- 10,000 samples from the Gamma(100,20) using nu=rgamma(100,20)
.
ii. Invert those samples to make them samples from the IGamma(100,20)
using lambda=1/nu.
iii. Using each of the samples in lambda, sample y by the model y ∼Exponential(lambda)
using y=rexp(lambda).
iv. Calculate the sample mean of the values in y using mean(y).
5
3. (a) i. Suppose a non-linear model that can be written as
Y = f(X) + ,
where has zero mean and variance σ2, and is independent of X. Show
that the expected test error, conditional on X can be decomposed into the
following three parts:
E
[(
Y − fˆ(X)
)2]
= σ2 + Bias [f(x)]2 + Var [f(x)] ,
where f(·) is estimated from the training data. [7 marks]
Answer: Since Y = f(x) + , we can write
E
[(
Y − fˆ(X)
)2
|X = x
]
= E
[(
f(x)− fˆ(X) +
)2]
= E
[(
fˆ(X)− E
(
fˆ(X)
)
+ E
(
fˆ(X)
)
− f(x)−
)2]
= E
[(
fˆ(X)− E
(
fˆ(X)
))2]
+ E
[(
E
(
fˆ(X)
)
− f(x)
)2]
+E
[(
fˆ(X)− E
(
fˆ(X)
))(
E
(
fˆ(X)
)
− f(x)
)]
+ E
(
2
)
= Var [f(x)] + Bias [f(x)]2 + 0 + σ2
The third equality comes from the fact that is independent of fˆ(X). The
fourth equality uses the definitions of variance and bias, var() = E(2) =
σ2 and the fact that the cross-product is equal to 0.
ii. To estimate the test error rate, one can use the 10-fold Cross Validation
(CV) approach or the information criterion approach, e.g. AIC, BIC. What
are the main advantage and disadvantage of using the 5-fold CV approach
in comparison with AIC or BIC? [3 marks]
Answer: For the 10-fold CV, it is computational extensive because one
need to fit the model 10 times, but only 1 time is needed for AIC or BIC.
CV approaches provide direct estimates of the test error and make fewer
assumption about the true model. For AIC or BIC, it is also hard to
specify the model degrees of freedom.
iii. State which one of AIC and BIC tends to select smaller size model and
explain the reason. [3 marks]
Answer: BIC places a heavier penalty on models with many variables and
hence results in the selection of smaller models than AIC.
(b) i. The tree in Figure 1 provides a regression tree based on a dataset of pa-
tient visits for upper respiratory infection. The aim is to identify factors
associated with a physicians rate of prescribing, which is a continuous vari-
able. The variables appearing in the regression tree are private: percent
of privately insured patients a physician has, black: the percent of black
patients a physician has, and fam whether or not the physician specialises
in family medicine. Provide an interpretation of this tree. [5 marks]
6
Figure 1: Regression tree for Question 3 (b) i.
Answer: Among those privately insured, black patient populations had a
48.72% average physician rate of prescribing, while physician’s prescrip-
tion rate for non-black populations was 54.60%. Among those without
private insurance, the presence of a family medicine doctor raises the aver-
age provider prescribing rate by approximately 10%, to reach 54.48% (vs
44.67%), indicating that family medicine doctors systematically prescribe
most antibiotics than non-family medicine doctors.
ii. Consider the regression tree of Figure 2 where the response variable is the
log salary of a baseball player, based on the number of years that he has
played in the major leagues (Years) and the number of hits that he made
in the previous year (Hits). Create a diagram that represent the partition
of the predictors spaces according to this tree.
Answer: The requested diagram showing the partition of the predictors
spaces according to this tree is provided by Figure 3:
7
|Years < 4.5
Hits < 117.5
5.11
6.00 6.74
Figure 2: Regression tree for Question 3 (b) ii.
4.0 4.2 4.4 4.6 4.8 5.0
11
5
11
6
11
7
11
8
11
9
12
0
Years
H
its 5.11
6
6.74
Figure 3: Partition of the predictor’s according to the tree in Figure 2.
8
4 (a) i. Consider the following data:
10 20 40 80 85 121 160 168 195
Use the k-means algorithm with k = 3 to cluster the data set. Use the
Euclidean distance to measure the distance between the data points. Sup-
pose that the points 160, 168, and 195 were selected as the initial cluster
means. Work from these initial values to determine the final clustering for
the data. Provide results from each iteration. [9 marks]
Answer: The k-means clustering can be performed via the following steps:
• If we work through this you will see that initially observations (10, 20,
40, 80, 85, 121, 160) are closest to cluster centre 1, the observation 168
is closest to cluster centre 2, whereas the observation 195 is closer to
cluster 3.
• The new cluster centres will then be the averages of the observations
belonging to each cluster that are 73.71, 168 and 195 respectively.
• Now observations (10, 20, 40, 80, 85) are closest to cluster centre 1,
the observations (121, 160, 168) are closest to cluster centre 2, whereas
cluster centre 3 has the observation 195.
• The new cluster centre for cluster 1 will then be the average of the
observations (10, 20, 40, 80, 85), which is 47, the new cluster centre
for cluster 2 will be the average of observations (121, 160, 168), which
is 149.67. Finally the centre of cluster 3 will remain unchanged to 195.
• As with the previous step, observations (10, 20, 40, 80, 85) are closest
to cluster centre 1, the observations (121, 160, 168) are closest to
cluster centre 2, whereas cluster centre 3 has the observation 195.
• Since there has been no change in the clusters k-means stops at this
point with final cluster assignments of (10, 20, 40, 80, 85), (121, 160,
168), 195 and centres of 47, 149.67, 195.
ii. What are the main disadvantages of k-means clustering? Why one may
want to consider hierarchical clustering as an alternative? [4 marks]
Answer: Regarding the k-means algorithm one drawback is that we can
only find a local optimum rather than a global optimum, so the results ob-
tained will depend on initial cluster assignment of each observation. Sec-
ond, k-means clustering requires us to pre-specify the number of clusters k.
In hierarchical clustering there is no need to set k beforehand. Also, hier-
archical clustering may also be appealing over K-means clustering in that
it also offers a tree-based depiction of the data, called dendrogram.
(b) i. Data are available for students taking BSc degree in Data Science and
in particular the variables X1: average mark on project coursework, X2:
average hours studied per course, and Y : get a degree with distinction. The
estimated coefficients of a logistic regression model were β0 =?5, β1 = 0.02,
β2 = 0.1. Estimate the probability that a student who takes on average
50% on project coursework and studies 30 hours on average for each course
9
gets a degree with distinction? How many hours would the student in part
(a) need to study on average to have a 50 % chance of getting a degree
with distinction ? [6 marks]
Answer: We have
p(X) =
exp(β0 + β1X1 + β2X2)
1 + exp(β0 + β1X1 + β2X2)
,
where X1 = average coursework mark, X2 = average hours studied per
course, β0 = −5, β1 = 0.02 and β2 = 0.1.
For X1 = 50 and X2 = 30 we get
p(X) =
exp(−5 + 0.02 50 + 0.1 30)
1 + exp(−5 + 0.02 50 + 0.1 30) = 26.89%.
For X1 = 50 and X2 = x we get
p(X) =
exp(−5 + 0.02 50 + 0.1 x)
1 + exp(−5 + 0.02 50 + 0.1 x) or else
0.50 =
exp(−4 + 0.1 x)
1 + exp(−4 + 0.1 x) or else x = 40hours.
ii. Suppose that we wish to predict whether a high quality chip produced in
a factory will pass the quality control (‘Pass’ or ‘Fail’) based on x, the
measurement of its diameter. Diameter measurements are available for a
large number of chips. After examining them it turns out that the mean
value of x for chips that passed the quality control was 5mm, while the
mean for those that didnt was 7mm. Moreover, the variance of x for
these two sets of companies was σ2 = 1. Finally, 70% of the produced
chips passed the quality control. Assuming that x follows the normal
distribution, predict the probability that a chip with x = 5.8 will pass the
quality control. [6 marks]
Answer: For the probability of passing the quality control we get
ppass(x) =
pipass exp(− 12σ2 (x− µpass)2)
pipass exp(− 12σ2 (x− µpass)2) + pifail exp(− 12σ2 (x− µfail)2)
=
0.70 exp(− 1
2∗1(x− 5)2)
0.70 exp(− 1
2∗1(x− 5)2) + 0.30 exp(− 12∗1(x− 7)2)
Setting x = 5.8 we get
ppass(5.8) =
0.70 exp(− 1
2∗1(5.8− 5)2)
0.70 exp(− 1
2∗1(5.8− 5)2) + 0.30 exp(− 12∗1(5.8− 7)2)
= 77.68%
10
Appendix: Table of Common Distributions
Binomial(n, θ): number of successes in n independent Bernoulli trials with probability of suc-
cess θ.
• f(x|θ) = P (x|θ) = n!x!(n−x)!θx(1− θ)n−x for x = 0, 1, . . . , n.
• E(X) = nθ, Var(X) = nθ(1− θ).
NegBin(r, θ): number of successes before rth failures in repeated independent Bernoulli trials.
• f(x|θ) = P (x|θ) = (x+r−1x )θx(1− θ)r for x = 0, 1, . . ..
• E(X) = r(1−θ)θ , Var(X) = r(1−θ)θ2 .
Poisson(λ): often used for the number of events which occur in an interval of time.
• f(x|λ) = P (x|λ) = λxe−λx! for x = 0, 1, . . ..
• E(X) = λ, Var(X) = λ.
Normal N(µ, σ2): characterized by first two moments.
• f(x) = (2piσ2)−1/2 exp
(
− (x−µ)2
2σ2
)
for −∞ < x <∞.
• E(X) = µ, Var(X) = σ2.
Beta(α, β): characterized by parameters α > 0 and β > 0.
• f(x) = 1B(α,β)xα−1(1− x)β−1 for 0 ≤ x ≤ 1, B(α, β) =
∫ 1
0 y
α−1(1− y)β−1dy = Γ(α)Γ(β)Γ(α+β)
• E(X) = αα+β , Var(X) = αβ(α+β+1)(α+β)2 .
Gamma(α, β): characterized by parameters α > 0 and β > 0.
• f(x) = βαΓ(α)xα−1 exp(−βx) for 0 ≤ x <∞, Γ(t) =
∫∞
0 y
t−1e−ydy.
• E(X) = αβ , Var(X) = αβ2 .
IGamma(α, β): characterized by parameters α > 0 and β. If X ∼ Gamma(α, β), then 1/X ∼
IGamma(α, β).
• f(x) = βαΓ(α)x−α−1 exp
(
−βx
)
for 0 ≤ x <∞.
• E(X) = βα−1 , Var(X) = β
2
(α−1)2(α−2) . for positive integer n.
END OF PAPER
11