手写代写-MFIT5010 01|学霸联盟

手写代写-MFIT5010 01

时间：2021-05-09

MFIT5010 01 3
Statistical Machine Learning 3
1 3
INSTRUCTIONS:
1. Answer ALL of the following questions.
2. The full mark for this examination is 100.
3. Answers without sufficient explanations/steps receive no or partial marks.
4. A calculator is allowed during the exam. Internet access (e.g., Google search) is NOT
allowed.
5. Open book and open notes, but each student should work on it independently.
1. (10 marks)
Consider a data set in which each data point {yi,xi}i=1,...,n is associated with a weighting
factor wi > 0, so that the sum-of-squares error function becomes
min
β
1
2
n∑
i=1
wi(yi − βTxi)2.
Find the optimal solution of the above problem.
2. (20 marks)
Consider a data set D = {xi, yi}, i = 1, . . . , n for classification, where xi = [xi1, . . . , xip] ∈
Rp and yi ∈ {0, 1}, and n is the number of samples. Suppose we choose to minimize
the exponential loss function L(y, F ) = 1
n
∑n
i=1 exp(−(2yi − 1)F (xi)), where F (x) = f0 +∑M
m=1 fm(x) is an additive model, f0 is the intercept term and fm(x) is to be fitted by a
regression tree.
(a) Estimate f0 and justify your result. (5 marks)
(b) Suppose we have fitted F (x) as Fˆm(x) at the m-th step. Using the gradient boosting
approach, we would like to find fm+1(x) by minimizing
1
n
∑n
i=1(−gm,i − fm+1(xi))2,
where gm,i is the functional gradient evaluated at the current step. Derive the closed
form of −gm,i. (5 marks)
(c) Suppose we have fitted a tree with J-terminal nodes by solving the optimization prob-
lem in (b). Let Tˆ (x) =
∑J
j=1 cˆjI(x ∈ Sj) be the regression tree, where Sj is the j-th
partition region and I(·) is the indicator function. To improve the minimization of the
chosen exponential loss function, please re-adjust the constant cˆj by solving the fol-
lowing optimization problem using the Newto-Raphson method, given the fitted Fˆm(x)
and the partition {Sj}j=1,...,J :
min
cj
1
n
n∑
i=1
[
L
(
yi, Fˆm(xi) +
J∑
j=1
cjI(x ∈ Sj)
)]
.
Hint: You need to derive the closed form of gradient and Hessian. (10 marks)
The Hong Kong University of Science and Technology Page: of
Spring EXAMINATION, 2019-2020
Course Code: Section No.: Time Allowed: Hour(s)
Course Title: Total Number of Pages:
MFIT5010 01 3
Statistical Machine Learning 3
2 3
3. (10 marks)
Consider a data set D = {xi, yi}i=1,...,n for a classification problem, where n = 100 samples
are in two equal-sized classes yi ∈ {0, 1} and xi ∈ R10×1. It is known that the design matrix
X = [xT1 , . . . ,x
T
n ] ∈ R100×10 is a full rank matrix. Suppose we used the standard software to
apply logistical regression (linear model in the logit scale) to this data set but the software
sent a warning message “algorithm did not converge”. What is the problem here? Can linear
discriminant analysis (LDA) be applied to D without such a numerical problem? Justify
your answer.
4. (10 marks)
Consider a set of D binary variables xi, where i = 1, . . . , D, each of which is governed by a
Bernoulli distribution with parameter µi, so that
p(x|µ) =
D∏
i=1
µxii (1− µi)1−xi
where x = (x1, . . . , xD)
T and µ = (µ1, . . . , µD)
T . Now let us consider a finite mixture of
these distributions given by
p(x|θ) =
K∑
k=1
pikp(x|µk)
where θ = {µ1, . . . ,µK , pi1, . . . , piK}, and p(x|µk) =
∏D
i=1 µ
xi
ki(1−µki)1−xi . Assuming the set
of parameters θ is known, what are the mean vector E[x] and the covariance matrix Cov[x]?
5. (10 marks)
Consider a scenario with n = 50 samples in two equal-sized classes, and p = 5, 000 quanti-
tative predictors (standard Gaussian) that are independent of the class labels.
(a) Based on misclassification error rate, what is the training error rate of 1-nearest neigh-
bor (1-NN) classifier? What is the true test error rate of 1-nearest neighbor classifier?
(5 marks)
(b) Suppose we carried out the following strategy: (1) select the 100 predictors having
highest correlation with the class labels. (2) Use a 1-NN classifier based on the selected
100 for classification. (3) Use cross-validation to estimate the misclassification error
rate of 1-NN classifier based on the selected subset {xi, yi}i=1,...,n, where xi ∈ R100×1.
Do you think this is a correct way of cross-validation? If yes, please justify your answer;
if no, please give the correct way of cross-validation. (5 marks)
The Hong Kong University of Science and Technology Page: of
Spring EXAMINATION, 2019-2020
Course Code: Section No.: Time Allowed: Hour(s)
Course Title: Total Number of Pages:
MFIT5010 01 3
Statistical Machine Learning 3
3 3
6. (10 marks)
Consider the observed data {yi}i=1,...,n from following model: yi = β∗ + i, where ∼
N (0, σ2), where β∗ ∈ R is the underly true parameter and σ2 is the true variance of the
residual. Now we consider solving the following optimization problem to estimate β∗:
βˆ(λ) = min
β
[
1
2
n∑
i=1
(yi − β)2 + λ
2
β2
]
,
where λ is known. Derive the closed forms of bias E[βˆ(λ)− β∗] and variance Var(βˆ(λ)).
7. (5 marks)
Suppose that we have three colored boxes r (red), b (blue), and g (green). Box r contains
3 apples, 4 oranges, and 3 lemons. Box b contains 1 apple, 1 orange, and 0 lemon. Box g
contains 3 apples, 3 oranges, and 4 lemons. Now a box is chosen at random with probabilities
Pr(r) = 0.1, Pr(b) = 0.6, Pr(g) = 0.3 and a piece of fruit is removed from the box (with
equal probability of selecting any of the items in box). If we observe that the selected fruit
is in fact an orange, what is the probability that it came from the green box?
8. (25 marks)
Consider a data set D = {x1, . . . , xM}, where xj ∈ (0,1) and M = 1, 000, 000. The observed
x values independently come from a mixture of the Uniform distribution on (0,1) (denoted
as component 0) and a distribution with desity function αxα−1 with unknown parameter α
(denoted as component 1). Let zj ∈ {0, 1} be the latent variable indicating whether the xj
is from component 0 (zj = 0) or component 1 (zj = 1). Then the probabilistic model can
be written as:
pi0 = Pr(zj = 0) : xj ∼ U [0, 1], if zj = 0,
pi1 = Pr(zj = 1) : xj ∼ αxα−1, if zj = 1.
(a) Let Θ = {pi0, pi1, α} be the set of unknown parameters to be estimated. Write down the
incomplete log-likelihood function L(Θ) for this problem. (5 marks)
(b) Derive an EM algorithm for parameter estimation, where Θ = {pi0, pi1, α} is the parameter
set to be estimated and {z1, . . . , zM} are considered as missing data. (10 marks)
(c) Suppose we have some additional information collected in a vector A = [A1, . . . , AM ], where
Aj ∈ {0, 1}. We model the relationship between Aj and zj as q0 = Pr(Aj = 1|zj = 0) and
q1 = Pr(Aj = 1|zj = 1), respectively. Derive an EM algorithm to estimate all the parameters
{pi0, pi1, α, q0, q1}. Again, {z1, . . . , zM} are considered as missing data. (10 marks)
— END —
The Hong Kong University of Science and Technology Page: of
Spring EXAMINATION, 2019-2020
Course Code: Section No.: Time Allowed: Hour(s)
Course Title: Total Number of Pages:

学霸联盟