FIT5149 Sample Exam
Question Marks Score
Multiple Choice Questions 23
Regression & Classification 18
Tree-based Method 8
Model selection, Regularisation and Dimensionality 10
Hierarchical Clustering and Splines 6
Total: 65
Page 2 of 16
FIT5149 Sample Exam
1. Multiple Choice Questions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Total: 23 marks
For each multiple choice question, you should always check all that apply, which means the number of correct
choices can be one or greater than one. Marks will be given to an answer without any wrong choices. For
example, if the true answer is A and B, an answer including C or D will receive zero mark. Please write your
answers in the given place.
(1)(2 marks) Which of the following statement(s) is/are appropriate regarding bias and variance?
A. Models which overfit generally have a high bias
B. Models which under-fit generally have a low variance
C. Models with low bias were often found better fitted with the training data
D. Models with high variance were often found better fitted with both training and testing
data.
(1) B, C
(2)(2 marks) The following histogram are generated from data drawn from a Beta distribution:
x ∼ beta(0.5, 0.5)
Histgram of Beta distribution
x
Fr
eq
ue
nc
y
0.0 0.2 0.4 0.6 0.8 1.0
0
50
10
0
15
0
20
0
Which qqPlot below corresponds to the distribution given by the histgram?
Page 3 of 16
FIT5149 Sample Exam
−3 −2 −1 0 1 2 3
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
A
Theoretical Quantiles
Sa
m
pl
e
Qu
an
tile
s
−3 −2 −1 0 1 2 3
70
0
90
0
11
00
13
00
B
Theoretical Quantiles
Sa
m
pl
e
Qu
an
tile
s
−3 −2 −1 0 1 2 3
0.
5
0.
6
0.
7
0.
8
0.
9
1.
0
C
Theoretical Quantiles
Sa
m
pl
e
Qu
an
tile
s
−3 −2 −1 0 1 2 3
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
D
Theoretical Quantiles
Sa
m
pl
e
Qu
an
tile
s
(2) D
(3)(2 marks) Suppose you have m = 30 training examples with n = 5 features (excluding the additional all-ones
feature for the interception term, however which you should add). The normal equation is
β = X
Ty
XTX
For the given values of m and n, what are the dimensions of β, X, and y in this equation?
A. β is 6× 6, X is 30× 6, y is 30× 6
B. β is 6× 1, X is 30× 6, y is 30× 1
C. β is 5× 5, X is 30× 5, y is 30× 5
D. β is 5× 1, X is 30× 5, y is 30× 1
Page 4 of 16
FIT5149 Sample Exam
(3) B
(4)(2 marks) Given (x1, y1), (x2, y2), . . . , (xn, yn), which of the following is total sum of squares (TSS)?
A.
n∑
i=1
(yi − f(xi))2
B.
n∑
i=1
(yi − y¯)2,where y¯ =
n∑
i=1
yi/n
C. √√√√ 1
n− p− 1
n∑
i=1
(yi − f(xi))2
D.
n∑
i=1
(f(xi)− y¯)2,where y¯ =
n∑
i=1
yi/n
(4) B
(5)(2 marks) Which of the following statements are true? Check all that apply.
A. Linear regression always works well for classification if you classify by using a threshold
on the prediction make by linear regression.
B. The cost function L(β) for logistic regression trained with more than one observations is
always greater than or equal to zero.
C. If the order of the labels must be retained in an analysis, you can use multinomail logitic
regression.
D. The sigmoid function g(z) = 11+e−z is never greater than one.
(5) B,D
(6)(2 marks) Please select any of the following methods are designed for the multi-class classification task (e.g.,
predicting a outcome of 5 categories).
A. LDA
B. QDA
C. Logistic regression
D. Multinomial logistic regression
(6) A, B, D
(7)(2 marks) Suppose you ran logistic regression twice with L2 regularisation, once with with the tuning (or
regularisation) parameter λ = 0, and once with λ = 1. One of the times, you got parameter
βt = (81.47, 12.69), and the other time you got βt = (13.01, 0.91). However, you forgot which value
of λ corresponds to which value of β. Which one do you think corresponds to λ = 0?
A. βt = (81.47, 12.69)
B. βt = (13.01, 0.91)
Page 5 of 16
FIT5149 Sample Exam
(7) A
When λ is set to 1, we use L2 regularisation to penalise the large value of β. Thus, the parameters
β obtained will in general have smaller values
(8)(2 marks) Suppose we estimate the regression coefficients in a linear regression model by minimizing
n∑
i=1
(
yi − βTxi
)
+ λ ‖ β ‖22
for a particular value of λ. As we increase λ from 0, which of the following will steadily increase?
Check all that apply.
A. The training RSS
B. The test RSS
C. Variance
D. (Squared) bias
E. The irreducible error
(8) A, D
(9)(2 marks) Assume you have a training dataset consisting of N observations and D features. You used the
closed-form solution to fit a multiple linear regression model using ridge regression. To choose
the tuning parameter λ, you use LOOCV searching over L values of λ. Let Cost(N,D) be the
computation cost of running ride regression with N data points and D features. Which of the
following represents the computational cost of your LOOCV procedure?
A. L×D × Cost(N − 1, D)
B. L×D × Cost(N,D)
C. L×N × Cost(N,D)
D. L×N × Cost(N − 1, D)
E. N × Cost(N,D)
F. L× Cost(N,D)
(9) D
(10)(2 marks) Which of the following methods grows the tree sequentially?
A. Random forests
B. Bagging (Bootstrap aggregation )
C. Boosting
D. None
(10) C
Page 6 of 16
FIT5149 Sample Exam
(11)(2 marks) Which of the following is true: For a fixed model complexity, in the limit of an infinite amount of
training data,
A. The noise goes to 0.
B. Bias goes to 0.
C. Variance goes to 0.
D. Training error goes to 0.
(11) C
(12)(1 mark) Select any of the following residual plot(s) that satisfy assumptions of linear regression
0 200 400
−
50
0
50
10
0
(A)
Fitted values
R
es
id
ua
ls
Residuals vs Fitted
149
164147
−20 20 60 100
−
60
−
20
20
60
(B)
Fitted values
R
es
id
ua
ls
Residuals vs Fitted
70
11
76
−10 10 30
−
3
−
1
1
2
3
(C)
Fitted values
R
es
id
ua
ls
Residuals vs Fitted
126
1211
−300 −100 0 100
−
15
00
0
0
10
00
0
(D)
Fitted values
R
es
id
ua
ls
Residuals vs Fitted
149
56
27
(12) C
Page 7 of 16
FIT5149 Sample Exam
2. Regression & Classification: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Total: 18 marks
(1) In multivariate regression analysis, the outcome Y were fitted with predictors X1, X2 with the R
code of lm(y ~ X1*X2). X2 has value of 0 or 1, and X1 is a continuous variable range between 50
and 200. The researchers found out the following regression results:
Estimate Std. Error t value Pr(>
(Intercept) 24.49627 2.73893 8.944 1.07e-09 ***
X1 -0.04143 0.01379 -3.011 0.00547 **
X2 14.50458 4.58160 3.166 0.00371 **
X1 : X2 -0.11627 0.04130 -2.822 0.00868 **
(a).(2 marks) What does the intercept 24.49627 indicate?
Solution:
The expected value of Y , excluding all the other predictors.
(b).(2 marks) Calculate the upper 95% confidence interval of the coefficient for 10 unit increase of X1, when
X2==0. Round your result to the 3rd decimal place.
Solution:
-0.04143*10+1.96*0.01379*10
## [1] -0.144016
(c).(2 marks) Compared with when X2 = 0, when X2 = 1 , is the average change in Y associated with a
10-units change in X1 larger or smaller? And justify your answer.
Solution: When X2 = 1, 10 unit increase in X1 is estimated to be associated with 1.58
decrease in Y, hence larger effect.
-0.04143*10-0.11627*10
## [1] -1.577
(2) This problem has to do with odds. In the binary classification scenario, the odds is
p(X)
1− P (X) = e
βTX
(a).(2 marks) On average, what fraction of people with an odds of 0.37 of defaulting on their credit card
payment will in fact default?
Solution:
p(X) = 0.371 + 0.37 = 0.27
Page 8 of 16
FIT5149 Sample Exam
(b).(2 marks) Suppose that an individual has a 16% chance of defaulting on her credit card payment. What
are the odds that she will default ?
Solution:
p(X)
1− P (X) =
0.16
1− 0.16 = 0.19
(3) The table below provides a training data set containing six observations, three predictors, and one
qualitative response variable.
Obs. X1 X2 X3 Y
dis-
tance
1 0 3 4 red 3
2 2 0 0 red 2
3 0 1 3 red 3.2
4 0 1 2 green 2.2
5 −1 0 1 green 1.4
6 1 1 1 red 1.7
Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using
K-nearest neighbors. The last column contains the Euclidean distance between each observation
and the test point X1 = X2 = X3 = 0.
(a).(2 marks) What is our prediction with K = 1? Why?
Solution: Green. Observation 5 is the closest neighbour for K = 1.
(b).(2 marks) What is our prediction with K = 3? Why?
Solution: Red. Observations 2, 5, 6 are the closest neighbours for K = 3. 2 is Red, 5 is
Green, and 6 is Red.
(c).(2 marks) If the Bayes decision boundary in this problem is highly non-linear, then would we expect the
best value for K to be large or small? Why?
Solution: Small. A small K would be flexible for a non-linear decision boundary, whereas
a large K would try to fit a more linear boundary because it takes more points into
consideration.
(4)(2 marks) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or
irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods
towards more flexible approaches. The x-axis should represent the amount of flexibility in the
method, and the y-axis should represent the values for each curve. There should be five curves.
Make sure to label each one.
Page 9 of 16
FIT5149 Sample Exam
Solution:
Page 10 of 16
FIT5149 Sample Exam
3. Tree-based Method: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Total: 8 marks
(1) Imagine we are training a decision tree, and arrive at a node. Each data point is (x1, x2, y), where
x1, x2 are features, and y is the class label. The data at this node is
## x1 x2 y
## 1 0 1 +1
## 2 1 0 +1
## 3 0 1 +1
## 4 1 1 -1
(a).(2 marks) What is the classification error at this node (assuming a majority-vote based classifier)?
Solution:
0.25
1/4
## [1] 0.25
(b).(2 marks) If we further split on x2, what is the classification error?
Solution:
0.25
(2) Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We
then apply a classification algorithm to each bootstrapped sample and, for a specific value of X,
produce 10 estimates of p(Class is Red | X) :
0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75.
There are two common ways to combine these results together into a single class prediction. One is
the majority vote approach. The second approach is to classify based on the average probability.
Assume that if p(Class is Red | X) > 0.5, the sample is assigned to the Red class. otherwise assigned
to the Green class.
(a).(2 marks) In this example, what is the final classification under the majority vote approach?
Solution: Red, as it is the most commonly occurring class among the 10 predictions (6
for Red vs 4 for Green).
(b).(2 marks) In this example, what is the final classification under the average probability approach?
Solution: Green, as the average probability is
## [1] 0.45
Page 11 of 16
FIT5149 Sample Exam
4. Model selection, Regularisation and Dimensionality: . . . . . . . . . . . . .Total: 10 marks
(1) We now review k-fold cross-validation.
(a).(3 marks) Explain how k-fold cross-validation is implemented.
Solution:
The k-fold cross validation is implemented by taking the n observations and randomly
splitting it into k non-overlapping groups of length of (approximately) n/k. These groups
acts as a validation set, and the remainder (of length n− n/k) acts as a training set. The
test error is then estimated by averaging the k resulting MSE estimates.
(b).(4 marks) What are the advantages and disadvantages of k-fold cross-validation relative to the validation
set approach and LOOCV respectively?
Solution:
The validation set approach: The validation set approach has two main drawbacks
compared to k-fold cross-validation. First, the validation estimate of the test error
rate can be highly variable (depending on precisely which observations are included
in the training set and which observations are included in the validation set). Second,
only a subset of the observations are used to fit the model. Since statistical methods
tend to perform worse when trained on fewer observations, this suggests that the
validation set error rate may tend to overestimate the test error rate for the model fit
on the entire data set.
LOOCV: The LOOCV cross-validation approach is a special case of k-fold cross-validation
in which k = n. This approach has two drawbacks compared to k-fold cross-validation.
First, it requires fitting the potentially computationally expensive model n times
compared to k-fold cross-validation which requires the model to be fitted only k times.
Second, the LOOCV cross-validation approach may give approximately unbiased
estimates of the test error, since each training set contains n−1 observations; however,
this approach has higher variance than k-fold cross-validation (since we are averaging
the outputs of n fitted models trained on an almost identical set of observations, these
outputs are highly correlated, and the mean of highly correlated quantities has higher
variance than less correlated ones). So, there is a bias-variance trade-off associated
with the choice of k in k-fold cross-validation; typically using k = 5 or k = 10 yield
test error rate estimates that suffer neither from excessively high bias nor from very
high variance.
Page 12 of 16
FIT5149 Sample Exam
(2)(3 marks) It is well-known that ridge regression tends to give similar coefficient values to correlated variables,
whereas the lasso may give quite different coefficient values to correlated variables. We will now
explore this property in a very simple setting.
Suppose that n = 2, p = 2, x11 = x12, x21 = x22. Furthermore, supposethat y1 + y2 = 0 and
x11 + x21 = 0 and x12 + x22 = 0,so that the estimate for the intercept in a least squares, ridge
regression, or lasso model is zero: β0 = 0.
Write out the ridge regression optimisation problem in this setting.
Solution:
(y1 − βˆ1x1 − βˆ2x1)2 + (y2 − βˆ1x2 − βˆ2x2)2 + λ(βˆ21 + βˆ22)
Page 13 of 16
FIT5149 Sample Exam
5. Hierarchical Clustering and Splines: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Total: 6 marks
(1) In this problem, you will perform K-means clustering manually, with K = 2, on a small example
with n = 6 observations and p = 2 features. The observations are as follows.
Obs. X1 X2
1 1 4
2 1 3
3 0 4
4 5 1
5 6 2
6 4 0
Now, let us randomly initialise the cluster labels
Obs. label
1 red
2 red
3 green
4 green
5 red
6 green
(a).(2 marks) Compute the centroid for each cluster, and mark the two centroids in the plot as two crosses
(“x”). Note you should show how you compute the centroids.
Solution:
We can compute the centroid for the green cluster with
x¯11 =
1
3(0 + 4 + 5) = 3
and
x¯12 =
1
3(4 + 0 + 1) = 5/3
and for the red cluster
x¯21 =
1
3(1 + 1 + 6) = 8/3
and
x¯22 =
1
3(2 + 4 + 3) = 3
Page 14 of 16
FIT5149 Sample Exam
0 1 2 3 4 5 6
0
1
2
3
4
x[, 1]
x[,
2]
(b).(2 marks) Repeat the K-mean clustering algorithm until there is no change of clusters. Report the two
centroids for the two new clusters.
Solution:
We can compute the centroid for the green cluster with
x¯11 =
1
3(4 + 5 + 6) = 5
and
x¯12 =
1
3(0 + 1 + 2) = 1
and for the red cluster
x¯21 =
1
3(0 + 1 + 1) = 2/3
and
x¯22 =
1
3(4 + 3 + 4) = 11/3
Page 15 of 16
FIT5149 Sample Exam
(2)(2 marks) Consider two curves, gˆ1 and gˆ2, defined by
gˆ1 = arg min
g
n∑
i=1
(yi − g(xi))2 + λ
∫
[g(3)(x)]2dx
)
gˆ2 = arg min
g
n∑
i=1
(yi − g(xi))2 + λ
∫
[g(4)(x)]2dx
)
where, g(d)(x) indicates the d-th derivative of g(x). As λ = 0, will gˆ1 or gˆ2 have the smaller training
and testing RSS? Justify your answer.
Solution: If λ = 0, we have gˆ1 = gˆ2, so they will have the same training and testing RSS.
Page 16 of 16
学霸联盟