GU4241/GR5241-无代写
时间:2023-05-05
AMidterm Exam
GU4241/GR5241 Fall 2016
Name
UNI
Problem 0: UNI (2 points)
Write your name and UNI on the exam book and on the first page of the problem sheet. After the exam, please
put the problem sheet into the exam book and return both to us.
Problem 1: Short questions (2+2+3+3+4+4 points)
Short answers (about one sentence) are sufficient.
(a) (True/False) A classifier trained on more training data is more likely to over fit.
(b) (True/False) No classifier can do better than a naive Bayes classifier if the distribution of the data is
known.
(c) We want to identify the subgroups of a 2-dimensional data set. After some exploratory analysis, we find
that there are several clusters. The data points within the same cluster are normally distributed. But the
variances varies among the clusters. Which of the following methods you will use: k-means, k-medoids,
EM algorithm? Explain briefly.
(d) What is the main advantage of the Lasso with respect to Ridge regression?
(e) Suppose the data (X,Y ) are well fit by a linear model, how would you diagnose if the data point (xi, yi)
is an outlier or a high leverage point?
(f) A cubic spline with K knots is a function that:
• is a cubic polynomial between each pair of knots,
• is continuous at the knots, and
• has continuous first and second derivatives at each knot.
Without using the definition of a cubic spline in terms of basis functions, explain why the spline has K + 4
free parameters or degrees of freedom.
Problem 2: Hierachical clustering (10 points)
Perform single-linkage hierarchical clustering on the following 2-dimensional observations. Use the Manhattan
distance between a pair of points: d(A,B) = |X1A −X1B |+ |X2A −X2B |. Show your results after each step.
Obs. X1 X2
A 1 4
B 1 3
C 3 4
D 5 2
E 3 2
F 3 0
1
Problem 3: `q-regression (5+5 points)
β1
β2
βˆ
x1
x2
x3 x4 x5
β1
β2
βˆ
x1
x2
x3 x4 x5
`q-regression is a generalization of ridge regression and the lasso. We want to find the solution of the following
optimization problem:
β˜ = arg min
β
N∑
i=1
(yi − β0 −
p∑
j=1
xijβj)
2 + λ
p∑
j=1
|βj |q
The red curves in the figures show the cost function components of the `q-regression problems with q = 0.5 (left)
and q = 4 (right). And the black curves are the contours of the residual sum of squares.
(a) Does one/none/both of the cost functions encourage sparse estimates? If so, which one? Explain your
answer.
(b) Which of the points x1, . . . , x5 would achieve the smallest cost under the `q-constrained least squares cost
function? For each of the two cases, name the respective point and give a brief explanation for your answer.
Problem 4: Logistic regression (10 points)
Consider the dataset
x y
-2 ’slow’
5 ’fast’
-1 ’slow’
10 ’fast’
4 ’fast’
Suppose we used logistic regression to fit this model: that is, if y is a binary variable that is either ’fast’ or ’slow’,
we wish to fit the model
P(yi = fast) =
1
1 + e−(β0+β1xi)
P(yi = slow) =
e−(β0+β1xi)
1 + e−(β0+β1xi)
for all i = 1, . . . , 5. What value(s) of β would maximize the likelihood ( and dthus be the estimates returned
from fitting this model)?
2
Problem 5: Cross validation (10 points)
A scientist performs a ridge regression in R and estimates the optimal parameter λ by 10-fold cross-validation.
She uses the following code:
> library(glmnet)
> X = as.matrix(data[,-1])
> Y = as.vector(data[,1])
> cv.out = cv.glmnet(X,Y,alpha = 0)
> cv.out$lambda.min
[1] 0.5
> min(cv.out$cvm)
[1] 485.1199
The scientist concludes that the test MSE of ridge regression with λ = 0.5 is approximately 485.119. Explain the
problem with this estimate and suggest a better way to estimate the test MSE.
Problem 6: Variable selection(10 points)
A new startup has developed a cheap way to measure the expression of antibodies in infants. They hope to use
these data to classify infants according to whether they will develop atopic syndrome. The company has collected
data from 500 infants and followed them for 5 years to evaluate the incidence of atopy. They ran an initial trial in
which they measured just 1,000 antibodies, and then a second trial which considered a more complete library of
100,000 antibodies. In each case, they performed variable selection using forward stepwise selection and produced
a classification by logistic regression.
They found that the classification quality was worse in the second trial. Puzzled by this outcome, they decided
to consult with you. How would you explain this observation?
3
Problem 7: Polynomial regression (2+2+2+2+2 points)
(a) Consider a regression problem on R, that is, the regression function is of the form f : R→ R. Assume f is
linear with least squares solution (weight vector) β = (β0, β1)
t = (−1, 4)t. What would be the predictions
f(x) and f(x′) for points x = 0.8 and x′ = 1.0 ?
(b) Now consider a more general linear regression problem with n data points each in d dimensions, correspond-
ing to the data matrix X ∈ Rn×(d+1) and output vector y ∈ Rn. What is the form of the ordinary least
squares solution for the parameters β? What is the dimensionality of β?
(c) Suppose f is actually a quadratic function, with quadratic parameter β2 = 1, with the linear terms as
above. What are the new predictions for the above points x and x′?
(d) Now consider a more general quadratic regression problem with n data points each in d dimensions and
output vector y ∈ Rn. Assume the function f has a simple quadratic relationship with each dimension
(no product terms between dimensions). What is the form of the ordinary least squares solution for the
parameter β? What is the dimensionality of β?
Hint: you may wish to define a feature matrix Φ = [φ(x1) . . . φ(xn)]
t, for some choice of φ. Specift the
form of the feature map φ.
(e) Given these extra quadratic covariates, we are concerned with overfitting and decide to add a ridge penalty,
with some given value λ. What is the form of the ridge regression solution for the parameter β? You should
again write your solution in terms of Φ. It is not necessary to derive the solution; you can just write it.
What is the dimensionality of β?
Problem 8: Model evaluation for clustering (10 points)
You have a set of microarray data with 100,000 genes, and 100 samples, 50 in a healthy group and 50 in a diseased
group. Since you have a slow computer, you first discard the 99,000 genes with smallest absolute t-statistic for
comparing the 2 groups. Then you apply agglomerative hieearchical clustering to the samples, using the remaining
1000 genes. You find that the clustering cleanly separates the diseased group from the healthy group, and your
collaborator is delighted and ready to publish. Comment. How would you assess the ability of these 1000 genes
to separate the classes?
4
Problem 9: PCA (4+6 points)
(a) Please graph approximately into the following picture (you can write on the problem sheet):
• The first principle component (the 1-dimensional subspace onto which PCA with one component would
project). Mark this ξ1.
• Pick an arbitrary data point (reasonably far away from the principal component) and graph the direction
in which it will be projected.
(b) The next figure shows (on the left) 100 examples from a data set of face images. Each image is a 92× 112
grayscale image and can be interpreted as a vector x ∈ R10304. The figure on the right shows the first 48
principal components ξ1, . . . , ξ48, visualized as images.
High Dimensional Data
Figure 15.5: 100 training images. Each image consists of 92 ×
112 = 10304 greyscale pixels. The train data is scaled so that,
represented as an image, the components of each image sum to 1.
The average value of each pixel across all images is 9.70× 10−5.
This is a subset of the 400 images in the full Olivetti Research
Face Database.
(a) (b)
Figure 15.6: (a): SVD reconstruc-
tion of the images in fig(15.5) using
a combination of the 49 eigen-images.
(b): The eigen-images are found us-
ing SVD of the images in fig(15.5)
and taking the mean and 48 eigenvec-
tors with largest corresponding eigen-
value. The images corresponding to
the largest eigenvalues are contained
in the first row, and the next 7 in the
row below, etc. The root mean square
reconstruction error is 1.121 × 10−5,
a small improvement over PLSA (see
fig(15.16)).
being very small. This gives an indication of the number of degrees of freedom in the data, or the intrinsic
dimensionality. Directions corresponding to the small eigenvalues are then interpreted as ‘noise’.
Non-linear dimension reduction
In PCA we are presupposing that the data lies close to a linear subspace. Is this really a good description?
More generally, we would expect data to lie on low dimensional non-linear subspace. Also, data is often
clustered – examples of handwritten 4’s look similar to each other and form a cluster, separate from the 8’s
cluster. Nevertheless, since linear dimension reduction is computationally relatively straightforward, this is
one of the most common dimensionality reduction techniques.
15.3 High Dimensional Data
The computational complexity of computing the eigen-decomposition of a D × D matrix is O D3. You
might be wondering therefore how it is possible to perform PCA on high dimensional data. For example, if
we have 500 images each of 1000×1000 = 106 pixels, the covariance matrix will be a 106×106 square matrix.
It would appear to be a significant computational challenge to find the eigen-decomposition of this matrix
directly. In this case, however, since there are only 500 such vectors, the number of non-zero eigenvalues
cannot exceed 500. One can exploit this fact to reduce the O
D3
complexity of the naive approach, as
described below.
DRAFT November 6, 2012 311
(i) How many principal components are there in total?
(ii) Can you describe a method that approximately reconstructs a specific face image x ∈ R10304 using
only the first 48 principal components? To describe your method, denote by xˆ the approximate
representation of x. Represent xˆ as an equation xˆ = .... Please define precis ly what each variable
occurring on the right-hand side of the equation means.