程序代写案例-ST4200|学霸联盟

程序代写案例-ST4200

时间：2021-08-30

ST4200
THE UNIVERSITY OF WARWICK
FOURTH YEAR EXAMINATION: SPRING 2021
STATISTICAL LEARNING AND BIG DATA
Time Allowed: 2 hours
Read carefully the instructions on the answer book and make sure that the particulars
required are entered on each answer book.
Calculators are not needed and are not permitted in this examination.
Full marks will be obtained by correctly answering all complete questions.
Candidates may attempt all questions. Marks will be awarded for the best three answers
only.
The numbers in the margin indicate approximately how many marks are available for
each part of a question.
1 CONTINUED
ST4200
1. The subject of this question is binary classification. Let X ? Rp and Y = {1,+1},
X and Y be random variables taking values in X and Y respectively, and P be the
joint distribution of X and Y . We consider the classification problem of inferring a
function f : X ! Y from training data T = {(Xi, Yi)}ni=1, an independent identically
distributed sample from P . Unless stated otherwise, we reduce this problem to
inferring a function g : X ! R such that the empirical risk plus some regularising
term
1
n
nX
i=1
L (Yig (Xi)) + (g)
is minimised, where L : R! R0 is logistic loss, given by
L (z) = log (1 + exp (z)) .
The output, g? of this inference is then used to give the probability of classes +1 for
an observed x, using (g? (x)), where is the logistic sigmoid function.
(a) Prove that the logistic loss function is strictly convex on R. [3]
(b) Let S ? Rp be a convex set. Recall that if f : S ! R and g : S ! R, then f +g
is strictly convex. Now prove that if a function f : S ! R is strictly convex,
?f is also strictly convex for ? > 0. [2]
(c) Let g : X ! R be of the form g (X) = XT. Suppose that an algorithm is run
to solve the following minimisation problem
argmin

() + () ,
where : Rp ! R0
() =
1
n
nX
i=1
L (Yig (Xi))
and where : Rp ! R0 is a strictly convex function and > 0. The algorithm
locates a local minimum. Is this local minimum guaranteed to be a global
minimum? Prove your answer in full, using only properties that have already
been stated or derived in this exam paper (i.e. all other steps must be fully
justified, not making use of results proved during the module). You may use
the fact that a twice-di?erentiable function on a convex domain is convex if and
only if its second derivative is everywhere positive semi-definite. [9]
(d) Suppose we know that the algorithm used in part (c) finds a global minimum.
Is the global minimum unique? Prove your answer. [4]
(e) Now suppose that g : X ! R no longer takes the form g (X) = XT, but is
instead given by a neural network, whose parameters we denote by . Suppose
we run an algorithm for minimising
argmin

() + () ,
2 Question 1 continued overleaf
ST4200
Question 1 continued
that finds a local minimum. Is this local minimum guaranteed to be a global
minimum? Explain your answer (a formal proof is not needed here). [2]
3 CONTINUED
ST4200
2. This question concerns support vector machines and neural networks: two methods
that we motivated as being able to yield useful sets of features for regression and
classification.
(a) Define what it means for k to be a kernel (in the context of support vector
machines). [1]
(b) Define a reproducing kernel. [1]
(c) Prove that a reproducing kernel is a kernel. [2]
(d) A symmetric function k : X ? X ! R is positive definite if for all n 1,
8 (a1..., an) 2 Rn, 8 (x1..., xn) 2 X n,
nX
i=1
nX
j=1
aiajk (xi, xj) 0.
Prove that a kernel is a positive definite function. (It is also true that a sym-
metric positive definite function is a kernel, but you do not need to prove this.)
[2]
(e) Using the fact that a symmetric positive definite function is a kernel, show that
if k1 and k2 are kernels, then for ?1,?2 > 0, the function
k (x, z) = ?1k1 (x, z) + ?2k2 (x, z)
is a kernel. [2]
(f) Let (km)m0 be a sequence of kernels. If, for all x, z 2 X , (km (x, z))m0 con-
verges to a unique limit k (x, z). Using the fact that a symmetric positive definite
function is a kernel, show that this limiting function is a kernel. [2]
(g) Using parts (e) and (f), and the fact that the product of two kernels is also a
kernel, show that if k1 is a kernel, the function
k (x, z) = exp [k1 (x, z)]
is a kernel. [4]
(h) Consider the use of a neural network for binary classification. It is well-known
that neural networks are very flexible models, thus it is common for them to
overfit. Bayesian methods provide one approach to avoid overfitting. What
changes and/or additions to the model might be required in order to use Bayesian
methods with neural networks to avoid overfitting? Make sure to justify your
choices. How might it be possible to fit the model to data? You should describe
which algorithm you might use, and the challenges you might face. [6]
4 CONTINUED
ST4200
3. (a) Show that the moment generating function of a univariate normal distribution
with mean μ and variance 2 is, for all t 2 R,
MX(t) = exp
?
μt+
2t2
2
◆
.
[4]
(b) Prove that, for any random variable X with finite mean μ, for any ? > 0
P (X μ ") ? inf
t>0
{exp (t ("+ μ))E [exp (tX)]} .
[3]
(c) Using parts (a) and (b), show that if X is a univariate normal distribution with
mean μ and variance 2, then
P (X μ ") ? exp
?
"
2
22
◆
.
[5]
(d) Are the following distributions sub-Gaussian? Give your reasoning.
(i) a truncated Gaussian distribution;
(ii) a discrete uniform distribution, on values {1, ..., n}. [3]
(e) Concentration inequalities play a key role in this module. Outline how they give
us insight into the performance we might expect of classification algorithms, as
if you were explaining the ideas to a student in your year who has not taken
this module. You should cover the majority of the main ideas in the module,
but you do not need to go into great detail. [5]
5 CONTINUED
ST4200
4. Let X ? Rp and Y = R, X and Y be random variables taking values in X and Y
respectively, and P be the joint distribution of X and Y . We consider the regression
problem of inferring a function f : X ! Y from training data T = {(Xi, Yi)}ni=1, an
independent identically distributed sample from P . Let f : X ! Y be of the form
f (X) = XT + 0.
(a) The lasso is a regularisation scheme that can be used when estimating the
parameter vector in the situation set out in this question. One description of
the lasso is as an unconstrained minimisation problem. State this description,
for the case of squared error loss. [2]
(b) Show that the minimisation problem mentioned in part (a) is equivalent to
finding the maximum a posteriori estimate using a particular choice of prior,
and model for the training data. [6]
(c) For simplicity, this part of the question takes 0 = 0. The lasso estimate can
also be described as a constrained optimisation problem; as the solution to
arg min
02R,2Rp
1
n
nX
i=1

Yi XTi
2
where
pX
j=1
|j| ? s,
for some s. Prove that solution to the unconstrained problem from part (a) is
also a solution to the constrained problem for some s. [4]
(d) Now consider using Bayesian inference of the posterior distrbution on and
0, using the prior you introduced in part (b). Consider the case where p, the
dimension of X, is large (p > 100).
(i) Suppose we are particularly interested in the posterior variance of the 1
parameter, given the training data. Describe how one may obtain estimates
of these quantities. [6]
(ii) Lasso estimates are known to exhibit sparsity, in that some of the variables
will be estimated to be exactly zero. Will this e?ect also be observed in
the posterior distribution for , when the prior is chosen as you described
in part (b)? Explain your answer. [2]
6 END