考试代写-ST2020/ST300
时间:2021-03-26
© LSE ST2020/ST300 Page 1 of 5
Summer Assessment 2020
Assessment paper and instructions to candidates:
ST300 -- Regression and GLM
Suitable for all candidates
Instructions to candidates
This paper contains FOUR questions.
Answer ALL questions. All questions carry equal weight.
Time Allowed: 2 hours
You may also use: no additional materials
Calculators: calculators are allowed
1. Assess (a)-(d) on fitting the normal linear regression model by correcting any methodolog-
ical and factual error(s). Use bullet points in your answers.
(a) A model that is good for predicting the dependent variable given new data that is
within the range of the current data should have a R-squared close to one, and so I
will seek a model with the highest R-squared. [7 marks]
(b) It is necessary to remove outliers in the dependent variable and in the predictors before
and after fitting a regression model. This is because outliers distort the estimated
coefficients. [7 marks]
(c) To investigate the relationship between a set of continuous predictors and the depen-
dent variable, a data analyst regressed the dependent variable on all the predictors,
and the predictors with VIF less than 10 were removed. After this stepwise regression
was run on the remaining predictors. The resulting model is adequate for the task if
it passes the diagnostic checks. [6 marks]
(d) To investigate the relationship between a dependent variable and a set of predictors,
we could run simple linear regression models of the dependent variable on each pre-
dictor, which is better than using multiple linear regression as it is easier to interpret
output of simple linear regression.
. [5 marks]
©LSE ST 2020/ST300 Page 2 of 5
2. Consider the linear regression model that contains an intercept term
y = Xβ + ,
where y is the response vector of length n, β is the vector of parameters of length p,
X is a matrix of constants, and n > p.
(a) Denote the residual sum of squares computed at the OLS estimates by RSS1.
Suppose we add a further k predictors to this model, and denote the RSS com-
puted at the OLS estimates by RSS2. Show that RSS2 ≤ RSS1 and hence show
that the R-squared for the larger model is greater or equal to the R-squared for
the smaller model.
. [8 marks]
(b) i. Define the leverage hii of the ith data point in terms of the hat matrix.
. [1 marks]
ii. Writing the design matrix as
X =
 x

1
...
x′n
 ,
where x′i is a row vector of covariates for the ith data point, show that
hii = x

i(X
′X)−1xi
[3 marks]
iii. Suppose we have a simple linear model with x′i = (1, xi − x¯). Obtain an
expression for hii in terms of n, and the data.
. [3 marks]
(c) In the presence of perfect multicollinearity X′X is not invertible in the usual
sense, and the solution to the normal equations that provide the least squares
estimates of β is not unique. Under this scenario, show that the fitted values are
unique.
. [10 marks]
©LSE ST 2020/ST300 Page 3 of 5
3.(a) State the binomial GLM with the logit and probit link functions. State which of
these is the canonical link function. [2 marks]
(b) Suppose βˆ = (βˆ1 βˆ2 βˆ3)
T is obtained from the IWLS procedure of fitting a certain
GLM.
i. What is the asymptotic distribution of Aβˆ where A is a matrix of constants?
. [2 marks]
ii. Suppose the estimated information matrix I(βˆ) is given by
I(βˆ) =
 3 0 00 3 1
0 1 2
 ,
and we want to use the Wald test to test
H0 : (β1 + β3, β2) = (1, 0)←→ H1 : (β1 + β3, β2) 6= (1, 0).
Write down the null hypothesis in the form Aβ = v0 and obtain an expression
of the Wald test and state its distribution under H0. Your final answer should
not contain matrices or vectors. [14 marks]
iii. What would be the problem with the Wald test if we added the equation β1 + β2 + β3 = 1
to the null hypothesis in ii.? [2 marks]
(c) In the following R code, find the numerical value for round(z,1) providing an expla-
nation on how you reach your answer:
y <- c(3,3,5,4,6)
x <- c(12,15,20,16,22)
model <- glm(y~x,family=poisson)
z <- sum(y)-sum(predict(model,type="response"))
round(z,1)
[5 marks]
©LSE ST 2020/ST300 Page 4 of 5
4. (a) The exponential family of density functions can be written as
f(y) = exp
{
yθ − c(θ)
φ
+ d(y, φ)
}
.
i. Suppose the canonical link is used with η = xTβ in fitting a GLM to the data.
Derive the score vector U(β) and the information matrix I(β) for a single data
point.
. [6 marks]
ii. Suppose φ is known, and we have n independent data points drawn from the
density f with canonical link. Derive the deviance in terms of βˆ, xi, and the yi’s.
. [6 marks]
(b) Let yi ∼ N(x′iβ, σ2), with covariates xi of length p. Assume σ2 is known. Using part
(a), find the deviance in terms of the data and σ2 (you should express βˆ in terms of
design matrix X whose ith row is x′i, and y) and state its exact distribution.
. [13 marks]
END OF PAPER
©LSE ST 2020/ST300 Page 5 of 5


















































































































学霸联盟


essay、essay代写