1
ECA 5304 Homework 1
(Due by Friday February 18th 11pm to LumiNUS Homework 1 submission folder)
Instructions: If working in groups, submit one copy per group and indicate clearly the names of the
collaborators. You should use R to answer the computational questions. The submission format will be
as follows: 1) Merge all your handwritten works and typed up answers/report into a single PDF file;
2) Append your R code at the end of the same PDF file; 3) Name your file as your NUS recorded
name. E.g., if I were a student registered under "Tkachenko, Denis", my filename would be "Tkachenko
Denis.pdf". Therefore, you will submit ONE pdf file per student or group that contains all the
answers and the code appended at the end.
Verify that your code runs seamlessly as a whole, containing commands to load all the necessary
libraries etc. Where randomness is involved, remember to seed the seed(s) of the random number
generator for replicability. You should verify that your code produces the same answers when run
several times. Answers to computational questions should be formatted as a report (i.e., type /write up
your answers and supplement with graphs/tables/numbers as necessary – do not just comment answers
between the lines of the R script (don’t use Rmarkdown either – it looks ugly and hard to follow the
answers) or printscreen the whole output when you only need 1 or 2 numbers from it. Finally, read the
hints carefully and good luck!
Question 1 (In-sample vs. out-of-sample MSE)
In this question you will establish an important and generally valid result in a simplest possible
setting. Consider the data generated by the following simple constant plus noise process:
ݕ = μ + ϵ , ϵ~ௗ(0,σଶ)
You
plan to estimate the equation by OLS. Suppose you have a training
dataset ݕ =(ݕଵ,ݕଶ. . . , ݕே) and a test dataset ݕ′ = (ݕ′ଵ,ݕ′ଶ. . . ,
ݕ′ே) of the same size generated by the
same process.
Hints: If you forgot, re-derive the OLS estimator for the model with only a constant. Also,
recall the key properties of OLS estimators – these will be useful in working out the answers.
1) Derive the in-sample and out-of-sample mean squared error (MSE) expressions.
2) Using the results in (1), argue that the in-sample MSE is always going to be less than or
equal to the out-of-sample MSE.
3) Explain what determines the difference between the two MSE’s. Do you think this result
can be valid more generally? Discuss the significance and any potential usefulness of this
result.
Question 2 (Some intuition on ridge regression)
In this question you will establish an important property of ridge estimates in a simplest possible
setting. Suppose that N =2, P=2, and the design matrix X is such that ݔଵଵ = ݔଵଶ = ݔଵ , ݔଶଵ =
ݔଶଶ = ݔଶ (the indices refer to the row/column positions of the elements of the X matrix).
2
Furthermore, assume that the variables are demeaned, so there is no intercept included in the
model and hence there is no constant in the design matrix.
1) Can you estimate the parameters using OLS in this setting? Explain.
2) State the ridge regression optimization problem in this setting.
3) Solve the problem in (2) and argue that ߚଵ and ߚଶ obtained from ridge estimation for a
given lambda will be equal in this setting. (Hint: you do not need to resolve the problem
fully – derive the F.O.C.’s and see whether there is something you can note that gives
away the answer. You also don’t need to use matrix algebra here.).
4) Without using any derivation, what would you intuitively expect to happen to ߚଵ and ߚଶ
in this setting if instead of ridge you used the LASSO penalty?
Question 3 (Revisit the Boston housing data with new tools)
1) Load the quarterly ‘Boston’ dataset from the MASS package. Use the ‘?Boston’ command to
retrieve the description of the dataset and the variables – discuss each variable and provide
economic intuition on why it may be a useful predictor for the median house value (medv).
Explain which variables you expect intuitively to be the most important predictors of medv
(do not run any quantitative analysis yet). Randomly split the dataset into the training set of
400 observations and the test set of 106 observation.
2) Use the first block of code in the Boston_aug.R file to create a mildly expanded set of
variables (original data plus cubic polynomials in all continuous variables, dummy for zn >
0). Perform best subset selection with the max number of 39 variables. Select the best models
using AIC and BIC (use 3 methods for each: variance estimate from the model (take Cp/BIC
computed by regsubsets()), variance from the largest model, and iterative variance). Discuss
your results.
3) (Hint: it may be useful to start a new script here or run ‘rm(list = ls())’ to clean up the previous
results to avoid clutter). Use the second code block in the Boston_aug.R file to create a greatly
expanded set of variables – this is what is sometimes called “feature engineering”. Read the
comments there to understand what variables are created. How many predictors are there in the
augmented dataset? Suppose we wanted to do subset selection – explain which strategies are
possible here and which are not.
4) Perform forward stepwise selection, restricting the maximum model size at 200 variables. Use
AIC and BIC with iterated variance estimation to select the best model. Contrast your results
with the best subset results from previous section.
3
5) Fit the Ridge regression model using 10-fold cross-validation and evaluate its performance on
the test set. There is a “1-standard-error rule of thumb” in the machine learning literature, stating
that it may be desirable to utilize the most penalized model which is within one standard error
in terms of cross-validation MSE (the logic is that we use an even simpler model that does not
seem to be statistically different in performance on cross-validation). The glmnet package
reports the corresponding lambda value as ‘lambda.1se’ in the results of the cv.glmnet()
function. Evaluate the model corresponding to ‘lambda.1se’ – does this rule of thumb look like
a good idea?
6) Fit the LASSO model using 10-fold cross-validation and evaluate its performance on the test
set for both ‘lambda.min’ and ‘lambda.1se’. Comment on your results briefly.
7) Now fit the LASSO with plug-in lambda using the rlasso() function from the hmd package. Use
the default setting (adjusted for heteroskedasticity). Perform post-LASSO estimation as well
using the same package under default settings. Comment on your findings.
8) Summarize all your results obtained in a big table for ease of reference. Now it’s time to check
how much value added the fancy machinery brought to the table: 1) compute the test MSE for
the polynomial in lstat only model we chose in Lecture 1 (4th order polynomial in lstat only);
2) compute the test MSE for the model fit in the original paper 1978 paper: include all the initial
13 predictors in levels, but square the rm variable (i.e., do not include its level, only the square).
Comment on the results here and more broadly on the whole exercise, e.g., you may want to
address the questions: What have you learned from this application? Were there results that
surprised you, were there results you expected to obtain? Can you name some plausible reasons
for why your methods did as well/badly as they did? Do you have any takeaways that occurred
to you in the application that were not apparent from the textbook/lectures? Do you feel you
are a more powerful data analyst than the researcher from 1978?