BIOSTAT 274 Spring 2021 Homework 1
Due 11:59 PM 04/21/2020 (Submit to CCLE)
Remark. For Computational Part, please complete your answer in the RMarkdown file and summit the
generated PDF and RMD files. Related packages have been loaded in setup.
Computational Part
1. (Model Selection, [ISL] 6.8, 25 pt) In this exercise, we will generate simulated data, and will then use
this data to perform model selection.
(a) Use the rnorm function to generate a predictor X of length n “ 100, as well as a noise vector of
length n “ 100.
(b) Generate a response vector Y of length n “ 100 according to the model
Y “ β0 ` β1X ` β2X2 ` β3X3 ` ,
where β0 “ 3, β1 “ 2, β2 “ ´3, β3 “ 0.3.
(c) Use the regsubsets function from leaps package to perform best subset selection in order to
choose the best model from the set of predictors pX,X2, ¨ ¨ ¨ , X10q. What are the best models
obtained according to Cp, BIC, and adjusted R2, respectively? Show some plots to provide evidence
for your answer, and report the coefficients of the best model obtained.
(d) Repeat (c), using forward stepwise selection and also using backward stepwise selection. How does
your answer compare to the results in (c)?
(e) Now fit a LASSO model with glmnet function from glmnet package to the simulated data, again
using pX,X2, ¨ ¨ ¨ , X10q as predictors. Use cross-validation to select the optimal value of λ. Create
plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates,
and discuss the results obtained.
(f) Now generate a response vector Y according to the model
Y “ β0 ` β7X7 ` ,
where β7 “ 7, and perform best subset selection and the LASSO. Discuss the results obtained.
2. (Prediction, [ISL] 6.9, 20 pt) In this exercise, we will predict the number of applications received (Apps)
using the other variables in the College data set from ISLR package.
(a) Randomly split the data set into equal sized training set and test set (1:1).
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
(c) Fit a ridge regression model on the training set, with λ chosen by 5-fold cross-validation. Report
the test error obtained.
(d) Fit a LASSO model on the training set, with λ chosen by 5-fold cross-validation. Report the test
error obtained, along with the number of non-zero coefficient estimates.
(e) Comment on the results obtained. How accurately can we predict the number of college applications
received? Is there much difference among the test errors resulting from these three approaches?