QUIZ 4: BIG DATA General instruction: This assignment is due at 4:00 pm on the Friday, the 12th September. Please generate a single PDF file using R Markdown. You may either knit directly to PDF or create an HTML document and convert it to PDF. Once completed, submit the PDF via Turnitin on the course webpage. Caution: Do not set a seed. If you do, no credit will be given for this quiz. The same penalty applies if you do not use R Markdown to generate a single document. When a word limit is specified (e.g., 50 words), do not exceed it; otherwise, no credit will be given. You may count words at https://wordcounter.net/. Preliminaries: Import the dataset Carseats from the ISLR2 R-package. Randomly select 300 observa- tions to serve as the training set. These observations must be used exclusively for training throughout this quiz. The remaining observations should be used as the test set for all relevant questions. Total 10 marks (each 1 mark) 1. Using the model.matrix() function, define the matrices y and x, where y corresponds to the response variable Sales, and x contains all remaining predictor variables in the dataset. For all subsequent questions, use y and x in place of the original variables. 2. Implement ridge regression of Sales on all other variables in the Carseats dataset. Using 10-fold cross-validation, generate a plot of the cross-validated mean squared error (MSE) against a wide range of values for the regularization parameter λ, similar to the diagram on page 17 of the tutorial document Ch6-varselect-lab.pdf. Report the value of λ that minimizes the cross-validated MSE. 3. Using the results from the ridge regression above, compute the test MSE associated with the value of λ that minimizes the cross-validated MSE. 4. Repeat Questions 2 and 3 using lasso regression instead of ridge regression. 5. Unlike ridge regression, lasso regression performs variable selection by shrinking some coefficient esti- mates exactly to zero. Using the results from Question 4, print the coefficient estimates only for the variables selected by the lasso—that is, those with non-zero coefficients at the optimal value of λ. 6. Create a plot that illustrates how the estimated coefficients change as a function of λ in the lasso re- gression. This coefficient path plot should show the shrinkage behavior of each predictor as λ increases. 7. Repeat Questions 2 and 3 using Principal Component Regression (PCR) instead of ridge regression. That is, create a plot showing the cross-validated MSE as a function of the number of principal components M , and determine the value of M that minimizes the cross-validated MSE. Then, using this optimal number of components, compute the test MSE on the test data. 8. Repeat Questions 2 and 3 using Partial Least Squares (PLS) instead of ridge regression. That is, create a plot showing the cross-validated MSE as a function of the number of principal components M , and determine the value of M that minimizes the cross-validated MSE. Then, using this optimal number of components, compute the test MSE on the test data. 9. Suppose you are limited to using at most 5 components. In this case, which method would you prefer between PCR and PLS? No computation is necessary—provide a conceptual explanation in 50 words. 10. Considering all the analyses conducted so far, briefly explain which modeling approach you would prefer and why. Discuss whether this approach would be expected to perform better than ordinary least squares regression. Your answer should be contain fewer than 50 words in total. 1
学霸联盟