QUIZ 5: BIG DATA General instruction: This assignment is due at 4:00 pm on the Friday, the 19th September. Please generate a single PDF file using R Markdown. You may either knit directly to PDF or create an HTML document and convert it to PDF. Once completed, submit the PDF via Turnitin on the course webpage. Caution: Do not set a seed. If you do, no credit will be given for this quiz. The same penalty applies if you do not use R Markdown to generate a single document. When a word limit is specified (e.g., 50 words), do not exceed it; otherwise, no credit will be given. You may count words at https://wordcounter.net/. For this quiz, you may use only the ISLR2, splines, gam, and tree packages in R. No marks will be awarded for this quiz if any other packages are used. Preliminaries: Import the dataset Carseats from the ISLR2 R-package. Total 10 marks (each 1 mark) 1. Add a new variable, Sales1 to the Carseats data set by adding random errors drawn from the distribution N (0, 1) to the Sales variable. Similarly, create another new variable, Price1, in the Carseats dataset by adding errors drawn from N (0, 1) to the Price variable. Finally, print out the number of variables in the Carseats dataset after adding Sales1 and Price1 in Carseats. 2. Fit a regression model of Sales1 on Price1 using 4th degree orthogonal polynomials. Report the estimation results, including estimates, standard errors, t values, and p values). In addition, make a diagram to show the fitted regression line along with a 95 percent confidence band over the sample distribution of (Price1, Sales1). 3. Produce a similar diagram in Q2 but use cubic splines instead of orthogonal polynomials. Implement the spline regression by placing three internal knots such that the sample is divided into four equally sized intervals (based on quantiles of Price1). Print out the knots. Hint: Use the option df in bs(). 4. Produce a similar diagram in Q2 but use natural splines with four degrees of freedom instead of orthogonal polynomials. 5. Produce a similar diagram as in Question 2, but use smoothing splines instead of orthogonal polyno- mials. Select the degrees of freedom using LOOCV. You do not need to plot the confidence band for this question. 6. Produce a similar diagram as in Question 2, but use local regression (LOESS) instead of orthogonal polynomials. Fit three separate LOESS models using spans of 0.1, 0.5, and 0.9. Plot all three fitted regression lines on the same diagram for comparison, along with the original data distribution. Include a legend to indicate which line corresponds to which span value. You do not need to include a confidence band for this question. 7. Fit a Generalized Additive Model (GAM) to predict Sales1 using Price1, Income, and ShelveLoc. Use smoothing splines with 4 degrees of freedom for quantitative predictors and include dummy variables for qualitative ones. Produce a 1-by-3 panel figure, with each panel displaying the effect of one predictor on Sales1, along with a 95% confidence interval. 8. Fit a decision tree to predict Sales1 using all other variables in the Carseats dataset, excluding Sales and Price. Use cross-validation to select the optimal size of the decision tree. Plot the tree’s predictive performance as a function of tree size to visualize how performance changes with the number of terminal nodes. 9. Continuing from Question 8, select a tree size that achieves the best predictive performance while still being easily visualizable, so that all terminal nodes are clearly readable. Print the chosen tree. 10. Continuing from Question 9, provide an interpretation of the rightmost terminal node, describing the characteristics of observations that fall into it and their predicted Sales1 values. 1
学霸联盟