QUIZ 1: -无代写|学霸联盟

QUIZ 1: -无代写

时间：2025-08-21

QUIZ 1: BIG DATA
This assignment is due at 4:00 pm on the Friday, the 22nd August. Please generate a single PDF file
using R Markdown. You may either knit directly to PDF or create an HTML document and convert it to
PDF. Once completed, submit the PDF via Turnitin on the course webpage.
Caution: Do not set a seed. If you do, no credit will be given for this quiz. The same penalty applies if
you do not use R Markdown to generate a single document. When a word limit is specified (e.g., 50 words),
do not exceed it; otherwise, no credit will be given. You may count words at https://wordcounter.net/.
Total 10 marks (each 1 mark).
1. Import the dataset Carseats from R-package ISLR2. You can view information about the data by
typing
> ??ISLR2::Carseats
This will display a one-line description of the dataset, along with the sample size and number of
variables. Reprint the one-line description in your answer (1 mark).
2. In the command window, type
> data("Carseats")
Explain in 20 words what you see in the Environment tab of RStudio. Specifically, how many
observations and variables are in the Carseats dataset? Hint: You may need to click the dataset name
to view details.
3. Create a new binary variable, HighSales, to indicating whether Sales is above its median value. Try:
> HighSales <- Carseats$Sales >= median(Carseats$Sales)
Explain in 20 words what changed in the Environment tab of RStudio.
4. Remove HighSales, which you created in Q3, using rm(HighSales), and run the following code.
> Carseats$HighSales <- Carseats$Sales >= median(Carseats$Sales)
Explain in 20 words what the code right above produces, referring to the Environment tab of RStudio.
5. Additionally, each observation is assigned to the training set with 75% probability and to the test set
with 25% probability. The following three lines of R code perform this sample split.
> train <- runif(N) <= 0.75 # N is number of observations; you should find N.
> Carseats.train <- Carseats[train, ] # training sample
> Carseats.test <- Carseats[!train, ] # test sample
Use the R function length() to print the numbers of observations in the training and test samples.
6. Let X1 := Price. Select two additional predictors, denoted (X2, X3) (do not select Sales or HighSales).
Using the training sample from Q5, regress Sales on each Xj , j = 1, 2, 3, separately. Display the re-
sults in a figure with three horizontally arranged plots, similar to the example below. Ensure that the
axes are labeled with appropriate variable names rather than R symbols or generic labels like Xj .
1
60 80 100 120 140 160 180
0
5
10
15
Price
Sa
le
s
100 120 140 160
0
5
10
15
Competitor Price
Sa
le
s
20 40 60 80 100 120
0
5
10
15
Income
Sa
le
s
7. Using the training sample from Q5, regress Sales on (X1, X2, X3), the three predictors selected in Q6.
Predict Sales at the median values of (X1, X2, X3) and compute a 95% confidence interval. Then,
repeat the prediction to compute a 95% prediction interval.
Hint: median(A) computes the median of variable A.
8. Using the predictions from Q7, compute the mean squared error (MSE) for the test sample created in
Q5. The test MSE is given as
MSEtest =
1
Ntest
∑
i∈test sample
(Salesi − Ŝalesi)2
where Ntest and Ŝalesi are, respectively, the sample size of the test sample and the predicted value of
Sales for observation i in the test sample.
Hint: The mean squared error (MSE) on the training sample is
MSEtrain =
1
Ntrain
∑
i∈training sample
(Salesi − Ŝalesi)2
where Ntrain is the sample size of the training sample and Ŝalesi is the predicted value of Sales for
observation i in the training sample. There are several ways of computing MSEtrain in R, including
> mean(summary(fit)$residuals∧2)
> mean((Carseats.train$Sales - predict(fit,Carseats.train))∧2)
9. Compute the training error rate of a logistic regression for the qualitative variable HighSales. Using
the training set from Q5, fit a logistic model to predict HighSales using the predictors (X1, X2, X3)
selected above. Let ̂HighSalesi = 1 if
Pr(HighSales = 1|(X1, X2, X3) = (xi1, xi2, xi3)) ≥ 1/2
and ̂HighSalesi = 0, otherwise, for all training observations.
Then, make a confusion table and calculate the error rate, i.e.,
1
Ntrain
∑
i∈trainning sample
I(HighSalesi 6= ̂HighSalesi)
Hint: see the first five pages of the tutorial material for classification analysis.
10. Compute the test error rate for the logistic regression in Q9 using the test sample in Q5.
2

学霸联盟