R Studio代写-STA130|学霸联盟

R Studio代写-STA130

时间：2022-03-23

Week 9: Multiple Linear
Regression and Assessing
Prediction models
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 1
Branches of statistical
inference
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 2
Taking a step back: branches of statistical inference
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 3
Testing: A hypothesis test evaluating evidence against a particular value for a parameter.
Statistical methods:
• Hypothesis test for one proportion
• Randomization test to compare the values of a parameter across two groups
Estimation: Confidence interval estimating a parameter (gives a range of plausible values for a parameter)
Statistical method:
• Bootstrap confidence interval
Prediction: Predict value of a variable for an observation using a statistical model based on other variables.
Statistical methods:
• Simple linear regression (one predictor) – last week
• Multiple linear regression (multiple predictors) – this week
• Classification trees – next week
Using data to make predictions
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 4
◦ Using apartment rental data for the past several years, can we predict the average
rental price for one-bedroom apartments in a given year?
◦ Using weather data, can we build a model to predict which days would be good days
to go to the beach?
Types of variables
The x variables are often called predictors, covariates, independent variables,
explanatory variables, inputs, or features.
The y variables is often called response, output, outcome, or dependent variable.
Types of models for prediction
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 5
There are many types of models to choose from to make predictions. In this
course, we’ll look at two types of models:
◦ Linear regression: Useful when the response y is numerical
◦ Classification trees: Useful when the response y is categorical
Measuring and assessing
prediction accuracy
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 6
Strategy for assessing prediction accuracy
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 7
Strategy:
1. Randomly divide the sample data into “training” and “testing” datasets
Ex: 80% for training and 20% for testing
2. Fit the prediction model based on the training data
3. Run the “test” data through the fitted prediction model (i.e. make predictions for the test
data) and look at how accurate the prediction model is.
Why might we want to measure the prediction accuracy of a model?
A few answers:
• To quantify how accurate the predictions from our model might be (on average)
• To compare between several candidate prediction models
Two questions
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 8
Why don’t we just use all the data to build our prediction model,
and then check how accurate the predictions are?
When we fit a prediction model, we use the pairs ( , ) to get our estimates of the regression coefficients
መ. Because of this, if we use our fitted model to make predictions for the same observations that are used to
fit the model, then since the model has already “seen” the data it will make more accurate predictions for
these same observations, in general.
Instead, it is more fair to use different observations for fitting (or building) and testing a prediction model.
How can we measure how accurate the predictions from our linear regression model are?
We need to be able to calculate some number that summarizes the “accuracy” of the predictions (on
average). There are several measures of this, but in this course we’ll focus on the Root Mean Squared
Error (RMSE).
How to assess how well a fitted regression line
performs as a predictive model?
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 9
While the coefficient of determination 2 tells us about the proportion of variability in which is
explained by our fitted regression model, it doesn’t directly tell us about how accurate predictions
from this model are…
The Root Mean Squared Error (RMSE) measures prediction error for predictions from a linear
regression model.
=
1

σ=1
− ො 2
The RMSE can be used to compare different sizes of datasets (as long as they’re all in the same unit)
and to compare different models on the same dataset.
Taking a square root means that RMSE is in the same units (and scale) as .
R Code to create training and testing datasets
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 10
library(tidyverse)
heights <- read.csv("heights.csv") %>% select(-X)
set.seed(123);
n <- nrow(heights)
# Pick 80% of observations to go into the training dataset
training_indices <- sample(1:n, size=round(0.8*n))
training_indices
# Add a column called "rowid" to our heights tibble
heights <- heights %>% rowid_to_column()
glimpse(heights)
R Code to create training and testing datasets
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 11
# Create training dataset
train <- heights %>% filter(rowid %in% training_indices)
# Testing dataset includes all observations NOT in the training data
test <- heights %>% filter(!(rowid %in% training_indices))
R Code to make predictions using a fitted regression
model
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 12
# Build model using the training data only
model <- lm(height ~ shoePrint, data = train)
# Make predictions for test data using the training model
yhat_test <- predict(model, newdata = test)
yhat_test
y_test <- test$height
# Calculate RMSE for predictions in the test dataset
sqrt( sum( (y_test - yhat_test)^2 ) / nrow(test) )
=
1

෍
=1

− ො 2
Interpretation and purpose of the RMSE
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 13
A small value of RMSE indicates that on average, the predictions for the response are close to
the true (observed) values.
But how small is small?
The RMSE is most useful as a tool to compare the prediction accuracy of several models with
different predictors, to help us choose which one to use. We’ll look at this further in
upcoming videos…
Recap: vs RMSE
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 14
RMSE
1 −
σ=1
− ො
2
σ=1
− ത 2
1

෍
=1

− ො 2
Range: 0 to 1 Range: 0 to infinity
Larger value indicates larger proportion
of variation of the response explained
by the model
Smaller value indicates better
prediction accuracy
No units Same units as y (the response variable)
Multiple linear regression:
Building richer prediction
models
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 15
eBay auctions of Mario Kart games
Items can be sold on ebay.ca through an auction
◦ Sellers can post items for sale; they specify a duration for the auction and a minimum selling price
◦ Interested potential buyers can submit bids until the auction closes
◦ The person who bids the highest price purchases the item for that price
The mariokart dataset in the openintro package includes eBay sales of the Mario Kart game
for Nintendo Wii in October 2009.
Question: Are longer auctions associated with higher prices?
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 16
Association between auction duration and price
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 17
Do you notice anything unusual in the scatterplot on the right?
Yes! Two of the values of total price are much(!!) higher than the rest… Why? It turns out that in these two
auctions (and only these two) the game was sold with other items.
Association between auction duration and price
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 18
Description of this association
There appears to be a negative association
between auction duration and total price of
Mario Kart games sold on eBay. The
relationship is relatively linear, although it is
only weak to moderate in strength.
Does this make sense?
Maybe there isn’t actually a meaningful association between auction duration and total price, and
the observed negative association could be just due to random chance?
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 19
Are these data consistent with a slope of 0?
We will assume that the 4 assumptions discussed last week are satisfied, so we can use p-
values from the output of lm()
0: 1 = 0
: 1 ≠ 0
Conclusion: Since the p-value for testing 1 = 0 vs 1 ≠ 0 is much smaller than 0.001, we have strong
evidence against the hypothesis that the slope is equal to 0.
There must be something else affecting the relationship between duration and price…
= 0 + 1 × +
There are many other possible predictors!
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 20
Let’s consider the role of condition on sale price
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 21
New games (which are more
desirable), were mostly sold in
one-day auctions!
Let’s consider the role of condition on sale price
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 22
= 0 + 1 × +
where = = ቊ
1
0
We have very strong evidence against the null
hypothesis that there is no difference in average
sale price for new and used games.
Is the average sale price different for new and
used games?
Association between sale price, duration of
auction, and condition
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 23
Multiple linear regression
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 24
= 0 + 11 + 22 +
We can specify linear regression models with two (or more!) predictors.
These predictors can be numerical or categorical (or any combination).
where is the sale price of item i, 1 is the duration of the auction, and 2 =
Fitted line for new games
ො = መ0 + መ11 + መ2 0 = መ0 + መ11
Fitted line for used games
ො = መ0 + መ11 + መ2 1 = ( መ0+ መ2) + መ11
Graphical representation
Fitted model
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 25
Equation for the fitted regression line
ො = መ0 + መ11 + መ22 = 54.71 − 0.411 − 9.872
Plotting parallel lines model
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 26
The augment function (from the broom library) creates a data frame with predicted values
(in the .fitted column), residuals, etc
One row for each value in the
training data
Predicted values in the
.fitted column
Plotting parallel lines model
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 27
Join up the fitted values to plot the fitted regression model
Multiple linear regression with non-parallel lines
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 28
= 0 + 11 + 22 + 312 +
In the previous multiple linear regression model, we assumed that the association between auction
duration and sale price was the same for new and used games (i.e. same slope), but that the intercepts
could be different.
Let’s see if condition modifies the relationship between duration and sale price.
To do this, we add a new independent variable to the linear model which is the product of duration
and cond: this is an interaction term.
where is the sale price of item i, 1 is the duration of the auction, and 2 =
Model:
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 29
Fitted line for new games
ො = መ0 + መ11 + መ2 0 + መ31(0) = መ0 + መ11
Fitted line for used games
ො = መ0 + መ11 + መ2 1 + መ31(1) = ( መ0+ መ2) + ( መ1+ መ3)1
ො = መ0 + መ11 + መ22 + መ312
To fit a linear model with an interaction term in R, use * instead of + between the predictors
Plotting fitted model with interaction term
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 30
This is an example of Simpson’s
paradox. Simpson’s paradox occurs
when adding an extra predictor to a
linear regression model changes the
direction of the association.
Comparing prediction
models
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 31
Which of these models is most useful for
predicting total price?
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 32
Criteria for choosing a prediction model
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 33
When choosing between various prediction models we want:
◦ A model with good prediction accuracy on average
◦ i.e. RMSE for test data is small (compared to other models)
◦ A model which doesn’t exhibit evidence of overfitting
◦ i.e. predictions for test observations are almost as accurate as those for the training
data
◦ RMSE for training and test data are similar
◦ Why is overfitting a problem? We want our prediction model to generalize the pattern
to make predictions for new observations!
◦ A model which is as simple as possible (i.e. not too many predictors), while still
capturing the association between the predictors and the response
Goal: Identify a model which balances these factors. It is possible that different models
appear “best” based on different criteria, but ultimately we need to choose one and
justify our choice (there may sometimes be more than one reasonable choice).
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 34
set.seed(1211);
n <- nrow(mariokart2)
training_indices <- sample(1:n, size=round(0.8*n))
mariokart2 <- mariokart2 %>% rowid_to_column() # adds a new ID column called rowid
# Create training dataset
train <- mariokart2 %>% filter(rowid %in% training_indices)
y_train <- train$total_pr;
# Testing dataset includes all observations NOT in the training data
test <- mariokart2 %>% filter(!(rowid %in% training_indices))
y_test <- test$total_pr;
# Fit models to training data
modA_train <- lm(total_pr ~ duration, data = train)
modB_train <- lm(total_pr ~ cond, data = train)
modC_train <- lm(total_pr ~ duration + cond, data=train)
modD_train <- lm(total_pr ~ duration * cond, data=train)
Creating training/testing datasets & fitting models
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 35
# Make predictions for testing data using training model
yhat_modA_test <- predict(modA_train, newdata = test)
yhat_modB_test <- predict(modB_train, newdata = test)
yhat_modC_test <- predict(modC_train, newdata = test)
yhat_modD_test <- predict(modD_train, newdata = test)
# Make predictions for training data using training model
yhat_modA_train <- predict(modA_train, newdata = train)
yhat_modB_train <- predict(modB_train, newdata = train)
yhat_modC_train <- predict(modC_train, newdata = train)
yhat_modD_train <- predict(modD_train, newdata = train)
Making predictions for observations using either
training or testing data (separately)
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 36
# Calculate RMSE for testing data
modA_test_RMSE <- sqrt(sum((y_test - yhat_modA_test)^2) / nrow(test))
modB_test_RMSE <- sqrt(sum((y_test - yhat_modB_test)^2) / nrow(test))
modC_test_RMSE <- sqrt(sum((y_test - yhat_modC_test)^2) / nrow(test))
modD_test_RMSE <- sqrt(sum((y_test - yhat_modD_test)^2) / nrow(test))
# Calculate RMSE for training data
modA_train_RMSE <- sqrt(sum((y_train - yhat_modA_train)^2) / nrow(train))
modB_train_RMSE <- sqrt(sum((y_train - yhat_modB_train)^2) / nrow(train))
modC_train_RMSE <- sqrt(sum((y_train - yhat_modC_train)^2) / nrow(train))
modD_train_RMSE <- sqrt(sum((y_train - yhat_modD_train)^2) / nrow(train))
Calculating RMSE
Comparing our four models
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 37
mytable <- tibble(Model = c("A","B","C","D"),
RMSE_testdata = c(modA_test_RMSE, modB_test_RMSE,
modC_test_RMSE, modD_test_RMSE),
RMSE_traindata = c(modA_train_RMSE, modB_train_RMSE,
modC_train_RMSE, modD_train_RMSE),
ratio_of_RMSEs = RMSE_testdata / RMSE_traindata)
library(kable)
knitr::kable(mytable)
Comparing our four models
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 38
Prediction accuracy? (look at RMSE for test data)
Model D has a much lower RMSE for the test data
than the other models. Models B and C are similar
to each other in terms of RMSE, and model A has a
much higher RMSE.
Evidence of overfitting? (compare RMSE for training and testing data)
The RMSE for model D is about 15% higher for the test data than the training data, while for the other models,
this ranges from 17% to 21%. Model D shows the least evidence of overfitting among these models.
Simplicity of the model
Models A and B each have one predictor, while model C has two predictors and model D has two predictors and
one interaction term (total of 4 regression coefficients).
Conclusion
Although model D is the most complex of these models, with four regression coefficients, it exhibits the best
prediction accuracy (i.e. lowest RMSE for test data) and shows the least evidence of overfitting. Thus, it is
reasonable to choose model D to predict the sale price of Mario Kart games sold on eBay around October 2019.
What is (and isn’t) a
linear regression model
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 39
What makes a model linear?
STA130 INSTRUCTORS, UNIVERSITY OF TORONTO © BY-NC 40
The “linear” in linear regression means that the equation is linear in the parameters
Examples of linear regression models:
• = 0 + 11 + 22 +
• = 0 + 11
2 + 2 2 +
• = 0 + 11 + 22 + 33 +⋯+ 999 +
• All the models we considered today
Examples of non-linear models:
• = 0 + 1
1 + 2
2 +
• =
0+11 +
• =
0+11
1+22
+