R2-无代写|学霸联盟

R2-无代写

时间：2023-12-07

Overview
Model selection
- Subset selection: best subset, Forward, Backward regression
- Model selection criteria: MSE, Cross-validation, Mallow’s Cp,
AIC, BIC, R2
- Regularisation methods (shrinking): Ridge, Lasso regression
• Suggested reading:
An Introduction to Statistical Learning, James et al., Ch 5, 6.1, Ch8.1
Model Selection: A Motivating Example
• We observe following data and want to fit these data using a linear
model.
0 1 2 3 4 5 6 7
−3
−2
−1
0
1
2
x
y
• Consider a simple OLS
x
y
0 1 2 3 4 5 6 7
−3
−2
−1
0
1
2
• Add quadratic term
x
y
0 1 2 3 4 5 6 7
−3
−2
−1
0
1
2
• Add more polynomial terms
x
y
0 1 2 3 4 5 6 7
−3
−2
−1
0
1
2
• Now the model has polynomial terms up to degree 9.
x
y
0 1 2 3 4 5 6 7
−3
−2
−1
0
1
2
• What is the true model?
x
y
0 1 2 3 4 5 6 7
−3
−2
−1
0
1
2
0 2 4 6 8
0
50
10
0
15
0
Model Selection
• Training Error
Training error is the prediction error observed from the training sample. It can
be easily calculated by applying the estimated model on the training sample.
• Test Error
Test error is the error when we apply the estimated model on the new data,
which are not used to train/estimate the model.
Select the model with the smallest test error!
How to measure prediction error?
• Mean squared error (MSE)
How to measure prediction error?
• Mean squared error (MSE)
Bias and variance cannot be reduced at the same time.
– High bias, low variance.
– Low bias, high variance.
Model complexity and Prediction Errors
Borrowed from Hastie, Tibshirani and Friedman
49
Model Selection
Subset Selection
52
53
Predicting Test Error
• Use all the data to fit the model
• Since no test data is left, how to know the test error?
• Test error can be approximated by a few metrics
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 163.10359 90.77854 1.797 0.073622 .
AtBat -1.97987 0.63398 -3.123 0.002008 **
Hits 7.50077 2.37753 3.155 0.001808 **
HmRun 4.33088 6.20145 0.698 0.485616
Runs -2.37621 2.98076 -0.797 0.426122
RBI -1.04496 2.60088 -0.402 0.688204
Walks 6.23129 1.82850 3.408 0.000766 ***
Years -3.48905 12.41219 -0.281 0.778874
CAtBat -0.17134 0.13524 -1.267 0.206380
CHits 0.13399 0.67455 0.199 0.842713
CHmRun -0.17286 1.61724 -0.107 0.914967
CRuns 1.45430 0.75046 1.938 0.053795 .
CRBI 0.80771 0.69262 1.166 0.244691
CWalks -0.81157 0.32808 -2.474 0.014057 *
LeagueN 62.59942 79.26140 0.790 0.430424
DivisionW -116.84925 40.36695 -2.895 0.004141 **
PutOuts 0.28189 0.07744 3.640 0.000333 ***
Assists 0.37107 0.22120 1.678 0.094723 .
Errors -3.36076 4.39163 -0.765 0.444857
NewLeagueN -24.76233 79.00263 -0.313 0.754218
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 315.6 on 243 degrees of freedom (59
observations deleted due to missingness)
Multiple R-squared: 0.5461,Adjusted R-squared: 0.5106
F-statistic: 15.39 on 19 and 243 DF, p-value: < 2.2e-16
>AIC(m1) [1]
3794.383
>BIC(m1) [1]
3869.398
Model Select Procedures
4
9
Backward Selection: Baseball player’s salary
16
Model Selection
Estimate Test Error
Cross-validation
●Cross-validation avoids overlapping test sets
First step: split data into k subsets of equal size
Second step: use each subset in turn for testing, the
remainder for training
●Called k-fold cross-validation
●Often the subsets are stratified before the cross-
validation is performed
●The error estimates are averaged to yield an overall
error estimate
Holdout estimation
●What to do if the amount of data is limited?
●The holdout method reserves a certain amount for
testing and uses the remainder for training
Usually: one third for testing, the rest for training
●Problem: the samples might not be representative
Example: class might be missing in the test data
●Advanced version uses stratification
Ensures that each class is represented with approximately
equal proportions in both subsets
More on cross-validation
●Standard method for evaluation: stratified ten-fold cross-
validation
●Why ten?
Extensive experiments have shown that this is the best choice to get
an accurate estimate
There is also some theoretical evidence for this
●Stratification reduces the estimate’s variance
Cross-validation
●Cross-validation avoids overlapping test sets
First step: split data into k subsets of equal size
Second step: use each subset in turn for testing, the
remainder for training
●Called k-fold cross-validation
●Often the subsets are stratified before the cross-
validation is performed
●The error estimates are averaged to yield an overall
error estimate
10-fold Cross-validation
The Cross-Validation MSE:
Summary
Questions？
Model selection
- Subset selection: best subset, Forward, Backward
regression
- Model selection criteria: MSE, Cross-validation, Mallow’s
Cp, AIC, BIC, R2
- Regularisation methods (shrinking): Ridge, Lasso
regression
• Suggested reading:
An Introduction to Statistical Learning, James et al., Ch 5, 6.1,
Ch8.1