Overview Model selection - Subset selection: best subset, Forward, Backward regression - Model selection criteria: MSE, Cross-validation, Mallow’s Cp, AIC, BIC, R2 - Regularisation methods (shrinking): Ridge, Lasso regression • Suggested reading: An Introduction to Statistical Learning, James et al., Ch 5, 6.1, Ch8.1 Model Selection: A Motivating Example • We observe following data and want to fit these data using a linear model. 0 1 2 3 4 5 6 7 −3 −2 −1 0 1 2 x y • Consider a simple OLS x y 0 1 2 3 4 5 6 7 −3 −2 −1 0 1 2 • Add quadratic term x y 0 1 2 3 4 5 6 7 −3 −2 −1 0 1 2 • Add more polynomial terms x y 0 1 2 3 4 5 6 7 −3 −2 −1 0 1 2 • Now the model has polynomial terms up to degree 9. x y 0 1 2 3 4 5 6 7 −3 −2 −1 0 1 2 • What is the true model? x y 0 1 2 3 4 5 6 7 −3 −2 −1 0 1 2 0 2 4 6 8 0 50 10 0 15 0 Model Selection • Training Error Training error is the prediction error observed from the training sample. It can be easily calculated by applying the estimated model on the training sample. • Test Error Test error is the error when we apply the estimated model on the new data, which are not used to train/estimate the model. Select the model with the smallest test error! How to measure prediction error? • Mean squared error (MSE) How to measure prediction error? • Mean squared error (MSE) Bias and variance cannot be reduced at the same time. – High bias, low variance. – Low bias, high variance. Model complexity and Prediction Errors Borrowed from Hastie, Tibshirani and Friedman 49 Model Selection Subset Selection 52 53 Predicting Test Error • Use all the data to fit the model • Since no test data is left, how to know the test error? • Test error can be approximated by a few metrics Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 163.10359 90.77854 1.797 0.073622 . AtBat -1.97987 0.63398 -3.123 0.002008 ** Hits 7.50077 2.37753 3.155 0.001808 ** HmRun 4.33088 6.20145 0.698 0.485616 Runs -2.37621 2.98076 -0.797 0.426122 RBI -1.04496 2.60088 -0.402 0.688204 Walks 6.23129 1.82850 3.408 0.000766 *** Years -3.48905 12.41219 -0.281 0.778874 CAtBat -0.17134 0.13524 -1.267 0.206380 CHits 0.13399 0.67455 0.199 0.842713 CHmRun -0.17286 1.61724 -0.107 0.914967 CRuns 1.45430 0.75046 1.938 0.053795 . CRBI 0.80771 0.69262 1.166 0.244691 CWalks -0.81157 0.32808 -2.474 0.014057 * LeagueN 62.59942 79.26140 0.790 0.430424 DivisionW -116.84925 40.36695 -2.895 0.004141 ** PutOuts 0.28189 0.07744 3.640 0.000333 *** Assists 0.37107 0.22120 1.678 0.094723 . Errors -3.36076 4.39163 -0.765 0.444857 NewLeagueN -24.76233 79.00263 -0.313 0.754218 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 315.6 on 243 degrees of freedom (59 observations deleted due to missingness) Multiple R-squared: 0.5461,Adjusted R-squared: 0.5106 F-statistic: 15.39 on 19 and 243 DF, p-value: < 2.2e-16 >AIC(m1) [1] 3794.383 >BIC(m1) [1] 3869.398 Model Select Procedures 4 9 Backward Selection: Baseball player’s salary 16 Model Selection Estimate Test Error Cross-validation ●Cross-validation avoids overlapping test sets First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training ●Called k-fold cross-validation ●Often the subsets are stratified before the cross- validation is performed ●The error estimates are averaged to yield an overall error estimate Holdout estimation ●What to do if the amount of data is limited? ●The holdout method reserves a certain amount for testing and uses the remainder for training Usually: one third for testing, the rest for training ●Problem: the samples might not be representative Example: class might be missing in the test data ●Advanced version uses stratification Ensures that each class is represented with approximately equal proportions in both subsets More on cross-validation ●Standard method for evaluation: stratified ten-fold cross- validation ●Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate There is also some theoretical evidence for this ●Stratification reduces the estimate’s variance Cross-validation ●Cross-validation avoids overlapping test sets First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training ●Called k-fold cross-validation ●Often the subsets are stratified before the cross- validation is performed ●The error estimates are averaged to yield an overall error estimate 10-fold Cross-validation The Cross-Validation MSE: Summary Questions? Model selection - Subset selection: best subset, Forward, Backward regression - Model selection criteria: MSE, Cross-validation, Mallow’s Cp, AIC, BIC, R2 - Regularisation methods (shrinking): Ridge, Lasso regression • Suggested reading: An Introduction to Statistical Learning, James et al., Ch 5, 6.1, Ch8.1
学霸联盟