Multiple Linear Regression Model - Part 4
(self-study)
Model evaluation and variable selection
Dr. Linh Nghiem
STAT 3022
Applied linear models
Overview
We consider the MLR with outcome y and p− 1 predictors
X1, . . . , Xp−1.
• The number of possible models: 2p−1, not counting models
with interaction or polynomial terms.
• When p is large, multicollinearity becomes more serious, so
there would be predictors that do not contribute much to the
model when others are already included in the model.
• You are expected to do model comparison and variable
selection for the project (see the Rmarkdown file on Canvas
for the accompanied R code); however, this section will not be
tested for the quiz or final exam.
1
Overview
In this section, we will review some criteria and strategy that can
facilitate variable and model selection better.
• There is no ’best’ criterion or strategy. At the end, choice of
’best’ model is subjective and depend highly on your
objectives.
First, we will review different criteria of evaluating a model. They
come from different perspectives:
1. Quality of fitted models (i.e in-sample prediction): R2,
Adjusted R2, MSE
2. Information criteria: AIC, BIC.
3. Prediction: Cross-validation,
2
Model evaluation
Quality of fitted models
Given a model with p− 1 predictors, then
R2p = 1−
SSEp
SST
,
where we use subscript p to highlight that the dependence of R2
and SSE on p. A higher R2 suggests the models fit data better.
• As more predictors are added into the model (p increases), R2p
always increases.
• Therefore, R2 is most useful for comparing models containing
the same number of predictors.
• If you want to compare models with different number of
predictors using R2, look at the rate of increase.
3
Quality of fitted models
l
l
l l l
l l l l l l
0.75
0.80
0.85
0.90
0.95
3 6 9
Model size (p)
R
s
qu
ar
ed
R−squared always increases with p
The line represents the maximum R2 at a model size, i.e the
number of predictors in the model.
4
Quality of fitted models
The disadvantage of R2 is that it does not take the number of
predictors p into account. Hence, a better measure of quality of
fitted model is the Adjusted R2, defined as
Adj-R2 = 1− SSEp/(n− p)
SST/(n− 1) = 1−
MSEp
SST/(n− 1) .
• A higher Adj-R2 suggest a better fit.
• Since SST does not change with the number of fitted models,
choosing model with maximum Adj-R2 is the same as
choosing model with minimum MSEp.
- Note that E(MSEp) = σ2, so the model with maximum
Adj-R2 is roughly the model with smallest error variance σ2.
5
Quality of fitted models
While R2 always increases when p is increasing, Adj-R2 does not.
As a result MSEp does not always decrease when p is increasing.
l
l
l
l
l
l
l
0.9542
0.9544
0.9546
0.9548
0.9550
6 8 10
Model size (p)
R
s
qu
ar
ed
R−squared always increases with p
l
l
l
l
l
l
l
0.9536
0.9537
0.9538
0.9539
0.9540
6 8 10
Model size (p)
Ad
jus
ted
R
sq
ua
red
Adjusted R−squared does not always increase with p
l
l
l
l
l
l
l
9720
9740
9760
9780
9800
6 8 10
Model size (p)
M
SE
_p
MSE_p does not always decrease with p
6
Model evaluation using Information Criteria
Akaike Information Criterion: AIC = n log(SSEp) + 2p
Bayesian Information Criterion: BIC = n log(SSEp) + log(n)p
• The first term in each criterion measures model errors and
decreases when p is increasing.
• The second term in each criterion is called the penalty, which
increases when p is increasing.
• Hence, we favor a model with minimum information criterion,
since it achieves the best balance between errors and sizes.
• The penalty in BIC is larger than the penalty in AIC, so
BIC tends to favor a smaller model.
7
Model evaluation from prediction perspective
The most common (modern) methods to evaluate a statistical
model is to evaluate model through its predictive performance.
Essentially, this involves three steps:
1. From the original data, create pairs of independent training
and test sets.
2. For each pair of training and test dataset:
• Fit a model in the training data.
• Use the model to predict the testing data, and compute
prediction errors.
3. Select the model with the smallest average prediction error
across all test sets.
8
Leave-one-out cross-validation (LOOCV)
Given we have sample size n, design matrix X with rows x>i and
response vector y = (y1, . . . , yn)
>.
For i = 1, . . . , n
• Form the training data X(−i),y(−i) by leaving out x>i and yi.
Hold that observation to be the test data.
• Fit model on the training data and obtain the estimate b(i).
• Compute prediction yˆ(i)i = x>i b(i) and prediction error
yi − yˆ(i)i
Finally, compute the average squared prediction error:
CV =
1
n
n∑
y=1
(yi − yˆ(i)i)2
9
Leave-one-out cross-validation
10
Leave-one-out cross-validation (LOOCV)
In general, LOOCV is computationally expensive especially when n
is large, since it involves fitting one models n times.
However, for MLR, it turns out that the above CV quantity can be
computed by the following formula:
CV =
1
n
n∑
i=1
(
ei
1− hii
)2
when ei is the ith residual after fitting the MLR on the entire data,
and hii denotes the (i, i) element of H = X(X
>X)−1X>.
Without the scaling 1/n in the formula for the LOOCV statistic, we
obtain the PRediction Error Sum of Squares, or PRESS, for MLR.
11
K-fold cross-validation
Since LOOCV is generally computationally expensive when n is
large, we can reduce computational power by dividing the data into
folds, or subsets of the data.
12
K-fold cross-validation
• Divide the set {1, . . . , n} into K non-overlapping subsets
(folds) S1, . . . , SK
• For k = 1, . . . ,K
- Leave out all observations on kth fold and hold it to be the
test data (orange part).
- Fit the model into the training data (cyan part), obtain the
estimated coefficient b(k)
- Make prediction on the test data, and compute the average
squared prediction error:
ek =
∑
i∈Sk
(
yi − x>i b(k)
)2
• Finally, average the prediction error over all folds:
CV =
1
n
K∑
k=1
ek
13
Automatic Search Procedure
Stepwise Regression
Given we know criteria for evaluating a regression model, we next
consider some classical strategy of building MLR models with a
large number of potential predictors.
• Forward selection:
- Start with no variables in the model.
- At each step, add one predictor that improves model fit the
most (using one of the discussed criterion).
- Stop until no significant improvement is made.
• Backward elimination:
- Start with all variables in the model.
- At each step, remove one predictor that improves model fit the
most (using one of the discussed criterion).
- Stop until no significant improvement is made.
14
Example: credit balance data
Consider the regression of Y = Balance versus all the other
p− 1 = 10 predictors, with n = 400. Let’s choose BIC to be our
model criterion. First, let’s consider forward selection.
• Step 0: No variable. Current model: Balance ∼ 1 (intercept
only). bic = 4910.
• Step 1: Fit 10 simple linear regressions of Y versus each
predictor separately, and compute the BIC of each model.
Added predictor BIC
Income 4819
Rating 4368
Limit 4373
. . . . . .
Age 4920
15
Example: credit balance data
After step 1, adding Rating to the current model results in the
lowest BIC among all 10 models, so we add Rating to the model.
The current model is Balance ∼ Rating with BIC = 4368.
• Step 2: consider 9 multiple regression model of Y versus
Ratings and another predictor separately.
Added predictor BIC
Income 4089
Limit 4374
. . . . . .
Age 4361
Adding Income to the current model results in the lowest BIC
among all 9 models with BIC = 4089 < 4368, so we add Income
to the model.
16
Example: credit balance data
The process continues. At each step, we consider adding one
variable at a time that decreases the BIC the most.
Now let’s consider the process after step 5, when the current
model is Balance ∼ Rating + Income + Student + Limit + Cards
with BIC = 3706.
• Step 6: Consider 5 multiple regression model of Y versus all
the above predictors another predictor.
Added predictor BIC
Ethnic 3717
Gender 3711
Married 3712
Education 3712
Age 3707
17
Example: credit balance data
Since there is no improvement in BIC, we stop there. The entire
process is summarized in the following table, where each row
specifies the added predictor in each step sequentially.
Added variable BIC
Rating 4368
Income 4089
Student 3730
Limit 3717
Card 3706
The final model is Balance ∼ Rating + Income + Student + Limit
+ Cards.
18
Example: credit balance data
Now, we consider backward elimination with the same BIC.
• Step 0: Start with model that includes all 10 predictors.
BIC = 3734.
• Step 1: Remove each predictor separately and fit 10 multiple
regressions, each having 9 predictors.
- Among the 10 models, the model with Ethnicity removed has
the lowest BIC = 3723, so we remove Ethnicity.
Removed predictor BIC
Education 3729
Ethnicity 3723
Age 3733
. . . . . .
Income 4268
19
Example: credit balance data
Step 2: Now there are 9 predictors left, so again we remove each
predictor separately and fit 9 multiple regressions, each having 8
predictors. Then pick the model with lowest BIC again.
- Among the 9 models, the model with Married removed has
the lowest BIC = 3718.14, so we remove Married.
Removed predictor BIC
Married 3718.11
Education 3718.13
Gender 3719
Age 3722
. . . . . .
Income 4257
20
Example: credit balance data
The process continues. After removing 6 predictors (Ethnicity,
Married, Education, Gender, Age, Rating), the we have
BIC = 3705. Now, if we remove any of the remaining predictor,
the BIC will increase.
Removed predictor BIC
Card 3739
Limit 4792
Student 4096
Income 4242
so we stop there. The final model contains all other remaining
predictors (Card, Limit, Student, Income), a smaller model than
the result from forward selection.
21
Bidirectional Search
• A disadvantage of both forward selection and backward
elimination is once a predictor is added in (dropped out) the
model, it cannot be dropped out (added in) in later steps.
• Therefore, the final model selected by forward selection and
backward elimination is not guaranteed to have the best fit
(with one criteria).
• A solution is bidirectional search, where we consider both
adding and removing predictors at each step; and this is
the default method in most software.
22
Bidirectional search
For example, consider forward selection with the previous credit
balance data example.
• Step 1: same as in forward selection. Rating is included in the
model.
• Step 2: same as in forward selection. Income is included,
because the model Balance ∼ Rating + Income has the lowest
BIC when Rating is already added from step 1 (recall slide
16)
• Step 3: Now, instead of only considering what should be the
next predictor to add, we also consider whether Rating
can be dropped given Income is in the model as well.
23
Bidirectional search
Current model: Balance ∼ Rating + Income
Action Predictor BIC
Add Student 3730
Add Married 4092
. . . . . . . . .
Add Ethnicity 4379
Remove Rating 4910
The model with lowest BIC is the one obtained by adding Student
to the current model. OK, so let’s add it and the current model
becomes Balance ∼ Rating + Income + Student.
24
Bidirectional search
Step 4: Similar to step 3, we not only consider whether a new
predictor should be added, but also consider whether any of Rating
or Income can be removed as well.
Action Predictor BIC
Add Limit 3717
Add Age 3732
. . . . . . . . .
Add Ethnicity 3740
Remove Rating 4792
Remove Income 4249
The model with lowest BIC = 3717 is the one obtained by adding
Limit to the current model. Again, we add it and the current
model becomes Balance ∼ Rating + Income + Student + Limit.
25
Bidirectional search
I skip step 5, where we not only consider whether a new predictor
should be added, but also consider whether any of Rating or
Income or Student can be removed as well.
The result of this step is that we continue to add Card and the
current model becomes Balance ∼ Rating + Income + Student +
Limit + Card with BIC = 3706.
Now, we consider step 6.
26
Bidirectional search
Action Predictor BIC
Add Age 3708
Add Gender 3711
. . . . . . . . .
Add Ethnicity 3717
Remove Rating 3705
Remove Limit 3735
Remove Student 4095
Remove Income 4248
At this step, removing rating out of the current model results in a
new model with lowest BIC = 3705, which is smaller than BIC of
the current model!
27
Bidirectional search
• The current model now is Balance ∼ Income + Student +
Limit + Card with BIC = 3705.
• Note that this model is not reachable by forward selection,
since it does not consider removing Rating out of the model.
Now we consider next step to see whether a new predictor should
be added or we should continue to remove. Since Rating is just
removed, we do not have to consider it as a potential new
predictor in the next step.
28
Bidirectional search
Step 7:
Action Predictor BIC
Add Age 3707
Add Gender 3710
. . . . . . . . .
Add Ethnicity 3716
Remove Limit 4792
Remove Student 4095
Remove Card 3739
Remove Income 4242
No improvement can be made, so we stop there with the final
model Balance ∼ Income + Student + Limit + Cards.
29
Notes about automated search procedure
• The end model from these procedures depend highly on the
criterion that is used to evaluate a model.
- For example, if I use AIC instead of BIC for the bidirectional
search in the previous example, the final model is Balance ∼
Rating + Income + Student + Limit + Cards + Age.
• In practice, these search procedures are often used for
screening purpose; after these search procedures, consider
other criterions/aspects to evaluate and/or refine the model.
- Since these automated procedures do not account for
interaction or polynomial terms, one can also consider adding
them after these search procedures.
- All the hypothesis testing and diagnostics (residuals plots,
unusual observations) should be considered to select the final
model.