MCD2080 182-无代写|学霸联盟

MCD2080 182-无代写

时间：2022-12-12

14. Evaluating Regression Models
Monash College Topic 14: Evaluating Regression Models MCD2080 182 / 250
Key Objectives for Topic 14
• Understand measures by which we judge the fit of a regression model.
• Construct dummy variables to include in a linear regression model
• Interpret the estimated coefficients of dummy variables in a multiple
linear regression model.
Monash College Topic 14: Evaluating Regression Models MCD2080 183 / 250
14.1.1. R-Squared (R2)
R2 is closely related to the correlation coefficient.
• It is the square of the correlation between the actual Y values and
those predicted by the model, Yˆ :
R2 =
[
Corr(Y , Yˆ )
]2
This is called ‘R Square’ in the Excel output. It will be between 0 and 1.
It measures the proportion of the total variation in Y that the model has
been able to explain.
• A value of R2 close to 1 indicates that the model has been able to
explain a large proportion of the variation in Y , and hence is a very
good model.
• A value of R2 close to zero indicates a poor model—not much of Y
has been explained.
Monash College Topic 14: Evaluating Regression Models MCD2080 184 / 250
14.1.1. R-Squared (R2)
R2 can be thought of as the
variation in Y that is explained
(ESS = explained sum of squares)
by the model over and above the
mean Y¯ (TSS = total sum of
squares).
R2 =
ESS
TSS
=
∑n
i=1(Yˆi − Y¯ )2∑n
i=1(Yi − Y¯ )2
= 1− RSS
TSS
= 1−
∑n
i=1(Yˆi − Yi )2∑n
i=1(Yi − Y¯ )2
Figure: R2 and Sums of Squares
Y
X
Xi
Yi
Yˆi
Y¯
RSSi =
(
Yi − Yˆi
)2
ESSi =
(
Yˆi − Y¯
)2
Monash College Topic 14: Evaluating Regression Models MCD2080 185 / 250
14.1.1. R-Squared (R2)
Consider our earlier results. The R2 is 0.1871. Is this good or poor?
There is no absolute basis upon which to judge R2.
But an R2 of 0.1871 is probably not too bad in this case.
Though over 80% of the variation in income is not explained.
But fully explaining income would be very difficult.
Figure: Regression Results—Excel Output Includes R2 and Titles it ‘R Square’
Monash College Topic 14: Evaluating Regression Models MCD2080 186 / 250
14.1.2. Standard Error
The standard error is simply the standard deviation of the error term in the
model. That is,
s =
√√√√ 1
n − k − 1
n∑
i=1
eˆ2i
In our example above the standard error was 32535.
• On average, the model’s predictions of the annual income (Y ) are in
error by $32,192, either above or below the actual annual income Y .
Is this a big or a small number?
To answer this question it is useful to compare it with the sample mean of
Y and/or with the sorts of values Y takes.
• In our data the average annual income was around $30,598.
• So to have a model that predicts income with an ‘average error’ of
$32,192 is not particularly accurate.
Monash College Topic 14: Evaluating Regression Models MCD2080 187 / 250
14.1.3. Error/Residual Plots
The aim of a regression model is to explain patterns in Y .
What we would like to see then is all the pattern gone from the errors.
• If some pattern can be detected in the unexplained part of the model
(the errors) then it probably indicates we don’t have a great model.
We can examine this by looking at residual plots (residual = estimated
error).
Excel can produce residual plots as part of the regression function tool. At
this stage, we look to the residual plots for two things:
• Evidence that the use of a linear model may not have been
appropriate.
• Evidence that there may be an important variable left out of our
model.
Monash College Topic 14: Evaluating Regression Models MCD2080 188 / 250
14.1.3. Error/Residual Plots
What we are looking for is residuals likes those in the left figure. In the
right figure there is still significant pattern which is unexplained.
Figure: Random Residuals
-4
-3
-2
-1
0
1
2
3
0 10 20 30 40 50 60
Re
sid
ua
l
X
Figure: Patterned Residuals
-6
-4
-2
0
2
4
6
8
0 10 20 30 40 50 60
Re
sid
ua
l
X
Monash College Topic 14: Evaluating Regression Models MCD2080 189 / 250
14.2.1. When the Relationship between X and Y is Not
Linear
So far we have considered linear models. But sometimes the relationship
between the Y and X variables is non-linear.
Consider the infant mortality rate against income per capita across
countries.
Figure: Infant Mortality and Income Per Capita Across Countries
0
50
100
150
200
250
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
Ch
ild
D
ea
th
s p
er
1
00
0
Income Per Capita ($)
Monash College Topic 14: Evaluating Regression Models MCD2080 190 / 250
14.2.1. When the Relationship between X and Y is Not
Linear
It is obvious here that income per capita does not have a linear
relationship with infant mortality rates.
We want a way of capturing this non-linear relationship.
One possibility is to add a quadratic term for income per capita into the
regression model. That is, we could estimate the model,
Yi = βˆ0 + βˆ1Xi1 + βˆ2X
2
i1 + ei
Here Xi1 is income per capita for country i . This model allows for income
to influence Y in a non-linear (quadratic) way.
Monash College Topic 14: Evaluating Regression Models MCD2080 191 / 250
14.2.1.
Lets estimate the model and see if the quadratic term helps.
Examining the p-value for Income (per capita) Squared, we see that the
venture into quadratic models has been successful. This variable is clearly
significant.
Figure: Regression Results—Infant Mortality and Income Per Capita Across
Countries
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.672944375
R Square 0.452854132
Adjusted R Square 0.446565099
Standard Error 30.59758107
Observations 177
ANOVA
df SS MS F Significance F
Regression 2 134827.5513 67413.77563 72.0069578 1.64048E-23
Residual 174 162900.8823 936.2119674
Total 176 297728.4336
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 75.14918072 3.859149294 19.47299133 1.44257E-45 67.5324108 82.76595065 67.5324108 82.76595065
Income -0.004288581 0.000433011 -9.90408191 1.27639E-18 -0.005143212 -0.00343395 -0.005143212 -0.00343395
Income Squared 5.44695E-08 7.93144E-09 6.867538082 1.11098E-10 3.88152E-08 7.01237E-08 3.88152E-08 7.01237E-08
Monash College Topic 14: Evaluating Regression Models MCD2080 192 / 250
14.2.2. Categorical Variables with Two Categories
We have already created dummy variables for Female and included this in
our regression model.
We can extend our regression model of Income to include Female as well
as Education and Age.
Figure: Regression Results—Income on Education, Age and Female Dummy
Monash College Topic 14: Evaluating Regression Models MCD2080 193 / 250
14.2.2. Categorical Variables with Two Categories
Suppose instead of having a dummy variable for female we instead
included one for male. What would the results look like in this case?
Figure: Regression Results—Income on Education, Age and Male Dummy
Monash College Topic 14: Evaluating Regression Models MCD2080 194 / 250
14.2.2. Categorical Variables with Two Categories
What has happened?
• The dummy variable for Male has the coefficient 6205. Remember
the coefficient when we included Female was -6,205. This is not a
coincidence!
• The intercept has fallen by 6205.
• Both models are telling us exactly the same thing; male incomes are
higher by $6,205.
You will also notice that both models have the same value for R2 and the
Standard Error, and the coefficients on all other variables are the same (as
are the t-statistics, p-values, etc.).
• These two models are functionally equivalent.
• Whether you include a dummy for female, and compare incomes
relative to male, or include a dummy for male, and compare relative
to female incomes, is not an important choice.
Monash College Topic 14: Evaluating Regression Models MCD2080 195 / 250
14.2.2. Categorical Variables with Multiple Categories
Now suppose we have a categorical variable with more than two
categories.
• We have three broad mutually exclusive and exhaustive occupation
types—managerial, clerical and labour. What would we do in this
case?
• We create dummy variables for two of the cases and include them in
the model.
We estimate a model where our ‘base’ person is male and works as a
manager (i.e. we dropped the male dummy variable [only included female]
and the dummy variable for manager [included clerical and labourer]).
How do we interpret the coefficients:
• βˆ4 = −9280: if there were two people of the same gender and same
level of education and age, but one worked as a clerk and the other
worked as a manager, the clerk is estimated to earn $9,280 less than
the manager, on average.
Monash College Topic 14: Evaluating Regression Models MCD2080 196 / 250
14.2.2. Categorical Variables with Multiple Categories
βˆ5 = −5532: if there were two people of the same gender and same
level of education and age, but one worked as a manager and the
other worked as a labourer, the labourer is estimated to earn $5,532
less than the manager on average.
Figure: Regression Results—Income on Education, Age, Female Dummy and
Occupation Dummy Variables
Monash College Topic 14: Evaluating Regression Models MCD2080 197 / 250
14.3. Presenting Regression Results
Excel tends to produce far more results than are necessary for
presentational purposes.
In a written presentation, and particularly in a verbal presentation, you do
not want to over-burden the readers/listeners with details.
But you do not want to skim over important bits either.
Here are a couple of guidelines which are useful in presenting regression
results.
Monash College Topic 14: Evaluating Regression Models MCD2080 198 / 250
14.3. Presenting Regression Results
1. When presenting regression results, most interest is usually in the
coefficients. But the coefficients should never be presented in
isolation. You must also present some measure of their accuracy.
2. Be aware of how many decimal places you are reporting. Choose a
number of decimals places such that the smallest possible change in a
number will be of significance to the reader/listener.
3. It is usually important to include some information about the model
such as the number of observations, the R2 and the model’s standard
error.
4. The final point is that some flexibility is possible in the presentation
of results and you may want to adjust the rules above to emphasize
the point you are trying to make.
Monash College Topic 14: Evaluating Regression Models MCD2080 199 / 250