ECONOMETRICS
LECTURE 5B
Trang Le
LECTURE SLIDES #7A TOPICS
Adjusted R-Squared
Qualitative Information
Dummy Variable and Multiple Groups
Key references 6.3 , 7.1, 7.2 and 7.3
2
MORE ON GOODNESS OF FIT
General remarks on R-squared
High R-squared does not imply there is a causal interpretation
Low R-squared does not preclude precise estimation of
marginal effects
R-squared will always increase (at least never decrease) when we
add an extra variable
How to construct a version of R-squared that takes into account
this fact
3
MORE ON GOODNESS OF FIT...
Adjusted R-squared accounts for degrees of freedom
�2 = 1 − ( ⁄ ( − − 1))( ⁄ ( − 1))
Adjusted R-squared imposes a penalty for adding new
regressors
Adjusted R-squared may increase or decrease when add a
variable
Potentially useful in comparing models with alternative
numbers of regressors
Adjusted R-squared may be negative
�2 = 1 − (1 − 2)( − 1)/( − − 1)
4
ADJUSTED R-SQUARED IN STATA
5
_cons 4.821997 .2883396 16.72 0.000 4.253538 5.390455
lsales .2566717 .0345167 7.44 0.000 .1886224 .3247209
lsalary Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 66.7221632 208 .320779631 Root MSE = .50436
Adj R-squared = 0.2070
Residual 52.6559944 207 .254376785 R-squared = 0.2108
Model 14.0661688 1 14.0661688 Prob > F = 0.0000
F(1, 207) = 55.30
Source SS df MS Number of obs = 209
. reg lsalary lsales
USING FIT TO CHOOSE BETWEEN MODELS
If models are nested – one is a special case of other
= 0 + 1 +
= 0 + 1 + 22 +
First model is nested within second depending on 2
Could choose between models on basis of test of 0:2 = 0
Could also decide on basis of fit using �2
Implicitly selecting a specific critical value
�2 for first model increases relative to second iff t-statistic for
estimate of 2 is greater than one in absolute value
6
USING FIT TO CHOOSE BETWEEN MODELS...
Models are nonnested if neither model is special case of other
= 0 + 1 log +
= 0 + 1 + 22 +
Can’t impose restrictions on 1 & 2 to move to log model
Testing option not available but can compare fit
Using RDCHEM data log model 2 = .061 while quadratic model
yields 2 = .148 but comparison unfair to first model
�2 = 0.030 for log & �2 = 0.090 for quadratic model
Even after adjusting for difference in degrees of freedom quadratic
model is preferred
7
USING FIT TO CHOOSE BETWEEN MODELS...
Models with different dependent variables will typically
be non-nested
Here neither R-squared nor adjusted R-squared should be
used for comparison
Continuing previous ex. what if comparison between log() = 0 + 1 + 22 +
= 0 + 1 + 22 +
Now not possible to compare fit
Comparing how well variation in log is explained
versus with how well variation in is explained
Extent of variation in these two could be very different
8
QUALITATIVE INFORMATION
Thus far variables have been quantitative – number of
bedrooms, years of education, hourly wage, …
Many features likely to appear in analyses are
qualitative
Gender of individual, their occupation, whether they are
employed or not, …
Industry classification of firm, its credit rating, whether or not it
paid a dividend last quarter, …
One way to incorporate qualitative information is to use
dummy (binary, indicator) variables
Equals 1 or 0 representing presence or absence of feature
May appear as dependent or as independent variables 9
DUMMY EXPLANATORY VARIABLE
Single dummy independent variable
= 0 + 0 +
= 1 if person is a woman & = 0 otherwise
Choice of who is the dummy is arbitrary
Using zero/one also arbitrary but useful for
interpretation
In our example being a woman is choosen for the dummy
variable being equal to 1. By using the binary female,
we have chosen male to be the base/benchmark group.
10
DUMMY VARIABLE
= 0 + 0 +
Specified model is regression representation of
conditional means:
= 0 = 0
= 1 = 0 + 0
= 1 − = 0 = 0
11
DUMMY EXPLANATORY VARIABLE
Have relied on ZCM assumption
To better estimate gender effect need to control for other
factors
= 0 + 0 + 1 +
= 1, − = 0, = 0
0represents difference in mean wage between men & women
with the same education
12
DUMMY EXPLANATORY VARIABLE
13
Implication of this
particular model
Difference does not
depend on level of
education
Data determine this
difference
Graphically, model
specifies an intercept shift
according to gender
DUMMY VARIABLE TRAP
What happened to the male dummy?
Why can’t we estimate
= 0 + 0 + 0 + 1 + ?
Answer: There is a perfect multicollinearity problem (MLR.3 not
satisfied)
+ = 1
Male & female dummy variables are perfectly collinear with
the intercept
An example of the dummy variable trap
More latter when talk about multiple groups
14
DUMMY VARIABLE TRAP
Solution was to drop male dummy
= 0 + 0 +
males chosen as base group
Alternatively could make females the base
= 0 + 0 +
Choice of base arbitary as can always recover estimates for
one specification from the other: 0 = 0 + 0; 0 + 0 = 0
These are just alternative reparameterizations of the same
model
Can also drop the intercept although not advisable
= + +
15
EXAMPLE 7.1
Incorporating gender into our wage equation
�=−1.57(.72)− 1.81.26 + .572.049.025.012 + .141.021
= 526,2 = .364
Holding education, experience & tenure fixed, women earn
$1.81 less per hour compared to men
Does that imply wage discrimination against women?
Depends on how good are the controls
Being female may be correlated with other productivity
characteristics not controlled for
16
EXAMPLE 7.1
Comparing means
�= 7.10(.35) − 2.51.30
= 526,2 = .116
$7.10 is estimated mean hourly wage for men (base group)
Women earn $2.51 less per hour (not controlling for anything)
Is this difference in mean wages significant?
Have 2 estimates of the gender effect (previously $1.81)
Some but not all of the difference in male & female wages is
explained by differences in education, experience & tenure
17
EXAMPLE 7.1
What if dependent variable is in logs? log( �)= .50(.10)− .30.04 + .087.007.0046.0016 + .017.003
= 526,2 = .392
Effect of gender? (Recall previous lecture)
As dummy changes from 0 to 1 (males to females) change in
wage approximately100. −.3 = −30 percent
Large change so approximation may be poor
18
PROGRAM EVALUATION
Useful & important application of a dummy explanatory
variable occurs in policy analysis
Governments or firms are interested in costs & benefits of
alternative policies
Program evaluation involves measuring effect of a specific
program (or treatment or intervention)
E.g. a training program that potentially improves worker
employability
To evaluate a program consider comparison between
Control group that does not participate in the program
Treatment group that does participate
Natural model is outcome (hours employed) depending on a
treatment dummy (attended training) plus controls (education) 19
PROGRAM EVALUATION
Experimental evaluation
In randomized experiments assignment to treatment is random
Here causal effects can be inferred using a simple differences
in means regression
= 0 + 0 +
Self-selection into treatment as a source for endogeneity
When treatment status is not randomly assigned then likely
related to characteristics that also influence the outcome
Subjects self-select themselves into treatment depending on
their individual characteristics & prospects
20
PROGRAM EVALUATION - EXAMPLE
Week 1 we wanted to asses the effectiveness of a
training program on wages
Option 1: Random Treatment
Option 2: People can decide whether to train or notlog() = 0 + 0 +
Now we know in:
1. Option 1: = 0 → 0 causal effect
2. Option 2: ≠ 0 selection into
treatment. Bias estimates 21
DUMMY VARIABLES: MULTIPLE GROUPS
Use of dummy variables easily extends to case of multiple
groups
Examples include occupation, industry, region, ...., where
groups are mutually exclusive & exhaustive
Define membership in each category by a dummy
variable (Group1, Group 2, ... , Group S) are all
dummies
In regression model avoid the dummy variable trap by
leaving out one category (becomes base category)
22
DUMMY VARIABLES: MULTIPLE GROUPS
US divided into north central, south, west & east
= 0 + 0 + 1 + 2 + 3+4 + 5 + 6 +
East is what is called the reference group
Essential to drop one of the groups
Arbitrary which one is dropped
Important to know which group is dropped for interpretation
Once one group is dropped is easy to know the estimation
results if we would have changed the reference group
23
DUMMY VARIABLES: MULTIPLE GROUPS
24
Other things equal hourly
wages compared to the
east are lower in northcen
($.62 less) & south ($.57
less) but higher in west
($.57 more)
None of these individual
differences are significant
at the 5% level
What about joint
significance?
Dependent variable: wage
Explanatory
variables
Estimate (se)
female -1.86 (.26)
educ .566 (.049)
exper .027 (.011)
tenure .139 (.021)
northcen -.621 (.372)
south -.572 (.348)
west .571 (.413)
constant -1.23 (.77)
n 526
2 .378
DUMMY VARIABLES: MULTIPLE GROUPS
Test joint null
0:4 = 5 = 6 = 0;1:
Need F statistic for joint test of linear restrictions
See Slides #4
, = � −
�
= 4557.3 − 4453.03 /34453.03/(526 − 7 − 1) ≈ 4.04
5% critical value is 2.6 sufficient evidence to reject 0
Conclude that regional effects are jointly significant
Other estimates do not change much between models
Omission of regional effects does not seem to be source of
omitted variable bias 25
DUMMY VARIABLES: MULTIPLE GROUPS
Reference group is East and got the following estimates
Northcen → −.621
South → −.572
West → .571
What if the reference group was South
Treat East in the previous regression as having the related
parameter equal to zero:
East → +.572
Northcen → −.621 + .572 = −.049
West → .571 + .572 = 1.143