Endogeneity Issues
ECO00005M
Applied Microeconometrics
Professor Cheti Nicoletti
cheti.nicoletti@york.ac.uk
1
What are we going to learn about endogeneity
❖Definition of the endogeneity issue.
❖Consequences of this issue for the OLS estimation.
❖Potential causes of endongeneity.
❖Methods to solve endogeneity issues.
❖ Instrumental variable estimation (two-stage least
squares, 2SLS) and the conditions that the instruments
should satisfy.
❖Formulas for the computation of instrumental variable
estimation and two-stage least squares estimation
❖How to choose between OLS and 2SLS estimation.
2
References for endogeneity
❖BASIC STARTING KNOWLEDGE
Wooldrige J.M. Introductory Econometrics: A Modern
Approach, Sixth Edition,
Chapter 15 Instrumental Variables Estimation and Two Stage Least Squares
❖MORE ADVANCED KNOWLEDGE
Wooldrige J.M. Econometric Analysis of Cross Section and
Panel Data, Second Edition,
Chapter 5 Instrumental Variables Estimation of Single-Equation Linear Models,
Chapter 6 Additional Single-Equation Topics *
3
Definition of endogeneity
❖Suppose we have a linear regression model:
❖Definition: Exogeneity and Endogeneity of Independent
Variables.
• is exogenous if it is uncorrelated with u.
• is endogenous if it is correlated with u.
❖OLS (Ordinary least squares) estimation of the linear
regression model requires exogeneity for consistency.
❖Homework: Assuming that be identically and
independently distributed as N(0,1) but correlated with
one of the explanatory variables, show why endogeneity
implies that the OLS estimation is biased.
4
0 1 1y k kx x u = + ++ +
Causes of endogeneity
◼ Endogeneity can be caused by many things.
▪ An important variable that is not observed and omitted
▪ Functional form specification
▪ Reverse causality
▪ Simultaneity
▪ Measurement error in the regressors
▪ ...
◼ Endogeneity is present in most applications in applied
economic research.
5
Omitted Variables
❖Let us begin with the case whether the true regression
model has only two explanatory variables k=2
y = 0 + 11 + 22 + where E 1, 2 = 0
❖But we omit the variable 2 and estimate the model
y = 0 + 11 + where = 22 +
❖Now, if (1, 2) ≠ 0 and 2 ≠ 0, we do not have
E 1 = 0 and we would have an omitted variable bias.
❖Solution: Instrumental Variable, Proxy Variable
6
Consequence of the omission of a relevant variables
What happens if we omit 2 i.e. a variable that actually
belongs to the true model?
True model y = 0 + 11 + 22 +
❖Consider the linear projection of 2 on 1
2 = 0 + 11 +
❖Then by definition E() = 0 and cov , 1 = 0
❖Let plug in the true model the equation for 2
y = 0 + 11 + 2(0 + 11 + ) +
y = 0 + 20 + (1+21)1 + 2 +
y = 0 + 11 + so ො1 = 1 + 21
❖Asymptotic bias (21)
7
Omission of a relevant variable when the true
regression model has of k independent variables
• if the true parameter for the omitted variable is zero, then
estimates are still unbiased.
• If the omitted variable is uncorrelated with all other independent
variables, the estimators are still unbiased
• If the omitted variable is correlated with at least one independent
variable, this can cause a bias for all estimates.
Homework: Compute the asymptotic bias caused by the omission
of xk.
Notice that
▪ If we include irrelevant variables in the model, the estimators are
still unbiased, but the variance of the estimation increases.
8
0 1 1y k kx x u = + ++ +
Using a Proxy Variable
for Unobserved Explanatory Variables
❖A more difficult problem arises when a model excludes a
key variable, usually because of data unavailability.
❖Example: Return to Education
• the population (true) model is:
• Suppose we do not observe the ability (abil). Ignoring abil would
generally give biased and inconsistent estimates of the return to
education.
▪ We expect an upward bias for the estimated return to education.
Why?
• How can we solve or at least mitigate this omitted variable
problem?
9
( ) 0 1 2 3log wage educ exper abil u = + + + +
❖One possibility is to use a proxy variable for the omitted
variable.
• Something that is related to the unobserved variable.
❖ In the wage equation one could use the intelligence
quotient, or IQ as a proxy for ability. IQ and ability do not
need to be the same, but they need to be correlated.
❖Suppose we have the model
y = 0 + 11 + 22 + 33
∗ +
with 3
∗ being unobserved. We have a proxy variable 3
❖What do we require of 3?
• 3 must be relevant in explaining 3
∗, i.e. in regression
3
∗= 0 + 33 + 3 we have 3 ≠ 0
• If 3 = 0 , the proxy is not good.
10
❖ Replace 3
∗ with 3, i.e. just regress y on 1, 2 and 3.
This is called the plug-in solution to the omitted
variables problem.
❖ Since 3 and 3
∗ are not the same: when does this
procedure give consistent estimators for 1 and 2?
❖ The assumptions are with respect to u and 3:
Assumption 1. In addition to assuming that u and 1, 2 and 3
∗ are
uncorrelated, we need that u and 3 be uncorrelated. This means
that 3 is irrelevant in the population model once 1, 2 and 3
∗ are
included.
Assumption 2. The error 3 is uncorrelated with 1, 2 and 3.
This means that 3 is a good proxy for 3
∗: (3
∗ 1, 2, 3 =
(3
∗ 3
11
❖ From the latter assumption follows, that
(3
∗ 1, 2, 3 =0 + 33
❖ In terms of our wage equation this means:
, , = = 0 + 3
thus the mean value of ability only changes with IQ.
❖ More formally, what are the implications of the two
assumptions?
12
❖By combining
y = 0 + 11 + 22 + 33
∗ +
3
∗ = 0 + 33 + 3
we obtain:
y = 0 + 11 + 22 + 3(0 + 33 + 3) +
y = 0 + 30 + 11 + 22 + 333 + 33 +
• Now let us denote e = 33 + as the composite error.
• Note that u and 3 have both zero mean and each is uncorrelated
with 1, 2 and 3 (see assumptions 1 and 2 in slide 11) . Then e
has zero mean and is uncorrelated with 1, 2 and 3.
❖For this reason, we can write
y = 0 + 11 + 22 + 33 +
• The OLS estimation of the above equation is consistent for
0, 1, 2 and 3.
• We do not get unbiased estimators for 0 and 3.
• Empirically 3may even be of more interest than 3.
Functional form misspecification
❖Special case: omission of a relevant variable 1
2.
❖Suppose y = 0 + 11 + 21
2 + with (|1, 1
2) = 0
❖But we estimate y = 0 + 11 + where = 21
2 +
❖Now, since cov(1, 1
2) ≠ 0 and if 2 ≠ 0, we do not have
1 = 0 and we would have a bias due to functional
form misspecification.
❖Solution: Test for functional form (RESET), Non-
parametric and semiparametric methods, more flexible
parametric specifications of the model.
14
Simultaneity
❖ If an explanatory variable is determined simultaneously
with the dependent variable, then it is correlated with the
error term.
❖ In this case OLS is biased and inconsistent.
❖As an example we consider two equations (structural
equations) without an intercept:
1 = 12 + 11 + 1
2 = 21 + 22 + 2
the variables 1 and 2 are exogenous.
❖We focus on estimation of the first equation.
15
1 = 12 + 11 + 1
2 = 21 + 22 + 2
❖To show that the dependent variables are generally
correlated with the error terms (e.g. 2 with 1), we
replace into the second equation 1 with the right hand
side of the second equation:
2 = 2(12 + 11 + 1) + 22 + 2
(1 − 21)2= 211 + 22 + 21 + 2
❖ In order to solve for 2 we have to assume 21 ≠ 1.
❖ It depends on the application whether this is restrictive.
16
(1 − 21)2= 211 + 22 + 21 + 2
Can be rewritten as:
2 = 211 + 222 + 2
where
21=21/(1 − 21)
22=2/(1 − 21)
2=(21 + 2)/(1 − 21)
This is a reduced form equation for 2.
• 21 and 22 are reduced form parameters.
• 2 is linear in 1and 2. For this reason, it is uncorrelated with 1
and 2. We can apply OLS to estimate 21 and 22.
• There is an equivalent reduced form equation for 1.
17
2 = 211 + 222 + 2
❖We can use this equation to show that OLS estimation of
the structural equations will generally result in biased
and inconsistent estimates for and :
1 = 12 + 11 + 1
❖From the reduced form equation, we see that 2 and 1
are correlated if 2 and 1 are correlated. Since 2
linearly depends on 1, it is generally correlated with 1.
❖When is it not correlated?
• If 2 = 0 and if 1 and 2 are uncorrelated.
• In this case 2 is not simultaneously determined with 1.
18
❖When 2 is correlated with 1 because of simultaneity,
the OLS suffers from simultaneity bias and it is
inconsistent.
❖Obtaining the direction of the bias is generally
complicated. Simple expressions of the bias can be
derived under additional assumptions but this is not
covered here.
❖Solution: Instrumental Variable estimation
19
Measurement error in an explanatory variable
❖We consider the simple regression model:
y = 0 + 11
∗ +
and assume that it satisfies the Gauss Markov
assumptions.
❖We do not observe 1
∗ but 1(e.g. actual and reported
income).
❖The measurement error in the population is: 1 = 1 − 1
∗
❖We assume: (1) = 0
❖Moreover, we assume that is uncorrelated with 1
∗ and
1:
E y 1, 1
∗ = E(y|1
∗)
20
❖The model can be written as: y = 0 + 11 + − 11
❖The classical errors-in-variables (CEV) assumption is
that the measurement error is uncorrelated with the
unobserved explanatory variable: (1, 1
∗) = 0
• This has the meaning that the observed measure 1 consists of
two uncorrelated components: 1 = 1
∗ + 1
• (We still assume that is uncorrelated with 1 and 1
∗.)
• The above assumption implies that 1 and 1 must be correlated:
(1, 1) = (1, 1) = (1, 1
∗) + 1
2 = 0 + 1
2
• This correlation causes problems for the OLS estimation.
21
❖ This implies for our model y = 0 + 11 + ( − 11) that
since and 1 are uncorrelated, the covariance between 1 and the
composite error ( − 11) is:
1, − 11 = −1(1, 1)=−11
2
• Note also that (1) = (1
∗) + (1) = 1∗
2 + 1
2
• Then one can show:
መ1 = 1 +
(1, − 11)
(1)
= 1 −
11
2
1
∗
2 +1
2 =1 1 −
1
2
1
∗
2 +1
2
=1
1
∗
2
1
∗
2 +1
2
• This equation is very interesting: መ1 is always closer to
zero than 1: attenuation bias 22
• OLS is biased in the classical error in variables model:
▪ If 1 is positive, it will underestimate 1 and vice versa.
• Things are more complicated if we look at the multiple
regression model but again OLS will be biased and inconsistent.
❖Solution: Instrumental Variable estimation, ....
23
Instrumental Variable Estimation
❖ Suppose we have an endogenous independent variable.
❖ How can we obtain a good estimate for the coefficient on the
endogenous variable?
❖ This can be achieved if there is an instrumental variable available.
24
❖ First, we look at the simple regression model:
y = 0 + 1 +
❖ Now take another variable with cov(x, ) ≠ 0. Then,
, = 1(, )+(, )
1 =
, − (, )
(, )
❖ Under the additional assumption , = 0, we
have
1 =
,
(, )
25
❖A natural estimator for 1 is therefore
,
(,)
with the
population covariances replaced by their sample
analogues:
መ1 =
σ=1
( − ҧ)( − ത)
σ=1
( − ҧ) ( − ҧ)
❖This estimator is consistent for 1 but it is inconsistent if
, ≠ 0.
❖The estimator can be biased in small samples even if
, = 0.
❖ If x is exogenous, it can be used as an instrument and
then the IV estimator is identical to OLS.
❖The natural estimator for 0 is simply:
መ0 = ത − መ1 ҧ
26
❖ A variable z is a candidate for an instrument for a variable x if it
satisfies the following conditions:
cov(x, ) ≠ 0 and , = 0
❖ Some remarks on the choice of an instrument:
• It is often difficult to find a good instrument.
• A proxy variable does not make a good instrument as it is supposed to
be correlated with the error term.
• Example: Consider the regression of log wage on education and ability
with error term u. Ability is not observed but IQ is observed and highly
correlated with ability. IQ is a potential candidate for a proxy for ability
but clearly violates the condition , = , =0. (HOME
WORK: Explain why IQ is a potential candidate for a proxy for ability but
clearly violates the condition cov(IQ,u)=0. Explain why IQ cannot be
used an instrument for education.)
• Instruments for education: One may use family background variables
such as the number of siblings: negatively correlated with education but
maybe uncorrelated with ability. The latter, however, is unclear as ability
is not observed.
27
Inference with the IV estimator
❖The IV estimator has an asymptotic normal distribution.
❖When we impose a homoscedasticity assumption
conditional to the instrument,
E 2| = () = 2,
one can derive the asymptotic variance of መ1 which is
( መ1) =
2
2,
2
❖This provides us a standard error for the IV estimator.
❖As with the OLS estimator, the asymptotic variance of
the IV estimator decreases to zero at the rate 1/n.
❖ If ,
2 is small, the variance of the IV estimator is large.
❖The asymptotic variance of the IV estimator is always
larger and sometimes much larger than the asymptotic
variance of the OLS estimator. 28
Estimating () =
,
❖The population variance of the error term 2 can be
estimated just like in the case of the OLS regression:
ො2 =
1
− 2
=1
ො
2
where ො are now the residuals from the IV regression.
❖The population variance of x can be estimated by the
sample variance /:
❖The square of the population correlation between x and z
can be estimated by the R-squared of the regression of
x on z: ,
2 .
❖Then a consistent estimator is: ( መ1) =
ෝ2
,
2
❖Example: Return to Education for Married Women
• Data: MROZ.dta
• Simple log – level regression model:
log(wage) = 0 + 1 +
• We obtain OLS estimates:
log(wage) = −0.185 + 0.109
0.205 (0.014)
= 428, 2 = 0.118
• We use father’s education as an instrument for education.
• We cannot empirically check whether ability and father’s
education are uncorrelated. However, we can test, whether
education and father’s education are correlated.
30
❖Example: cont.
• When we regress educ on fatheduc, we obtain
educ = 10.240 + 0.269 ℎ
0.105 (0.011)
= 428, 2= 0.173
This suggests that there is significant positive correlation and
about 17% of the variation in educ is explained by father’s educ.
• When we use father’s educ as instrument for educ, we obtain:
log(wage) = 0.441 + 0.059
0.446 (0.035)
= 428, 2 = 0.093
• The IV estimate of the return of education is about one half of the
OLS estimate, suggesting that there is omitted ability bias.
• The IV standard error is much larger than the OLS standard error
and the IV 95% Conf. Interval contains the OLS estimate.
• While this empirical example suggests that differences in
estimates between IV and OLS are practically large, they are not
statistically significant.
31
❖There are similar IV applications with other data sets
which yield larger IV estimates than OLS estimates.
❖Larger IV estimates than OLS estimates may suggest
some measurement error issues that cause an
underestimation of the OLS or might suggest that the IV
is invalid because correlated with the error term.
❖Since just a little correlation between z and u can already
cause serious problems for the IV estimator, this is an
important issue.
❖ IV estimation can be also applied in case of a binary
endogenous regressor or a binary instrumental variable.
32
IV Estimation with a poor Instrumental Variable
❖ IV estimates can have large standard errors if x and z are only
weakly correlated. (Don’t use IV in this case.)
❖ IV estimates can have a large asymptotic bias if z and u are only
weakly correlated:
መ1 = 1 +
(, )
(, )
This implies that the bias can be large if the population correlation
between z and x is small even if the population correlation between
z and u is small. (HOME WORK: prove the equality above. HINTS:
plim መ1 =
(,)
(,)
and y = 0 + 1 + )
❖ For this reason IV can be worse in terms of consistency than OLS
even if Corr(z,u) is small (provided that (, ) is also small).
❖ One can show that IV is only superior in terms of asymptotic bias if
(, ) / (, ) < (, )
33
R-Squared and IV Estimation
2 = 1 − /
❖SSR (sum of squared IV residuals) can be larger than
SST (total sum of squares). For this reason the R-
squared can be become negative and it is smaller than
for OLS.
❖ It is not clear whether the IV R-squared should be
reported after IV estimation.
34
IV Estimation of the Multiple Regression Model
• The idea of IV estimation can be easily extended to the multiple
regression case.
• For this purpose we change a bit the notation.
• The model is now: y = + with = 0,
• is Kx1 vector
• = (1, 2, 3, … , −1 , ) is a 1xK vector
and is endogenous i.e. (, ) ≠ 0 .
• We need an instrument for to obtain consistent estimates.
• We need another exogenous variable 1 with cov(1, ) = 0.
• Let = (1, 2, 3, … , −1 , 1), then E(
′) = and is
exogenous.
35
• The instrument 1 must be relevant to explain the endogenous
variable once controlled for all remaining exogenous
explanatory variables, i.e.
= 1 + 22 + 33 +⋯+−1−1 +11 + ,
where 1 ≠ 0 and by defintion is uncorrelated with all
exogenous variables and () = 0.
• This implies that (′) has full rank (rank condition)
• We can test 1 ≠ 0 using a t-test.
y = +
′y = ′ + ′
(′y) = (′) + (′)
(′y) = (′)
= [(′)]−1(′y)
• (′y) and (′) can be consistently estimated using the
corresponding sample moments.
36
❖Given a random sample i=1,....,N the instrumental
variables estimator of is
=
1
=1
′
−1
1
=1
′
= ′ −1 ′
where =
1 2,1 3,1
⋮ ⋮
1 2, 3,
… −1,1 1,1
⋱ ⋮ ⋮
… −1, 1,
is a N x K matrix
=
1 2,1 3,1
⋮ ⋮
1 2, 3,
… −1,1 ,1
⋱ ⋮ ⋮
… −1, ,
is a N x K matrix
=
1
⋮
is N x 1 vector, = 1 2, 3, … −1, ,
′
= 1 2, 3, … −1, 1, 37
❖Example: College Proximity as IV for Education
• Data: Card.dta See do file in the VLE
• Log(wage) is dependent variable, several controls (exper expersq
black smsa south smsa66 reg662-reg669) plus the endogenous
education
• Instrument for education: dummy if someone grew up near a four
year college (nearc4).
• We assume that nearc4 is uncorrelated with the error. Moreover, to
be a valid instrument it has to be partially correlated with educ.
• We can test this by estimating in stata the equation:
regress educ nearc4 exper expersq black smsa south smsa66 reg662-reg669
• The t-statistic is 3.64 and therefore if nearc4 is uncorrelated with the
error term, we can use it as IV for educ.
38
_cons 16.63825 .2406297 69.14 0.000 16.16644 17.11007
reg669 .210271 .2024568 1.04 0.299 -.1866975 .6072395
reg668 .5238914 .2674749 1.96 0.050 -.0005618 1.048344
reg667 -.2168177 .2343879 -0.93 0.355 -.6763953 .2427598
reg666 -.3028147 .2370712 -1.28 0.202 -.7676536 .1620242
reg665 -.2726165 .2184204 -1.25 0.212 -.7008858 .1556528
reg664 .117182 .2172531 0.54 0.590 -.3087984 .5431624
reg663 -.027939 .1833745 -0.15 0.879 -.3874918 .3316139
reg662 -.0786363 .1871154 -0.42 0.674 -.4455241 .2882514
smsa66 .0254805 .1057692 0.24 0.810 -.1819071 .2328682
south -.0516126 .1354284 -0.38 0.703 -.3171548 .2139296
smsa .4021825 .1048112 3.84 0.000 .1966732 .6076918
black -.9355287 .0937348 -9.98 0.000 -1.11932 -.7517377
expersq .0008686 .0016504 0.53 0.599 -.0023674 .0041046
exper -.4125334 .0336996 -12.24 0.000 -.4786101 -.3464566
nearc4 .3198989 .0878638 3.64 0.000 .1476194 .4921785
educ Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 21562.0801 3,009 7.16586243 Root MSE = 1.9405
Adj R-squared = 0.4745
Residual 11274.4622 2,994 3.76568542 R-squared = 0.4771
Model 10287.6179 15 685.841194 Prob > F = 0.0000
F(15, 2994) = 182.13
Source SS df MS Number of obs = 3,010
. regress educ nearc4 exper expersq black smsa south smsa66 reg662-reg669
39
◼ The following table reports OLS and IV estimates.
◼ IV estimate is almost twice as large as the OLS estimate.
◼ SE of the IV estimate is 18 times larger. This is the price we have to
pay if we use an instrument to obtain a consistent estimator.
Dependent Variable: log(wage)
Independent
Variable
(1) OLS (2) IV
Educ 0.075
(0.003)
0.132
(0.055)
Exper 0.085
(0.007)
0.108
(0.024)
Exper^2 -0.0023
(0.0003)
-0.0023
(0.0003)
…other controls … …
Observations
R-squared
3,010
0.300
3,010
0.238
40
Two Stage Least Squares (2SLS)
❖Sometimes there are multiple valid IVs for an
endogenous explanatory variable.
❖Suppose the variables 1, 2, … , satisfy
(ℎ, ) = 0 for ℎ = 1,… ,
❖We could simply use all of them as instruments and
obtain multiple IV estimators.
❖The idea is to use all IV together to obtain a more
efficient estimator:
• Let = 1, 1, … , −1 , 1, 2, … , be 1x L with L=K+M.
• As each element of z is uncorrelated with u, any linear
combination is also uncorrelated with u.
41
❖ The linear combination of z which is most highly correlated with is
the linear projection of on z
= 1 + 22 + 33 +⋯+−1−1 +11 +⋯+ +
where by definition is uncorrelated with all exogenous variables
ℎ and and () = 0
❖ is correlated with u if is endogenous but
[1 + 22 + 33 +⋯+−1−1 +11 +⋯+ ]
is not correlated with u.
❖ This means that we can replace the endogenous variable with it
prediction (using OLS estimation of the linear model for ):
ො = መ1 + መ22 + መ33 +⋯+ መ−1−1 + መ11 +⋯+ መ
Which is exogenous
❖ We require that at least one is non-zero. Use F-test to test the null
hypothesis the all instruments have zero effects.
42
❖Now, let and use it as the
instruments for :
❖ It can be shown that and thus
❖This estimator is consistent under the conditions:
, , ,
❖The last condition suggests that we need at least as
many instruments as explanatory variables in the model
(order condition)
43
44
❖Under homoscedasticity it is possible to show
that
❖The variance matrix can be estimated by with
❖Econometric packages have 2SLS implemented. There
is no need to perform the two stages manually. If you
compute it manually, OLS standard errors and statistics
for the second stage are not valid (as there is a
composite error in the second regression).
❖The IV estimator with multiple instruments is also called
two stage least squares (2SLS) estimator:
• One can show that the IV estimates are identical to OLS
estimates from the regression of
This is the second stage.
• The first stage is the regression of
❖Multicollinearity is a bigger problem for 2SLS than for
OLS . This is for two reasons:
• has less variation than .
• has more correlation with than .
45
Some Remarks:
❖ If the R-squared of on the exogenous variables
[1, 1, … , −1 ] is very large, the standard error of 2SLS
explodes. Can be verified with data at hand.
❖2SLS can be also used in models with more than one
endogenous variable.
• We need more candidates for instruments to achieve
identification.
• The sufficient condition for identification is the rank condition.
❖Since R-squared after 2SLS cannot be compared to OLS
R-squared we must be careful.
❖The standard errors and tests for a 2SLS estimation
produced manually will be wrong.
❖ It is possible to derive a statistic with an approximate F-
distribution in large samples. Use econometric packages
to run the 2SLS with correct standard errors.
46
Testing for Endogeneity
❖ It is useful to have a test for endogeneity of an
explanatory variable to show whether 2SLS is even
necessary.
❖Suppose we have the structural equation
1 = 12 + 11 + 22 + 1
where 2 is endogenous and 3 and 4 are two
exogenous variables (i.e. uncorrelated with 1) but
relevant to explain 2.
❖Hausman (1978) suggests a test which directly
compares OLS and 2SLS estimates and determines
whether differences are statistically significant (Hausman
Test).
47
❖The idea behind the test is as follows:
• We have:
1 = 0 + 12 + 21 + 32 + 1 (main regression)
and
2 = 0 + 11 + 22 + 33 + 44 + 2 (first stage equation)
• Each is uncorrelated with 1.
• 2 is uncorrelated with 1 if and only if 2 is uncorrelated with
1. This is what we want to test.
• Write 1 = 0 + 12 + 1, then by definition 1 is uncorrelated
with 2.
• If 1=0 then 1 and 2 are uncorrelated.
• To test for such correlation (endogeneity of 2) we can test
0: 1 = 0 in the following model
1 = 0 + 12 + 21 + 32 + 1 ො2 +
• where ො2 is the residuals from the reduced form equation.
48
1 = 0 + 12 + 21 + 32 + 1 ො2 +
• We then test 0: 1 = 0 using a t-test.
• If we reject it at a small significance level, we
conclude 2 is endogenous because 1 and 2 are
correlated.
Practical guideline for the Hausman test:
1. Estimate the first stage equation for 2 and compute
the residual ො2.
2. Add ො2 as explanatory variable in the main regression
and estimate it by OLS. You may want to use a
heteroscedasticity robust version of the t-test for
testing whether the coefficient on ො2 is significant. If it
is statistically significantly different from zero, we
conclude that 2 is indeed endogenous.
A disadvantage of this test is that it reliability requires the
use of a valid instrument.
49
More remarks on IV estimation:
• If we have more instruments for one endogenous explanatory
variable, we can also test whether at least some of them are not
correlated with 1 (validity of the instrument).
• We have to assume that at least one of the IVs is exogenous.
Then we can test the overidentifying restrictions that are used in
2SLS. No details presented here.
• Heteroscedasticity in the context of 2SLS raises the same issues
as with OLS.
• There are standard errors and test statistics available which are
robust with respect to heteroscedasticity.
• There are also tests for heteroscedasticity available.
• 2SLS can be also applied to pooled cross section and panel data
(e.g. first differencing). This does not rise new difficulties.
50
Summary
❖We have seen the method of instrumental variables as a
way to consistently estimate the parameters in a linear
model when there are endogenous explanatory
variables.
❖When instruments are poor IV estimates can be worse
than OLS.
❖2SLS is routinely used in economics and social sciences
alike.
❖Hausman-Test for endogeneity.
❖ IV estimation can be used for cross section, pooled
cross section and panel data.
51
学霸联盟