程序代写案例-ECMT2150|学霸联盟

程序代写案例-ECMT2150

时间：2021-11-10

ECMT2150 – Lecture 6
Topics Today
Week 6
Model Specification
Specification errors
=> including irrelevant variables
=> omitting relevant variables – Omitted Variable Bias
More on multicollinearity and sampling variance
Endogeneity
Proxy Variables
Reference: Chapters 3.3b; 3.4a; Chp 9.2
Recap
• Econometrics is about estimating economic relationships, e.g.
between wages and education, etc.
• Simple and multiple linear regression – many examples
– Incorporated non-linearities: logs & polynomials
– Incorporated qualitative and categorical information
– Discussed goodness of fit
Recap
• Derived the OLS estimator – understand the steps/the
intuition. Derivation is not examinable
• Statistical Properties of OLS that hold for any sample under
given assumptions
– Expected values/unbiasedness under MLR.1 – MLR.4
– Variance formulas under MLR.1 – MLR.5
– Gauss-Markov Theorem under MLR.1 – MLR.5
– Exact sampling distributions/tests under MLR.1 – MLR.6
• Asymptotic Properties of OLS (Consistency and Asymptotic
normality)
Recap
Why?
• Importance of the Zero Conditional Mean Assumption:
E(u|x) = E(u) = 0
• Reliability of estimator and inference rest on whether the
assumptions hold
• Causality - understanding what we have to assume to obtain
causal estimates of our parameters of interest
• + we need the variance formulas and sampling distributional
assumptions to conduct inference
Recap
• Inference
– Hypothesis tests
• T-tests:
– One-sided
– Two sided (tests of statistical significance)
• p-values
• Confidence intervals
– Testing more general alternatives
• An estimate is equal to a constant
• One estimate is equal to another => Linear combination of
parameters
• Testing multiple linear restrictions
Specification Errors
Specification errors
• Until now we have assumed our population model
= 0 + 11 + … + +
has been correctly specified
• A bit unrealistic – we can never be sure of the true
population model
• Types of specification errors:
– Choice of independent variables
• Over-specification & sampling variance effects
• Omitted variables & Endogeneity
– Heteroskedasticity
– Functional Form
– Measurement Error
– Missing Data, Non-random Sampling & Outliers
Misspecification in the choice of
independent variables
Misspecification in the choice of independent variables
A. Including irrelevant variables in a regression
model (over-specifying)
– This model satisfies MLR.1-MLR.4
– x3 may be correlated with x1 and x2
– Crucially, in the population, x3 has no effect on y after we
control for x1 and x2
⇒ Inclusion of x3 has no cost in terms of bias in the estimates of
any of the parameters, because
– However, including irrelevant variables may increase the
sampling variance (more on this shortly)
= 0 in the population
Misspecification in the choice of independent variables
B. Omitting relevant variables (Wooldridge 3.3b)
⇒Omitted Variable Bias
⇒Violate MLR.4, E(u|x)=0
The simple case:
but due to our ignorance or data unavailability, we estimate
True model (contains x1 and x2)
Estimated model (x2 is omitted)
Misspecification in the choice of independent variables
B. Omitting relevant variables (Wooldridge 3.3b)
Example: Omitting ability in a wage equation
True model:
We estimate:
Omitting a relevant variable causes bias when the omitted
variable is correlated with any of the other explanatory variables
in the model
Omitted Variable Bias – the simple case
Let‘s look at this in more detail:
• If x1 and x2 are correlated, assume a linear regression
relationship between them:
• The true model is:
If y is only regressed
on x1 this will be the
estimated intercept
If y is only regressed
on x1, this will be the
estimated slope on x1
error term
And the bias =
Conclusion: All estimated coefficients will be biased.
Omitted Variable Bias – the simple case
Our example again: Omitting ability in a wage equation
Will be positive
The return to education 1 will be over-estimated because 21 > 0.
It will look as if people with many years of education earn very high
wages, but this is partly due to the effect of ability - the fact that
people with more education are also more able on average.
Omitted Variable Bias – the simple case
Summarising the direction of the bias:
Omitted Variable Bias – the simple case
What about the size of the bias?
Our example again: Omitting ability in a wage equationln = 0 + 1 + 2 +
As above, the return to education 1 will be over-estimated
because 21 > 0.
By how much?
For example, if the return to educ in the population is 8.6%
• a bias of 21= 0.1 percentage points – not so worrying
• a bias of 21= 3 percentage points – big concern
Omitted Variable Bias – the simple case
Q: When is there no omitted variable bias?
A: When the omitted factors are
i) unrelated => 1, 2 = 0, 1 = 0
ii) when they don‘t affect the outcome => 2 = 0
In our wages example:
• 1, 2 > 0 if individuals with high innate ability
tend to have higher education.
• 2 > 0 if individuals with high ability tend to have higher
productivity and wages.
 Together, we would expect that OLS overestimates 1
Omitted Variable Bias: more general cases
(Wooldridge 3.3c)
– No general statements possible about direction of bias
– If 1, 3 ≠ 0, we can analyse as per the simple
case if and only if 2is uncorrelated with 1 and 3
others
• Example: Omitting ability in a wage equation
True model (contains x1, x2,
and x3)
Estimated model (x3 is omitted)
If experience is approximately uncorrelated with educ and abil, then the
direction of the omitted variable bias can be as analyzed in the simple two
variable case.
Omitting relevant variables =>
Inconsistency
• Not only is the OLS estimator biased when we omit
relevant variables, it is also inconsistent
see section 5.1a, Wooldridge
• We can show that:
�1 = 1 + 2 ∑ 1 − ̅1 2∑( 1 − ̅1 2) + ∑ 1 − ̅1 ∑( 1 − ̅1 2)
• Then taking the plim on both sides, we have:
�1 = 1 + 2 (1, 2)(1)
Sampling Variances
Specification Errors and Multi-collinearity
• So omitting a relevant variable can cause bias.
• Including irrelevant variables does not cause bias
• BUT, you might consider leaving them out so as to not
unnecessarily inflate the sampling variance
⇒there is a trade-off to be made with the effect on the
variance of our slope parameters
Sampling Variance: Mis-specified Models
(Wooldridge 3.4b)
• The choice of whether to include a particular variable in a
regression can be made by analyzing this trade-off
between bias and variance
• It might be the case that the likely omitted variable bias in
the misspecified model 2 is compensated by a smaller
variance
True population
model
Estimated model 1
Estimated model 2
Sampling Variance: Misspecified Models
(Wooldridge 3.4b)
• Variance in the misspecified model
• Case 1:
• Case 2:
Conditional on x1 and x2,
the variance in model 2
is always smaller than
that in model 1
Conclusion: Do not include irrelevant regressors
Trade off bias and variance
Sampling Variance: Misspecified Models
Caution!! Bias will not vanish even in large samples. But the variance
of � will decrease with a large sample
Multicollinearity & sampling variance
– Recall:
– Linear relationships between explanatory variables can create
problems
– High multicollinearity can occur when R2j is ‘close’ to 1
– Ideally, we have little correlation between xj and other
independent variables – Yet this may not be the case.
– For example, examining the effect of various school expenditure
categories on school performance
• It is expected that wealthier schools will spend more on everything than less
wealthy schools
• It can be difficult to estimate the effect of any category of school expenditure on
student performance when there is little variation in one category
Multicollinearity & sampling variance
Average standardized
test score of a school Expenditure
on teachers
Expenditure on
instructional
materials
Other expenditures
The different expenditure categories will be strongly correlated: if a school has a
lot of resources it will spend a lot on everything.
It will be hard to estimate the differential effects of different expenditure
categories because all expenditures are either high or low.
For precise estimates of the differential effects, one would need information
about situations where expenditure categories change differentially.
So,... often the sampling variance of the estimated effects will be large.
Multicollinearity & sampling variance
• Because effects cannot be disentangled, it may be better to lump
all expenditure categories together
• In other cases, dropping some of the x‘s may reduce
multicollinearity (but this may lead to omitted variable bias!)
• Only the sampling variance of the variables involved in
multicollinearity will be inflated; the estimates of other effects
may be very precise
• Note that multicollinearity is not a violation of MLR.3
• Multicollinearity may be detected through variance inflation
factors
As an (arbitrary) rule of thumb: the variance
inflation factor should not be larger than 10
25
Omitted variable bias is one
source of endogeneity
But, what is endogeneity?
Recall: Crucial standard assumption for the MLR model
• Assumption MLR.4 (Zero conditional mean)
– If MLR.4 is violated, there is an endogeneity problem.
– If there is a correlation between and , then MLR4 is
violated and there is endogeneity
Multiple Linear Regression
27
Endogeneity is a major challenge in the social
sciences including in economics:
– Unlike the hard sciences, it is difficult to run truly
randomized experiments
– More economists are conducting
• experiments in the field (field experiments,
randomized control trials (RCTs))
• laboratory experiments
– Sometimes we have access to ``natural‘‘ or
``quasi-experiments‘‘
Endogeneity
28
But the problem we face in economics is that we are
usually working with observational data!
• Sources of endogeneity:
1. Omitted variables
• In many cases important characteristics cannot be observed AND
these are often correlated with observed explanatory information.
2. Measurement error: variables are measured with error
3. Simultaneity: two or more variables are simultaneously
determined
• X causes Y but Y also causes X, X is jointly determined with Y
– quantity and price by demand and supply
– investment and productivity
– sales and advertising
Endogeneity
29
Why is it so important?
– Contemporary empirical economics agenda looks to answer
specific, highly focused questions
• Targets the causal effects of a single factor
• EG: the effects of immigration on wages
• EG: the effects of democracy on GDP growth
– Wages and Education example:
• FOCUS: Causal effect of an additional year of education
on wages
• Not to ``explain‘‘ wages
• Other regressors are included as controls – included in
service of this focused causal agenda
(Angrist and Pischke 2017 http://ftp.iza.org/dp10535.pdf)
Endogeneity
30
With endogeneity, the OLS estimator is biased and
inconsistent.
Solutions to endogeneity problems include:
–Proxy variables method for omitted regressors (W 9.2)
– IV is the most well-known method to address endogeneity
problems
–Fixed effects methods if 1) panel data is available, 2) endogeneity
is time-constant, and 3) regressors are not time-constant
–Random effects methods 1) again need panel data; 2) requires
stronger assumptions
31
Something related to the
unobserved factor that we
control for.
Endogeneity
Proxy Variables
Reference Wooldridge 9.2
Proxy Variables
33
Using proxy variables for unobserved explanatory variables
• In our wage – educ example,
• In general, the estimates for the returns to education and
experience will be biased
– Omitted unobservable ability
• Idea: find a proxy variable for ability which is able to control for
ability differences between individuals
– possible proxy for ability: IQ score or similar test scores
=> coefficients of the other variables, e.g. educ or exper, will not be
biased.
Replace by proxy
Using proxy variables for unobserved explanatory variables
General approach to using proxy variables
• Goal: Estimate the ceteris paribus effect of on holding
and ∗ fixed.
• We want to control for ∗
• But we do NOT plan to, or hope to, estimate the causal
effect of ∗
Omitted variable, e.g.
ability
Proxy Variables
34
Using proxy variables for unobserved explanatory variables
• 3∗ is unobserved
• Instead, we have 3 that we can observe
• And 3 is related to 3∗.
• Recognising that the relationship between 3∗ and 3 is not
a perfect one, and assuming it is linear, we can write:
Omitted variable, e.g. ability
An imperfect linear relationship
between the omitted variable and its
proxy
Proxy Variables
35
Assumptions necessary for the proxy variable method to
be valid:
1) The error u is uncorrelated with all the explanatory
variables (1, 2 3∗) AND uncorrelated with the
proxy 3
• ZCM assumption for all variables used in the model
• In our example this implies that IQ is irrelevant in the
population model once educ, exper, abil have been
included
• In other words, the proxy is "just a proxy" for the omitted
variable, it does not belong into the population regression
and it is uncorrelated with the population regression
error:
Estimation with Proxy Variables
36
Assumptions necessary for the proxy variable method to be valid:
2) The proxy variable is a "good" proxy for the omitted variable
– Correlated with the omitted variable
– And using other variables in addition will not help to predict the omitted
variable
• In our example:
– ability is not correlated with educ or exper once we control for IQ
⇒ the average level of ability only changes with IQ, not with educ or exper
=> x3 is such a good proxy for x3* , such that once x3 is
known, neither x1 nor x2 would help to predict x3*
Estimation with Proxy Variables
37
• Under these assumptions, the proxy variable method works:
• In this regression model, the error term is uncorrelated with
all explanatory variables.
• As a consequence, all coefficients will be correctly estimated
using OLS.
• The coefficents for the explanatory variables x1 and x2 will be
correctly identified.
• The coefficient for the proxy variable may also be of interest
(it is a multiple of the coefficient of the omitted variable).
Estimation with Proxy Variables
38
Discussion of the proxy assumptions in the wage example
– Assumption 1: Should be fullfilled as IQ score is not a direct
wage determinant; what matters is how able the person is at
work
– Assumption 2: Most of the variation in ability should be
explainable by variation in IQ score, leaving only a small
remaining correl, if any, to educ and exper
Estimation with Proxy Variables
39
Q: What if the firm gives an IQ test before hiring?
Using lagged dependent variables as proxy variables
– In many cases, omitted unobserved factors may be proxied
by the value of the dependent variable from an earlier time
period
• Example: City crime rates
– Including the past crime rate will at least partly control for
the many omitted factors that also determine the crime
rate in a given year
– Another way to interpret this equation is that one compares
cities which had the same crime rate last year; this avoids
comparing cities that differ very much in unobserved crime
factors
Estimation with Proxy Variables
40
Next week
Specification Issues II
• Functional form misspecification
• Measurement Error
• Missing Data, Non-random samples
• Outliers
References: Chp 9 (9.1, 9.4, 9.5, 9.6)
41
Midterm Exam Arrangements…
• Exam is in 2 weeks: Wed 6 October 2021, at 6pm = Week 8
• ONLINE EXAM
– QUIZ + UPLOAD of handwritten answers for most questions
• 50 minute exam, worth 20%
• Open book exam
– but you will be busy – so you should prepare as though it’s a closed book
exam - you should not expect to have time to find the answer in the
textbook, etc
– Formula sheets and statistical tables will be provided – you can view these
ahead of time on Canvas
• The exam covers lectures from Weeks 1 – 7 inclusive
– Wooldridge Chapters 1 – 5, 6.1 – 6.3, 7.1 – 7.5, 9.1, 9.2, 9.4-9.6
• Practice questions are also available on Canvas
• Keep an eye out on Canvas/Ed for announcements with further
details
42

学霸联盟