ECON6300/7320-无代写
时间:2024-03-15
ECON6300/7320
Advanced Microeconometrics
Review of Multiple Regression and
M-estimation
Fu Ouyang
1University of Queensland
Lecture 2
1 / 44
Features of microeconometrics (1)
▶ Data pertain to firms, individuals, households, etc
▶ Focus on "outcomes", and relationships linking outcomes
to actions of individuals
▶ earnings = f(hours worked, years of education, gender,
experience, institutions)
▶ Heteroegeneity of economic subjects’ preferences,
constraints, goals etc. explicitly acknowledged (no
"representative agent" assumption)
▶ Noisy data, large samples,
▶ Economic factors supplemented by social, spatial,
temporal interdependence
2 / 44
Features of microeconometrics (2)
▶ Sources of data:
▶ Surveys (Govt/private); cross section or longitudinal (panel)
▶ Census
▶ Administrative data (by-product: Tax related, health related,
program related)
▶ Natural experiments
▶ Designed experiments
▶ Randomized trials with controls
▶ Type of data impacts method and model used in analysis
3 / 44
Features of microeconometrics (3)
▶ Measures of "outcomes"
▶ Continuous (e.g. earnings)
▶ Discrete (binary or multinomial chpoice as in discrete
choice) or integers-valued (number of doctor visits)
▶ Partially observed/censored (hours of work)
▶ Proportions or intervals
▶ Type of measure may affect the choice of model used
▶ Many types of regression models
4 / 44
Objectives of econometric models
1. Data description and summary of associations between
variables
2. Conditional prediction
3. Estimation of causal ("structural") parameters
– Inference about structural parameters and
interdependence between endogenous variables
4. Policy analysis, prospective and retrospective
– Simulation of counter-factual scenarios to address "what
if" type questions
– Analysis of interventions, both actual and hypothetical
5. Empirical confirmation or refutation of hypotheses
regarding microeconomic behavior.
5 / 44
An example of Mincerian earnings regression
lnE = β0 + β1yreduc + β3age + β3occ + w′γ + ε
1. Does this regression equation (with perhaps a small
number of regressors) provide a good fit to the sample
data? Is the fit improved by adding age2 to the regression?
[Data description]
2. Is the regression equation a good predictor of earnings at
different ages and occupations? [Conditional prediction]
3. What does the regression say about the rate of return to an
extra year of education? [Causal parameter]
4. Can the regression be used to explain the sources of
earnings differential between male and female workers?
[Counterfactual analysis]
▶ These seemingly different objectives are connected, but
may imply differences in emphasis on various aspects of
modeling
6 / 44
Regression decomposition - an example of
counterfactual analysis
▶ Consider the problem of explaining male-female earnings
differential
Y gi = β
g
0 +
∑
xikβ
g
k + ε
g
i , g = M,F
∆̂ = Y¯ F − Y¯M
= (β̂F0 − β̂M0 ) +
K∑
k=1
(β̂Fk − β̂Mk )x¯Fk +
K∑
k=1
(x¯Fk − x¯Mk )β̂Mk + R
▶ This is counterfactual analysis as it answers the question:
What if certain differentials were equalized?
7 / 44
Structural vs Reduced form models
▶ Very highly structured models, derived from detailed
specification of: underlying economic behavior; institutional
set-up, constraints and administrative information;
statistical and functional form assumptions, assumptions of
agent’s optimizing behavior.
▶ Structural models can be preferable for modelling
objectives 3-5 (causal parameters, policy analysis,
confirmation/refutation of microeconomic theory)
▶ Reduced form studies which aim to uncover correlations
and associations among variables.
▶ Reduced form models can be preferable for modelling
objectives 1-3 (data description, prediction, causal
parameters)
8 / 44
General set-up and notation
▶ Data: (y : (N × 1),X : (N × K ))
▶ A joint unknown population distribution of data: f (y,X;θ0),
where both f and θ0 are unknown
▶ Three approaches:
1. Fully parametric: assume f is given, θ0 is finite dimensional
but unknown
2. Semi-parametric: assume that θ0 is finite dimensional but
unknown we can specify some moment functions for y , e.g.
E[y |X ], or Var[y |X ] and we do not want to make
assumptions about the distribution f (.)
3. Nonparametric: assume that θ0 is infinite dimensional, and
we want to estimate the relation between y and X without
making a parametric assumption about f (.)
9 / 44
▶ θ0 : vector of mean and variance parameters in the
relationships to be estimated
▶ θ̂ : the estimator of θ0 based on sample of observations
from the population of interest.
▶ In general θ̂ ̸= θ0; (θ̂ − θ0) : sampling error has a statistical
distribution
▶ Ideally the distribution of θ̂ is centered on θ0 (unbiased
estimator) with high precision (efficiency property), and
a known distribution, to support statistical inference
(probability statements and hypothesis testing).
▶ Consistency means θ̂ p→ θ0.
10 / 44
General approach to estimation and inference
▶ Model specification and identification
▶ Which specification/restrictions are reasonable?
▶ Can the parameter θ0 be recovered given infinite data?
▶ Correct model specification or correct specification of key
components of the model given the data we have available
is necessary for consistency
▶ Qualification: All models are necessarily misspecified as
they are simplifications
11 / 44
▶ Under additional assumptions the estimators are
asymptotically normally distributed,
▶ i.e. the sampling distribution is well approximated by the
multivariate normal in large samples:
θ̂
a∼ N [θ,V [θ̂]]
where V[θ̂] denotes the (asymptotic) variance-covariance
matrix of the estimator (VCE).
▶ Efficient estimators have small variance
12 / 44
▶ In many (most) cases large sample (normal) distribution of
θ̂ is the best we can do. Hence inference on θ̂ is based on
distributions derived from the normal
▶ Test statistics based on (asymptotic) normal results include
z-test, t-test, Wald test, F-test,...
▶ Standard errors of the parameter estimates are obtained
from V̂ [θ̂].
▶ Different assumptions about the data generating process
(DGP), such as heteroskedasticity, can lead to different
VCE.
13 / 44
OLS
▶ Linear regression estimated by least squares can be
regarded as semi-parametric
▶ Goal: to estimate the linear conditional mean function
E [yi |xi ] = x′iβ = β1xi1 + β2xi2 + · · ·+ βKxiK , (1)
where usually an intercept is included so xi1 = 1.
▶ E[yi |xi ] is of direct interest if goal is prediction based on x′iβ
▶ Econometrics interested in marginal effects (e.g. price
change on quantity transacted): ∂ E [yi |xi ]∂xij = βj .
▶ The linear regression has two components, conditional
mean and the error
yi = E[ yi |xi ] + ui (2)
yi = x′iβ + ui , i = 1, . . . ,N. (3)
14 / 44
OLS (1)
▶ Recall: y : N × 1 column vector with i th entry yi , X : N × K
regressor matrix X to have i th row x′i .
▶ Convention is that all vectors as column vectors, with
transpose if row vectors are desired.
▶ In matrix notation:
y = Xβ + u
▶ The objective function is the sum of squared errors,
QN(β) = (y− Xβ)′(y− Xβ) ≡
N∑
i=1
(yi − x′iβ)2
which is minimized with respect to β
▶ Solving FOC (first oder conditions) using calculus methods
yields the OLS solution: X′(y− Xβ) = 0
▶ Matrix notation provides a very compact way to represent
estimator and variance matrix formulas that involve sums
of products and cross-products.
15 / 44
OLS (2)
▶ The OLS estimator can be written in matrix or mixed
matrix-scalar notation:
β̂ = (X′X)−1X′y
=
(
N∑
i=1
xix′i
)−1 N∑
i=1
xiyi
=
∑N
i=1 x
2
i1
∑N
i=1 xi1xi2 · · ·
∑N
i=1 xi1xiK∑N
i=1 xi2xi1
∑N
i=1 x
2
i2
...
. . .∑N
i=1 xiKxi1 · · ·
∑N
i=1 x
2
iK
−1
×
∑N
i=1 xi1yi∑N
i=1 xi2yi
...∑N
i=1 xiKyi
.
16 / 44
Properties of OLS estimator
▶ Properties of any estimator depend on assumptions about
the DGP.
▶ For the linear regression model this reduces to
assumptions about the regression error ui .
▶ As a starting point in regression analysis it is typical to
assume:
1. E[ui |xi ] = 0 (exogeneity).
2. E[u2i |xi ] = σ2 (conditional homoskedasticity).
3. E[uiuj |xi ,xj ] = 0, i ̸= j , (conditional uncorrelatedness).
4. ui ∼ i .i .d . N[0, σ2], (not essential for estimation but often
added for simplicity)
17 / 44
Properties of OLS estimator (1)
▶ Assumption 1 is essential for consistent estimation of β,
and implies that the conditional mean given in (1) is
correctly specified.
▶ It also implies linearity and no omitted variables. Linearity
in variables can be relaxed.
▶ Assumptions 2-3 determine the form of the VCE of β̂.
▶ Assumptions 1-3 (assuming also no perfect collinearity)
lead to β̂ being asymptotically normally distributed with
default estimator of the VCE
V̂ default [β̂] = s2(X′X) −1, (4)
where ûi = yi − x′i β̂ and s2 = (N − K )−1
∑
i û
2
i .
18 / 44
Properties of OLS estimator (2)
▶ β̂ converges in probability to β and s2 to σ2
▶ Under assumptions 1-4 β̂j/se(β̂j) are exactly t-distributed.
▶ Assumption 4 is not always made. If not it is common to
continue to use the t-distribution for hypothesis testing and
confidence intervals (as opposed to the standard normal),
hoping that it provides a better finite sample approximation.
▶ Under assumptions 2-3, OLS is efficient. If assumptions
2-3 are relaxed, OLS is no longer efficient.
19 / 44
Heteroskedasticity-robust standard errors
▶ If assumption 1 holds, but 2 or 3 do not, we have
heteroskedastic or dependent errors.
▶ Then variance estimated using the default formula is wrong
▶ A heteroskedasticity-robust estimator, of the correct
formula of the VCE of the OLS estimator is
V̂ robust [β̂] = (X′X) −1
(∑
i
û2i xix
′
i
)
(X′X) −1. (5)
▶ For cross-section data the above "robust estimator" is
widely used as the default variance matrix estimate in most
applied work
▶ Using R, one can use the lm_robust() function of the
estimatr package to calculate OLS estimator with
heteroskedasticity-robust SE.
20 / 44
When assumptions fail
▶ "All models are lies but they get us closer to the truth."
▶ A specified/assumed model is a "pseudo-true" model, our
approximation to the unknown DGP.
▶ Goal: Get the best estimates of the assumed model
(usually an approximation)
▶ Use diagnostic checks to see if the approximation can be
improved
21 / 44
Common failures
▶ Omitted variable bias is unavoidable.
▶ Suppose the correct regression is y = xβ + zγ + u but z is
incorrectly omitted.
▶ Consequences - Modeling objectives 3-5 (see slide 5) are
affected but 1-2 (data description and prediction) are not
▶ β̂ = (X′X) −1X′y. is biased as E[β̂|X,Z]= β + (X′X) −1X′Zγ
where the second term measures the bias
▶ β̂ suffers from confounding (i.e., its value depends on Zγ)
and β is not identified
▶ However, Xβ̂ is still useful to predict y
22 / 44
Common failures (1)
▶ Potentially a very long list. The most important are:
1. Omitted variables (unobserved factors that affect economic
behavior - e.g., business confidence)
2. Misspecified functional forms (departures from linearity)
3. Ignoring endogenous regressors
4. Ignoring measurement errors in regressors
5. Ignoring violations of "classical" assumptions
(heteroskedasticity, serial and cross section dependence)
23 / 44
Regression diagnostics and tests
Usual to apply diagnostic checks of model specification
▶ A standard modeling cycle has four steps:
specification → estimation → diagnostics → re-estimation
▶ Diagnostic checks involve testing a restricted model
against a less restricted model
▶ Ex. 1: fewer regressors vs. more regressors (e.g. F-tests)
▶ Ex. 2: homoskedastic errors vs. heteroskedastic errors
(e.g. tests of homoskedasticity)
▶ Ex. 3: nonlinear regression vs. linear regression (tests of
nonlinearity)
▶ Ex. 4: serially independent errors vs. dependent errors
(tests of serial correlation)
▶ Regression is almost always followed by postregression
analysis involving diagnostics
24 / 44
Properties of OLS (formal)
Model
y = Xβ + u, E [u|X] = 0
Unbiasedness
E [β̂|X] = E [(X′X)−1X′y|X]
= E [(X′X)−1X′(Xβ + u)|X]
= E [(X′X)−1(X′X)β|X] + E [(X′X)−1X′u|X]
= (X′X)−1(X′X)E [β|X] + (X′X)−1X′E [u|X]
= E [β|X] + (X′X)−1X′0 = β
Then, by the law of iterated expectations,
E [β̂] = E [E [β̂|X]] = E [β] = β
25 / 44
Properties of OLS (formal)
Variance
V [β̂|X] = E
[
(β̂ − E [β̂|X])(β̂ − E [β̂|X])′|X
]
We showed that β̂ − E [β̂|X] = β̂ − β, so
V [β̂|X] = E [(β̂ − β)(β̂ − β)′|X]
Now
β̂ − β = (X′X)−1X′y− β
= (X′X)−1X′(Xβ + u)− β
= (X′X)−1(X′X)β + (X′X)−1X′u− β
= β + (X′X)−1X′u− β
= (X′X)−1X′u
and (β̂ − β)′ = u′X(X′X)−1 so
(β̂ − β)(β̂ − β)′ = (X′X)−1X′uu′X(X′X)−1
26 / 44
Properties of OLS (formal)
V [β̂|X] = E [(β̂ − β)(β̂ − β)′|X]
= E [(X′X)−1X′uu′X(X′X)−1|X]
= (X′X)−1X′E [uu′|X]X(X′X)−1
Now assume E [uu′|X] = Ω. For example, under assumptions
2-3 on slide 17 (homoskedasticity, uncorrelatedness),
Ω = σ2IN . So
V [β̂|X] = (X′X)−1X′ΩX(X′X)−1
and V [β̂] = E [V [β̂|X]] = E [(X′X)−1X′ΩX(X′X)−1]. If Ω = σ2IN
this simplifies to
V [β̂] = σ2E [(X′X)−1(X′X)(X′X)−1] = σ2E [(X′X)−1]
27 / 44
Properties of OLS (formal)
Consistency
Recall β̂ = β + (X′X)−1X′u so
β̂ = β + (X′X/N)−1(X′u/N)
Now (X′X/N)kl = (
∑N
i=1 xikxil)/N, so
(X′X/N)kl
p→ E [xikxil ]
so by WLLN
(X′X/N) p→
E [x
2
i1] E [xi1xi2] . . .
E [xi1xi2] E [x2i2] . . .
...
. . .
28 / 44
Properties of OLS (formal)
so by Slutsky lemma
(X′X/N)−1 p→
E [x
2
i1] E [xi1xi2] . . .
E [xi1xi2] E [x2i2] . . .
...
. . .
−1
= Q
for K × K symmetric matrix Q. Also, by WLLN
X′u/N =
(
∑N
i=1 xi1ui)/N
(
∑N
i=1 xi2ui)/N
...
p→
E [x1iui ]E [x2iui ]
...
= 0
so, by Slutsky lemma
β̂ = β + (X′X/N)−1(X′u/N) p→ β + Q× 0 = β
29 / 44
Properties of OLS (formal)
Asymptotic Normality
Recall β̂ = β + (X′X)−1X′u so
√
N(β̂ − β) = (X′X/N)−1(X′u/
√
N)
We already know that (X′X/N)−1 p→ Q. Also,
X′u/
√
N =
√
N(X′u/N − 0)
=
√
N(X′u/N − E [X′u/N]) d→ N (0,Σ)
where Σ is K × K with entries Σkl = COV (xikui , xilui).
30 / 44
Properties of OLS (formal)
So by Slutsky lemma,
√
N(β̂ − β) = (X′X/N)−1(X′u/
√
N) d→ Q×N (0,Σ)
d→ N (0,QΣQ′)
d→ N (0,QΣQ)
Now if we take assumptions 2-3 (homoskedasticity,
uncorrelatedness) then
Σ = σ2Q−1
and
√
N(β̂ − β) d→ N (0, σ2QQ−1Q)
d→ N (0, σ2Q)
31 / 44
Properties of OLS (formal)
So in general
√
N(β̂ − β) d→ N (0,QΣQ)
which means
β̂
a∼ N (β,QΣQ/N)
and with homoskedasticity and uncorrelatedness
β̂
a∼ N (β, σ2Q/N)
In practice we use
Q̂ =
(
∑N
i=1 x
2
i1)/N (
∑N
i=1 xi1xi2)/N . . .
(
∑N
i=1 xi1xi2)/N (
∑N
i=1 x
2
i2)/N . . .
...
. . .
−1
= (X′X/N)−1
and σ̂2 = s2, so
β̂
a∼ N (β, s2(X′X)−1)
32 / 44
m-estimation
▶ We consider the very extensive topic of m-estimation.
Almost all estimation methods used in this class are
special cases of m-estimation.
▶ Examples: Least squares (OLS); non-linear least squares
(NLS); generalized least squares (GLS); generalized
method of moments (GMM); maximum likelihood (ML);
quantile regression (QR)
▶ Objective: Introduce key and useful asymptotic properties
of m-estimators
33 / 44
Basic set-up and notation
We define an m-estimator θ̂ of the q × 1 parameter vector θ is
an estimator that maximizes an objective function that is a sum
or average of N sub-functions
QN(θ) =
1
N
N∑
i=1
q(yi ,xi , θ), (6)
where q(·) is a scalar function, yi is the dependent variable, xi
is a regressor vector (of exogenous variables) and we assume
conditional independence over i .
▶ Common properties of q(·) - continuity and differentiability
with respect to θ
▶ m-estimation typically involves minimizing or maximizing a
specified objective function defined in terms of data and
unknown population parameters.
34 / 44
First order conditions
The estimator θ̂ that is the solution to the first-order conditions
∂QN(θ)/∂θ|θ̂ = 0, or equivalently
1
N
N∑
i=1
∂q(yi ,xi , θ)
∂θ
∣∣∣∣
θ̂
= 0. (7)
is an m-estimator. It is a system of q estimating equations in
q unknowns that does not necessarily have a closed-form
solution for θ̂ in terms of data (yi ,xi , i = 1, . . . ,N).
▶ The term m-estimator is interpreted as an abbreviation for
maximum-likelihood-like estimator.
▶ Many econometricians define an m-estimator as optimizing
over a sum of terms, as in (6).
▶ Other authors define an m-estimator as solutions of
equations such as (7).
35 / 44
Property Algebraic formula
Objective Function QN(θ) = N−1
∑
i q(yi , xi , θ) is maximized wrt θ
Examples MLE: qi = ln f (yi |xi , θ) is the log-density
NLS: qi = −(yi − g(xi , θ))2 is minus the squared error
MM: qi = [(yi − g(xi , θ))x′ixi(yi − g(xi, θ))]
First-order conditions ∂QN(θ)/∂θ = N−1
∑N
i=1 ∂q(yi , xi , θ)/∂θ|θ̂ = 0.
36 / 44
Example
Univariate distribution : yi (i = 1, . . . ,N) is a 1/0 binary
variable generated by a Bernoulli trial with parameter π which is
the target parameter of interest
Method Objective Function First order condition
OLS QN= 1N
∑N
i=1(yi−π)2 1N
∑N
i=1(yi−π)=0
ML QN= 1N
∏
f (yi ;π)= 1N
∏
πyi (1−π)1−yi 1N
∑N
i=1(yi−π)=0
MM QN= 1N
∑N
i=1(yi−π)[π(1−π)]−1(yi−π) 1N
∑N
i=1(yi−π)=0
.
37 / 44
Variance estimation for m-estimators
▶ For all m-estimators we can obtain the expression for the
stochastic error of the estimator. e.g., for OLS
β̂ − β = (X′X)−1X′u
▶ We can then derive the expression for the asymptotic
variance of the estimator.
▶ Two approaches are possible
1. Derive the variance expression assuming that the errors
are i.i.d. (restrictive)
2. Derive the variance expression assuming that the errors
are heteroskedastic or serially correlated (less restrictive).
▶ The second approach yields robust variance estimator
relative to the i.i.d. case
▶ Example of least squares is given below.
38 / 44
Standard vs robust variance estimation
Standard version Robust version (two-step)
Assume ui are i.i.d.; V[u|X]=σ2IN Assume ui are not i.i.d; V[u|X]=Ω ̸= σ2IN
β̂ = (X′X)−1X′y β̂ = (X′X)−1X′y
β̂ = (X′X)−1X′(Xβ + u)
= β + (X′X)−1X′u
β̂ − β = (X′X)−1X′u
V[β̂|X] = E[(X′X)−1X′uu′X(X′X)−1|X] V[β̂|X] = E[(X′X)−1X′uu′X(X′X)−1|X]
V[β̂|X] = σ2(X′X)−1 V[β̂|X] = (X′X)−1(X′ΩX)(X′X)−1
σ̂2 = û′û/(N − K ) ̂(X′ΩX) =∑Ni=1 xi û2i x′i
V̂ [β̂] = σ̂2(X′X)−1 V̂ [β̂] = (X′X)−1(
∑N
i=1 xiû
2
i x
′
i)(X
′X)−1
39 / 44
Efficiency of OLS
▶ Given the i.i.d. assumption and exogeneity of regressors,
OLS estimator is unbiased (and consistent) and efficient.
(Gauss-Markov Theorem.)
▶ The i.i.d. assumption is violated if errors are
heteroskedastic, or serially correlated in which case,
V [u] = Ω ̸= σ2IN
40 / 44
Two possible structures for N=5
ΩN×N =
σ21 0 0 0 0
0 σ22 0 0 0
0 0 σ23 0 0
0 0 0 σ24 0
0 0 0 0 σ25
ΩN×N =
σ21 σ12 0 0 0
σ12 σ
2
2 σ23 0 0
0 σ23 σ23 σ34 0
0 0 σ34 σ24 σ45
0 0 0 σ45 σ25
41 / 44
Properties of OLS vs. GLS
▶ Then OLS is consistent but GLS estimator is more efficient.
▶ Two alternatives are: (i) use feasible two-step GLS, or (ii)
use the robustified estimator of V̂ [β̂], which requires fewer
assumptions.
▶ The idea behind robust variance estimator can be
extended to other M-estimators.
42 / 44
Generalized Least Squares Estimator
GLS FGLS
β̂ = (X′Ω−1X)−1X′Ω−1y β̂ = (X′X)−1X′y consistent
β̂ = (X′Ω−1X)−1X′Ω−1(Xβ + u) Assume Ω = Ω(θ) ;
= β + (X′Ω−1X)−1XΩ−1u θ can be consistently estimated
β̂ − β = (X′Ω−1X)−1XΩ−1u given β̂
V[β̂|X,Ω] =E[(β̂ − β)(β̂ − β)′|X,Ω] Ω̂ = Ω(θ̂)
V[β̂|X,Ω] = (X′Ω−1X)−1 V[β̂|X,Ω̂] = (X′Ω̂−1X)−1
V̂ [β̂] = (X′Ω−1X)−1 β̂FGLS
p→ β̂GLS b/c Ω̂ p→ Ω
43 / 44
Why m-estimation?
▶ Large sample optimality of m-estimators
▶ Consistency and asymptotic normality
Property Algebraic formula
Consistency Is plimQN(θ) maximized at θ = θ0?
Consistency (informal) Does E
[
∂q(yi , xi , θ)/∂θ|θ0
]
= 0?
Limit Distribution
√
N(θ̂ − θ0) d→ N [0,A−10 B0A−10 ]
A0 = plim N−1
∑N
i=1 ∂
2qi(θ)/∂θ∂θ′
∣∣
θ0
B0 = plimN−1
∑N
i=1 ∂qi/∂θ × ∂qi/∂θ′|θ0 .
Asymptotic Distribution θ̂ a∼ N [θ0,N−1Â−1B̂Â−1]
 = N−1
∑N
i=1 ∂
2qi(θ)/∂θ∂θ′
∣∣
θ̂
B̂ = N−1
∑N
i=1 ∂qi/∂θ × ∂qi/∂θ′|θ̂
44 / 44