程序代写案例-E300|学霸联盟

程序代写案例-E300

时间：2021-05-07

E300 Econometric Methods
Questions and Answers
Julius Vainora
University of Cambridge
Contents
1 How does E[εi | X] = 0 imply E[εixi] = 0? (Added Nov 1) 4
2 Somewhere I read about the homoskedasticity assumption in the form of Var
[
ε2i
∣∣ xi] =
σ2. Why is it the case that the assumption sometimes includes the squared errors
and sometimes only the error? (Added Nov 1) 4
3 What is the relationship between a homoskedasticity assumption Var[εi | X] = σ2
and Var[εixi]? (Added Nov 1) 4
4 What are the elements characterizing a multivariate normal distribution? (Added
Nov 1) 5
5 When manipulating equations, do we always premultiply on the left? (Added Nov
1) 5
6 Why is it that AR(1) + AR(1) gives ARMA(2,1)? (Added Nov 1) 5
7 What real-life processes can be modelled as an MA(q)? (Added Nov 1) 6
8 When does a single endogenous variable bias just its own coefficient? (Added Nov
1) 6
9 Does E[YtYt−1] = E[Yt−1Yt−2] require strong stationarity or is weak stationarity
enough? (Added Nov 8) 7
10 On true and usual standard errors and their estimation. (Added Nov 8) 7
11 How are weak stationarity, stability, and weak dependence related? (Added Nov
8) 8
12 When testing for unit roots, does α represent the drift or the intercept (mean)?
Is it not β that represents the drift/trend? Or should the dependent variable be
∆yt and not just yt? (Added Nov 12) 9
13 Why do we have standard testing for unit roots in the Case (3) of a Random walk
with a drift α 6= 0 / Nonzero mean stationary AR(1)? (Added Nov 12) 10
1
14 In what way is conditional unbiasedness stronger than unconditional unbiased-
ness? (Added Nov 14) 10
15 Is there ever a case in which an estimator is unconditionally unbiased but not so
conditionally? (Added Nov 14) 10
16 How would Var[Yt] look like for a multivariate time series? (Added Nov 14) 10
17 How to understand all the different subscripts in our notation in time series?
(Added Nov 14) 10
18 How do we get the formula for βˆ from Lecture 17, Slide 7? (Added Nov 14) 11
19 What is an I(0) process? (Added Nov 14) 11
20 If Yt has a unit root, when is it not the case that ∆Yt ∼ I(0)? (Added Nov 14) 11
21 What happens with the (T − 1)/T , (T − 2)/T , and so on factors in the limit when
deriving the HAC standard errors? (Added Nov 14) 12
22 Out of the five settings for unit root testing, why did we consider just the second
and the fifth ones? (Added Nov 14) 13
23 What does “non-standard” mean when talking about asymptotic distributions?
(Added Nov 23) 16
24 Can we use leads and lags of xt instead of ∆xt in the Dynamic OLS to remove the
correlation? (Added Nov 23) 16
25 If we have an m-dimensional VECM model, is it true that there cannot be m
cointegrating relationships? (Added Nov 23) 17
26 I am confused about the expression for the Likelihood Ratio test statistic: shouldn’t
we use the ratio λ = L(θ˜; Y)/L(θˆ; Y)? (Added Nov 23) 17
27 What is the difference between the residuals and the errors? (Added Nov 23) 17
28 Why do we sometimes have the i subscript and sometimes we do not? (Added
Nov 23) 18
29 Where does the approximation −2 log λ ≈ mF come from? (Added Nov 23) 18
30 In the LR, W, and LM tests, how do we accept and reject the null? Is it right that
the smaller the difference between θ0 and θˆML, the more likely we are to accept
H0? (Added Nov 23) 18
31 Why is it that aN (0, b) ∼ N (0, a2b) if we know that Var[aX] = a2Var[X]? (Added
Nov 24) 18
32 How does the standardisation of normal random variables, leading to a χ2 distri-
bution, work? (Added Nov 24) 19
2
33 When doing GLS, FGLS, or WLS, do I also have to transform the constant term?
(Added Nov 27) 19
34 How to understand the FWL theorem and how to use it in practice? (Added Nov
27) 20
35 If we say that experience is exogenous, why do we include it as an instrument?
Can’t we just drop it and obtain the same regression results? What is the rationale
behind having an instrument for an exogenous variable? (Added Dec 8) 22
36 What are overidentifying restrictions and how to find the number of them? (Added
Dec 8) 22
37 What did you mean by saying that GMM under heteroskedasticity with instru-
ments is like 2SLS with GLS? Heteroskedasticity is about error variance and Opti-
mal GMM uses moment conditions’ variance. Aren’t those very different things?
(Added Dec 8) 23
38 When using MLE in a simple regression model, how can we relate individual
normality of βˆ and of σˆ2 with their joint normality? How is the latter better?
(Added Dec 8) 24
39 How do we go from asymptotic normality to standard errors? (Added Dec 8) 25
40 We saw that doing 2SLS manually will lead to incorrect standard errors. Is there
a way to also manually correct them? (Added Dec 8) 26
41 How do we have multiple expressions for βˆ2SLS? Are they really equivalent?
(Added Dec 8) 27
42 How to show asymptotic normality of the 2SLS estimator under homoskedasticity?
(Added Dec 8) 27
43 On (i) of A6 in the Final Exam of 2018. (Added Apr 21) 29
3
1 How does E[εi | X] = 0 imply E[εixi] = 0? (Added Nov 1)
It is a direct consequence of the Law of Iterated Expectations (LIE). As a special case of it, we have
that
E[Z] = E[E[Z |W]],
where Z and W are any random vectors. So, taking Z = εixi and W = X (n× k matrix) we have
E[εixi] = E[E[εixi | X]] = E[xi E[εi | X]] = 0
by the LIE in the first equality, the fact that E[XY | X] = X E[Y | X] for any random variables X
and Y in the second, and the assumption E[εi | X] = 0 in the third. Hence, the LIE is very useful
when we want to exploit information about some conditional expectation (in this case, E[εi | X] = 0).
More generally, we also have that
E[Z | U] = E[E[Z | U,W]],
although this still is not the most general expression for the LIE.
2 Somewhere I read about the homoskedasticity assumption in the form of Var
[
ε2i
∣∣ xi] =
σ2. Why is it the case that the assumption sometimes includes the squared errors and
sometimes only the error? (Added Nov 1)
I am sure that instead of Var
[
ε2i
∣∣ X] = σ2 what you have seen is E[ε2i ∣∣ X] = σ2. Under exogeneity
(E[εi | X] = 0) that is equivalent to what we are doing since then
Var[εi | X] = E
[
ε2i
∣∣∣ X]− E[εi | X]2 = E[ε2i ∣∣∣ X].
It is unusual to be interested in Var
[
ε2i
∣∣ X], although we have seen that one may require that
Var
[
ε2i
]
<∞ for a Law of Large Numbers on n−1∑ni=1 ε2i to work.
3 What is the relationship between a homoskedasticity assumption Var[εi | X] = σ2 and
Var[εixi]? (Added Nov 1)
As we know, under strict exogeneity (E[εi | X] = 0),
Var[εixi] = E
[
ε2ixix′i
]
− E[εixi]E
[
εix′i
]
= E
[
ε2ixix′i
]
since strict exogeneity implies E[εixi] = 0. Hence,
Var[εixi] = E
[
ε2ixix′i
]
.
Similarly,
Var[εi | X] = E
[
ε2i
∣∣∣ X]− E[εi | X]2 = E[ε2i ∣∣∣ X] = σ2.
Thus, if we do have homoskedasticity, using the Law of Iterated Expectations we have
Var[εixi] = E
[
ε2ixix′i
]
= E
[
E
[
ε2ixix′i
∣∣∣ X]] = E[E[ε2i ∣∣∣ X]xix′i] = E[σ2xix′i] = σ2 E[xix′i],
4
just as in slide 18 of Lecture 6. If we do not have homoskedasticity, we stop at
Var[εixi] = E
[
ε2ixix′i
]
as the expression cannot be simplified anymore. And that makes sense: conditional heteroskedasticity
means that the conditional variance (ε2i on average) depends on the values of X. So, if ε2i and xi are
in the expectation together, then we cannot just ignore this dependence and somehow separate them.
4 What are the elements characterizing a multivariate normal distribution? (Added Nov
1)
Specifically, this question relates to Problem 4 from Problem Set 2, where we had(
εi
xi
)
∼ N
((
0
1
)
,
(
1 0
0 2
))
.
Similarly to the univariate case, where X ∼ N (µ, σ2) means that E[X] = µ and Var[X] = σ2, we
have that
E
[(
εi
xi
)]
=
(
0
1
)
and
Var
[(
εi
xi
)]
=
(
Var[εi] Cov[εi, xi]
Cov[xi, εi] Var[xi]
)
=
(
1 0
0 2
)
.
5 When manipulating equations, do we always premultiply on the left? (Added Nov 1)
The question refers to cases such as premultiplying Y = Xβ + ε by P−1 on the left to get a GLS-
transformed model Y˜ = X˜β + ε˜ with Y˜ = P−1Y and so on. It is true that most of the time we
premultiply on the left. A partial reason for that is that many times in equations there is a column
vector, say n× 1, and it can be multiplied by a non-scalar and a non-row-vector only on the left.
6 Why is it that AR(1) + AR(1) gives ARMA(2,1)? (Added Nov 1)
Suppose that we have two uncorrelated AR(1) processes,
xt = φxt−1 + εx,t =
εx,t
1− φL,
yt = ρyt−1 + εy,t =
εy,t
1− ρL.
Let zt = xt + yt so that
zt = xt + yt =
εx,t
1− φL +
εy,t
1− ρL.
Multiplying both sides by 1− φL and 1− ρL we have
(1− φL)(1− ρL)zt = (1− ρL)εx,t + (1− φL)εy,t,
zt − (φ+ ρ)zt−1 + φρzt−2 = εx,t + εy,t − ρεy,t−1 − φεy,t−1.
The left hand side nicely corresponds to an AR(2) in zt with a lag polynomial
Φ(L) = 1− (φ+ ρ)L+ φρL2.
5
We are not able to claim, however, that the right hand side is an MA(1) in εt = εx,t+εy,t; it would be
the case if ρ = φ. Even so, it is possible to show that there exist some errors εt following an MA(1)
process that share the save autocovariance function as our right hand side; see (Granger and Morris,
1976) for more details.
Following the steps of this answer you may actually obtain a sketch of a proof for the general result
for ARMA processes that we considered in the lecture.
7 What real-life processes can be modelled as an MA(q)? (Added Nov 1)
That is a good question. We have already seen a few partial answers. For instance, we know that
if we observe an AR(1) with a measurement error (white noise), it corresponds to an ARMA(1,1),
including an MA(1) part. We also considered news shocks (as εt) affecting price levels for q periods,
corresponding to an MA(q). It is harder, however, to find a process that is exactly MA(q), and
not just a part of it. One abstract example is given by Granger and Morris (1976): “a variable in
equilibrium but buffeted by a sequence of unpredictable events with a delayed or discounted effect
will give MA models”.
8 When does a single endogenous variable bias just its own coefficient? (Added Nov 1)
Consider a model
yi = β1xi,1 + · · ·+ βk−1xi,k−1 + βkxi,k + εi = x′i,−kβ−k + βkxi,k + εi = x′iβ + εi,
where E[xi,jεi] = 0 for j = 1, . . . , k − 1, but E[xi,kεi] 6= 0 so that only xi,k is endogenous. Then for
the OLS estimator, under usual assumptions, we have
βˆ =
(
1
n
n∑
i=1
xix′i
)−1( 1
n
n∑
i=1
xiyi
)
= β +
(
1
n
n∑
i=1
xix′i
)−1( 1
n
n∑
i=1
xiεi
)
p−→ β + E[xix′i]−1 E[xiεi].
Since E[xi,jεi] = 0 for j = 1, . . . , k − 1, but E[xi,kεi] 6= 0, we have that
βˆ
p−→ β +

(
E[xix′i]
−1)
1,k
· E[xi,kεi](
E[xix′i]
−1)
2,k
· E[xi,kεi]
...(
E[xix′i]
−1)
k,k
· E[xi,kεi]

,
where Ai,j denotes the element in row i and column j of a matrix A. Clearly
(
E[xix′i]
−1)
k,k
6= 0 so
that βˆk 6 p−→ βk. What about the other variables? We have that
βˆ−k
p−→ β−k ⇐⇒
(
E
[
xix′i
]−1)
j,k
= 0 for j = 1, . . . , k − 1
⇐⇒ E[xix′i]−1 is block diagonal with (k − 1)× (k − 1) and 1× 1 blocks
⇐⇒ E[xix′i] is block diagonal with (k − 1)× (k − 1) and 1× 1 blocks
⇐⇒ E[xi,−kxi,k] = 0.
Hence, the endogeneity of xi,k will affect only its own coefficient if and only if E[xi,−kxi,k] = 0. Note
that under E[xi] = 0 it is the same as Corr[xi,−k, xi,k] = 0.
6
9 Does E[YtYt−1] = E[Yt−1Yt−2] require strong stationarity or is weak stationarity enough?
(Added Nov 8)
Suppose that {Yt} is weakly stationary. Then we know that Cov[Yt, Ys] = γ(|t−s|). Also, by definition
of covariance,
Cov[Yt, Ys] = E[YtYs]− E[Yt]E[Ys].
Now considering E[YtYt−1] and E[Yt−1Yt−2], note that in both cases we have lag 1, and
Cov[Yt, Yt−1] = γ(1) = Cov[Yt−1, Yt−2].
Hence,
Cov[Yt, Yt−1] = E[YtYt−1]− E[Yt]E[Yt−1] = E[Yt−1Yt−2]− E[Yt−1]E[Yt−2] = Cov[Yt−1, Yt−2].
On the other hand, weak stationarity implies E[Yt] ≡ µ so that
Cov[Yt, Yt−1] = E[YtYt−1]− µ2 = E[Yt−1Yt−2]− µ2 = Cov[Yt−1, Yt−2]
and, hence, E[YtYt−1] = E[Yt−1Yt−2]. Thus, weak stationarity is enough for this result.
On the other hand, under weak stationarity of {Yt} we could not claim that, say,
E
[
Y 2t Y
2
t−1
]
= E
[
Y 2t−1Y
2
t−2
]
,
where, first of all, we would ask for those moments to be finite. Meanwhile, under strict stationarity
of {Yt} we have
E[f(Yt, Yt−1)] = E[f(Yt−1, Yt−2)]
for any f as long as the moments are finite. That is because strict stationarity gives, among other
things, (Yt, Yt−1)
d= (Yt−1, Yt−2), implying equality not only between such expectations, but also
between any other measures.
Lastly, notice that strict stationarity may even not imply E[YtYt−1] = E[Yt−1Yt−2]. For instance, if
{Yt} were i.i.d. following the Cauchy distribution, mixed moments like E[YtYt−1] would not even be
defined. As a result, such a process would not be weakly stationary either. This is a counterexample
showing that strict stationarity does not imply, in general, weak stationarity. If, however, Var[Yt] <
∞, then the implication does hold.
10 On true and usual standard errors and their estimation. (Added Nov 8)
This is to clarify some informal terminology. Suppose that we have three estimators consistent for
β, each perhaps with its own set of assumptions: β˜, β∗, and β¯. Assume that each is asymptotically
normal with respective asymptotic variance-covariance matrices V˜, V∗, and V¯. In particular, by
allowing for each of the estimators to have its own set of assumptions, we may, e.g., use the same
OLS estimator three times, but each time under different assumption (e.g., under serial correlation
and heteroskedasticity; only under heteroskedasticity; under Gauss-Markov assumptions), which is
similar to what we often consider in lectures. Moreover, let V˜ > V∗ > V¯ so that, if all three
estimators were operating under the same assumptions, they could be ranked from the least efficient,
β˜, to the most efficient, β¯.
In terms of the true variance-covariance matrices, V˜, V∗, and V¯ are such for β˜, β∗, and β¯, respectively.
7
If, for instance, we were to try to estimate and use V¯ to make inferences using β˜, we would be making
a mistake and our inferences would be invalid.
By the usual variance-covariance matrix we essentially mean the case when the Gauss-Markov as-
sumptions are satisfied. Notice that for a “clean” model Y = Xβ + ε that means σ2 E[xix′i]
−1, but
we may also have σ2 E
[
xiΩ−1x′i
]−1 if the Gauss-Markov assumptions hold for the GLS-transformed
model.
11 How are weak stationarity, stability, and weak dependence related? (Added Nov 8)
This is a great question and it is important that you understand the differences. First let us recall
the definitions. Weak stationarity of {Yt} means that
E[Yt] ≡ µ, Var[Yt] ≡ σ2 <∞, and Cov[Yt, Ys] = γ(|t− s|).
We have defined stability (or, in some textbooks, causality) just for AR(p) processes (with a general-
ization for VAR(p)) and required the roots z∗ of the autoregressive lag polynomial to be greater than
one in absolute value, |z∗| > 1. Lastly, there is no formal definition of weak dependence. We said
that for ARMA processes a necessary condition is ρ(h)→ 0 as h→∞, which is helpful for intuition
but, again, it is only necessary. We also said that the purpose of weak dependence is for Laws of
Large Numbers and Central Limit Theorems to work, but there exist various versions of these results,
requiring different types and degrees of weak dependence. So, let us just say that the meaning of
weak dependence is case-dependent and, in particular, we will require the dependence to be weak
enough for some situation-specific asymptotic result to hold.
Although the previous paragraph already includes it, it is important to emphasize that weak sta-
tionarity and weak dependence are concepts for any stochastic process {Yt}, while we have defined
stability just for AR(p) (and, more generally, ARMA, VAR, VARMA) processes. (Although it would
be possible to consider causality or general stochastic processes as well.) So, that already creates
some distance between the concepts. Now let us consider the six possible implications between the
three concepts one by one.
AR(p) stability =⇒ Weak stationarity? Yes. We have shown this for AR(1), but it also holds
for general AR(p) processes when the errors are white noise. If the errors are i.i.d., then we get even
strict stationarity.
Weak stationarity =⇒ AR(p) stability? No. I have also mentioned this during a lecture.
Consider an AR(1) given by Yt = ρYt−1 +εt. If |ρ| = 1, then no stationary solution exists and instead
we have a unit root. If |ρ| < 1, we are in the usual, weakly stationary case. Now let |ρ| > 1. Then
we can write Yt+1 = ρYt + εt+1 as
Yt =
1
ρ
Yt+1 − 1
ρ
εt+1 = · · · = 1
ρh
Yt+h −
h∑
i=1
1
ρi
εt+i = −
∞∑
i=1
1
ρi
εt+i,
which is a stationary solution of Yt, although we do not have stability and using future to describe
Yt is, indeed, exotic.
AR(p) stability =⇒Weak dependence? Mostly yes. As I said, it depends on what kind of weak
dependence we want. For instance, If we were interested in the convergence T−1∑Tt=1 Yt p−→ E[Yt],
then the answer would be positive as ρ(h)→ 0 as h→∞ is enough for the convergence (we have not
8
proved it), and ρ(h)→ 0 is indeed implied by the AR(p) stability. The answer will remain positive for
most other asymptotic results as well (except that we may need to additionally require something like
E
[
ε4t
]
<∞), although we could probably come up with something more complex where the stability
of an AR(p) would not be enough.
Weakly dependent AR(p) =⇒ AR(p) stability? Probably. Here it again is impossible to
be precise, as the weak dependence is case-dependent. However, assuming that an AR(1) is weakly
dependent in the sense that some asymptotic result holds for it, it would be hard or impossible to
construct a result such that would not imply |ρ| < 1. In fact, we even said that ρ(h)→ 0 (equivalent
to |ρ| < 1) is a necessary condition for weak dependence.
Weak stationarity =⇒ Weak dependence? No. We have seen a trivial example of Yt ≡ Y
implying both strict stationarity and weak stationarity (if Var[Y ] < ∞), but ρ(h) ≡ 1 6→ 0 as
h→∞.
Weak dependence =⇒ Weak stationarity? No. For instance, our general process {Yt} may
be weakly dependent in the sense that T−1∑Tt=1 Yt p−→ E[Yt] is true, but each Yt may have different
variance.
12 When testing for unit roots, does α represent the drift or the intercept (mean)? Is it
not β that represents the drift/trend? Or should the dependent variable be ∆yt and
not just yt? (Added Nov 12)
First, if we have a stationary AR(1) given by yt = α+ ρyt−1 + εt, then α is not the mean; it relates
to the mean in the sense that α 6= 0 if and only if the mean of yt is non-zero. To be precise, we saw
that E[yt] = α/(1− ρ).
Second, I realize that going back and forth from yt to ∆yt may be confusing, but we actually have
the following result. Consider estimating
yt = α+ βt+ ρyt−1 + εt and ∆yt = λ+ γt+ θyt−1 + ut,
leading to αˆ, βˆ, and ρˆ for the first regression and λˆ, γˆ, and θˆ for the second one. Then we have that
αˆ = λˆ, βˆ = γˆ, and ρˆ− 1 = θˆ! So, subtracting yt−1 from both sides does not affect the interpretation
of the terms in the initial regression, without differenced variables. Note that we also have εˆt = uˆt.
Given the previous point, we can interpret all the coefficients in yt = α + βt + ρyt−1 + εt and their
interpretation will remain the same in ∆yt = λ+ γt+ θyt−1 + ut. We normally use the drift concept
only for random walks and not stationary AR(1) processes. So, under the null of ρ = 1, α is the drift.
Under alternative α no longer has such a “big effect”, we just say that it makes the mean nonzero.
In none of the cases we keep βt under the null with β 6= 0 as that would be quite extreme, so there is
no need to look for a special name for it. Now under the alternative it corresponds to a trend term,
after removing which we would be left with a stationary AR(1).
Some confusion possibly arises from the fact that if yt = α+ yt−1 + εt, then we also have
yt = y0 + αt+ εt + · · ·+ ε1,
where now suddenly α appears to have something to do with the trend. But this expression actually
just illustrates this “power” of the drift term in random walks: it leads to a trending behaviour once
we rewrite the model in terms of the past shocks.
9
13 Why do we have standard testing for unit roots in the Case (3) of a Random walk
with a drift α 6= 0 / Nonzero mean stationary AR(1)? (Added Nov 12)
As I said during the lecture, it is because the drift term (creating the trending behaviour) dominates
in the asymptotic results, making the issues arising from the accumulated errors negligible. This,
however, is not considered in the Wooldridge’s book. For more technical details see Case 3 in Section
17.4 in the book by Hamilton.
14 In what way is conditional unbiasedness stronger than unconditional unbiasedness?
(Added Nov 14)
Let βˆ be some estimator of β. Then we have that E
[
βˆ
∣∣∣ X] = β (conditional unbiasedness) always
implies E
[
βˆ
]
= β (unconditional unbiasedness). That can be shown using the Law of Iterated
Expectations (LIE):
E
[
βˆ
]
= E
[
E
[
βˆ
∣∣∣ X]] = E[β] = β,
where we use the LIE in the first equality, the conditional unbiasedness in the second, and the fact
that the expectation of a constant is the constant itself.
15 Is there ever a case in which an estimator is unconditionally unbiased but not so
conditionally? (Added Nov 14)
It is indeed possible. This happens, roughly, when
E
[
βˆ
∣∣∣ X] = β + something that depends on X.
For instance, consider a strange but valid estimator βˆ = β + x1, where E[xi] = 0. Then
E
[
βˆ
∣∣∣ X] = β + x1 6= β,
meaning no conditional unbiasedness, but
E
[
βˆ
]
= β + E[x1] = β,
giving the unconditional unbiasedness.
16 How would Var[Yt] look like for a multivariate time series? (Added Nov 14)
Just like for any other general random vector, Var[Yt] would be the variance-covariance matrix of
Yt. That is, we do not look at any other time periods and just consider variances and covariances
across the components of Yt. Let Yt = (Yt,1, . . . , Yt,k)′. Then Var[Yt]ij = Cov[Yt,i, Yt,j ] for all
i, j = 1, . . . , k.
17 How to understand all the different subscripts in our notation in time series? (Added
Nov 14)
Initially there, understandably, may be some confusion, but ultimately you should be able to figure
it out from the context, if no extra details are given. By convention, we will not use t to refer to a
variable number, so Yt is going to be a vector of time series at time t, and Yt will be a single time series
at time t. When dealing with two time periods, using t and s is common, as in Cov[Yt, Ys] = γ(|t−s|).
10
Now if we are being more general, like in the definition of strict stationarity, we need an arbitrary
number of time periods. In this case using t1, t2, . . . , tn for the time periods is convenient (to then
talk about Yt1 , Yt2 , . . . , Ytn). Notice that 1, 2, . . . , n are subindices. On the other hand, you may also
encounter things like Yt1, Y1t, Yt,1, Y1,t, and so on. Things like t1 or 1t will almost always correspond
to their counterparts with commas. Afterwards it should be clear that we mean time period t and
variable 1.
18 How do we get the formula for βˆ from Lecture 17, Slide 7? (Added Nov 14)
It is the usual “convenient” formula for the OLS estimator when considering consistency. A quick
way to derive it would be as follows:
yt = βxt + εt,
xtyt = βx2t + xtεt,
E[xtyt] = β E
[
x2t
]
+ E[xtεt],
β = E[xtyt]
E
[
x2t
] ,
βˆ =
∑T
t=1 xtyt∑T
t=1 x
2
t
= β +
∑T
t=1 xtεt∑T
t=1 x
2
t
,
βˆ − β =
∑T
t=1 xtεt∑T
t=1 x
2
t
,
where we used our knowledge that the OLS estimator corresponds to the method of moments esti-
mator. Minimizing the residual sum of squares should not take much more either.
19 What is an I(0) process? (Added Nov 14)
Various definitions of I(0) and, more generally, I(d) processes can be found. The original definition by
Engle and Granger (1987) was that a series with no deterministic component which has a stationary,
invertible, ARMA representation after differencing d times, is said to be integrated of order d, denoted
Yt ∼ I(d). A similar definition eliminating certain anomalies has an additional detail that differencing
the series d−1 times is not enough to have stationarity (Banerjee, Dolado, Galbraith, Hendry, 2003).
In the context of this course, it will be enough to think I(0) is a stationary time series, and I(d) must
be differenced d to become stationary.
20 If Yt has a unit root, when is it not the case that ∆Yt ∼ I(0)? (Added Nov 14)
Relating to the previous answer, it can be the case if Yt ∼ I(d) for d > 1 so that the time series has
more than one unit root or, in other words, its multiplicity is higher than one. For instance, suppose
that Yt is such that (1 − L)2Yt = εt is stationary so that Yt ∼ I(2). We have that (1 − L)2Yt =
Yt − 2Yt−1 + Yt−2. Hence, Yt = 2Yt−1 − Yt−2 + εt. Let us double-check now whether differencing Yt
twice would give us stationarity:
∆Yt = Yt − Yt−1 =
(
2Yt−1 − Yt−2 + εt
)
− Yt−1 = Yt−1 − Yt−2 + εt,
∆2Yt = ∆∆Yt = ∆Yt −∆Yt−1 =
(
Yt−1 − Yt−2 + εt
)
−
(
Yt−1 − Yt−2
)
= εt,
11
as expected. Two time series examples that you may find to be I(2) are price levels and money
supply.
Now the random walk process is somewhat intuitive in that it permanently accumulates all the past
shocks. What about something like an I(2) process? It does the accumulation twice. Let Yt =
∑t
i=1 εi
be the white noise shocks accumulated once. Clearly we also have Yt−1 =
∑t−1
i=1 εi. Hence,
Yt − Yt−1 =
t∑
i=1
εi −
t−1∑
i=1
εi = εt so that Yt = Yt−1 + εt,
as expected! Now let us try to accumulate the shocks twice by defining
Xt =
t∑
i=1
Yi =
t∑
i=1
i∑
j=1
εj =
t∑
i=1
(t− i+ 1)εi.
Then we also have that
Xt−1 =
t−1∑
i=1
(t− i)εi and Xt−2 =
t−2∑
i=1
(t− i− 1)εi.
Referring back to the definition that Xt ∼ I(2) if (1− L)2Xt = Xt − 2Xt−1 +Xt−2,
∆2Xt = Xt − 2Xt−1 +Xt−2 =
t∑
i=1
(t− i+ 1)εi − 2
t−1∑
i=1
(t− i)εi +
t−2∑
i=1
(t− i− 1)εi
=
t−2∑
i=1
(
(t− i+ 1)− 2(t− i) + (t− i− 1)
)
εi
+
(
(t− (t− 1) + 1)− 2(t− (t− 1))
)
εt−1
+ (t− t+ 1)εt
= εt,
confirming that accumulating twice gave us an I(2) process!
21 What happens with the (T − 1)/T , (T − 2)/T , and so on factors in the limit when
deriving the HAC standard errors? (Added Nov 14)
More generally, the question is on how do we go from
VT =E
(
εtxtx′tεt
)
+
(
T − 1
T
)
E
(
εtxtx′t−1εt−1 + εt−1xt−1x′tεt
)
+
(
T − 2
T
)
E
(
εtxtx′t−2εt−2 + εt−2xt−2x′tεt
)
+ . . .
+ 1
T
E
(
εtxtx′t−T+1εt−T+1 + εt−T+1xt−T+1x′tεt
)
=E
(
εtxtx′tεt
)
+
T−1∑
h=1
(
T − h
T
)
E
(
εtxtx′t−hεt−h + εt−hxt−hx′tεt
)
.
12
to
lim
T→∞
VT = V = E
(
εtxtx′tεt
)
+
∞∑
j=1
E
(
εtxtx′t−jεt−j
)
+
∞∑
j=1
E
(
εt−jxt−jx′tεt
)
.
It follows by the Dominated Convergence Theorem since (T − h)/T → 1 as T → ∞ for any fixed h
and assuming that for all variables i, j = 1, 2, . . . , k we have
∞∑
h=−∞
|E (εtxt,ixt−h,jεt−h)| <∞.
22 Out of the five settings for unit root testing, why did we consider just the second
and the fifth ones? (Added Nov 14)
Let us discuss each of the five combinations of null and alternative models when testing for unit roots.
1. Random walk / Zero mean stationary AR(1):
H0 : yt = yt−1 + εt,
H1 : yt = ρyt−1 + εt with ρ < 1.
In Figure 1 you can find three realizations for each of several different values of ρ. The first three
graphs are under the alternative and the last one is under the null. In each case we start from
Y1 = 0 and simulate the errors independently as N (0, 1). Clearly, under small values of ρ the
realizations do not resemble a unit root process, while that with ρ = 0.9 may look so, especially
with small T . However, as mentioned in the lecture, fixing the mean to zero is restrictive and,
hence, makes it unappealing to consider this alternative.
ρ = 0.9 ρ = 1
ρ = 0.1 ρ = 0.5
0 25 50 75 100 0 25 50 75 100
−10
0
−10
0
t
Y t
Figure 1: Case 1
2. Random walk / Nonzero mean stationary AR(1):
H0 : yt = yt−1 + εt,
H1 : yt = α+ ρyt−1 + εt with ρ < 1.
Figure 2 does exactly the same as Figure 1, except that now we also have α = 1 under the
alternative (and still α = 0 under the null). This makes the mean of the stationary AR(1)
processes nonzero, and we choose E[Yt] = Y1 = α/(1− ρ) as the starting point (we also do so in
13
the remaining cases below). Now that we allow for α 6= 0, this is a realistic situation for testing
in practice: under ρ close to 1 and under ρ = 1 we may see somewhat similar trajectories,
especially for small T . As T grows, however, the unit root realizations look very different from
those under alternative, hence all the different theoretical properties and an interest in testing
for unit roots.
ρ = 0.9, α = 1 ρ = 1, α = 0
ρ = 0.1, α = 1 ρ = 0.5, α = 1
0 25 50 75 100 0 25 50 75 100
−10
0
10
−10
0
10
t
Y t
Figure 2: Case 2
3. Random walk with a drift α 6= 0 / Nonzero mean stationary AR(1):
H0 : yt = α+ yt−1 + εt with α 6= 0,
H1 : yt = α+ ρyt−1 + εt with ρ < 1.
In Figure 3, where now α = 1 under alternative and α = 0.5 under the null, we have a sort
of mismatch: the intercept for a stationary AR(1) simply changes its mean, while for the unit
root process it creates this obvious trending behaviour. As a result, even the lower two panels
look very differently and there is no good reason to consider this setting as it should be already
clear whether it is just a nonzero mean stationary AR(1) or a random walk with a drift. One
possible exception could, again, be a small T , when the unit root’s trending behaviour is not
so apparent yet.
ρ = 0.9, α = 1 ρ = 1, α = 0.5
ρ = 0.1, α = 1 ρ = 0.5, α = 1
0 25 50 75 100 0 25 50 75 100
0
20
40
60
0
20
40
60
t
Y t
Figure 3: Case 3
14
4. Random walk / Nonzero mean stationary AR(1) with a trend:
H0 : yt = yt−1 + εt,
H1 : yt = α+ βt+ ρyt−1 + εt with ρ < 1.
Figure 4 shows a mismatch in the other direction. Now the random walks have no drift, but
the nonzero stationary AR(1) processes also have a trend term with β = 0.1, leading to clear
long-term growth. The absence of a drift under the null makes the processes very different
(again, except under small T ) and of little interest.
ρ = 0.9, α = 1, β = 0.1 ρ = 1, α = 0, β = 0
ρ = 0.1, α = 1, β = 0.1 ρ = 0.5, α = 1, β = 0.1
0 25 50 75 100 0 25 50 75 100
0
25
50
75
100
0
25
50
75
100
t
Y t
Figure 4: Case 4
5. Random walk with a drift α 6= 0 / Nonzero mean stationary AR(1) with a trend:
H0 : yt = α+ yt−1 + εt with α 6= 0,
H1 : yt = α+ βt+ ρyt−1 + εt with ρ < 1.
Lastly, Figure 5 depicts the case when there is long-term growth both under the null (due to
the α = 0.5 drift) and alternative (due to β = 0.1). As T grows, the random walk with a drift
realizations reveal their “particularly stochastic” nature, but otherwise this is a good example
of a situation where we may want to use a unit root test.
ρ = 0.9, α = 1, β = 0.1 ρ = 1, α = 0.5, β = 0
ρ = 0.1, α = 1, β = 0.1 ρ = 0.5, α = 1, β = 0.1
0 25 50 75 100 0 25 50 75 100
0
25
50
75
100
0
25
50
75
100
t
Y t
Figure 5: Case 5
15
23 What does “non-standard” mean when talking about asymptotic distributions? (Added
Nov 23)
This admittedly vague phrasing at the very least means that the asymptotic distribution is not
normal. Almost always it will also mean that we do not have a name for this distribution (like F or
χ2). As an example, to get an idea, recall our Case (1) for unit root testing. In that case
T (ρˆT − 1) d−→
1
2(W (1)2 − 1)∫ 1
0 W (r)2 dr
,
where W (·) denotes the standard Brownian motion satisfying the following:
(a) W (0) = 0;
(b) For any 0 ≤ t1 ≤ t2 ≤ · · · ≤ tk ≤ 1, the changes
W (t2)−W (t1),W (t3)−W (t2), . . . ,W (tk)−W (tk−1)
are independent multivariate Gaussian with W (s)−W (t) ∼ N (0, s− t);
(c) For any given realization, W (t) is almost surely continuous in t.
Note that parts (a) and (b) imply that W (t) ∼ N (0, t), meaning that W (1) ∼ χ2(1), and in the
denominator we have an integral of squared normal random variables (which are not independent).
That should pretty well illustrate this nonstandardness. One of the main consequences then is that
we do not immediately know the critical values of such limiting distributions, and it makes testing
more difficult in practice.
The same may (or may not) apply to t statistics. For instance, in the same Case (1) it could be
shown that
t
d−→
1
2(W (1)2 − 1){∫ 1
0 W (r)2 dr
}1/2 .
24 Can we use leads and lags of xt instead of ∆xt in the Dynamic OLS to remove the
correlation? (Added Nov 23)
No, we should be using differences. Suppose that we are considering a cointegration relationship
yt = α+βxt+ut with xt = xt−1 + et. The issue then may be that ut is correlated with xt or, in other
words, with the errors of xt (because it can be expressed in terms of the errors), which are given by
∆xt = et! So, if there is only contemporaneous dependence so that only et and ut are correlated,
including ∆xt to the equation as to get
yt = α+ βxt + λ∆xt + vt
should be enough. But the dependence between the errors can be more general, and that is why we
may want to include leads and lags of the differences.
16
25 If we have an m-dimensional VECM model, is it true that there cannot be m cointe-
grating relationships? (Added Nov 23)
That is indeed true. Note that m cointegrating relationship would mean that the rank of an m×m
matrix R in the VECM given by
∆yt = α+ Ryt−1 + B1∆yt−1 + · · ·+ Bp∆yt−p+1 + εt
would have to be m as well. That is, R would be invertible! If that were the case, we could write
yt−1 = R−1∆yt −R−1α−R−1B1∆yt−1 − · · · −R−1Bp∆yt−p+1 −R−1εt,
where all the variables on the right hand side are I(0). Hence, yt would also have to be I(0), but
that contradicts the whole idea of cointegration relationships between I(1) variables.
26 I am confused about the expression for the Likelihood Ratio test statistic: shouldn’t
we use the ratio λ = L(θ˜; Y)/L(θˆ; Y)? (Added Nov 23)
True, the ratio of likelihoods is literally λ and it is the basis for all the three tests. However, the
actual test statistic of the Likelihood Ratio test is using log λ and equals −2 log λ. What is the reason
for that? As we have seen (slides 5 and 8), this leads to a very familiar chi-square distribution, and
this makes sense: λ itself takes values in [0, 1], which would be quite unusual for a test statistic.
Meanwhile, −2 log λ takes values in [0,∞), as we are used to. Now multiplying by 2 is not necessary
for this range of values but, again, using χ2(m) for −2 log λ is much more convenient than χ2(m)/2
for − log λ. The other two tests, W and LM, being likelihood-based also are motivated by the ratio
λ and involve approximations of log λ.
27 What is the difference between the residuals and the errors? (Added Nov 23)
There is a big difference and you must understand the two concepts very well. Consider a simple
regression model yi = βxi + εi, which you may interpret as a causal mechanism or the best linear
predictor. In both cases there exists an unknown, fixed, true parameter value β. Similarly, yi and xi
are some random variables that are out there and that we potentially observe. Hence, εi = yi − βxi
are unknown (because β is unknown), random (because yi and xi are random), true model errors.
In this setup all “we can do” is to estimate/guess the true value of β. It can be done in various ways
(e.g., you may guess that β = pi for all the models in the world), but let us focus on the OLS. Also, to
fix ideas suppose that the true errors are independent and normally distributed, εi ∼ N (0, σ2). Our
sample of n observations leads to the OLS estimator βˆ of β, which we can also use to “estimate” the
true yet unknown model errors by εˆi = yi − βˆxi, which we call residuals. It is crucial to realize that
the errors εi and the residuals εˆi have very different properties. For instance, in a regression with a
constant we saw that ∑ni=1 εˆi = 0. However, if we sum the errors of the same individuals, we do not
get a zero: ∑ni=1 εi 6= 0 with probability 1. The reason for ∑ni=1 εˆi = 0 is technical and comes from
the way the OLS works (recall orthogonal projection) for a given sample. What is true for the errors
is E[εi] = 0.
As another example consider the parameter Var[εi] = σ2 that is unknown. Since, under usual
conditions, βˆ p−→ β as n → ∞, residuals εˆi also are becoming more and more similar to the true
errors εi. Hence, it makes sense using, say, n−1
∑n
i=1 εˆ
2
i to estimate σ2 and similarly with many other
quantities, like E
[
ε2ix
2
i
]
.
17
28 Why do we sometimes have the i subscript and sometimes we do not? (Added Nov
23)
In Lecture 19 we had multiple instances where Yi became just Y . And what is the purpose of
the subscript i? It clearly is needed when we actually are using Yi’s for multiple individuals, like in
n−1
∑n
i=1 Yi. But what if all we care is the expectation E[Yi]? Notice that since in most of the lectures
we have i.i.d. random variables, we could just as well write E[Y1] (which some authors actually do)
or E[Y2] all the time as they all are the same. So, it is reasonable to drop i in such generic situations
where we do not really care whether i = 1 or i = 2.
Another way to think about this would be that the true, say, regression model is y = βx+ ε, where y
and x are some representative random variables. Then all the y1, y2, . . . are just i.i.d. copies of y, and
similarly with x and ε. So, dropping i in those generic situations then can be understood as coming
back to the representative random variables.
29 Where does the approximation −2 log λ ≈ mF come from? (Added Nov 23)
First, we saw that
mF = (n− k − 1)
(
RSSr
RSSu
− 1
)
.
The approximation −2 log λ ≈ mF is for the case when n→∞. Indeed, since
n
n− k − 1 → 1
as n → ∞, we write n − k − 1 ≈ n. On the other hand, under the null, RSSu ≈ RSSr, giving
RSSr/RSSu ≈ 1 and
x = RSSr
RSSu
− 1 ≈ 0.
But for x ≈ 0 we have that x ≈ log(1 + x). Hence,
RSSr
RSSu
− 1 ≈ log
(
1 + RSSr
RSSu
− 1
)
= log
(
RSSr
RSSu
)
,
which completes the approximation.
30 In the LR, W, and LM tests, how do we accept and reject the null? Is it right that
the smaller the difference between θ0 and θˆML, the more likely we are to accept H0?
(Added Nov 23)
The statement is correct. What the data “says” is θˆML or, equivalently, l(θˆML; Y). So, if θ0 or,
equivalently, l(θ0; Y) are close to their sample counterparts, it means that our null hypothesis is in
agreement with what the data says and the less likely we are to reject the H0. Hence, using the
asymptotic distribution χ2(1) (or, more generally, χ2(m)) we reject the null if the LR, W, or LM test
statistic exceeds the 1− α quantile of χ2(1).
31 Why is it that aN (0, b) ∼ N (0, a2b) if we know that Var[aX] = a2Var[X]? (Added Nov
24)
While in both cases we “bring the constant a inside”, this “inside” is different in the two cases. The
second case, Var[aX] = a2Var[X], is more direct and means that the variance of a random variable
aX is a2 times the variance of X. Roughly, the reason for this is that the variance is the mean squared
18
deviation from the mean, so the scale of variance is quadratic relative to the scale of X; that is why
we consider the standard deviation σ (of the same scale as X) to talk about confidence intervals.
Now in the first case we have something else. Notation N (0, b) can be seen as a label representing
a random variable; it says that this random variable is normally distributed and has zero mean and
variance b. That is, there is some normally distributed X with E[X] = 0 and Var[X] = b. Then
aN (0, b) can be seen as another label, representing aX. Then instead of a product aN (0, b) we may
write a single label with the new mean and the new variance. But we know that E[aX] = 0 and
Var[aX] = a2Var[X] = a2b. Hence, aN (0, b) is the same as N (0, a2b).
32 How does the standardisation of normal random variables, leading to a χ2 distribu-
tion, work? (Added Nov 24)
The question refers to cases like
W = n
(
RθˆML − r
)′ [
RVR′
]−1 (RθˆML − r) d−→ χ2 (m) .
The question is also very closely related to the previous one. The steps are a little harder to understand
when dealing with multivariate random variables, so let us first start with a scalar case. Let X ∼
N (0, σ2). Then what is the distribution of σ−1X? Using my previous answer we see that σ−1X ∼
N (0, 1). We also know that N (0, 1)2 ∼ χ2(1). So, σ−2X2 = (σ−1X)2 ∼ χ2(1). More generally, if
X ∼ N (µ, σ2) you should be able to see that σ−2(X − µ)2 ∼ χ2(1) because now X − µ ∼ N (0, σ2),
just as before. The last step to bring the scalar case closer to the multivariate one is to note that
σ−2X2 = X ′
(
σ2
)−1
X ∼ χ2(1) since, for scalars, X ′ = X.
Everything remains the same in the multivariate case except that now the order of how we multiply
things matters and we have a variance-covariance matrix rather than just a variance. Let X be
p-dimensional with X ∼ N (0, I). In that case X′X = ∑pi=1X2i ∼ χ2(p), by definition. Now let
X ∼ N (0,Σ) with Σ = Σ1/2Σ1/2′ so that
Σ−1/2ΣΣ−1/2′ = Σ−1/2Σ1/2Σ1/2′Σ−1/2′ = I.
Then notice that
X′Σ−1X = X′Σ−1/2′Σ−1/2X =
 Σ−1/2X︸︷︷︸
=Z∼N (0,I)

′ Σ−1/2X︸︷︷︸
=Z∼N (0,I)
 = Z′Z ∼ χ2(p),
where Z = Σ−1/2X ∼ N (0, I) because we know that Var[AX] = AVar[X]A′ and, hence,
Var
[
Σ−1/2X
]
= Σ−1/2Var[X]Σ−1/2′ = Σ−1/2ΣΣ−1/2′ = I.
33 When doing GLS, FGLS, or WLS, do I also have to transform the constant term?
(Added Nov 27)
Yes, absolutely. Remember that if we are dealing with Y = Xβ + ε and Var[ε | X] = Ω = PP′,
then the transformation consists of premultiplying everything by P−1. So, why would in the end
there be an exception for the constant term? OK, you may think that perhaps it does not make
a difference whether we do the multiplication or not. It actually does and we are going to see
why. Consider a simple case of yi = α + βxi + εi, where the need for FGLS or GLS arises simply
19
from heteroskedasticity. This allows us to use the Weighted Least Squares (WLS). In particular, let
Var[ε | X] = diag(σ2(x1), . . . , σ2(xn)) for some known function σ2(·), like σ2(x) = 1 + x2. Then we
need to estimate 1
σ(xi)
yi =
1
σ(xi)
α+ β xi
σ(xi)
+ ε
σ(xi)
rather than 1
σ(xi)
yi = α+ β
xi
σ(xi)
+ ε
σ(xi)
,
and there is a difference. You should already start suspecting why. On the one hand, 1σ(xi)α is no
longer constant as it depends on i. On the other hand, you can think that the WLS is changing
the “scale” of the regression and this scale is observation-dependent. That is, before the dependent
variable was yi, in kilograms, say, with the intercept α in the same scale of kilograms. Now when
using 1σ(xi)yi as the dependent variable the scale is no longer just kilograms, and we actually cannot
really interpret this new scale. But what is clear is that we want our new “constant” (although not
so constant anymore) term to be in this scale too! Otherwise our estimates of α in the two regression
are going to give weird results.
In particular, suppose that our true data generating process is yi = 2 · 1 + 3xi + εi, where yi is
measured in kilograms. Also, suppose that we do not really have heteroskedasticity (or do not know
it) and just use the WLS with weights 1/1000. So, having
yi
1000 = 2 ·
1
1000 + 3
xi
1000 +
ε
1000
just changes the scale to grams! The coefficients remain the same: 2 and 3. Now suppose that we
did not rescale the constant term from 1 to 1/1000:
yi
1000 = α · 1 + 3
xi
1000 +
ε
1000 .
What is α now? Now α = 2/1000 instead of 2. So, if we forget to transform the constant term, at
the very least we will incorrectly estimate the intercept. However, in the more general case of weights
1/σ(xi) we also have dependence on xi. Hence, not performing the transformation in that case would
also affect other coefficients.
34 How to understand the FWL theorem and how to use it in practice? (Added Nov
27)
Let
Y = Xβ + ε = X1β1 + X2β2 + ε, (1)
where X is n×k, β is k×1, X1 is n×k1, β1 is k1×1, X2 is n×k2, and β2 is k2×1 with k1 +k2 = k.
There is nothing special so far: in any multiple regression model we may always split our k regressors
into two groups of k1 and k2 variables. The purpose of doing this is that we want to estimate just
β1, and not β2. Why would we want to do this? Well, it is true that this is very much a nice
theoretical result, and there is not a bunch of very intuitive applications. There are some, although
they are somewhat theoretical too; e.g., the Fixed Effects estimator using panel data. Also you may
be interested in the asymptotic properties of just a single coefficient; that of a single endogenous
variable, for instance.
20
So, how can we estimate just β1 in that regression? Perhaps by estimating
Y = X1β1 + u? (2)
Not really. Since we are interested in the relationship given by (1), X2 is omitted now in (2), and
there very easily may be a bias when estimating β1 in (2). That is, the estimator of β1 in (2) would
not be consistent for β1 in (1). Now why is that? That is because X2 is, in general, correlated with
X1 and Y. Hence, omitting it is going to change the estimation results. So, what could we do?
It would be nice to run something like (2), where the fact that we are not including X2 no longer
matters. And that is what the FWL theorem allows us to do! We are going to use a counterpart
of (2), where now the variables are going to be free of the effect of X2. How can we do that? First,
run
Y = X2γ + v
and save the residuals vˆ = Y −X2γˆ. That is what is left from Y after linearly using X2 to explain
it. Now let us do the same with the n× k1 matrix X1. But what would it mean to regress X1 on X2
given that X1 has k1 columns? It would mean regressing each of the columns of X1, one by one, on
the whole X2. That is, let X1 = (X1,1 X1,2 . . . X1,k1), where each of X1,j is an n× 1 vector. So, let
us run now
X1,j = X2λj + wj
for j = 1, . . . , k1, and save the residuals wˆj = X1,j − X2λˆj each time, where wˆj is what is left
from X1,j after linearly using X2 to explain it. Then collect the residuals to a single n× k1 matrix,
wˆ = (wˆ1 wˆ2 . . . wˆk1). Now wˆ is what is left from X1 after using X2 to explain it. Then what? Then
running
vˆ = wˆδ + z (3)
we actually have that δˆ equals to βˆ1 from (1)! That is because we have got rid off the effect of X2
from all the rest of the variables and not including X2 is no longer a problem. How does this fit into
the FWL theorem? We have that vˆ = MX2Y and wˆ = MX2X1 so that (3) is the same as running
MX2Y = MX2X1δ + z.
And what is the OLS estimator of δ? It is
δˆ =
(
X′1MX2X1
)−1 X′1MX2Y
=
(
X′1MX2MX2X1
)−1 X′1MX2MX2Y
=
(
(MX2X1)
′MX2X1
)−1
(MX2X1)
′MX2Y,
where we are able to obtain multiple expressions because of symmetry and idempotency of MX2 .
So, how do we use the FWL theorem in practice? We start with
Y = Xβ + ε = X1β1 + X2β2 + ε
aiming to estimate just β1 and know that it can be done by running
MX2Y = MX2X1δ + z,
where multiple expressions for δˆ are given above. So, all one needs to do is to split X into two
21
matrices, construct MX2 = I−X2(X′2X2)−1X′2, and that is all. Depending on your problem it may
be that MX2Y and/or MX2X1 can be further simplified or interpreted.
35 If we say that experience is exogenous, why do we include it as an instrument? Can’t
we just drop it and obtain the same regression results? What is the rationale behind
having an instrument for an exogenous variable? (Added Dec 8)
Let us think about the two stages behind the 2SLS. In the first stage our main goal is to extract the
exogenous part from each endogenous variable in the best way possible. So, if our dependent variable
is education, it makes sense to also add experience as a regressor, which we assume to be exogenous,
since that should only improve our approximation of this exogenous part of education. (Actually you
must do that.) Recall that this first stage regression we wrote as educi = z′ipi + ui, meaning that
zi contains all exogenous information that is available to us. The key part in the second stage is to
replace the endogenous variables by their estimated exogenous counterparts.
So far so good. What are we missing? We perform the first stage for all the regressors, both
endogenous and exogenous. That is, we also run experi = z′iλ + ei and save the fitted values. Is
that a problem? No, because if experience is in zi, then all the coefficients will be 0 except that for
experience itself. That is, we will perfectly predict experience and there will be no loss of information
by using those fitted values in the second stage.
Then you may ask why bother doing those redundant regressions? Well, manually you would not
actually run them. But it is very convenient to just put all the exogenous variables (both those from
the model and the instruments from outside the model) into the same zi as then the 2SLS formula is
very simple.
So, to summarise, it makes sense to use (and actually we must do so) experience in the first stage for
endogenous regressors, and all the rest is just mathematical convenience.
36 What are overidentifying restrictions and how to find the number of them? (Added
Dec 8)
Before answering the actual question let us make a useful detour. Consider a linear model yi = x′iβ+εi,
where xi is k×1 and E[xiεi] 6= 0 so that at least one of the regressors is endogenous. Before we start,
note that taking expectations on both sides gives
E[yi] = E
[
x′i
]
β (4)
(assuming zero mean errors). This gives us a single equation with k unknowns, meaning that there
are infinitely many solutions to this equation (if k > 1). That is an underdetermined system.
Now let there be available zi such that E[ziεi] = 0, where zi includes all the exogenous regressors
from the model as well as all the available instruments outside of the model. Then premultiply the
linear model on the left by zi to get
ziyi = zix′iβ + ziεi.
Taking expectations of both sides gives
E[ziyi] = E
[
zix′i
]
β. (5)
22
What about now? We would like to solve it for β, but can we? We can premultiply both sides by
E[zix′i]
−1 under two (or just one, actually) conditions: zi has to be k × 1 as well and, if that is the
case (so that E[zix′i] is square), E[zix′i] has to be invertible. In this case we can see (5) just like a
usual system of linear equations of the form b = Ax that we want to solve for x and there exists a
unique solution. Notice how we transformed the initial single equation in (4) to a system in (5) that
we can solve uniquely. The derivation for the OLS estimator (under no endogeneity) would be the
same after taking zi = xi.
We are not interested in the case when E[zix′i] is square but not invertible; in that case the rank
condition is violated and it could be shown that the instruments actually are not relevant enough for
explaining xi.
Now suppose that zi is r × 1, where r > k. Then E[zix′i] is not square and (5) provides an overde-
termined system that, normally, has no solutions (r equations and k unknowns). Before we went
from an underdetermined system in (4) to one with a unique solution. Now we can also go from the
overdetermined system in (5) to one with a unique solution. Importantly, there is not just a single
way to do that, meaning that we could construct various estimators of β. One of those options is to
premultiply (5) by E[xiz′i]E[ziz′i]
−1 as to get
E
[
xiz′i
]
E
[
ziz′i
]−1 E[ziyi] = E[xiz′i]E[ziz′i]−1 E[zix′i]β. (6)
Assuming that the rank of E[zix′i] is k, you could show that
E
[
xiz′i
]
E
[
ziz′i
]−1 E[zix′i] (7)
is invertible so that (6) once again has a unique solution! Hence, premultiplying both sides of (6) by
the inverse of (7) we have
β =
(
E
[
xiz′i
]
E
[
ziz′i
]−1 E[zix′i])−1 E[xiz′i]E[ziz′i]−1 E[ziyi].
Then the 2SLS estimator is just a Method of Moments estimator based on the latter equality.
Now let us come back to the actual question. We had an issue of overidentification or overdetermi-
nation when r > k and actually q = r − k is the number of those overidentifying restrictions.
37 What did you mean by saying that GMM under heteroskedasticity with instruments
is like 2SLS with GLS? Heteroskedasticity is about error variance and Optimal GMM
uses moment conditions’ variance. Aren’t those very different things? (Added Dec
8)
It is true that optimality of GMM is moments conditions-dependent. That is, Optimal GMM is going
to be efficient only in a large class of estimators under the specified moment conditions. So, when
talking about 2SLS with GLS, I also was implicitly assuming that we would provide information about
heteroskedasticity in our moment conditions. In particular, if we used something like E
[
Z′Ω−1ε
]
= 0,
where Ω = diag
(
E
[
ε21
∣∣ z1], . . . ,E[ε2n ∣∣ zn]) is assumed to be known, then this would lead to a sort
of 2SLS plus GLS estimator. One reason I am saying “sort of” is because we have not seen GLS in
the context of instruments. My main goal was just to give intuition that GMM is not completely
some magic that we have not seen: if there is heteroskedasticity, the optimal thing to do would be
something like GLS in this context. You may have noticed that E
[
Z′Ω−1ε
]
= 0 is in matrix notation
23
rather than vector notation, which we used in lecture. But notice also that
E
[
Z′Ω−1ε
]
=
n∑
i=1
E
[
zi
εi
E
[
ε2i
∣∣ zi]
]
,
where E
[
ε2i
∣∣ zi] is assumed to be known (e.g., E[ε2i ∣∣ zi] = 1 + z′izi). Both expressions actually are
fine. It is possible to show that then the Optimal GMM estimator is
βˆ =
(
X′Ω−1Z(Z′Ω−1Z)−1Z′Ω−1X
)−1
X′Ω−1Z(Z′Ω−1Z)−1Z′Ω−1Y.
Notice that the only difference between this and the 2SLS estimator is Ω−1 appearing in the middle
of every triplet. Try to see how the latter formula would look like in terms of X∗ = Ω−1/2X and
Z∗ = Ω−1/2Z.
Regarding the second question, it is true that, in general, one thing is error variance and another
things is variance of moment conditions. However, if our moment conditions are something like
E[xiεi] = 0, then this includes the error term and also has to do with its variance.
38 When using MLE in a simple regression model, how can we relate individual nor-
mality of βˆ and of σˆ2 with their joint normality? How is the latter better? (Added
Dec 8)
This question is concerns Lecture 20 and the confusion that arises due to the order of how the results
are presented there. It would be best to start with joint normality of βˆML and σˆ2ML. In particular,
in slide 8 we finished computing the Information Matrix I(β0, σ20) and found that
I(β0, σ20) =
(
σ−20
∑n
i=1 x
2
i 0
0 n/(2σ40)
)
.
Then we know that
√
n
((
βˆMLE
σˆ2MLE
)
−
(
β0
σ20
))
d−→ N
((
0
0
)
, I1(β0, σ20)−1
)
,
where
I1(β0, σ20)−1 =
 σ20limn→∞ n−1∑ni=1 x2i 0
0 2σ40

with the diagonal elements being the asymptotic variances and the off-diagonal zeros being the asymp-
totic covariance. This gives us three results. First two results are
√
n(βˆMLE − β0) d−→ N
(
0, σ
2
0
limn→∞ n−1
∑n
i=1 x
2
i
)
and √
n(σˆ2MLE − σ20) d−→ N
(
0, 2σ40
)
,
24
which we have already seen in the lecture individually. That is, multivariate normality implies
marginal normality. An easy way to prove this is using the Slutsky’s theorem. For instance,
(
1 0
)√
n
((
βˆMLE
σˆ2MLE
)
−
(
β0
σ20
))
=
√
n(βˆMLE − β0)
d−→
(
1 0
)
· N
((
0
0
)
, I1(β0, σ20)−1
)
= N
(
0,
(
1 0
)
· I1(β0, σ20)−1 ·
(
1
0
))
= N
(
0, σ
2
0
limn→∞ n−1
∑n
i=1 x
2
i
)
.
Now the third result that is available only after proving joint normality is that βˆMLE and σˆ2MLE are
asymptotically independent. That is because their asymptotic covariance is zero and zero covariance
under normality is equivalent to independence.
39 How do we go from asymptotic normality to standard errors? (Added Dec 8)
Suppose first that yi = βxi + εi. Then under usual assumptions we would have
√
n(βˆ − β) d−→ N
(
0, σ2 E
[
x2i
]−1)
, (8)
where βˆ is the OLS estimator. Then AVar
(√
n(βˆ − β)
)
= σ2 E
[
x2i
]−1 is the asymptotic variance of
√
n(βˆ − β). Now while (8) is about convergence in distribution, i.e. limiting behaviour of βˆ, in large
samples normality will also approximately hold so that, for a large n,
√
n(βˆ − β) ≈ N
(
0, σ2 E
[
x2i
]−1)
and, hence,
βˆ − β ≈ N
(
0, σ2 E
[
x2i
]−1
/n
)
,
βˆ ≈ N
(
β, σ2 E
[
x2i
]−1
/n
)
,
using properties of the normal distribution. So, somewhat informally we can also say that AVar(βˆ) =
σ2 E
[
x2i
]−1
/n. The (asymptotic) standard deviation then is just the square root of the latter quantity,
σ E
[
x2i
]−1/2
n−1/2, while a standard error of βˆ is an estimator of the standard deviation. For instance,
se(βˆ) = σˆ√∑n
i=1 x
2
i
.
More generally, for vector βˆ we would have
√
n(βˆ − β) d−→ N
(
0, σ2 E
[
xix′i
]−1)
.
25
Then continuing in a similar manner we would have
se(βˆj) = σˆ
√
((X′X)−1)jj .
40 We saw that doing 2SLS manually will lead to incorrect standard errors. Is there a
way to also manually correct them? (Added Dec 8)
Yes, there is a way. First let us understand the issue of this manual procedure. We are dealing with
yi = x′iβ + εi and recall that
√
n(βˆ2SLS − β) d−→ N
(
0, σ2ε
(
E
[
zix′i
]′ E[ziz′i]E[zix′i])−1) , (9)
under homoskedasticity, where we in no place specified that this asymptotic distribution is doing
something manually or not; that is irrelevant. The issue appears in the estimation step. In particular,
the second stage runs yi = xˆ′iβ + ui, and the OLS estimator of β is the 2SLS estimator. Then you
may want to estimate σ2ε in (9) with (n− k − 1)−1
∑n
i=1 uˆ
2
i , where uˆi = yi − xˆ′iβˆ2SLS . However, that
is an estimate of Var[ui] = σ2u, not σ2ε . That is the problem leading to incorrect standard errors using
this manual procedure.
So, what should we do? We should use the correct residuals εˆi = yi − x′iβˆ2SLS to estimate σ2ε by,
say, σˆ2ε = (n− k− 1)−1
∑n
i=1 εˆ
2
i . The correct standard errors then are given by the square root of the
diagonal elements of
1
n
σˆ2ε
(
Eˆ
[
zix′i
]′Eˆ[ziz′i]Eˆ[zix′i])−1
= 1
n
σˆ2ε
( 1
n
n∑
i=1
zix′i
)′( 1
n
n∑
i=1
ziz′i
)−1( 1
n
n∑
i=1
zix′i
)−1
= σˆ2ε
( n∑
i=1
zix′i
)′( n∑
i=1
ziz′i
)−1( n∑
i=1
zix′i
)−1
= σˆ2ε
(
X′Z
(
Z′Z
)−1 Z′X)−1 .
Notice that the last equality comes from the fact that for
X =
(
x1 x2 · · · x2
)′
=

x′1
x′2
...
x′n

we have
X′X =
(
x1 x2 · · · x2
)

x′1
x′2
...
x′n
 =
n∑
i=1
xix′i
and similarly for other matrix products.
26
41 How do we have multiple expressions for βˆ2SLS? Are they really equivalent? (Added
Dec 8)
Yes, there are multiple equivalent expressions for βˆ2SLS . That is due to several factors: (1) it depends
whether we use matrix notation or vector notation, (2) it depends whether we use PZ (or fitted values)
or explicitly write what PZ is, (3) idempotency of PZ (i.e., PZPZ = PZ). Here I will focus mostly
on (2) and (3).
Let us start with the most intuitive estimator. The OLS estimator of regressing Y on X is (X′X)−1X′Y.
We know that in the first stage of 2SLS we obtain fitted values Xˆ = PZX, and in the second one
regress Y on Xˆ. Hence, one expression for the 2SLS estimator is
(Xˆ′Xˆ)−1Xˆ′Y. (10)
Next we can replace Xˆ by PZX to get
((PZX)′PZX)−1(PZX)′Y = (X′P′ZPZX)−1X′P′ZY,
but this is neither nicer nor gives something more than (10). However, recall that PZ is idempotent
and symmetric. Hence,
X′P′ZPZX︸︷︷︸
=Xˆ′Xˆ
= X′P′ZX︸︷︷︸
=Xˆ′X
= X′PZX︸︷︷︸
=X′Xˆ
.
Similarly,
X′P′ZY︸︷︷︸
=Xˆ′Y
= X′P′ZPZY︸︷︷︸
=Xˆ′Yˆ
= X′PZY︸︷︷︸
=X′Yˆ
.
This already gives us many alternative expressions. What else can we do? We can replace PZ by
Z(Z′Z)−1Z′. This gives
(X′Z(Z′Z)−1Z′X)−1X′Z(Z′Z)−1Z′Y,
So far we have addressed factors (2) and (3) from above. Regarding (1), whenever we do not have
PZ in the expression (it would be awkward with it), we can rewrite it in vector notation (see the
explanation in the previous answer). For instance, (10) is the same as(
n∑
i=1
xˆixˆ′i
)−1 n∑
i=1
xˆiyi.
42 How to show asymptotic normality of the 2SLS estimator under homoskedasticity?
(Added Dec 8)
For proving asymptotic normality it is best to use vector notation (something that I mentioned in
the very first lecture) due to the fact that all the theorems apply for i.i.d. data, for averages, and so
on. So, consider
βˆ2SLS =
( n∑
i=1
zix′i
)′( n∑
i=1
ziz′i
)−1( n∑
i=1
zix′i
)−1( n∑
i=1
zix′i
)′( n∑
i=1
ziz′i
)−1( n∑
i=1
ziyi
)
.
27
Substituting yi = x′iβ + εi and adding n−1 factors (without changing anything) as a preparation for
the Law of Large Numbers we get
√
n(βˆ2SLS − β)
=
( n∑
i=1
zix′i
)′( n∑
i=1
ziz′i
)−1( n∑
i=1
zix′i
)−1( n∑
i=1
zix′i
)′( n∑
i=1
ziz′i
)−1√
n
(
n∑
i=1
ziεi
)
=
( 1
n
n∑
i=1
zix′i
)′( 1
n
n∑
i=1
ziz′i
)−1( 1
n
n∑
i=1
zix′i
)−1( 1
n
n∑
i=1
zix′i
)′( 1
n
n∑
i=1
ziz′i
)−1√
n
(
1
n
n∑
i=1
ziεi
)
.
Now by the Law of Large numbers,
1
n
n∑
i=1
zix′i
p−→ E[zix′i] and 1n
n∑
i=1
ziz′i
p−→ E[ziz′i].
(The convergence holds because the rank condition guarantees finiteness of the limiting matrices.)
One of our assumptions is that E[ziz′i] is invertible. Hence, by the Slutsky’s theorem,(
1
n
n∑
i=1
ziz′i
)−1
p−→ E[ziz′i]−1.
Moreover, we assume that E[zix′i] has full column rank. It can be show then that E[zix′i]
′ E[ziz′i]E[zix′i]
is invertible and, hence, by the Slutsky’s theorem,( 1
n
n∑
i=1
zix′i
)′( 1
n
n∑
i=1
ziz′i
)−1( 1
n
n∑
i=1
zix′i
)−1 p−→ (E[zix′i]′ E[ziz′i]E[zix′i])−1 .
By the same arguments,(
1
n
n∑
i=1
zix′i
)′( 1
n
n∑
i=1
ziz′i
)−1
p−→ E[zix′i]′ E[ziz′i]−1.
Now we are left to deal with
√
n
(
1
n
n∑
i=1
ziεi
)
.
By exogeneity, E[ziεi] = 0. By conditional homoskedasticity (and the rank condition),
Var[ziεi] = E
[
ε2i ziz′i
]
= E
[
E
[
ε2i ziz′i
∣∣∣ zi]] = E[E[ε2i ∣∣∣ zi]ziz′i] = σ2ε E[ziz′i] <∞,
where the rank condition actually guarantees that the expectation is finite (if matrix is invertible, it
must be finite). So, we have shown that the i.i.d. vectors ziεi have zero mean and finite variance.
Using the Central Limit Theorem,
√
n
(
1
n
n∑
i=1
ziεi
)
d−→ N (0, σ2ε E
[
ziz′i
]
).
28
We are almost done. By the Slutsky’s theorem once again we have that
√
n(βˆ2SLS − β)
=
( 1
n
n∑
i=1
zix′i
)′( 1
n
n∑
i=1
ziz′i
)−1( 1
n
n∑
i=1
zix′i
)−1( 1
n
n∑
i=1
zix′i
)′( 1
n
n∑
i=1
ziz′i
)−1√
n
(
1
n
n∑
i=1
ziεi
)
d−→
(
E
[
zix′i
]′ E[ziz′i]−1 E[zix′i])−1 E[zix′i]′ E[ziz′i]−1 · N (0, σ2ε E[ziz′i]).
Now remember that A · N (0,B) d= N (0,ABA′). Hence, our final (asymptotic) variance is((
E
[
zix′i
]′ E[ziz′i]−1 E[zix′i])−1 E[zix′i]′ E[ziz′i]−1)
×
(
σ2ε E
[
ziz′i
])
×
((
E
[
zix′i
]′ E[ziz′i]−1 E[zix′i])−1 E[zix′i]′ E[ziz′i]−1)′
=
((
E
[
zix′i
]′ E[ziz′i]−1 E[zix′i])−1 E[zix′i]′ E[ziz′i]−1)
×
(
σ2ε E
[
ziz′i
])
×
(
E
[
ziz′i
]−1 E[zix′i] (E[zix′i]′ E[ziz′i]−1 E[zix′i])−1)
= σ2ε
(
E
[
zix′i
]′ E[ziz′i]−1 E[zix′i])−1 (E[zix′i]′ E[ziz′i]−1 E[zix′i]) (E[zix′i]′ E[ziz′i]−1 E[zix′i])−1
= σ2ε
(
E
[
zix′i
]′ E[ziz′i]−1 E[zix′i])−1 .
Thus, √
n(βˆ2SLS − β) d−→ N
(
0, σ2ε
(
E
[
zix′i
]′ E[ziz′i]−1 E[zix′i])−1) .
43 On (i) of A6 in the Final Exam of 2018. (Added Apr 21)
The marginal distribution of X is not said to be related to β, our parameter of interest. Hence, we
consider the conditional distribution of Y given X and Conditional Maximum Likelihood Estimation.
Then what is this conditional distribution? Given xi (which, importantly, is independent of εi),
yi = βxiεi is just a fixed term (βxi) multiplied by a standard normal random variable (εi). Hence,
yi | xi ∼ βxi · N (0, 1) d= N (0, β2x2i ). Now recall that the density function of N (µ, σ2) at a point y is
given by
f(y) = 1
σ
√
2pi
e−
1
2( y−µσ )
2
.
In our case, µ = 0 and σ2 = σ2(x) = β2x2 so that the model is heteroskedastic. Then
L(β; Y | X) = f(Y | X;β) =
n∏
i=1
f(yi | xi;β) =
n∏
i=1
 1
β|xi|
√
2pi
e
− 12
y2
i
β2x2
i
 .
Thus,
l(β; Y | X) = −n2 ln(2pi)− n ln β −
n∑
i=1
ln |xi| −
n∑
i=1
(yi/xi)2
2β2 .
29
The penultimate term is different from the one in the provided solution (involving this mistake), but
it does not affect the CMLE as it does not involve β.
30

学霸联盟