MATH50011-R代写|学霸联盟

MATH50011-R代写

时间：2023-03-22

Statistical Modelling I
MATH 50011
Dr Riccardo Passeggeri
Imperial College London
Spring 2023
Last Update: 21/03/2023
Contents
1 Statistical Models 6
1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Parametric Statistical Models . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Using Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Point Estimation 10
2.1 Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Standard Error . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Mean Square Error . . . . . . . . . . . . . . . . . . . . . . . . 13
3 The Crame´r-Rao Lower Bound 16
1
4 Asymptotic Properties 19
5 Maximum Likelihood Estimation 25
5.1 Properties of Maximum Likelihood estimators . . . . . . . . . . . . . . 29
5.1.1 MLEs are functionally invariant . . . . . . . . . . . . . . . . . 29
5.1.2 Large sample properties . . . . . . . . . . . . . . . . . . . . . 31
5.2 Sketch of proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Confidence Regions 36
6.1 Construction of Confidence Intervals . . . . . . . . . . . . . . . . . . 39
6.2 Asymptotic confidence intervals . . . . . . . . . . . . . . . . . . . . . 41
6.3 Simultaneous Confidence Intervals/Confidence Regions . . . . . . . . . 43
7 Hypothesis Tests 44
7.1 Power of a Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.2 p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3 Connection between tests and confidence intervals . . . . . . . . . . . 49
8 Likelihood Ratio Tests 51
9 Linear Models with Second Order Assumptions 57
9.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.2 Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.3 Review of rules for E, cov for random vectors . . . . . . . . . . . . . . 61
9.4 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2
9.5 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.6 Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . 68
9.7 Properties of Least Squares Estimation . . . . . . . . . . . . . . . . . 69
9.8 Projection Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.9 Residuals, Estimation of the variance . . . . . . . . . . . . . . . . . . 74
10 Linear Models with Normal Theory Assumptions 81
10.1 Distributional Results . . . . . . . . . . . . . . . . . . . . . . . . . . 81
10.1.1 The Multivariate Normal Distribution . . . . . . . . . . . . . . 81
10.1.2 Distributions derived from the Multivariate Normal . . . . . . . 84
10.1.3 Some Independence Results . . . . . . . . . . . . . . . . . . . 86
10.2 The Linear Model with Normal Theory Assumptions . . . . . . . . . . 89
10.3 Confidence Intervals, Tests for one-dimensional quantities . . . . . . . 90
10.4 The F-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.5 Confidence Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
11 Diagnostics, Model Selection, Extensions 99
11.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.2 Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
11.3 Cook’s Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.4 Under/overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.5 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . 103
11.6 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.7 Distributional Checks . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3
11.8 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
11.9 Generalized Linear Models (GLMs) . . . . . . . . . . . . . . . . . . . 108
4
Background and Scope
You have now had one or more modules introducing the key concepts of probability
and statistics. In this module, we delve more deeply into selected topics and apply
statistical methods to real data.
Broadly speaking, statistics is all about science. Science is about proving things, which
requires demonstrating the validity of hypotheses.
Hypotheses don’t just pop into existence from thin air, though. A common first step
in scientific investigations is thus hypothesis generation. This is typically driven by
observational studies in which an association between events or objects is noted in
existing populations or systems. We must be careful not to ascribe cause and effect
from an observational study unless we have systematically ruled out all confounding
factors (i.e. other objects/events that explain the observed relationship).
Once we have a scientific hypothesis, we are in a position to refine its statement or
confirm its truth. To do this, the gold standard is to design a controlled experiment
in which we directly intervene on a well-defined study population or system.
Statistics can aid the process of scientific inquiry when we can translate our scien-
tific hypothesis into statements or comparisons about one or more numbers, and we
know how we would answer the scientific question if we had knowledge of the entire
population.
The primary goal of this module is to introduce the theory and application of linear
models, which include and substantially generalise the simple linear regression models
you will have previously encountered. In service of this goal, we will also derive further
results that apply to other models. This preamble is not the last time we consider the
role of statistics in science.
5
1 Statistical Models
The methods we consider deal with the analysis of relationships between measurements
collected on groups of subjects or objects.
When a measurement is expected to vary in response to other variables, we call it a
response, outcome or dependent variable (these terms are exchangeable in the
literature).
The measurements that are believed to lead to a change in the response are called
covariates, explanatory variables, predictor variables, or independent variables.
These terms are not exhaustive. In economics, the terms endogenous (dependent
variable) and exogenous (independent variable) are common.
We must be careful when making causal statements about the relationships between
variables we have called the response and predictor.
In this module, we will consider settings with just one response variable, but allow
there to be possibly many predictor variables. At all times, we model measurements
as samples from some random process.
1.1 Notation
The typical convention of denoting random variables by uppercase italic letters such as
X , Y , Z and observed data values by corresponding lowercase letters such as x , y , z will
be followed. Similarly, greek letters such as θ,µ,σ, β are used to denote parameters. An
estimator or estimate of a parameter will typically be denoted by placing a decoration
such as a“hat”on the corresponding letter, such as θˆ, µˆ, σˆ, βˆ. At all times, exceptions
to these conventions will be made clear by context.
1.2 Parametric Statistical Models
In the following, we suppose that we observe some data y . We interpret this as a
realisation of some random object Y .
A statistical model is a collection of probability distribtions {Pθ : θ ∈ Θ} on a given
sample space. The set Θ of all possible parameter values is called the parameter space.
6
In this module, Θ ⊆ Rp, so that we consider parametric models. A semiparametric
or nonparametric model involves parameters that belong to more general sets (e.g.
function spaces).
Since we interpret the data y as realisation of a random variable Y , a statistical model
can be seen as a specification of the distribution of Y up to an unknown parameter θ.
A statistical model (or a parametrization) is generally required to be such that distinct
parameter values give rise to distinct distributions. In particular, a statistical model is
called identifiable if the mapping θ 7→ Pθ is one-to-one, that is Pθ1 = Pθ2 ⇒ θ1 = θ2
must hold for all θ1, θ2 ∈ Θ.
Often, the data y = (y1, ... , yn) ∈ Rn is a vector and Y = (Y1, ... , Yn) is a random
vector. In this case the statistical model is the specification of the joint distribution of
Y1, ... , Yn up to some unknown parameter θ.
If Y1, . . . , Yn are iid (independent and identically distributed) then Y1, ... , Yn are
called a random sample (or just sample). However, this terminology is not universally
accepted, hence it is always a good practice to explicitely mention that the observations
are iid.
Example 1
At a fair, a competition is held where participants guess the weight of an ox based
only on visual inspection. The organizer of the competition has weighed the ox and
knows that the animal weighs exactly 543.4 kg (about 1198 lbs). Suppose that
participants are equally likely to overestimate or underestimate the weight, and that
extremely poor guesses are unlikely. We can then reasonably model the guess y
(measured in kg) of a given participant as the realization of a normally distributed
random variable Y ∼ N(543.4,σ2) for some σ > 0. The collection of n guesses
Y1, ... , Yn is a random sample from the N(543.4,σ
2) distribution.
In many situations, Y1, ... , Yn are independent but do not have the same distribu-
tion. For instance, the distribution of Y1, ... , Yn may depend on (nonrandom) values
x1, ... , xn. The xi ’s are an example of covariates.
Example 2
• In a clinical trial, the survival times with a new and an old treatment are
compared. The study enrolls n patients, and the treatments are randomly
7
allocated. The outcome for the ith patient is Yi = survival time. As covariate
for the ith patient, we may use
xi =
1, patient receives new treatment0, patient receives old treatment
• Do taller people have a higher income?
Yi = income, xi = height, i = 1, ... , n
Model: Yi = β0 + xi β1 + i , i = 1, ... , n, i ∼ N(0,σ2) iid, θ = (β0, β1,σ2),
Θ = R2 × [0,∞)
Other commonly used distributions in statistics also give rise to parametric families. For
instance, if we assume that there exists λ0 > 0 such that a random count Y (e.g. the
number of named storms in a given year) has a Poisson distribution with mean λ0, then
the corresponding parametric model is the set of distributions {Poisson(λ) : λ > 0}.
We will see later how the model that we assume can help us choose between estimators
for a quantity of interest.
1.3 Using Models
Having formulated a model, we can then draw inferences from the sample to answer
our scientific question. Often, the first step is to estimate any unknown parameters
required to answer our scientific question. This step is often called“fitting the model,”
though we stress that this step is really fitting the data to the model.
A point estimate is convenient, because it provides a single number (or vector) to
answer our scientific question. However, we will often be interested in using some
combination of hypothesis tests and confidence intervals to help make decisions about
our scientific question in a manner that accounts for random variation in our sample.
When we adopt a model, we must accept that it will not perfectly reflect reality.
However, we still want the model to be useful! A model should
• ...agree with observed data reasonably well.
• ...be reasonably simple (no more parameters than necessary).
8
• ...be easy to interpret, e.g. parameters should have a practical meaning.
In service of the above aims, we might conduct sensitivity analyses that try to answer
the question, “Is the model adequate for the data?” If the answer to this is “No,” then
we should refine the model. This may become an iterative procedure, as we will discuss
at the end of the module. It is important that we clearly document this process in any
analysis, to preserve the validity of our primary results.
9
2 Point Estimation
How can we estimate unknown parameters based on observed data y1, ... , yn? Using
our model, we view yi a realisation of the random variables Yi for i = 1, ... , n.
A function of observable random variables is called statistic. Any statistic could be
used to estimate θ.
Definition 1
Let t be a statistic. The value t(y1, ... , yn) is called the estimate of θ. The random
variable T = t(Y1, ... , Yn) is called an estimator of θ.
There can be many candidate estimates for a parameter.
Example 3
Model: Y1, ... , Yn ∼ N(µ, 1) iid, µ ∈ R unknown.
Even in this simple situation µ can be estimated in several ways:
• sample mean: y¯ = 1
n
∑
yi
• sample median: y((n+1)/2) if n is odd and (y(n/2) + y(n/2+1))/2 if n is even,
where y(1) < · · · < y(n) is the ordered sample
• trimmed mean: discard the highest and lowest k observed yi before computing
the mean
• ...
For the estimate t(y1, ... , yn) = y¯ the corresponding estimator is T (Y1, ... , Yn) =
Y¯ = 1
n
∑
Yi .
Note: T is a r.v. Its distribution may depend on θ = µ.
Here: T ∼ N(µ, 1/n).
To prove that T has the above distribution: Show that the sum of independent
normal random variables is normally distributed and compute the mean and variance
of T .
Remark We judge how good t is by looking at the properties of T .
10
2.1 Properties of Estimators
We assume that we have specified a parametric model for the data. That means that
we have a set of possible distributions, indexed by a parameter θ ∈ Θ, that could have
generated the observations. Whenever we are working out properties of estimators
(such as probabilities, expected values or variances), we have to indicate with respect
to which of these distributions we are doing our calculations. To do this we usually
write the parameter as subscript, e.g. we write
Pθ(T ∈ A), Eθ(T ), Varθ(T ).
2.1.1 Bias
The bias of an estimator is the difference between its expected value and the true
value. More precisely:
Definition 2
Let T be an estimator for θ ∈ Θ ⊂ R. The bias of T is defined by biasθ(T ) =
Eθ(T )− θ. If biasθ(T ) = 0 for all θ ∈ Θ, we say that T is unbiased for θ.
Example 4
Let X ∼ Binomial(n, p). Here, p ∈ [0, 1] is the unknown parameter and n is known.
Consider the two estimators S = X/n and T = X+1
n+2
for p.
∀p: biasp S = Ep(S − p) = 1
n
Ep X − p = 0. Thus S is unbiased for p.
biasp T = Ep(T − p) = Ep X + 1
n + 2
− p = np + 1
n + 2
− p = 1− 2p
n + 2
.
Thus T is a biased estimator for p. This example shows that the bias may depend
on the model parameters.
More generally, if the paramter space is higher dimensional, e.g. Θ ⊂ Rk for some
k > 1, we may only be interested in g(θ), where g : Θ → R is some function. The
obvious extension of the definition of bias to this situation is biasθ(T ) = Eθ(T )−g(θ).
11
Example 5
Y1, ... , Yn ∼ N(µ,σ2) independently with unknown parameter θ = (µ,σ2) ∈ Θ =
R× (0,∞). If we are only interested in the mean µ, we may use g(θ) = µ.
Example 6
For a random sample Y1, ... , Yn with E Yi = µ and Var Yi = σ
2 with µ,σ2 unknown:
• Y¯ =
1
n
n∑
i=1
Yi is unbiased for µ = E Y .
Indeed,
E(Y¯ ) = E(
1
n
n∑
i=1
Yi) =
1
n
n∑
i=1
E Yi = E Y = µ
• s2 = 1
n−1
∑n
i=1(Yi − Y¯ )2 is unbiased for σ2 = Var Y .
Indeed,
∑
(Yi − Y¯ )2 =
∑
i
Y 2i −
1
n
∑
i ,j
YiYj︸︷︷︸∑
i=j
···+
∑
i 6=j ...
= (1− 1
n
)
∑
i
Y 2i −
1
n
∑
i 6=j
YiYj
and hence, E s2 =
1
n − 1
(
n − 1
n
∑
i
E Y 2i −
1
n
∑
i ,j
E YiYj︸︷︷︸
=EYi EYj
)
= E Y 2−(E Y )2 =
σ2
Thus, (Y¯ , s2) is an unbiased estimator of (µ,σ2).
However, in general: Y¯ 2 is not unbiased for µ2 and s is not unbiased for σ (see
Problem Sheet).
Remark T unbiased for θ does not imply h(T ) unbiased for h(θ).
2.1.2 Standard Error
The bias measures how far the center of the sampling distribution of T is from the true
parameter θ. A natural measure of spread for an estimator is its standard deviation.
12
Definition 3
Let T be an estimator for θ ∈ Θ ⊂ R. The standard error of T is the standard
deviation of the sampling distribution of T . Therefore, SEθ(T ) =
√
Varθ(T ).
We will later see that many confidence intervals and test statistics involve the (approx-
imate) standard error of an estimator T .
Example 7
Let X1, ... , Xn be i.i.d. with mean E(X1) = µ, variance Var(X1) = σ
2 and define
X¯ = n−1
∑n
i=1 Xi . The standard error of the sample mean is SE(X¯ ) = σ/
√
n. We
estimate this by SE(X¯ ) =
√
S2/
√
n where S2 is the unbiased sample variance.
2.1.3 Mean Square Error
Definition 4
Let T be an estimator for θ ∈ Θ ⊂ R. The mean square error of T is defined as
MSEθ(T ) = Eθ(T − θ)2.
Lemma 1
MSEθ(T ) = Varθ(T ) + (biasθ(T ))
2
Proof Problem sheet (though this should be revision!).
Remark The MSE is a good criterion for selection an estimator as it takes both the
bias and the variance of an estimator into account.
The following example shows that a biased estimator can outperform an unbiased
estimator.
13
Example 8
X ∼ Binomial(n,p). n is known, the unknown parameter is p ∈ [0, 1].
Consider the two estimators S = X/n and T = X+1
n+2
.
We already know that S is unbiased for p and that biasp T =
1−2p
n+2
.
Thus MSEp(S) = Varp S =
1
n2
Varp X = p(1− p)/n.
Furthermore, Varp T =
1
(n + 2)2
Varp X =
np(1− p)
(n + 2)2
and thus MSEp(T ) =
Varp(T ) + biasp(T )
2 =
np(1− p) + (1− 2p)2
(n + 2)2
.
For p = 0 and p = 1, MSEp(T ) =
1
(n + 2)2
> 0 = MSEp(S)
However, for p = 1
2
, MSE 1
2
(T ) =
n
4(n + 2)2
<
n
4n2
=
1
4n
= MSE 1
2
(S)
Since MSEp(T ) and MSEp(S) are quadratic in p, this implies that
∃0 < p1 < p2 < 1
such that
∀p ∈ (p1, p2) : MSEp(T ) < MSEp(S)
and
∀p ∈ [0, p1) ∪ (p2, 1], MSEp(T ) > MSEp(S).
The figure below illustrates this behavior for n = 5, 10, 30:
14
15
3 The Crame´r-Rao Lower Bound
When we choose between estimators, we would ideally know how the best possible
estimator would perform. Unfortunately, minimizing criteria such as MSE across all
parameter values is typically impossible. This means that we must restrict our attention
to a set of reasonable estimators before we can discuss optimality. The set of unbiased
estimators is one sensible such choice.
In fact, under weak conditions, we can derive a lower bound on the variance of an
unbiased estimator (and hence its MSE). Regularity conditions will not be looked at in
detail in this course - see separate course on Statistical Theory. We will only consider
the case of unbiased estimators in which the parameter space is one-dimensional. There
exist generalisations to biased estimators and to higher-dimensional situations.
Theorem 1 (Crame´r-Rao lower bound)
Suppose T = T (X ) is an unbiased estimator for θ ∈ Θ ⊂ R based on X =
(X1, ... , Xn) with joint pdf fθ(x). Under mild regularity conditions,
Varθ(T ) ≥ 1
I (θ)
,
where
I (θ) = Eθ
{ ∂
∂θ
log fθ(X )
}2
is the Fisher information of the sample.
The Fisher information can also be written as
I (θ) = −Eθ
[
∂2
∂θ2
log fθ(X )
]
.
Remark Suppose X1, ... , Xn are a random sample. Then
fθ(x) =
n∏
i=1
f
(1)
θ (xi),
where x = (x1, ... , xn) and f
(1)
θ is the pdf/pmf of a single observation. This implies
If (θ) = −Eθ
(
∂2
∂θ2
log fθ(X )
)
=
n∑
i=1
−Eθ
(
∂2
∂θ2
log f
(1)
θ (Xi)
)
= nIf (1)(θ).
16
Thus for a random sample, the Fisher information is proportional to the sample size.
Example 9 (Sample Proportions)
X1, ... , Xn ∼ Bernoulli(θ) independently, θ ∈ Θ = (0, 1).
Want to compute If (θ). We are dealing with a random sample thus If (θ) = nIf (1)(θ).
Then f
(1)
θ (0) = Pθ(X1 = 0) = 1 − θ, f (1)θ (1) = Pθ(X1 = 1) = θ. Putting this
together gives
f
(1)
θ (x) = θ
x(1− θ)1−x .
Then
∂
∂θ
log f
(1)
θ (x) =
∂
∂θ
(x log θ + (1− x) log(1− θ)) = x
θ
− 1− x
1− θ
and (
∂
∂θ
)2
log f
(1)
θ (x) = −
x
θ2
− 1− x
(1− θ)2
Hence,
If (1)(θ) = −Eθ[
(
∂
∂θ
)2
log f
(1)
θ (X1)] =
Eθ X1
θ2
+
1− Eθ X1
(1− θ)2
=
θ
θ2
+
1− θ
(1− θ)2 =
1
θ
+
1
1− θ =
1
θ(1− θ)
Thus If (θ) = nIf (1)(θ) =
n
θ(1−θ) .
Hence, for any unbiased estimator T for θ,
Varθ T ≥ (1− θ)θ
n
.
Consider S = 1
n
∑n
i=1 Xi .
Var(S) =
1
n2
n Var X1 =
1
n
θ(1− θ).
Hence, S has minimal variance among all unbiased estimators for θ.
17
Proof (Sketch of the derivation of the Rao-Cramer bound.)
Using the Cauchy-Schwarz inequality [(E YZ )2 ≤ E Y 2 E Z 2 for square integrable
Y , Z ],
Varθ(T )If (θ) = Eθ[(T − Eθ T )2] Eθ[( ∂
∂θ
log fθ(X ))
2]
≥
(
Eθ
[
(T − Eθ T ) ∂
∂θ
log fθ(X )
])2
Eθ
[
(T − Eθ T ) ∂
∂θ
log fθ(X )
]
= Eθ
[
(T − Eθ T )
∂
∂θ
fθ(X )
fθ(X )
]
=
∫
(T (x)− Eθ T ))
∂
∂θ
fθ(x)
fθ(x)
fθ(x)dx
=
∫
T (x)
∂
∂θ
fθ(x)dx −
∫
Eθ T
∂
∂θ
fθ(x)dx
=
∂
∂θ
∫
T (x)fθ(x)dx − Eθ T ∂
∂θ
∫
fθ(x)dx
=
∂
∂θ
Eθ(T )− 0
=
∂
∂θ
θ = 1
Thus,
Varθ(T ) ≥ 1
If (θ)
Proof (Sketch of proof that the two formulation of the Fisher information are the same)
We need to show Eθ[(
∂
∂θ
log fθ(X ))
2] = −Eθ[
(
∂
∂θ
)2
log fθ(X )].
18
4 Asymptotic Properties
Many estimators do not admit straightforward comparison at a fixed sample size n.
This may be the case if the bias and variance involve complex calculations, or are
not even available in closed form. We can often simplify our comparisons by using
approximations based on large sample (asymptotic) theory. Thus, we now compare
the performance of sequences of estimators as the sample size n increases.
To motivate what large sample properties are to be evaluated, consider the familiar
mean of a gaussian random sample.
Example 10
Y1, Y2, · · · ∼ N(θ, 1) independent, θ ∈ R.
Consider Tn =
1
n
∑n
i=1 Yi as estimator for θ. Then
E Tn = E
1
n
n∑
i=1
Yi =
1
n
n∑
i=1
E Yi = θ
Var Tn =
1
n2
Var
n∑
i=1
Yi =
1
n2
n∑
i=1
Var Yi =
1
n
As the sum of independent normally distributed r.v. is normally distributed, this
implies Tn ∼ N(θ, 1n).
0.
0
0.
5
1.
0
1.
5
2.
0
pdf of Tn
θ − 1 θ θ + 1
n = 5
n = 10
n = 20
As n increases, the estimator Tn becomes more precise, it fluctuates less around the
true value θ.
19
A formal concept for an estimator that gets “perfect” as n→∞ is the following.
Definition 5
A sequence of estimators (Tn)n∈N for g(θ) is called (weakly) consistent if for all
θ ∈ Θ:
Tn
Pθ→ g(θ) (n→∞)
Recall that Tn
Pθ→ g(θ) (n → ∞) denotes convergence in probability and is defined
by:
∀ > 0 : lim
n→∞Pθ(|Tn − g(θ)| < ) = 1
Usually: Tn depends only on Y1, ... , Yn.
Lemma 2 (Portmanteau Lemma)
Let X and Xn, n ∈ N, be real valued random variables. The following statements
are equivalent:
• Xn
d→ X , as n→∞,
• E [f (Xn)] → E [f (X )]as n → ∞ for all bounded and continuous funtions
f : R→ R.
A consistent estimator has a high probability of being close to the true parameter
value. Here, closeness is defined as the euclidean distance. Showing consistency via
the definition can be tedious! Hence, we will introduction a simple sufficient condition
for consistency.
Definition 6
A sequence of estimators (Tn)n∈N for g(θ) is called asymptotically unbiased if for
all θ ∈ Θ:
Eθ(Tn)→ g(θ) (n→∞)
20
Lemma 3
Suppose (Tn) is asymptotically unbiased for g(θ) and for all θ ∈ Θ
Varθ(Tn)→ 0 (n→∞).
Then (Tn) is consistent for g(θ).
Proof Recall Markov’s inequality: P(|X | ≥ a) ≤ E |X |
a
for a > 0.
[Proof: a I (|X | ≥ a) ≤ |X |. Hence, P(|X | ≥ a) = E I (|X | ≥ a) ≤ 1
a
E |X |]
Applying Markov’s inequality, we have
∀ > 0 : Pθ(|Tn − g(θ)| ≥ ) = Pθ((Tn − g(θ))2 ≥ 2) ≤ Eθ(Tn − g(θ))
2
2
=
MSEθ(Tn)
2
=
1
2
(Varθ Tn︸︷︷︸
→0
+(Eθ Tn − g(θ)︸︷︷︸
→0
)2)
Example 11 (Sample Proportions)
Xi ∼ Bernoulli(θ), θ ∈ Θ = [0, 1].
Tn(x1, ... , xn) =
1
n
∑n
i=1 xi
EθTn(X1, ... , Xn) = θ ∀θ ∈ Θ
Varθ(Tn) =
1
n
θ(1− θ)→ 0 (n→∞)
}
=⇒ Tn consistent
Consistency is only a minimal requirement for an estimator. To derive valid hypothesis
tests and confidence intervals we need the sampling distribution of our estimators.
Example 12
Returning to the example at the beginning of the subsection where Y1, Y2, · · · ∼
N(θ, 1). We have shown Tn ∼ N(θ, 1n). Rephrasing this gives
√
n(Tn − θ) ∼ N(0, 1)
The distribution of many estimators cannot be computed as nicely as in the above
example. However, the distribution of many estimators can be approximated by a
normal distribution. The following property holds true for many estimators.
21
Definition 7
A sequence Tn of estimators for θ ∈ R is called asymptotically normal if
√
n(Tn − θ) d→ N(0,σ2(θ))
for some σ2(θ).
The following important result (that you have encountered before) demonstrates that
sample averages are asymptotically normal.
Theorem 2 (Central Limit Theorem)
Let Y1, ... , Yn be iid random variables with E Yi = µ and Var(Yi) = σ
2. Then the
sequence
√
n(Y¯ − µ) converges in distribution to a N(0,σ2) distribution.
Example 13 (Sample Proportions)
Y1, Y2, · · · ∼ Bernoulli(θ), independent, θ ∈ Θ = [0, 1].
Consider the estimator Tn =
1
n
∑n
i=1 Yi for θ.
Using the Central Limit Theorem,
√
n(Tn − θ) = 1√
n
n∑
i=1
(Yi − θ) d→ N(0, θ(1− θ)).
Thus Tn is asymptotically normal. [Recall: Var(Y1) = θ(1− θ)]
Remark Under some mild regularity conditions, the standard error of an asymptotically
normal estimator Tn can be approximated by SEθ(Tn) ≈ σ(Tn)/
√
n.
While we are often interested in estimators other than the sample mean, many esti-
mators can be closely related to such averages. The following result, stated without
proof, is typically called Slutsky’s lemma.
22
Lemma 4 (Slutsky)
Let Xn, X and Yn be random variables (or vectors). If Xn →d X and Yn →p c for a
constant c , then
(i) Xn + Yn →d X + c ;
(ii) YnXn →d cX ;
(iii) Y −1n Xn →d c−1X provided c 6= 0.
Example 14 (Sample Proportions)
Suppose that X ∼ Binomial(n, p) where p is unknown. The estimator sequence
Tn =
X+1
n+2
is asymptotically normal:
√
n(Tn − p)→d N(0, p(1− p)).
To see this, write Tn as the sum of An(X/n) + Bn for scalar sequences An and Bn
such that An → 1 and Bn → 0 as n → ∞. To finish your proof, apply Slutsky’s
lemma twice.
What does this suggest about the use of Tn and X/n as estimators of p for large
sample sizes?
Another important tool for deriving the sampling distribution of more complex estima-
tors is the so-called delta method.
Theorem 3 (Delta Method)
Suppose that Tn is an asymptotically normal estimator of θ with
√
n(Tn − θ)→d N(0,σ2(θ)).
Let g : Θ→ R be a differentiable function with g ′(θ) 6= 0. Then,
√
n(g(Tn)− g(θ))→d N(0, g ′(θ)2σ2(θ)).
23
Example 15 (Sample Odds)
Recall that the odds of an event are defined as P(A)/(1−P(A)). Let Y1, ... , Yn be
iid Bernoulli(p) random variables. We consider estimating the odds that Yi = 1.
We have that X ∼ Binomial(n, p). Let Sn = X/n. We know that
√
n(Sn − p)→d N(0, p(1− p)).
The sample odds are Tn =
Sn
1−Sn = g(Sn) for the function g(s) = s/(1 − s). The
first derivative of g is g ′(s) = (1− s)−2. By the delta method
√
n(Tn − p/(1− p)) =
√
n(g(Sn)− g(p))→d N(0, (1− p)−3p).
Theorem 4 (Continuous Mapping Theorem)
Let k , m ∈ N and let X and Xn, n ∈ N, be Rk-valued random variables (i.e. random
vectors of dimension k). Let g : Rk → Rm be a function continuous at every point
of a set C such that P(X ∈ C ) = 1.
• If Xn
d→ X , then g(Xn) d→ g(X ), as n→∞.
• If Xn
p→ X , then g(Xn) p→ g(X ), as n→∞.
• If Xn
a.s.→ X , then g(Xn) a.s.→ g(X ), as n→∞.
24
5 Maximum Likelihood Estimation
One common method of finding an estimator for θ is maximum likelihood. This method
is very widely applicable, and variations on maximum likelihood remain important for
current research.
Recall that the main idea behind maximum likelihood estimation is to find the param-
eter value for which the observed data is most “likely” and report this as our point
estimate. For discrete distributions, we can directly consider probabilities based on the
mass function.
Example 16
Suppose we observe the number of successes of 10 independent trials (outcomes:
1=success, 0=failure) and we know that the success probability θ is an element of
Θ = {0, 1
4
, 1
2
, 3
4
, 1}.
Model: X ∼ Binomial(10, θ), θ ∈ Θ.
The probability of a given outcome is
Pθ(X = x) =
(
10
x
)
θx(1− θ)10−x , x = 0, ... , 10.
The following table shows the probability for all outcomes for all possible parameter
values, i.e. Pθ(X = x). The model specifies every row of the table.
x=0 x=1 x=2 x=3 x=4 x=5 x=6 x=7 x=8 x=9 x=10
θ = 0 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
θ = 0.25 0.06 0.19 0.28 0.25 0.15 0.06 0.02 0.00 0.00 0.00 0.00
θ = 0.5 0.00 0.01 0.04 0.12 0.21 0.25 0.21 0.12 0.04 0.01 0.00
θ = 0.75 0.00 0.00 0.00 0.00 0.02 0.06 0.15 0.25 0.28 0.19 0.06
θ = 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
In any given experiment we observe a specific outcome x . The maximum likelihood
estimate is the value that maximises Pθ(X = x). For example, if we observe x =
then the maximum likelihood estimate is θˆ = 0.25, if we observe x = 4 the maximum
likelihood estimate is θˆ = 0.5.
25
Example 17
We now make the previous more realistic and use the parameter space Θ = [0, 1].
Suppose we observe x = 5 successes. The function
L : Θ→ [0,∞), L(θ) = Pθ(X = 5) =
(
10
5
)
θ5(1− θ)5
is called the likelihood function. The maximum likelihood estimate is the maximiser
of L, which in this case turns out to be θˆ = 0.5.
The following figure contains the likelihood function for several different outcomes
x .
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
x=0
θ
L(θ
)
0.0 0.2 0.4 0.6 0.8 1.0
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
x=3
θ
L(θ
)
0.0 0.2 0.4 0.6 0.8 1.0
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
x=5
θ
L(θ
)
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
x=10
θ
L(θ
)
In general, suppose we observe the random object Y with realisation y. The likelihood
function is
L(θ) = L(θ; y) =
P(Y = y;θ), discrete datafY(y;θ), absolutely continuous data.
Thus the likelihood function is the joint pdf/pmf of the observed data considered as
function of the unknown parameter.
Consider the case of a random sample, i.e. Y = (Y1, ... , Yn) and Y1, ... , Yn are iid. If
26
the model specifies that Yi has pdf f (·;θ) then
L(θ) =
n∏
i=1
f (yi ;θ).
Definition 8
A maximum likelihood estimator (MLE) of θ is an estimator θˆ s.t.
L(θˆ) = sup
θ∈Θ
L(θ).
Usually, the MLE is well defined. However, one can construct situations in which it
does not exist or is not unique.
Example 18 (Poisson distribution)
X1, ... , Xn ∼Poisson(θ) independent, θ > 0. Then
L(θ) = Pθ(X1 = x1, ... , Xn = xn) =
n∏
i=1
Pθ(Xi = xi) =
n∏
i=1
θxi
xi !
e−θ
and log L(θ) =
∑n
i=1[xi log(θ)− log(xi !)−θ]. Differentiating and equation to 0 gives
∂
∂θ
log L(θ) =
n∑
i=1
[xi/θ − 1] = 1
θ
n∑
i=1
xi − n != 0.
This gives θˆ = 1
n
∑n
i=1 xi as candidate for a maximum likelihood estimator. You
need to check that θˆ is actually a maximiser - we omit this here. Once this is done
we know that θˆ = 1
n
∑n
i=1 xi is the maximum likelihood estimate and θˆ =
1
n
∑n
i=1 Xi
is the maximum likelihood estimator.
Example 19 (Survival of Leukemia Patients )
This data contains the survival time yi (in weeks) and xi =
log10(initial white blood cell count) for 17 leukemia patients.
xi 3.36 2.88 3.63 3.41 3.78 4.02 4 4.23 3.73 3.85 3.97 4.51 4.54 5 5 4.72 5
yi 65 156 100 134 16 108 121 4 39 143 56 26 22 1 1 5 65
27
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
3.0 3.5 4.0 4.5 5.0
0
50
10
0
15
0
x_i
y_
i
We choose to treat x1, ... , xn as constants (we model Yi conditional on the observed
xi). Suppose we use the model
Yi = α exp(β(xi − x¯))i , i = 1, ... , n
where i ∼ Exp(1) iid and θ = (α, β)T ∈ (0,∞)× R.
To work out the MLE: First, note that Yi ∼ Exp(λi) with λi = 1α exp[β(xi−x¯)] .
Furthermore, the pdf of Yi is fYi (yi) = λie
−λiyi . Hence,
L(θ) =
∏
i
λie
−λiyi =
1
αn
e−β
∑
(xi−x¯)− 1α
∑
yi exp[−β(xi−x¯)]
log L(θ) = −n logα− β∑(xi − x¯)− 1
α
∑
yi exp(−β(xi − x¯))
Differentiate wrt α, β and solve numerically (or optimise numerically). This can be
done in R by running the code:
> dat <- read.csv("leuk.dat")
> g <- function(par){
-sum(log(dexp(dat$y,1/(par[1]*exp(par[2]*(dat$x-mean(dat$x)))))))
}
> fit <- optim(g,par=c(1,1),hessian=TRUE)
> fit$par
[1] 51.086326 -1.110557
28
Remark In fact, this data set is a bit unrealistic. Survival time is observed for all
patients. In practice such a study may take a long time. At the time the study is
evaluated, for many patients it will only be known that they did not die before a given
time (their observation time is said to be censored). The third year module on Survival
Models covers many methods appropriate for censored data.
The following example demonstrates that MLEs can fail to be unbiased.
Example 20 (MLEs are not necessarily unbiased)
Y1, ... , Yn ∼ N(µ,σ2) iid with µ and σ2 unknown. Then the MLEs are µˆ = 1
n
n∑
i=1
yi
and σˆ2 =
1
n
∑
(yi − y¯)2 (Check). µˆ is unbiased but σˆ2 is not:
E(σˆ2) =
1
n
E(
∑
(Yi − Y¯ )2) = n − 1
n
σ2 6= σ2
It is straightforward to derive an MLE, but we need to understand why we would prefer
to use the MLE over other estimators (especially if they may be biased). We will see
in the next section that the MLE is asymptotically well behaved.
5.1 Properties of Maximum Likelihood estimators
5.1.1 MLEs are functionally invariant
In this section we show that MLEs are invariant under reparametrisations of the pa-
rameter space. More precisely:
If g is a bijective function and if θˆ is an MLE of θ, then φˆ = g(θˆ) is an MLE of
φ = g(θ).
29
This can be seen as follows: Suppose g : Θ → Φ. Since g is bijective, g has an
inverse, which we denote by g−1.
We will denote the likelihood under the new parametrisation by L˜. It is now (using the
notation for the discrete observation case, assuming that the observation is x),
L˜(φ) = Pφ(X = x) = Pg−1(φ)(X = x) = L(g
−1(φ))
To see that φˆ = g(θˆ) maximises L˜: For all φ ∈ Φ,
L˜(φˆ) = L(g−1(φˆ)) = L(g−1(g(θˆ)) = L(θˆ) ≥ L(g−1(φ)) = L˜(φ)
Thus φˆ maximises L˜.
Example 21
Y1, ... , Yn ∼ N(θ, 1) iid, θ ∈ Θ = R
The MLE of θ is θˆ =
1
n
n∑
i=1
Yi .
What is the MLE of φ = θ + 2?
φ = θ + 2 =: g(θ). Clearly, g : Θ→ R = Φ is bijective
Hence, φˆ := g(θˆ) = θˆ + 2 = 1
n
∑n
i=1 Yi + 2 is the MLE of φ.
Remark What if g is not bijective?
Recall: Let f : A→ B be a function. It is injective iff ∀a1, a2,∈ A : f (a1) = f (a2) =⇒
a1 = a2. It is surjective iff ∀b ∈ B∃a ∈ A : f (a) = b. It is bijective if it is both
injective and surjective.
If g is not surjective then there are φs that are not in the range of g , implying that
for these parameter values no model is defined. It does not really make sense to speak
30
of the likelihood of these parameter values, so one should set the likelihood for these
to the lowest possible value (which is 0). Doing this, one can easily argue that the
invariance of the MLE under the transformation induced by g is retained.
If g is not injective then knowing φ does not uniquely identify the parameter θ or the
model. One way to define the likelihood on Φ is the “induced” likelihood function
L˜ : R→ R, L˜(φ) = sup{L(θ) : g(θ) = φ}.
This gives every φ ∈ Φ the highest likelihood of all θ that g maps onto it.
With this definition the invariance is retained: If θˆ is the MLE then φˆ = g(θˆ) maximises
the induced likelihood function L˜.
Example 22
(continued from previous example)
Consider g : R→ [0,∞) = Φ, g(θ) = θ2 is not bijective.
Then
L˜(φ) = sup{L(θ) : g(θ) = φ} = sup{L(θ) : θ2 = φ}
= sup{L(θ) : θ ∈ {−
√
φ,
√
φ}}
= max(L(−
√
φ), L(
√
φ))
5.1.2 Large sample properties
In most cases, MLEs are defined as the solution to a system of equations. A natural
question is then, “What happens as n→∞?”
Theorem 5
Let X1, X2, ... be iid observations with pdf (or pmf) fθ(x), where θ ∈ Θ and Θ is an
open interval. Let θ0 ∈ Θ denote the true parameter. Under regularity conditions
(e.g. {x : fθ(x) > 0} does not depend on θ), the following holds:
(i) There exists a consistent sequence (θˆn)n∈N of maximum likelihood estimators.
[θˆn is an MLE based on X1, ... , Xn].
31
(ii) Suppose (θˆn)n∈N is a consistent sequence of MLEs. Then
√
n(θˆn − θ0) d→ N(0, (If (θ0))−1),
where If (θ) = Eθ[(
∂
∂θ
log fθ(X ))
2] is the Fisher Information of a sample of size
1.
Remark (i) implies: if the MLE is unique (for every n) then the sequence of MLEs is
consistent.
Example 23
X1, X2, ... ∼ Bernoulli(θ) independently, θ ∈ (0, 1).
We already shown that θˆn =
1
n
∑n
i=1 Xi is the unique MLE based on X1, ... , Xn.
Furthermore, we know If (θ) =
1
θ(1−θ) . Thus,
√
n(θˆn − θ0) d→ N(0, θ0(1− θ0)).
[We have previously shown this directly using the central limit theorem).]
In other words, the distribution of θˆ can be approximated by an N(θ0,
θ0(1− θ0)
n
)
distribution.
Remark The above theorem has a limiting distribution that depends on If (θ0), which
would not be known in practical situations. To use this result, we need to estimated
If (θ0).
In an iid sample, If (θ0) can be estimated by
• If (θˆ)
• 1
n
∑n
i=1
(
∂
∂θ
log(f (xi ; θ))|θ=θˆ
)2
• − 1
n
∑n
i=1
(
∂
∂θ
)2
log(f (xi ; θ))|θ=θˆ
These estimators are often consistent, i.e. will converge to If (θ0) in probability.
32
Remark Under mild regularity conditions, the standard error of an asymptotically
normal MLE θˆn can be approximated by SE(θˆn) ≈
√
Iˆ−1n /
√
n. Here, Iˆn can be any of
the estimators for If (θ0) listed above.
Remark Multivariate version: Suppose Θ ⊂ Rk is an open set and θˆn is the MLE
based on n observations. Then the equivalent result to the above theorem is
√
n(θˆn − θ0) d→ N(0, (If (θ0))−1),
where θ0 denotes the true parameter and If (θ) is the Fisher Information Matrix given
by
If (θ) := Eθ[(∇ log f (X ;θ))T (∇ log f (X ;θ))] = −Eθ[∇T∇ log f (X ;θ)]
[∇ denotes the gradient wrt θ, i.e. the row vector of partial derivatives wrt the
elements of θ.) Convergence in distribution for random vectors is defined as follows:
Let X, X1, X2, ... be random vectors of dimension k . Then Xn
d→ X (n → ∞) if
P(Xn ≤ z)→ P(X ≤ z) (n→∞) ∀z ∈ Rk at which z 7→ P(X ≤ z) is continuous. ]
5.2 Sketch of proofs
Proof (Sketch of the proof of (i) of Theorem 5: ) The likelihood is L(θ) =
∏n
i=1 fθ(Xi).
Let
Sn(θ) :=
1
n
log L(θ) =
1
n
n∑
i=1
log fθ(Xi)︸︷︷︸
=:Zi
θˆ maximises L(θ) ⇔ θˆ maximises Sn(θ).
Z1, Z2, ... are iid. Thus, ∀θ, by the law of large numbers,
Sn(θ)→ Eθ0(Z1) = Eθ0[log fθ(X1)] =: R(θ).
We next show that θ0 maximises R(θ). Indeed, ∀θ,
R(θ)− R(θ0) = Eθ0[log(
fθ(X1)
fθ0(X1)
)] ≤ Eθ0[
fθ(X1)
fθ0(X1)
− 1] (using ∀z > 0 : z − 1 ≥ log(z))
=
∫ ( fθ(x)
fθ0(x)
− 1
)
fθ0(x)dx =
∫
fθ(x)dx −
∫
fθ0(x)dx = 1− 1 = 0.
33
Thus, we known that Sn → R pointwise,
θˆn maximises Sn(θ) and θ0 maximises R(θ).
This indicates that θˆn → θ0. [For a formal
proof see books on mathematical statistics.]
Proof (Sketch of the proof of (ii) of Theorem 5:) The MLE θˆ satisfies ∂
∂θ
log L(θ)|θ=θˆ =
0. Hence,
0 =
1√
n
∂
∂θ
log L(θ)|θ=θˆ =
1√
n
n∑
i=1
∂
∂θ
log f (xi ; θ)︸︷︷︸
=:g(θ)
|θ=θˆ = g(θˆ)
Taylor
= g(θ0) +
1√
n
g ′(θ˜)
√
n(θˆ − θ0)
for some θ˜ between θˆ and θ0. Hence,
√
n(θˆ − θ0) = g(θ0)− 1√
n
g ′(θ˜)
.
For all θ, by the weak law of large numbers,
− 1√
n
g ′(θ) = −1
n
n∑
i=1
(
∂
∂θ
)2
log f (Xi ; θ)︸︷︷︸
=:ηi
= −1
n
n∑
i=1
ηi
Pθ0→ −E[
(
∂
∂θ
)2
log f (X1; θ)] = If (θ)
as η1, η2, ... are iid.
One can show that θˆ
P→ θ0 implies θ˜ P→ θ0 and that
− 1√
n
g ′(θ˜) P→ If (θ0).
Want to use CLT for getting the asymptotic distribution of g(θ0): We have g(θ0) =
1√
n
∑n
i=1 Yi , with Yi =
∂
∂θ
log f (Xi ; θ)|θ=θ0 ,
Eθ0 Yi = Eθ0
∂
∂θ
log f = Eθ0
f ′
f
=
∫ f ′
f
f =
∫
f ′ =
∂
∂θ
∫
f︸︷︷︸
=1
= 0,
34
and
Varθ0(Yi) = Eθ0[Y
2
i ] = If (θ0).
Hence, by the CLT: g(θ0)
d→ N(0, If (θ0))
By Slutsky’s lemma, g(θ0)− 1√
n
g ′(θ˜)
d→ N(0, 1
If (θ0)
).
35
6 Confidence Regions
Point estimator: one number only.
Confidence interval: random interval that contains the true parameter with a certain
probability.
Example 24
Y1, ... , Yn iid N(µ, σ
2
0︸︷︷︸
known
) and µ ∈ R is unknown.
Want: random interval that contains µ with probability 1− α for some α > 0, e.g.
α = 0.05
Y¯ = 1
n
∑
Yi ∼ N(µ,σ20/n). Hence,
Y¯ − µ
σ0/
√
n
∼ N(0, 1) ∀µ ∈ R.
− cα 2 0 cα 2
1 − α
pdf of N(0, 1)
Thus,
1− α = P(−cα/2 < Y¯ − µ
σ0/
√
n
< cα/2),
where 0 < α < 1 and Φ(cα/2) = 1− α/2. Φ is the cdf of N(0, 1)
Rewrite this as
1− α = P(Y¯ + cα/2σ0/
√
n︸︷︷︸
random
> µ︸︷︷︸
non-random
> Y¯ − cα/2σ0/
√
n︸︷︷︸
random
]).
36
The interval (Y¯ − cα/2σ0/
√
n, Y¯ + cα/2σ0/
√
n) is a random interval which contains
the true µ with probability 1− α.
The observed value of the random interval is (y¯ − cα/2σ0/
√
n, y¯ + cα/2σ0/
√
n).
This is called a 1− α confidence interval for µ.
Remarks:
• α is usually small, often α = 0.05. this is the usual convention
• When speaking of a confidence interval we can either mean the realisation
of the random interval or the random interval itself (this should hopefully be
clear from the context).
• Could use asymmetrical values, but symmetrical values (±cα/2) give the short-
est interval in this case.
• The value σ0/
√
n is exactly the standard error of Y¯ .
Example 25
In an industrial process, past experience shows it gives components whose strengths
are N(40, 1.212). The process is modified but s.d.(=1.21) remains the same.
After modification, n = 12 components give an average of 41.125.
New strength ∼ N(µ, 1.212).
n = 12, σ0 = 1.21, y¯ = 41.125, α = 0.05, cα/2 ≈ 1.96.
→ a 95% CI for µ is (40.44, 41.81).
This does not mean that we are 95% confident that the true µ lies in (40.44, 41.81).
It means that if we were to take an infinite number of (indep) samples then in 95%
of cases the calculated CI would contain the true value.
Note that our CI does not include 40 - an indication that the modification seems to
have increased strength (→ hypothesis testing)
37
Definition 9
A 1 − α confidence interval for θ is a random interval I that contains the ’true’
parameter with probability ≥ 1− α, i.e.
Pθ(θ ∈ I ) ≥ 1− α ∀θ ∈ Θ
In the above, I can be any type of interval. For example, if L and U are random
variables with L ≤ U then I could be the open interval (L, U), the closed interval
[L, U], the unbounded interval [L,∞), . . .
Example 26
X ∼ Bernoulli(θ) θ ∈ [0, 1] unknown. Want: 1−α CI for θ (suppose 0 < α < 1/2).
Let
[L, U] =
[0, 1− α], for X = 0[α, 1], for X = 1
This is indeed a 1− α CI, since
Pθ(θ ∈ [L, U]) =

Pθ(X = 0) = 1− θ ≥ 1− α for θ < α,
1 for α ≤ θ ≤ 1− α,
Pθ(X = 1) = θ ≥ 1− α for θ > 1− α.
Remark L = −∞ and U =∞ is allowed.
Example 27 (One-sided confidence interval)
Suppose Y1, ... , Yn are independent measurements of a pollutant θ, where higher
values indicate worse pollution. To be“on the safe side”when reporting the amount
of pollutant, we want a 1− α CI for θ of the form (−∞, h(y)]. For this h needs to
be a function h : Rn → R such that
Pθ(θ ≤ h(Y)) ≥ 1− α ∀θ.
38
6.1 Construction of Confidence Intervals
Features of Y¯−µ
σ0/
√
n
in the first example (where σ0 is known):
1. it is a function of the unknown µ and the data only (σ0 is known)
2. its distribution is completely known.
More generally, consider a situation, where we are interested in a (scalar) unknown
parameter θ. There may be nuisance parameters (i.e. other unknown parameters we
are not interested in).
Definition 10
A pivotal quantity for θ is a function t(Y, θ) of the data and θ (and NOT any further
nuisance parameters) s.t. the distribution of t(Y, θ) is know, i.e. does NOT depend
on ANY unknown parameters.
Suppose t(Y, θ) is a pivotal quantity for θ. Then we can find constants a1, a2 s.t.
P(a1 ≤ t(Y, θ) ≤ a2) ≥ 1− α
because we know the distribution of t(Y, θ). (there may be many pairs (a1, a2); ≥ is
needed for discrete distributions)
In many cases (as above) we can rearrange terms to give
P(h1(Y) ≤ θ ≤ h2(Y)) ≥ 1− α
[h1(Y), h2(Y)] is a random interval. The observed interval
[h1(y)︸︷︷︸
lower confidence limit
, h2(y)︸︷︷︸
upper confidence limit
]
is a 1− α confidence interval for θ.
Example 28
Y1, ... , Yn i.i.d N(µ,σ
2), µ,σ2 both unknown
39
1. Want: confidence interval for µ.
σ is unknown =⇒ can’t use Y¯−µ
σ/
√
n
as a pivotal quantity;
Replace σ by S , where
S2 =
1
n − 1
∑
(Yi − Y¯ )2 (sample variance)
to give
T =
√
n
S
(Y¯ − µ).
T follows a Student-t distribution with n − 1 degrees of freedom. (You may
have seen this previously, but we will prove more general results in the 2nd
part of the course that includes this as a special case)
− tn−1,α 2 0 tn−1,α 2
1 − α
pdf of t distr with n−1 df
1− α = P(−tn−1,α/2 ≤ T ≤ tn−1,α/2)
= P(Y¯ − S√
n
tn−1,α/2 ≤ µ ≤ Y¯ + S√
n
tn−1,α/2)
1− α CI is (y¯ − s√
n
tn−1,α/2, y¯ + s√n tn−1,α/2)
2. Want: confidence interval for σ (or σ2).
We will see that: ∑
(Yi − Y¯ )2
σ2
∼ χ2n−1
40
0 c1 c2
1 − α
pdf of χn−1
2
α 2
α 2
c1 and c2 such that
P
(
c1 ≤
∑
(Yi − Y¯ )2
σ2
≤ c2
)
= 1− α
=⇒ a 1 − α CI for σ2 is
(∑
(yi−y¯)2
c2
,
∑
(yi−y¯)2
c1
)
and a 1 − α CI for σ is(√∑
(yi−y¯)2
c2
,
√∑
(yi−y¯)2
c1
)
.
6.2 Asymptotic confidence intervals
Often, we only know
√
n(Tn − θ) d→ N(0,σ2(θ)) (e.g. asymptotic distribution of the
MLE). This implies
√
nTn−θ
σ(θ)
d→ N(0, 1) and thus approximately:
√
n
Tn − θ
σ(θ)
∼ N(0, 1)
and we can use the LHS as a pivotal quantity. The resulting confidence interval is
called asymptotic confidence interval.
Definition 11
A sequence of random intervals In is called an asymptotic 1− α CI for θ if
lim
n→∞Pθ(θ ∈ In) ≥ 1− α ∀θ.
41
Trying to construct an asymptotic CI directly from
√
nTn−θ
σ(θ)
is usually not easy. This is
because σ depends on θ and thus it may be difficult to solve the resulting inequalities
for θ.
Simplification:
Suppose we have a consistent estimator σˆn for σ(θ). Thus, by definition, σˆn
Pθ→ σ(θ)
for all θ. Hence, using the Slutsky lemma and the fact that X ∼ N(0,σ2(θ)) implies
X/σ(θ) ∼ N(0, 1),
√
n
Tn − θ
σˆn
d→ N(0, 1).
Using the LHS as the pivotal quantity leads to the approximate confidence limits
Tn ± cα/2σˆn/
√
n
where Φ(cα/2) = 1− α/2.
The quantity σˆn/
√
n in the simplification is an estimate of the standard error SE(Tn).
Hence, we can also write the approximate confidence limits as
Tn ± cα/2 SE(Tn).
Example 29
Y ∼ Binomial(n, θ), θ ∈ (0, 1) unknown. n is known.
Then
√
n(Y /n − θ) d→ N(0, θ(1 − θ)) as n → ∞. To see this use the CLT or the
large sample properties of the MLE.
As
√
n Y /n−θ√
θ(1−θ) is approx. N(0, 1), we get
P(−cα/2 ≤ Y − nθ√
nθ(1− θ)
≤ cα/2) ≈ 1− α
The conf. limits (approx) are the roots of
(y − nθ)2 = c2α/2nθ(1− θ)
Solving this gives the confidence interval(
1
2
2 yn + c2n +
√
4 yn2c2 + c4n2 − 4 y 2c2n
n (n + c2)
,
1
2
2 yn + c2n −√4 yn2c2 + c4n2 − 4 y 2c2n
n (n + c2)
)
42
Using the above simplification idea: For σˆ2 = Y
n
(1− Y
n
) one can show σˆ2
P→ θ(1−θ)
(LLN: Y /n
P→ θ, rules for P→). Using the (asymptotic) pivotal quantity
√
n
Y /n − θ√
Y
n
(1− Y
n
)
leads to the confidence limits
y
n
± cα/2√
n
√
y
n
(1− y
n
).
Both of the above confidence intervals are asymptotic confidence intervals with
coverage probability 1− α.
6.3 Simultaneous Confidence Intervals/Confidence Regions
Extension of confidence intervals to more than one parameter / to a
parameter vector:
Suppose θ = (θ1, ... , θk)
T ∈ Θ ⊂ Rk and suppose that we have
random intervals (Li(Y), Ui(Y)) such that
∀θ : Pθ(Li(Y) < θi < Ui(Y) for i = 1, ... , k) ≥ 1− α
then we call (Li(y), Ui(y)), i = 1, ... , k a 1 − α simultaneous confi-
dence intervals for θ1, ... , θk . Can we construct simultaneous confi-
dence intervals from one-dimensional confidence intervals?
Remark (Bonferroni correction) Suppose [Li , Ui ] is a 1 − α/k confidence interval
for θi , i = 1, ... , k . Then [(L1, ... , Lk)
T , (U1, ... , Uk)
T ] = (L1, U1)× · · · × (Lk , Uk) is a
1− α simultaneous confidence interval for (θ1, ... , θk)T . Indeed,
P(θi ∈ [Li , Ui ], i = 1, ... , k) = 1− P(
k⋃
i=1
{θi /∈ [Li , Ui ]}) ≥ 1−
k∑
i=1
P(θi /∈ [Li , Ui ])︸︷︷︸
≤α/k
≥ 1− α.
This Bonferroni correction works without assuming anything about the joint distribu-
tion of the intervals. It also works if not all confidence intervals have the same coverage
probabilities:
43
Example 30
Suppose [L1, U1] is a 99% confidence interval for θ1 and [L2, U2] is a 97% confidence
interval for θ2. Then [L1, U1] × [L2, U2] is a 96% simultaneous confidence interval
for the parameter vector (θ1, θ2).
The Bonferroni correction is conservative, i.e. the actual coverage probabilty is higher
than the result from the correction.
Example 31
Suppose X1, ... , Xn ∼ N(µ, 1), Y1, ... , Yn ∼ N(θ, 1) independent with (µ, θ) being
the unknown parameter. Then we have seen previously that I = (X¯ − cα/2/
√
n, X¯ +
cα/2/
√
n), where Φ(cα/2) = 1 − α/2, is a 1 − α confidence interval for µ and
J = (Y¯ − cα/2/
√
n, Y¯ + cα/2/
√
n) is a 1− α confidence interval for θ.
Then we know, using the Bonferroni correction, that I × J is a 1 − 2α confidence
region for (µ, θ).
In this example we know that I and J are independent, thus the actual coverage
probability of I × J is
P(µ,θ)((µ, θ) ∈ I × J) = P(µ,θ)(µ ∈ I ) P(µ,θ)(θ ∈ J) = (1− α)2.
For example, for α = 0.1, the Bonferroni correction only guarantees a coverage
probability of 80%, whereas the actual coverage probability is 0.92 = 0.81.
If one uses a more complicated form than rectangles, i.e. one uses a random set A(Y)
such that for all θ ∈ Θ Pθ(θ ∈ A(Y)) ≥ 1 − α one calls A(y) a 1 − α confidence
region for θ. [In the second part of the course there will be an example in which the
random set is an ellipse]
7 Hypothesis Tests
Beyond point estimation, statistical inference often makes use of hypothesis testing
methods. Here, a hypothesis is any statement about a population parameter. We will
44
consider pairs of complementary hypotheses with the aim of deciding between the two.
Definition 12
The two complementary hypotheses in a hypothesis testing problem are called the
null hypothesis and the alternative hypothesis, denoted by H0 and H1, respectively.
Let θ denote the population parameter of interest taking values in a set Θ. The set Θ
of all possible parameter values is usually called the parameter space. Formally, H0 is
a statement that θ ∈ Θ0 for some Θ0 ⊂ Θ. Then, H1 corresponds to Θ1 = Θ \Θ0.
Example 32
(Ox weights, continued) Recall that a reasonable probability model for the guess of
a participant is X ∼ N(µ,σ2), with µ = 543.4 kg. We might therefore be interested
in testing the hypothesis H0 : µ = 543.4 against the alternative H1 : µ 6= 543.4.
If we reject H0, then we are deciding that evidence suggest the average guess of a
participant does not equal the true weight of the ox.
In practice, we will often use the result of a hypothesis test to assert that H0 or H1 is
true. For example, in clinical trials the decision to be made is whether or not there is
a difference in the primary outcome between the population receiving standard of care
and the group receiving a novel treatment. To make this process rigorous, we must
define a formal procedure for decision-making.
Definition 13
• A hypothesis test is a rule that specifies for which values of the sample
X1, ... , Xn the decision is made to accept H0 as true and for which values
to reject H0 and accept H1 as true.
• The subset of the sample space for which H0 will be rejected is called the
rejection region or critical region.
We typically regard H0 as the ’status quo’ which we do not reject unless there is
(considerable) evidence against it.
45
Example 33
Medical statistics:
H0: new treatment is not better
H1: new treatment is better.
When we use hypothesis tests, there are two types of errors we can make:
H0 true H0 false
do not reject H0 X Type II error
reject H0 Type I error X
A test is of level α (0 < α < 1) if
Pθ(reject H0) ≤ α ∀θ ∈ Θ0.
Usually α is small, e.g. 0.01 or 0.05.
Loosely speaking: the probability of a type I error is less than α.
There is no such bound for the probability of a type II error.
7.1 Power of a Test
Setup: Θ parameter space, Θ0 ⊂ Θ, Θ1 = Θ \Θ0. Consider
H0 : θ ∈ Θ0 v.s. H1 : θ ∈ Θ1
Suppose we have some test for this hypothesis.
The power function is defined as the mapping
β : Θ→ [0, 1], β(θ) = Pθ(reject H0)
If θ ∈ Θ0 then we want β(θ) to be small.
If θ ∈ Θ1 then we want β(θ) to be large.
46
Example 34
X ∼ N(θ, 1), θ ∈ R unknown.
H0 : θ ≤ 0 against H1 : θ > 0
Thus Θ = R, Θ0 = (−∞, 0], Θ1 = (0,∞). Suppose we use the critical region
R = [c ,∞)
(i.e. we reject if X ∈ R). We will choose the critical value c s.t. the test is of level
α. For θ ≤ 0:
Pθ(reject H0) = Pθ(X ≥ c) = Pθ(X − θ︸︷︷︸
∼N(0,1)
> c − θ) = 1− Φ(c − θ) ≤ 1− Φ(c)
where Φ is the standard normal cdf. Choose c such that Φ(c) = 1 − α. Then
Pθ(reject H0) ≤ α for all θ ∈ Θ0.
Note: In this case it was sufficient to construct a test of level α for the boundary
case θ=0.
In this situation we can work out the power function β(θ) = Pθ(reject H0) [try it!].
Below is a sketch of β(θ):
θ
β(θ
)
0
0
α
1
The above is a typical β(θ) for a one-sided test. Examples for power functions of
two-sided tests are in the next section.
47
7.2 p-value
Often the so-called p-value is reported (instead of a test decision):
p = sup
θ∈Θ0
Pθ(observing something ”at least as extreme” as the observation)
Reject H0 iff p ≤ α → α-level test.
Advantage for computer packages: User does not have to specify the level.
If the test is based on the statistic T with rejection for large values of T then
p = sup
θ∈Θ0
Pθ(T ≥ t),
where t is the observed value.
In the above example (where X ∼ N(θ, 1) and H0 : θ ≤ 0 against H1 : θ > 0 ) the
p-value is:
p = sup
θ∈Θ0
Pθ(X ≥ x) = P0(X ≥ x) = 1− Φ(x)
Example 35
Two-sided test with known variance. X1, ... , Xn ∼ N(µ, 1) iid, µ unknown parameter
H0 : µ = µ0 against H1 : µ 6= µ0
Under H0: T =
√
n(X¯ − µ0) ∼ N(0, 1). Rejection region (based on T ):
(−∞,−cα/2] ∪ [cα/2,∞),
where Φ(cα/2) = 1− α/2.
Test rejects for large values of |T |.
Hence, for the observation t the p-value is:
p = Pµ0(|T | ≥ |t|) = P(T ≤ −|t| or T ≥ |t|) = Φ(−|t|)+1−Φ(|t|) = 2−2Φ(|t|)
Power: Note that T ∼ N(√n(µ− µ0), 1).
β(µ) = Pµ(|T | ≥ cα/2) = 1− Pµ(−cα/2 ≤ T ≤ cα/2)
=1− Pµ(−
√
n(µ− µ0)− cα/2 ≤ T −
√
n(µ− µ0) ≤ −
√
n(µ− µ0) + cα/2)
=1− Φ(−√n(µ0 − µ) + cα/2) + Φ(−
√
n(µ0 − µ)− cα/2)
48
4.0 4.5 5.0 5.5 6.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
test of mean, known variance
α = 0.05,H0: µ = 5
true mean
po
w
e
r
n=16
n=100
4.0 4.5 5.0 5.5 6.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
test of mean, known variance
n=16, H0: µ = 5
true mean
po
w
e
r
α = 0.05
α = 0.1
Example 36 (Student’s t-Test; One-Sample t-Test)
X1, ... , Xn ∼ N(µ,σ2) iid, µ and σ unknown parameters
H0 : µ = µ0 against H1 : µ 6= µ0
Under H0: T =
√
n X¯−µ0
S
∼ tn−1.
Rejection region:
(−∞,−c] ∪ [c ,∞),
where c = tn−1,α/2. (tn−1,α/2 is chosen such that if Y ∼ tn−1 then P(Y >
tn−1,α/2) = α/2)
(one gets similar plots for the power function)
7.3 Connection between tests and confidence intervals
Constructing a test from a confidence region: Let Y be the random observations.
Suppose A(Y ) is a 1− α confidence region for θ, i.e.
Pθ(θ ∈ A(Y )) ≥ 1− α ∀θ ∈ Θ.
49
Then one can define a test for
H0 : θ ∈ Θ0 v.s. H1 : θ /∈ Θ0
(where Θ0 is some fixed subset of Θ) with level α as follows:
Reject H0 if Θ0 ∩ A(y) = ∅.
In words: Reject H0 if none of its elements are in the confi-
dence region.
To see that the above test has the appropriate level: For all θ ∈ Θ0,
Pθ(type I error) = Pθ(reject) = Pθ(Θ0 ∩ A(Y ) = ∅) ≤ Pθ(θ /∈ A(Y )) ≤ α.
Remark If the null hypothesis is simple, i.e. Θ0 = {θ0}, then the above rule is equiv-
alent to
Reject H0 if θ0 /∈ A(y).
Constructing a confidence region from tests (Optional):
Suppose that ∀θ0 ∈ Θ we have a level α test φθ0 for
Hθ00 : θ = θ0 v.s.H
θ0
1 : θ 6= θ0,
i.e. a decision rule φθ0 to reject or not reject H
θ0
0 that satisfies
Pθ0(φθ0 rejects H
θ0
0 ) ≤ α
Consider the random set
A := {θ0 ∈ Θ : φθ0 does not reject Hθ00 }
This is a 1− α confidence region for θ. Indeed, ∀θ ∈ Θ,
Pθ(θ ∈ A) = Pθ(φθ does not reject) = 1− Pθ(φθ rejects) ≥ 1− α
50
8 Likelihood Ratio Tests
Idea behind the maximum likelihood estimator: parameter with the highest likelihood
is “best”. Can this idea be used to create a test?
More precisely, consider the hypotheses
H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 := Θ \Θ0
Main idea: compare the maximised likelihood L under H0 (supθ∈Θ0 L(θ)) to the un-
restricted maximum likelihood (supθ∈Θ L(θ)). If the latter is (much?) larger then
supθ∈Θ1 L(θ) >> supθ∈Θ0 L(θ), casting doubt on H0.
Definition 14
Suppose we observe the data y. The likelihood ratio test statistic is
t(y) =
supθ∈Θ L(θ; y)
supθ∈Θ0 L(θ; y)
=
max. lik. under H0 + H1
max. lik. under H0
(Other equivalent definitions are possible) Note: t(y) ≥ 1.
If t(y) is “large” this will indicate support for H1, so reject H0 when
t(y) ≥ k ,
where k is chosen to make
sup
θ∈Θ0
Pθ(t(Y) ≥ k) = (or ≤)α
(e.g. α = 0.05).
Remark The choice of k ensures that we get a test to the level α.
Example 37
X ∼Binomial(n, θ), θ ∈ (0, 1) = Θ. n is known.
51
H0 : θ = 0.5 v.s. H1 : θ 6= 0.5
Here, Θ0 = {0.5}, Θ1 = (0, 0.5) ∪ (0.5, 1).
Suppose we observe the realisation x . The likelihood is
L : Θ→ R, θ 7→ Pθ(X = x) =
(
n
x
)
θx(1− θ)n−x .
We have shown previously that the MLE is θˆ = x
n
. Thus
sup
θ∈Θ
L(θ) = L(θˆ) =
(
n
x
)(
x
n
)x (
1− x
n
)n−x
Under H0 :
sup
θ∈Θ0
L(θ) = L(0.5) =
(
n
x
)
0.5x(1− 0.5)n−x =
(
n
x
)
0.5n
Thus, the likelihood ratio statistic is
t =
supθ∈Θ L(θ)
supθ∈Θ0 L(θ)
=
(
x
n
)x (
1− x
n
)n−x
0.5n
l
l
l
l
l
l
l
l
l l l l
l
l
l
l
l
l
l
l
l
0 5 10 15 20
1e
+0
0
1e
+0
2
1e
+0
4
1e
+0
6 Dependence of the likelihood ratio t on the observation x.
The above is for n=20. Note the log−scale on the y−axis.
x
t(x
)
To construct the test (find the threshold k or compute a p-value) we need the
distribution of t under H0.
It is possible to do this here, as the null hypothesis is simple, (i.e. consists of only
one element). As X is a discrete random variable, t(X ) is a discrete random variable,
allowing to compute the cdf of t(X ).
52
1e+00 1e+02 1e+04 1e+06
CDF of t(X)
x
F(
x)
l
l
l
l
l
l
l
l l l l l l
5.18
0.
00
0.
25
0.
50
0.
95
In this case rejecting if t > 5.19 leads to a test with level 5%.
Example 38
Xi ∼ Binomial(n, θi), i = 1, 2 indep., θ = (θ1, θ2) ∈ (0, 1)2 = {(x , y) : x , y ∈
(0, 1)} = Θ
H0 : θ1 = θ2 v.s. H1 : θ1 6= θ2
Suppose we observe the realisation (x1, x2).
L(θ) = Pθ(X1 = x1, X2 = x2) =
(
n
x1
)
θx11 (1− θ1)n−x1
(
n
x2
)
θx22 (1− θ2)n−x2
We want to derive the likelihood ratio statistic.
First we find the MLE over all of Θ. The likelihood is the product of two pos-
itive functions, the first only depending on θ1, the second depending only on θ2.
Maximising them separately (which is just the standard maximisation of a Binomial
likelihood) leads to the MLE
θˆ1 =
x1
n
, θˆ2 =
x2
n
.
Under H0, we have θ1 = θ2. Hence,
L(θ) = Pθ(X1 = x1, X2 = x2) =
(
n
x1
)(
n
x2
)
θx1+x21 (1− θ1)2n−(x1+x2)
Except from a different constant at the beginning this is the likelihood of a single
Binomial(2n, θ1) observation. Maximising this leads to the estimator
θˆ1 = θˆ2 =
x1 + x2
2n
53
Thus the likelihood ratio test statistic is
t =
L(x1/n, x2/n)
L( x1+x2
2n
, x1+x2
2n
)
To construct a test, I would need the distribution of t under H0. This is not easy to
obtain as for each value of (θ, θ) ∈ Θ0 the distribution of t may be different.
Example 39
m factories producing light bulbs. Are all factories producing bulbs of the same
quality? Observation: life-length of n light bulbs from each factory; Yij = life-length
of bulb j from factory i .
Model: Yij indep. Exp(λi), i=1,. . . ,m; j=1,. . . n, λi > 0 unknown, i = 1, ... , m.
H0 : λ1 = · · · = λm v.s. H1 : not H0
Interpretation of H0 : all factories produce bulbs of equal quality
Likelihood (using θT = (λ1, ... ,λm)):
L(θ) =
m∏
i=1
n∏
j=1
λi exp(−λiYij) =
m∏
i=1
λni e
−λi
∑
j
yij
As the elements of the product are all nonnegative, and only contain different com-
ponents of the parameter, we can maximise them separately. Each element is the
likelihood of iid Exponential(λi) observations, leading to the MLE
λˆi =
1
y¯i
where y¯i =
1
n
∑
j
yij .
Hence,
sup
θ∈Θ
L(θ; y) =
e−mn
(
∏
i y¯i)n
Under H0 (setting λ := λ1 = · · · = λm)
L(θ; y) = λmne−λ
∑
i ,j
yij
Again this is the likelihood for iid Exponential(λ) observations. Thus, the MLE is
λˆ =
1
y¯
. Hence,
sup
H0
L(θ; y) =
e−mn
y¯mn
54
=⇒ t(y) = y¯
mn
(
∏
i y¯i)n
To construct a test we would need to know the distr. of t(Y)
under H0. Not easy!
Even if it were known - the distribution of t(Y) may depend on λ and hence, choosing
k according to supλ>0 Pλ(t(Y) ≥ k) = α may not be easy.
Definition 15
Let X1, X2, ... , Xn ∼ N(0, 1) independently. Then ∑ni=1 X 2i ∼ χ2n.
Theorem 6
Let Y1, ..., Yn be a random sample. Let Y = (Y1, ..., Yn). Under certain regularity
conditions (in particular H0 must be “nested” in H1, i.e. Θ0 is a lower-dimensional
subspace/subset of Θ),
2 log t(Y)
d→ χ2r (n→∞)
under H0, where r = #independent restrictions on θ needed to define H0.
Alternative way to derive the degrees of freedom r :
r = # of independent parameters under full model−# of independent parameters under H0
Example 40
In the above examples:
• X ∼Binomial(n, θ), θ ∈ (0, 1) = Θ with H0 : θ = 0.5 v.s. H1 : θ 6= 0.5: r=1
• Xi ∼ Binomial(n, θi), i = 1, 2 indep., θ ∈ (0, 1)2 with H0 : θ1 = θ2 v.s. H1 :
θ1 6= θ2: r=1
• “light bulbs”: r = m − 1
55
Proof (Sketch of a proof for the case Θ0 = {θ0}) Suppose Θ ⊂ Rr . Then
2 log t(Y) = 2(log L(θˆ)− log L(θ0)),
where θˆ denotes the MLE of θ. Using a Taylor expansion,
log L(θ0) ≈ log L(θˆ) + (θ0 − θˆ)T ∂ log L(θ)
∂θ
∣∣∣∣
θˆ︸︷︷︸
=0;(suff. cond. for min)
+
1
2
(θ0 − θˆ)T ∂∂ log L(θ)
∂θ∂θT
∣∣∣∣
θˆ
(θ0 − θˆ)
Hence,
2 log t(Y) ≈ (θ0 − θˆ)T
(
−∂∂ log L(θ)
∂θ∂θT
∣∣∣∣
θˆ
)
(θ0 − θˆ).
By a multivariate version of Theorem 5 (asymptotics of MLE),
√
n(θˆn − θ0) d→ N(0, If (θ0)−1),
where If (θ) = Eθ[(
∂
∂θ
log L(θ))( ∂
∂θ
log L(θ))T ].
By the law of large numbers (and a few more arguments similar to the proof of the
asymptotics of the MLE),
−1
n
∂∂ log L(θ)
∂θ∂θT
∣∣∣∣
θˆ
Pθ0→ If (θ0).
Hence, 2 log t(Y) ≈ ZT If (θ0)Z, where Z ∼ Np(0, If (θ0)−1).
Results on quadratic forms of normal random vectors (which will be derived in the
second part of this course) imply ZT If (θ0)Z ∼ χ2r .
56
9 Linear Models with Second Order Assumptions
Many scientific problems focus on evaluating associations between an outcome of
interest and a set of predictor variables. In the remainder of the course, we consider
how linear regression can be used for this purpose.
9.1 Simple Linear Regression
Example 41
Boiling Point of Water
Can the boiling point of water be used to measure altitude? Altitude can be measured
by atmospheric pressure. Early barometers were difficult to transport, so, in the 19th
century, James David Forbes, a Scottish scientist, conducted a study if the boiling
point of water can predict atmoshperic pressure (and thus ultimately altitude).
He collected 17 observations on boiling point (Fahrenheit) and barometric pressure
(inches of mercury) at various locations in the alps.
(S. Weisberg (1980), Applied Linear Regression, Wiley.)
bp pressure bp pressure
194.5 20.79 194.3 20.79
197.9 22.4 198.4 22.67
199.4 23.15 199.9 23.35
200.9 23.89 201.1 23.99
201.4 24.02 201.3 24.01
203.6 25.14 204.6 26.57
209.5 28.49 208.6 27.76
210.7 29.04 211.9 29.88
212.2 30.06
ll
l l
l l
lll
l
l
l
l
l
ll
195 200 205 210
3.
1
3.
2
3.
3
3.
4
Boiling Point of Water
boiling point [Fahrenheit]
lo
g(P
res
su
re)
[in
ch
es
of
m
erc
ury
]
57
Example 42
Brain and Body Weights for 62 Species of Land Mammals
body brain
Chimpanzee 52.160 440.0
Human 62.000 1320.0
Donkey 187.100 419.0
Horse 521.000 655.0
Cow 465.000 423.0
Grey wolf 36.330 119.5
Goat 27.660 115.0
African giant pouched rat 1.000 6.6
Asian elephant 2547.000 4603.0
African elephant 6654.000 5712.0
Baboon 10.550 179.5
Rat 0.280 1.9
Red fox 4.235 50.4
Brazilian tapir 160.000 169.0
Jaguar 100.000 157.0
Pig 192.000 180.0
Desert hedgehog 0.550 2.4
Slow loris 1.400 12.5
Golden hamster 0.120 1.0
. . .
l
l
l
l
lll
l
l
l
l l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
−4 −2 0 2 4 6 8
−
2
0
2
4
6
8
log(body weight)
lo
g(b
rai
n
we
ig
ht
s)
The simple linear model is
Yi = β1 + aiβ2 + i , i = 1, ... , n
• Yi “outcome”, “response”; observable random variable.
• ai “covariate”; observable constant.
• β1, β2 unknown parameters.
• Error 1, ... , n iid, E i = 0, Var(i) = σ2 for i = 1, ... , n. σ2 > 0 is another
unknown parameter. The errors 1, ... , n are not observable.
Forbes data: ai =boiling point and Yi = log(pressure)
Mammals data: ai = log(body weight) and Yi =log(brain weight).
58
The so-called least squares estimators βˆ1, βˆ2 of β1 and β2 are defined as the minimisers
of
S(β1, β2) =
n∑
i=1
(yi − β1 − aiβ2)2.
Remark Note that:
• ei = yi − βˆ1 − ai βˆ2, the so-called residuals, are observable. They are not iid, as
dependence is introduced via βˆ1, βˆ2.
• The unknown parameters are β1, β2 and σ2.
• In linear regression models Y1, ... , Yn are generally not iid observations. Indepen-
dence will still hold if the errors 1, ... , n are independent. However, the Yi do
not have the same distribution; the distribution of Yi depends on the covariate
ai .
The general formulation of the linear model makes heavy use of matrix notation and
properties of random vectors. In the next two subsections, some useful tools for this
are developed.
9.2 Matrix Algebra
This section contains some useful results about matrices. AT denotes the transpose
of a matrix. I will use the terms “invertible” and “non-singular” synonymously.
59
Lemma 5
Let A ∈ Rn×m, B ∈ Rm×n. Then (AB)T = BTAT
Let A ∈ Rn×n be non-singular. Then (A−1)T = (AT )−1.
You can try the proof of the above lemma yourself.
Let A = (Aij) ∈ Rn×n. Its trace, denoted by trace(A) is the sum of its diagonal
elements, i.e. trace(A) =
∑n
i=1 Aii .
Lemma 6
trace(AB) = trace(BA).
Proof Recall that AB = (
∑
j AijBjk)i ,k . Thus,
trace(AB) =
∑
i
∑
j AijBji =
∑
j
∑
i BjiAij = trace(BA)
Example 43
Let A = (1, 1), B = (1, 1)T
Then AB = 2 6= BA =
(
1 1
1 1
)
, but trace(AB) = 2 = trace(BA).
Lemma 7
Let X be an n × p matrix. Then rank(XTX ) = rank(X ).
Proof Let kern(X ) = {x : X x = 0}. Then p = rank X + dim kern(X ) (this is
the “Rank-nullity theorem”, which you know from previous courses). Similarly, p =
rank XTX + dim kern(XTX )
It suffices to show: kern(X ) = kern(XTX ).
If x ∈ kern(X ) then 0 = X x and hence 0 = XTX x which shows x ∈ kern(XTX ) =
{y : XTX y = 0}.
If x ∈ kern(XTX ) then 0 = XTX x and thus 0 = xTXTX x = (X x)TX x = ‖X x‖2
which shows X x = 0, i.e. x ∈ kern(X ).
60
Recall that a symmetric matrix A ∈ Rn×n is called positive definitive if ∀x ∈ Rn \{0} :
xTAx > 0.
Lemma 8
If A ∈ Rn×n is symmetric then ∃ an orthogonal matrix P (i.e. PTP = I ) s.t. PTAP
is diagonal (with diagonal entries equal to the eigenvalues of A).
If A is an n × n positive definite symmetric matrix, then ∃ a non-singular matrix Q
s.t. QTAQ = In.
Proof The first result is a standard piece of linear algebra book-work.
The second result can be derived from it: Suppose A is positive definite, then its
eigenvalues are all positive. Hence, PTAP = D = diag(λ1, ... ,λn) where λi > 0 ∀i .
Let E = D
1
2 = diag(
√
λ1, ... ,
√
λn) and define Q = PE
−1. Then
QTAQ = (PE−1)TAPE−1 = (E−1)TPTAPE−1 = (E−1)TEEE−1 = I .
9.3 Review of rules for E, cov for random vectors
Let X = (X1, ... , Xn)T be a random vector. Then
E(X) = (E X1, ... , E Xn)
T ,
i.e. the expectation is defined componentwise. For random matrices the expectation is
also defined componentwise.
Lemma 9
Let X and Y be n-variate random vectors. Then the following hold:
• E(X + Y) = E X + E Y.
• Let a ∈ R then E(aX) = a E(X)
• Let A, B be deterministic matrices of “suitable dimensions” (deterministic
means that they are not random). Then E(AX) = A E(X) and E(XTB) =
E(X)TB .
61
Proof Use properties of one-dimensional random variables, for example
E(AX) = (E(
∑
j
AijXj))i = (
∑
j
Aij E(Xj))i = A E X
Definition 16
If X, Y are random vectors then
cov(X, Y) :=(cov(Xi , Yj))i ,j
= E[(X− E(X))(Y − E(Y))T ] = E[XYT ]− E(X) E(Y)T .
Furthermore cov(X) := cov(X, X).
The above definition really contains three equivalent definitions of cov(X, Y). It is
straightforward to show that they are equivalent.
Lemma 10
If X, Y and Z are random vectors, A, B are deterministic matrices and a, b ∈ R are
constants then (assuming appropriate dimensions)
cov(X, Y) = cov(Y, X)T
cov(aX + bY, Z) = a cov(X, Z) + b cov(Y, Z)
cov(AX, BY) = A cov(X, Y)BT
cov(AX) = A cov(X)AT
cov(X) is positive semidefinite and symmetric,
i.e. cT cov(X)c ≥ 0 for all vectors c, or, equivalently, all eigenvalues of cov(X)
are nonnegative.
• If X and Y are independent then cov(X, Y) = 0.
Proof Play back to properties of the one-dimensional covariance or work with one of
the vector definitions of the covariance. For example, to see the last property,
cT cov(X)c = E(cT (X− E X)(X− E X)Tc) = E[(cT (X− E X))2] ≥ 0
62
Example 44
Let X ∼ Binomial(17, 0.4). Then
cov(X ) =
Example 45
If Y1, ... , Yn are independent then
cov(

Y1
...
Yn
) =
Example 46
Let X , Y be independent r.v. with X ∼ N(5, 2) and Y ∼ Binomial(10, 0.5). Then
cov
((
X
−X
))
=
cov
((
X
X + Y
))
=
cov
(
X ,
(
2X
X − Y
))
=
63
9.4 Linear Model
General formulation that allows for any number of covariates influencing an observation.
Definition 17
In a linear model
Y = Xβ + ,
where
Y is an n-dimensional random vector (observable),
X ∈ Rn×p is a known matrix (often called “design matrix”),
β ∈ Rp is an unknown parameter and
is an n-variate random vector (not observable) with E() = 0.
Remark Unless mentioned otherwise, we will assume n > p.
Remark Concerning notation: From now on, a vector will implicitly always be a
column vector. Vectors will be printed in bold. The transpose of a vector x is
denoted by xT . Expectations of random vectors are defined componentwise, i.e.
E() = (E(1), ... , E(n))
T .
Remark Shorter way of writing a linear model:
E Y = Xβ
Example 47 (Forbes, Mammals)
In the examples of the previous section, we had Yi = β1 + aiβ2 + i . This fits into
the general framework with X =

1 a1
...
...
1 an
, where
Forbes data: ai =boiling point and Yi = log(pressure)
Mammals data: ai = log(body weight) and Yi =log(brain weight).
64
Example 48
20 patients, 2 drugs, A and B
10 given A, 10 given B
YAj = response of jth patient to receive A, j = 1, ... , 10
YBj = response of jth patient to receive B, j = 1, ... , 10
The simplest model is E(YAj) = µA, E(YBj) = µB . In matrix form:
E

YA1
...
YA,10
YB1
...
YB,10

=

1 0
...
...
1 0
0 1
...
...
0 1

(
µA
µB
)
Example 49
10 pairs of twins, 2 drugs: A and B
one twin in each pair receives A, the other one receives B
twins are alike - we need to modify our previous model:
E(YAj) = µA + τj , E(YBj) = µB + τj , where τj=effect of twin pair j .
E

YA1
...
YA,10
YB1
...
YB,10

=

1 0 1 0 · · · 0
...
... 0
. . . . . .
...
...
...
...
. . . . . . 0
1 0 0 · · · 0 1
0 1 1 0 · · · 0
...
... 0
. . . . . .
...
...
...
...
. . . . . . 0
0 1 0 · · · 0 1

︸︷︷︸
=X

µA
µB
τ1
...
τ10

︸︷︷︸
β
65
Example 50
We might be interested in a model of the following form:
E(Yi) = β1 + β2xi1 + β3xi2 + β4xi1xi2 + β5x
2
i3,
where xik is the value of the kth predictor for observation i . This is a linear model,
as the parameters act linearly.
Example 51
E(Yi) = β1 + β2x
β3
i
is not a linear model. The influence of the parameter β3 is not linear.
Assumptions:
We will frequently use some of the following further assumptions about the linear
model.
Second Order Assumption (SOA): cov() = (cov(i , j))i=1,...,n
j=1,...,n
= σ2In for some
σ2 > 0.
(SOA) really consists of two parts: First, that the errors of two different observations,
i and j for i 6= j are uncorrelated. Second, that the variance of all errors is identical
(recall: Var(i) = cov(i , i)).
For the remainder of this Chapter we will assume (SOA)
Normal theory assumptions (NTA): ∼ N(0,σ2In) for some σ2 > 0.
N denotes the n-dimensional multivariate normal distribution. One could equivalently
define the NTA as: 1, ... , n ∼ N(0,σ2) independently.
(NTA) implies (SOA). We will use (NTA) in Chapter 3 to construct tests and confidence
intervals.
Full rank (FR) The matrix X has full rank.
66
We say that a matrix has“full rank”if it has the highest possible rank for its dimensions,
i.e. if rank(X ) = min(n, p). As we are mostly working with the situation n > p, (FR)
reduces to rank(X ) = p.
In the following, we will denote the rank of X always by r = rank(X ).
9.5 Identifiability
In statistical models, one of the main aims is to determine the unknown parameter. If
two parameter values lead to the same distribution for the observed data we cannot
distinguish between these parameter values.
Definition 18
Suppose we have a statistical model with unknown parameter θ. We call θ identi-
fiable if no two different value of θ lead to the same distribution of the observed
data.
For a linear model: the main parameter we are interested is β and the observation is
Y. It turns out that if r < p, then the parameter vector β is not identifiable. The
following example shows this.
Example 52
Consider again the linear model for the study with twins in which r = 11 and p = 12.
Let β = (µA,µB , τ1, ... , τ10)
T ∈ Rp and for some δ > 0 let
β˜ =

µA − δ
µB − δ
τ1 + δ
...
τ10 + δ
. Then Xβ =

µa + τ1
...
µa + τ10
µb + τ1
...
µb + τ10

=

µa − δ + τ1 + δ
...
µa − δ + τ10 + δ
µb − δ + τ1 + δ
...
µb − δ + τ10 + δ

= X β˜. Thus β
and β˜ lead to the same expected value of Y. Thus β is not identifiable.
67
9.6 Least Squares Estimation
We will estimate β by least squares.
Least squares: choose β to minimise
S(β) =
n∑
i=1
Yi − p∑
j=1
Xijβj
2 = (Y − Xβ)T (Y − Xβ)
(Later we will see that under (NTA) the least squares estimator is the MLE).
Multiplying out we get
S(β) = YTY − βTXTY − YTXβ + βTXTXβ
= YTY − 2YTXβ + βTXTXβ
The derivative wrt β is
∂S(β)
∂β
=
(
∂S(β)
∂βi
)
i=1,...,p
= −2XTY + 2XTXβ
[You can check this by componentwise differentiation.] The least squares estimator βˆ
should satisfy ∂S(β)
∂β
|β=βˆ = 0 and hence we get the
Least Squares Equations (LSE): XTX βˆ = XTY
These equations have a unique solution ⇐⇒ XTX is invertible (i.e. has rank p).
But rank XTX = rank X (see Lemma 7).
Hence the LSE have a unique solution ⇐⇒ the linear model is of full rank.
Suppose βˆ satisfies the LSE. Then it does in fact minimise S(β). To see this, for all
β ∈ Rp,
S(β) = (Y − X βˆ + X βˆ − Xβ)T (Y − X βˆ + X βˆ − Xβ)
= S(βˆ) + (X βˆ − Xβ)T (X βˆ − Xβ)︸︷︷︸
≥0
+2 (X βˆ − Xβ)T (Y − X βˆ)︸︷︷︸
=(βˆ−β)T (XTY−XTX βˆ)=0
≥ S(βˆ)
68
Geometrical Interpretation
To see that Y − X βˆ is orthogonal to X βˆ:
(X βˆ)T (Y − X βˆ) = βˆTXT (Y − X βˆ) = βˆT (XTY − XTX βˆ)︸︷︷︸
=0 by LSE
= 0
9.7 Properties of Least Squares Estimation
In this section: assume (FR) and (SOA)
Then
βˆ = (XTX )−1XTY
• βˆ is linear in Y.
More precisely: the function βˆ : Rn → Rp, y 7→ (XTX )−1XTy is a linear
mapping.
• βˆ is unbiased for β.
Indeed, for all β:
E[βˆ] = (XTX )−1XT E(Y) = (XTX )−1XTXβ = β
• cov(βˆ) = σ2(XTX )−1.
69
Indeed, letting A = (XTX )−1XT we have
cov(βˆ) = cov(AY) = A cov(Y)AT = σ2AAT
= σ2(XTX )−1XTX (XTX )−1 = σ2(XTX )−1
We now compute the least squares estimator explicitly in one simple situation.
Example 53 (Simple linear regression)
Let
Yi = β1 + β2ai + i , E i = 0 i = 1, ... , n,
where a1, ... , an are known deterministic constants. Assume that n ≥ 2
YT = (Y1, ... , Yn), β
T = (β1, β2) and X =

1 a1
...
...
1 an
 .
Assume SOA and that not all ais are equal (to ensure FR). Then
XTX =
(
n na
na
∑
a2i
)
(XTX )−1 =
1
n
∑
a2i − n2a2
( ∑
a2i −na
−na n
)
XTY =
(
nY∑
aiYi
)
.
Now we can find βˆ = (XTX )−1XTY, hence(
βˆ1
βˆ2
)
=
1∑
a2i − na2
(
Y
∑
a2i − a
∑
aiYi∑
aiYi − naY
)
.
βˆ2 =
∑
(ai − a)(Yi − Y )∑
(ai − a)2
βˆ1 = Y − βˆ2a.
cov(βˆ) = σ2(XTX )−1 =
σ2
n
∑
(ai − a)2
( ∑
a2i −na
−na n
)
.
70
If a = 0 everything becomes easier: the covariance matrix is diagonal and β̂1 = Y¯ .
To get to this situation, we can re-parametrise the model by letting γ1 = β1 + a¯β2
and γ2 = β2. Then
E(Y) = Xβ =

β1 + a1β2
...
β1 + anβ2
 =

γ1 + (a1 − a¯)γ2
...
γ1 + (an − a¯)γ2
 =

1 (a1 − a¯)
...
...
1 (an − a¯)

(
γ1
γ2
)
Then
∑n
i=1(ai − a¯) = 0, γˆ1 = Y¯ , γˆ2 = βˆ2.
If the main interest lies in β2 then one can work with the simpler transformed model
and get the same estimates via γ2.
The following theorem justifies the use of the least squares estimator - it can be used
to construct a best linear unbiased estimator (BLUE ).
An estimator γˆ is called linear if there exists L ∈ Rn such that γˆ = LTY.
Theorem 7 (The Gauss-Markov Theorem for full-rank linear models)
Assume (FR), (SOA). Let c ∈ Rp and let βˆ be a least squares estimator of β in a
linear model. Then the following holds: The estimator cT βˆ has the smallest variance
among all linear unbiased estimators for cTβ.
Remark For i ∈ {1, ... , n}, let ei = (0, ... , 0, 1, 0, ... , 0)T with the 1 being in the ith
component. Choosing c = ei we have cTβ = βi and cT βˆ = βˆi . Thus βˆi has the
smallest variance among all linear unbiased estimators of βi .
Proof cT βˆ is linear and unbiased
Let γˆ = LTY be any other linear unbiased estimator of cTβ.
Var(γˆ) = Var(LTY) = Var(cT βˆ + (LT − cT (XTX )−1XT )︸︷︷︸
=:DT
Y)
= cov(cT βˆ + DTY, cT βˆ + DTY)
= Var(cT βˆ) + Var(DTY) + 2 cov(cT βˆ, DTY)
71
cov(cT βˆ, DTY) = cT (XTX )−1XT cov(Y)︸︷︷︸
=σ2In
D = cT (XTX )−1(DTX︸︷︷︸
(∗)
= 0T
)Tσ2 = 0.
To see (∗): both est unbiased =⇒ 0 = E(γˆ)− E(cT βˆ) = E(DTY) = DTXβ for all
β. Hence, DTX = 0T . Thus,
Var(γˆ) = Var(cT βˆ) + Var(DTY) ≥ Var(cT βˆ).
9.8 Projection Matrices
Let L be a linear subspace of Rn, dim L =
r ≤ n.
Definition 19
P ∈ Rn×n is a projection matrix onto L,
if
1. Px = x ∀x ∈ L
2. Px = 0 ∀x ∈ L⊥ = {z ∈ Rn :
zTy = 0∀y ∈ L}
By definition, rank P = dim L = r .
Remark Projection matrices will be very
useful when proving results for linear mod-
els. We will often use L = span(X ), the
space spanned by the columns of the de-
sign matrix X , or L = span(X )⊥.
Lemma 11
P is a projection matrix ⇐⇒ PT = P︸︷︷︸
P symmetric
and P2 = P︸︷︷︸
P idempotent
.
Proof ⇒:
72
Recall that any x ∈ Rn can be uniquely
written as x = xL + xL⊥ , where xL ∈ L
and xL⊥ ∈ L⊥.
Let x ∈ Rn. Then P2x = P(Px) = PxL =
Px. Hence, P2 = P .
For all x, y ∈ Rn:
xTPTy = ( Px︸︷︷︸
∈L
)Ty = ( Px︸︷︷︸
xL
)TyL = x
T
L Py = x
TPy
Hence, PT = P
⇐:
Let L be the space spanned by the columns of P .
• Let x ∈ L. Then ∃z ∈ Rn : x = Pz. Hence, Px = P2z idempot= Pz = x.
• Let x ∈ L⊥. Then for all y ∈ Rn: (Px)Ty = xTPTy symm= xT Py︸︷︷︸
∈L
= 0. Hence
Px = 0.
The projection matrix is unique. Indeed, for each i , the vector ei can be uniquely
written as ei = x + y, where x ∈ L and y ∈ L⊥. Then the ith column of P is Pei = x.
If x1, ... , xr are a basis of L then the projection onto L is given by
P = X (XTX )−1XT ,
where X = (x1, ... , xr ). [prove this directly via the definition of the projection matrix
or check P2 = P , PT = P , span(P)︸︷︷︸
space spanned by the columns of P
= L or .]
If x1, ... , xr are an orthonormal basis then P = XXT .
In − P is the projection matrix onto L⊥
73
(can be checked using original definition).
Example 54
n = 3
If L = span(
10
0
 ,
01
0
) then P =
1 0 00 1 0
0 0 0
.
If L = span(
11
0
 ,
00
1
) then P =
1/2 1/2 01/2 1/2 0
0 0 1
.
If L = span(x) for some x ∈ Rn then P = xxT
xT x
.
Lemma 12
If A is an n × n projection matrix (i.e. A = AT , A2 = A) of rank r then
1. r of the eigenvalues of A are 1 and n − r are 0,
2. rank A = trace A,
Proof Let x be an eigenvector of A, with eigenvalue λ. Then λx = Ax = A2x =
λAx = λ2x.
x6=0
=⇒ λ = λ2 =⇒ λ ∈ {0, 1}.
1. A symmetric =⇒ ∃P (orthogonal) s.t. P−1AP = D, where D is diagonal with
0s and 1s on the diagonal. Since P is non-singular, rank A = rank D. Hence D
has r ones down the diagonal.
2. trace(A) = trace(APP−1) Le6= trace(P−1AP) = trace D = rank A
9.9 Residuals, Estimation of the variance
Definition 20
Yˆ = X βˆ, where βˆ is a least squares estimator, is called the vector of fitted values.
74
In the full rank case, Yˆ = X (XTX )−1XTY.
Lemma 13
Yˆ is unique and
Yˆ = PY,
where P is the projection matrix onto the column space of X .
Because of this lemma, P is sometimes called the hat matrix (it puts the hat on Y,
i.e. Yˆ = PY).
Proof Suppose βˆ is a LSE of β. We already know P is unique, hence PY is unique.
Thus it suffices to show Yˆ = PY. Since PY ∈ span(X ) there exists γ s.t. Xγ = PY.
Then
S(βˆ) =‖Y − PY + PY − X βˆ‖2
= ‖Y − PY‖2︸︷︷︸
=S(γ)
+‖PY − X βˆ‖2 + 2 (Y − PY)T︸︷︷︸
=YT (I−P)
(PY − X βˆ)︸︷︷︸
∈span(X )︸︷︷︸
=0
≥S(βˆ) + ‖PY − X βˆ‖2,
since βˆ minimises S . Thus ‖PY − X βˆ‖ = 0. Therefore, PY = X βˆ.
Definition 21
e = Y − Yˆ is called the vector of residuals.
Remark Equivalent form: Using Lemma 13,
e = Y − PY = QY,
where Q = I − P is the projection matrix onto span(X )⊥.
Remark
E(e) = E[QY] = Q E Y = QX︸︷︷︸
=0
β = 0
e can be used to see how well the model and the data agree and to see if certain
observations are larger or smaller than predicted by the model.
75
Diagnostic plots (Optional)
Suppose the data comes from the model
Y = Xβ + Zγ + , E = 0
where Z ∈ Rn \ span(X ) and γ ∈ R are deterministic, but the analyst erroneously
works with
Y = Xβ + , E = 0.
Thus, if γ 6= 0 then the analyst uses the wrong model. Then
E(e) = E(QY) = E(Q(Xβ + Zγ + )) = QZγ
Thus plotting QZ against the residuals should give a line (with some noise) through
the origin. If the slope of this line is different from 0 then we should consider including
Z in the model.
Example 55 (Plots of Residuals - Brain and Body Weight of Land Mammals)
oi=body weight of the ith animal, ri=brain weight of the ith animal.
Consider the model
log(ri) = β1 + i , E(i) = 0. (1)
In matrix form this is
E[

log(r1)
...
log(rn)
] =

1
...
1
 β1.
Here,
P =
The following is a plot of the residuals in this model:
l
l
l
l
l l l
l
l
l
l
l l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
0 10 20 30 40 50 60
−
4
−
2
0
2
4
i
e
i
76
Suppose we suspect Z = o might be important. To investigate this, we plot (Qo)i =
oi − o¯ vs ei . If the model (1) is true then the plot below should roughly look like
the previous plot.
l
l
l
l
ll
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
ll
l
0 1000 2000 3000 4000 5000 6000
−
4
−
2
0
2
4
bodyi − body
e
i
The fit of (1) does not seem to be good; however simply including o in the model
does not seem to be reasonable because the above plot does not look like a linear
relationship.
Let zj = log(oj). A plot of (Qz)i vs ei :
l
l
l
l
lll
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
−6 −4 −2 0 2 4 6 8
−
4
−
2
0
2
4
log(bodyi) − log(body)
e
i
This looks like a linear relationship with slope 6= 0 → could include ziβ2 in model
(1).
Residual Sum of Squares
Definition 22
RSS = eTe is called the residual sum of squares.
RSS quantifies the departure of the data from the model. It is the minimum of S(β).
77
Remark Other forms:
• RSS =
∑n
i=1 e
2
i
• RSS = S(βˆ) = ‖Y − X βˆ‖2
• RSS = (QY)TQY = YTQTQY = YTQY
• RSS = YTY − YˆT Yˆ.
Indeed, RSS = (Y − Yˆ)T (Y − Yˆ) = YTY − 2YˆTY + YˆT Yˆ = YTY − YˆT Yˆ.
The last equality holds because Yˆ
T
Y = (PY)TY = (PPY)TY = YTPTPTY =
(PY)TPY = Yˆ
T
Yˆ.
Theorem 8
σˆ2 := RSS
n−r is an unbiased estimator of σ
2.
Recall: r = rank(X )
Proof Let Q = I − P . Since P is a projection matrix, Q is a projection matrix as
well. Hence, RSS = Y TQY .
E(RSS) = E trace RSS = E trace(YTQY)
Le 6
= E trace(QYYT ) = trace(Q E(YYT ))
= trace(Q[cov Y + E(Y) E(Y)T ]) = trace(Qσ2) + trace(QXβ(Xβ)T )
= σ2 trace(I − P) + 0 = σ2(n − trace(P))
Le 12
= σ2(n − rank(P)) = σ2(n − r).
Remark This is a generalisation of the result that the sample variance s2 is an unbiased
estimator for σ2 when Y1, ... , Yn are i.i.d. with unknown mean µ and unknown variance
σ2.
Indeed, we can write this iid setup as the linear model Y =

1
...
1
µ+ with E = 0 and
cov = σ2I . Then P = X(XTX)−1XT = 1
n
XXT and thus e = Y − Yˆ = Y − PY =
78
Y −

Y¯
...
Y¯
. Hence,
RSS
n − r =
∑
(Yi − Y¯ )2
n − 1︸︷︷︸
=s2=sample variance
which we already know is unbiased for σ2.
Coefficient of Determination (R2)
In the simplest model with only an intercept term, i.e. in
Y =

1
...
1
 β1 + , E = 0
we have RSS =
∑n
i=1(Yi − Y¯ )2. Larger models, i.e. models with more columns in X
will only lead to smaller RSS.
For models containing an intercept term, (i.e. X contains a column consisting of 1s
(or any other constant)), a popular measure of the quality of a model is
R2 = 1− RSS∑n
i=1(Yi − Y¯ )2
,
called the coefficient of determination or simply R2. A smaller RSS is “better”, thus
we want a large R2. Note: 0 ≤ R2 ≤ 1 and R2 = 1 for a “perfect” model.
Remark (Intuitive interpretation) RSS /n is an estimator of σ2. 1
n
∑n
i=1(Yi − Y¯ )2
is an estimator of σ2 in the model with only the intercept term (let us call this the
“total variance”).
Thus RSS /n1
n
∑n
i=1
(Yi−Y¯ )2 ≈
Variance in the model
total variance
and hence
R2 ≈ total variance− variance in model
total variance
Hence, R2 ≈ fraction of the total variance of the data that“is explained”by the model.
79
Example 56
Boiling Point: R2 = 0.995
Mammals: Model log(braini) = β1 + log(bodyi)β2 + i : R
2 = 0.92
Model: braini = β1 + bodyiβ2 + i : R
2 = 0.87
Note: These are unusually high values! Often, R2 can be much smaller.
Remark Adding columns to X will never decrease R2. Thus one should not use R2
directly for model comparisons; one should penalise models with a larger number of
parameters. More about this in Chapter 10.
80
10 Linear Models with Normal Theory Assumptions
In this Chapter we will again consider a linear model Y = Xβ + , E = 0. In order
to construct confidence intervals or test hypotheses we need assumptions about the
distribution of Y (or equivalently about the distribution of ).
We will work with the (NTA), which are ∼ N(0,σ2In).
10.1 Distributional Results
We first define the multivariate normal distribution and some distributions constructed
from it. After that some useful properties are shown.
10.1.1 The Multivariate Normal Distribution
The multivariate normal distribution, denoted by N(µ, Σ), is a distribution of a random
vector. It has two parameters: one vector µ ∈ Rn and one positive semidefinite matrix
Σ ∈ Rn×n. It will turn out that µ is its expecation and Σ is its covariance.
It can be defined in several ways. In M2S1 you have defined it via the joint pdf as
follows (this definition only works if Σ is positive definite):
Z ∼ N(µ, Σ) if Z has a pdf of the form
f (z) =
1
(
√
2pi)n|Σ|1/2 exp
(
−1
2
(z− µ)TΣ−1(z− µ)
)
,
where |Σ| denotes the determinant of Σ.
Example 57
Z ∼ N(µ,σ2I ) for some σ2 > 0. Then
f (z) =
1√
2pi
n|σ2n| 12 exp(−
1
2
(z− µ)T (σ−2I )(z− µ))
=
n∏
i=1
1√
2piσ2
exp(−(zi − µi)
2
2σ2
)
Thus: Z1, ... , Zn are independent, with Zi ∼ N(µi ,σ2), i = 1, ... , n.
81
The three definitions mentioned below (which are all equivalent) will also work if Σ is
only positive semidefinite.
Definition 23
• An n-variate random vector Z follows a multivariate normal distribution if for
all a ∈ Rn the random variable aTZ follows a univariate normal distribution
(the degenerate case N(·, 0) is allowed).
• Let X1, ... Xr ∼ N(0, 1) be iid, let µ ∈ Rn and A ∈ Rn×r . Then Z = AX+µ ∼
N(µ, AAT ).
• Z ∼ N(µ, Σ) if its characteristic function φ : Rn → C, φ(t) = E(exp(iZT t))
satisfies
φ(t) = exp
(
iµT t− 1
2
tTΣt
)
∀t ∈ Rn.
where µ ∈ Rn and Σ ∈ Rn×n is a positive semidefinite matrix.
Remark (Concerning the definition via the characteristic function) You have pre-
viously met the moment generating function. One of the key results about moment
generating functions is that the moment generating function uniquely identifies the
distribution of a random variable.
Similarly, there is a moment generating function defined for n-variate random vectors
X, namely M : Rn → R, t 7→ E(exp(tTX)). Again, M identifies the distribution of a
random vector.
The characteristic function (which was used in the above definition), is often used
instead of the moment generating function. It has similar properties (in particular it
uniquely defines a distribution). The i in it is the complex number i . Furthermore, if
Z = X + iY is a complex-valued random variable then E(Z ) := E(X ) + i E(Y ).
Remark (Useful properties) Let Z ∼ N(µ, Σ). Then
• E Z = µ,
• cov Z = Σ,
• if A is a deterministic matrix and b is a deterministic vector of appropriate
dimension then
AZ + b ∼ N(Aµ + b, AΣAT ).
82
In general: if X and Y are random variables then cov(X , Y ) = 0 does not imply that X
and Y are independent. To put it briefly: uncorrelated does not imply independence.
The following lemma shows (in a general form) that uncorrelated and jointly normal
does imply independence.
Lemma 14
For i = 1, ... , k , let Ai ∈ Rni×ni be pos. semidef. and symmetric and let Zi be an
ni -variate random vector. If Z =

Z1
...
Zk
 ∼ N(µ, Σ), for some µ ∈ R∑ki=1 ni and
Σ = diag(A1, ... , Ak) =

A1 0
. . .
0 Ak
 then Z1, ... , Zk are independent.
Proof In the special case that all Ai are positive definite, this can be shown by using
Σ−1 = diag(A−11 , ... , A
−1
k ) and |Σ| =
∏k
i=1 |Ai | to factor the pdf.
The full proof works via the characteristic (or via the moment generating) functions; to
show independence one needs to show that the characteristic functions can be written
as product of the characteristic functions of the components, i.e. one needs to show
E exp(itTZ) =
∏k
i=1 E exp(it
T
i Zi) for all t = (t
T
1 , ... , t
T
k )
T ∈ Rn.
Example 58
Let k = 3, A1 = 2 = (2), A2 =
(
1 −0.5
−0.5 1
)
, A3 =
(
2 1
1 2
)
. Let
Σ =
A1 0 00 A2 0
0 0 A3
 =

2 0 0 0 0
0 1 −0.5 0 0
0 −0.5 1 0 0
0 0 0 2 1
0 0 0 1 2
 .
If Z ∼ N(µ, Σ) for some µ ∈ R5 then Z1,
(
Z2
Z3
)
,
(
Z4
Z5
)
are independent.
83
10.1.2 Distributions derived from the Multivariate Normal
In this section we define several distributions derived from the multivariate normal
distribution. They will appear as distributions of pivotal quantities and test statistics.
You should have met the central χ2- and t-distribution previously.
Definition 24
Let Z ∼ N(µ, In), where µ ∈ Rn.
U = ZTZ =
∑n
i=1 Z
2
i is said to have a non-central χ
2-distribution with n degrees
of freedom (d.f.) and non-centrality parameter
δ =
√
µTµ.
Notation: U ∼ χ2n(δ), χ2n = χ2n(0).
Remark In order for this to be a proper definition we need to show that the distr. of
U depends on µ only through µTµ. (One of the questions on the problem sheet asks
you to prove this). One approach is to show that the moment generating function of
U equals
MU(t) =
1
(1− 2t)n/2 exp
(
tδ2
1− 2t
)
The following lemma contains some properties of the χ2-distribution.
Lemma 15
Let U ∼ χ2n(δ). Then E(U) = n + δ2 and Var(U) = 2n + 4δ2.
If Ui ∼ χ2ni (δi), i = 1, ... , k and Ui ’s are indep. then
∑k
i=1 Ui ∼ χ2∑ ni ,√∑ δ2i .
Proof: → Problem Sheet.
84
0 2 4 6 8 10
0.
0
0.
1
0.
2
0.
3
0.
4
χn
2
distribution, plot of pdf fX(x)
x
n = 1
n = 3
n = 6
0 2 4 6 8 10
0.
0
0.
1
0.
2
0.
3
0.
4
χ3
2(δ) distribution, plot of pdf fX(x)
x
δ = 0
δ = 1
δ = 2
Definition 25
Let X and U be independent random variables with X ∼ N(δ, 1) and U ∼ χ2n. The
distribution of
Y =
X√
U/n
is called non-central t-distribution with n degrees of freedom and non-centrality
parameter δ.
Notation: Y ∼ tn(δ), tn = tn(0).
−4 −2 0 2 4
0.
0
0.
1
0.
2
0.
3
0.
4
tn distribution, plot of pdf fX(x)
x
n = 1
n = 3
n = 9
n = ∞
−4 −2 0 2 4 6 8
0.
0
0.
1
0.
2
0.
3
0.
4
t3(δ) distribution, plot of pdf fX(x)
x
δ = − 2
δ = 0
δ = 2
δ = 4
Remark Convergence to the normal distribution: Suppose Yn ∼ tn for all n ∈ N.
Then
Yn
d→ N(0, 1) (n→∞)
To see this: Let X ∼ N(0, 1) and Un = ∑ni=1 Z 2i for Z1, Z2, · · · ∼ N(0, 1) indep. Then,
by the weak law of large numbers: Un/n
P→ E(Z 21 ) = 1. As x 7→
√
x is continuous,
85
this implies
√
Un/n
P→ √1 = 1. Thus, by Slutsky’s Lemma:
Yn =
X√
Un/n
d→ X
1
= X ∼ N(0, 1).
Definition 26
If W1 ∼ χ2n1(δ), W2 ∼ χ2n2 independently then
F =
W1/n1
W2/n2
is said to have a non-central F distribution with (n1, n2) d.f. and n.c.p.=δ.
Notation: F ∼ Fn1,n2(δ), Fn1,n2 = Fn1,n2(0).
0 1 2 3 4 5 6
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
F distribution, plot of pdf fX(x)
x
X~F2,2
X~F8,2
X~F8,8
0 1 2 3 4 5 6
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
F distribution, plot of pdf fX(x)
x
X~F8,8(0)
X~F8,8(2)
X~F8,8(4)
X~F8,8(8)
Remark If the n.c.p. δ = 0 then the above are called central χ2, t, F distribution.
10.1.3 Some Independence Results
In order to show that a random variable follows a t-distribution via the definition we
need to show independence between a normal and a χ2-distributed random variable.
Similarly, to show that a random variable is F -distributed, we need to show indepen-
dence of two χ2 distributed random variables. This section provides results that help
in doing this.
86
Lemma 16
Let A ∈ Rn×n be a pos. semidefinite symmetric matrix with rank r . Then
there exists L ∈ Rn×r such that rank L = r , A = LLT and LTL =
diag(nonzero eigenvalues of A).
Proof (Lemma 8) =⇒ ∃ an orthogonal matrix P st
PTAP = D = diag(eigenvalues of A)
Precisely r elements of D are positive and the others are zero. Hence, A = PDPT =
PD
1
2 D
1
2 PT = PD
1
2 (PD
1
2 )T Let L consist of the nonzero columns of PD
1
2 . Then
A = LLT .
Because P is orthogonal (i.e. PTP = I ) we get LTL = diag(nonzero eigenvalues of A).
Lemma 17
Let X ∼ N(µ, I ), A ∈ Rn,n pos. semidefinite symmetric and let B be a matrix such
that BA = 0.
Then XTAX and BX are independent.
Proof Let r = rank(A). By Lemma 16, ∃ L ∈ Rn×r such that rank L = r and
A = LLT .
cov(BX, LTX) = B cov(X)L = BL = BLLTL(LTL)−1 = BAL(LTL)−1 = 0
Thus BX and LTX are independent (because they are jointly normally distributed).
Hence, BX and XTLLTX = XTAX are independent.
Lemma 18
If Z ∼ N(µ, In) and A is an n × n projection matrix of rank r , then
ZTAZ ∼ χ2r (δ) with δ2 = µTAµ
87
Proof All nonzero eigenvalues of A are equal to 1.
By Lemma 16, ∃ L ∈ Rn×r such that A = LLT and LTL = Ir . Let V = LTZ. Then
V ∼ N(LTµ, I︸︷︷︸
=LTL
) and
ZTAZ = ZTLLTZ = VTV ∼ χ2r (δ),
where δ2 =
(
LTµ
)T
LTµ = µT LLT︸︷︷︸
=A
µ = µTAµ.
Lemma 19
If Z ∼ N(µ, In) and A1, A2 ∈ Rn×n are projection matrices and A1A2 = 0 then
ZTA1Z and Z
TA2Z are independent.
Proof cov(A1Z, A2Z) = A1 cov(Z, Z)︸︷︷︸
=I
AT2 = A1A
T
2 = 0
Because A1Z and A2Z are jointly normally distributed this shows that they are inde-
pendent.
As ZTAiZ = (AiZ)T (AiZ) for i = 1, 2 (symm+ idempotent) this implies that Z
TA1Z
and ZTA2Z are independent.
This result extends to ZTA1Z, ... , Z
TAkZ, where AiAj = 0 (i 6= j).
Lemma 20
If A1, ... , Ak are symmetric n × n matrices such that ∑Ai = In and if rank Ai = ri
then the following are equivalent:
1.
∑
ri = n
2. AiAj = 0 for all i 6= j
3. Ai is idempotent for all i = 1, ... , k .
Proof (2)→(3) ∀j : A1 + · · ·+ Ak = In =⇒ A1Aj + · · ·+ AkAj = Aj =⇒ A2j = Aj .
88
(3)→(1) n = trace In = ∑ trace Ai 12.= ∑ rank Ai = ∑ ri
(1)→(2) Let Vi = {Aix : x ∈ Rn} = span(Ai). Then dim Vi = ri . Let Bi be a basis
for Vi and let B =
⋃
i Bi . Since x = I x =
∑
Aix, ∀x ∈ Rn, B spans Rn and
since B has at most
∑
ri = n elements, B must form a basis of Rn. Hence, any
x ∈ Rn can be written uniquely as ∑ui where ui ∈ Vi . Let x be a column of
Aj . Then x︸︷︷︸
∈Vj
+
∑
i 6=j 0 =
∑
Aix. By uniqueness, Aix = 0 for all i 6= j .
Theorem 9 (The Fisher-Cochran Theorem)
If A1, ... , Ak are n×n projection matrices such that ∑ni=1 Ai = In, and if Z ∼ N(µ, In)
then ZTA1Z, ... , Z
TAkZ are independent and
ZTAiZ ∼ χ2ri (δi), where ri = rank Ai and δ2i = µTAiµ.
Proof By Lemma 20, AiAj = 0 for all i 6= j . Hence, ZTA1Z, ... , ZTAkZ are indepen-
dent.
The rest of the theorem is a consequence of Lemma 18.
10.2 The Linear Model with Normal Theory Assumptions
In this section we will consider the linear model Y = Xβ + , E() = 0 with (NTA).
Recall that the (NTA) are ∼ N(0,σ2In). In particular, this implies Y ∼ N(Xβ,σ2In).
The joint probability density function of Y is thus
f (y) =
1
(σ
√
2pi)n
exp
(
− 1
2σ2
(y − Xβ)T (y − Xβ)
)
Estimation using the maximum likelihood approach:
• The log-likelihood of the data is
L(β,σ2) = −n
2
log(2piσ2)− 1
2σ2
(Y − Xβ)T (Y − Xβ)︸︷︷︸
=S(β)
89
• maximising L with respect to β (for fixed σ2) is equivalent to minimising S(β) =
(Y−Xβ)T (Y−Xβ), i.e. maximum likelihood is equivalent to least squares for
estimating β.
• The maximum likelihood estimator for σ2 is RSS /n (look at first/second deriva-
tive of L(βˆ,σ2) = −n
2
log(2piσ2)− 1
2σ2
RSS wrt σ2).
10.3 Confidence Intervals, Tests for one-dimensional quantities
The following gives a pivotal quantity for σ2 which can be used to construct CIs or
tests.
Lemma 21 (Distribution of RSS)
Assume (NTA). Then
RSS
σ2
∼ χ2n−r
where r = rank X .
Proof Let P denote the projection matrix onto span(X ). Then
RSS = eTe = ((I − P)︸︷︷︸
=:Q
Y)T (I − P)Y = YT QTQ︸︷︷︸
=QQ=Q
Y = YTQY
and Q is the projection onto the space orthogonal to the columns of X . Hence,
RSS
σ2
=
Y
σ
T
Q
Y
σ
= ZTQZ
where Z = Y/σ ∼ N(Xβ/σ, I ) and Q is a projection matrix. Furthermore, Q + P = I
and Q and P are projection matrices. Thus, by Lemma 20, rank Q + rank P︸︷︷︸
=r
= n,
implying rank Q = n − r .
Thus, by Lemma 18,
RSS
σ2
∼ χ2n−r
since (Xβ/σ)T QX︸︷︷︸
=0
β/σ = 0.
90
Often, we will be interested in parts of the parameter vector β, e.g. one of its compo-
nents βi . We want to construct tests, or construct confidence intervals.
The following lemma provides a flexible way of doing this; it gives a pivotal quantity for
cTβ for some (known) c ∈ Rp. We have worked with cTβ already in the Gauss-Markov
Theorem.
Lemma 22
Assume (FR), (NTA) in a linear model. Let c ∈ Rp. Then
cT βˆ − cTβ√
cT (XTX )−1c RSS
n−p
∼ tn−p
Proof Since cT βˆ = cT (XTX )−1XTY and Y ∼ N(Xβ,σ2I ),
E cT βˆ = cTβ
Var(cT βˆ) = Var(cT (XTX )−1XTY) = cT (XTX )−1XT cov(Y)X (XTX )−1c
=cT (XTX )−1cσ2
and thus cT βˆ ∼ N(cTβ, cT (XTX )−1cσ2). Hence,
cT βˆ − cTβ√
cT (XTX )−1cσ2
∼ N(0, 1).
We already know RSS /σ2 ∼ χ2n−p.
Thus the lemma is a consequence of the definition of the t-distribution once we
have shown that βˆ = (XTX )−1XTY and RSS = YTQY are independent. The
latter is a consequence of Lemma 17 (using Z = Y/σ), since (XTX )−1XTQ =
(XTX )−1(QX︸︷︷︸
=0
)T = 0.
Example 59
Data Set: Tooth Growth (see File in Additional Material)
91
Remark Suppose we construct a tests with the above pivotal quantity for cTβ. It
turns out that the test statistic has a non-central t-distribution under the alternative
hypothesis.
10.4 The F-Test
In the previous section, we derived pivotal quantities for one-dimensional parameters
(σ2 or linear combinations cTβ of the components of β such as, for some i , eTi β = βi).
If we are interested in how more than one component of the parameter behaves,
e.g. if the null-hypotheses β2 = β3 = 0 is of interest then we would have to do
more than one test (and this would result in similar problems as the “joint confidence
intervals” mentioned earlier and a correction such as the Bonferroni correction would
be necessary). This section presents a method to test more complicated hypotheses
about β.
Example 60
Suppose we have a linear model with p = 3 and design matrix
X =

1 a1 b1
...
...
...
1 an bn

Suppose we are interested in testing the hypotheses
H0 : β2 = β3 = 0 against H1 : β2 6= 0 or β3 6= 0
Under H0, we can write the linear model as
E Y = X0β1, where X0 =

1
...
1
 .
Thus we can rewrite the hypotheses as
H0 : E Y ∈ span(X0) against H1 : E Y 6∈ span(X0)
92
In general, suppose we want to test whether a sub-model of a linear model E Y = Xβ
is true, i.e. we want to test
H0 : E Y ∈ span(X0) against H1 : E Y 6∈ span(X0)
for some matrix X0 with span(X0) ⊂ span(X ). In other words, the null hypothesis says
that the sub-model
E(Y) = X0β0
is true.
Example 61
Continuing the previous example, one may also be interested in X0 =

1 a1
...
...
1 an

[equivalent to β3 = 0] or X0 =

1 a1 − b1
...
...
1 an − bn
 [equivalent to β3 = −β2].
Let
• RSS= the residual sum of squares in the full model E Y = Xβ
• RSS0= the residual sum of squares in the sub-model E Y = X0β˜
Lemma 23
Under H0 : E Y ∈ span(X0),
F =
RSS0−RSS
RSS
· n − r
r − s ∼ Fr−s,n−r
where r = rank X , s = rank X0.
Let P be the projection matrix onto span X , and let Q = I − P the projection matrix
onto (span X )⊥.
Likewise, let P0 be the projection matrix onto span X0, and let Q0 = I − P0 the
projection matrix onto (span X0)
⊥. Then, as we derived in the previous chapter,
RSS = YTQY, RSS0 = Y
TQ0Y.
93
Using this gives
F =
YTQ0Y − YTQY
YTQY
· n − r
r − s
=
YT (P − P0)Y/σ2
YT (I − P)Y/σ2 ·
n − r
r − s
To show: numerator∼ χ2r−s , denominator∼ χ2n−r , numerator and denominator are
independent.
Proof We will use the Fisher-Cochran theorem. Let Z = Y/σ, A1 = I − P , A2 =
P − P0, A3 = P0.
Clearly, A1 + A2 + A3 = I . We already know that A1 and A3 are projection matrices.
To show: A2 = P − P0 is a projection matrix. P − P0 is symmetric as P0 and P are
both symmetric. Furthermore,
(P − P0)2 = P2 + P20 − PP0 − P0P
Every column y of P0 is an element of span(X0) and thus an element of span(X ).
Thus, Py = y.
Hence,
PP0 = P0.
This also implies P0P = (P
TPT0 )
T = (PP0)
T = PT0 = P0.
Thus,
(P − P0)2 = P + P0 − P0 − P0 = P − P0
The Fisher-Cochran theorem now implies
• ZT (P − P0)Z and ZT (I − P)Z are independent,
• ZT (P − P0)Z ∼ χ2rank(P−P0)(E ZT (P − P0) E Z),
• ZT (I − P)Z ∼ χ2rank(I−P)(E ZT (I − P) E Z).
Next, we show that the non-centrality parameters are 0.
Under H0, we know E Z =
1
σ
E Y ∈ span(X0) ⊂ span(X ) Thus,
(P − P0) E Z = P E Z︸︷︷︸
=E Z
−P0 E Z︸︷︷︸
=E Z
= 0.
94
Hence, E ZT (P − P0) E Z = 0. Furthermore,
EZT (I − P) E Z = EZT (E Z− P E Z︸︷︷︸
=E Z
) = 0.
Concerning the degrees of freedom:
• By Lemma 20,
n = rank(P) + rank(I − P) = rank X + rank(I − P) = r + rank(I − P)
Thus, rank(I − P) = n − r .
• Using Lemma 20 again,
rank(P0)︸︷︷︸
=rank(X0)=s
+ rank(P − P0) + rank(I − P)︸︷︷︸
n−r
= n
Thus, rank(P − P0) = r − s.
To summarise, we have shown
ZT (P − P0)Z ∼ χ2r−s , ZT (I − P)Z ∼ χ2n−r independently.
Thus, by definition, F ∼ Fr−s,n−r .
If H0 is not true then the proof is still valid, except for the non-centrality parameter of
ZT (P − P0)Z. Now,
E ZT (P − P0) E Z = 1
σ2
βTXT (P − P0)Xβ.
Thus, without assuming H0, we get
F ∼ Fr−s,n−r (δ), where δ2 = 1
σ2
(Xβ)T (P − P0)Xβ.
This implies that F will take on larger values if H0 is not true.
Thus it is advisable to reject for large values of F . In particular, if we want a test to
the level α > 0, we reject if
F > c ,
where c is such that P(X ≥ c) = α for X ∼ Fr−s,n−r .
95
10.5 Confidence Regions
Suppose E Y = Xβ is a linear model satisfying (FR), (NTA). We are interested in
finding a confidence region for β, i.e. we want a random set D s.t. P(β ∈ D) ≥ 1−α
for all β,σ2.
Let
A =
(βˆ − β)TXTX (βˆ − β)
RSS
· n − p
p
If we can work out the distribution of A, we can use A as a pivotal quantity for β.
The numerator of the first fraction can be rewritten as
(Y − Xβ)TP(Y − Xβ)
where P is the projection onto the space spanned by the columns of X . Indeed,
(Y − Xβ)TP(Y − Xβ) =(Y − Xβ)TPP(Y − Xβ) = (P(Y − Xβ))TP(Y − Xβ)
Using P = X (XTX )−1XT this is equal to
(X (βˆ − β))TX (βˆ − β)
Furthermore, RSS can be written as
RSS = YTQY = (Y − Xβ)TQ(Y − Xβ)
where Q = I − P . Thus, letting Z = 1
σ
(Y − Xβ),
A =
ZTPZ
ZTQZ
· n − p
p
with Z ∼ N(0, I ), P + Q = I , rank P = p, P and Q projection matrices.
Thus the Fisher-Cochran theorem shows that A ∼ Fp,n−p.
Hence, a 1− α confidence region R for β is defined by all γ ∈ Rp such that
(βˆ − γ)TXTX (βˆ − γ)
RSS
· n − p
p
≤ Fp,n−p,α,
where P(Z ≥ Fp,n−p,α) = α for Z ∼ Fp,n−p. The region R is an ellipsoid centred at βˆ
(use diagonalisation).
Remark General definition of an ellispoid: {z ∈ Rp : (z − z0)TA−1(z − z0) ≤ 1}
where A is pos. def. and z0 ∈ Rp.
96
Example 62
p = 2, X has full rank.
Let a = βˆ − β, B = XTX and c = p RSS
n−pFp,n−p,α. Hence, in order to obtain the
confidence region for β we want to find for which a,
aTBa ≤ c . (2)
B is pos. def. Hence, ∃ an orthogonal matrix R and a diagonal matrix D =
diag(d1, d2) s.t. B = R
TDR and d1, d2 are positive. [D consists of the eigenvalues
of B , while R consists of the corresponding normalised eigenvectors.]
Let a˜ = Ra (this rotates the coordinate axes). Then (2) is equivalent to
a˜21
d1
c
+ a˜22
d2
c
≤ 1.
This describes an ellipse with half-axes of lenght
√
c
d1
and
√
c
d2
.
√
c
d1
√
c
d2
Transforming everything back via β = βˆ − a = βˆ − RT a˜ gives a rotated and
translated ellipse, centered at βˆ.
97
β1
β2
βˆ1
βˆ2
Remark We could construct individual CIs for β1 and β2 via Lemma 22 and combine
them via the Bonferroni correction. The advantage of the above construction is that
the resulting ellipsoid has a smaller area.
98
11 Diagnostics, Model Selection, Extensions
11.1 Outliers
An outlier is an observation that does not conform to the general pattern of the rest
of the data. Potential causes:
• Error in the data recording mechanism (example - iron content of spinach).
• Data set may be “contaminated” - i.e. it may be the mixture of two or more
populations.
• Indication that the model/underlying theory needs to be improved. Further
investigations needed. Outliers may be the “most interesting” observations.
Practical method for spotting outliers: Look for residuals that are “too large”. When
is a residual too large?
Recall: e = (I − P)Y, where P is the projection onto span(X ). If X is full rank then
P = X (XTX )−1XT . Note that
cov e = (I − P) cov Y(I − P)T = σ2(I − P)
and E e = 0. Thus, unter NTA, ei ∼ N(0,σ2(1 − Pii)), where Pii is the ith diagonal
element of P . Hence,
ei√
(1− Pii)σ2
∼ N(0, 1).
We do not know σ2 → plug in the unbiased estimate
σˆ2 =
RSS
n − p .
This gives the standardised residuals
ri =
ei√
σˆ2 (1− Pii)
This of course means that ri are not (necessarily) N(0,1) distributed.
Nevertheless, the standardized residuals should be roughly independent, and their dis-
tribution should be close to a N(0,1)-distribution.
99
Remark ri is also not student t; the usual proof (that we used for t-tests) does not
work because σˆ2 and ei are not independent.
Remark Let X ∼N(0,1). Then the probabilities for larges values of X are very rapidly
decreasing as the following table shows. [The normal distribution has light tails.]
x 3 4 5 6 7 8
P(X > x) 1.3e-03 3.2e-05 2.9e-07 9.9e-10 1.3e-12 6.2e-16
Thus if (NTA) holds then the standardized residuals should be relatively small.
Remark An entire branch of statistics, called“robust statistics”, is concerned with the
development of methods/statistics that give meaningful results even in the presence of
outliers.
Suppose we are interested in the “centre” of a distribution and have observations
x1, ... xn. The sample mean x¯ =
1
n
∑n
i=1 xi is heavily affected by outliers - in fact,
just one outlier can make x¯ take any value. Other statistics are far more robust to
outliers. For example the median cannot be changed arbitrarily by one outlier [you
would have to move half of the observations.]
11.2 Leverage
What is the potential impact of individual observations on the model fit?
cov(e) = σ2 (In − P)
and Var ei = σ
2(1− Pii).
Definition 27
The leverage of the ith observation in a linear model is Pii , the ith diagonal matrix
of the hat matrix P .
(Recall: P is the projection matrix onto span(X ), where X is the design matrix).
If Pii ≈ 1 then the variance of the ith residual is very low. This is totally determined
by X , i.e. the design matrix is forcing the model fit to be good at the covariates of
the ith observation. In this case the ith observation is said to have high leverage.
100
∑n
i=1 Pii = trace(P) = rank(X ) =: r (see Lemma 12), so the “average” is r/n and a
rule of thumb is to take notice when
Pii >
2r
n
.
Example 63 (Linear regression)
E Yi = β1 + β2xi
l
l
l
l
l
ll
ll
l
l
1.0 1.5 2.0 2.5 3.0 3.5 4.0
1.
0
1.
5
2.
0
2.
5
3.
0
x
Y
l
ll
ll
l
l
l
l
1.0 1.5 2.0 2.5 3.0 3.5 4.0
0.
2
0.
4
0.
6
0.
8
x
P i
i
11.3 Cook’s Distance
To measure how much the ith observation changes the estimator βˆ one can consider
the following measure, called Cook’s distance:
Di =
(βˆ(i) − βˆ)TXTX (βˆ(i) − βˆ)
p RSS /(n − p) ,
where βˆ(i) is the least squares estimator with the ith observation removed. Alterna-
tively,
Di =
(Yˆ − Yˆ(i))T (Yˆ − Yˆ(i))
p RSS /(n − p) ,
where Yˆ(i) = X βˆ(i). Rule of thumb: take notice if Di gets close to 1.
101
Algebraically equivalent expression:
Di = r
2
i
Pii
(1− Pii)r ,
where ri is the standardised residual and r = rank(X ). Cook’s distance combines
leverage and residual.
11.4 Under/overfitting
Underfitting=necessary predictors left out
Let
Y = Xβ + Zγ +
be the model the observations have come from and let
Y = Xβ +
be the fitted model.
Suppose we are interested in estimating cTβ. βˆ is biased:
E βˆ = (XTX )−1XT (Xβ + Zγ) = β + (XTX )−1XTZγ
Hence,
MSE (cT βˆ) = Var(cT βˆ)+(E (cT βˆ−cTβ))2 = cT (XTX )−1cσ2+(cT (XTX )−1XTZ)2γ2
Let (βˆ
F
, γˆ)T be the estimator in the full model. Then
cov(
(
βˆ
F
γˆ
)
) = σ2
(
XTX XTZ
ZTX ZTZ
)−1
Formulas for the inverse of 2 × 2 block matrices are known in the literature (see e.g.
the Matrix cookbook at http://matrixcookbook.com/). Using these we get
cov(βˆ
F
) = σ2(XTX )−1 + σ2(XTX )−1XTZ(ZTQZ)−1ZTX (XTX )−1,
where Q is the projection matrix onto (span X )⊥. Hence,
Var(cT βˆ
F
) = σ2cT (XTX )−1c +
σ2
ZTQZ
(cT (XTX )−1XTZ)2
102
Hence, if σ
2
ZTQZ
> γ2 then the estimator from the reduced model has the smaller MSE.
Hence, the mean squared error can be improved by omitting covariates...
Sometimes it pays to use a simpler model!
Overfitting = unnecessary predictors included.
This means that some of the components of β are 0. Estimator βˆ is unbiased; however
the variance will be larger than in a model where these predictors are left out.
11.5 Weighted Least Squares
So far we have assumed cov(Y) = σ2In. Now suppose cov(Y) = σ2V , where V is
known, symmetric and positive definite.
Example 64
Var Yi ∝ b2i , Yi ’s uncorrelated. Then V =

b21 0
. . .
0 b2n
.
How to estimate β? What is a BLUE in this situation? Main idea: transform the
model to a situation in which (SOA) hold true, i.e. in which cov() = σ2I .
V is symmetric and positive definite. There ∃ a nonsingular matrix T such that
TTVT = In and TT
T = V−1. Indeed, by Lemma 8 , ∃ an orthogonal matrix P and
a diagonal matrix D with the eigenvalues of V on the diagonal s.t.
PTVP = D
Let T = PD−1/2PT . Since P is orthogonal, V = PDPT and thus
TTVT = PD−1/2PTPDPTPD−1/2PT = I .
Furthermore, TTT = PD−1PT = (PT )−1D−1P−1 = (PDPT )−1 = V−1.
103
Let Z = TTY. Then
E(Z) = TTX︸︷︷︸
=:X˜
β and cov(Z) = TTVTσ2 = σ2In
Thus the linear model E Z = X˜β satisfies (SOA). Assuming (FR), we get the following
least squares estimator.
βˆ = [X˜T X˜ ]−1X˜TZ
= [XT (TTT )X ]−1XT (TTT )Y
= (XTV−1X )−1XTV−1Y.
Note: βˆ is an optimal estimator in the sense of the Gauss-Markov theorem.
11.6 Residual Plots
Goal: To detect problems with a model; in other words: to detect a lack of fit of a
model:
Approach: Plot standardised residuals against some other variable (e.g. a column of
X , potentially interesting additional covariates, Yˆ , ...)
If the model is correct then the resulting plots should just show“noise”, with no distinct
patterns.
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
ll
l
l
l
5 10 15 20
−
2
0
2
4
x
r
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
0 1 2 3 4 5
−
2
−
1
0
1
2
3
x
r
The left plot suggests a non-constant variance (a heteroscedastic error).
The right hand plot indicates that the covariate may have a nonlinear influence.
104
11.7 Distributional Checks
Is the normal theory assumption justified? So-called Quantile-Quantile plots (qq-plots)
can plots can reveal if it is not. [note: it can never prove that it is correct]
This is because essentially we try to test the following hypotheses:
H0 : model correct against H1model not correct
Not rejecting H0 is no evidence that H0 is correct.
Basic idea of the Quantile-Quantile plots:
Consider two cdfs F and G . Consider the curve
(0, 1)→ R2, t 7→
(
F−1(t)
G−1(t)
)
If F = G this gives a line through the origin.
Check if data comes from a specific distribution If one wants to check if observed
iid data Y1, ... , Yn come from a fixed distribution G one replaces F by the empirical
cdf
Fn : R→ [0, 1], Fn(x) = 1
n
n∑
i=1
I (Yi ≤ x)
where I (Yi ≤ x) =
1 Yi ≤ x0 otherwise. Then, if Y1, ... , Yn ∼ F , by the strong law of
larger numbers
Fn(x)→ F (x) (n→∞)a.s.
and so we can use Fn as consistent estimator for F .
Check if data is normally distributed
Often, one does not want to check if data comes from a fixed distribution but whether
it comes from a given class, e.g. a normal distribution.
suppose F (x) = Φ
(
x−µ
σ
)
and G = Φ where Φ is the cdf of N(0,1). F−1(t) =
σΦ−1(t) + µ and G−1(t) = Φ−1(t). Hence, the qq-plot is
(0, 1)→ R2, t 7→
(
F−1(t)
G−1(t)
)
=
(
σΦ−1(t) + µ
Φ−1(t)
)
105
which is a line.
To apply to real data, F is replaced by the empirical cdf Fn and then, if the observations
are coming from a normal distribution, the qq plot should show a straight line.
Example 65
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
−2 −1 0 1 2
−
2
−
1
0
1
2
Normal Q−Q Plot
Φ−1(yi)
y i
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
−2 −1 0 1 2
0
1
2
3
4
Normal Q−Q Plot
Φ−1(yi)
y i
n=100; LHS: Yi ∼ N(0, 1); RHS: Yi ∼ Exp(1).
Use standardised residuals to check assumption of normality in the linear model.
(Similar plots can also be used for other (specific) distributions)
More formal approaches: Goodness-of-fit tests.
H0 : Y ∼ F , vs H1 : Y 6∼ F
Observations Y1, ... , Yn → Empirical cdf Fn
106
Test statistic (Kolmogorov-Smirnov test):
sup
x∈R
|Fn(x)− F (x)|
Reject H0 for large values.
11.8 Model Selection
What covariates/predictors to include in a model?
Example 66 (Quine data)
Aitkin(1978) discussed an observational study of S. Quine. Town in Australia. 146
childrens in the study.
Response: Number of days children where absent from school in a year.
4 factors:
• Age (primary, first, second or third form)
• Ethnicity (aboriginal or non-aboriginal)
• slow or average learner
• male or female
4 · 2 · 2 · 2 = 32 combinations of factors
Want to use a model with fewer parameters (reduce MSE of parameter estimates
one is interested in, better understanding of the model).
How many models? Essentially 232 models possible.
Why not just minimize RSS? Including more predictors will always decrease the RSS
but not the MSE (see previous section). Bias-Variance tradeoff (more parameters -
higher variance and lower bias).
Criterion developed by Akaike (he called it “an information criterion”):
AIC = −2 log(L) + 2p,
107
where p is the number of parameters in the model, and L is the maximum of the
likelihood function. In the linear model,
AIC = n log(RSS/n) + 2p + constant,
Want: model with small AIC.
There are many other criteria: BIC, Mallow’s Cp, LASSO, ...
Search Strategies:
• Best subset (search all possible subsets)
usually not possible: #of subsets = (2#of predictors)
• Stepwise (forward, backward, both): Start with an initial model and change it
using small steps (adding/deleting a predictor) always aiming to decrease the the
criterion.
Current field of research: What to do if there is an extremely large number of (potential)
covariates (often p >> n)?
Example: How does the DNA influence the occurence of diseases/survival/obesity
(microarray data)?
11.9 Generalized Linear Models (GLMs)
Classical book: McCullagh and Nelder (1989).
g(E Yi) =
∑
j
Xijβj , i = 1, ... , n,
where g is the so-called link-function. One assumes that the distribution of Yi is
contained in a so-called exponential familiy of distributions. Examples of exponenial
families: Normal, Exponential, Gamma, Poisson, Binomial distributions.
The models are usually fitted by the maximum likelihood method.
Example 67
Suppose Y1, ... , Yn are Bernoulli outcomes. We could formulate the linear model
E Yi =
∑
j
Xijβj .