Statistical Fundamentals I
MATH401/501 Notes
Chris Sherlock
2
Contents
1 Preliminaries 7
1.1 Overview of model-based inference . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Point estimates, confidence intervals and hypothesis tests . . . . . . . 10
1.2.1 Point estimates and estimators . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.3 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 The likelihood function and its maximum 21
2.1 The likelihood and log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Maximum likelihood estimate and estimator . . . . . . . . . . . . . . . . 23
2.3 Tractable ML with a scalar parameter . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Discrete cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Tractable continuous cases . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Intractable ML with a scalar parameter . . . . . . . . . . . . . . . . . . . . 31
2.4.1 One-dimensional numerical optimisation via optim() . . . . . . 34
2.5 Tractable ML for a vector parameter . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Intractable ML for a vector parameter . . . . . . . . . . . . . . . . . . . . 41
2.6.1 Example: F-distribution with both df unknown . . . . . . . . . . . 41
2.6.2 Multi-dimensional numerical optimisation via optim() . . . . . 43
2.7 Estimating functions of the parameter . . . . . . . . . . . . . . . . . . . . 44
2.8 Uncertainty under repeated sampling . . . . . . . . . . . . . . . . . . . . . 45
3 The Fisher information and parameter uncertainty 47
3.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Key definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Asymptotic approximations and consequences I: Wald CIs . . . . . . . 50
3.3.1 The scalar case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.2 The two-dimensional case . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Asymptotic approximations and consequences II: the Wald and score
tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.1 Wald-based statistical tests for a single parameter . . . . . . . . 58
3.4.2 Asymptotic distribution of the score . . . . . . . . . . . . . . . . . . 61
3
4 CONTENTS
3.4.3 The score test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4.4 Using the information at θ0 . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.5 Vector null hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.6 One-sided hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Numerical estimation of the observed information . . . . . . . . . . . . 64
3.6 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.7 CIs for functions of the parameter . . . . . . . . . . . . . . . . . . . . . . . 68
3.7.1 CIs for a monotonic function of a single parameter . . . . . . . . 68
3.7.2 CIs for a linear combination of parameters . . . . . . . . . . . . . 70
4 The likelihood ratio 71
4.1 Single parameter case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.2 Asymptotic result: θ 6= θ∗ . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.3 Asymptotic result: θ = θ∗ . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.4 The χ21 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.5 Likelihood ratio test I . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Full multiparameter tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.1 The χ2p distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 LRT when there is a nuisance parameter . . . . . . . . . . . . . . . . . . . 77
4.3.1 Maximising the likelihood under H0 . . . . . . . . . . . . . . . . . . 78
4.3.2 The LRT for nested models . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 More general nested models . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5 The deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 The Linear Model 83
5.1 Motivating example: Barley data . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 From two simple ideas to the linear model . . . . . . . . . . . . . . . . . . 85
5.2.1 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.2 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.3 The linear model by example . . . . . . . . . . . . . . . . . . . . . . 86
5.2.4 Fitted values, residuals and model checking . . . . . . . . . . . . 87
5.2.5 Predictive performance . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.6 Interpretation I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 The linear model: formal description . . . . . . . . . . . . . . . . . . . . . 91
5.3.1 Design matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Parameter estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.1 The likelihood and RSS . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.2 MLE for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.3 Standard errors for the bβ . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.4 MLE for σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5 Predictions, residuals and leverage . . . . . . . . . . . . . . . . . . . . . . 94
CONTENTS 5
5.5.1 Predictions and the hat matrix . . . . . . . . . . . . . . . . . . . . . 94
5.5.2 Residuals and leverage . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.6 Confounding and interactions . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6.1 Confounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6.2 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.7 Diagnostics and model improvements . . . . . . . . . . . . . . . . . . . . 103
5.8 F-tests and t-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.9 Model choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.9.1 Forward fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.9.2 Backward fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.9.3 Two-way Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.9.4 Exhaustive search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.10Model interpretation II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6 Generalised linear models and beyond 113
6.1 Count data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.1.1 Defining an appropriate model . . . . . . . . . . . . . . . . . . . . . 115
6.1.2 Log likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.1.3 Fitting the model in practice . . . . . . . . . . . . . . . . . . . . . . . 116
6.1.4 The offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2 Binary and Binomial data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2.1 Defining an appropriate model . . . . . . . . . . . . . . . . . . . . . 120
6.2.2 Log likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2.3 Fitting the models in practice . . . . . . . . . . . . . . . . . . . . . . 122
6.3 Single-parameter exponential family . . . . . . . . . . . . . . . . . . . . . 125
6.3.1 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4 The generalised linear model . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4.1 The canonical link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4.2 Fisher information with canonical link . . . . . . . . . . . . . . . . . 128
6.5 Other models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.5.1 Gumbel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.5.2 Covariates for σ in the linear model . . . . . . . . . . . . . . . . . . 132
6.6 Model diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.6.1 Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.6.2 Deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6.3 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.6.4 Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.6.5 Binomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.6.6 Calibration plots and the Hosmer-Lemeshow test . . . . . . . . . 141
6.7 Steps in a model fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.8 Further topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6 CONTENTS
7 Appendix 147
7.1 Multivariate Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2 QQ plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Chapter 1
Preliminaries
1.1 Overview of model-based inference
The basis of much statistical modelling is the assumption of a known stochastic
model for our data. This model will have unknown parameters which we wish to
make inference about.
Data y and model ƒ (y|θ).
If the data are discrete then ƒ is a probability mass function (pmf):
ƒ (y|θ) = P (y|θ) .
If the data are continuous then ƒ is a probability density function (pdf), which can
be defined through ∫
A
ƒ (y|θ) dy = P (y ∈ A|θ) .
More generally, ƒ might be a mixture of mass and density functions, for example.
Initially we will consider simple stochastic models, whereby each data point is
considered as a realisation of a random variable, and these random variables
are assumed to be mutually independent with a common probability distribution.
Such data are often referred to as independent and identically distributed, which
is often abbreviated to IID. The common distribution of our random variable will
come from an appropriate family of distributions, where the family depends on
one or more unknown parameters. Our aim will be to make inference about these
parameters.
7
8 CHAPTER 1. PRELIMINARIES
1.1.0.1 Example: IID Bernoulli data
We have data y = (0,1,1,1,0), where y = 1 indicates that person suffers from
back ache and y = 0 indicates that they do not. We believe that whether not
each Y took the value 0 or 1 did not depend on the values taken by any other
Yj (j 6= ). Further, we believe that P (Y = 1) = θ, with the same θ for each and
0 ≤ θ ≤ 1; thus, P (Y = 0) = 1 − θ. More succinctly this model can be written as
Y
D∼ Berno(θ), = 1, . . . , n.
Then
ƒ (y|θ) = P (Y = (0,1,1,1,0)) = (1 − θ) × θ × θ × θ × (1 − θ) = θ3(1 − θ)2.
Important: this module assumes an understanding of probability, including
independence and mass and density functions and cumulative distribution func-
tions (cdfs). It also assumes familiarity with a number of standard distributions:
their purposes, parameters, ranges, relationships, cdfs and mass/density func-
tions. These include: the Bernoulli, binomial, geometric, uniform, Poisson and
negative-binomial discrete distributions and the Gaussian (normal), uniform, beta,
exponential and gamma continuous distributions. If you are not comfortable with
this material then please look in the relevant parts of your Probability I and II
notes (MSci students) or the pre-reading MATH230 notes (MSc students). Except
for the Bernoulli (above) and the Gaussian (shortly), these distributions will not
be explained in any detail.
Later in this module we will expand the class of problems to include those where
for each y additional information x is available. Elements of the the vector x are
often referred to as covariates or explanatory variables, and the idea is that some
or all of them might affect the distribution of y. Let us revisit the IID Bernoulli
example.
1.1.0.2 Example: A first logistic regression
Suppose that for each y from the IID Bernoulli example we also have information
, the height of person . We might suspect that the probability that a person
suffers from back ache increases (or decreases) with their height, and so we might
suggest a model of the form:
Y
indep∼ Berno
exp(θ0 + θ1)
1 + exp(θ0 + θ1)
, = 1, . . . , n,
1.1. OVERVIEW OF MODEL-BASED INFERENCE 9
for some θ0 ∈ (−∞,∞) and θ1 ∈ (−∞,∞). Clearly,
P (Y = 0|θ, ) = 1 − P (Y = 1|θ, ) = 1
1 + exp(θ0 + θ1)
.
Then, given the parameters and the covariates, the probability of obtaining the
data is:
P (Y = y|θ,x1, . . . ,xn) = ƒ (y|θ,x)
= P (Y1 = 0|θ, 1, . . . , 5) × P (Y2 = 1|θ, 1, . . . , 5)
× P (Y3 = 1|θ, 1, . . . , 5) × P (Y4 = 1|θ, 1, . . . , 5)
× P (Y5 = 0|θ, 1, . . . , 5)
= P (Y1 = 0|θ, 1) × P (Y2 = 1|θ, 2) × P (Y3 = 1|θ, 3)
× P (Y4 = 1|θ, 4) × P (Y5 = 0|θ, 5) ,
which is a messy function of θ and all the x that we will learn how to deal with.
Aside: why is the form for the probability in the above example sensible? (Consider
P (Y = 0)).)
Given the data, y, the model, ƒ , and unknown parameter vector, θ, there are
several tasks we may wish to perform:
1. Obtain a good estimate for the θ. This is referred to as a point estimate, and
we will denote it by bθ.
2. Measure uncertainty in such an estimate. This will be performed through the
use of standard errors and confidence intervals.
3. Test hypotheses about the parameters. For example, for the IID Bernoulli
example, do the data provide evidence to contradict a hypothesis that
θ = 0.4?
4. Choose between competing models for the same data. For example, deciding
between the IID Bernoulli model for the back ache data and the model that
takes height into account: which explains the data better?
Our initial focus will be on 1-3, with model choice, 4, introduced through hypothesis
tests, 3, and of considerable importance in the second part of the module.
Whenever you come across an example specifying a particular model for a set
of data, it is good practice to ask yourself ‘Is the model appropriate?’, ‘What
assumptions are being made?’, ‘Could I think of a more realistic model?’. For
example, often data are assumed to be independent, when they are not (lack of
independence will be covered in Statistical Fundamentals II). Another common
assumption is that data are Gaussian, yet this is very rarely the case in reality.
10 CHAPTER 1. PRELIMINARIES
Even if you decide to make such simplifying assumptions, you should always think
about the knock-on effects for the inference that you make, and the conclusions
that you draw.
In this module you will learn techniques for tasks 1-4 based upon the likelihood.
This is not the only method for inference, but it is the most popular and there
are sound theoretical reasons for this which we will also discuss. In Statistical
Fundamentals II you will delve further into the theory of likelihood, tackle more
complex data models using likelihood inference and also be introduced to Bayesian
inference which, for very good reasons, is another popular paradigm.
Before we even introduce the likelihood and likelihood-based methods, we first
recap the ideas of point estimates and estimators, confidence intervals, hypothesis
tests and p-values through a simple running example. This example will also
provide the key intuitions behind much of the likelihood theory that we will see
and use, and we will point out the natural generalisations as we proceed.
1.2 Point estimates, confidence intervals and hy-
pothesis tests
In this section we describe point estimates, confidence intervals, hypothesis tests
and p-values and exemplify them in the context of an IID Gaussian model for
data. This is one of the cases where all the quantities can be computed exactly
analytically. Likelihood theory allows approximations to all of these quantities
to be obtained for much more general data models and we will overview these
generalisations at the same time as describing the specifics for the running
example.
First we detail the Gaussian (also known as normal) model
A random variable Y is said to have a Gaussian distribution, also known as a
normal distribution, with an expectation of μ and a variance of σ2 if it has a
density function of
ƒ (y|μ, σ) = 1
σ
p
2pi
exp
− 1
2σ2
(y − μ)2
, (−∞ < y <∞).
This is denoted Y ∼ N(μ, σ2), and we have E [Y] = μ and Vr [Y] = σ2.
The density function when μ = 4 and σ = 2 appears below.
1.2. POINT ESTIMATES, CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 11
−2 0 2 4 6 8 10
0.
00
0.
05
0.
10
0.
15
0.
20
y
de
ns
ity
The Gaussian distribution has two key properties that we will use throughout this
section:
1. If Y1 and Y2 are independent, Y1 ∼ N(μ1, σ21) and Y2 ∼ N(μ2, σ22) then Y1+Y2 ∼
N(μ1 + μ2, σ21 + σ
2
2).
2. If Y ∼ N(μ, σ2) and 6= 0 is a constant then Y ∼ N(μ, 2σ2) and Y − ∼
N(μ − , σ2).
1.2.0.1 Running example: IID Gaussian data with known variance
Throughout this section we imagine that our data y = (y1, . . . , yn) arise from
the model:
Y
D∼ N(μ∗, σ2) = 1, . . . , n.
We will suppose that μ∗ is unknown but σ2 is known.
1.2.1 Point estimates and estimators
A natural estimate of the expectation, μ∗, is the average of the data,
bμ = yn = 1n
n∑
=1
y.
12 CHAPTER 1. PRELIMINARIES
Aside: for Gaussian data, this will also turn out to be the maximum likelihood
estimate; see later.
To examine the properties of the estimate we imagine how it might vary if we had
collected a different set of data from the same model, and done this repeatedly.
We would have a whole collection of yn values and could draw a histogram of
these to see, for example, the amount of variability in yn. This is known as
repeated sampling. It is a hypothetical process (i.e., we cannot actually do this;
we simply imagine what would happen if we could) and is the source of all of
probablity-based statistical statements that we will make in this module.
In the limit as the number of hypothetical replicate samples →∞ the histogram
would simply be the density of the random variable:
bμ ≡ bμ(Y) = Yn = 1
n
n∑
=1
Y.
The quantity yn is an average of some numbers so is itself a number; it is an
estimate for μ. The quantity Yn is an average of random variables so is itself a
random variable; it is an estimator for μ. Notice that we use the same symbol, bμ,
for both the estimate and the estimator. This is, unfortunately, standard practice,
and whether or not it refers to an estimate or an estimator must be inferred from
the context. Sometimes, to be explicit we write bμ(y) and bμ(Y) which distinguishes
the two and makes it clear that the first is a function of the data and the second a
function of the corresponding random variables.
To understand how much we can trust the specific estimate we obtained from
our data we look at the properties of the estimator. From Gaussian property 1,∑n
=1 Y ∼ N(nμ∗, nσ2). Hence, from Gaussian property 2,
Yn ∼ N
μ∗,
1
n
σ2
.
We note two important properties:
Firstly, E
Yn
= μ∗; i.e., Yn is an unbiased estimator for μ∗.
Secondly, for any ε > 0, as n→∞
P
μ∗ − ε < Yn ≤ μ∗ + ε
→ 1
In words: pick any small band around the true value, μ∗. The probability that the
estimator Yn is within that band (i.e., very close to the truth) increases to 1 as the
1.2. POINT ESTIMATES, CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 13
amount of data increases to ∞. Such an estimator is termed consistent. To see
why this happens here:
P
μ∗ − ε < Yn ≤ μ∗ + ε
= P
Yn ≤ μ∗ + ε
− P Yn < μ∗ − ε
=
μ∗ + ε − μ∗
σ/
p
n
−
μ∗ − ε − μ∗
σ/
p
n
= (
p
nε/σ) − (−pnε/σ),
where (z) =
∫ z
−∞ ƒ (t) dt is the cumulative distribution function of the N(0,1)
distribution; i.e., P (Z ≤ z) when Z ∼ N(0,1). As n→∞ the first term increases to
1 and the second decreases to 0.
4.6 4.8 5.0 5.2 5.4
0
5
10
15
20
25
30
y
re
pl
ica
te
µ
We visualise this (above) by fixing μ = 5 and σ = 1. We simulate 10 replicates of
data yn = y1, . . . , yn with n = 10 and for each we calculate yn (shown in blue; we
then repeat the exercise with n = 100 (shown in red) and n = 1000 (green).
More generally than the Gaussian setting, given independent data
yn = (y1, . . . , yn) and a parameter θ with a true value of θ∗, an esti-
mate for θ∗ is any function of the data bθ = bθn(yn). The corresponding
estimator for θ∗ is the same function applied to the random variables:bθ = bθn(Yn).
The bias in the estimator is E[ bθn(Yn)] − θ∗, and the estimator is consis-
tent if for any ε > 0,
P
θ∗ − ε < bθn(Yn) ≤ θ∗ + ε→ 1 as n→∞.
14 CHAPTER 1. PRELIMINARIES
This module is concerned with a particular class of estimators called maximum
likelihood estimators. These are consistent and in the limit as n→∞ they are also
unbiased.
Aside: of the two properties discussed so far, consistency is generally considered
to
be more important than unbiasedness. For example, in the Gaussian
settingbμ = bθn(Yn) = Y1 is an unbiased estimator for μ∗, but it becomes
no more accurate
as the amount of data increases.
Consistency is the only formal limiting property (as n→∞) that we examine in
this section, so henceforth for simplicity of notation we drop the subscript n.
1.2.2 Confidence intervals
For the IID Gaussian data example, we have an estimate bμ = y for a parameter μ
with a true value of μ∗, and we know that bμ becomes increasingly accurate as
more and more data are used (n→∞), but how accurate is it for the particular n
we actually have? It is natural to describe our uncertainty via an interval around
our estimate, for example (y− , y+ ). It might seem natural to then describe the
interval in terms of the probability that the true parameter is in it. However, just as
when we investigated the accuracy of an estimate by considering the properties
of the estimator under (hypothetical) repeated sampling, so we investigate the
properties of the intervals such as (y − , y + ) under repeated sampling.
To be clear, the statement P
μ∗ ∈ (y − , y + ) = 0.9 (say) makes no sense as
μ∗ is a fixed (if unknown) value and y and are simply numbers, so either μ∗ is
in the interval or it is not.
Instead, we consider the random interval (Y − , Y + ) and state the probability
that it contains μ∗:
P
Y − , Y + 3 μ∗ = p.
1.2. POINT ESTIMATES, CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 15
4.5 5.0 5.5 6.0
interval
re
pl
ica
te
µ*
The idea is illustrated above using 12 repeated samples and = 0.5. We take
μ∗ = 5, σ = 1 and n = 10 and simulate 12 values of yn drawing a horizonal line to
represent each interval of width 2 = 1.0 and a • for y. Most of the intervals do
contain μ∗, but one of them does not.
More generally we have an estimate, bθ = bθ(y) for a parameter θ with
a true value of θ∗.
We consider the random interval (Clo(Y), Chi(Y)), which could be (bθ(Y)−
, bθ(Y)+ ) for some > 0 or could be more complex. This is a (1− α)%
confidence interval if the probability that it contains the truth is 1 − α:
P ((Clo(Y), Chi(Y)) 3 θ∗) = 1 − α.
For the Gaussian example we wish to evaluate
P
(Y − , Y + ) 3 μ∗
.
We define the standard error of the estimator bμ(Y) = Y to be
SE(bμ) ≡ÇVr bμ(Y) =rVr Y = σ/pn,
and, to simplify notation in calculations, we abbreviate this to s.
Now μ∗ ∈ (Y − , Y + ) is equivalent to μ∗ > Y − and μ∗ < Y + , or
Y < μ∗ + and Y > μ∗ − .
16 CHAPTER 1. PRELIMINARIES
This is illustrated in the following diagram.
D
en
si
ty
o
f Y
n
µ+aµ−a µ
Thus
P
(Y − , Y + ) 3 μ∗
= P
μ∗ − < Y < μ∗ +
=
μ∗ + − μ∗
s
−
μ∗ − − μ∗
s
= (/s) − (−/s) = 1 − 2(−/s),
by symmetry. Hence, to choose such that P
(Yn − , Yn + ) 3 μ∗
= 1 − α, for
example, we must set
1 − 2(−/s) = 1 − α
⇐⇒ 2(−/s) = α
⇐⇒ −/s = −1(α/2)
⇐⇒ = −s−1(α/2).
If, for example, α = 0.05, this gives ≈ 1.96s = 1.96σ/pn. So,
P
(Y − 1.96σ/pn, Yn + 1.96σ/pn) 3 μ∗
≈ 0.95.
When we know σ and we have data y1, . . . , yn, we therefore create a 95% confi-
dence interval for μ as
yn − 1.96
σp
n
, y + 1.96
σp
n
≡
bμ(yn) − 1.96 σp
n
, bμ(yn) + 1.96 σp
n
.
In the case of IID Gaussian data, the standard deviation of the parameter estimator
is s = σ/
p
n. However in more general cases, the standard deviation of the
estimator is not the standard deviation of the distribution divided by
p
n.
1.2. POINT ESTIMATES, CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 17
More generally we will find that just as bμ(Y) ≡ Y ∼ N(μ, σ2/n), for a
certain class of estimators (maximum likelihood estimators) and a scalar
parameter: bθ(Y) ∼ N(θ∗, s2) (approximately),
where s is the standard deviation of the random variable bθ(Y), also
known as the standard error of the estimator bθ, and is the generalisation
of σ/
p
n. See later.
1.2.3 Hypothesis tests
In the Gaussian running example, a scientist may conjecture a particular value for
μ, μ0, say. They may wish to know whether the data fit with this value. Denoting
the true value of μ by μ∗, they wish to know how the data fit with the conjecture
that μ∗ = μ0. This is set up through a pair of hypotheses.
• The null hypothesis, often denoted H0, is that μ∗ = μ0.
• The alternative hypothesis, often denoted H1, is the μ∗ 6= μ0.
Sometimes the direction of a discrepancy is important and in such cases the
null hypothesis can be H0 : μ∗ ≤ μ0, with the alternative H1 : μ∗ > μ0 (or the
other way round: H0 : μ∗ ≥ μ0 and H1 : μ∗ < μ0). These are called one-sided
hypotheses; see later.
More generally, let the true value of the parameter be θ∗. A scientist
may wish to decide whether or not θ∗ = θ0 where θ0 is some specific
hypothesised value. Or, for a specific scalar component θ of θ they may
wish to decide whether θ∗ = θ0. A analogous pair of hypotheses to the
Gaussian running example is set up. For the scalar case these can be
one-sided just as in the Gaussian example.
The hypotheses do not have equal standing. Typically the null hypothesis provides
a default value and we test to see whether or not the data contradict this default
value. There are two possible outcomes to a test: either we reject the null
hypothesis or we fail to reject the null hypothesis. We never accept the null
hypothesis, for reasons that will be described shortly.
A natural entry into the hypothesis test is via the confidence interval: since
P
bθ(Y) − 1.96s, bθ(Y) + 1.96s 3 θ∗ ≈ 0.95,
given data y we check whether or not θ0 ∈
bθ(y) − 1.96s, bθ(y) + 1.96s: if it is
then the data do not seem to contradict the hypothesis that θ∗ = θ0 and we would
18 CHAPTER 1. PRELIMINARIES
fail to reject H0; on the other hand if θ0 /∈
bθ(y) − 1.96s, bθ(y) + 1.96s then the
data do not seem to fit with H0 and we would reject the hypothesis.
Some mathematical proofs work by contradiction. One makes an assumption
that is believed to be false and shows that if the assumption is true it leads to
a contradiction, hence the assumption must be false. We would like something
similar to show that the null hypothesis is false. We start by assuming that it is
true; however, we cannot usually prove that it is false by obtaining a contradiction,
we can only show that the null hypothesis does not fit well with the data, which
makes us inclined to reject it.
To perform a hypothesis test we create a scalar test statistic, T(y), formulated to
that the larger |T(y)| is the more the data disagree with the null hypothesis.
In the Gaussian running example we would choose
T(y) =
bμ(y) − μ0
σ/
p
n
≡ y − μ0
σ/
p
n
.
We then consider the distribution of the statistic T(Y) which, by Gaussian Property
2 is:
T(Y) ∼ N
μ∗ − μ0
σ/
p
n
,1
.
If H0 is true, then μ0 = μ∗ so
T(Y) ∼ N (0,1) .
More generally, since bθ(Y) ∼ N(θ∗, s2) approximately,
T(Y) ≡
bθ(Y) − θ0
s
∼ N
θ∗ − θ0
s
,1
,
approximately, so if H0 is true, T(Y) ∼ N(0,1) (approximately).
Either in the Gaussian running example, or more generally, therefore, if we were
to reject H0 if T ≥ c then
α ≡ P (reject H0|H0 is true) = P (|T(Y)| ≥ c|H0) = 2(−c).
defining α, the significance level of the test. Typically we fix α at some small
value such as 0.05 and find the corresponding c, here
c = −−1(α/2) ≈ 1.96,
showing how this test corresponds exactly to checking whether or not μ0 (or θ0)
lies within the 1 − α% confidence interval.
1.2. POINT ESTIMATES, CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 19
It is possible to prove that the probability of rejecting H0 if H0 is false is:
P (|T(Y)| > c|H1) =
−(c + μ0 − μ∗
σ/
p
n
)
+
−(c − μ0 + μ∗
σ/
p
n
)
> 2(−c),
(student exercise), which is desirable for a sensible test. It is also easy to see this
intuitively. On the graph below we have reproduced the CIs from the previous
experiment, but added a line showing an incorrect hypothesised value of μ0 = 5.5.
4.5 5.0 5.5 6.0
interval
re
pl
ica
te
µ* µ0
Four of twelve the CIs do not include this incorrect μ0, whereas only one does
not include the true value, μ∗. If we had hypothesised μ0 = 6, only one CI would
have included μ0 - on nearly every occasion we would have rejected H0, which is
desirable behaviour.
If, however, we had hypothesised μ0 = 4.8, then the outcome of each of the
hypothesis tests would have been the same as if we had hypothesised the true
value. This illustrates why we fail to reject the null hypothesis if the data do not
contradict H0; we never accept the null hypothesis.
The power of a test is the probability of rejecting H0 given that H0 is false. Clearly
power depends on the true value μ∗ as well as on μ0; it also depends on the
amount of data. In our Gaussian running example, the standard error and width
of any 1 − α% confidence interval decreases as the number of data values, n
increases. Thus, for sufficiently large n we might expect to be able to reject
H0 : μ0 = 4.8, for example.
Finally, we can do more than simply reject or fail to reject the null hypothesis. The
20 CHAPTER 1. PRELIMINARIES
p-value depends on the data and is the probability, under the null hypothesis,
of seeing a value of the test statistic as extreme or more extreme than the one
actually observed:
p = P (|T(Y)| ≥ |T(y)| | H0) = P (|N(0,1)| ≥ |T(y)|) = 2(−|T(y)|).
If the p-value is at least as small as the required significance level of the test then
we reject H0; however, the magnitude of p provides additional information.
Chapter 2
The likelihood function and its
maximum
Our approach to performing inference – that is obtaining point estimates, obtaining
confidence intervals and performing hypothesis tests – will be to use something
called the likelihood function. Whilst there are many other ways of performing
inference, likelihood-based methods are arguably the most important and most
commonly used. We will consider reasons for this later. Our first focus will be on
defining the likelihood function, and how we can use this function to obtain point
estimates. We will define the necessary function for a vector parameter θ, then
examine the practice of estimation, first when there is a single unknown scalar
parameter θ, and then when there is a vector parameter with two components
θ = (θ1, θ2). Higher-dimensional parameter vectors will be covered in Statistical
Fundamentals II.
2.1 The likelihood and log-likelihood
The likelihood function, L(θ), or L(θ;y) is exactly ƒ (y|θ). For discrete data this is
the probability of obtaining the data given parameters θ; for continuous data it is
the probability density function for the data (see later for an heuristic as to why
this is sensible). We use a different symbol, L, because we are viewing it primarily
as a function of θ rather than of y.
For IID data, as in the IID Bernoulli example and the IID Gaussian with known
variance example, we have the following definition:
Assume y = (y1, . . . , yn) are IID random variables with individual probability mass
21
22 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
or density function ƒ (y|θ). The likelihood of data y is
L(θ) ≡ L(θ;y) =
n∏
=1
ƒ (y|θ).
For independent but non-identically distributed data (as in the First logistic regres-
sion example) where Y has a density or mass function of ƒ, we obtain:
L(θ) ≡ L(θ;y) =
n∏
=1
ƒ(y|θ).
More complicated data models where the data are not independent will be covered
in Statistical Fundamentals II.
The log-likelihood function, which will turn out to be more useful in practice, is:
ℓ(θ) ≡ ℓ(θ;y) ≡ log L(θ;y) ≡ log L(θ).
For IID data we obtain:
ℓ(θ) =
n∑
=1
log ƒ (y|θ),
whilst for independent but non-identically distributed data we have:
ℓ(θ) =
n∑
=1
log ƒ(y|θ).
Since taking the logarithm is a 1-1 increasing transformation, the values of θ
with relatively large likelihood values will also have relatively large log-likelihood
values. In practice algebraic manipulation of the log-likelihood function is simpler
than manipulating the likelihood function, which is one reason why we tend to
work with the ℓ(θ).
When using the likelihood function we will be interested in the relative likelihood
values for different parameter values. Thus, as we will see below, we only need
the likelihood function up to a constant of proportionality (the constant can depend
on the data but not on the parameter). Equivalently we need the log-likelihood
function only up to an additive constant.
The likelihood (and log-likelihood) is a function of the data, but often we drop the
explicit dependence on y and use L(θ) as the data do not change. However, as in
the previous chapter when we are interested in the properties of the likelihood
under repeated sampling we will explicity acknowledge the data, or other
possible realisations by using L(θ;y) and L(θ;Y), the latter highlighting that the
likelihood is now a function of a vector random variable and so is itself a random
function.
2.2. MAXIMUM LIKELIHOOD ESTIMATE AND ESTIMATOR 23
2.2 Maximum likelihood estimate and estimator
By definition, L(θ) or ℓ(θ) give measures of how appropriate θ would be as the
true parameter value for the population, since, up to proportionality, L(θ) is the
probability (or probability density) of obtaining the observed data if θ had been
the true parameter. Thus, the larger the likelihood (or log-likelihood) the greater
the probability of having obtained the observed data. Consequently, values of
θ with large (log)likelihood are more compatible with the data than values with
low (log)likelihood. This leads to the principle of maximum likelihood estimation,
whereby we estimate θ by the maximum likelihood estimate (MLE), defined to
be the value θˆ of θ that maximises L(θ). Since L(θ) ≡ L(θ;y), the maximum
likelihood estimate is a function of the data: bθ ≡ bθ(y). Since the likelihood and
log-likelihood are maximised at the same point:
bθ ≡ bθ(y) = rgmx L(θ;y) = rgmx ℓ(θ;y).
If we replace the data y with the a random variable, Y, drawn from the same
distribution from which the data came, we can define the maximum likelihood
estimator:
bθ ≡ bθ(Y) = rgmx L(θ;Y) = rgmx ℓ(θ;Y).
In practice we will write bθ rather than bθ(y) or bθ(Y). We will use MLE as shorthand
for both the maximum likelihood estimate and the maximum likelihood estimator.
To calculate the MLE we must maximise the likelihood. In some simple examples
we can do this via straightforward calculations using pen and paper; in other
examples there is no tractable solution and we must maximise the likelihood
numerically. In the tractable examples it is normally easier to maximise ℓ(θ) than
L(θ), and for the non-tractable examples it is usually computationally safer to
maximise ℓ(θ) rather than L(θ). Hence we consider ℓ(θ) throughout.
2.3 Tractable ML with a scalar parameter
Perhaps the most natural mechanism for finding the maximum of ℓ(θ) is to solve
ℓ′(θ) = 0, where ℓ′(θ) is the first derivative of the log-likelihood with respect to
the parameter. Solving ℓ′(θ) = 0 just finds a turning-point of the log-likelihood
surface. If the log-likelihood function is everywhere differentiable and there is a
24 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
single turning-point then this will generally be the maximum of the function. Thus
the value of θ that solves ℓ′(θ) = 0 will be the MLE.
If we have multiple solutions to ℓ′(θ) = 0 then we will need to evaluate ℓ(θ) for
each solution to find the MLE (the MLE will be the solution which gives the largest
value of ℓ(θ)). Further, if the function is not everywhere differentiable, then solving
ℓ′(θ) = 0 will not necessarily give the MLE. These different possibilities are shown
in the figure below.
0 2 4 6 8 10
2.
5
3.
0
3.
5
4.
0
4.
5
5.
0
θ
lo
g−
lik
e
lih
oo
d
0 2 4 6 8 10
3.
8
4.
0
4.
2
4.
4
4.
6
4.
8
5.
0
θ
lo
g−
lik
e
lih
oo
d
0 2 4 6 8 10
0
1
2
3
4
5
θ
lo
g−
lik
e
lih
oo
d
The left-hand plot shows a log-likelihood function with a single local maximum.
Finding the MLE can be achieved through solving ℓ′(θ) = 0. The middle plot shows
a function with multiple local maxima. Finding the MLE involves finding solutions
to ℓ′(θ) = 0, and evaluating ℓ(θ) for each one to find which gives the largest value.
The right-hand plot shows a log-likelihood function which is discontinuous. Solving
ℓ′(θ) = 0 does not give the MLE in this case (see the uniform example later). The
MLE is shown by the red vertical line in each case.
2.3. TRACTABLE ML WITH A SCALAR PARAMETER 25
2.3.1 Discrete cases
2.3.1.1 Example: Binomial likelihood
Consider y = (y1, . . . , yn) arising as IID draws from a Binomial(m,θ) distribution.
Assuming m is known, find the likelihood function and the MLE.
We first state the parameter space, θ ∈ Ω with Ω = [0,1].
Now, using the fact that we only need the likelihood function up to a constant of
proportionality,
L(θ) ∝
n∏
=1
θy(1 − θ)m−y .
Taking logs gives
ℓ(θ) =
n∑
=1
y log(θ) +
n∑
=1
(m − y) log(1 − θ) + const.
To find the MLE, we differentiate. To simplify notation write s =
∑n
=1 y, and note
that
∑n
=1(m − y) =mn − s.
ℓ′(θ) =
s
θ
− mn − s
1 − θ .
Setting this derivative equal to 0 we obtain
0 =
s
θˆ
− mn − s
1 − θˆ ,
⇒ 0 = s(1 − θˆ) − (mn − s)θˆ.
Thus bθ = s
mn
.
This is the overall proportion of “successes”.
2.3.1.2 Example: IID Poisson data
For y = (y1, . . . , yn) IID draws from a Posson(θ) distribution obtain the parameter
space, find the likelihood function and the MLE.
The parameter is θ ∈ Ω with parameter space Ω = (0,∞).
26 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
L(θ) =
n∏
=1
e−θθy
y!
∝
n∏
=1
e−θθy ,
and
ℓ(θ) =
n∑
=1
log(e−θθy) + const
=
n∑
=1
log(e−θ) + log(θy)
+ const
=
n∑
=1
{−θ + y log(θ)} + const
= −nθ + ny¯ log(θ) + const
Thus
ℓ′(θ) = −n + ny¯
θ
.
ℓ′(bθ) = −n + ny¯bθ = 0
=⇒ bθ = y¯.
2.3.2 Tractable continuous cases
We start with an heuristic for why the density function of the data is the right
quantity to maximise when the data are continuous. Recall that when a random
variable is continuous, the probability of obtaining any given value is 0. Instead
we consider subsets of the region in which the random variable is defined; the
integral of the density over this set gives the probability that the random variable
is in this set (see earlier).
For a scalar random variable, Y, this becomes
Pr( < Y < b) =
∫ b
ƒ (y|θ) dy,
for any < b.
In practice we can only record data to some precision, so that observing data, y1
say, is really observing the event that Y1 ∈ (y1 − ε/2, y1 + ε/2). The probability of
2.3. TRACTABLE ML WITH A SCALAR PARAMETER 27
this is
Pr(y1 − ε/2 < Y1 < y1 + ε/2) =
∫ y1+ε/2
y1−ε/2
ƒ (y|θ)dy ≈ εƒ (y1|θ).
Now, the definition of the likelihood is only meaningful up to multiplicative con-
stants, so we can ignore the factor of ε. Thus, as stated earlier the likelihood is
defined to be the density.
We use this likelihood function for inference on continuous data in an identical
way to that in which we used the likelihood function for discrete data.
2.3.2.1 Example: IID exponential data
An engineer is interested in understanding the failure time of a particular engine
component. They undertake an experiment where 10 independent copies of a
component are tested separately and their lifetime (in hours) measured, leading
to the following data:
90 255 40 143 30 239 484 28 39 15
What conclusions can be made about the expected failure time?
For lifetime data of this sort a common model for such data is the exponential
distribution. This distribution arises from the distribution of the time between
events in a Poisson process; and thus would be appropriate if the rate of failure is
constant over time.
This distribution has a single parameter θ, and pdf
ƒ (y|θ) = θe−θy (y > 0).
Note that exponential random variables are always positive, as we need to model
failure times. The expectation of an Eponent(θ) random variable is θ−1.
Thus, we will assume that our data y1, . . . , yn are realisations of independent
random variables Y1, . . . , Yn, each having the Eponent(θ) distribution.
So, for each of the Y,
ƒ (y|θ) = θe−θy (y > 0),
28 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
with θ ∈ Ω = (0,∞). Hence:
L(θ) =
n∏
=1
θe−θy (θ > 0)
= θne−θ
∑n
=1 y ,
and
ℓ(θ) = n logθ − θ
n∑
=1
y = n logθ − θny.
So
ℓ′(θ) =
n
θ
− ny
and for a stationary point ℓ′(bθ) = 0. Hence,
bθ = 1
y
.
Thus, we obtain bθ = 1/y as the maximum likelihood estimator of θ: for our data
this leads to bθ = 0.00734.
It can often be helpful to plot the log-likelihood. In tractable cases such as this
one, the plot can act as a check that your algebraic manipulations are correct.
ys=c(90, 255, 40, 143, 30, 239, 484, 28, 39, 15)
n=length(ys)
thetahat=1/mean(ys) ##0.00734
thetas=seq(thetahat/3,2*thetahat,len=200)
lls=n*log(thetas)-thetas*sum(ys)
plot(thetas,lls,type="l",lwd=2,xlab=expression(theta),
ylab="log-likelihood")
abline(v=thetahat,col=2,lty=2,lwd=2)
2.3. TRACTABLE ML WITH A SCALAR PARAMETER 29
0.002 0.004 0.006 0.008 0.010 0.012 0.014
−
63
−
62
−
61
−
60
−
59
θ
lo
g−
lik
e
lih
oo
d
One can even estimate the MLE from the grid (though we will see a better method
shortly).
maxlik=max(lls)
thetahat=thetas[lls==maxlik]
thetahat
## [1] 0.007361336
The third significant figure is incorrect, though with a fine enough grid the estimate
can be made as accurate as desired.
2.3.2.2 Gaussian data
The most common models for continuous data are based on the normal (or Gaus-
sian) distribution (see earlier). In part the importance of this distribution stems
from the central limit theorem that states that, under weak conditions, the sum of
independent contributions to a data value is approximately normally distributed.
In many applications, data are indeed approximately normally distributed.
One consideration of using the normal distribution is that outliers are rare (for
example you would need samples of size 10,000 or more to expect to see an
observation that is 4 standard deviations away from the expectation.
30 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
2.3.2.3 Wheat yield
The wheat yield, Y, from field is believed to be normally distributed with expecta-
tion θ, where is the known quantity of fertiliser spread on the field. Assuming
that the yields in different fields are independent, and that the variance is some
fixed, known value, σ2, so that
Y ∼ N(θ, σ2) for = 1, . . . , n,
Find bθ.
L(θ) =
n∏
=1
1
σ
p
2pi
exp{−(y − θ)2/(2σ2)} ,
ℓ(θ) = const − 1
2σ2
n∑
=1
(y − θ)2 ,
ℓ′(θ) =
1
σ2
n∑
=1
(y − θ).
Now setting ℓ′ to be zero, we obtain
θˆ =
∑n
=1 y∑n
=1
2
.
Aside: in the particular case where each = 1, the model is Y ∼ N(θ,1) and we
obtain bθ = y as stated earlier.
2.3.2.4 Uniform data
In many examples that you might see elsewhere on this programme, a different
symbol might be used for the data.
Consider X = (X1, . . . , Xn) IID from the Uniform distribution on [0, θ], with pdf
ƒ (|θ) =
§
1/θ if 0 ≤ ≤ θ,
0 otherwise.
Find the likelihood function and the MLE for this data.
The key to this question is that the range of the random variable depends on
θ. Thus we need to keep track of this constraint within the likelihood function.
2.4. INTRACTABLE ML WITH A SCALAR PARAMETER 31
We will assume that all the data are non-negative, else it would not have been
sensible to fit a Unƒorm[0, θ] distribution in the first place.
The likelihood function is
L(θ) =
§
1/θn if 1, . . . , n ≤ θ,
0 otherwise.
Let (n) =mx, then 1, . . . , n ≤ θ is equivalent to (n) ≤ θ. Thus we get that
the log-likelihood function is
ℓ(θ) =
§ −n log(θ) if (n) ≤ θ,−∞ otherwise.
• To maximise this we need to find θ that maximises −n log(θ) subject to
θ ≥ (n).
• Since −n log(θ) is a decreasing function of θ, and the maximum occurs at
the smallest possible value for θ.
• Thus bθ = (n).
The shape of the log-likelihood curve is similar to that shown in the right-hand
plot of the earlier Figure, except the log-likelihood is −∞ for values of θ less than
the MLE.
2.4 Intractable ML with a scalar parameter
When the maximum likelihood estimate is intractable we must use numerical
methods to find the MLE. The simplest idea, which only works well in the case of
a scalar parameter is to evaluate the log-likelihood over a find grid of points and
find the parameter value that gives the largest value.
2.4.0.1 Example: gamma data with unknown shape parameter
Data y = 0.733,2.960,1.460,0.701,1.912,0.902 arise from IID sampling from a
Gmm(α,1) distribution. Find the MLE for α.
32 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
The density for a Gmm(α,1) random variable y is
ƒ (y) =
1
(α)
yα−1 exp[−y] (y > 0),
where is the gamma function. So the likelihood for data y1, . . . , yn is
L(α) =
1
(α)n
n∏
=1
yα−1 exp[−
n∑
=1
y]
∝
∏n
=1 y
α
(α)n
.
The log-likelihood is, therefore,
ℓ(θ) = const+ α
n∑
=1
logy − n log (α).
We cannot maximise this analytically since the derivative of the gamma function is
as intractable as the gamma function itself. Instead, we plot the log-likelihood and
find the maximum of the plot. Since the expectation of a Gm(α, β) distribution
is α/β, a reasonable “ball-park” estimate of α is the mean of the data values.
ys=c(0.733,2.960,1.460,0.701,1.912,0.902)
n=length(ys)
alphaGuess=mean(ys)
alphas=seq(alphaGuess/4,3*alphaGuess,len=100)
lls=alphas*sum(log(ys)) - n*lgamma(alphas) ## lgamma=log of gamma fn
llmax=max(lls)
alphahat=alphas[lls==llmax]
plot(alphas,lls,type="l",lwd=2,xlab=expression(alpha),
ylab="log-likelihood")
abline(v=alphaGuess,col=4,lty=2,lwd=2)
abline(v=alphahat,col=2,lty=2,lwd=2)
2.4. INTRACTABLE ML WITH A SCALAR PARAMETER 33
1 2 3 4
−
6
−
4
−
2
0
2
α
lo
g−
lik
e
lih
oo
d
The data mean (blue, dashed line) is approximately 1.445. From the values tried
in the plot, the MLE (red) is approximately 1.726.
If, instead of 100, 10000 values are used then we obtain bα ≈ 1.719. Whilst these
extra evaluations are not required to produce a smooth-looking plot, they are
necessary to find the MLE to the desired accuracy.
An alternative way to evaluate the log-likelihood would be to use the built-in R
density function, dgamma:
alphas=seq(alphaGuess/4,3*alphaGuess,len=100)
lls=rep(0,100) ## empty container to fill
for (i in 1:100) {
alpha=alphas[i]
lls[i]=sum(dgamma(ys,alpha,rate=1,log=TRUE))
}
llmax=max(lls)
alphahat=alphas[lls==llmax]
plot(alphas,lls,type="l",lwd=2,xlab=expression(alpha),
ylab="log-likelihood")
abline(v=alphaGuess,col=4,lty=2,lwd=2)
abline(v=alphahat,col=2,lty=2,lwd=2)
34 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
1 2 3 4
−
16
−
14
−
12
−
10
−
8
α
lo
g−
lik
e
lih
oo
d
Notice the different scale on the y-axis (why is this?), but that it makes no
difference to the MLE value, bα.
For one-dimensional (and two-dimensional) problems one should always plot the
log-likelihood function. As we have just seen, the MLE can then be obtained
by zooming in on the maximum using a sufficiently fine grid. A more efficient
mechanism in intractable cases is numerical maximisation.
2.4.1 One-dimensional numerical optimisation via optim()
In R this is most easily accomplished via the function optim(). We will treat
optim() as a “black box”; its help page provides references for the numerical
methods that it employs. The most generic way to use optim() starts with the
creation of a function that evaluates the log-likelihood.
llUnivariateGamma<-function(alpha,ydata) {
ll=sum(dgamma(ydata,alpha,rate=1,log=TRUE))
return(ll)
}
The first argument to the log-likelihood function must always be the unknown
parameter (or, more generally, parameter vector) over which the log-likelihood is
to be maximised. Other arguments can include data and any fixed parameters.
Now we describe the use of optim().
2.5. TRACTABLE ML FOR A VECTOR PARAMETER 35
• The first argument of optim() is always an initial value for the parameter (or
parameter vector).
• The second argument is always the function to be optimised.
• By default, optim() minimises a function. We wish to maximise the log-
likelihood, so throughout this module we will use the control argument to
(effectively) tell optim() to maximise.
• For one-dimensional problems only optim() must use a special method,
Brent, and requires a lower bound and an upper bound on where the max-
imum could occur. It is straightforward to obtain these visually from the
log-likelihood plot.
• Other arguments that are needed for the log-likelihood function can be
passed through optim() by naming them.
theta0=2 ## initial value for the optimiser
lo=1; hi=3 ## bounds for the optimiser
ctl=list(fnscale=-1) ## tells optim() to maximise rather than minimise
out=optim(theta0,llUnivariateGamma,control=ctl,method="Brent",
lower=lo,upper=hi,ydata=ys)
print(out$par); print(out$val); print(out$convergence)
## [1] 1.719365
## [1] -7.152326
## [1] 0
Compare the numerical estimates for α and ℓ(α) with that from the graph and
grid.
That out$convergence is 0 indicates that the algorithm has converged - it has
found a maximum. If out$convergence is 1 then the algorithm has not converged
and you should investigate why.
2.5 Tractable ML for a vector parameter
When there are two unknown parameters it is easiest to visualise the log-likelihood
as a function of these parameters via a contour plot:
36 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
θ1
θ 2
−141
−140
−139
−138
−137
−136
−
13
6
−136
−135
−135
−134
−134
−
13
3
−133
−132
−131
−130
−130 −129
−128
−127
1.5 2.0 2.5 3.0
5
10
15
20
In this case, simply from the plot, we can see that the maximum value is around
(2.3,7). In this example, also, both parameter components must be positive; this
appears in the plot as the steep decline in the log likelihood as either parameter
approaches 0. We will revisit this example when we cover intractable inference
for two parameter models.
Issues such as multimodality and discontinuities or lack of differentiability can
occur in problems with vector parameters, just as they can for scalar parameters;
however, you will not see any such examples in Statistical Fundamentals I.
In tractable cases we simply differentiate the log-likelihood with respect to each
of the parameters in turn, setting each partial derivative to zero:
∂ℓ
∂θ1
θ=bθ =
∂ℓ
∂θ2
θ=bθ = 0.
2.5.0.1 Example: IID Gaussian data with both parameters unknown
Let Y ∼ N(μ, σ2), so that θ = (μ, σ) with parameter space Ω = (−∞,∞) × (0,∞).
Obtain the likelihood function for θ and the MLE bθ
The likelihood function is
L(θ) =
n∏
=1
1
(2pi)1/2σ
exp
− 1
2σ2
(y − μ)2
,
2.5. TRACTABLE ML FOR A VECTOR PARAMETER 37
and so
ℓ(θ) =
n∑
=1
−1
2
log(2pi) − logσ − 1
2σ2
(y − μ)2
= −n
2
log(2pi) − n logσ − 1
2σ2
n∑
=1
(y − μ)2.
To calculate the MLEs, first differentiate with respect to μ and σ:
∂
∂μ
ℓ(θ) =
1
σ2
n∑
=1
(y − μ),
∂
∂σ
ℓ(θ) = − n
σ
+
1
σ3
n∑
=1
(y − μ)2.
For a maximum we need to obtain θ = θˆ such that
∂
∂μ
ℓ(θˆ) = 0 and
∂
∂σ
ℓ(θˆ) = 0.
Thus,
1
σˆ2
n∑
=1
(y − μˆ) = 0, and so, μˆ = y,
and
− n
σˆ
+
1
σˆ3
n∑
=1
(y − μˆ)2 = 0, giving σˆ =
1
n
n∑
=1
(y − y)2
1/2
.
We can see that μˆ is the sample mean and σˆ is the sample standard deviation.
This seems reasonable given μ and σ are the population mean and standard
deviation respectively.
As with single-parameter likelihoods, it can be helpful to plot the log-posterior for
your data set. From this example onwards we will create an R function to evaluate
the likelihood at any given parameter value. This would also allow us to find the
MLE numerically, if we wished (see later).
Gauss.llfn<-function(theta,ys) {
mu=theta[1]; sigma=theta[2]
if (sigma<=0) {
z=-Inf
38 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
}
else {
n=length(ys)
z=-n*log(sigma)-sum((ys-mu)^2)/(2*sigma^2)
## or z=sum(dnorm(ys,mu,sigma,log=TRUE))
}
return(z)
}
set.seed(12345)
GaussData<-2+3*rnorm(50) ## true mu=2, sigma=3
par(mfrow=c(1,2))
hist(GaussData)
mus=seq(0.5,4,len=100)
sigmas=seq(2.2,6,len=100)
nmu=length(mus); nsig=length(sigmas)
lls=matrix(nrow=nmu,ncol=nsig)
for (i in 1:nmu) {
mu=mus[i]
for (j in 1:nsig) {
sig=sigmas[j]
lls[i,j]=Gauss.llfn(c(mu,sig),GaussData)
}
}
contour(mus,sigmas,lls,nlevels=20)
## Add on the MLE
muhat=mean(GaussData)
sighat=sqrt(mean((GaussData-muhat)^2))
points(muhat,sighat,pch=4,col=2,cex=1.5,lwd=2)
2.5. TRACTABLE ML FOR A VECTOR PARAMETER 39
Histogram of GaussData
GaussData
Fr
eq
ue
nc
y
−5 0 5 10
0
2
4
6
8
10
12
−110
−104
−102
−100
−
100
−98
−
98
−96
−96
−
96
−94
−94
−92
−90
−88
−86
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
3
4
5
6
Later, we will examine the behaviour of the above example under repeated
sampling.
2.5.0.2 Example: Linear regression with known variance
We now look at a simple regression model with a known variance
Y ∼ N(α + β,1)
with (known) explanatory variables (1, . . . , n). Thus, θ = (α, β) and
Ω = (−∞,∞) × (−∞,∞).
The likelihood function is
L(θ) =
n∏
=1
1
(2pi)1/2
exp{−1
2
(y − α − β)2}
and
ℓ(θ) =
n∑
=1
−1
2
log(2pi) − 1
2
(y − α − β)2
= −n
2
log(2pi) − 1
2
n∑
=1
(y − α − β)2.
40 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
To obtain θˆ:
∂
∂α
ℓ(θ) =
n∑
=1
(y − α − β),
∂
∂β
ℓ(θ) =
n∑
=1
(y − α − β).
For a maximum we find θ = θˆ such that
∂
∂α
ℓ(θˆ) = 0 and
∂
∂β
ℓ(θˆ) = 0.
Thus from the first equation we have
0 =
n∑
=1
(y − αˆ − βˆ)
= ny¯ − nαˆ − βˆn¯
=⇒ αˆ = y − βˆ.
From the second equation we have
0 =
n∑
=1
(y − αˆ − βˆ)
=
n∑
=1
y − αˆ
n∑
=1
− βˆ
n∑
=1
2
=
n∑
=1
y − (y − βˆ)n¯ − βˆ
n∑
=1
2
βˆ =
∑n
=1 y − ny¯¯∑n
=1
2
− n¯2
.
This expression for βˆ is identical to
βˆ =
∑n
=1(y − y)( − )∑n
=1( − )2
These coincide with the least-squares estimates of β obtained by minimising the
sum of squares function; see later.
2.6. INTRACTABLE ML FOR A VECTOR PARAMETER 41
0 2 4 6 8 10
−
4
−
3
−
2
−
1
0
1
2
Data & regression line
x
y
Log likelihood
α
β
−360
−360
−290
−290
−280
−280
−220
−220
−210
−210
−200
−200
−160
−160
−150
−150
−140
−140
−130
−130
−120
−110
−110
−100
−100
−90
−90
−80
−80
−70
−70
−60
−60
−50
−50
−40
−40
−30
−20
−6 −4 −2
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
The left-hand plot above shows the data together with the line y = bα + bβ. The
right hand plot shows the log-likelihood with the MLE marked in red.
2.6 Intractable ML for a vector parameter
2.6.1 Example: F-distribution with both df unknown
Given parameters s > 0 and t > 0, the F-distribution has a density of
ƒ (y|s, t) = ((s + t)/2)
(s/2)(t/2)
s
t
s/2
ys/2−1
1 +
s
t
y
−(s+t)/2
(y > 0).
The data you are given has actually arisen by setting the R seed to 520 and
simulating n = 100 data values from an F5,1 distribution. Plot the log-likelihood
and find the MLE.
We first write a function to evaluate the log-likelihood.
fdist.loglik<-function(theta,ys){
s=theta[1]; tt=theta[2]
42 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
if(s < 0 || tt<0){
#if either parameter is negative, log-likelihood is log(0)=-Inf
z=10^(-10)
}
else{
z <- sum(df(ys,df1=s,df2=tt,log=T))
}
return(z)
}
We then simulate the data (in “real life” you would read in a data vector):
##Create test data set
set.seed(520)
fdata <- rf(100,2,8)
Next, we fill up a matrix with the log-likelihood values, plot this, and then find the
MLE using the numerical optimiser optim(), and add this to the plot.
##Set up vectors for s and t...
## may need trial and error to get plot on a sensible range
ss <- seq(1.2,3.3,length=100)
ts <- seq(3,24,length=100)
##Now evaluate the log-likelihood at each combination of s and t
## 1. Create empty matrix in which to put log likelihood values
flls <- matrix(0,nrow=length(ss),ncol=length(ts))
## 2. Populate the matrix using a double for loop
for (i in 1:length(ss)){
for(j in 1:length(ts)){
flls[i,j] <- fdist.loglik(c(ss[i],ts[j]),fdata)
}
}
## 3. Plot
contour(ss,ts,flls,xlab=expression(theta[1]),ylab=expression(theta[2]),
nlevels=20)
2.6. INTRACTABLE ML FOR A VECTOR PARAMETER 43
θ1
θ 2
−141
−140
−139
−138
−137
−136
−
13
6
−136
−135
−135
−134
−134
−
13
3
−133
−132
−131
−130
−130 −129
−128
−127
1.5 2.0 2.5 3.0
5
10
15
20
2.6.2 Multi-dimensional numerical optimisation via optim()
It is more straightforwards to use optim on two-dimensional (and higher-
dimensional) problems than it is for one-dimensional problems.
• Just as for the one-dimensional problem we must create a log-likelihood
function. The first argument must be the parameter vector to be optimised
over, and it can have any number of other arguments.
• Just as for the one-dimensional problem, the first argument to optim() is an
initial value, now for the parameter vector, rather than a scalar.
• Just as for the one-dimensional problem, the second argument to optim() is
the function to be maximised.
• Unlike the one-dimensional problem there is no need to specify the method,
nor provide upper and lower bounds. If no method is specified, optim()
defaults to the Nelder-Mead method, which (unlike, say, Newton-Raphson)
does not require gradient information.
• Just as for the one-dimensional problem, further arguments to the log-
likelihood function can be passed through optim() by naming them.
## 4.Find the MLE
ctl=list(fnscale=-1)
44 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
theta0=c(4.5,1)
out=optim(theta0,fdist.loglik,control=ctl,ys=fdata)
Fthetahat=out$par
print(Fthetahat); print(out$value)
## [1] 2.306666 5.904558
## [1] -126.7087
One could mark the MLE on the contour plot using the following:
##points(Fthetahat[1],Fthetahat[2],pch=4,col=2,cex=1.5,lwd=2)
You may also wish to try the command print(out). As with the one-dimensional
problem, convergence=0 indicates that the algorithm converged; you will also
see that it took 59 iterations to converge.
2.7 Estimating functions of the parameter
Often we are interested in estimating a property of the model, and not the
parameter. If this property can be written as a function of the parameter, then it
is easy to calculate its MLE once we have the MLE of the parameter.
Theorem (The Invariance Principle): If ϕ = g(θ) for a 1-1 function g(·). Then the
MLE of ϕ is related to the MLE for θ by
ϕˆ = g(θˆ).
This result says we simply evaluate the function at the MLE for θ, to get the MLE
for our new “parameter” of interest. For a proof, see Pawitan (2007), Section 2.8.
2.7.0.1 Example: Function of the Poisson mean
For the Poisson model, find the MLE of the probability of observing no events
Pr(X = 0).
From the probability mass function, we have Pr(X = 0) = exp{−θ}. Thus we are
interested in finding the MLE of ϕ = exp{−θ}.
By the invariance principle, ϕˆ = exp{−θˆ}, thus the MLE is
ϕˆ = exp{−y}.
2.8. UNCERTAINTY UNDER REPEATED SAMPLING 45
2.7.0.2 Example: IID F-distribution: expectation and variance
If a random variable Y has an Fs,t distribution, then provided t > 4, both the
expectation and variance are finite, and are:
E [Y] =
t
t − 2
Vr [Y] =
2t2(s + t − 2)
s(t − 2)2(t − 4) .
Use the MLEs from the earlier example to estimate the expectation and variance
of the distribution from which the data were drawn.
s=out$par[1]; tt=out$par[2]
Ehat=tt/(tt-2)
Vhatnum=2*tt^2*(s+tt-2)
Vhatden=2*(tt-2)^2*(tt-4)
Vhat=Vhatnum/Vhatden
print(round(c(Ehat,Vhat),3))
## [1] 1.512 7.458
This is not the same as simply finding the mean and variance of the sample:
print(round(c(mean(fdata),var(fdata)),3))
## [1] 1.478 9.237
2.8 Uncertainty under repeated sampling
We illustrate the variability associated with the likelihood and the MLE induced by
repeated sampling by repeating the IID Gaussian example a further three times,
each time with a different data set of size 50 simulated from the same N(2,32)
distribution.
46 CHAPTER 2. THE LIKELIHOOD FUNCTION AND ITS MAXIMUM
Data set 1
−100
−98
−
98
−
98 −96
−96
−94
−94
−92
−90
−88
−86
0.5 1.5 2.5 3.5
3
4
5
6
Data set 2
−100
−
100
−
100
−98
−
98
−96
−94
−92
−90
−88
0.5 1.5 2.5 3.5
3
4
5
6
Data set 3
−100
−
100
−
98
−
98
−98
−96
−96
−94
−94
−92
−90
−88
−86
0.5 1.5 2.5 3.5
3
4
5
6
Data set 4
−94
−92
−90
−88
−86
−86
−84
−
84
−82
−80
−78
−76
−74
0.5 1.5 2.5 3.5
3
4
5
6
Each data set leads to a different likelihood surface and a different MLE vector.
The MLEs for μ are 2.539, 2.932, 1.961, 2.31 and the MLEs for σ are 3.257, 3.385,
3.341, 2.645. All of these are close the the true values of 2 and 3 respectively
but there is variability. The methods in the next two chapters allow us to quantify
this variability and then use it to provide measures of uncertainty and perform
hypothesis tests.
Chapter 3
The Fisher information and
parameter uncertainty
As described in Chapter 1 we quantify uncertainty using the repeated sampling
framework where Y is assumed to be drawn from the modelled distribution using
the true parmater value θ∗ or θ∗. Thus, from our data and the corresponding
likelihood we have the estimate bθ(y) or bθ(y) which we examined in the previous
chapter; however, also, from the underlying data generating mechanism we have
the estimator bθ(Y) or bθ(Y).
As alluded to in the first chapter, the maximum likelihood estimator is consistent:
for any ε > 0,
P
θ∗ − ε < bθn(Yn) ≤ θ∗ + ε→ 1 as n→∞;
a similar definition for bθ(Yn) uses an ε ball. It is also asymptotically unbiased:
as n → ∞, E bθn(Y) → θ∗. These properties both, in fact, only hold subject to
regularity conditions which we will overview later.
Also as described in Chapter 1, we have two key tools for quantifying uncertainty
in a parameter estimate. Firstly, the standard error of a parameter estimator is
SE(bθ) ≡ SE bθ(Y) ≡rVr bθ(Y).
For the running example in Chapter 1, this was s = σ/
p
n.
Secondly, a 100(1 − α)% confidence interval is a region (Clo(y), Chi(y)) such that
P ((Clo(Y), Chi(Y)) 3 θ∗) = 1 − α.
For the running example in Chapter 1, we found that the interval (bμ(Y)− , bμ(Y) +
), where was −s−1(α/2) = −σ/pn −1(α/2) had precisely the desired prop-
erty.
47
48 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
In this chapter we explore the Fisher Information, which relates directly to the
standard error and to confidence intervals known as Wald-based confidence
intervals, and leads to two different tests of a given hypothesis: the Wald test and
the score test. In the next chapter we examine the likelihood ratio which provides
likelihood ratio hypothesis test. It also provides an alternative route to confidence
intervals which is covered in Statistical Fundamentals II.
3.1 Intuition
−1 0 1 2 3 4 5
−
5
0
5
10
θ
lo
g−
lik
e
lih
oo
d
l1
l2
−1 0 1 2 3 4 5
−
5
0
5
10
θ
lo
g−
lik
e
lih
oo
d
l1
l3
The left panel of the above figure shows two log-likelihood functions: ℓ1 in solid
blue and ℓ2 in dashed red. Both ℓ1 and ℓ2 lead to the same MLE, bθ = 2, correspond-
ing to a log-likelihood of ℓ1(2) = ℓ2(2) = 6. As θ moves away from bθ, ℓ2 drops off
much more quickly from its maximum value than ℓ1 does. For example, ℓ1(1) = 5,
1 below the maximum, whereas ℓ2(1) = 2, 4 below the maximum. Thus we might
expect there to be more uncertainty surrounding the bθ value obtained from ℓ1
than there is from ℓ2.
In the right panel, the dashed red curve represents ℓ3 = ℓ2 + 2. The MLE remains
the same as it was for ℓ2 in the left panel, as does the drop in log-likelihood from
θ = bθ = 2 to θ = 1. Both ℓ2 and ℓ3 show the same tightness about the MLE value
and so the degree of uncertainty arising from these two curves is the same.
The intuition to take away from the above is that the amount of uncertainty in the
3.2. KEY DEFINITIONS 49
MLE is governed by the tightness of the log-likelihood curve about the MLE, that
is the “speed” with which it drops off from the maximum, which is affected by its
slope.
• By definition, at the MLE, ℓ′1(bθ) = ℓ′2(bθ) = 0.
• At θ = 1 < bθ, 0 < ℓ′1(θ) < ℓ′2(θ).
• At θ = 2 > bθ, ℓ′2(θ) < ℓ′1(θ) < 0.
For both log-likelihoods, the first derivative is decreasing as θ increases, but it
decreases more quickly for ℓ2 than it does for ℓ1. The rate of change of the first
derivate is the second derivative, so ℓ′′2 < ℓ′′1 < 0. Thus we have the intuition that:
The more negative ℓ′′ is, the less uncertainty there is associated with
the MLE obtained from ℓ.
Indeed, as we shall see, ℓ′′ is intimately related to the standard error of the
estimator bθ(Y).
3.2 Key definitions
We first provide the definition of each key quantity for one-dimensional problems,
and then immediately generalise it to two dimensions.
Definition: The score function is the first derivative of the log-likelihood. In one
dimension the score is a scalar:
U(θ) ≡ ℓ′(θ) ≡ dℓ(θ)
dθ
≡ U(θ;y).
As a function of θ we call U(θ) the score function, while as a random variable
for fixed θ we call U(θ) = U(θ;Y) the score statistic. As you will see in Statistical
Fundamentals II, understanding the basic properties of the score function is key
to deriving asymptotic results for the MLE.
Evaluation of the score function at a particular value θ0, is simply
U(θ0) =
d
dθ
ℓ(θ)
θ=θ0
.
In two dimensions both the score function U(θ) = U(θ;y), and the score statistic,
U(θ;Y), are vectors:
U(θ) ≡ ∇ℓ(θ) ≡
∂
∂θ1
ℓ(θ)
∂
∂θ2
ℓ(θ)
≡
∂ℓ
∂θ1
∂ℓ
∂θ2
.
50 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
Definition: The observed Fisher information is minus the second derivative of
the log-likelihood. In one dimension the observed Fisher information is a scalar:
O(θ) = O(θ;y) = −ℓ′′(θ) = − d
2
dθ2
ℓ(θ) = − d
dθ
U(θ).
In two dimensions it is a matrix:
O(θ) = O(θ;y) ≡
− ∂2∂θ21 ℓ(θ) − ∂2∂θ1∂θ2 ℓ(θ)
− ∂2∂θ2∂θ1 ℓ(θ) − ∂
2
∂θ22
ℓ(θ)
≡ −
∂2∂θ21 ℓ(θ) ∂2∂θ1∂θ2 ℓ(θ)
∂2
∂θ2∂θ1
ℓ(θ) ∂
2
∂θ22
ℓ(θ)
.
Definition: The expected Fisher information evaluated at θ is the expectation
of the observed information over all random datasets drawn from the same
distribution as is being used to model the real data, using parameter θ:
E(θ) = E [ O(θ;Y)] = −E ℓ′′(θ;Y) = −E d2
dθ2
ℓ(θ;Y)
= −E
d
dθ
U(θ)
.
In two dimensions it is a matrix:
E(θ) = E [ O(θ;Y)] = −E
∂2∂θ21 ℓ(θ) ∂2∂θ1∂θ2 ℓ(θ)
∂2
∂θ2∂θ1
ℓ(θ) ∂
2
∂θ22
ℓ(θ)
= −
E
∂2
∂θ21
ℓ(θ)
E
h
∂2
∂θ1∂θ2
ℓ(θ)
i
E
h
∂2
∂θ2∂θ1
ℓ(θ)
i
E
∂2
∂θ22
ℓ(θ)
.
NB: in Statistical Fundamentals I, the notation O(bθ) and O(bθ) will always refer
to the information conditional on the observed data y. Whenever we require the
random variable O(bθ;Y) or O(bθ;Y) we will always explicitly include the Y.
3.3 Asymptotic approximations and conse-
quences I: Wald CIs
The asymptotic (meaning “as the number of data values, n→∞”) approximations
that we state here are only valid if certain regularity conditions are satisfied.
1. θ (or θ) cannot be a boundary parameter; i.e., it cannot be at the edge of
the parameter space.
2. The Fisher information is positive (or a positive-definite matrix) and bounded.
3.3. ASYMPTOTIC APPROXIMATIONS AND CONSEQUENCES I: WALD CIS 51
3. For the density/mass function ƒ (y|θ), we can take up to third derivates (with
respect to θ) under an integral (with respect to y) sign. In particular, the
support of ƒ cannot depend on θ (or θ).
We will see that the Fisher information is the inverse variance (or variance matrix),
which explains point 2.
The Uniform data example does not satsify regularity Condition 3.
All of the other examples in Statistical Fundamentals I satisfy all three regularity
conditions, although we will later discuss the consquences of Condition 1 for the
range of validity of certain statistical tests.
Subject to the regularity conditions it is possible to obtain several rigorous the-
oretical results. These are stated, and heuristic proofs are given, in Statistical
Fundamentals II. The consequences of the theory are the following approximations,
which become more and more accurate as the number of independent data points
increases.
3.3.1 The scalar case
We first deal with the scalar case, and denote the true (unknown) parameter value
by θ∗. When Y is drawn from ƒ using the parameter θ∗:
bθ(Y) ∼ N(θ∗,1/ E(bθ(y))),bθ(Y) ∼ N(θ∗,1/ O(bθ(y);y)).
The
approximations would be better if instead of evaluating the information
atbθ(y) we evaluated it at the true parameter value, θ∗; however, we do
not know
θ∗. The approximations above are still valid as bθ is a consistent.
The most immediate consequence of these approximations for bθ are that we can
obtain standard errors and confidence intervals for the unknown parameter.
As described in the first chapter and recapped earlier in this chapter, the standard
error (SE) of an estimator is the standard deviation of bθ(Y), where Y is drawn from
ƒ using parameter θ∗. Thus
SE
bθ(Y) ≈ 1/qE(bθ) ≈ 1/ÇO(bθ;y),
and, for shorthand, we sometimes abbreviate this to s.
52 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
We also aim to create a 100(1 − α)% confidence interval for θ. Recall from the
first chapter and the start of this chapter that this is an interval (Clo(Y), Chi(Y))
such that
P ((Clo(Y), Chi(Y)) 3 θ∗) = 1 − α.
We
will choose Clo(Y) = bθ(Y) − and Chi(Y) = bθ(Y) + . Now Clo(Y) <
θ∗ ⇐⇒bθ(Y)− < θ∗ ⇐⇒ bθ(Y) < θ∗+ and Chi(Y) > θ∗ ⇐⇒ bθ(Y)+
> θ∗ ⇐⇒ bθ(Y) >
θ∗ − . Thus
(Clo(Y), Chi(Y)) 3 θ∗ ⇐⇒ bθ(Y) ∈ (θ∗ − , θ∗ + ).
Hence
P ((Clo(Y), Chi(Y)) 3 θ∗) = P
bθ(Y) ∈ (θ∗ − , θ∗ + )
≈
s
−
−
s
= 1 − 2 (−/s) .
Equating this with 1 − α gives:
≈ −s−1
α
2
.
The two different approximations for the distribution for bθ(Y) then give:
≈ − 1q
E(bθ)−1
α
2
,
≈ − 1q
O(bθ;y)−1
α
2
.
Either of these standard errors creates a so-called Wald-based confidence interval.
For example, setting α = 0.05, we find −1(α/2) ≈ 1.96, so the two possible forms
for the confidence interval are:bθ − 1.96/qE(bθ), bθ + 1.96/qE(bθ) and bθ − 1.96/qO(bθ), bθ + 1.96/qO(bθ) .
Here bθ ≡ bθ(y) is the MLE from the data.
3.3.1.1 Example: IID Poisson data - continued
Consider the likelihood for IID data from a Posson(θ) distribution derived earlier:
ℓ(θ) = ℓ(θ;y) = −nθ + ny¯ log(θ).
3.3. ASYMPTOTIC APPROXIMATIONS AND CONSEQUENCES I: WALD CIS 53
We also derived the score
U(θ) = U(θ;y) = ℓ′(θ) = −n + ny¯
θ
.
Indeed, we solved the equation U(bθ) = 0 to obtain the MLE: bθ = y¯.
The observed Fisher information is:
O(θ) = O(θ;y) = −ℓ′′(θ) = −U′(θ) = ny¯
θ2
.
At the MLE it simplifies to O(bθ) = n/ bθ.
The expected Fisher information is:
E(θ) = E [ O(θ;Y)] = E
nY¯
θ2
=
nE
Y¯
θ2
=
n
θ
,
since if Y ∼ Posson(θ) are identically distributed then E Y¯ = E [Y1] = θ. Thus
E(bθ) = O(bθ) = nbθ.
Hence, we obtain
SE(bθ) ≈
√√√ bθ
n
and the following as an asymptotically valid 95% confidence interval:bθ − 1.96
√√√ bθ
n
, bθ + 1.96
√√√ bθ
n
,
where bθ ≡ bθ(y).
3.3.1.2 Example: IID Exponential data - continued
Consider the likelihood for IID data from a Eponent(θ) described earlier:
ℓ(θ) = n logθ − nθy.
The score was
dℓ
dθ
=
n
θ
− ny.
54 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
Thus
− d
2ℓ
dθ2
=
n
θ2
.
Since n is fixed and known, 0(θ) = E(θ) = n/θ2. To find the standard error of bθ(Y)
we use θ = bθ(y). Hence SE(bθ) ≈ bθ/pn.
For these data, n = 10 and bθ = 1/y ≈ 0.00734. Thus the SE(bθ) ≈ 0.00734/p10 ≈
0.00232. A 90% Wald-based CI is then
thetahat=0.00734; n=10
SE=thetahat/sqrt(n)
CIloExpo=thetahat+qnorm(0.05)*SE
CIhiExpo=thetahat+qnorm(0.95)*SE
print(c(CIloExpo,CIhiExpo))
## [1] 0.003522111 0.011157889
3.3.2 The two-dimensional case
The analogous approximations for the two-parameter case, where the true param-
eter is θ∗, are:
bθ(Y) ∼ MVN(θ∗, O(bθ(y))−1),bθ(Y) ∼ MVN(θ∗, E(bθ(y))−1).
Here MVN denotes the (two-dimensional) multivariate normal distribution; see
the Appendix.
Thus, the standard error of θj (j = 1,2) is approximately the square root of the
(j, j) element of E(bθ)−1 or O(bθ)−1. Confidence intervals may then be obtained as
for the one-dimensional case.
NB: the standard error of θj is not, in general, the reciprocal of the (j, j) element
of (bθ).
3.3. ASYMPTOTIC APPROXIMATIONS AND CONSEQUENCES I: WALD CIS 55
3.3.2.1 Example: Normal distribution with unknown expectation and
variance - continued
Consider the likelihood for IID Gaussian data derived earlier:
ℓ(θ) = −n
2
log(2pi) − n logσ − 1
2σ2
n∑
=1
(y − μ)2.
We also derived the two components of the score vector, which we now put
together:
U(θ) = U(θ;y) =
1
σ2
∑n
=1(y − μ)− nσ + 1σ3
∑n
=1(y − μ)2
.
Let us calculate the four components of the information matrix
− ∂
2ℓ
∂μ2
= − ∂U1
∂μ
=
n
σ2
,
− ∂
2ℓ
∂μ∂σ
= − ∂U1
∂σ
=
2
σ3
n∑
=1
(y − μ),
− ∂ℓ
∂σ2
= − ∂U2
∂σ
= − n
σ2
+
3
σ4
n∑
=1
(y − μ)2.
Aside: ∂U2/∂μ = ∂2ℓ/∂σ∂μ = ∂2ℓ/∂σ∂μ = ∂U1/∂σ.
Thus the observed Fisher information matrix is: n
σ2
2
σ3
∑n
=1(y − μ)
2
σ3
∑n
=1(y − μ) − nσ2 + 3σ4
∑n
=1(y − μ)2
= n
1
σ2
2
σ3
(y − μ)
2
σ3
(y − μ) − 1
σ2
+ 3
σ4
(y−μ)2
.
Since bμ = y and bσ2 = (y− bμ)2,
O(bμ, bσ) = n 1bσ2 00 2bσ2
.
The expected Fisher information matrix is:
n
E 1σ2 E 2σ3 (Y − μ)
E
2
σ3
(Y − μ) E − 1
σ2
+ 3
σ4
(Y − μ)2
= n 1σ2 0
0 2
σ2
,
56 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
since, as the data are identically distributed, E
Y
= μ and
E
(Y−μ)2
= E
(Y1 − μ)2 = Vr [Y1] = σ2.
Thus O(bθ) = E(bθ). Since this matrix is diagonal, its inverse is:
E(bμ, bσ)−1 = O(bμ, bσ)−1 = 1
n
σ2 0
0 12σ
2
.
Thus SE(bμ) ≈ bσ/pn and SE(bσ) ≈ bσ/p2n. The respective approximate 95% confi-
dence intervals are:
(bμ − 1.96bσ/pn, bμ + 1.96bσ/pn) and (bσ − 1.96bσ/p2n, bσ + 1.96bσ/p2n).
3.3.2.2 IID Gaussian data with both parameters unknown (continued)
Earlier we considered data Y ∼ N(μ, σ2), = 1, . . . , n with n = 50. For the particular
data set, we found bμ = 2.539 and bσ = 3.257 . Thus, our Wald-based 95% CI for μ
is (0.52 , 4.557) and for bσ we have (1.829 , 4.684).
In both of the above examples O and E were identical when evaluated at the
MLE. This is not always the case. If not, which one should be used when creating
a confidence interval? Their accuracies are similar; the discussion in Section 9.6
of In All Likelihood suggests that, a least sometimes, the observed information
might be more robust.
In both of the above examples the observed and expected information were
(Effectively) proportional to n, the number of independent observations, so that
the standard error decreased in proportion to 1/
p
n. This is typical of all of the
examples you will see in this module (though it is not the case for the Uniform
data example, which does not satisfy the regularity conditions). To illustrate the
typical behaviour we repeat the likelihood plot for the IID Gaussian example from
earlier, and place alongside it a contour plot of the likelihood for 200 data points;
the original 50 and then 150 more. The contours drop off much more quickly.
3.4. ASYMPTOTIC APPROXIMATIONS AND CONSEQUENCES II: THE WALD AND SCORE TESTS57
Data set 1
−100
−98
−
98
−
98 −96
−96
−94
−94
−92
−90
−88
−86
0.5 1.5 2.5 3.5
3
4
5
6
Data set 1 & 150 extra pts
−
388 −386 −384
−382
−382
−
380 −378
−376
−
376
−374
−
372 −370
−
368
−366
−
364
−362
−360
−
358
−356
−
356
−354
−354
−352
−
352
−
350
−
35
0
−348
−
34
6
−344
−342 −340
−
338
−336
−334
−
332
−
330
−328
−326
−
324
0.5 1.5 2.5 3.5
3
4
5
6
3.4 Asymptotic approximations and conse-
quences II: the Wald and score tests
Let the true (unknown) value of the parameter be θ∗. A scientist may wish to
decide whether or not θ∗ = θ0 where θ0 is some specific hypothesised value. Or,
for a specific scalar component θ of θ they may wish to decide whether θ∗ = θ0.
Recall from the first chapter that a statistical hypothesis test considers two
hypotheses: the null hypothesis, H0, and an alternative hypothesis, H1 which
covers every possibility except H0.
The test tries to provide evidence to show that H0 is false by constructing a test
statistic, T(y), from the data such that |T(y)| is likely to be large if H0 is false and
small if H0 is true.
Typically one chooses a significance level, α, such as 5%. If the probability under
the null hypothesis of seeing a test statistic as or more extreme than the observed
statistic is less than or equal to α then the null hypothesis is deemed to not fit with
the data and is rejected; otherwise H0 is not rejected. Equivalently, we discover a
critical value for the test at significance level α, cα, and we reject H0 if |T | ≥ cα.
The p-value of the test is the probability under the null hypothesis of obtaining
a value of the test statistic as extreme or more extreme than the one actually
observed. If under H0 the test statistic has a N(0,1) distribution then:
p = P (|T(Y)| ≥ |T(y)| | H0) = P (|N(0,1)| ≥ |T(y)|) = 2(−|T(y)|).
58 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
If the p-value is at least as small as the required significance level of the test then
we reject H0; however, the magnitude of p provides additional information.
Recall also that the power of a test is the probability of rejecting H0 given that
H0 is false. Clearly power depends on the true value θ∗ as well as on θ0; it also
depends on the amount of data, since this affects the Fisher information and,
hence, the standard error.
3.4.1 Wald-based statistical tests for a single parameter
When there is a single unknown parameter, approximately, bθ(Y) ∼ N(θ∗, s2),
where s2 can be chosen to be either 1/ E(bθ(y)) or 1/ O(bθ(y)). The Wald test uses
this distribution directly to construct the test statistic;
When there is a vector of unknown parameters and we wish to test the jth, we
have bθ(Y)j ∼ N(θ∗,j, s2), where s is the jth diagonal entry from either E(bθ(y))−1
or O(bθ(y))−1.
TWld =
bθ(y) − θ0
s
= either
q
E(bθ){bθ(y) − θ0} or qO(bθ){bθ(y) − θ0}.
Crucially, all of the quantities in the above formulae are known.
From the asymptotic distribution, approximately,
T(Y) ∼ N
θ∗ − θ0
s
,1
.
In particular, if H0 is true, θ0 = θ∗ and T(Y) ∼ N(0,1). For a significance level of
α, the critical value, cα satisfies
P (|T(Y)| ≥ cα | H0) = α.
i.e., 2(−cα) = α, so cα = −−1(α/2). For example, if α = 0.05, cα ≈ 1.96.
The p-value for the test is 2(−|T(y)|) as derived on the previous page.
If H0 is not true, then E [T] 6= 0 and |T | is more likely to be larger than if H0 were
true (for example E
|T |2 = 1 + (θ∗ − θ0)2/s2)).
The power of the Wald test with significance level α is:
P
|N
θ∗ − θ0
s
,1
| ≥ cα
.
3.4. ASYMPTOTIC APPROXIMATIONS AND CONSEQUENCES II: THE WALD AND SCORE TESTS59
As discussed earlier this requires the true value, θ∗, which is unknown. Typically
one calculates the power for several conjectured values of θ∗. Also as discussed
earlier, the power depends on s which depends on the Fisher information, which
depends on the amount of data.
3.4.1.1 Example: IID Poisson data - continued
Consider the likelihood for IID data from a Posson(θ) distribution described earlier.
Suppose that from n = 50 data points we obtain bθ = 8.8; use a Wald test at the
5% level of the null hypothesis that H0 : θ∗ = θ0 against the alternative that
H1 : θ∗ 6= θ0, where θ0 = 10. Also write down the p-value for the test.
From the earlier Fisher information calculations, we have the following approxima-
tions for large n: bθ(Y) ∼ N θ∗, bθ(y)
n
!
.
From the MLE calculations we have bθ = y. The standard error is s =Æbθ/n, and so
the test statistic is
T(y) =
bθ − θ0Æbθ/n = 8.8 − 10p8.8/50 ≈ −2.860.
The critical value for the test is −−1(0.05/2) ≈ 1.96. Since |T | > 1.96 we reject
H0.
The p-value is
2 (−| − 2.860|) ≈ 0.00423.
Common misunderstanding: students have been observed to report that “not
only is the magnitude of the test statistic larger than the critical value, but also the
p-value is under α”; these two events are equivalent, so of course they will only
happen together. A third equivalence is that the MLE is outside of the 100(1−α)%
confidence interval.
3.4.1.2 IID Exponential data
Consider
the IID data from a Eponent(θ) described earlier, where n = 10
andbθ = 1/y ≈ 0.00734. Use a Wald test at the 10% level of the null
hypothesis that
H0 : θ∗ = θ0 against the alternative that H1 : θ∗ 6= θ0, where θ0 = 0.01. Also write
down the p-value for the test.
60 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
From earlier SE(bθ) ≈ 0.00232, we calculate the test statistic
thetahat=0.00734; theta0=0.01
SE=0.00232
TWald=(thetahat-theta0)/SE
print(c(TWald,qnorm(0.95)))
## [1] -1.146552 1.644854
The magnitude of the test statistic is less than the threshold c0.1 and so we fail to
reject H0, there is not sufficient evidence to reject the hypothesis that θ∗ = 0.01.
3.4.1.3 IID Gaussian data with both parameters unknown (continued)
Earlier we had n = 50 independent data values from a N(μ, σ2) distribution with a
mean of y = 2.539 and a mean of squares of y2 = 17.05.
Test at the 10% level the null hypothesis that σ = 3 against H1 : σ 6= 3; find the
p-value.
From earlier we know that
bσ = ¨1
n
n∑
=1
(y − y)2
«1/2
=
¦
y2 − y2©1/2 ≈ 3.257.
From earlier in this chapter we have that, whether the observed or expected
information at (bμ, bσ) is used,
SE(bσ) ≈ bσp
2n
≈ 0.326.
The critical value for a test with a significance level of α = 10% is c0.9 =−−1(0.1/2) = 1.645.
The test statistic here is bσ − σ0
s
≈ 3.257 − 3
0.326
≈ 0.788.
Since this is less than 1.645 we fail to reject the null hypothesis; there is insuffi-
cient evidence to suggest σ∗ 6= 3.
The p-value for the test is
p = 2
−
bσ − σ0s
≈ 2
−
3.257 − 3.00.326
≈ 0.431.
3.4. ASYMPTOTIC APPROXIMATIONS AND CONSEQUENCES II: THE WALD AND SCORE TESTS61
3.4.2 Asymptotic distribution of the score
When there is a single unknown parameter, θ, conditional on the second and third
of the regularity conditions, we have that approximately, when n is large:
U(θ∗;Y) ∼ N
0, E(bθ(y)) ,
U(θ∗;Y) ∼ N
0, O(bθ(y);y) ,
3.4.3 The score test
This leads directly to an alternative statistical test of the null hypothesis H0 : θ∗ =
θ0 against the alternative H1 : θ∗ 6= θ0 via the following test statistic:
Tscore =
U(θ0)q
E(bθ) or Tscore =
U(θ0)q
O(bθ) .
If H0 is true, θ0 = θ∗ then Tscore ∼ N(0,1), just as TWld does.
3.4.3.1 Example: IID Poisson data - continued
Consider the likelihood for IID data from a Posson(θ) distribution derived earlier.
Suppose that from n = 50 data points we obtain bθ = 8.8; use a score test at
the 5% level of the null hypothesis that H0 : θ∗ = θ0 against the alternative that
H1 : θ∗ 6= θ0, where θ0 = 10. Also write down the p-value for the test.
We have the following approximations for large n:
U(θ∗;Y) ∼ N
0,
nbθ
.
From earlier the score is U(θ;y) = −n+ ny¯θ . Also, from the previous chapter, bθ = y.
Thus
T(y) =
U(θ0;y)Æ
n/ bθ = −50 + 50∗ 8.8/10p50/8.8 ≈ −2.517.
The critical value for the test is −−1(0.05/2) ≈ 1.96. Since |T | > 1.96 we reject
H0; there is evidence to suggest θ 6= 10.
The p-value is
2 (−| − 2.517|) ≈ 0.0118.
62 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
Aside: sometimes a Wald test and a score test might suggest different actions.
First and foremost this suggests that the better action hangs in the balance and
more data should be collected. If, however, it is not possible to collect more data,
there is a third test which we will cover in the next chapter and which is more
accurate if, also, slightly more complicated to carry out.
Advantages of the score test: In the single parameter case that we consider,
there are two advantages of the score test over the Wald test. The first is that
it does not require the first of the regularity conditions described earlier; i.e. θ0
can be on the boundary of the parameter space. The second is that it can be
conducted without evaluating the MLE as we now discuss.
3.4.4 Using the information at θ0
The approximate asymptotic distributions of the MLE and score that we have
given so far use the Fisher information evaluated at the MLE: E(bθ) and O(bθ) in
the case of a single unknown parameter, and E(bθ) and O(bθ) when there is a
vector of unknown parameters. These approximations are valid because bθ→ θ∗
and bθ→ θ∗.
During a full hypothesis test, if the null hypothesis is true then we already know
the true value of the parameter: θ∗ = θ0 or θ∗ = θ0 and so we may use this to
replace the MLE. That is we may use E(θ0) in the scalar case, and E(θ0) when
there is a vector of unknown parameters. It can be dangerous to use O(θ0) since
this need not even be positive (and O(θ0) need not be positive definite). NB: You
will see the reason why E(θ) is positive no matter what the value of θ in Statistical
Fundamentals II.
This has a particular advantage in terms of logistics in the case of a score test when
the MLE is intractable since Tscore = U(θ0)/
p
E(θ0) and Tscore = U(θ0)/
p
O(θ0) do
not require the evaluation of the MLE.
When the intractability of the MLE is not an issue, the exploration of a single
example in Section 9.7 of In All Likelihood suggests that bθ may lead to a more
robust statistical test.
3.4.5 Vector null hypotheses
In case of an unknown parameter vector, we might have a full vector null hypoth-
esis of H0 : θ∗ = θ0 against H1 : θ∗ 6= θ0. There are both Wald and score tests
for this too; see In All Likelihood Section 9.9, for example.
3.4. ASYMPTOTIC APPROXIMATIONS AND CONSEQUENCES II: THE WALD AND SCORE TESTS63
Indeed, when the number of parameters is p > 2 we may wish to test whether or
not a sub-vector of the parameters is a specific (vector) value. There is a Wald
test for this type of hypothesis too.
The likelihood ratio test, which we cover in the next chapter, can be applied in
both of the above cases, and is generally more robust.
3.4.6 One-sided hypothesis tests
In the cases of one sided hypotheses, a one-tailed test is required: the sign of T
is important and the threshold and p-value are calculated according to this. For
example, if we were only concerned if the true value were larger than some θ0
then we would have:
• H0: θ∗ ≤ θ0.
• H1: θ∗ > θ0.
It is, therefore, only cases where bθ > θ0, i.e., large, positive T(y), that contradict
the null hypothesis. Similarly, for the p-value, we are only interested in T(Y) ≥
T(y).
Under the null hypothesis θ0 has a range of values. The p-value for the test is
then defined as the largest p-value under all the possible θ0 values in H0:
p =mx
H0
P (T(Y) ≥ T(y)| H0)
=mx
θ≤θ0
P
N(0,1) ≥
bθ(y) − θ
s
!
= P
N(0,1) ≥
bθ(y) − θ0
s
!
;
i.e., the value of θ at the extreme end of the null hypothesis region should be
used.
We illustrate one sided tests in the next section.
64 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
3.5 Numerical estimation of the observed infor-
mation
The expected and observed Fisher information might be quite difficult to calculate,
involving messy differentiation and substitution of the MLE; indeed, the expected
information may not be tractable for any general parameter value. When this is the
case, the observed information at the MLE can be calculated numerically - either
as a check that the analytical calculations are correct or instead of performing
them.
We illustrate this on the two-dimensional numerical example from the previous
chapter, IID samples from an F distribution with both parameters unknown.
When optim() is called to find the MLE numerically, it can also be asked to find
the Hessian at the mode:
ctl=list(fnscale=-1)
theta0=c(4.5,1)
out=optim(theta0,fdist.loglik,control=ctl,ys=fdata,hessian=TRUE)
print(out$par); print(out$value);
## [1] 2.306666 5.904558
## [1] -126.7087
print(out$hess)
## [,1] [,2]
## [1,] -8.0104351 -0.3762961
## [2,] -0.3762961 -0.3315742
Remember: the observed Fisher Information is the negative Hessian at the mode.
An alternative that is sometimes more accurate is to use the hess function
provided in the hessian.R file on the Moodle web-page. Because we will call
it using the negative log-likelihood, it returns the inverse of the variance of the
MLE. The arguments are (i) a function whose first argument is the parameter (or
parameter vector), (ii) the parameter value at which we wish to evaluate the
Hessian (usually the MLE), and (iii) any other arguments the function should take,
in the correct order.
source("../../hessian.R") ## whatever path you need
FMLEs=out$par
H=hess(fdist.loglik,FMLEs,ys=fdata)
H ## check this matches with optim
3.5. NUMERICAL ESTIMATION OF THE OBSERVED INFORMATION 65
## [,1] [,2]
## [1,] -8.0104307 -0.3762962
## [2,] -0.3762962 -0.3315742
See Statistical Fundamentals II for an explanation of how the second derivatives
are approximated numerically.
Since we are already working in R we can use it to find the variance matrix directly
so as to conduct a Wald test or find Wald-based confidence intervals.
WaldVar= -solve(H)
print(WaldVar)
## [,1] [,2]
## [1,] 0.1318673 -0.1496533
## [2,] -0.1496533 3.1857545
SEofs=sqrt(WaldVar[1,1])
MLEofs=Fthetahat[1]
## 90% CI
CIlos=MLEofs+qnorm(0.05)*SEofs
CIhis=MLEofs+qnorm(0.95)*SEofs
print(c(CIlos,CIhis))
## [1] 1.709361 2.903970
## Wald test of H0: s=4 against H1: s != 4
pval1=2*pnorm(-abs((MLEofs-4)/SEofs))
## One-sided Wald test of H0: s>=4 against H1: s<4
pval2=pnorm((MLEofs-4)/SEofs)
## One-sided Wald test of H0: s<=4 against H1: s>4
pval3=1-pnorm((MLEofs-4)/SEofs)
print(c(pval1,pval2,pval3))
## [1] 3.114916e-06 1.557458e-06 9.999984e-01
We reject the null hypothesis on the first two occasions, but not on the third.
66 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
3.6 Orthogonality
When there are two unknown parameters θ = (θ1, θ2)>, these parameters are
said to be orthogonal if
E
∂2ℓ
∂θ1∂θ2
= 0;
the off-diagonal terms in the expected Fisher-information matrix are 0 for all θ.
If E(θ) is diagonal then so is E(θ)−1. Since, asymptotically, bθ(Y) ∼
N(θ∗, E(bθ(y))), we see that θ1 and θ2 are (asymptotically) uncorrelated.
Since the joint distribution is bivariate normal this means that θ1 and θ2 are
asymptotically independent. This has a number of advantages:
1. Parameters are easier to interpret since “reasonable” values for θ1 do not
depend on the value for θ2, and vice versa
2. The standard error for θ1 is the same whether or not θ2 is known.
3. Confidence intervals are easier to interpret since they can be examined
independently.
The parameters μ and σ in the Normal distribution with unknown mean and
variance example were orthogonal. We now provide an example where the
parameters are not orthogonal.
3.6.0.1 Example: Linear regression with known variance
Previously We looked at a simple regression model with a known variance
Y ∼ N(α + β,1)
with (known) explanatory variables (1, . . . , n). Thus, θ = (α, β) and
Ω = (−∞,∞) × (−∞,∞). We found that
ℓ(θ) = −n
2
log(2pi) − 1
2
n∑
=1
(y − α − β)2.
Also
∂ℓ
∂α
=
n∑
=1
y − α − β,
∂ℓ
∂β
=
n∑
=1
(y − α − β).
3.6. ORTHOGONALITY 67
Thus
∂2ℓ
∂α2
=
n∑
=1
−1 = −n,
∂2ℓ
∂α∂β
=
n∑
=1
− = −n,
∂2ℓ
∂β2
=
n∑
=1
−2 = −n2.
Whatever the values of α and β, the observed Fisher information and its inverse
are, therefore
n
1
2
and
1
n(2 − 2)
2 −
− 1
.
Under the usual repeated-sampling paradigm, Y is random but is fixed and
known, so the above are also E and
−1
E .
Thus α and β are only orthogonal if = 0; otherwise their correlation has the
opposite sign to . To illustrate this we simulate n = 20 independent data points
with ∼ Unƒ (0,2) and Y = 1 + 2 + ε, where ε ∼ N(0,1).
The left hand figure plots the log-likelihood together with the MLE in blue. The
negative correlation between bα and bβ is clearly visible and we pick two other
parameter combinations (red and green) which respect this correlation and have
reasonably high likelihoods. We also pick another combination (in magenta)
that has an intercept and slope which are both individually reasonable but in
combination are not. The corresponding lines through the data appear in the
same colours in the right hand plot, with the data in the background.
68 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
α
β
−30
−30
−28
−28
−26
−26
−24
−24
−22
−22
−20
−18
−16
−14
−12
−2 −1 0 1 2 3
0
1
2
3
4
5
0.4 0.8 1.2 1.6
1
2
3
4
5
6
7
x
y
Whilst the MLE provides the best fit, indeed it is called the line of best fit, the red
and green lines also fit the data reasonably well. However, the magenta line does
not.
3.7 CIs for functions of the parameter
Often we are interested in estimating a property of the model, and not the
parameter. For example, in the linear regression example to predict the value
of y at = 0.3 we would be interested in ψ = α + 0.3β. From the invariance
principle bψ = bα + 0.3 bβ, but how do we measure uncertainty in bψ?
For a general function of a parameter, this can be approached via the delta
method or profile-likelihood which are covered in Statistical Fundamentals II. Here
we consider two special cases.
3.7.1 CIs for a monotonic function of a single parameter
Suppose that we have a 100(1 − α)% confidence interval (Clo(y), Chi(y)) for a
parameter θ (e.g., a Wald CI). This is equivalent to
P ((Clo(Y), Chi(Y)) 3 θ∗) = 1 − α,
3.7. CIS FOR FUNCTIONS OF THE PARAMETER 69
where Y arises from the distribution with true parameter θ∗.
Let h(θ) be an increasing function, such as h(θ) = θ3 or h(θ) = −1(0.5θ− 1). The
left graphic below shows why,
Clo < θ∗ < Chi ⇐⇒ h (Clo) < h(θ∗) < h (Chi) .
The right graphic shows that the opposite holds when h is decreasing.
Clo < θ∗ < Chi ⇐⇒ h (Chi) < h(θ∗) < h (Clo)
−1.5 −0.5 0.5 1.5
−
3
−
2
−
1
0
1
2
3
h(x)=x3
θ
h(
θ
)
Clo Chi
h(Clo )
h(Chi )
−1.5 −0.5 0.5 1.5
−
3
−
2
−
1
0
1
2
3
h(x)=−x3
θ
h(
θ
)
Clo Chi
h(Clo )
h(Chi )
Thus, if h is increasing,
P (h (Clo(Y)) < h(θ∗) < h (Chi(Y))) = 1 − α,
and ((h (Clo) , h (Chi)) is a 100(1 − α)% CI for h(θ).
Similarly, if h is decreasing, the 100(1 − α)% CI is ((h (Chi(Y)) , h (Clo(Y))).
3.7.1.1 Example: IID Exponential
Earlier we found a 90% Wald-based CI for the rate parameter in the exponential
distribution for the failure data: (0.004,0.011).
70 CHAPTER 3. THE FISHER INFORMATION AND PARAMETER UNCERTAINTY
The probability that there are no events in a future time interval of Δ = 50 is
ψ = exp(−θΔ) which is a decreasing function of θ. Thus the 90% CI for ψ is
(exp(−50 × 0.011),exp(−50 × 0.004)) = (0.572,0.839) .
3.7.2 CIs for a linear combination of parameters
We have that, approximately, bθ(Y) ∼ N(θ∗,V), where V = −1E (bθ) or V = −1O (bθ).
Consider a 2-dimensional vector = (1, 2)> and the linear combination ψ =
>θ = 1θ1 + 2θ2. From the appendix,bψ ≡ 1 bθ1(Y) + 2 bθ2(Y)
∼ N >θ∗,>V ≡ N 1θ∗,1 + 2θ∗,2, 21V1,1 + 212V1,2 + 22V2,2
≡ N ψ∗, 21V1,1 + 212V1,2 + 22V2,2 ,
by the invariance principle. We have a Gaussian distribution for bψ and the form
for SE( bψ).
3.7.2.1 Example: linear regression prediction
In the linear regression example to predict the value of y at = 0.3 we would be
interested in ψ = α + 0.3β. In the terminology used above, 1 = 1 and 2 = 0.3.
So, using the inverse information derived earlier,
bψ ∼ N ψ∗, 2 − 2 × 0.3 × + 0.32
n(2 − 2)
!
.
Manipulations like this will be helpful when we study GLMs in the second half of
this module.
Chapter 4
The likelihood ratio
In the previous chapter we found confidence intervals for a parameter and con-
ducted statistical tests based upon the (observed or expected) Fisher Information
and the asymptotic normality of the MLE and the score.
Suppose that for the standard deviation from IID Gaussian data we had an MLE
of bσ = 1.5 and a Fisher information of O = E = 1. Then a two-tailed Wald test of
H0 : σ∗ = σ0 with σ0 = −0.2 would gave a p-value of
pvalue=2*pnorm(-abs(-0.2-1.5))
pvalue
## [1] 0.08913093
If conducting the test at the 5% level we would fail to reject the null hypothesis;
yet we know that the standard deviation cannot be negative! Even if we conducted
the test at the 10% level and rejected H0 it would not be for the fundamental
incorrectness of a negative standard deviation. Similarly, the 95% Wald CI would
include negative values.
In this chapter we consider a statistical test based upon the likelihood ratio.
The likelihood ratio test respects the support of the parameter or parameter
vector and, even in cases which do not contradict the support, it is typically more
robust than the Wald or score tests. There is also an alternative construction
of a confidence interval based upon the likelihood ratio; this will be covered in
Statistical Fundamentals II.
71
72 CHAPTER 4. THE LIKELIHOOD RATIO
4.1 Single parameter case
4.1.1 Definition
When there is a single unknown parameter, the likelihood ratio for a particular θ
value is
L(bθ;y)
L(θ;y)
.
Clearly this is ≥ 1, with equality when θ = bθ.
4.1.2 Asymptotic result: θ 6= θ∗
For our first pair of asymptotic results, just as in the preliminary chapter we write
Y1, . . . Yn as Yn; i.e., we explicitly state the number of data points. The first result
is that subject to the regularity conditions, as the amount of dt, n→∞,
L(bθ(Yn);Yn)
L(θ;Yn)
→∞ if θ 6= θ∗.
The strict statement is that the probability the above does not happen is 0.
Heuristically, the above result arises from two considerations:
1. bθ(Yn)→ θ∗, the true value;
2. the likelihood gets tighter and tighter about the MLE since the Fisher infor-
mation (the curvature) increases linearly in n:
This suggests that the likelihood ratio can be used to test compatibility of the data
with a hypothesised θ0 since it will be large if θ0 6= θ∗.
4.1.3 Asymptotic result: θ = θ∗
When θ = θ∗ considerations 1 and 2 pull in opposite directions. It turns out that
these two effects cancel so that L(bθ(Yn);Yn)/L(θ∗;Yn) neither approaches 0 nor
∞.
4.1. SINGLE PARAMETER CASE 73
Subject to the regularity conditions, in the limit as the amount of data n→∞ we
have
2 log
L(bθ;Y)
L(θ∗;Y)
∼ χ21.
4.1.4 The χ2
1
distribution
• The χ21 distribution is the Gmm(1/2,1/2) distribution.
• If Z ∼ N(0,1) then Z2 ∼ χ21.
• Thus, asymptotically, both
(bθ(y)) bθ(Y) − θ∗2 and U(θ∗;Y)2
(bθ(y))
have an approximate χ21 distribution. Hence the (two-tailed) Wald and score
tests could have been conducted using the χ21 distribution on the statistic T
2,
leading to identical outcomes. In some texts you will see the score test and
or the Wald test described in this way.
4.1.5 Likelihood ratio test I
In the case of a single unknown parameter, Wilks’ likelihood ratio statistic at a
particular hypothesised θ0 is:
TWks ≡ TWks(y) ≡ 2 log L(
bθ(y);y)
L(θ0;y)
≡ 2¦log L(bθ;y) − log L(θ0;y)© ≡ 2¦ℓ(bθ;y) − ℓ(θ0;y)© .
Clearly TWks ≥ 0, and as discussed above, the larger TWks is the less compatibil-
ity there is between the data and θ0 being θ∗; i.e., a large value of W indicates
low compatibility. Thus is seems natural to base a statistical test of θ∗ = θ0 on
the size of TWks.
Suppose that for some θ0 we have a null hypothesis of H0 : θ∗ = θ0 and an
alternative of H1 : θ∗ 6= θ0. If the null hypothesis is true, TWks ∼ χ21 and
P (TWks(Y) ≥ c) = P
χ21 ≥ c
= 1 − Fχ21(c),
74 CHAPTER 4. THE LIKELIHOOD RATIO
where Fχ21 is the cumulative distribution function of a χ
2
1 distribution. Thus, for a
test of size α we set
c = cα = F
−1
χ21
(1 − α),
and reject the null hypothesis if TWks ≥ cα. For example, for a test with a 5%
significance level:
ca=qchisq(0.95,df=1)
ca
## [1] 3.841459
4.1.5.1 Example: IID Poisson data
Recall that the log likelihood for IID Poisson data is
ℓ(θ) = −nθ + ny¯ log(θ) + const.
Hence, the likelihood ratio statistic is
TWks = 2
¦
ℓ(bθ) − ℓ(θ0)© = 2¦−nbθ + ny log(bθ) + const+ nθ0 − ny log(θ0) − const©
= 2n
¦
θ0 − bθ + y log(bθ/θ0)© .
Just
as in the Wald and score examples, we have a data set with n = 50 andbθ
= y = 8.8, and we wish to conduct a test at the 5% level of H0 : θ∗ =
θ0 against
the alternative that H1 : θ∗ 6= θ0, where θ0 = 10.
The statistic is
n=50; thetahat=8.8; ybar=8.8; theta0=10
TWilks=2*n*(theta0-thetahat+ybar*log(thetahat/theta0))
TWilks
## [1] 7.506633
TWks is approximately 7.507, which is much larger than the 0.95 quantile of the
χ21 distribution (≈ 3.84) so we reject the null hypothesis; there is evidence to
suggest that θ∗ 6= 10. Indeed, the p-value is
1-pchisq(TWilks,df=1)
## [1] 0.006147218
4.2. FULL MULTIPARAMETER TESTS 75
4.1.5.2 Example: IID Exponential (failure time) data
Recall that the log likelihood for IID Exponential data is
ℓ(θ) = n logθ − θny.
The likelihood ratio statistic is
2
¦
ℓ(bθ) − ℓ(θ0)© = 2¦n log(bθ/θ0) − (bθ − θ0)ny.©
For the failure-time data, bθ = 1/y ≈ 0.00734, with n = 10. As for the Wald test of
these data, we wish to test at the 10% level the null hypothesis that H0 : θ∗ = θ0
against the alternative that H1 : θ∗ 6= θ0, where θ0 = 0.01, and to write down the
p-value for the test.
thetahat=0.00734; theta0=0.01
n=10; ybar=1/thetahat
TWilks=2*n*log(thetahat/theta0)-2*(thetahat-theta0)*n*ybar
print(c(TWilks,qchisq(0.9,df=1)))
## [1] 1.063031 2.705543
The test statistic is approximately 1.063, which is below the rejection threshold,
so we fail to reject the null hypothesis. As with the Wald test, there is insufficient
evidence to reject the null hypothesis that θ = 0.01. The p-value is
1-pchisq(TWilks,df=1)
## [1] 0.3025248
4.2 Full multiparameter tests
As discussed in the previous chapter, when the parameter is a vector, θ, we may
wish to test whether the entire parameter vector has a particular value. That is,
we might have
• H0 : θ∗ = θ0,
• H1 : θ∗ 6= θ0.
In this case, the following asymptotic results (as always, subject to regularity
conditions) are relevant:
76 CHAPTER 4. THE LIKELIHOOD RATIO
1. If θ 6= θ∗ then
L(bθ(Yn);Yn)
L(θ;Yn)
→∞
as n→∞.
2. If θ = θ∗ then for large n, approximately:
2
¦
ℓ(bθ(Yn);Yn) − ℓ(θ;Yn)© ∼ χ2p,
where p is the number of components in the vector θ.
4.2.1 The χ2
p
distribution
• The χ2p distribution is the Gmm(p/2,1/2) distribution.
• If Z1, Z2, . . . , Zp are IID with a N(0,1) distribution then Z21 + Z
2
2 + · · ·+ Z2p ∼ χ2p.
• The χ2p distribution can also be used in Wald and score tests of the entire
parameter vector (which we have not covered).
4.2.1.1 Example: IID Gaussian data with both parameters unknown
Consider again Y1, . . . , Yn IID with a N(μ, σ2) distribution. From earlier, the log
likelihood of the full model is
ℓ(θ;y) = −n
2
log(2pi) − n logσ − 1
2σ2
n∑
=1
(y − μ)2.
The MLEs are bμ = y and
σˆ =
1
n
n∑
=1
(y − y)2
1/2
.
Thus
ℓ((bμ, bσ);y) = −n
2
log(2pi) − n
2
log
1
n
n∑
=1
(y − y)2
− n
2
.
By contrast,
ℓ(μ0, σ0;y) = −n
2
log(2pi) − n logσ0 − 1
2σ20
n∑
=1
(y − μ0)2.
4.3. LRT WHEN THERE IS A NUISANCE PARAMETER 77
Thus
ℓ(bμ, bσ;y) − ℓ(μ0, σ0;y) = n logσ0 − n
2
log
1
n
n∑
=1
(y − y)2
− n
2
+
1
2σ20
n∑
=1
(y − μ0)2.
From earlier we have n = 50, bμ = 2.539 and bσ = 3.257, but instead of simply
testing σ∗ = σ0 = 2 (for which we used a Wald test) we wish to conduct the test:
H0 : (μ, σ) = (2,3) against H1 : (μ, σ) 6= (2,3).
Then
mu0=2;sigma0=3
TWilksOver2=n*log(sigma0)-n/2*log(sigmahat^2)-n/2
TWilksOver2=TWilksOver2+1/(2*sigma0^2)*sum((Gauss,xlab="observation number")
0 10 20 30 40 50 60 70
0.
05
0.
06
0.
07
0.
08
0.
09
observation number
le
ve
ra
ge
The combinations of covariates were specifically designed so that no observation
should have undue influence.
5.6 Confounding and interactions
5.6.1 Confounding
Consider two covariates, year of birth, which we shorten to year, and age. Now
year+age= the current year (2021 when these notes were written). Thus
β0 − 2021b + (β1 + b)year+ (β2 + b)age = β0 + b(year+ age− 2021) + β1year+ β2age
= β0 + β1year+ β2age.
Adding any arbitrary value on to both β1 and β2 and subtracting an appropriate
value from β0 does not change the contribution from these three terms. Equiv-
alently, if the MLEs are ( bβ0, bβ1, bβ2) then for any b, ( bβ0 − 2021b, bβ1 + b, bβ2 + b)
would give the same likelihood and so be an MLE. We say that the model is
not identifiable and that the two covariates and the intercept are completely
confounded.
98 CHAPTER 5. THE LINEAR MODEL
More generally, the linear model is not identifiable if there are constants
0, . . . , p−1 such that
∑p−1
j=0 j,j = 0 for every . In our example, 0 = −2021,
1 = 1 and 2 = 1.
Lack of identifiability is straightforward to diagnose: since some of the bβ could
take any value, their standard errors must be infinity. This can only happen if the
matrix X>X is singular, and hence not invertible.
In some cases, however, although the model is identifiable, there is a linear
combination of covariates that is not far from zero for each individual. Let us
consider the simplest case of one covariate being close to a multiple of another:
,2 ≈ b,1 for each , so that b,1 − ,2 ≈ 0 for each . In particular, therefore
β0 + β1,1 + β2,2 ≈ β0 + (β1 + b),1 + (β2 − ),2
for any small value . So, the predicted value by using sub-optimal coefficients
( bβ0, bβ1 + b, bβ2 − ) will be very close to the predicted value using the MLE, so
the residuals - and the RSS would be similar to those obtained using the MLE.
Equivalently, if β1 and β2 were to move away from their MLEs in such a way that
β1 − bβ1 ≈ −b(β2 − bβ2) then the log-likelihood would drop away from its maximum
very slowly.
We create an artifical example below to demonstrate the consequences by plotting
the log-likelihood as a function of β1 and β2 at the true values of β0 and σ. Similar
plots would be obtained at bβ0 and bσ.
n=100 ## number of observations
betatrue=c(1,-0.2,0.4); sigmatrue=1 ## true parameter values
set.seed(4321)
x1s=rnorm(n)
x2s=x1s*24/25+7/25*rnorm(n) ## x2, so Cor(X1,X2)=0.96
X=matrix(nrow=n,ncol=3)
X[,1]=rep(1,n); X[,2]=x1s; X[,3]=x2s
ys=X %*% betatrue +sigmatrue*rnorm(n)
betahat=solve(t(X) %*% X) %*% t(X) %*% ys
m=100
beta1s=seq(-2.5,1.5,len=m); beta2s=seq(-1.5,2.5,len=m)
lls=matrix(nrow=m,ncol=m)
for (i in 1:m) {
for (j in 1:m) {
thisbeta=c(betatrue[1],beta1s[i],beta2s[j])
5.6. CONFOUNDING AND INTERACTIONS 99
lls[i,j]=-1/(2*sigmatrue^2)*sum((ys-X %*% thisbeta)^2)
}
}
par(mfrow=c(1,2))
plot(x1s,x2s,pch=4,xlab=expression("x"[1]),ylab=expression("x"[2]))
contour(beta1s,beta2s,lls,xlab=expression(beta[1]),ylab=expression(beta[2]),
main="log-likelihood",levels=round(max(lls))-c(1:10,20,30,40,50,100))
abline(v=0,col="blue",lwd=2,lty=2)
−2 −1 0 1 2
−
2
−
1
0
1
2
x1
x 2
log−likelihood
β1
β 2
−54
−55
−56
−57
−58
−59
−60
−61
−62
−63
−73
−73
−83
−83
−93
−93
−103
−103
−153
−153
−2 −1 0 1
−
1
0
1
2
This is a more extreme version of the confounding seen between the intercept and
the slope for the simple linear regression with known variance. The interpretation
is the same: many different combinations of the parameters provide a fit that
is nearly as good as the fit at the MLE. The consequence is a great deal of
uncertainty in the parameter values, visible, for example, in very large estimates
of the standard error (SE( bβ1) = 0.409 and SE( bβ2) = 0.398) and very wide CIs for
both parameters.
Henceforth it will help to name the four linear models that might be fitted to these
100 CHAPTER 5. THE LINEAR MODEL
data:
M0 : Y = β0 + ε,
M1(1) : Y = β0 + β1,1 + ε,
M1(2) : Y = β0 + β2,2 + ε,
M2 : Y = β0 + β1,1 + β2,2 + ε,
where ε ∼ N(0, σ2) are IID.
We have fitted M2; however, the covariates are so similar that it is tempting to
ignore X,1 by forcing β1 = 0 and fitting M1(2). The log-likelihood for this model
as a function of β2 is illustrated by the dashed blue line on the log-likelihood plot.
The maximum along this line is above −54, suggesting that a likelihood-ratio test
of H0 : β1 = 0 against H1 : β1 6= 0 might find that there is no evidence to reject H0.
Testing at the 5% level using twice the difference in the maximum log-likelihoods
between M2 and M1(2) we find that this is the case, as the p-value is 0.294 -
there is insufficient evidence to reject H0.
If we fix β1 = 0 (by fitting M1(2)) then the standard error of β2 depends (roughly,
since this plot is just a cross-section of ℓ(β0, β1, β2, σ)) on the curvature of the
log-likelihood along the line β1 = 0 and is much reduced. Indeed SE( bβ2) = 0.111
compared to SE( bβ2) = 0.398 in M2.
Of course, M1(2) is wrong; we simulated the data and we know M2 is correct. The
point is that there are insufficient data points for us to be able to tell statistically
that M1(2) is wrong. There are many, many other models that are not supported
by the data, and in a real scenario we would not know which, if any, of these might
be correct. The principle of parsimony suggests choosing the simplest model that
is supported by the data.
The idea of performing likelihood ratio tests to decide which covariates are im-
portant is explored further in the section on model choice. The example also
highlights how fitting a single model with all covariates in and then deciding which
covariates to remove purely based on the Wald tests from that model can be
entirely misleading. We have bβ1 = −0.432 and SE( bβ1) = 0.409; the p-value of
a Wald test is 0.294. Similarly, bβ2 = 0.663 and SE( bβ2) = 0.398; the p-value of a
Wald test is 0.0985. Thus both parameters would be removed. By contrast, once
we have set β1 = 0, a likelihood ratio test of H0 : β2 = 0 against H1 : β2 6= 0 can
be performed by comparing the log-likelihoods of M1(2) with M0. This gives a
p-value of 0.0215; there is clear evidence to reject H0, so ,2 should stay in the
model.
In the special case where the X>X matrix is diagonal, so is its inverse, showing
that the parameters are orthogonal and their estimates are asymptotically inde-
pendent. In this case, the standard error of one parameter estimate does not
5.6. CONFOUNDING AND INTERACTIONS 101
depend on the value of the other parameter estimate, including if the latter is set
to 0. As well as narrowing the confidence intervals by removing confounding, this
also helps with the interpretability. The subject of traditional experimental design
is, essentially, about choosing combinations of covariates so that X>X is as close
to diagonal as possible whilst providing maximum information about all covariate
effects.
5.6.2 Interactions
The exploratory plots below are of a response against each of two covariates; 1
is a continuous covariate, whereas X2 is binary.
0.0 0.5 1.0 1.5 2.0
0
1
2
3
4
x1
y
0 1
0
1
2
3
4
x2
y
The plots suggest a linear model of the form y = β0 + β1,1 + β2,2 + ε; this can
be thought of as two straight lines, with the line when ,2 = 1 raised above the
line when ,2 = 0 by β2. We fit this model and plot the residuals against each of
the covariates.
102 CHAPTER 5. THE LINEAR MODEL
0.0 0.5 1.0 1.5 2.0
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
x1
re
si
du
al
0 1
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
x2
re
si
du
al
In terms of the average effect these look fine; however, the residuals are more
variable when 2 = 0 than they are when 2 = 1. Either of the following two plots
shows the reason. The first is a scatter plot of bε against 1 and the second of y
against 1; however, both use + when 2 = 0 and x when 2 = 1. The second plot
includes the prediction lines from bβ.
0.0 0.5 1.0 1.5 2.0
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
x1
re
si
du
al
0.0 0.5 1.0 1.5 2.0
0
1
2
3
4
x1
y
5.7. DIAGNOSTICS AND MODEL IMPROVEMENTS 103
In the first plot we can see that when 2 = 0 the residuals decrease with 1,
whereas when 2 = 1 they increase with 1. In the second plot we can see that
when 2 = 1 the observations start below the fitted line and end above it, with
the opposite occuring when 2 = 0. Both plots suggest that the slope of the best
fit line should be larger when 2 = 1 and smaller when 2 = 0. That is, the rate at
which y increases with 1 depends on the value of 2. This is called an interaction.
More generally:
A two-way interaction occurs when the effect of one covariate on the
response depends on another covariate.
There can be higher-order interactions: for example, a three-way interaction
occurs when the size of a two-way interaction depends on the value of a third
covariate. However, this module will restrict attention to two-way interactions.
To contrast with an interaction, the individual covariates are often termed main
effects.
A further example of a two-way interaction could be with an intervention such
as the introduction of free milk in schools. For simplicity, partition the children
into those who received free school milk (1 = 1) and those who did not (1 = 0).
Also partition the children into lower socioeconomic class (2 = 0) and higher
socioeconomic class (2 = 1) and let the response, y, be some measure of bone
health. Milk improved the health outcome for everyone (β1 > 0), and middle
class children were generally healthier than working class children (β2 > 0);
however the the health benefits of milk were greater for the children from the
lower socioeconomic class.
5.7 Diagnostics and model improvements
We have already seen how the
• QQ plots of the observed residuals
• plots of the residuals against each covariate
can be helpful in diagnosing deficiencies with a model. Other useful plots include
• a scatter plot of the residual against the predicted value by.
Aside: one should expect to see an increasing trend if the residual is plotted
against the observed value - so this is not recommended.
Within the framework of the linear model, there is a range of possible strategies
to overcome issues raised by these diagnostic plots:
104 CHAPTER 5. THE LINEAR MODEL
• Unused covariates. There may be one or more additional covariates that
should be included in the model.
• Non-linearity in the covariates. For a continuous covariate, , it could be for
example that the response does not change linearly with . It might be thatp
or 2 should be included either in addition to or instead of .
• Periodicity in a covariate. It may also be that the response is cyclic in
with a period of T (e.g. T = 24 or T = 365 if is a time), in which case, two
covariates of the form sin2pi/T and cos2pi/T might be included.
• Interactions. There may be one or more interactions that should be included
in the model.
• Transformation of the response. If the residuals are non-Gaussian, or show a
pattern when plotted against the fitted value such as variance increasing with
the fitted value, even after your best efforts (or if the science suggests) it may
be better to transform the response for example,
p
y = x> β+ ε, or logy =
x> β + ε. The simplest method to decide whether or not a transformation
is appropriate is through the diagnostics. A likelihood-ratio test between
the transformed and untransformed models is not valid because they are
not nested. Use of BIC or AIC is possible, but the Jacobian of the data
transformation must be included in the likelihood term, and that is beyond
the scope of this module. More generally, the flexible Box-Cox transformation
is commonly used, though, because the choice of its parameter involves the
Jacobian of the transformation, it is beyond the scope of this module.
• Finally, it may be that the residuals do not have the same variance; for
example it may be that the responses are averages from different sized
groups. It may even be that the residuals are correlated with each other. For
example suppose blood pressure measurements 1− 5 are from individual 1,
measurements 6− 10 are from individual 2 etc., for each of 1000 individuals
for whom we have only two covariates: age and sex. Suppose individuals 1
and 2 are both 60-year old males but individual 1 has high blood pressure
for their age and sex, whereas individual 2 has low blood pressure for their
age and sex. The residuals for individual 1 will all be high and those for
individual 2 will be low; residuals are correlated within an individual. The
technique of generalised least squares can account for a general covariance
matrix, ; this is also called the general linear model. We will not be covering
this in Statistical Fundamentals I, but see the optional module on Modelling
longitudinal and hierarchical data.
Of course, it could be that no transformation leads to Gaussian residuals, for
example, because the response can only be 0 or 1. In this case a generalised
linear model might be appropriate. This is the subject of the final part of this
5.8. F-TESTS AND T-TESTS 105
module.
5.8 F-tests and t-tests
For the linear model and its special cases, such as ANOVA (a special case of which
is the two-sample t-test), if the assumptions (such as IID N(0, σ2) residuals) are
satisfied exactly then it is possible to calculate the distributions of the Wald and
likelihood ratio statistics under H0 exactly.
The Wald statistic for a linear model with p “β” values (including β0) has a tn−p
distribution, and the likelihood ratio statistic for a test between nested models
with q and p β parameters has an Fq−p,n−q distribution.
• As m→∞, the tm distribution → the N(0,1) distribution.
• As m→∞, the F,m distribution → the χ2 distribution.
• If Z ∼ tm then Z2 ∼ F1,m, mimicking the relationship between the N(0,1) and
χ21 distributions.
Thus, as the amount of data n → ∞, the t and F distributions given before the
bullet points are equivalent to the asymptotic Gaussian and χ2 distributions given
earlier in this module.
All of the p-values in this chapter - and this chapter only - are calculated using
the t and F distributions.
5.9 Model choice
Given a set of covariates, we now describe in detail two algorithms that use
likelihood ratio tests to choose an appropriate set of covariates and interactions to
include. Both use the same methodology for deciding on appropriate interactions,
so we describe each methodology first and then describe the fitting of interactions.
In each case, associated with each response, y we have a set of covariates
1, . . . , p. We denote by Mq(j1, . . . , jq) as the model with q main effects indexed
by j1, . . . , jq. For example M2(1,3) ≡ M2(3,1) is the model with 1 and 3
included. Each algorithm uses a significance level, α, for the likelihood ratio tests.
106 CHAPTER 5. THE LINEAR MODEL
5.9.1 Forward fitting
0. Forward fitting starts from the null model M0 as the best model so far and
proceeds as follows.
1. Fit each of the models M1(1), . . . ,M1(p) in turn. For each of the one param-
eter models conduct a likelihood-ratio test against the null model. If none
of the p-values is ≤ α then terminate: no covariates are included. Other-
wise choose the covariate k1 that leads to the lowest p-value and M1(k1)
becomes the new best model.
2. Fit each of the models M2(k1,1), . . . ,M2(k1, p), except M2(k1, k1), which
makes no sense. For each of the two parameter models conduct a likelihood-
ratio test against M1(k1). If none of the p-values is ≤ α then terminate:
M1(k1) is the best model. Otherwise choose the covariate k2 that leads to
the lowest p-value and M2(k1, k2) becomes the new best model.
• The algorithm continues adding new covariates until either all are included
or all p-values are > α.
With categorical covariates with more than two categories, such as strain in the
barley data we usually include all categories together or none, rather than trying
them individually.
We illustrate the process now with the barley data using α = 0.05.
M0=lm(yield~1,data=Barley) ## null
M1.1=lm(yield~light,data=Barley)
M1.2=lm(yield~fert,data=Barley)
M1.3=lm(yield~strain,data=Barley) ## includes BOTH indicator functions
c(anova(M0,M1.1)$P[2], anova(M0,M1.2)$P[2], anova(M0,M1.3)$P[2])
## [1] 7.737122e-19 3.580707e-02 5.029512e-05
The smallest p-value is for light, and this is under α, so we include that in the
best model.
M2.12=lm(yield~light+fert,data=Barley)
M2.13=lm(yield~light+strain,data=Barley)
c(anova(M1.1,M2.12)$P[2], anova(M1.1,M2.13)$P[2])
## [1] 3.697492e-01 1.521983e-22
The p-value for strain is below α so we include that in the best model. Even
though the p-value for fert was above α it still could turn out to be important
once we have strain in the model, so we continue.
5.9. MODEL CHOICE 107
M3.123=lm(yield~light+fert+strain,data=Barley)
anova(M2.13,M3.123)$P[2]
## [1] 0.06163627
This is (just!) not significant, so we do not include fert. However, if α = 0.10
then we would have included this covariate too.
5.9.2 Backward fitting
0. Backward fitting starts with the best model as Mp(1, . . . , p) and tries remov-
ing values from the model.
1. From the current best model, try removing each covariate in turn and perform
a likelihood ratio test.
a. If none of the p-values is > α (i.e., all null hypotheses are rejected) then
terminate; you have found the best model.
b. Otherwise, remove the covariate with the largest likelihood-ratio p-value
from the model to create the new best model and go to 1.
When p is large yet many of the covariates are irrelevant, backward fitting can be
time consuming or even (if p + 1 > n) impossible.
For the barley data, we have
M3.123=lm(yield~light+fert+strain,data=Barley)
M2.12=lm(yield~light+fert,data=Barley)
M2.13=lm(yield~light+strain,data=Barley)
M2.23=lm(yield~fert+strain,data=Barley)
p3=anova(M3.123,M2.12)$P[2]
p2=anova(M3.123,M2.13)$P[2]
p1=anova(M3.123,M2.23)$P[2]
c(p3,p2,p1)
## [1] 8.128571e-23 6.163627e-02 5.367768e-35
The highest p-value is 0.0616363 for removing fert, so we remove this to obtain
M2.13 as our new best model. Now we try removing each of light and strain:
c(anova(M2.13,M1.1)$P[2],anova(M2.13,M1.3)$P[2])
## [1] 1.521983e-22 5.544793e-36
108 CHAPTER 5. THE LINEAR MODEL
Neither p-value was > α so the algorithm terminates. The model with light and
strain is the best main-effects model.
5.9.3 Two-way Interactions
Usually one considers all of the interactions between main effects that have
been included in the model. For example, if q main effects have been included
then there are q(q − 1)/2 two-way interactions to consider. Even if q only has a
moderate size this can be a large number of terms, and so forward fitting is used,
starting with the chosen main-effects model and trying each of the interactions.
In the barley example, since there are only two main effects kept in, there is only
one interaction to check
M2.13.interact=lm(yield~light+strain+light*strain,data=Barley)
anova(M2.13,M2.13.interact)$P[2]
## [1] 0.6906134
The LRT has H0 : β1,3 = 0 and H1 : β1,3 6= 0, where β1,3 is the coefficient of the
interaction between 1 (light) and 3 (strain). We fail to reject H0, so the best
model stays as the main-effects model with light and fert.
5.9.4 Exhaustive search
Both forwards fitting and backwards fitting are greedy search algorithms, seeking
out a locally optimal solution, so they may not find the “best”. When p is small
(typically p ≤ 5) it is feasible to fit every one of the 2p possible main-effects
models. When p ≤ 3 it is in fact feasible to fit all 2p+p(p−1)/2 main-effects and
two-way interaction models. However, these cannot all be compared against each
other using likelihood-ratio tests; instead we seek an absolute measure of fit.
Firstly, we explain why maximised log-likelihood on its own is not a good measure.
If model A is nested within model B then the maximised log-likelihood for B cannot
be less than that for A, since A is a special case of B. Even if the extra parameter(s)
in B have no true predictive value they will always be able to fit to the noise in the
data, at least slightly, thus improving the fit. Indeed, this “fitting to noise” is the
source of the χ2 distribution for twice the difference between the log-likelihoods.
Hence, if maximised log-likelihood were used as a measure of fit then we would
always choose the largest model even if it did not improve the fit substantially
enough. The same would happen if we were to use R2 as a measure.
5.9. MODEL CHOICE 109
For the linear model there are many sensible measures of fit. We will examine
a measure that can be used more generally to compare models, the Bayesian
Information Criterion or BIC. For a general model with p∗ parameters, θ, and n
independent data points, the BIC is
BC = −2ℓ(bθ) + p∗ logn.
For the linear model, p∗ = p + 2 as the intercept term and the unknown variance
must also be included. The BIC can be thought of a measure of the goodness of
fit (the log-likelihood) but penalised according to the number of parameters used.
The lower the BIC the “better” the model.
Aside: for those taking Statistical Fundamentals II, the BIC is an asymptotic
approximation to minus twice the logarithm of the marginal likelihood of the
model, that is valid when n is much larger than p. Comparing the BICs of two
models is like comparing twice their (log) Bayes factors, so a difference of at least
2 is usually deemed worth mentioning; a difference of 20 is substantial.
Aside 2: The Akiake’s Information Criterion (AIC= −2ℓ(bθ) + 2p∗) is also commonly
used.
In the barley example we can find the BIC of all the main effects models:
BIC(M0)
## [1] 514.7327
c(BIC(M1.1), BIC(M1.2), BIC(M1.3))
## [1] 437.6738 514.4449 502.6301
c(BIC(M2.12), BIC(M2.13), BIC(lm(yield~fert+strain,data=Barley)))
## [1] 441.1049 339.8432 500.7593
BIC(M3.123)
## [1] 340.3385
The BIC tells the same story as the likelihood-ratio tests: the model with light
and strain is the best, but the model with all three main effects is only slightly
worse; however, we can also clearly see that these two models are much better
than any of the competitors.
Trying each of the three main effects interactions with the full main-effects model
gives
110 CHAPTER 5. THE LINEAR MODEL
BIC(lm(yield~light+fert+strain+light*fert,data=Barley))
## [1] 344.5563
BIC(lm(yield~light+fert+strain+light*strain,data=Barley))
## [1] 348.0403
BIC(lm(yield~light+fert+strain+fert*strain,data=Barley))
## [1] 338.8033
It appears that the model with all main effects and the interaction between
fertiliser and strain is the best so far. Thus it may be that the improvement in
yield caused by increasing the amount of fertiliser is different for the different
strains. This would fit with the boxplot of residuals against strain at the start of
this chapter.
We could check all possible combinations of sets of two-way interactions; the
interested student is welcome to do this!
5.10 Model interpretation II
We explain by example how to interpret the coefficients for a model with one
or more interaction terms and how to calculate standard errors that R does not
provide directly.
Imagine that we have decided that the model which contains all main effects and
the interaction between fert and strain is the best model for the barley data.
The coefficients are:
M3.123.inter23=lm(yield~light+fert+strain+fert*strain,data=Barley)
S3.123.inter23=summary(M3.123.inter23)
S3.123.inter23$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.19740536 0.88089790 26.33381832 2.159479e-36
## light 0.96256702 0.03741777 25.72486210 8.666307e-36
## fert 0.24353835 0.10187332 2.39059990 1.972722e-02
## strainB -7.32279762 1.08060588 -6.77656654 4.297264e-09
## strainC -3.00953571 1.08060588 -2.78504474 7.004272e-03
## fert:strainB -0.38096571 0.14276495 -2.66848211 9.610194e-03
## fert:strainC 0.01100571 0.14276495 0.07708975 9.387890e-01
5.10. MODEL INTERPRETATION II 111
• Every extra hour of light leads to 0.963 (SE = 0.0374) extra Kg/plot of barley.
• Since there is an interaction term 0.244 (SE = 0.102) extra Kg/plot.this
is the increase in barley yield (Kg/plot) per extra Kg/ha of fertiliser for Strain
A. For StrainB it appears that there is an increase of fert+fert:strainB=-
0.137; i.e. a decrease. For Strain C the estimated increase per plot is
fert+fert:strainC=0.255.
• When there is no fertiliser, whatever the light level, Strain B yields -7.32
(SE = 1.08) more Kg/plot than Strain A; Strain C yields -3.01 (SE = 1.08)
more than Strain A under the no-fertiliser regime.
R does not directly report the standard errors for the terms we have summed, such
as fert+fert:strainB. We can do this by hand using the variance-covariance
matrix for the coefficients and the method for obtaining the variance of a linear
combination of parameter estimates.
a1=c(0,0,1,0,0,1,0) ## fert and fert:strainB
a2=c(0,0,1,0,0,0,1) ## fert and fert:strainC
V=vcov(M3.123.inter23)
v1=(t(a1) %*% V %*% a1)[1,1] ## get the element of the 1x1 matrix
v2=(t(a2) %*% V %*% a2)[1,1] ## get the element of the 1x1 matrix
c(sqrt(v1),sqrt(v2)) ## standard errors
## [1] 0.1018733 0.1018733
The final model can also be used to predict the output for a new combination of
light, fertiliser and yield. For the point estimate, we use the invariance principle,
just as we did to obtain by. To obtain the standard error of the prediction we use
exactly the same method as above.
mylight=6; myfert=10; mystrainB=1; mystrainC=0;
a=c(1,mylight,myfert,mystrainB,mystrainC,myfert*mystrainB,myfert*mystrainC)
predicted=sum(a*S3.123.inter23$coef[,1]) ## invariance principle
v=(t(a) %*% V %*% a)[1,1]
c(predicted, sqrt(v)) ## estimate and SE
## [1] 20.2757362 0.6723412
112 CHAPTER 5. THE LINEAR MODEL
Chapter 6
Generalised linear models and
beyond
In the previous chapter we considered the linear model, where the response y
could be modelled linearly in terms of a covariate vector x and the distribution
conditional on the covariates was Gaussian:
Y
ndep∼ N x> β, σ2 .
It will be helpful to reformulate this as follows:
Y
ndep∼ N η, σ2 ,
where
η = x> β
is called the linear predictor.
In this chapter we maintain the linear combination of covariates, but extend to
the vast array of non-Gaussian distributions.
Many aspect of such models, such as the MLEs, are intractable, so the models
must be fitted numerically. However, especially in certain cases, known as the
exponential family, certain aspects are tractable, or the likelihood is guaranteed
to be well-behaved, and we investigate the resulting generalised linear models in
more detail.
Given the vector of covariates still plays an important role in predicting the
response, many concepts covered in the previous chapter apply equally well here:
• The design matrix X.
113
114 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
• Residuals - these are often helpful, however the details change; in particular
we will use the Pearson residual rather than the ordinary residual.
• Leverage via the hat matrix is still a helpful concept, but the connection
between leverage and the variance of the observed residuals is lost.
• Confounding and interactions.
• Diagnostics and model improvements: non-linearity of covariates, period-
icity and interactions can be accounted for in an analogous manner. Some
diagnostics require more care.
• Model choice: all of the techniques used previously carry over.
Unlike the special case of the linear model, the exact distributions for Wald
and likelihood-ratio statistics are not tractable, and we revert to the standard
asymptotics: all Wald p-values are via the asymptotic Gaussian distribution and
all likelihood-ratio p-values are via the asymptotic χ2 distribution.
6.1 Count data
6.1.0.1 Ants data
The Ants data response, Srich, is ant species richness (number of ant species)
found in 64 one-square metre sampling grids, in 22 bogs and 22 forests surround-
ing the bogs, in Connecticut, Massachusetts and Vermont (USA). The sites span
3o of latitude in New England. There are n = 44 observations on four variables,
plus the site identifier.
• Site: An abbreviation for the name of the site;
• Srich: The species richness (number of species) for ants;
• Habitat: Either Forest or Bog;
• Latitude: The latitude (in decimal degrees) for the site;
• Elevation: The elevation of the site, in metres above sea level.
The graphs below shows the species richness as a function of latitude (left) and
elevation (right), with forest coloured green.
6.1. COUNT DATA 115
42.0 43.0 44.0 45.0
5
10
15
Latitude
R
ic
hn
es
s
0 100 300 500
5
10
15
Elevation
R
ic
hn
es
s
Clearly there is greater abundance in forests, and the abundance also seems to
be higher at lower elevations and lower latitudes, but how do we model this?
6.1.1 Defining an appropriate model
The response, Y, is a count of the number of species. The linear model would
suggest
Y
ndep∼ N(η, σ2),
where
η = β0 + β11Bog + β2Latitude+ β3Elevation.
However, we would usually expect counts that are likely to be large (have a
large expectation) to also have a large variance. Further, counts are integers and
cannot be negative.
One natural model for count data is the Poisson distribution:
Y
ndep∼ Posson(λ);
but, how should we relate λ to η = x> β?
• λ = η is tempting in its simplicity, but η ∈ R, whereas λ > 0.
116 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
We need a function that converts a real number to a positive number. The simplest
such function is λ = exp(η) and this is by far the most commonly used. The
linear predictor is connected to the Poisson parameter as follows:
η = logλ,
and this is known as a log link.
6.1.2 Log likelihood
The likelihood is
L(β;y) =
n∏
=1
λy
y!
exp(−λ) ∝
n∏
=1
λy exp(−λ).
So, the log-likelihood is
ℓ(β;y) =
n∑
=1
y logλ − λ + const
=
n∑
=1
yη −
n∑
=1
exp(η) + const.
6.1.3 Fitting the model in practice
6.1.3.1 Ants data
Poisson regression with a log link is a special case of the generalised linear model,
so we may use the glm() function to fit the model. We must tell the function
which distribution to use for the response (Poisson). The log link is the default for
a Poisson response so we do not need to specify this explicitly.
AntsM3.123=glm(Srich~Habitat+Latitude+Elevation,family="poisson",data=Ants)
AntsS3.123=summary(AntsM3.123)
AntsS3.123$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.936812111 2.6214970400 4.553433 5.277739e-06
## HabitatForest 0.635438863 0.1195663840 5.314528 1.069343e-07
## Latitude -0.235793020 0.0616637999 -3.823848 1.313847e-04
## Elevation -0.001141079 0.0003748977 -3.043708 2.336820e-03
6.1. COUNT DATA 117
All coefficients look significant, and backwards fitting confirms this:
library("lmtest") ## gives the lrtest() function
AntsM2.12=glm(Srich~Habitat+Latitude,family="poisson",data=Ants)
AntsM2.13=glm(Srich~Habitat+Elevation,family="poisson",data=Ants)
AntsM2.23=glm(Srich~Latitude+Elevation,family="poisson",data=Ants)
lrtest(AntsM2.12,AntsM3.123)
## Likelihood ratio test
##
## Model 1: Srich ~ Habitat + Latitude
## Model 2: Srich ~ Habitat + Latitude + Elevation
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -105.32
## 2 4 -100.52 1 9.5938 0.001952 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(AntsM2.13,AntsM3.123)$P[2]
## [1] 6.47308e-05
lrtest(AntsM2.23,AntsM3.123)$P[2]
## [1] 5.081168e-08
Instead, we could have built up the model from scratch using forward fitting. Next,
one would look for interactions via forward fitting (student exercise for Lab5).
Care must be taken to remember the link function when interpreting the coeffi-
cients. For the model fit we have:
• The linear predictor for forest is 0.635 higher than for Bog, so the expected
number of species is a factor of exp(0.635) larger for Forest than Bog.
• The linear predictor decreases by 0.236 for every degree of latitude fur-
ther north. So the expected number of species decreases by a factor of
exp(−0.236).
Confidence intervals for these quantities can be obtained by finding the Wald
confidence intervals on the scale of the linear predictor and then exponentiating
the lower and upper limit as described earlier in the module.
## 95% CI for E[no/. species in forest]/E[no. species in bog]
## First find CI on linear predictor scale
MLE=AntsS3.123$coeff[2,1]
SE=AntsS3.123$coeff[2,2]
118 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
CIeta=MLE+qnorm(0.975)*SE*c(-1,1)
CIeta
## [1] 0.4010931 0.8697847
## Now transform
CI=exp(CIeta)
CI
## [1] 1.493456 2.386397
6.1.4 The offset
Suppose that the counts in each observation were collected over a different length
of time; for example the number of asthma attacks for some patients might be
counted over a week, but for others the count might be from a month. If asthma
attacks for a given patient occur at a certain rate/day on average then we would
naturally expect different counts when taken over different time periods, even for
the same patient.
Or some counts of numbers of deandelions might use 1m2 quadrats and others
might use 2m2 quadrats. In such cases, if at a particular site the expected count
is λ/m2, then for the larger quadrats the expected count would be 2λ.
In general, there is some known factor, T, which we expect to multiply the
expected count.
Y ∼ Posson(λT),
where logλ = η as before, but T is the total time over which the count was
measured or the total area of the quadrat, for example. Notice that
logE [Y] = log (T expη) = η + logT.
We, therefore, fit this model using an additive offset.
6.1.4.1 Ship data
This data set has 34 entries, corresponding to the observed combinations of type
of ship, year of construction and period of operation. Each row has information on
five variables as follows:
• ship type, coded 1-5 for A, B, C, D and E;
6.1. COUNT DATA 119
• year of construction (1960-64, 1965-70, 1970-74, 1975-79);
• period of operation (1960-74, 1975-79);
• months of service, ranging from 63 to 20,370;
• damage incidents, ranging from 0 to 53.
Again, since the response, the number of incidents, damage is a set of counts, the
natural first model is Poisson regression for the response in terms of the three
categorical variables and the months of service variable. However, if accidents
for a particular boat type do follow a Poisson process then we expect the number
of accidents to be proportional to the total months of service. Thus, the model
with all three covariates is fitted as:
logmonths=log(Ship$months)
ShipM3.123=glm(damage~type+construction+operation+offset(logmonths),
family="poisson",data=Ship)
ShipS3.123=summary(ShipM3.123)
ShipS3.123$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.40590156 0.2174441 -29.4599974 9.375082e-191
## typeB -0.54334430 0.1775899 -3.0595451 2.216734e-03
## typeC -0.68740165 0.3290442 -2.0890863 3.669996e-02
## typeD -0.07596142 0.2905787 -0.2614143 7.937730e-01
## typeE 0.32557946 0.2358794 1.3802793 1.675007e-01
## construction1965-69 0.69714043 0.1496413 4.6587419 3.181478e-06
## construction1970-74 0.81842658 0.1697736 4.8206949 1.430590e-06
## construction1975-79 0.45342664 0.2331704 1.9446148 5.182136e-02
## operation1975-79 0.38446696 0.1182721 3.2506982 1.151220e-03
To prove the point, let us just check that this is a much better fit than if no offset
had been used:
ShipM3.123.NoOffset=glm(damage~type+construction+operation,
family="poisson",data=Ship)
BIC(ShipM3.123); BIC(ShipM3.123.NoOffset)
## [1] 168.2988
## [1] 268.689
120 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
6.2 Binary and Binomial data
6.2.0.1 Dose-response data
A Phase I clinical trial aims to determine the minimum effective dose of a new
drug. Patients were randomly assigned to receive different doses of the drug. The
resulting data are:
Dose () 1.69 1.72 1.76 1.78 1.81 1.84 1.86 1.88
# patients (m_i) 59 60 62 56 53 59 62 60
# success (y_i) 6 13 18 28 52 53 61 60
6.2.1 Defining an appropriate model
It makes no sense to model Y = β0 + β + ε where ε
D∼ N(0, σ2):
• This does not account for the m.
We could create a new response Y∗ = Y/m and use a linear model for this;
however,
• Y∗ lies between 0 and 1, and the linear model does not respect this.
• If p is the probability that an individual patient with dose responds posi-
tively then Y ∼ Bn(m, p). Thus
Vr
Y∗
=
Vr [Y]
m2
=
mp(1 − p)
m2
=
p
m
(1 − p) = E
Y∗
(1 − p),
i.e., the variance depends on the expectation and is not constant.
Since the natural model is
Y ∼ Bn(m, p),
we should use this! But how should we relate p to β0 + β?
More generally, for a Binomial model with a covariate vector x, how should we
relate p to η = x> β?
• p = η will not work as η ∈ R is not guaranteed to be between 0 and 1.
• p = expη will not work as expη ∈ R+ is not guaranteed to be ≤ 1.
6.2. BINARY AND BINOMIAL DATA 121
In fact, exactly what we need is: p = F(η) where F is the cumulative distribution
function (cdf) of any continuous random variable.
• Firstly, cdfs are probabilities, so always lie between 0 and 1.
• Secondly, cdfs increase monotonically with their argument, so the larger η,
the larger the probability.
• Finally, cdfs of continuous random variables increase continuously; there are
no sudden jumps.
The two most commonly used cdfs are:
1.
p = F(η) =
exp(η)
1 + exp(η)
,
which is the cdf of a standardised logistic random variable. This specification
leads to logistic regression. Inverting the expression for p, we obtain
η = log
p
1 − p
.
This is known as the logit link.
2.
p = F(η) = (η),
where is the cdf of a standard Gaussian random variable. This specification
leads to probit regression. Inverting the expression for p, we obtain
η = −1(p).
This is known as the probit link.
The odds of an event occuring, where the event has a probability of p, are p/(1−p),
so the logistic regression models the log odds for the event via a linear predictor.
6.2.2 Log likelihood
For any binomial regression, the likelihood is,
L(β;y) =
n∏
=1
m!
y!(m − y)!p
y
(1 − p)m−y ∝
n∏
=1
py (1 − p)m−y .
122 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
So, ignoring the additive constant, the log-likelihood for logistic regression is
ℓ(β;y) =
n∑
=1
y logp + (m − y) log(1 − p)
=
n∑
=1
y log
p
1 − p
+
n∑
=1
m log(1 − p)
=
n∑
=1
yη −
n∑
=1
m log (1 + exp(η)) ,
as 1 − p = 1/(1 + exp(η).
Probit regression only differs in how η relates to p, so the log-likelihood
(ignoring the additive constant) is,
ℓ(β;y) =
n∑
=1
y log
p
1 − p
+
n∑
=1
m log(1 − p)
=
n∑
=1
y {log(η) − log(−η)} +
n∑
=1
m log(−η),
as 1 − p = (−η).
6.2.3 Fitting the models in practice
6.2.3.1 Dose-response data
Both logistic regression and probit regression are special cases of the generalised
linear model so, again, we may use the glm() function to fit the model. The full
response consists of the number of successes and the number of failures for each
category, so we must create a new response variable that includes this.
DoseResponse$fullresponse=cbind(DoseResponse$response,
DoseResponse$m-DoseResponse$response) ## successes and failures
## Fit logistic the probit regression
DRLM1=glm(fullresponse~dose,family=binomial(link="logit"),
data=DoseResponse)
DRPM1=glm(fullresponse~dose,family=binomial(link="probit"),
data=DoseResponse)
DRLS1=summary(DRLM1)
6.2. BINARY AND BINOMIAL DATA 123
DRPS1=summary(DRPM1)
DRLS1$coeff
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -64.4258 5.632030 -11.43918 2.663945e-30
## dose 36.4487 3.174349 11.48226 1.619873e-30
DRPS1$coeff
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -36.32038 2.781595 -13.05739 5.767051e-39
## dose 20.55919 1.566499 13.12429 2.390244e-39
The covariate effects in a logistic regression are most easily described on the
scale of the linear predictor, i.e., in terms of the log-odds. Here an increase
in dose of 0.01 corresponds to an increase in the log-odds of success of 0.364.
Sometimes, however, the change in the probability is helpful. The change in
probability depends on the value of the covariate(s); so you would could look
at two or more separate covariate combinations and calculate the probabilities
associated with each. The same idea should be applied to probit regression.
Confidence intervals for each of the probabilities can be obtained as follows.
Consider a covariate combination xk
• Calculate CI for the linear predictor ηk = xk bβ as you would for any linear
combination of the covariates; see earlier.
• Transform the limits of the CI according to the link function, which is mono-
tonic by choice; see earlier.
Whichever link is used, the model shows a clear increase in the probability of a
positive response as the dose increases. The difference in the β1 coefficient is
simply an artefact of the different link functions; notice that whilst the β1 for the
probit regression is around half the β1 for logistic regression, there is the same
relationship between the standard errors, so the z-values are similar.
Indeed, we can check to see if one model is preferred over the other:
## Compare fits via BIC
BIC(DRLM1); BIC(DRPM1)
## [1] 57.49727
## [1] 57.44258
The summary function also directly provides an alternative quantity, AIC, which we
can also access as follows:
124 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
## Compare fits via AIC
AIC(DRLM1); AIC(DRPM1)
## [1] 57.33839
## [1] 57.2837
AIC or BIC? Both BIC and AIC tell us that in this case there is no reason to choose
one of the link functions over the other. Between nested models, AIC is more
consistent with the results of a likelihood ratio test; however, when comparing
many different models (i.e., covariate combinations) for a large data set, rather
than just a few models, BIC penalises overly complex models more severely than
AIC does. If you are prepared to have some spurious terms in your final model as
long as the predictions are good, then AIC might be preferable; if you would like a
simpler, more interpretable model where spurious terms are unlikely then BIC is
better.
6.2.3.2 Titanic data
Data from a subset of passengers of the Titanic (which famously sank on its
maiden voyage in 1912) contains information on each passenger. The response
is whether or not they survived, the covariates are age, gender, passenger class
(i.e. 1st, 2nd or 3rd), ticket fare, port of embarkation, number of siblings/spouse
on board, There are 705 observations in the data set, but 657 distinct covariate
patterns.
Since the response is binary the R code to fit a model is slightly more straightfor-
ward. We fit a logistic regression to illustrate this:
TiLM6=glm(Survived~Pclass+Sex+Age+SibSp+Fare+Embarked,
family=binomial(link="logit"),data=Titanic)
summary(TiLM6)$coeff
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.7611816246 0.640063570 9.0009522 2.237683e-19
## Pclass -1.2583494982 0.165735372 -7.5925223 3.137376e-14
## Sexmale -2.6210956828 0.218705610 -11.9845837 4.279892e-33
## Age -0.0435746998 0.008253663 -5.2794379 1.295808e-07
## SibSp -0.3787982946 0.124988636 -3.0306619 2.440183e-03
## Fare 0.0007086487 0.002403493 0.2948412 7.681152e-01
## EmbarkedQ -0.7899097852 0.599163217 -1.3183549 1.873849e-01
## EmbarkedS -0.3889160475 0.270992370 -1.4351550 1.512429e-01
6.3. SINGLE-PARAMETER EXPONENTIAL FAMILY 125
For a full analysis, we would perform model choice for the main effects and then
look at interactions. However, even from this simple fit we can see that males
were much less likely to survive than females, that the lower the passenger class
ticket the less likely the passenger was to survive, and older people were less
likely to survive. The pattern with siblings seems odd at first glance, and would
warrant further investigation.
6.3 Single-parameter exponential family
Consider any density or mass function ƒ that satisfies:
log ƒ (y|θ) = θT(y) − A(θ) + const,
where the constant can depend on y but not θ. The only term that contains
both the parameter θ and the data y appears as a product θT(y). The set of
probability models with log densities or log mass functions of this form is known
as the single-parameter exponential family.
The Poisson and binomial log mass functions are
log ƒPosson(y|θ) = y logλ − λ + const,
log ƒBn(y|θ) = y log
p
1 − p
+m log(1 − p) + const,
so both are in the single-parameter exponential family with θ = logλ and θ =
log(p/(1 − p)) respectively. The parameter θ is called the natural parameter.
For a Gaussian model with known σ, we have
log ƒN(y|μ, σ = known) = −1
2
log2pi − logσ − 1
2σ2
(y − μ)2
= − y
2
2σ2
+
yμ
σ2
− μ
2
2σ2
− logσ + const
=
yμ
σ2
− μ
2
2σ2
+ const,
so this is also in the single-parameter exponential family with T(y) = y and natural
parameter θ = μ/σ2 (remember that σ2 is a known number).
The density for a Gmm(α, β) distribution is
ƒGmm(y|α, β) = 1
(α)
βαyα−1 exp(−βy).
126 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
The expectation is ψ = α/β, so β = α/ψ and the density in terms of α and ψ is
ƒGmm(y|α,ψ) = 1
(α)
α
ψ
α
yα−1 exp(−αy/ψ).
Thus the log density when α is known is
log ƒGmm(y|α = known, ψ) = −αy
ψ
− α logψ + const.
This is also in the single-parameter exponential family with T(y) = y and natural
parameter θ = −α/ψ (remember that α is a known number).
6.3.1 Expectation and variance
The expectation and variance of the random variable T(Y) also has a simple form:
E [T(Y)] = A′(θ) and Vr [T(Y)] = A′′(θ).
To show this, we write the constant term as log c(y) and note that since the ƒ is a
density ∫
exp{θT(y) − A(θ)} c(y) dy = 1,
where the integral is over the range of y, or is a sum if the random variable
is discrete. Whichever the case, provided the range does not depend on θ,
differentiating with respect to θ gives∫
T(y) − A′(θ) exp{θT(y) − A(θ)} c(y) dy = 0,
or, equivalently
E
T(Y) − A′(θ) = 0,
proving the first part. Differentiating the original expression twice rather than just
once gives∫ −A′′(θ) + b′(θ)T(y) − A′(θ) 2exp{b(θ)T(y) − A(θ)} c(y) dy = 0,
or
E
−A′′(θ)+ E T(Y) − A′(θ) 2 = 0.
Since E [T(Y) − A′(θ)] = 0, this is equivalent to
−A′′(θ) + Vr T(Y) − A′(θ) = 0,
6.4. THE GENERALISED LINEAR MODEL 127
as required.
The exponential family has many other interesting and useful properties; see the
Wikipedia page for example. The exponential family extends to more than one
parameter; the N(μ, σ2) and the Gmm(α, β) distributions with both parameters
unknown, are members of the two-parameter exponential family.
6.4 The generalised linear model
A generalised linear model (GLM) is regression model for data y1, . . . , yn given
covariates x1, . . . ,xn such that
η = x> β
Y | η ndep∼ Exponential family g−1(η) ,
where g is a differentiable, monotone function called the link function.
Generalised linear models extend to the two-parameter exponential family, and
therefore include linear regression but we will not pursue the details here.
6.4.1 The canonical link
In all cases of the single-parameter exponential family, the log likelihood for n
data values y1, . . . , yn is
ℓ(β;y) =
n∑
=1
θT(y) −
n∑
=1
A(θ) + const,
where g(θ) = η = x> β.
In the special case where θ = η = x> β; i.e. the natural parameter is the linear
predictor, the link function g is called the canonical link. In the case of logistic
regression we have
η = g(p) = log
p
1 − p
,
and for Poisson regression we have
η = g(λ) = logλ,
128 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
so these both use the canonical link. Linear regression with a known variance has
η = g(μ) = μ,
which is called the identity link. This is essentially the canonical link, which is
η = μ/σ2.
Probit regression does not use the canonical link. For a form of gamma-regression
it would not necessarily be sensible to use the canonical link of η = −α/ψ, since
η can take any value in R, whereas ψ and α must be positive.
When it is sensible to use, the canonical link does lead to certain advantages, two
of which we now investigate.
6.4.2 Fisher information with canonical link
With the canonical link, the log-likelihood for the GLM is
ℓ(β;y) =
n∑
=1
ηT(y) −
n∑
=1
A(η) + const,
where η = x> β. For a single observation y with linear predictor η we have
ℓ(β;y) = ηT(y) − A(η) + const;
the log likelihood is then the sum of these. The Fisher information matrix will,
therefore be the sum of the Fisher information matrices for all the observations.
For this single observation, the jth component of the score is
Uj =
∂ℓ
∂βj
=
∂η
∂βj
dℓ
dη
= X,jT(y) − A′(η)X,j.
Hence, the (j, k) component of the Hessian is
∂2ℓ
∂βj∂βk
=
∂Uj
∂βk
=
∂η
∂βk
∂Uj
∂η
= −A′′(η)X,jX,k.
From this, the (j, k) element of the observed Fisher information matrix for all
observations is
[ O(β;y)] j,k =
n∑
=1
A′′(η)[X>] j,X,k.
6.5. OTHER MODELS 129
This does not depend on the data, y, so it is also the expected Fisher information;
i.e.,
E(β) = O(β;y) = X>DX,
where
D = diag(A′′(η1), . . . , A′′(ηn)).
The second property is arguably more important, and arises directly from the form
for O. Since A′′(η) is Vr [T(Y)] > 0, the vector D consists of positive numbers
whatever the value of β. This implies that O is positive definite everywhere, so ℓ
is concave and the likelihood has a single maximum.
This makes numerically finding the maximum via a standard optimisation algo-
rithm much more reliable.
6.5 Other models
Sometimes a probability distribution that is not in the exponential family might be
appropriate for the data. For example, (and for a sound theoretical reason) ex-
treme values such as annual maximum temperatures or daily maximum pollution
levels are often modelled with the Gumbel distribution.
Alternatively it might be that some or all of the covariates influence a parameter
other than the centrality parameter. For example, it might be that the observations
are Gaussian but it is the standard deviation that is affected by covariates, rather
than the expectation.
Given a black-box optimiser such as optim() and a numerical evaluation of
the Hessian, all that is required for inference in these cases is an appropriate
formulation of the log-likelihood function. We illustrate this now using the above
two examples.
6.5.1 Gumbel data
Given parameters μ and β > 0, the Gumbel density function is
ƒ (y|μ, λ) = 1
λ
exp
§
− y − μ
λ
− exp
− y − μ
λ
ª
,
for −∞ < y <∞.
130 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
6.5.1.1 Hottest temperatures in European countries
Below are the first few lines of a data set of the hottest temperatures recorded in
each of 29 European countries, together with the latitude and longitude at which
the temperature was recorded.
head(Hottest)
## country locn temp lat lon
## 1 Austria Bad Deutsch-Altenburg 40.5 48.13333 16.900000
## 2 Belgium Begijnendijk 41.8 51.01667 4.783333
## 3 Bulgaria Sadovo 45.2 42.13000 24.930000
## 4 Cyprus Nicosia 46.2 35.16667 33.350000
## 5 Czech Republic Dobrichovice 40.4 49.91667 14.266667
## 6 Denmark Holstebro 36.4 56.35000 8.600000
plot(Hottest$lon,Hottest$lat,pch=4,col="red",cex=.3,xlab="lon",ylab="lat")
text(Hottest$lon,Hottest$lat,Hottest$temp)
−10 0 10 20 30 40
30
35
40
45
50
55
60
65
lon
la
t 40.5
41.8
45.2
46.2
40.4
36.4 35.2
37.2
46
41.2
48
41.9
30.5
33.3
47
37.8 37.5
40.8
43.8
40.7
35.6
40.2
47.4
44.5
47.4
38
41.5
42
38.7
We would like to fit a gumbel model to these data, where the centrality parameter
μ depends on the latitude and longitude linearly: μ = x> β.
6.5. OTHER MODELS 131
The likelihood is
L(β, λ) =
n∏
=1
1
λ
exp
§
− y − μ
λ
− exp
− y − μ
λ
ª
=
1
λn
exp
¨
−
n∑
=1
y − μ
λ
−
n∑
=1
exp
− y − μ
λ
«
.
So the log-likelihood is
ℓ(β, λ) = −n logλ − 1
λ
n∑
=1
(y − μ) −
n∑
=1
exp
− y − μ
λ
.
n=nrow(Hottest)
Xhot=matrix(nrow=n,ncol=3)
Xhot[,1]=rep(1,n)
Xhot[,2]=Hottest$lat
Xhot[,3]=Hottest$lon
y=Hottest$temp
llGumbel<-function(thetas,X,ys) {
betas=thetas[1:3]
lambda=exp(thetas[4]) ## easier than dealing with negative values
mus=X %*% betas
n=length(mus)
standardised = (ys-mus)/lambda
ll=-n*log(lambda)-sum(standardised)-sum(exp(-standardised))
return(ll)
}
ctl=list(fnscale=-1)
## Roughly the average value seems like a reasonable start for beta0
GumbelFit=optim(c(40,0,0,3), llGumbel, control=ctl, X=Xhot,ys=y,hessian=TRUE)
GumbelFit$converge; GumbelFit$par
## [1] 0
## [1] 60.36787010 -0.44086966 0.09076025 0.61722032
The coefficient of lat is -0.441 and of lon 0.0908. The estimate of the logλ is
0.617.
The observed information is the negative inverse Hessian:
132 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
-solve(GumbelFit$hess)
## [,1] [,2] [,3] [,4]
## [1,] 4.41191092 -0.0814273655 -0.0191018793 0.0386756358
## [2,] -0.08142737 0.0015908557 0.0002241764 -0.0003015795
## [3,] -0.01910188 0.0002241764 0.0005575289 -0.0005093073
## [4,] 0.03867564 -0.0003015795 -0.0005093073 0.0186304948
In particular, the standard errors are 2.1, 0.0399, 0.0236, 0.136. Simple Wald tests
suggest that the latitude effect is real - the maximum temperature decreases as
latitude increases - and so is the (smaller) longitude effect - temperature increases
the further east a country is. Likelihood ratio tests (through fitting models with
only latitude or only longitude) could confirm this (student exercise).
6.5.2 Covariates for σ in the linear model
−4 −2 0 2 4
−
5
0
5
10
15
20
x
y
The plot above shows a response, y, plotted against a covariate, . It suggests
that the response expectation does not depend on but that the variance does.
Hence we fit the following model:
Y ∼ N(μ, σ2 ),
where logσ = x> β. Why is the log-link sensible?
6.5. OTHER MODELS 133
The likelihood is
L(μ,β) =
n∏
=1
1
σ
p
2pi
exp
¨
− 1
2σ2
(y − μ)2
«
∝
1
σn
exp
¨
−
n∑
=1
1
2σ2
(y − μ)2
«
.
The log likelihood is, therefore,
ℓ(μ,β) = −n logσ − 1
2
n∑
=1
1
σ2
(y − μ)2 + const.
In this particular case, we can even just use the dnorm() function as we sometimes
did in the first part of this module.
ll.sigmacov<-function(theta,x,y) {
mu=theta[1]
betas=theta[2:3]
sigmas=exp(betas[1]+betas[2]*x)
ll=sum(dnorm(y,mu,sigmas,log=TRUE))
return(ll)
}
SigCovFit=optim(c(0,0,0),ll.sigmacov,control=ctl,x=SigCov$xs,y=SigCov$ys,
hessian=TRUE)
SigCovFit$conv
## [1] 0
print(round(SigCovFit$par,3))
## [1] 2.952 0.976 0.461
V=-solve(SigCovFit$hess)
## Obtain all the CIs at once
CIslo=SigCovFit$par+sqrt(diag(V))*qnorm(0.025)
CIshi=SigCovFit$par+sqrt(diag(V))*qnorm(0.975)
print(round(cbind(CIslo,CIshi),3))
## CIslo CIshi
## [1,] 2.648 3.256
## [2,] 0.798 1.155
## [3,] 0.376 0.546
134 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
The data were simulated using values μ = 3 and β> = (1,0.5), so the parameter
estimates are reasonable. A likelihood-ratio test could confirm, formally, the
necessity of the covariate in the variance term.
6.6 Model diagnostics
6.6.1 Leverage
As for the linear model, leverage can be used to check for observations that are
too influential. For the titanic data with all covariates, we have:
Xti=model.matrix(TiLM6)
Hti= Xti %*% solve(t(Xti) %*% Xti) %*% t(Xti)
lev=diag(Hti)
par(mfrow=c(1,2))
plot(lev,ylab="leverage")
plot(lev,Titanic$Fare,pch=4)
0 200 400 600
0.
00
0.
05
0.
10
0.
15
Index
le
ve
ra
ge
0.00 0.05 0.10 0.15
0
10
0
20
0
30
0
40
0
50
0
lev
Ti
ta
ni
c$
Fa
re
Titanic$Fare[lev>0.1]
## [1] 512.3292 512.3292 512.3292
The left-hand plot shows the leverage, and that three observations are particularly
6.6. MODEL DIAGNOSTICS 135
influential. Further investigation shows this to be due to Fare; these three
passengers payed a very high fare. One could exclude the observations or keep
them but transform fare so the observations are less influential. We try the latter
and demonstrate that this has solved the leverage issue.
logFare=log(Titanic$Fare)
TiLM6log=glm(Survived~Pclass+Sex+Age+SibSp+logFare+Embarked,
family=binomial(link="logit"),data=Titanic)
Xti=model.matrix(TiLM6log)
Hti= Xti %*% solve(t(Xti) %*% Xti) %*% t(Xti)
lev=diag(Hti)
plot(lev,ylab="leverage")
0 100 200 300 400 500 600 700
0.
01
0.
03
0.
05
Index
le
ve
ra
ge
As in the Gumbel example, for models other than GLMs the X matrix must be
constructed directly from your covariates (including the intercept).
6.6.2 Deviance
For both Poisson regression and binomial regression, imagine a model that allows
a parameter for every observation. For Poisson regression, each y would have an
associated μ, and we know from Chapter 2 that the MLE is bμ = y. For binomial
regression each y and m pair would have an associated p, which, again from
Chapter 2 would have an MLE bp = y/m. In each case the model with n such
parameters is called the saturated model.
136 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
Consider a null hypothesis of H0 : the model of interest contains all the covariates
that have an impact on the response, against H1 : H0 is not true. The model of
interest (with p covariates, and, hence, p + 1 parameters) is nested within the
saturate model and so, under H0, asymptotically:
D = 2{ℓst − ℓ(bβ)} ∼ χ2n−(p+1),
where ℓ(bβ) is the log-likelihood of the model of interest at its MLE and ℓst is the
log-likelihood of the saturated model at its MLE.
The quantity D is called the model deviance and provides a direct measure of the
goodness of fit of a Poisson or Binomial model (or of any other model with a single
unknown parameter that is expressed in terms of covariates).
We can obtain these directly from the GLM fit
## Dose response
print(c(DRLM1$deviance, DRLM1$df.resid, qchisq(.95,df=DRLM1$df.resid)))
## [1] 29.21892 6.00000 12.59159
The model is a poor fit! In fact, there is an alternative link function that produces
a much better fit. Student exercise: look up the complementary log-log link and
try that.
## Ship
print(c(ShipM3.123$deviance, ShipM3.123$df.resid,
qchisq(.95,df=ShipM3.123$df.resid)))
## [1] 38.69505 25.00000 37.65248
print(c(ShipM3.123.NoOffset$deviance, ShipM3.123.NoOffset$df.resid,
qchisq(.95,df=ShipM3.123.NoOffset$df.resid)))
## [1] 139.08526 25.00000 37.65248
With the offset, the test is only just significant at the 5% level; perhaps we would
be happy with this, perhaps we would look for interactions, or perhaps we would
consider another response model, such as the negative binomial. Without the
offset, the fit is clearly terrible.
For binomial data when all, or nearly all, the m are 1, the deviance is unhelpful
since bp is always either 0 or 1. The saturated model fits the data perfectly and
with no uncertainty, and the parameter values (0 and 1) are at the extreme ends
of the parameter space, so the regularity conditions are not satisfied and the
asymptotic χ2 distribution does not hold. This is the case for the Titanic data.
6.6. MODEL DIAGNOSTICS 137
6.6.3 Residuals
In the case of the linear model, the model residuals are Gaussian with a constant
variance and so we check that this distribution is reasonable using a QQplot of
the observed (≡ ordinary) residuals and we look for patterns in the residuals as
functions of the covariates and the predicted value, with any pattern indicating a
potential issue.
For some models, the model residuals, or a transformation of them, have a fixed,
known distribution and a QQplot can show whether or not the fit is reasonable.
In the case of regression on a binary response, the residuals are either bp or 1− bp,
and are unhelpful.
6.6.3.1 Hottest temperatures in European countries
For a gumbel random variable Y, the cumulative distribution function is
F(y) = exp
§
− exp
− y − μ
λ
ª
(differentiate and compare with the density to see this).
Let Z = (Y − μ)/λ, then for any z,
P (Z ≤ z) = Y − μ
λ
≤ z = P (Y ≤ μ + λz)
= F(μ + λz) = exp{− exp(−z)}.
That is, Z has a Gumbel(μ = 0, λ = 1) distribution.
Armed with this knowledge, we calculate observed residuals as
bε = y − bμbλ = y − x
>
βbλ ,
and produce a QQplot against a Gumbel(0,1) distribution.
etas=Xhot %*% GumbelFit$par[1:3]
resids.hot=(Hottest$temp-etas)/exp(GumbelFit$par[4])
par(mfrow=c(1,2))
hist(resids.hot)
n=length(etas) ## how many data points?
us=(1:n)/(n+1)
138 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
qtheory=-log(-log(us)) ## inverting the cdf
plot(qtheory,sort(resids.hot),xlab="theoretical quantile",ylab="observed quantile",pch=4)
abline(a=0,b=1)
Histogram of resids.hot
resids.hot
Fr
eq
ue
nc
y
−2 −1 0 1 2
0
2
4
6
8
−1 0 1 2 3
−
1
0
1
2
theoretical quantile
o
bs
er
ve
d
qu
an
tile
Not perfect, but not a terrible fit; other extreme-value distributions, such as the
Weibull might be more appropriate. These residuals can and should also be plotted
against the covariates and predicted values (student exercise).
6.6.3.2 Covariates for σ in the linear model
In this particular model, Y ∼ N(μ, σ2 ), thus
ε =
Y − μ
σ
=
Y − μ
exp(x> β)
∼ N(0,1).
Thus the standard QQplot against a normal distribution would be appropriate
(student exercise).
6.6. MODEL DIAGNOSTICS 139
6.6.4 Poisson regression
In the case of Poisson regression there is no transformation to a standard dis-
tribution. However, consider μ = λ when there is no offset, or μ = λT in the
presence of an offset, where λ = exp(x> β). Then
ε =
Y − μp
μ
satisfies
E [ε] = 0 and Vr [ε] = 1.
Furthermore, the central-limit theorem suggests that when μ is large the th
residual’s distribution should be approximately normal. Thus we create the
Pearson residuals
r =
y − μp
μ
.
Residuals can be obtained automatically for models fitted via the glm() function,
as we now demostrate.
6.6.4.1 Ants data
For the ants data we have the model fit AntsM3.123. We demonstrate the model
checking with two of the three plots you would use.
resids=residuals(AntsM3.123,type="pearson")
AntsFitted=fitted(AntsM3.123) ## Fitted values on the scale of the original data
par(mfrow=c(1,2))
cols=(as.numeric(Ants$Habitat)-1)*2+1 ## black and green for 1 and 2
plot(Ants$Latitude,resids,pch=4,xlab="Latitude",ylab="residuals",col=cols)
plot(AntsFitted,resids,pch=4,xlab="Fitted",ylab="residuals",col=cols)
140 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
42.0 43.0 44.0 45.0
−
2
−
1
0
1
2
3
Latitude
re
si
du
al
s
4 6 8 10 14
−
2
−
1
0
1
2
3
Fitted
re
si
du
al
s
The lack of a pattern suggests a good fit.
There are other types of residuals, such as “deviance”. The interested student is
encouraged to investigate these further.
6.6.5 Binomial regression
With an analogous motivation (zero mean and unit variance), for binomial regres-
sion we create the Pearson residual
r =
y −mpp
mp(1 − p) , where p =
exp(η)
1 + exp(η)
,
and η = x> β.
These residuals are only helpful when m is substantially greater than 1, so we
demonstrate them on the dose-response data set. For a linear model, the fitted
value and the predicted value are the same, but in general for a GLM they are
not. The fitted value is on the scale of the observations, whereas the predicted
value is on the scale of the linear predictor. We demonstrate this in our plots of
the residuals (one of these plots would be enough to check for any pattern; both
are shown here to exemplify the similarities and differences).
6.6. MODEL DIAGNOSTICS 141
6.6.5.1 Dose-response data
resids=residuals(DRLM1,type="pearson")
DRLM1Fitted=fitted(DRLM1) ## On the scale of the original data
DRLM1Predicted=predict(DRLM1) ## On scale of linear predictor
par(mfrow=c(1,2))
plot(DRLM1Fitted,resids,pch=4,xlab="Fitted",ylab="residuals")
plot(DRLM1Predicted,resids,pch=4,xlab="Predicted",ylab="residuals")
0.2 0.4 0.6 0.8 1.0
−
2
−
1
0
1
2
3
Fitted
re
si
du
al
s
−3 −1 1 2 3 4
−
2
−
1
0
1
2
3
Predicted
re
si
du
al
s
6.6.6 Calibration plots and the Hosmer-Lemeshow test
As we have seen, all of the above diagnostics of fit fail when the response is binary.
The Hosmer-Lemeshow test and the associated calibration plot are useful in this
case or, more generally in the case of a binomial response.
The key idea is that, even though we only have one observation per covariate
combinations, we can group observations that have similar predicted probabilities.
For example, if we have 10 observations with pˆ ≈ 0.3, then we would expect
around three of them to be positive.
To produce a calibration plot,
• find the fitted value (i.e., p, the probability of a 1) for each observation,
142 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
• sort the probabilities into ascending order,
• partition the vector into groups,
• estimate the success probability within each group by averaging its ps,
• find the fraction of the observations in the group that were actually 1,
• plot the vectors in the last two bullets against each other.
Let us do this for the Titanic data and the last model fit within these lecture notes
(which may not be the best!).
TitanFit=fitted(TiLM6log)
hist(TitanFit)
Histogram of TitanFit
TitanFit
Fr
eq
ue
nc
y
0.0 0.2 0.4 0.6 0.8 1.0
0
20
40
60
80
12
0
It looks like we need more categories for the smaller probabilities.
bs=c(0,0.07,0.11,0.15,seq(0.2,1,by=0.1)) ## boundaries of the groups
ngrp=length(bs)-1 ## number of groups
ningroup=rep(0,ngrp); n1ingroup=rep(0,ngrp); meanpingroup=rep(0,ngrp)
for (i in 1:ngrp) {
these=((TitanFit<=bs[i+1])&(TitanFit>bs[i]))
ningroup[i]=sum(these)
n1ingroup[i]=sum(Titanic$Survive[these])
meanpingroup[i]=mean(TitanFit[these])
6.6. MODEL DIAGNOSTICS 143
}
plot(meanpingroup,n1ingroup/ningroup,xlab="average P(success)",ylab="fraction of successes",pch=4,main="Calibration Plot")
abline(a=0,b=1)
0.2 0.4 0.6 0.8
0.
2
0.
4
0.
6
0.
8
1.
0
Calibration Plot
average P(success)
fra
ct
io
n
of
s
uc
ce
ss
es
This does not look too unreasonable, but can we make this more rigorous?
Let G be the number of groups, kg be the number in group g and pg be the average
probability of success in group g. If the probability did not vary at all within in a
group, then under H0 : the number of successes in each group ∼ Bn(kg, pg), the
statistic:
G∑
g=1
(OSg − ESg)2
ESg
+
G∑
g=1
(OFg − EFg)2
EFg
∼ χ2G−2,
where ESg = kgpg is the expected number of successes in group g, E
F
g = kg(1− pg)
is the expected number of failures, with OSg and O
F
g the observed number of
successes and failures respectively.
The above distribution also only holds provided none of the Eg is below 1 and
fewer than 20% of the Eg are below 5.
We can construct this statistic and conduct the test, which is called the Hosmer-
144 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
Lemeshow test:
ESs=ningroup*meanpingroup
OSs=n1ingroup
EFs=ningroup-ESs
OFs=ningroup-OSs
T=sum((ESs-OSs)^2/ESs)+sum((EFs-OFs)^2/EFs)
print(c(T,qchisq(0.95,df=ngrp-2),1-pchisq(T,df=ngrp-2)))
## [1] 3.317814e+01 1.830704e+01 2.542303e-04
We reject the null hypothesis; the number of successes in a group is not de-
termined by the covariates as we have them. But, are the assumptions of the
Hosmer-Lemeshow test satisfied?
ESs; EFs
## [1] 3.742941 7.936213 11.115211 6.054233 14.175766 21.548720 15.836334
## [8] 23.498476 32.885465 34.800801 48.482988 66.922851
## [1] 72.257059 80.063787 75.884789 28.945767 41.824234 39.451280 19.163666
## [8] 18.501524 18.114535 11.199199 8.517012 4.077149
Yes. Only two of the twenty-four values are below 5, and not one is below 1.
Perhaps a different link would be better, or there is one or more interaction?
Or perhaps it is simply not possible to predict sufficiently accurately with the
covariates that we have?
Looking at the vector (Eg−Og)2/Eg, the biggest contributors to the statistic are the
second group (for successes) and the second-to-last group (for failures). Delving
down further we see that the former contains 15 successes when the expected
number is 7.93, and the latter contains 2 failures when the expected number is
8.52. In both groups the model underpredicts successes. The interested student
could take this analysis further by looking at which passengers are in these groups.
This issue will be revisited in the workshop.
6.7 Steps in a model fit
The exact details of finding an appropriate model will differ from one problem to
another; however, as we have seen, there are certain steps that should always be
taken.
1. Exploratory data analysis. Examine the response and covariates via plots
and summaries. How many data points and how many covariates? What
6.8. FURTHER TOPICS 145
patterns are there? From the plots/science, what might you expect to be
important terms in any model fit?
2. Choose an appropriate model formulation. Given the nature of the
response, the patterns observed and any relevant science, decide on the
model type. For example, logistic regression, or a linear model, or an extreme-
value model etc.
3. Check for influential observations. Look at the leverage; are there any
massively influential observations? If so, decide what to do about these
(e.g. remove the observations or transform one or more covariates).
4. Find a good model. Use forward fitting, backward fitting, AIC or BIC as
appropriate. Check for interactions.
5. Model diagnostics. Examine various residual plots or other diagnostics
such as the deviance or a calibration plot. Are there any outstanding issues?
If so, deal with them and restart the whole procedure. If you have included
interactions then you should also re-examine the leverage, including the
interaction effects that you have found important.
6. Use the model. Interpret the coefficients; make predictions at covariate
combinations of interest.
6.8 Further topics
The last two chapters have covered the basics of linear models and generalised
linear models. Many more aspects of the linear model are tractable, and the
interested student is encouraged to explore these. The linear and generalised
linear models can also be extended in a number of directions; here are but a few:
• n > p, such as in genome-wide association studies. The X>X matrix is
automatically singular. Ridge regression is a tractable solution; *The LASSO$
is intractable but is often preferred as it shrinks many of the unimportant
covariate effects to 0.
• Rather then a linear covariate effect, or a specific covariate transformation
(such as 2 or
p
) one might wish to not specify this in advance and
allow a general shape. The General Additive Model allows a mixture of
prespecified and flexible covariate effects within the linear model framework;
the Generalised Additive Model (GAM) allows it within the GLM framework.
• Multiple measurements may be made on each of a number of subjects. Two
subjects with the same covariates might have very different mean responses
146 CHAPTER 6. GENERALISED LINEAR MODELS AND BEYOND
or different covariate effects. Random effects models can take this into
account; see the optional module on Longitudinal and hierarchical modelling
next term.
• Measurements of temperatures or pollution levels might be more similar in
places closer together in space. Thus the residual from a linear model might
be correlated with a correlation that depends on distance apart. The field of
geostatistics extends the linear, and generalised linear models to allow for
this.
• When the emphasis is on prediction rather than an interpretable model,
statistical learning techniques such as decision trees and artificial neural
networks (ANNs) can be powerful alternatives. Simple versions of these (such
as a single-layer ANN) can be usefully viewed from the modelling framework
we have covered.
Chapter 7
Appendix
7.1 Multivariate Normal distribution
If Y is multivariate Normal with mean (vector) μ and variance (matrix) , its
probability density is
ƒ (y) = (2pi)−d/2||||−1/2 exp
−1
2
(y − μ)>−1(y − μ)
,
where |||| ≡ det().
We write Y ∼ N(μ,) or Y ∼ MVN(μ,).
In two dimensions we have
≡
1,1 1,2
2,1 2,2
≡
σ21 ρσ1σ2
ρσ1σ2 σ22 ,
with 1,2 = Co [Y1, Y2], Y1 ∼ N(μ1, σ21), Y2 ∼ N(μ2, σ22) and Cor [Y1, Y2] = ρ.
Let be a p-vector, then if Y ∼ N(μ,) with p components,
p∑
=1
Y ≡ >Y ∼ N(>μ,>).
7.2 QQ plots
Suppose that you had three observations from a distribution with a density of ƒ
and a distribution of F. It is unlikely (probability=1/4) that all would be in the top
half or all would be in the bottom half of the distribution.
147
148 CHAPTER 7. APPENDIX
Typically you might expect one in the bottom half, one in the top half and one in
the middle. The plot below shows a density curve ƒ (for a N(1,22) distribution)
with lines at the median and the two quartiles. This seems like a reasonable
“typical” placement“.
−4 −2 0 2 4 6
0.
00
0.
05
0.
10
0.
15
0.
20
y
f(y
)
In general, if you have n IID data points from this distribution then when sorted
into ascending order you might expect them to be around the
1
n + 1
,
2
n + 1
, . . . ,
n
n + 1
quantiles of the distribution with cumulative distribution function F; that is
F−1
1
n + 1
, F−1
2
n + 1
, . . . , F−1
n
n + 1
.
A QQplot plots these values against the sorted values that are supposed to be
from this distribution. If the values are close to the y = lines then it is reasonable
that they might, at least approximately, arise from the distribution F.
The examples below show a QQplot using the correct distribution and two using
incorrect distributions.
n=100
set.seed(12349)
xs=rnorm(n)
7.2. QQ PLOTS 149
xs.sorted=sort(xs)
probs=(1:n)/(n+1)
q1s=qnorm(probs)
q2s=qt(probs,df=5)
q3s=qnorm(probs,1,2)
par(mfrow=c(1,3))
plot(xs.sorted,q1s); abline(a=0,b=1)
plot(xs.sorted,q2s); abline(a=0,b=1)
plot(xs.sorted,q3s); abline(a=0,b=1)
−2 −1 0 1 2
−
2
−
1
0
1
2
xs.sorted
q1
s
−2 −1 0 1 2
−
3
−
2
−
1
0
1
2
3
xs.sorted
q2
s
−2 −1 0 1 2
−
4
−
2
0
2
4
6
xs.sorted
q3
s
