MTH6102-贝叶斯代写|学霸联盟

MTH6102-贝叶斯代写

时间：2023-01-04

MTH6102
Bayesian Statistical Methods
Jamie Griffin
Queen Mary, Autumn 2022
1
Contents
1 Likelihood 4
1.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Bayesian inference 10
2.1 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Bayes’ theorem and Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Conjugate prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Point estimates and credible intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Specifying prior distributions 21
3.1 Informative prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Less informative prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Markov chain Monte Carlo methods 23
4.1 The Metropolis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Predictive distributions 28
5.1 Simulating the posterior predictive distribution . . . . . . . . . . . . . . . . . . . . . 30
6 Hierarchical models 31
6.1 Inference for hierarchical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2 Advantages of hierarchical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Tests and model comparisions 36
2
Introduction
So far at Queen Mary, the statistics modules have taught the classical or frequentist approach which is
based on the idea that probability represents a long run limiting frequency. In the Bayesian approach,
any uncertain quantity is described by a probability distribution, and so probability represents a
degree of belief in an event which is conditional on the knowledge of the person concerned.
This course will introduce you to Bayesian statistics. These notes are self-contained but you may
want to read other accounts of Bayesian statistics as well. A useful introductory textbook is:
Bayesian Statistics: An Introduction (4th ed.) by P M Lee. (This is available as an e-book
from the library)
Parts of the following are also useful:
Bayesian Inference for Statistical Analysis by G E P Box and G C Tiao.
Bayesian Data Analysis by A Gelman, J B Carlin, H S Stern and D B Rubin.
Probability and Statistics from a Bayesian viewpoint (Vol 2) by D V Lindley.
3
Chapter 1
Likelihood
First we review the concept of likelihood, which is essential for Bayesian theory, but can also be used
in frequentist methods. Let y be the data that we observe, which is usually a vector. We assume
that y was generated by some probability model which we can specify. Suppose that this probability
model depends on one or more parameters θ, which we want to estimate.
Definition 1.1. If the components of y are continuous, then the likelihood is defined as the joint
probability density function of y; if y is discrete, then the likelihood is defined as the joint probability
mass function of y. In either case, we denote the likelihood as
p(y | θ).
Example 1.1. Let y = y1, . . . , yn be a random sample from a normal distribution with unknown
parameters µ and σ. Then θ is the vector (µ, σ), and the likelihood is the joint probability density
function
p(y | µ, σ) =
n∏
i=1
ϕ(yi | µ, σ).
Note that
∏
here is the symbol for the product:
n∏
i=1
ai = a1 × a2 × · · · × an.
ϕ is the normal probability density function with parameters µ and σ
ϕ(yi | µ, σ) = e
−(yi−µ)2/2σ2
√
2π σ2
.
4
Example 1.2. Suppose we observe k successes in n independent trials, where each trial has proba-
bility of success q. Now θ = q, the observed data is k, and the likelihood is the binomial probability
mass function
p(k | q) =
(
n
k
)
qk(1− q)n−k.
It is also possible to construct likelihoods which combine probabilities and probability density func-
tions, for example if the observed data contains both discrete and continuous components. Alter-
natively, probabilities may appear in the likelihood if continuous data is only observed to lie within
some interval.
Example 1.3. Assume that the time until failure for a certain type of light bulb is exponentially
distributed with parameter λ, and we observe n bulbs, with failure times t = t1, . . . , tn.
The likelihood contribution for a single observation ti is the exponential probability density function
λe−λti .
So the overall likelihood is the joint probability density function
p(t | λ) =
n∏
i=1
λe−λti .
Suppose instead that we observe the failure time for the first m light bulbs with m < n, but for the
remaining n −m bulbs we only observe that they have not failed by time ti. Then for i ≤ m, the
likelihood contributions are as before.
For i > m, the likelihood is the probability of what we have observed. Denoting the random variable
for the failure time by Ti, we have observed that Ti > ti, so the likelihood contribution is
p(Ti > ti) = e
−λti .
This is because the cumulative distribution function is p(Ti ≤ ti) = 1− e−λti .
Hence the overall likelihood is now
p(t | λ) =
m∏
i=1
λe−λti
n∏
i=m+1
e−λti .
1.1 Maximum likelihood estimation
In example 1.1, the parameters µ and σ of the normal distribution which generated the data are
known as population parameters. An estimator is defined as a function of the observed data which
5
we use as an estimate of a population parameter. For example the sample mean and variance may
be used as estimators of the population mean and variance, respectively.
To use the likelihood to estimate parameters θ, we find the vector θˆ which maximizes the likelihood
p(y | θ), given the data x that we have observed. This is the method of maximum likelihood, and
the estimator θˆ is known as the maximum likelihood estimator, or MLE.
Usually the parameters θ are continuous, and those are the only examples we cover, but the idea of
likelihood also makes sense if the unknown quantity is discrete.
When finding the MLE it is usually more convenient to work with the log of the likelihood: as the
log is a monotonically increasing function, the same θ will maximize p(y | θ) and log(p(y | θ)). The
log-likelihood is denoted by
ℓ(θ; y) = log(p(y | θ)).
Since the likelihood is typically a product of terms for independent observations, the log-likelihood
is a sum of terms, so using the log greatly simplifies finding the derivatives in order to find the
maximum.
Returning to the binomial example 1.2, the log-likelihood is
ℓ(q; k) = log
((
n
k
)
qk(1− q)n−k
)
= log
(
n
k
)
+ k log(q) + (n− k) log(1− q).
To find the MLE, we differentiate with respect to q and set to zero
dℓ
dq
=
k
q
− n− k
1− q = 0.
Rearranging gives the MLE qˆ as
qˆ =
k
n
.
6
For the normal example 1.1, we have log-likelihood
ℓ(µ, σ; y) = log (p(y | µ, σ)) = log
(
n∏
i=1
ϕ(yi | µ, σ)
)
=
n∑
i=1
log (ϕ(yi | µ, σ))
=
n∑
i=1
(
− log(
√
2π)− log(σ)− (yi − µ)
2
2σ2
)
= −n log(
√
2π)− n log(σ)− 1
2σ2
n∑
i=1
(yi − µ)2
= −n log(
√
2π)− n
2
log(γ)− 1
2γ
n∑
i=1
(yi − µ)2
where γ = σ2.
Differentiating gives
∂ ℓ
∂ µ
=
1
γ
n∑
i=1
(yi − µ) = 0.
Hence the MLE for µ is
µˆ =
1
n
n∑
i=1
yi = y¯ .
∂ ℓ
∂ γ
= − n
2γ
+
1
2γ2
n∑
i=1
(yi − µ)2.
Setting to zero, substituting our value for µˆ and rearranging gives
σˆ2 = γˆ =
1
n
n∑
i=1
(yi − y¯)2.
Note that this is different from the usual sample variance estimator, which has 1/(n− 1) instead of
1/n.
Finally, for the exponential example 1.3, the likelihood for the second case, where we had both
probability density functions and probabilities, was
p(t | λ) =
m∏
i=1
λe−λti
n∏
i=m+1
e−λti = λme−λS
7
where S =
∑n
i=1 ti.
Hence the log-likelihood is
ℓ(λ; t) = m log(λ)− λS .
Differentiating and setting to zero gives
dℓ
dλ
=
m
λ
− S = 0.
Hence the MLE is
λˆ =
m
S
.
In each example, we could calculate the second derivatives at the stationary points to verify that we
have found a maximum.
1.2 Standard error
In frequentist statistics, an estimator ψˆ is a function of the data y which we use to estimate a pop-
ulation parameter ψ. To quantify the precision of ψˆ, the standard error is defined as the standard
deviation of ψˆ if the data y were repeatedly generated (given certain underlying population parame-
ters), and for each realisation of y we calculate ψˆ. So it quantifies the uncertainty in ψˆ due to random
variation in the data we might have observed.
In the binomial example 1.2, we have log-likelihood
ℓ(q; k) = log
(
n
k
)
+ k log(q) + (n− k) log(1− q).
We saw that the MLE is
qˆ =
k
n
k is a binomal random variable with variance V ar(k) = nq(1− q), and so
V ar(qˆ) =
nq(1− q)
n2
=
q(1− q)
n
.
We would estimate this variance by substituting our estimate qˆ for the unknown true value of q, so
the estimated standard error is
S e(qˆ) =
√
qˆ(1− qˆ)
n
=
√
k /n(1− k /n)
n
=
√
k(n− k)
n3
.
8
In the case of maximum likelihood estimation, there is a general approximate formula for the standard
error based on the second derivative of the log-likelihood, which we do not cover in this module.
We also do not cover confidence intervals, but we do cover the Bayesian version, which are called
credible intervals.
9
Chapter 2
Bayesian inference
2.1 Bayes’ theorem
Bayes’ theorem is a formula from probability theory that is central to Bayesian inference. It is
named after Rev. Thomas Bayes, a nonconformist minister who lived in England in the first half of
the eighteenth century. The theorem states that:
Theorem 2.1. Let Ω be a sample space and A1, A2, . . . , Am be mutually exclusive and exhaustive
events in Ω (i.e. Ai ∩ Aj = ∅, i ̸= j ,∪k i=1Ai = Ω; the Ai form a partition of Ω.) Let B be any event
with p(B) > 0. Then
p(Ai | B) = p(Ai)p(B | Ai)
p(B)
=
p(Ai)p(B | Ai)∑m
j=1 p(Aj)p(B | Aj)
The proof follows from the definition of conditional probabilities and the law of total probabilities.
Example 2.1. Suppose a test for an infection has 90% sensitivity and 95% specificity, and the
proportion of the population with the infection is q = 1/2000. Sensitivity is the probability of
detecting a genuine infection, and specificity is the probability of being correct about a non-infection.
So p(+ve test | infected) = 0.9 and p(-ve test | not infected) = 0.95. What is the probability that
someone who tests positive is infected?
Let the events be as follows:
B: test positive
A1: is infected
A2: is not infected
We want to find p(A1 | B). The probabilities we have are p(A1) = 1/2000, p(B | A1) = 0.9 and
p(B | A2) = 1− (BC | A2) = 1− 0.95 = 0.05.
10
Applying Bayes’ theorem,
p(A1 | B) = p(A1)p(B | A1)∑2
j=1 p(Aj)p(B | Aj)
p(A1)p(B | A1) = 1/2000× 0.9 = 0.00045
p(A2)p(B | A2) = (1− 1/2000)× 0.05 = .05
Hence
p(A1 | B) = 0.00045
0.00045 + .05
= 0.0089
So there is a less than 1% chance that the person is infected if they test positive.
Bayes’ Theorem is also applicable to probability densities.
Theorem 2.2. Let X , Y be two continuous r.v.’s (possibly multivariate) and let f(x, y) be the joint
probability density function (pdf), f(x | y) the conditional pdf etc. then
f(x | y) = f(y | x) f(x)
f(y)
=
f(y | x) f(x)∫
f(y | x′) f(x′) dx′
Alternatively, Y may be discrete, in which case f(y | x) and f(y) are probability mass functions.
2.2 Bayes’ theorem and Bayesian inference
In the Bayesian framework, all uncertainty is specified by probability distributions. This includes
uncertainty about the unknown parameters. So we need to start with a probability distribution for
the parameters p(θ).
We then update the probability distribution for θ using the observed data y. This updating is done
using Bayes’ theorem
p(θ | y) = p(θ) p(y | θ)
p(y)
.
p(y | θ) is the likelihood for parameters θ given the observed data y.
We don’t normally need to find the normalizing constant p(y), which is given by
p(y) =
∫
p(θ) p(y | θ) dθ or
∑
θ
p(θ) p(y | θ)
11
So the procedure is as follows:
Start with a distribution p(θ) - this is known as the prior distribution.
Combine the prior distribution with the likelihood p(y | θ) using Bayes’ theorem.
The resulting probability distribution p(θ | y) is known as the posterior distribution.
We base our inferences about θ on this posterior distribution.
The use of Bayes’ theorem can be summarized as
Posterior distribution ∝ prior distribution × likelihood
Example 2.2. Suppose a biased coin has probability of heads q, and we observe k heads in n
independent coin tosses. We saw the binomial likelihood for this problem:
p(k | q) =
(
n
k
)
qk(1− q)n−k
For Bayesian inference, we need to specify a prior distribution for q. As q is a continuous quantity
between 0 and 1, the family of beta distributions is a reasonable choice.
If X ∼ Beta(α , β), its probability density function is
f(x) =
xα−1(1− x)β−1
B(α , β)
, 0 ≤ x ≤ 1
where B is the beta function and α , β > 0 are parameters.
B(α , β) =
∫ 1
0
xα−1(1− x)β−1 dx, also B(α , β) = Γ(α)Γ(β)
Γ(α + β)
.
The uniform distribution on [0, 1] is a special case of the beta distribution with α = 1, β = 1.
Combining prior distribution and likelihood, we have the posterior distribution p(q | k) given by
p(q | k) ∝ p(q) p(k | q) = q
α−1(1− q)β−1
B(α , β)
(
n
k
)
qk(1− q)n−k ∝ qk+α−1(1− q)n−k+β−1
We can recognize this posterior distribution as being proportional to the pdf of a Beta(k+α , n−k+β)
distribution. Hence we do not need to explicitly work out the normalizing constant for p(q | k), we
can immediately see that q | k has a Beta(k + α , n− k + β) distribution.
12
For a Bayesian point estimate for q, we summarize the posterior distribution, for example by its
mean. A Beta(α , β) rv has expected value
α
α + β
. Hence our posterior mean for q is
E(q | k) = k + α
n+ α + β
.
Recall that the maximum likelihood estimator is
qˆ =
k
n
.
So for large values of k and n, the Bayesian estimate and MLE will be similar, whereas they differ
more for smaller sample sizes.
In the special case that the prior distribution is uniform on [0, 1], the posterior distribution is Beta(k+
1, n− k + 1). and the posterior mean value for q is
E(q | k) = k + 1
n+ 2
.
2.3 Conjugate prior distributions
In the binomial example 2.2 with a beta prior distribution, we saw that the posterior distribution is
also a beta distribution. When we have the same family of distributions for the prior and posterior
for one or more parameters, this is known as a conjugate family of distributions. We say that the
family of beta distributions is conjugate to the binomial likelihood.
The binomial likelihood is
p(k | q) =
(
n
k
)
qk(1− q)n−k
and the beta prior probability density is
p(q) =
qα−1(1− q)β−1
B(α , β)
Considered as functions of q, the prior density function and the likelihood have the same functional
form as each other (namely proportional to qr(1 − q)s for some r, s). Also, when we multiply them
together, we still have the same form. This is what characterizes conjugate distributions.
Example 2.3. Consider the exponential example 1.3, in which we have observed the time to failure
of n light bulbs as t1, . . . , tn. The likelihood for the observed data is
p(t | λ) =
n∏
i=1
λe−λti = λne−λS
13
where S =
∑n
i=1 ti.
The gamma distribution with parameters α , β > 0 has probability density function
f(x) =
βαxα−1e−β x
Γ(α)
, x > 0
This looks like a candidate for a conjugate prior distribution for the parameter λ in the exponential
model.
With a Gamma(α , β) prior for λ, the posterior distribution is
p(λ | t) ∝ p(λ) p(t | λ) = β
αλα−1e−β λ
Γ(α)
λne−λS
∝ λα−1e−β λ λne−λS = λn+α−1e−λ(S+β).
Hence the posterior density function is proportional to a Gamma(n + α , S + β) pdf, and so the
posterior density for λ must be a gamma distribution with parameters n+ α and S + β.
The mean of a Gamma(α , β) distribution is
α
β
, and so the posterior mean for λ is
E(λ | t) = n+ α
S + β
.
Recall that the maximum likelihood estimator (MLE) was
λˆ =
n
S
.
As n and S increase, the posterior mean approaches the MLE.
Example 2.4. As another example, consider a random sample y = y1, . . . , yn from a normal distri-
bution with parameters µ and σ, Yi ∼ N(µ, σ2), i = 1, . . . , n. For now, we take σ to be known, and
so we are only looking for a prior and posterior distribution for µ.
The likelihood for the sample is
p(y | µ) =
n∏
i=1
1
σ
√
2π
exp
(
−(yi − µ)
2
2σ2
)
∝ exp
(
− 1
2σ2
n∑
i=1
(yi − µ)2
)
where we have dropped terms that do not contain µ.
14
The sum in the exponent can be expanded as
n∑
i=1
(yi − µ)2 =
n∑
i=1
y2i − 2µ
n∑
i=1
yi + nµ
2
= S2 − 2nµy¯ + nµ2
where S2 =
∑n
i=1 y
2
i and y¯ =
1
n
∑n
i=1 yi.
Inside the exponent we can add or subtract terms not involving µ, because we only need to define
the likelihood up to a constant of proportionality
p(y | µ) ∝ exp
(
− 1
2σ2
(S2 − 2nµy¯ + nµ2)
)
∝ exp
(
− 1
2σ2
(−2nµy¯ + nµ2)
)
∝ exp
(
− 1
2σ2
(ny¯2 − 2nµy¯ + nµ2)
)
= exp
(
− n
2σ2
(y¯ − µ)2
)
If we take the prior distribution for µ as normal, µ ∼ N(µ0, σ2 0), then the prior probability density is
p(µ) ∝ exp
(
−(µ− µ0)
2
2σ2 0
)
.
Hence the posterior density after seeing the data y is
p(µ | y) ∝ p(µ) p(y | µ)
∝ exp
(
−(µ− µ0)
2
2σ2 0
)
exp
(
−n(y¯ − µ)
2
2σ2
)
∝ exp
(
−(µ
2 − 2µ0µ)
2σ2 0
− n(µ
2 − 2y¯ µ)
2σ2
)
= exp
(
−1
2
[
µ2
(
1
σ2 0
+
n
σ2
)
− 2µ
(
µ0
σ2 0
+
ny¯
σ2
)])
where any factors not involving µ have been dropped at each stage. We can recognise p(µ | y) as being
proportional to a normal density N(µ1, σ
2
1) for some µ1, σ1. An N(µ1, σ
2
1) pdf for µ is proportional
to
exp
(
−(µ− µ1)
2
2σ2 1
)
= exp
(
−1
2
[
µ2
(
1
σ2 1
)
− 2µ
(
µ1
σ2 1
)
+ µ2 1
(
1
σ2 1
)])
15
Equating the terms in µ2 and µ with those in the expression for p(µ | y), we see that
1
σ2 1
=
1
σ2 0
+
n
σ2
and
µ1
σ2 1
=
µ0
σ2 0
+
ny¯
σ2
.
Rearranging these gives
µ1 = σ
2
1
(
µ0
σ2 0
+
ny¯
σ2
)
=
(
µ0
σ2 0
+
ny¯
σ2
) / (
1
σ2 0
+
n
σ2
)
Hence the posterior distribution for µ is N(µ1, σ
2
1) with these parameters.
The mean of the posterior distribution µ1 can be written as a weighted average of the prior mean µ0
and the sample mean y¯
µ1 = (1− w)µ0 + wy¯ , where w = 1
1 +
σ2
nσ2 0
.
The weight w → 1 as n → ∞ or σ0 → ∞, hence in either of these limits, the posterior mean
approaches the sample mean.
Example 2.5. Now consider example 2.4, but this time suppose that µ is known and the standard
deviation σ is unknown.
The likelihood for the sample y1, . . . , yn is
p(y | σ) =
n∏
i=1
1
σ
√
2π
exp
(
−(yi − µ)
2
2σ2
)
∝ 1
σn
exp
(
− 1
2σ2
n∑
i=1
(yi − µ)2
)
Let τ = 1/σ2. τ is referred to as the precision. It turns out that we can find a more standard form
for a conjugate distribution for τ than for σ. In terms of τ , the likelihood is
p(y | τ) ∝ τ n2 exp
(
−τ
2
n∑
i=1
(yi − µ)2
)
As a function of τ , this likelihood has the same functional form as a gamma probability density
function. Hence we consider τ ∼ Gamma(α , β) as a prior distribution. The prior pdf is
p(τ) =
βατα−1e−β τ
Γ(α)
, τ > 0
16
The posterior distribution is
p(τ | y) ∝ p(τ) p(y | τ) ∝ τα−1e−β τ τ n2 e− τ2
∑n
i−1(yi−µ)2
= τα+
n
2
−1e−(β+
1
2
∑n
i=1(yi−µ)2)τ
Hence the posterior distribution is
τ ∼ Gamma
(
α +
n
2
, β +
∑n
i=1(yi − µ)2
2
)
and we can say that the family of gamma distributions is conjugate to the likelihood for the normal
precision parameter, if the mean µ is known.
If µ and τ are both unknown then there is a bivariate distribution which is conjugate. This is
constructed by taking as a marginal distribution
τ ∼ Gamma
and then the conditional distribution
µ | τ ∼ Normal.
The joint prior distribution is the product of these two pdfs. Combining the prior with the likelihood
for the sample, the posterior is of the same form as the prior, but we do not go into the details here.
2.4 Point estimates and credible intervals
Once we have found a posterior distribution p(θ | y), in the Bayesian framework all our inference are
based on this distribution. This includes point estimates. For a one-dimensional parameter θ, we
can summarize the posterior distribution just as we would normally summarize a distribution, using
familiar summaries such as the mean, median or mode. The median or mean are the most commonly
used.
The mode (the value of θ which maximizes the posterior density) may be used in cases where it is
difficult to find the posterior distribution. With a uniform distribution as a prior, i.e. p(θ) a constant,
the posterior distribution is proportional to the likelihood, and so the posterior mode in this case is
the same as the maximum likelihood estimate.
For the examples we have seen so far, there is a simple formula for the posterior mean. For the
beta or gamma posterior distributions, there is no exact formula for the median. However, computer
packages, including R, allow us to easily find the median for common probability distributions.
For a continuous random variable Θ with cumulative distribution function F (θ), we have by definition
P (Θ ≤ θ) = F (θ)
17
The quantile function is defined as the inverse function to the cdf
Q = F−1
If p = F (θ) for some p ∈ [0, 1], then Q(p) = θ. In particular, if p = 0.5 and m is the median of the
distribution, then F (m) = 0.5 and Q(0.5) = m.
For common distributions, there is a set of four functions on R to calculate the density (or probability
mass function), the cdf, the quantile function and to generate random variates. For example for the
gamma distribution, these functions are named dgamma, pgamma, qgamma, rgamma, respectively.
2.4.1 Credible intervals
In frequentist inference, confidence intervals are used to express a range of uncertainty around a
parameter estimate. Suppose random samples y are repeatedly generated. For each sample we can
estimate the true parameter θ by θˆ(y), and also construct an interval. A 95% confidence interval is
an interval (θL(y), θU(y)) with
P (θL(y) ≤ θ ≤ θU(y)) = 0.95
It is tempting to say the interval (θL(y), θU(y)) calculated from the current dataset y contains the true
value θ with probability 0.95. But such statements are only possible in the Bayesian framework. With
frequentist confidence intervals, the probability statement refers to the probability over repeatedly
generating datasets y. Frequentists consider θ to be a fixed (but unknown) quantity, not a random
variable.
In the Bayesian framework, we do have a probability distribution for θ. After seeing the data, this
is the posterior distribution p(θ | y). Based on this distribution, we can make probability statements
about θ.
For some α ∈ [0, 1], a 100(1− α)% credible interval is some (θL, θU) such that
P (θL < θ < θU) = 1− α
For example, α = 0.05 for a 95% credible interval.
There are two main ways of defining credible intervals, equal tail intervals and highest posterior
density intervals.
Equal tail intervals have equal probability mass above and below the interval (θL, θU),
P (θ < θL) = α /2
P (θ > θU) = α /2.
18
Highest posterior density intervals are chosen so that the probability density function has equal
height at either end of the interval, p(θL | y) = p(θU | y).
The equal tail interval is usually easier to calculate in practice, and so is more widely reported. If
the posterior distribution is among the well-known named distributions, then just as for the median,
we can use the quantile functions in R such as qgamma and qnorm to calculate equal tail credible
intervals.
2.4.2 Transformed parameters
Suppose we have found the posterior distribution p(θ | y) for a parameter θ. Let ψ = g(θ) be a
monotonic, increasing transformation of θ, such as ψ = log(θ) or
√
θ.
Some posterior summaries are preserved by the transformation. For example, suppose that θM is the
posterior median for θ
P (θ ≤ θM) = 0.5.
Then
P (ψ ≤ g(θM)) = P (g(θ) ≤ g(θM)) = P (θ ≤ θM) = 0.5
and so g(θM) is the posterior median for ψ.
Similarly, let (θL, θU) be a 95% equal tail credible interval for θ.
P (θ ≤ θL) = 0.025, P (θ ≥ θU) = 0.025.
Adapting the argument for the median leads to
P (ψ ≤ g(θL)) = 0.025, P (ψ ≥ g(θU)) = 0.025.
and so (g(θL), g(θU)) is a 95% equal tail credible interval for ψ.
This shows that the posterior median and equal tail credible intervals are preserved by increasing,
one-to-one transformations.
The posterior mean is not in general preserved by non-linear transformations. As an example, suppose
that ψ = θ2.
E(ψ) = E(θ2) = V ar(θ) + E(θ)2
Since in general V ar(θ) > 0, it follows that E(ψ) ̸= E(θ)2.
The shape of a probability density changes under non-linear transformations of the random vari-
able. This change of shape means that in general the posterior mode is not preserved by such a
transformation, and nor are the endpoints of highest posterior density credible intervals.
19
2.4.3 Multiple parameters
If θ, the unknown parameters(s), is a vector, then we still base our estimates on p(θ | y). If we are
interested in predictions of future data, then we use the entire joint distribution.
For point estimates of individual parameters, we typically use the marginal distribution. For example,
if θ = (θ1, θ2, θ3), then the marginal posterior distribution for θ1 is
p(θ1 | y) =
∫ ∫
p(θ1, θ2, θ3 | y) dθ2 dθ3
In practical Bayesian inference, Markov Chain Monte Carlo methods, covered later in these notes,
are the most common means of approximating the posterior distribution. These methods produce an
approximate sample from the joint posterior density, and once this is done the marginal distribution
of each parameter is immediately available.
Example 2.6. Generating samples from the posterior distribution may also be helpful even if we
can calculate the exact posterior distribution. Suppose that the data are the outcome of a clinical
trial of two treatments for a serious illness, the number of deaths after each treatment. Let the data
be ki deaths out of ni patients, i = 1, 2 for the two treatments, and the two unknown parameters
are q1 and q2, the probability of death with each treatment. Assuming a binomial model for each
outcome, and independent Beta(αi, βi) priors for each parameter, the posterior distribution is
qi | ki ∼ Beta(ki + αi, ni − ki + βi), i = 1, 2.
We have independent prior distributions and likelihood, so the posterior distributions are also inde-
pendent.
p(q1, q2 | k1, k2) = p(q1 | k1) p(q2 | k2) ∝ p(k1 | q1)p(q1) p(k2 | q2)p(q2)
However, it is useful to think in terms of the joint posterior density for q1 and q2, as then we can
make probability statements involving both parameters. In this case, one quantity of interest is the
probability P (q2 < q1), i.e. does the second treatment have a lower death rate than the first. To
find this probability, we need to integrate the joint posterior density over the relevant region, which
is not possible to do exactly when it is a product of beta pdfs.
We can approximate the probability by generating a sample of (q1, q2) pairs from the joint density.
To generate the sample, we just need to generate each parameter from its beta posterior distribution,
which can be done in R using the rbeta command. Then once we have the sample, we just count
what proportion of pairs has q2 < q1 to estimate P (q2 < q1).
20
Chapter 3
Specifying prior distributions
The posterior distribution depends on both the observed data via the likelihood, and also on the
prior distribution. So far, we have taken the prior distribution as given, but now we look at how to
specify a prior.
3.1 Informative prior distributions
An informative prior distribution is one in which the probability mass is concentrated in some subset
of the possible range for the parameter(s), and is usually based on some specific information. There
may be other data that is relevant, and we might want to use this information without including all
previous data in our current model. In that case, we can use summaries of the data to find a prior
distribution.
Example 3.1. In example 2.3, the data was the lifetimes of light bulbs, t = (t1, . . . , tn), assumed to
be exponentially distributed with parameter λ (the failure rate, reciprocal of the mean lifetime). The
gamma distribution provides a conjugate prior distribution for λ. Suppose that we had information
from several other similar types of light bulbs, which had observed failure rates r1, . . . , rk. Let the
sample mean and variance of these rates be m and v.
A Gamma(α , β) distribution has mean
α
β
and variance
α
β2
. We can match these to m and v.
m =
α
β
, v =
α
β2
=
m
β
Rearranging gives
β =
m
v
, α = β m =
m2
v
.
Hence we can use these values of α and β as the parameters of our prior distribution.
21
Other examples of summary data include published estimates and their standard errors or confidence
intervals. We can match these quantities to the mean (or median), standard deviation or percentiles
of the prior distribution, as appropriate. In this kind of example there is doubt over whether the
published estimate is really measuring the same quantity as the parameter in our current model
represents, and so it may be a good idea to increase the uncertainty, by increasing the standard error
or width of the confidence interval before matching to the prior distribution parameters.
3.2 Less informative prior distributions
If there is not specific numerical information upon which to base a prior, then one may aim to choose
a prior distribution which conveys as little information as possible. However, there is no unique way
of doing this.
We could choose a flat prior distribution, i.e. uniform over some range. If the full range of the
parameter is unbounded, then we could assign a uniform distribution on some range that includes
all plausible values for the parameter. For example suppose the parameter is a standard deviation
σ. We might choose
p(σ) = 1/c, 0 < σ < c
for some large c. But the prior will not be uniform for transformations of σ such as log(σ) or σ2.
There are more formal methods to attempt to specify uninformative prior distributions, but we do
not cover them in this module.
3.2.1 Weakly informative prior distributions
Instead of trying to make the prior distribution completely uninformative, an alternative is to use
the prior to convey some information about the plausible range of the parameters, but otherwise let
the data speak for themselves.
In regression models, either linear or generalized linear models, after rescaling the covariates to
have a standard deviation of 1, we could choose a prior distribution centred around zero, to convey
the information that extremely large effects on the outcome are unlikely, for example a normal
distribution N(0, s2). The scale parameter of the prior distribution (s here) would be based on what
magnitude of effects had been found in the past in analyses of similar types of outcomes.
An alternative to a normal distribution is the Cauchy distribution, which may be preferred as it
has heavier tails, so if the data do strongly suggest a large effect then this can be reflected in the
posterior distribution.
22
Chapter 4
Markov chain Monte Carlo methods
The examples covered so far have been conjugate, in which case we can derive an exact formula for
the posterior distribution, including the normalizing constant. However, this is only possible in a few
simple cases. Bayesian methods are in theory applicable to complicated models with many param-
eters. Computational methods that have come into widespread use in the past three decades have
made practical Bayesian inference possible for a much wider range of examples than was previously
the case. The main computational methods used are various types of Markov chain Monte Carlo
methods.
The term “Monte Carlo methods” refers to any method that generates random samples in order
to do some calculation that cannot easily be done by non-random methods. Simple Monte Carlo
methods are those where we can generate an independent sample from a probability distribution
that we want to make statements about. In Bayesian inference this would be the posterior density
p(θ | y). Example 2.6 was of this type, as it was possible to generate a random sample from the joint
posterior distribution of q1 and q2. This sample could then be used to estimate P (q2 < q1), which is
not possible to calculate exactly.
4.1 The Metropolis algorithm
In general, it is not possible to generate an independent random sample of θ values from the posterior
distribution p(θ | y), and so simple Monte Carlo cannot be used. Markov chain Monte Carlo methods
generate a sample from a probability distribution that is not independent, rather each element in
the sample is correlated with the previous element.
A Markov chain is a sequence θ1, θ2, . . . of random variables. The probability distribution of θi only
depends on the previous value θi−1
p(θi | θ1, θ2, . . . , θi−1) = p(θi | θi−1).
23
The Metropolis algorithm is a type of Markov chain Monte Carlo (MCMC). Let h(ψ; θ) be a probabil-
ity density function for ψ which is symmetric in ψ and θ, for example the normal pdf ϕ(ψ; θ , s). Let f
be another pdf. The algorithm constructs a Markov chain θ1, θ2, θ3, . . . , where the θi are continuous
rvs (in our applications).
The algorithm constructs the Markov chain as follows:
Start with θ1, where f(θ1) > 0.
For each i > 1, generate ψi from the distribution h(ψi; θi−1).
Define r = min
(
1,
f(θi)
f(θi−1)
)
.
Set θi =
{
ψi with probability r
θi−1 with probability 1− r.
h is called the proposal distribution, and r is the acceptance probability. f is sometimes called the
target distribution: this is what we are aiming for, i.e. we want to generate a sample with pdf f .
The sequence θ1, θ2, . . . has the property that if θi−1 ∼ f , then θi ∼ f , in other words f is the
equilibrium distribution of the chain. However, we don’t start with θ1 ∼ f , because if we could, then
we would not need this algorithm. But for large enough i, if some technical conditions are met, then
each θi ∼ f approximately. (The notation θ ∼ f is used here to mean that θ is distributed with pdf
f .)
The algorithm eventually produces points distributed with pdf f , but it does not produce an inde-
pendent random sample. Nevertheless, we can still use the sample to make inferences about f . Often,
we discard some initial number of steps B in order to reduce the influence of the choice of starting
value θ1. B is known as the burn-in. Then, we would continue the chain for some large number M
of additional steps, so that inferences about the sample are based on θB+1, θB+2, . . . , θB+M .
The proposal distribution h is often taken as a normal distribution centred on the current point
ψi ∼ N(θi−1, s2).
The pdf h is symmetric in θ and ψ, as required by the algorithm
h(ψ; θ) =
1√
2π s
e
−
(ψ − θ)2
2s2 =
1√
2π s
e
−
(θ − ψ)2
2s2 = h(θ;ψ).
The acceptance probability r is
r = min
(
1,
f(ψi)
f(θi−1)
)
.
24
If the proposal scale s (the standard deviation of h) is very small, then ψi is close to θi−1, and so
f(ψi) is close to f(θi−1). Hence there is a high probability of accepting the proposal, but it is only a
small step. If s is large, then ψi may be far from θi−1, and f(ψi) may be much lower than f(θi−1).
Now there is a lower probability of accepting the proposal, but it is a larger step.
Ideally we would want large steps which are often accepted, to reduce the correlation between suc-
cessive θi values, but in practice some compromise between taking large steps and having a high
acceptance probability is needed. This means that some intermediate value for s tends to be best for
reducing the correlation. The best value for s depends on the model and the data. There is a general
rule of thumb, which is to aim for an acceptance probability r = 0.23. This number is calculated
based on f being a multivariate normal density, but using simulation it has been shown to work well
in a wider range of problems.
4.1.1 Relevance to Bayesian inference
The Metropolis algorithm produces an approximate sample from a general distribution with pdf f .
In the algorithm, f appears in acceptance probability
r = min
(
1,
f(ψi)
f(θi−1)
)
.
As f only appears in a ratio, we only need to know f up to a constant. If f(θ) = cg(θ), where c does
not involve θ, then we could use g instead of f .
In Bayesian inference, the posterior density is
p(θ | y) = p(θ) p(y | θ)∫
p(θ) p(y | θ) dθ ∝ p(θ) p(y | θ).
It is difficult to find the normalizing constant
∫
p(θ)p(y | θ) dθ. When using the Metropolis algorithm,
there is no need to find this constant. We can put g(θ) = p(θ)p(y | θ), and use g in the algorithm for
calculating the acceptance probability r, in place of f . This will generate a Markov chain θ1, θ2, . . .
which is an approximate sample from p(θ | y).
4.1.2 Metropolis algorithm implementation
The following description of how the algorithm can be implemented for Bayesian inference is in terms
of a single scalar parameter θ, but the general method can be used when θ is a vector of parameters.
The most common choice of proposal distribution is a normal distribution, and that is all we will
consider in this module for the Metropolis algorithm.
Assume we have observed data y and have specified a probability model that depends on a parameter
θ, so that we can compute the likelihood p(y | θ). The prior distribution is p(θ).
25
Define g(θ) = p(θ)p(y | θ), the non-normalized posterior density. The Metropolis algorithm proceeds
as follows:
Choose some s > 0.
Start with θ1, where g(θ1) > 0.
For each i > 1:
1. Generate ψi ∼ N(θi−1, s2).
2. Define r = min
(
1,
g(ψi)
g(θi−1)
)
.
3. Generate u ∼ U nif or m(0, 1), the continuous uniform distribution on [0, 1].
4. Set θi =
{
ψi if u ≤ r
θi−1 if u > r.
Steps 1 to 4 are repeatedly carried out to generate a sequence θ1, θ2, θ3, . . . with some repeated values,
i.e. θi = θi−1 if the proposed value ψi was rejected.
Step 1 is equivalent to setting ψi = θi−1 + sz, where z ∼ N(0, 1).
Steps 3 and 4 are how in a computer implementation we would set θi = ψi with probability r, and
θi = θi−1 otherwise.
Note that if r∗ =
g(ψi)
g(θi−1)
> 1, then u < r∗ with probability 1, hence in steps 2 and 4 we could use
r∗ instead of r and the algorithm would not change.
Working on the log scale
In realistic applications, all computations are usually done using the log of the posterior density. The
likelihood is typically a product of many terms
p(y | θ) =
n∏
i=1
p(yi | θ).
Due to the finite accuracy of computers, if we multiply the likelihood contributions p(yi | θ) together
for a large dataset, the result is inaccurate. So we calculate
log (p(y | θ)) =
n∑
i=1
log (p(yi | θ)) .
Define L(θ) = log (p(θ) p(y | θ)) = log (p(θ)) + log (p(y | θ)), the log of the posterior density (up to a
constant).
26
To work on the log scale, the generation of the proposed value ψi is unchanged, but the part of the
algorithm with the acceptance step changes.
For each i > 1:
1. Generate ψi ∼ N(θi−1, s2), just as before.
2. Define δ = L(ψi)− L(θi−1).
3. Generate u ∼ U nif or m(0, 1), the continuous uniform distribution on [0, 1].
4. Set θi =
{
ψi if log(u) ≤ δ
θi−1 if log(u) > δ.
Here, δ = log(r∗) from section 4.1.2.
Summarizing the posterior distribution
The Metropolis algorithm generates an approximate sample from the posterior distribution p(θ | y),
θ1, θ2, . . . , θM . This is not an independent sample, since each θi is correlated with the previous
element in the sequence θi−1. But we can still summarize it like an independent sample. The sample
mean can be used to estimate the posterior mean of θ, and similarly for the median. The 2.5th and
97.5th sample percentiles can be used as a 95% equal tail credible interval for θ.
It is common to discard the initial part of the sample, the first B values for some B > 0, to reduce
the influence of the choice of starting value θ1. B is called the burn-in.
Multiple parameters
Usually θ is a vector of parameters of length k. The Metropolis algorithm still proceeds in the same
way in this case. We generate a Markov chain where each element of the sequence is a vector θi.
The only change is to step 1, the proposal of candidate parameter values. For each i > 1, we can
generate
ψi ∼ Nk(θi−1,Σ)
a multivariate normal proposal distribution with mean vector θi−1 and covariance matrix Σ, where
Σ is chosen at the start.
27
Chapter 5
Predictive distributions
In the Bayesian framework, all our inferences about θ are based on the posterior distribution p(θ | y).
As well as the parameter(s) θ, we may be interested in new data x. Inference about new data is also
based on p(θ | y).
Assume that x is independent of y (conditional on θ), but is generated by the same probability
model. The posterior predictive distribution for x given y is
p(x | y) =
∫
p(x, θ | y) dθ .
From standard rules for conditional distributions
p(x, θ | y) = p(x | θ , y) p(θ | y)
x is assumed to be independent of y (given θ), and so p(x | θ , y) = p(x | θ).
Therefore the posterior predictive distribution is given by
p(x | y) =
∫
p(x | θ) p(θ | y) dθ .
This is the probability distribution for unobserved or future data x. It takes into account the
uncertainty remaining about θ after we have seen y as well as the random variation in x given any
particular value for θ.
In conjugate examples, it is usually possible to derive an expression for the posterior predictive
distribution
p(x | y) =
∫
p(x | θ) p(θ | y) dθ .
However, this tends to be less simple than finding p(θ | y). Also, in most realistic problems there
will not be a conjugate posterior distribution.
28
It is generally easier to find the mean and variance of p(x | y) than deriving the full distribution.
Suppose that X and W are general random variables. Then
E(X) = E(E(X | W ))
and
V ar(X) = V ar(E(X | W )) + E(V ar(X | W )).
In these identities, we replace W by the parameters and X by the new data we would like to predict.
Example 5.1. Suppose that the observed data is k ∼ Bin(n, q). With a Beta(α , β) prior for q, the
posterior distribution is Beta(a, b), where a = k + α , b = n− k + β.
Suppose the data we want to predict is x ∼ Bin(m, q), where m is known. The posterior predictive
distribution is
p(x | k) =
∫ 1
0
p(x | q) p(q | k) dq .
This integral is possible to do, but involves working with beta or gamma functions. We can find the
mean and variance of x directly. Put
µ =
a
a+ b
, v =
ab
(a+ b)2(a+ b+ 1)
,
the posterior mean and variance of q. The general formula for conditional expectations gives
E(x) = E(E(x | q)).
We want expectations in the posterior distribution, so conditioning on k we have
E(x | k) = E(E(x | q , k))
E(x | q , k) = E(x | q) = mq
hence the posterior predictive mean is
E(x | k) = E(mq | k) = mE(q | k) = mµ.
For the variance,
V ar(x | q , k) = V ar(x | q) = mq(1− q).
Hence the posterior predictive variance is
V ar(x | k) = E(V ar(x | q , k)) + V ar(E(x | q , k))
= E(mq(1− q) | k) + V ar(mq | k)
= mE(q | k)−mE(q2 | k) +m2V ar(q | k)
= mE(q | k)−m (V ar(q | k) + E(q | k)2)+m2V ar(q | k)
= mµ−m(v + µ2) +m2v .
The general procedure is to find the mean and variance of the predicted data conditional on a specific
parameter or parameter vector, and then to find the mean and variance over the posterior distribution
of parameters.
29
5.1 Simulating the posterior predictive distribution
Suppose that we have generated a sample θ1, θ2, . . . , θM from the posterior distribution p(θ | y).
Given this sample, we can simulate the posterior predictive distribution p(x | y). We just generate
random values
xj from the distribution p(x | θj), j = 1, 2, . . . , M .
Then
x1, x2, . . . , xM
is a sample from the posterior predictive distribution.
Since the usual computational method for Bayesian inference is to use a form of MCMC such as the
Metropolis algorithm, we typically will have a sample of θ values.
Once we have a sample from the distribution p(x | y), we can summarize the sample to estimate
various quantities. For example, the sample mean of x1, x2, . . . , xM is an estimate of the posterior
predictive mean of x.
To estimate the posterior predictive probability that x is less than some value a, simply count what
proportion of the sample is less than a. If the data is discrete, to estimate the posterior predictive
probability that x = 0, count what proportion of the sample is equal to 0.
The sample quantiles of x1, . . . , xM can be used to form a prediction interval. For example, the 2.5th
and 97.5th sample quantiles form a 95% posterior predictive interval. A new data-point x will be
inside this interval with probability 0.95.
30
Chapter 6
Hierarchical models
So far we have considered a single parameter or vector of parameters θ, and the observed data
y1, . . . , yn are independent given θ. One way of generalizing statistical models is to account for the
fact that data are grouped in some way. In that case, we can add more levels to the model. In this
module, we only consider two-level models, but more complicated structures follow along the same
lines.
Suppose the data are in n groups, with mi observations in group i. Now the data is
y = {yij : i = 1, . . . , n, j = 1, . . . , mi}.
yi1, . . . , yimi depend on a set of parameters ψi, i = 1, . . . , n. These parameters ψ1, . . . , ψn are assumed
to be generated by a probability distribution with parameters θ. p(ψi | θ) acts like a prior distribution
for ψi, but using a hierarchical model we can also estimate θ from the data.
The joint probability density for ψ and y, given θ, is
p(ψ , y | θ) =
n∏
i=1
p(ψi | θ)
mi∏
j=1
p(yij | θ , ψi) = p(ψ | θ) p(y | θ , ψ)
where ψ = (ψ1, . . . , ψn).
In the Bayesian framework, we specify a prior distribution p(θ) but not for ψ, since we have the
probability distribution p(ψ | θ) .
For a one-level model, Bayes’ theorem is
p(θ | y) ∝ p(θ) p(y | θ)
With a two-level model, we are estimating both θ and ψ, and so we have
p(θ , ψ | y) ∝ p(θ , ψ) p(y | θ , ψ).
31
p(θ , ψ) = p(θ) p(ψ | θ)
and so
p(θ , ψ | y) ∝ p(θ) p(ψ | θ) p(y | θ , ψ) = p(θ) p(ψ , y | θ).
p(ψ , y | θ) is taken as the likelihood, although only y and not ψ is the observed data.
Example 6.1. Assume the data are the times until failure for a batch of light-bulbs, t1, . . . , tm.
These times follow an exponential distribution with parameter λ, which is given a Gamma(α , β)
prior distribution. In example 3.1, it was assumed that we have estimates of the failure rate from k
previous studies, r1, . . . , rk. The sample mean and variance of these estimates were used to find the
two gamma distribution parameters.
Suppose that we had the data from the previous studies, ti1, . . . , timi , i = 1, . . . , k. We could fit a
single model to the entire set of data - a hierarchical model would make sense here.
The number of groups (batches of light-bulbs) is n = k+1. Each group i has its own failure rate λi,
which we can take as following a Gamma(α , β) distribution. Conditional on λi, the data ti1, . . . , timi ∼
Exp(λi). We would also formulate a prior distribution for α and β. In the general notation of the
previous section, the top-level parameter θ is the vector (α , β) here, and the group-level parameter
ψi is λi here.
The hierarchical model is more coherent statistically than the approach of 3.1, as it treats each batch
of light-bulbs the same. Also, it correctly accounts for the uncertainty in the values of α and β.
6.1 Inference for hierarchical models
We estimate the joint posterior distribution of θ and ψ, p(θ , ψ | y).
When using MCMC, this means generating a sample of all these quantities, θ , ψ1, . . . , ψn. The simple
Metropolis algorithm will be less efficient when n is large, as we need to sample many parameters.
However there are other MCMC methods which can be used, such as Gibbs sampling. We are not
covering those methods in this module.
Once we have a sample from the posterior density p(θ , ψ1, . . . , ψn | y), we can treat this just like any
other joint posterior density. To obtain a sample from the marginal posterior density of any subset of
parameters, just ignore the other elements of the sample. For example, to get a sample from p(θ | y),
ignore ψ1, . . . , ψn.
32
6.1.1 Hierarchical model posterior predictions
Hierarchical models are useful for predictions. Now that there is more than one level to the model,
we can estimate the posterior predictive distribution of:
• a new data-point in a group in the dataset;
• a new data-point in a group not in the dataset.
Suppose that we have a sample of size M from the posterior distribution p(θ , ψ1, . . . , ψn | y):
θ1, ψ11, . . . , ψn1
θ2, ψ12, . . . , ψn2
. . .
θM , ψ1M , . . . , ψnM
To predict a new data-point x in group i, generate xk ∼ p(x | ψik, θk) for k = 1, . . . , M .
To predict a new data-point z in a new group, generate ψ˜k ∼ p(ψ | θk), then xk ∼ p(x | ψ˜k, θk) for
k = 1, . . . , M .
Example 6.2. Suppose the data are the observed numbers of cases of a certain disease
y = {yij : i = 1, . . . , n, j = 1, . . . , mi}.
yij is the number of cases in district j within county i. The population of the district is Nij. We
assume that
yij ∼ Poisson(λiNij).
λi is the disease rate in county i, with λi ∼ Gamma(α , β), i = 1, . . . , n.
α and β are given prior distributions, for example half-normal. The joint prior distribution is p(α , β).
The posterior density is
p(λ1, . . . , λn, α , β | y) ∝ p(α , β)
n∏
i=1
[
p(λi | α , β)
mi∏
j=1
p(yij | λi)
]
.
p(λi | α , β) is the Gamma probability density function with outcome λi. p(yij | λi) is the Poisson
probability mass function for outcome yij with Poisson distribution parameter (mean) λiNij.
MCMC methods can be used to generate a sample of size M from p(λ1, . . . , λn, α , β | y):
λ11, . . . , λn1, α1, β1
λ12, . . . , λn2, α2, β2
33
. . .
λ1M , . . . , λnM , αM , βM
Then to make inferences about λ1, we just take the first column, λ11, . . . , λ1M , and this is a sample
from the marginal posterior distribution p(λ1 | y).
To make inferences about µ = α /β, set
µk =
αk
βk
, k = 1, . . . , M .
Then µ1, . . . , µM is a sample from the marginal posterior distribution p(µ | y), and we can use the
sample median and quantiles to estimate the posterior median and credible interval limits just as we
have done previously for one-parameter models.
Two types of posterior predictions that we can make are:
1. The posterior predictive distribution for the number of cases x in a new district (i.e. a district
not in our dataset), which is in county i, where county i does appear in our dataset. Let P be
the population of this district.
The procedure here is to generate
xk ∼ Poisson(λikP ), k = 1, . . . , M .
Then x1, . . . , xM is a sample from the posterior predictive distribution.
2. The posterior predictive distribution for the number of cases z in a new district, which is in a
county that is not in our dataset.
Now we need to generate a posterior predictive sample for the disease rate in the county as
well. Let Q be the population of this district. The procedure is to generate
λ˜k ∼ Gamma(αk, βk), k = 1, . . . , M ,
and then
zk ∼ Poisson(λ˜kQ), k = 1, . . . , M .
Now z1, . . . , zM is a sample from the posterior predictive distribution.
The sample mean of z1, . . . , zM is an estimate of the posterior predictive mean for z, and the
2.5th and 97.5th percentiles of the sample form a 95% posterior predictive interval for z.
To find the posterior predictive probability that z will be zero, we just count what proportion
of the sample is equal to zero.
34
6.2 Advantages of hierarchical models
Pooling information
We can pool information while allowing some variation. Pooling information means using information
from multiple groups to estimate the distribution p(ψ | θ). Then in groups with little data (e.g. in
example 6.2, few districts or only districts with small populations), the posterior mean of ψ is close
to the mean of p(ψ | y). If the data in group i provides more information (e.g. many districts or some
districts with large populations), then the posterior mean of ψi is mainly determined by y11, . . . , yimi .
Posterior predictions
We can make posterior predictions for more quantities than with a non-hierarchical model.
We could analyse data from each group separately. In that case, we can predict a new observation
in the same group, as this situation is just a single-level model, estimating ψi from the sample
yi1, . . . , yimi . But we can’t predict a new observation in a group not in our dataset.
With a hierarchical model, the probability distribution p(ψ | θ) allows us to generalize to a new
group.
Correctly accounting for uncertainty
If there is variation between groups in the parameter(s) ψi, this implies that the observations in the
same group i are correlated with each other. Hence if we fitted a single-level model that assumed the
data-points are independent, then we would tend to underestimate the uncertainty in the parameter
estimates. So credible intervals (or confidence intervals in the frequentist framework) would be too
narrow.
Reducing the influence of arbitrary prior distributions
The hierarchical structure moves the prior distribution further away from the inferences. In example
6.2, the disease rates in each county follow a gamma distribution
λi ∼ Gamma(α , β), i = 1, . . . , n.
If instead we fitted a non-hierarchical model, we could fix α and β as prior parameters for each λi.
In the hierarchical model we define a prior distribution p(α , β), and estimate α and β. So with the
hierarchical model, the choice of prior distribution has less influence on our inferences about λi. The
level at which we choose fixed prior parameters is moved one level further away from the λi.
35
Chapter 7
Tests and model comparisions
The final topic is a brief look at Bayesian versions of hypothesis tests and model comparisons.
Example 7.1. Consider the normal example 2.4, in which the data y1, . . . , yn ∼ N(µ, σ2), and σ is
known. Suppose that we are interested in deciding whether or not µ is zero.
In the frequentist framework, we have a null and an alternative hypothesis. In this case, the null
hypothesis would be H0 : µ = 0. As a one-sided alternative hypothesis, take H1 : µ > 0. The null
hypothesis is tested using the p-value. This is the probability of observing a statistic at least as
extreme as the observed value, if H0 is true.
Since σ is known, we can use the sample mean Y¯ as a test statistic. IfH0 is true, then Y¯ ∼ N(0, σ2/n).
If we observe Y¯ = y¯, then the one-sided p-value is
P (Y¯ ≥ y¯ | µ = 0) = P
(√
nY¯
σ
≥
√
ny¯
σ
| µ = 0
)
= 1− Φ
(√
ny¯
σ
)
.
The Bayesian framework does not use p-values. Instead, probability statements are based on the
posterior distribution p(µ | y). We can use this distribution to calculate posterior probabilities such
as P (µ > 0 | y). With prior distribution µ ∼ N(µ0, σ2 0), the posterior distribution is
µ ∼ N(µn, σ2n), where µn =
µ0/σ
2
0 + ny¯ /σ2
1/σ2 0 + n/σ2
, σ2n =
1
1/σ2 0 + n/σ2
.
For sufficiently large σ2 0, i.e. for an uninformative prior distribution,
µn ≈ y¯ , σ2n ≈ σ2/n.
Hence the posterior probability
P (µ ≤ 0 | y) = P
(√
n(µ− y¯)
σ
≤ −
√
ny¯
σ
| y
)
≈ Φ
(
−
√
ny¯
σ
)
= 1− Φ
(√
ny¯
σ
)
.
36
This is the same numerical value as the p-value. Hence, the posterior probability and the p-value may
be the same or similar, but they do not have to be, as they are probabilities of different quantities.
Example 7.2. For another example, suppose we want to know if a coin is fair, or if it is biased
towards heads. The coin is tossed n times and we observe k heads. Let the probability of heads be
q, and take k ∼ Bin(n, q). A frequentist approach would set
H0 : q = 0.5
H1 : q > 0.5
Label the random variable for the number of heads as X. The one-sided p-value for testing H0 is
P (X ≥ k | q = 0.5).
Suppose n = 5, and that we observed k = n. Then the p-value is P (X = k | q = 0.5) = 0.55 = 0.0313.
For a Bayesian version, we can take a uniform prior distribution for q. Then the posterior distribution
is Beta(k + 1, n− k + 1). If k = n = 5, the posterior is Beta(6, 1). The normalized posterior density
and cdf are
p(q | k) = 6q5, F (q) = q6.
The posterior probability that q ≤ 0.5 is
P (q ≤ 0.5 | k) = F (0.5) = 0.56 = 0.0156.
Here we have two different numbers from a frequentist and Bayesian approach, although in this case
they are not hugely different.
In these two examples, we compared one-sided tests with posterior probabilities. With a continuous
prior and posterior distribution, the posterior probability P (µ ̸= 0 | y) is 1, since any single point
has zero probability, and so there is no posterior probability to compare to the two-sided test.
Multiple models
So far, we have been using Bayes’ theorem in the form
p(θ | y) ∝ p(θ) p(y | θ).
We have usually only needed p(θ | y) up to a constant of proportionality. With conjugate distri-
butions, we can recognise the full posterior density, and when using the Metropolis algorithm, the
normalizing constant cancels out.
Bayes’ theorem in full is
p(θ | y) = p(θ) p(y | θ)
p(y)
.
37
The denominator is
p(y) =
∫
p(θ) p(y | θ) dθ .
This integral is usually harder to calculate compared to the methods for inference that we have
covered so far, especially if θ is multi-dimensional. But it is used in one method for Bayesian model
comparison.
Suppose that we have more than one candidate model that might fit the data. Label the models
M1, M2, . . . , Mr. We assume that one of these models generated the data. Each model has a vector
of parameters θk, k = 1, . . . , r.
For example, in example 7.1, we could take models
M1 : yi ∼ N(0, σ2)
M2 : yi ∼ N(µ, σ2)
We expand the notation to make probabilities conditional upon a particular model. The prior
distribution within model Mk becomes p(θk |Mk) and the likelihood is p(y | θk, Mk). Bayes’ theorem
for model Mk becomes
p(θk | y , Mk) = p(θk |Mk) p(y | θk, Mk)
p(y |Mk) .
The term p(y | Mk) can be used in Bayes’ theorem in another way, for looking at probabilities
of different models rather than of parameter values within one model. To do this, we need to
specify prior probabilities for each model, p(Mk), k = 1, . . . , r. We could choose a discrete uniform
distribution
p(Mk) =
1
r
, k = 1, . . . , r.
However, we do not have to choose this distribution.
Then Bayes’ theorem is applied to give
p(Mk | y) = p(Mk) p(y |Mk)
p(y)
where
p(y |Mk) =
∫
p(θk |Mk) p(y | θk, Mk) dθk.
The denominator is
p(y) =
r∑
j=1
p(Mj) p(y |Mj).
p(Mk | y) is the posterior probability that model Mk is the true model. This provides a a Bayesian
method for choosing between models.
38
Bayes factors
Suppose that we are just considering two models, M1 and M2. The ratio of posterior probabilities is
p(M1 | y)
p(M2 | y) =
p(M1) p(y |M1)
p(M2) p(y |M2)
In general, for a probability p, the term odds means
p
1− p .
The prior odds of model M1 vs M2 is
p(M1)
p(M2)
=
p(M1)
1− p(M1)
and the posterior odds of model M1 vs M2 is
p(M1 | y)
p(M2 | y) =
p(M1 | y)
1− p(M1 | y) .
These are the first two terms in the ratio of posterior probabilities. The third term is called the
Bayes factor B12:
B12 =
p(y |M1)
p(y |M2)
=
∫
p(θ1 |M1) p(y | θ1, M1) dθ1∫
p(θ2 |M2) p(y | θ2, M2) dθ2
We have:
Posterior odds = prior odds× Bayes factor
p(θk | Mk) and p(y | θk, Mk) are the prior and likelihood for model Mk. Both of these needs to be
properly normalized (unless the same constant appears in model M1 and M2).
A Bayes factor B12 greater than 1 supports model M1, whereas a value less than 1 supports model
M2. Rules of thumb for the size of the Bayes factor have been suggested, for example:
Range of B12 Evidence for M1
1 to 10
1
2 slight
10
1
2 to 10 moderate
10 to 102 strong
> 102 decisive
39
Example 7.3. Continuing the example of the fair coin 7.2, now let there be two models
M1 : k ∼ Bin(n, 0.5)
M2 : k ∼ Bin(n, q)
Model M1 has no unknown parameters, and model M2 has one, namely q.
For model M1, there are no parameters to integrate over, and so p(k |M1) is just the probability of
the observed data if q = 0.5. If as before we observe k = n, this is
p(k |M1) = 0.5n.
For model M2
p(k |M2) =
∫ 1
0
p(q |M2) p(k | q , M2) dq .
p(q | M2) is the prior distribution for q and p(k | q , M2) is the likelihood. For a general Beta(α , β)
prior distribution, we have
p(k |M2) =
∫ 1
0
qα−1(1− q)β−1
B(α , β)
(
n
k
)
qk(1− q)n−k dq
where B(α , β) is the Beta function. The constants B(α , β) and
(
n
k
)
need to be included in the
calculation of p(k |M2).
Suppose again that we take a uniform prior for q, and that n = 5, and we observe k = 5. Then
p(q |M2) = 1 and p(k | q , M2) = qn.
p(k |M2) =
∫ 1
0
qn dq =
[
qn+1
n+ 1
]1
0
=
1
n+ 1
.
The Bayes factor is
B12 =
p(k |M1)
p(k |M2) = (n+ 1) 0.5
n =
n+ 1
2n
= 0.1875.
We previously calculated a one-sided p-value as 0.5n. A two-sided p-value is more comparable to the
Bayes factor. The two-sided p-value is
P (X = 0 | q = 0.5) + P (X = n | q = 0.5) = 2× 0.5n = 1
2n−1
= 0.0625.
Now take prior probabilities for each model as p(M1) = 1/2 and p(M2) = 1/2. The ratio of posterior
probabilities is
p(M1 | k)
p(M2 | k) =
p(M1)
p(M2)
B12 = B12.
40
p(M2 | k) = 1− p(M1 | k). Rearranging gives the posterior probability of model M1 as
p(M1 | k) = B12
1 +B12
= 0.158.
Sensitivity to prior distributions
Suppose that we are comparing two models M1 and M2, and that model M1 has a single parameter
θ1 ∈ R, with prior distribution θ1 ∼ N(0, σ2 0).
p(y |M1) =
∫
p(θ1 |M1) p(y | θ1, M1) dθ1
In typical problems, the likelihood p(y | θ1, M1) approaches zero for θ1 outside some range (−A, A).
If we take σ0 to be large enough (i.e. a flat, uninformative prior for θ1), then
p(θ1 |M1) ≈ 1√
2π σ0
for − A < θ1 < A
Hence for large enough σ0, the Bayes factor is
B12 ≈ 1√
2π σ0
∫
p(y | θ1, M1) dθ1∫
p(θ2 |M2) p(y | θ2, M2) dθ2
So if for example we replace a very large σ0 by 100 σ0, then B12 is divided by 100. However,
the posterior distribution within model M1 will hardly change, as the posterior is approximately
proportional to the likelihood for large σ0.
Because of this sensitivity of Bayes factors, and therefore of posterior model probabilities, to the prior
distribution, many Bayesian statisticians prefer not to use these methods for comparing models.
There are alternatives for checking or comparing models which combine Bayesian and frequentist
ideas, such as posterior predictive checks. We do not have time to cover these in this module.
Another option is to avoid choosing among different models, and instead construct a sufficiently
flexible model. Models with many parameters can be easier to deal with in the Bayesian framework:
conceptually, we can go from joint posterior to the marginal posterior distribution;
having slightly informative prior distributions helps if there is not enough data to estimate all
parameters using the current dataset alone5.