贝叶斯代写-STAT 431|学霸联盟

贝叶斯代写-STAT 431

时间：2022-04-22

STAT 431 — Applied Bayesian Analysis — Course Notes
Topics in Model Comparison
and Assessment
Spring 2020
Bayes Factors
Recall: Bayesians don’t use frequentist p-values.
What can we use instead?
Earlier, we used posterior probabilities of one-sided hypotheses
(some of which happened to equal p-values in some
noninformative cases).
But we would like a more general approach ...
1
Simple-vs-Simple Case
Consider data y which may follow one of two different models,
M0 and M1
Assume each of these models fully specifies a distribution
for y, and the distributions have densities
p(y |M0) and p(y |M1)
Then
H0 : M0 is true H1 : M1 is true
are two simple hypotheses.
2
Let the models have prior probabilities
P(M0) (= P(H0)) P(M1) (= P(H1))
We assume
P(M0) > 0 and P(M1) > 0
but it is not necessary that they sum to 1.
The prior odds in favor of M1 are
P(M1)
P(M0)
3
By Bayes’ rule,
P(M0 | y) ∝ P(M0) p(y |M0)
P(M1 | y) ∝ P(M1) p(y |M1)
with the same normalizing constant (p(y)) in both cases.
The posterior odds in favor of M1 are
P(M1 | y)
P(M0 | y) =
P(M1) p(y |M1)
P(M0) p(y |M0)
= prior odds × p(y |M1)
p(y |M0)
4
The Bayes factor in favor of M1 versus M0 is
BF1,0 =
posterior odds
prior odds
=
p(y |M1)
p(y |M0)
Interpretation: BF1,0 is the factor by which the “odds” of M1
(relative to M0) change due to the data.
So, for example,
I BF1,0 ≈ 1 means that the data do not distinguish
between the models very well
I BF1,0 1 means that the data strongly support M1
over M0
5
The Bayes factor in favor of M1 versus M0 is
BF1,0 =
posterior odds
prior odds
=
p(y |M1)
p(y |M0)
Interpretation: BF1,0 is the factor by which the “odds” of M1
(relative to M0) change due to the data.
So, for example,
I BF1,0 ≈ 1 means that the data do not distinguish
between the models very well
I BF1,0 1 means that the data strongly support M1
over M0
5
Notice: In this simple-vs-simple case, the Bayes factor BF1,0
I equals the likelihood ratio
L(M1 ; y)
L(M0 ; y)
I does not depend on the prior — it is the same for any
valid values of P(M0) and P(M1)
6
Example: Does Waldo ride the bus?
M1 = does ride M0 = doesn’t ride
y =
{
1 if lives in an apartment
0 if not
Based on the class survey (our best guess),
p(y |M1) =
{
0.91, y = 1
0.09, y = 0
p(y |M0) =
{
0.81, y = 1
0.19, y = 0
7
If Waldo lives in an apartment (y = 1),
BF1,0 =
0.91
0.81
≈ 1.12
and, if not (y = 0),
BF1,0 =
0.09
0.19
≈ 0.47
So living in an apartment increases the odds of riding the bus,
while not living in an apartment decreases them.
8
An Interpretation Scale
BF1,0 data evidence for M1 (H1) vs. M0 (H0)
1 to 3.2 Barely worth mentioning
3.2 to 10 Substantial
10 to 100 Strong
> 100 Decisive
9
More General Case
Consider modeling data y.
Suppose models M0 and M1 have (unknown) parameters:
θ0 for M0 θ1 for M1
We will assume the models are “disjoint”: They don’t share
any distributions for y.
10
Suppose the models have (conditional) priors
p(θ0 |M0) p(θ1 |M1)
Then
p(y |M0) =
∫
p(y, θ0 |M0) dθ0
=
∫
p(θ0 |M0)︸︷︷︸
prior
p(y | θ0,M0)︸︷︷︸
M0 model
dθ0
and similarly
p(y |M1) =
∫
p(θ1 |M1) p(y | θ1,M1) dθ1
These are the marginal likelihoods of M0 and M1 (under
their respective priors).
11
Suppose the models have (conditional) priors
p(θ0 |M0) p(θ1 |M1)
Then
p(y |M0) =
∫
p(y, θ0 |M0) dθ0
=
∫
p(θ0 |M0)︸︷︷︸
prior
p(y | θ0,M0)︸︷︷︸
M0 model
dθ0
and similarly
p(y |M1) =
∫
p(θ1 |M1) p(y | θ1,M1) dθ1
These are the marginal likelihoods of M0 and M1 (under
their respective priors).
11
The Bayes factor in favor of M1 versus M0 is
BF1,0 =
p(y |M1)
p(y |M0)
Notes:
I Unlike in the simple-vs-simple case, this Bayes factor does
depend on the priors — it is not purely a measure of the
evidence in the data.
I Both priors must be proper — otherwise, the Bayes factor
would depend on an arbitrary scaling.
[see p. 34 in section 2.3.3 of Marin, J. & Roberts, C. P.
(2007). Bayesian core: A practical approach to
computational Bayesian statistics, Springer, New York,
NY.]
12
Unfortunately, Bayes factors are generally difficult to compute,
requiring specialized methods.
But they can be easily computed for certain types of
hypothesis tests ...
13
For Hypothesis Testing
Consider a model with data densities p(y | θ) and prior p(θ).
Consider testing
H0 : θ ∈ Θ0 H1 : θ ∈ Θ1
where Θ0 ∩Θ1 = ∅, and both have positive prior probability.
Regard this as a test of two data model/prior combinations:
M0 : p(y | θ), θ ∈ Θ0
with prior p(θ) restricted to Θ0
M1 : p(y | θ), θ ∈ Θ1
with prior p(θ) restricted to Θ1
14
Proposition
In this case, the Bayes factor in favor of M1 versus M0 is
BF1,0 =
P(H1 | y)
/
P(H0 | y)
P(H1)
/
P(H0)(
=
posterior odds
prior odds
)
We call this the Bayes factor in favor of H1 (versus H0).
15
Proof.
p(y |M1) =
∫
Θ1
p(θ)
P(H1)︸︷︷︸
p(θ) restricted to Θ1
p(y | θ)︸︷︷︸
M1 on Θ1
dθ
=
1
P(H1)
∫
Θ1
p(θ) p(y | θ) dθ
=
p(y)
P(H1)
∫
Θ1
p(θ | y) dθ = p(y) P(H1 | y)
P(H1)
and similarly
p(y |M0) = p(y) P(H0 | y)
P(H0)
so the result follows by taking the ratio.
16
Proof.
p(y |M1) =
∫
Θ1
p(θ)
P(H1)︸︷︷︸
p(θ) restricted to Θ1
p(y | θ)︸︷︷︸
M1 on Θ1
dθ
=
1
P(H1)
∫
Θ1
p(θ) p(y | θ) dθ
=
p(y)
P(H1)
∫
Θ1
p(θ | y) dθ = p(y) P(H1 | y)
P(H1)
and similarly
p(y |M0) = p(y) P(H0 | y)
P(H0)
so the result follows by taking the ratio.
16
Proof.
p(y |M1) =
∫
Θ1
p(θ)
P(H1)︸︷︷︸
p(θ) restricted to Θ1
p(y | θ)︸︷︷︸
M1 on Θ1
dθ
=
1
P(H1)
∫
Θ1
p(θ) p(y | θ) dθ
=
p(y)
P(H1)
∫
Θ1
p(θ | y) dθ
= p(y)
P(H1 | y)
P(H1)
and similarly
p(y |M0) = p(y) P(H0 | y)
P(H0)
so the result follows by taking the ratio.
16
Proof.
p(y |M1) =
∫
Θ1
p(θ)
P(H1)︸︷︷︸
p(θ) restricted to Θ1
p(y | θ)︸︷︷︸
M1 on Θ1
dθ
=
1
P(H1)
∫
Θ1
p(θ) p(y | θ) dθ
=
p(y)
P(H1)
∫
Θ1
p(θ | y) dθ = p(y) P(H1 | y)
P(H1)
and similarly
p(y |M0) = p(y) P(H0 | y)
P(H0)
so the result follows by taking the ratio.
16
Proof.
p(y |M1) =
∫
Θ1
p(θ)
P(H1)︸︷︷︸
p(θ) restricted to Θ1
p(y | θ)︸︷︷︸
M1 on Θ1
dθ
=
1
P(H1)
∫
Θ1
p(θ) p(y | θ) dθ
=
p(y)
P(H1)
∫
Θ1
p(θ | y) dθ = p(y) P(H1 | y)
P(H1)
and similarly
p(y |M0) = p(y) P(H0 | y)
P(H0)
so the result follows by taking the ratio.
16
Example: Are shark attacks becoming more frequent?
Recall model for yearly number of attacks:
Y | λ ∼ Poisson(λ)
ln(λ) = β0 + β1 (year − year)
so we consider
H0 : β1 ≤ 0 (no) H1 : β1 > 0 (yes)
Recall prior on β1:
β1 ∼ N(0, 1002)
17
So
P(H0) = 0.5 P(H1) = 0.5
From JAGS (Example 10.1):
P(H1 | y) ≈ 0.99999
P(H0 | y) ≈ 1 − 0.99999 = 0.00001
So
BF1,0 ≈ 0.99999/0.00001
0.5/0.5
= 99999
representing decisive evidence that shark attacks are becoming
more frequent.
18
What about point-null hypotheses, like
H0 : β1 = 0
(perhaps corresponding to a two-sided test)?
Historically, Bayesians generally have considered this to be
problematic: If β1 is a continuous parameter (with a
continuous prior), it should have posterior probability 0 of
being any specified value.
See discussion in Cowles, Sec. 11.2.2.
19
The Deviance Information Criterion
You may have heard of model selection criteria like AIC
or BIC ...
Several models for the same data y are under consideration —
possibly nested, possibly intersecting, possibly unrelated. They
can differ in type and number of parameters.
Model selection criteria aim to answer the question
Which is the “best” model for the data?
20
Bayesians want a criterion that evaluates the prior, too ...
Suppose you have several candidate models for the same
data y:
M1, M2, . . . Mm
Suppose each model has its own prior.
Goal: Choose the “best” model/prior combination for
predicting new data (of the same kind).
21
Let M be a particular model, parameterized by θ (continuous),
under which the data have a density
p(y | θ)
and let p(θ) be a prior density.
Let p(θ | y) be the resulting posterior density.
Define
D(y, θ) = −2 ln p(y | θ)
(This is analytically computable, provided the density can be
evaluated.)
22
Define
Dˆavg(y) =
∫
D(y, θ) p(θ | y) dθ
= D averaged over the posterior
and
Dθˆ(y) = D(y, θˆ)
where θˆ is the posterior mean:
θˆ = E(θ | y)
Note: Both Dˆavg(y) and Dθˆ(y) can be approximated from
simulation output.
23
Define
pD = Dˆavg(y) − Dθˆ(y)
as the effective number of parameters.
Note: This is not what a frequentist would call the “effective
number of parameters.” It is usually not an integer, and could
possibly be negative!
24
The deviance information criterion (DIC) value (for
model M with prior p(θ)) is
DIC = Dˆavg(y) + pD
= 2 Dˆavg(y) − Dθˆ(y)
= Dθˆ(y) + 2pD
Usage: Choose the model/prior with minimal DIC.
25
Remark: Motivation for DIC is similar to the
frequentist-motivated AIC —
AIC = D(y, θˆMLE) + 2p
where p is the “true” (frequentist) number of free parameters
(in θ), and θˆMLE is the maximum likelihood estimate.
I the first term penalizes lack of fit
I the second term penalizes complexity
26
Warning: Models compared using DIC (or AIC or BIC) must
be for exactly the same data y.
For example, if the data is transformed, exactly the same
transformation must be used for all models being compared.
Remark: Software usually assumes that θ contains all
parameters and hyperparameters (at all levels) of a hierarchical
model.
27
Example: Baby Rat Weights
Model 1: Bivariate Formulation (Separate Lines)
Yij = weight (mass) of rat i at time xj
∼ indep. N(α0i + α1i(xj − x¯), σ2y)
αi =
[
α0i
α1i
] ∣∣∣∣∣ β,Σα ∼ indep. N2(β,Σα)
(with appropriate priors on σ2y, β, Σα)
28
Model 2: Univariate Formulation, Separate Lines
Yij ∼ indep. N
(
α0i + α1i(xj − x¯), σ2y
)
α0i | β0, σ2α0 ∼ N(β0, σ2α0)
α1i | β1, σ2α1 ∼ N(β1, σ2α1)
} all
conditionally
independent
(with vague independent priors on σ2y, β0, β1, σ
2
α0
, σ2α1)
29
Model 3: Separate Intercepts, Common Slope
Yij ∼ indep. N
(
α0i + β1(xj − x¯), σ2y
)
α0i | β0, σ2α0 ∼ indep. N(β0, σ2α0)
Model 4: Same Line (SLR)
Yij ∼ indep. N
(
β0 + β1(xj − x¯), σ2y
)
30
R/JAGS Example 11.1:
DIC for Hierarchical Normal Regressions
31
Remark: JAGS uses a different version of pD suggested by
Plummer (in the discussion of Spiegelhalter et al., 2002).
32
Posterior Predictive Assessment
So far, we considered only ways to compare different
model/prior combinations to each other.
How can we check whether a model/prior combination is a
good fit to the data?
Frequentist approach: lack-of-fit test (such as a chi-square
test — later)
Usually produces a p-value — smaller indicates more evidence
against the model.
A Bayesian wants to assess the prior and model together.
33
For notation:
y = the data (vector)
θ = the parameter (vector) in model M
We also need to choose a function called a discrepancy:
T (y;θ)
It is intended to measure how far observed data y depart from
what would be expected under model M with parameter
value θ. Larger values should indicate greater departures from
the model.
34
An example of a discrepancy:
T (y;θ) =
n∑
i=1
(
yi − E(Yi | θ)
)2
Var(Yi | θ)
where the observed data y = (y1, . . . yn) is a numerical vector,
and (Y1, . . . Yn) has its distribution under the model.
This is larger when the yis are generally farther from their
means than their variances would suggest, under the model
with parameter value θ.
(Note: Similar in form to the classical chi-square statistic.)
35
A frequentist can’t use T (y;θ) directly, since it depends on
the unknown θ.
Replacing θ with an estimate θˆ might yield a reasonable test
statistic
Tˆ (y) = T (y; θˆ)
but its distribution under model M might still depend on the
unknown θ.
If the distribution is (approximately) known, a frequentist can
compute the (approximate) p-value
p = P
(
Tˆ (Y rep) ≥ Tˆ (y))
where Y rep is a replication of the data under the model.
36
In the earlier example:
Tˆ (y) =
n∑
i=1
(
yi − E(Yi | θˆ)
)2
Var(Yi | θˆ)
is the classical (Pearson) chi-square statistic (where θˆ is
usually the MLE).
When the model is correct, asymptotic theory often suggests
Tˆ (Y ) | θ ∼˙ χ2n−k
if θ effectively has k elements.
When this approximation is valid, an approximate p-value is
the χ2n−k density tail area to the right of Tˆ (y).
37
In contrast, a Bayesian wants to
I assess the prior, not just the data model
I average over the distribution of θ (to avoid substituting
an estimate θˆ)
I avoid any asymptotic approximations
38
Suppose
Y rep | θ ∼ M(θ)
and is conditionally independent of the data.
Note: Averaging its distribution over the posterior gives the
posterior predictive distribution of the data.
Note: We can generate Y rep for a given θ by simulating from
the data model, which is usually easy.
A simulated value yrep would be called a replicate data set.
39
Then, instead of a frequentist p-value, a Bayesian could use a
posterior predictive p-value
pb = P
(
T (Y rep;θ) ≥ T (y;θ) ∣∣ y)
where the probability is over the joint posterior distribution
of θ and Y rep.
Sufficiently small pb indicates a problem with the model
and/or prior.
40
Usually pb can’t be directly computed. Instead, it can be
approximated by posterior simulation (e.g., MCMC):
1. For each posterior-generated value θ, generate a replicate
data set yrep (conditionally) and compute
T (yrep;θ) and T (y;θ)
2. Approximate pb as the fraction of generated
pairs (θ,yrep) for which
T (yrep;θ) ≥ T (y;θ)
41
Example: Shark Attacks
Recall:
Yi = number of shark attacks (worldwide)
xi = year (2005–2017)
Our chosen model and prior:
Yi | λi ∼ indep. Poisson(λi)
ln(λi) = β0 + β1(xi − x¯)
β0, β1 ∼ indep. N(0, 1002)
42
We will investigate fit using the chi-square discrepancy
T (y; β0, β1) =
n∑
i=1
(yi − λi)2
λi
(Why is this the chi-square discrepancy?)
If there are problems with the Poisson regression or the
(vague) normal priors, we expect T (y; β0, β1) to be large
relative to its replicate version, leading to small pb.
43
The JAGS code:
data {
xmean <- mean(x)
for(i in 1:length(x)) {
xcent[i] <- x[i] - xmean
}
}
model {
for(i in 1:length(y)) {
y[i] ~ dpois(lambda[i])
log(lambda[i]) <- beta0 + beta1 * xcent[i]
yrep[i] ~ dpois(lambda[i])
}
chisq <- sum((y - lambda)^2 / lambda)
chisqrep <- sum((yrep - lambda)^2 / lambda)
pb.ind <- chisqrep >= chisq
beta0 ~ dnorm(0, 0.0001)
beta1 ~ dnorm(0, 0.0001)
} 44
Note: Arithmetic operations are automatically vectorized in
JAGS.
Note: The data set used here has only years 2005 to 2017
(and not the missing 2018 observation).
We will also try an alternative version of the model that has a
badly mis-specified prior ...
45
R/JAGS Example 11.2:
Assessment for Poisson Regression
46
Remark: For some choices of discrepancy T , obtaining an
especially large value of pb would also indicate a problem with
the model and/or prior. See Cowles, Sec. 11.4, for an example.
47