MATLAB代写-ECON 5120|学霸联盟

MATLAB代写-ECON 5120

时间：2022-03-30

ECON 5120 Bayesian Data Analysis, second half
• Instructor: Kenichi Shimizu
(kenichi.shimizu@glasgow.ac.uk)
• Lectures: 9am-11am on Wednesdays
• Labs: 1-3pm on Wednesdays (online)
• Office hours: I usually stay for questions after the lectures.
You can also email me to make an appointment
ECON 5079 Lecture 6 1/ 46
ECON 5120 Bayesian Data Analysis, second half
• Coursework 1 (20%) Due: Monday, 21st February 9am
• Coursework 2 (20%) Due: Friday, 25th March
• Degree exam (60%): TBA, sometime between 25 April – 20
May
ECON 5079 Lecture 6 2/ 46
Course outline for the second half
• Week 6 Shrinkage priors
• Week 7 Shrinkage priors / Nonparametric methods
• Week 8 Nonparametric methods
• Week 9 Time series
• Week 10 Time series
Lecture slides will be posted prior to each lecture
ECON 5079 Lecture 6 3/ 46
Lecture 6: Shrinkage priors
Kenichi Shimizu
University of Glasgow
ECON 5120 Bayesian Data Analysis
Main material
Korobilis and Shimizu (2021). “Bayesian Approaches to
Shrinkage and Sparse Estimation: A guide for applied
econometricians,” forthcoming in Foundations and Trends in
Econometrics
https://ideas.repec.org/p/gla/glaewp/2021_19.html
ECON 5079 Lecture 6 5/ 46
Outline
A review on posterior sampling
Introduction to shrinkage priors
Student-t shrinkage
Bayesian Lasso
Global-local shrinkage priors
Conjugate vs independent priors
Computation
ECON 5079 Lecture 6 6/ 46
Review on Normal Inverse Gamma priors
for linear regression models
Suppose that we have the following regression model:
y|X, β, σ2 ∼ Nn(Xβ, σ2I) (1)
Assume an Inverse Gamma prior on σ2.
σ2 ∼ iGamma (a, b) (2)
• We use the definition if x ∼ iGamma (a, b), then it has
density p(x) = b
a
Γ(a)
(
1
x
)a+1
exp
(− bx).
• In Matlab x = 1/gamrnd(a,1/b).
• The improper uninformative prior p(σ2) ∝ σ−2 corresponds
to the case a, b→ 0
ECON 5079 Lecture 6 7/ 46
Dependent Normal Inverse Gamma prior
First, suppose that the prior on β is defined conditional on σ2.
The hierarchical structure is summarized as follows.
β|σ2 ∼ Np
(
µβ, σ
2Vβ
)
The conditional posteriors are of the form
β|σ2,y ∼ Np
(
V
[
X′y +V−1β µβ
]
, σ2V
)
,
σ2|β,y ∼ iGamma
(
a+
n+ p
2
, b+
1
2
[
(y −Xβ)′ (y −Xβ) + (β − µβ)V−1β (β − µβ)])
where V = (X′X+Vβ)−1.
ECON 5079 Lecture 6 8/ 46
Independent Normal Inverse Gamma prior
In this case, β and σ2 are a priori independent.
β ∼ Np (µβ,Vβ)
The conditional posteriors are of the form
β|σ2,y ∼ Np
(
V
[
X′y/σ2 +V−1β µβ
]
,V
)
,
σ2|β,y ∼ iGamma
(
a+
n
2
, b+
1
2
(y −Xβ)′ (y −Xβ)
)
where V = (X′X/σ2 +V−1β )
−1.
ECON 5079 Lecture 6 9/ 46
Outline
A review on posterior sampling
Introduction to shrinkage priors
Student-t shrinkage
Bayesian Lasso
Global-local shrinkage priors
Conjugate vs independent priors
Computation
ECON 5079 Lecture 6 10/ 46
Hierarchical Bayes for Linear Regression
Assuming that interest lies in learning about the regression
coefficients β, then a simple hierarchical specification takes the
form
p
(
yi|β, σ2
) ∼ N (xiβ, σ2) , i = 1, ..., n, (3)
p
(
βj |µ, τ2
) ∼ N (0, τ2) , j = 1, ..., p, (4)
p
(
τ2
) ∼ F (a, b), (5)
where F (a, b) denotes some distribution function with
hyperparameters a, b. Due to the fact that choice of τ2 is so
crucial for the posterior outcome of βj , the idea behind this
hierarchical specification is to treat the hyperparameter τ2 as a
random variable and learn about it via Bayes Theorem. For
that reason, a prior such as the one in Equations (4)-(5) is
many times referred to as a full-Bayes prior, as it allows for full
quantification of uncertainty around parameters of interest.
ECON 5079 Lecture 6 11/ 46
Hierarchical Bayes for Linear Regression
An important feature of the hierarchical prior in equations
(3)-(5) is that, while the conditional prior p
(
βj |τ2
)
is Gaussian,
unconditionally the prior for βj is non-Gaussian. Indeed the
marginal prior for βj becomes
p (βj) =
∫
N
(
βj ; 0, τ
2
)
p(τ2)dτ2, (6)
that is, a scale mixture of normals representation that allows to
approximate very complex prior shapes for βj . Mixtures have
the benefit of allowing for classification and grouping of
parameters. In the case of identifying sparsity and shrinkage,
we can think of the mixture prior as grouping parameters into
“important” and “non-important”.
ECON 5079 Lecture 6 12/ 46
Hierarchical Bayes for Linear Regression
Finally, the posterior mode of β under a hierarchical prior has a
penalized likelihood representation. For the linear regression
model penalized likelihood problems admit a regularized least
squares form
min
β∈Rp
1
n
||y −Xβ||2 + g (β, λ) , (7)
where the first term is the usual least squares problem and the
function g (β, λ) defines the penalty term as a function of the
regression parameters β and a scalar (or vector) tuning
parameter λ. Numerous penalized estimators, such as ridge
(Tikhonov regularization), lasso and elastic net fall under the
general form in 7, and Bayesian modal estimators under
suitable hierarchical priors can fully recover all of them.
ECON 5079 Lecture 6 13/ 46
Scale mixture of normals as a shrinkage devise
-40 -30 -20 -10 0 10 20 30 40
0
0.01
0.02
0.03
0.04
Normal prior
-40 -30 -20 -10 0 10 20 30 40
0
0.1
0.2
0.3
0.4
Normal- 2 prior
-40 -30 -20 -10 0 10 20 30 40
0
0.05
0.1
0.15
0.2
0.25
0.3
Normal-Exponential prior
-40 -30 -20 -10 0 10 20 30 40
0
0.1
0.2
0.3
0.4
Normal-Binomial prior
Figure: Hierarchical priors for a scalar parameter. In all four panels
the base distribution is β ∼ N(0, 100× τ2). In the top left panel
τ2 = 1 is a fixed hyperparameter, while in the remaining panels it
follows a χ2(1), Exp(0.5) and Bernoulli(0.9) priors.
ECON 5079 Lecture 6 14/ 46
Scale mixture of normals as a shrinkage devise
• The simple normal prior provides more probability at the origin
(zero) relative to its tails, however, it is fairly flat (diffuse) in a
small area around zero.
• What the other three mixture priors are doing is simply have a
more pronounced peak at zero such that when a parameter is in
the region of zero it can be shrunk at a faster rate.
• At the same time, all three mixture distributions have fat tails,
providing positive probability to parameter values that are far
from zero.
• The extreme case of the Bernoulli prior on τ2 (bottom right
panel) creates a distribution that looks normal but also has a
point mass at zero with high probability.
• All three examples of hierarchical priors provide clearer
distinction between important and non-important parameters
than the simple normal does.
ECON 5079 Lecture 6 15/ 46
Diffusing hierarchical prior
• A natural choice for the variance parameter τ2 in the
hierarchical model of equations (3)-(5) is a prior
distribution that is diffuse.
• Similar to Jeffrey’s prior for the regression variance σ2, the
choice τ2 ∼ U(0,∞) ≡ τ−2 can be thought as a default
prior choice that reflects our lack of information about
sparsity patterns in the data.
• However, as Lindley (1983) notes, “a prior for τ2 that
behaves like τ−2 will cause trouble” meaning it will lead to
an improper posterior.
• Gelman (2006) examines this issue in more detail and
explains why such a diffuse prior on τ2 would not work.
ECON 5079 Lecture 6 16/ 46
Outline
A review on posterior sampling
Introduction to shrinkage priors
Student-t shrinkage
Bayesian Lasso
Global-local shrinkage priors
Conjugate vs independent priors
Computation
ECON 5079 Lecture 6 17/ 46
Student-t shrinkage
Informative inverse gamma priors on τ2 can provide flexible
parametric shrinkage. The student-t prior in Armagan and
Zaretzki (2010) is of the following form
β|{τ2j }pj=1 ∼ Np(0,Dτ ), (8)
τ2j ∼ iGamma (ρ, ξ) , for j = 1, ..., p, (9)
σ2 ∼ 1
σ2
, (10)
where Dτ = diag(τ
2
1 , ..., τ
2
p ).
• (8) and (9) define a prior on βj which is a scale mixture of
normals representation of Student-t distribution with scale√
ρξ and d.f. 2/ξ.
• i.e. the marginal prior on βj is a student-t distribution
ECON 5079 Lecture 6 18/ 46
Student-t shrinkage
-8 -6 -4 -2 0 2 4 6 8 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
=1, =1
=4, =1
=1, =10
=0.01, =0.01
Figure: Marginal prior of βj for the Student-t prior.
ECON 5079 Lecture 6 19/ 46
Student-t shrinkage
Thank to the conjugacy, conditional posteriors are of the simple
form
β|{τ2j }pj=1, σ2,y ∼ Np
(
VX′y,V
)
, (11)
σ2|β,y ∼ iGamma (0.5n, 0.5Ψ) (12)
τ2j |βj ,y ∼ iGamma
(
ρ+ 1/2, ξ + β2j /2
)
(13)
where V =
(
X′X/σ2 +D−1τ
)−1
and Ψ = (y −Xβ)′(y −Xβ).
The Gibbs sampler can be used to sample sequentially from
these conditional posteriors.
ECON 5079 Lecture 6 20/ 46
Student-t shrinkage
For the conditional posterior (13) of the prior precisions, 1
τ2j
, we
have
p
(
1
τ21
, . . . ,
1
τ2p
∣∣∣∣β,y) ∝ p∏
j=1
(2πτ2j )
−1/2 exp
[
− 1
2τ2j
β2j
](
1
τ2j
)ρ−1
exp
[
− ξ
τ2j
]
∝
p∏
j=1
(
1
τ2j
)(ρ+ 12 )−1
exp
[
− 1
τ2j
(
ξ +
β2j
2
)]
where the proportional sign is with respect to 1
τ2j
’s. It can be seen
that the conditional posterior of 1
τ2j
’s is independent across j and that
it has the form in (13).
ECON 5079 Lecture 6 21/ 46
Outline
A review on posterior sampling
Introduction to shrinkage priors
Student-t shrinkage
Bayesian Lasso
Global-local shrinkage priors
Conjugate vs independent priors
Computation
ECON 5079 Lecture 6 22/ 46
Frequentist Lasso
The least absolute shrinkage and selection operator (Lasso) of Tibshirani (1996)
has been established as a key workhorse of scientists in all fields working with
high-dimensional settings. The estimator takes the form
argmin
β
n∑
i=1
(yi −Xiβ)2 subject to
p∑
j=1
|βj | < t,
where t is a prespecified free parameter that determines the degree of
regularization. The Lagrangian form of this program is
arg min
β∈Rp
||y −Xβ||2 + λ||β||1,
where ||x||1 =
∑ |xi| is the l1 norm and ||x||2 =√∑x2i is the l2 norm. λ is also a
tuning parameter related to t controlling how strongly we apply shrinkage. As
λ→ 0 the penalty term vanishes and the lasso becomes indistinguishable from the
least squares problem. This optimization formula is related to basis pursuit
denoising, which is the preferred term for the lasso among researchers in computer
science and signal processing.
ECON 5079 Lecture 6 23/ 46
Bayesian Lasso
Tibshirani (1996, Section 5) first noted that the Lasso estimate can be derived as
a Bayes posterior mode under the following Laplace prior distribution
p(β) =
p∏
j=1
λ
2
exp (−λ|βj |) =
(
λ
2
)p
exp (−λ||β||1) .
A formal Bayesian treatment of the Bayesian lasso using MCMC can be found in
Park and Casella (2008). These authors choose to specify the Bayesian lasso as a
normal-exponential mixture but conditional on the regression variance σ2. This is
because a hierarchical prior on βj that is independent of σ
2 results in a
multimodal posterior for βj .
ECON 5079 Lecture 6 24/ 46
Bayesian Lasso: Park and Casella (2008)
As noted first by Tibshirani (1996), the lasso estimator is equivalent to the
posterior mode of Bayesian inference under a “double-exponential” prior. This
Normal-Exponential mixture is a mixture representation to the Laplace
distribution, that is, the following Laplace prior
βj |σ ∼ λ
2
√
σ2
e−λ|βj |/
√
σ2 , j = 1, ..., p,
can be written as the following Normal-Exponential mixture
βj |σ ∼
∫ ∞
0
1√
2πσ2sj
e
(
− β
2
j
2σ2sj
)
λ2
2
e
− λ
2sj dsj , j = 1, ..., p
This is the mixture prior analyzed by Park and Casella (2008), which is by far the
most popular form for the Bayesian lasso.
ECON 5079 Lecture 6 25/ 46
Bayesian Lasso: Park and Casella (2008)
The Park and Casella (2008) Laplace prior takes the form
β|{τ2j }pj=1, σ2 ∼ Np(0, σ2Dτ ),
τ2j |λ2 ∼ Exponential
(
λ2
2
)
, for j = 1, ..., p,
λ2 ∼ Gamma(r, δ)
σ2 ∼ 1
σ2
,
where Dτ = diag(τ21 , ..., τ
2
p ).
ECON 5079 Lecture 6 26/ 46
Bayesian Lasso: Park and Casella (2008)
The conditional posteriors are of the form
β|{τ2j }pj=1, σ2,y ∼ Np
(
VX′y, σ2V
)
,
1
τ2j
|βj , λ2,y ∼ IG
(√
λ2σ2
β2j
, λ2
)
, for j = 1, ..., p,
λ2|y ∼ Gamma
(
r + p,
∑p
j=1 τ
2
j
2
+ δ
)
,
σ2|β,y ∼ iGamma
(
n− 1 + p
2
,
Ψ
2
+
β′D−1τ β
2
)
,
where V =
(
X′X+D−1τ
)−1
, Dτ = diag(τ21 , ..., τ
2
p ) and Ψ = (y−Xβ)′(y−Xβ).
ECON 5079 Lecture 6 27/ 46
Outline
A review on posterior sampling
Introduction to shrinkage priors
Student-t shrinkage
Bayesian Lasso
Global-local shrinkage priors
Conjugate vs independent priors
Computation
ECON 5079 Lecture 6 28/ 46
Global-local shrinkage
Global-local shrinkage priors (Polson and Scott, 2010) are of the form
p
(
βj |τ2, λ2j
) ∼ N (0, τ2λ2j) , j = 1, ..., p,
p
(
τ2
) ∼ Fτ (a, b),
p
(
λ2j
) ∼ Fλ(c, d),
where τ2 is a global shrinkage parameter (applying the same shrinkage to the
whole parameter vector β) and λj is a local shrinkage parameter (applying
shrinkage only to βj). As we see next, such priors will typically have at least three
hierarchical layers, but in practical situations they tend to have many more (e.g.
by putting priors on some or all of the hyperparameters a, b, c, d).
ECON 5079 Lecture 6 29/ 46
Horseshoe prior
The horseshoe prior was first introduced by Carvalho et al. (2010) and it since its
inception has been the most popular and influential hierarchical prior in Bayesian
inference. The Horseshoe is a prime representative of the class of global-local
shrinkage priors and it can be represented as a scale mixture of normals with
half-Cauchy mixing distributions. That is, the prior has the following formulation
β|{λj}pj=1, τ ∼ Np
(
0, σ2τ2Λ
)
,
λj |τ ∼ C+(0, 1), for j = 1, ..., p,
τ ∼ C+(0, 1),
where Λ = diag(λ21, ..., λ
2
p), and C
+(0, α) is the half-Cauchy distribution on the
positive reals with scale parameter α. That is, λj has conditional prior density
λj |τ = 2
πτ (1 + (λj/τ)2)
.
Under this hierarchical specification, the marginal prior for each βj is unbounded
at the origin and has tails that decay polynomially.
ECON 5079 Lecture 6 30/ 46
Horseshoe prior
• The survey paper by Bhadra et al. (2020) provides a thorough review of the
applications of this prior in numerous inference problems in statistics and
machine learning, including nonlinear models and neural networks.
• There are numerous theoretical results established for this prior, most
notably Datta and Ghosh (2013) and van der Pas et al. (2014), and the
reader is referred to Bhadra et al. (2020) for a more detailed discussion.
ECON 5079 Lecture 6 31/ 46
Horseshoe prior
There are various computational approaches to the Horseshoe, but the most
straightforward is the one proposed by Makalic and Schmidt (2016). These
authors note that the half-Cauchy distribution can be written as a mixture of
inverse-gamma distributions. In particular, if
x2|z ∼ Inv −Gamma(1/2, 1/z), z ∼ Inv −Gamma(1/2, 1/α2),
then x ∼ C+(0, α). Therefore, the Makalic and Schmidt (2016) prior takes the
form
β|{λj}pj=1, τ, σ2 ∼ N
(
0, σ2τ2Λ
)
,
λ2j |vj ∼ Inv −Gamma(1/2, 1/vj), for j = 1, ..., p,
vj ∼ Inv −Gamma(1/2, 1), for j = 1, ..., p,
τ2|ξ ∼ Inv −Gamma(1/2, 1/ξ),
ξ ∼ Inv −Gamma(1/2, 1),
σ2 ∼ 1
σ2
,
where Λ = diag(λ21, ..., λ
2
p).
ECON 5079 Lecture 6 32/ 46
Horseshoe prior
The conditional posteriors are of the form
β|{{λj}pj=1, τ2, σ2,y ∼ Np
(
VX′y, σ2V
)
,
λ2j |β, vj , τ2, σ2y ∼ Inv −Gamma
(
1,
1
vj
+
β2j
2τ2σ2
)
, for j = 1, ..., p,
vj |λj ,y ∼ Inv −Gamma
(
1, 1 +
1
λ2j
)
, for j = 1, ..., p,
τ2|β, ξ, {λj}pj=1, σ2,y ∼ Inv −Gamma
p+ 1
2
,
1
ξ
+
1
2σ2
p∑
j=1
β2j
λ2j

ξ|τ2,y ∼ Inv −Gamma
(
1, 1 +
1
τ2
)
,
σ2|β,y ∼ Inv −Gamma
(
n+ p
2
,
Ψ
2
+
β′D−1β
2
)
,
where V =
(
X′X+D−1τ,λ
)−1
, D = diag(τ2λ21, ..., τ
2λ2p) = τ
2Λ and
Ψ = (y −Xβ)′(y −Xβ).
ECON 5079 Lecture 6 33/ 46
Outline
A review on posterior sampling
Introduction to shrinkage priors
Student-t shrinkage
Bayesian Lasso
Global-local shrinkage priors
Conjugate vs independent priors
Computation
ECON 5079 Lecture 6 34/ 46
Conjugate vs independent priors
Conjugate prior Independent prior
β|σ2, τ2 ∼ Np
(
0, σ2Dτ
)
, β|τ2 ∼ Np (0,Dτ ) ,
τ2 ∼ p(τ2), τ2 ∼ p(τ2),
σ2 ∼ 1
σ2
, σ2 ∼ 1
σ2
,
where Dτ = diag(τ21 , ..., τ
2
p ) and depending on the structure of the distribution
p(τ2).
ECON 5079 Lecture 6 35/ 46
Conjugate vs independent priors
Conditional Posteriors (Conjugate prior)
β|σ2, τ2, σ2,y ∼ Np
(
VX′y, σ2V
)
,
τ2|β, σ2,y ∼ p(τ2|β, σ2,y),
σ2|β,y ∼ Inv −Gamma
(
n+p
2
, 1
2
(Ψ+ β′D−1τ β)
)
,
(14)
where V =
(
X′X+D−1τ
)−1
and Ψ = (y −Xβ)′ (y −Xβ). Under the
independent prior the posterior is of the form
Conditional Posteriors (Independent prior)
β|σ2, τ2, σ2,y ∼ Np
(
VX′y/σ2,V
)
,
τ2|β, σ2,y ∼ p(τ2|β,y),
σ2|β,y ∼ Inv −Gamma (n
2
, 1
2
Ψ
)
,
(15)
where V =
(
X′X/σ2 +D−1τ
)−1
. Notice how σ2 affects the posterior of τ2 in the
conjugate prior case, while σ2 doesn’t show up in the posterior of τ2. Which
specification should we use?
ECON 5079 Lecture 6 36/ 46
Conjugate vs independent priors
Should we be using a conditional or unconditional hierarchical
prior? In order to investigate this question, we consider
simulation by generating data from the regression model
y = Xβ + ϵ with ϵ ∼ N(0, σ2I).
• We let n = 100 and σ2 = 3.
• We construct the true vector of slope parameters cβ by
assigning values {1.5,−1.5, 2,−2, 2.5,−2.5} to the first 6
elements of β and setting others to zero.
• We chose c > 0 to make sure moderately strong
signal-to-noise-ratio
ECON 5079 Lecture 6 37/ 46
σˆ2 Bias MSE FN FP TP
p = 50
Student-t (conjugate) 1.77 0.19 0.05 0.10 0 5.9
Bayesian Lasso (conjugate) 1.96 0.18 0.05 0.06 0 5.9
Horseshoe (conjugate) 2.35 0.19 0.05 0.07 0 5.9
Student-t (independent) 2.99 0.19 0.05 0.09 0 5.9
Bayesian Lasso (independent) 2.86 0.18 0.05 0.05 0 6.0
Horseshoe (independent) 2.93 0.19 0.05 0.08 0 5.9
p = 100
Student-t (conjugate) 0.42 0.34 0.18 0.43 0.01 5.5
Bayesian Lasso (conjugate) 0.69 0.28 0.12 0.18 0 5.8
Horseshoe (conjugate) 1.05 0.26 0.11 0.19 0 5.8
Student-t (independent) 2.01 0.30 0.15 0.35 0.01 5.6
Bayesian Lasso (independent) 2.02 0.25 0.10 0.15 0 5.8
Horseshoe (independent) 2.28 0.26 0.11 0.19 0 5.8
p = 300
Student-t (conjugate) 0.35 0.35 0.24 0.41 2.79 5.6
Bayesian Lasso (conjugate) 0.66 0.78 0.71 0.74 2.86 5.2
Horseshoe (conjugate) 0.92 0.66 0.63 0.46 12.0 5.5
Student-t (independent) 3.88 1.52 2.43 0.03 68.6 5.9
Bayesian Lasso (independent) 2.59 1.42 2.12 0.01 68.9 5.9
Horseshoe (independent) 3.27 1.43 2.17 0.01 61.1 5.9
Table: Average metrics over 100 repetitions for each of the approaches. Estimated
error variance, and bias and MSE of the first 6 elements of the slope vector, and
the numbers of False Negatives (FN), False Positives (FP), True Positives (TP).
The posterior means were used as point estimates. The post-processing method of
Li and Pati (2017) was used to distinguish signals from noise. n = 100.
ECON 5079 Lecture 6 38/ 46
Conjugate vs independent priors
• Both conjugate and independent priors do well in terms of TPs and FNs.
However, there are some notable differences.
• σ2 tends to be underestimated when conjugate priors are used. This was in
fact pointed out by Moran et al. (2019). Intuitively, conjugate priors
implicitly add p ”pseudo-observations” to the posterior which can result in
underestimations of σ2 when β is sparse.
• The independent priors tend to have larger bias and MSE of the signals
under the high-dimensional case (i.e. p = 300). Park and Casella (2008)
point out that independent priors can induce bi-modality of the posterior on
the slope coefficients. This can make the posterior distributions for β wider
than in the conjugate case. We also see that independent priors have larger
FPs when p = 300, which could be a result of this.
• We need to be aware of these issues when choosing priors and to conduct
sensitivity checks.
ECON 5079 Lecture 6 39/ 46
Outline
A review on posterior sampling
Introduction to shrinkage priors
Student-t shrinkage
Bayesian Lasso
Global-local shrinkage priors
Conjugate vs independent priors
Computation
ECON 5079 Lecture 6 40/ 46
Fast sampling from normal posteriors
Under the natural conjugate prior the conditional posteriors are of the form
Conditional Posteriors (Natural conjugate prior)
β|σ2, τ2, σ2,y ∼ Np
(
VX′y, σ2V
)
,
σ2|β,y ∼ Inv −Gamma
(
n+p
2
, 1
2
(Ψ+ β′D−1τ β)
)
,
τ2|β, σ2,y ∼ p(τ2|β, σ2,y)
(16)
where V =
(
X′X+D−1τ
)−1
and Ψ = (y −Xβ)′ (y −Xβ). Under the
independent prior the posteriors are of the form
Conditional Posteriors (Independent prior)
β|σ2, τ2, σ2,y ∼ Np
(
VX′y/σ2,V
)
,
σ2|β,y ∼ Inv −Gamma (n
2
, 1
2
Ψ
)
,
τ2|β, σ2,y ∼ p(τ2|β,y)
(17)
where V =
(
X′X/σ2 +D−1τ
)−1
and Ψ = (y −Xβ)′ (y −Xβ).
Dτ is a diagonal matrix with prior variances for τ2j ’s.
ECON 5079 Lecture 6 41/ 46
Fast sampling from normal posteriors
In high-dimensional settings with p large, the most cumbersome step in a Gibbs
sampler for the linear regression model with hierarchical priors is sampling from
the p-variate normal conditional posterior distribution of β. This step involves an
inversion of precision matrix Q = V−1 =
(
X′X+D−1τ
)
(in the case of the
natural conjugate prior) or Q = V−1 =
(
X′X/σ2 +D−1τ
)
(in the case of the
independent prior) in order to obtain the posterior covariance matrix V. Next, the
Cholesky decomposition of V is needed in order to sample from the desired normal
distribution. While the inversion step can be sped up (e.g. by using Woodbury’s
identity), all built-in algorithms in various programming languages for obtaining
the Cholesky decomposition have a worst asymptotic complexity (measured in
flops) of O(p3). Therefore, simple sampling from a normal posterior is deemed to
become computationally cumbersome, if not infeasible, as p increases.
ECON 5079 Lecture 6 42/ 46
Fast sampling from normal posteriors
Rue (2001) provides a precision-based sampler in order to obtain samples from a
normal distribution efficiently when the precision matrix Q =
(
X′X+D−1τ
)
is
known. Due to the fact that the Bayesian conditional posterior of β (ignoring σ2,
e.g. assume it is fixed to value 1) is of the form β|• ∼ N(VX′y,V), the procedure
proposed by Rue (2001) takes the following form
• Compute the lower Cholesky factorization Q = LL′
• Generate Z ∼ Np(0, Ip)
• Set v = L−1(X′y)
• Set µ = L′−1v
• Set u = L′−1Z
• Set β = µ+ u
It is trivial to show that E(β) = µ = (L′−1L−1)X′y = (LL′)−1X′y = VX′y and
cov(β) = cov(µ+ u) = L′−1cov(Z)L−1 = L′−1L−1 = V, which means that the
above procedure provides valid samples from the desired normal distribution. The
main feature of this algorithm is that it requires to invert the Cholesky factor of
Q, instead of inverting Q itself to obtain V. While the worst case complexity of
this algorithm is also O(p3), it provides high efficiency gains in certain classes of
models, e.g. when the Gram matrix X′X is block-diagonal (assuming the prior
variance Dτ is diagonal or at most block-diagonal of similar structure).
ECON 5079 Lecture 6 43/ 46
Fast sampling from normal posteriors
Recently, Bhattacharya et al. (2016) proposed an efficient algorithm that makes
full use of Woodbury matrix inversion lemma in the context of generating normal
variates from a distribution of the form β|• ∼ N(µ, V ) (again for simplicity, ignore
σ2). Their algorithm takes the following form
1. Sample η ∼ Np(0,Dτ ) and δ ∼ Nn(0, In)
2. Set v = Xη + δ
3. Set w = (XDτX′ + In)−1[y − v]
4. Set β = η +DX′w
ECON 5079 Lecture 6 44/ 46
Fast sampling from normal posteriors
In this case, the sample of β is from the desired normal distribution with mean µ
and variance V. Note that this algorithm requires inversion of the n× n matrix
(XDτX′ + In)−1, while sampling directly from the normal posterior requires
inversion of the p× p matrix Q. Therefore, step 3 in this algorithm only becomes
efficient for p >> n. The main efficiency gains in this algorithm stem from the fact
that it requires to sample from two normal distributions with diagonal covariance
matrices (a p-variate distribution with covariance Dτ and an n-variate
distribution with identity covariance). Generating uncorrelated normal draws is
much more efficient than sampling directly from the p-variate normal posterior of
β using the full covariance matrix V. In particular, the worst-case complexity
(asymptotic upper bound) of this algorithm is O(n2p), that is, it is only linear in
p. Therefore, for n > p this algorithm will perform worse than the algorithm of
Rue (2001) since the term n2 will dominate, but this algorithm shines in the p > n
case where it can offer some dramatic improvements in computation times.
ECON 5079 Lecture 6 45/ 46
References
Armagan, A. and Zaretzki, R. L. (2010). Model selection via adaptive shrinkage
with t priors. Computational Statistics, 25(3):441–461.
Bhadra, A., Datta, J., Li, Y., and Polson, N. (2020). Horseshoe regularisation for
machine learning in complex and deep models1. International Statistical
Review, 88(2):302–320.
Bhattacharya, A., Chakraborty, A., and Mallick, B. K. (2016). Fast sampling with
Gaussian scale mixture priors in high-dimensional regression. Biometrika,
103(4):985–991.
Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator
for sparse signals. Biometrika, 97(2):465–480.
Datta, J. and Ghosh, J. K. (2013). Asymptotic properties of bayes risk for the
horseshoe prior. Bayesian Anal., 8(1):111–132.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical
models. Bayesian Analysis, 1(3):515 – 534.
Li, H. and Pati, D. (2017). Variable selection using shrinkage priors.
Computational Statistics & Data Analysis, 107:107–119.
Lindley, D. V. (1983). Parametric empirical bayes inference: Theory and
applications: Comment. Journal of the American Statistical Association,
78(381):61–62.
Makalic, E. and Schmidt, D. F. (2016). A simple sampler for the horseshoe
estimator. IEEE Signal Processing Letters, 23(1):179–182.
Moran, G. E., Rocˇkova´, V., and George, E. I. (2019). Variance Prior Forms for
High-Dimensional Bayesian Variable Selection. Bayesian Analysis,
14(4):1091–1119.
Park, T. and Casella, G. (2008). The bayesian lasso. Journal of the American
Statistical Association, 103(482):681–686.
Polson, N. G. and Scott, J. G. (2010). Shrink globally, act locally: Sparse bayesian
regularization and prediction. In Bernardo, J., Bayarri, M., Berger, J., Dawid,
A., Heckerman, D., Smith, A., and West, M., editors, Bayesian Statistics 9.
Oxford University Press.
Rue, H. (2001). Fast sampling of gaussian markov random fields. Journal of the
Royal Statistical Society. Series B (Statistical Methodology), 63(2):325–338.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society. Series B (Methodological), 58(1):267–288.
van der Pas, S. L., Kleijn, B. J. K., and van der Vaart, A. W. (2014). The
horseshoe estimator: Posterior concentration around nearly black vectors.
Electronic Journal of Statistics, 8(2):2585–2618.
ECON 5079 Lecture 6 46/ 46