无代写-CS5340|学霸联盟

无代写-CS5340

时间：2022-10-27

CS5340
Uncertainty Modeling in AI
Assoc. Prof. Lee Gim Hee
AY 2022/23
Semester 1
Lecture 9:
Monte Carlo Inference (Sampling)
Course Schedule
CS5340 :: G.H. Lee 2
Week Date Topic Remarks
1 10 Aug Introduction to probabilistic reasoning Assignment 0: Python Numpy Tutorial (Ungraded)
2 17 Aug Bayesian networks (Directed graphical models)
3 24 Aug Markov random Fields (Undirected graphical models)
4 31 Aug Variable elimination and belief propagation Assignment 1: Belief propagation and maximal probability (15%)
5 07 Sep Factor graph and the junction tree algorithm
6 14 Sep Parameter learning with complete data Assignment 1: Due
Assignment 2: Junction tree and parameter learning (15%)
- 21 Sep Recess week No lecture
7 28 Sep Mixture models and the EM algorithm Assignment 2: Due
8 05 Oct Hidden Markov Models (HMM) Assignment 3: Hidden Markov model (15%)
9 12 Oct Monte Carlo inference (Sampling)
* 15 Oct Variational inference Makeup Lecture (LT15)
Time: 9.30am – 12.30pm (Saturday)
10 19 Oct Variational Auto-Encoder and Mixture Density
Networks
Assignment 3: Due
Assignment 4: MCMC Sampling (15%)
11 26 Oct No Lecture I will be traveling
12 02 Nov Graph-cut and alpha expansion Assignment 4: Due
13 09 Nov -
Acknowledgements
• A lot of slides and content of this lecture are adopted
from:
1. "An introduction to MCMC for Machine Learning", Christophe Andrieu
et. al.
http://www.cs.bham.ac.uk/~axk/mcmc1.pdf
2. “Pattern Recognition and Machine Learning”, Christopher Bishop,
Chapter 11.
3. http://www.cs.cmu.edu/~epxing/Class/10708/lectures/lecture16-
MC.pdf
http://www.cs.cmu.edu/~epxing/Class/10708/lectures/lecture17-
MCMC.pdf, Eric Xing, CMU.
4. “Machine Learning – A Probabilistic Perspective”, Kevin Murphy,
Chapter 23.
5. “Probabilistic Graphical Models”, Daphne Koller and Nir Friedman,
chapter 12.
CS5340 :: G.H. Lee 3
Learning Outcomes
• Students should be able to:
1. Explain the Monte Carlo principle and its justification for
sampling methods.
2. Apply Rejection, Importance, Metropolis-Hasting,
Metropolis and Gibbs sampling methods to do maximal
probability, approximate inference, and expectation.
3. Use Markov chain properties, i.e. homogenous,
stationary distribution, irreducibility, aperiod, egordic
and detail balance, to show validity of MH algorithm.
CS5340 :: G.H. Lee 4
History of Monte Carlo Sampling
Metropolis algorithm is selected as one of the top 10 algorithms
that had the greatest influence on science and engineering in
the 20th century.
[Beichl & Sullivan 2000]
CS5340 :: G.H. Lee 5
History of Monte Carlo Sampling
CS5340 :: G.H. Lee 6
Stanislaw Ulam
1909-1984
• Occurred to him to try to compute the chances
that a particular solitaire laid out with 52 cards
would come out successfully.
Image source: https://en.wikipedia.org/wiki/Stanislaw_Ulam
• He attempted exhaustive combinatorial calculations, but
decided to lay out several solitaires at random and then
observing and counting the number of successful plays.
• Invented by Stan Ulam in 1946 when he
was playing solitaire, while convalescing
from an illness.
Idea Behind Monte Carlo Sampling
Ulam’s idea of selecting a statistical sample to approximate a
hard combinatorial problem by a much simpler problem is at
the heart of modern Monte Carlo simulation.
CS5340 :: G.H. Lee 7
Pioneers of Monte Carlo Sampling
CS5340 :: G.H. Lee 8
Stanislaw Ulam
1909-1984
John von Neumann
1903-1957
Nicholas Metropolis
1915-1999
Edward Teller
1908-2003Marshall Rosenbluth
1927-2003
Augusta H. Teller
1909-2000
Why Do We Need Sampling?
Bayesian inference and learning:
Given some unknown variables and data , the
following typically intractable integration problems are central
to Bayesian statistics.
1. Normalization. To obtain the posterior ) given the
prior () and likelihood ( | ), the normalizing factor in
Bayes’ theorem needs to be computed
CS5340 :: G.H. Lee 9
Intractable to compute in large dimensional space

Why Do We Need Sampling?
2. Marginalization: Given the joint posterior of ,
we may often be interested in the marginal posterior.
3. Expectation: The objective of the analysis is often to obtain
summary statistics of the form
CS5340 :: G.H. Lee 10
Intractable to compute in large dimensional space
for some function of interest integrable with
respect to ( | ).
Intractable to compute in large dimensional space
(, )
Why Do We Need Sampling?
Optimization:
• The goal of optimization is to extract the solution
that minimizes some objective function from a large
set of feasible solutions.
• This set can be continuous and unbounded.
• In general, it is too computationally expensive to
compare all the solutions to find out which one is
optimal.
CS5340 :: G.H. Lee 11
The Monte Carlo Principle
• Draw an i.i.d. set of samples
=1

from a target density
() defined on a high-dimensional space .
• E.g. the set of possible configurations of a system, the space
on which the posterior is defined, or the combinatorial set of
feasible solutions.
• These samples can be used to approximate the target
density with the following empirical point-mass function:
CS5340 :: G.H. Lee 12
Delta-Dirac mass located at ()
Weak vs Strong Law of Large Numbers
• Weak law of large numbers: the sample average, ത
converges in probability towards the expected value, , i.e.,
• That is, for any positive number ,
where
CS5340 :: G.H. Lee 13
ത =
1

σ and = න
Weak vs Strong Law of Large Numbers
• Strong law of large numbers: the sample average,
ത converges almost surely to the expected value, , i.e.,
• That is,
where
CS5340 :: G.H. Lee 14
ത =
1

σ and = න
The Monte Carlo Principle
• Consequently, we can approximate the integrals (or
very large sums) () with tractable sums ()
that converge as follows:
• That is, the estimate () is unbiased and by the
strong law of large numbers, it will almost surely
converge to ().
CS5340 :: G.H. Lee 15
Similar to ത Similar to
The Monte Carlo Principle
• If the variance (in the univariate case for simplicity)
of () satisfies:

2 ≜
2 − 2 < ∞,
• Then the variance of the estimator is equal to:
var =

2

,
CS5340 :: G.H. Lee 16
The Monte Carlo Principle
Proof:
CS5340 :: G.H. Lee 17
= − 2 = 2 − 2
⟹ = −
2
=
2 −
2 =
2 − 2 =
2
ത =
1

σ =
1
2
σ() =
2

Let ത ≔ () and ≔ , we get:
() =
1

σ =
1
2
σ() =

2

The Monte Carlo Principle
• and a central limit theorem yields convergence in
distribution of the error:
• The samples can also be used to obtain a maximum
of the objective function () as follows:
CS5340 :: G.H. Lee 18
Convergence in distribution
.
Non-Parametric Representation
Probability distributions can be represented:
1. Parametrically: e.g. using mean and covariance of
a Gaussian, or
2. Non-parametrically: using a set of hypotheses
(samples) drawn from the distribution.
Advantage of non-parametric representation:
• No restriction on the type of distribution (e.g. can be
multi-modal, non-Gaussian, etc)
CS5340 :: G.H. Lee 19
Non-Parametric Representation
The more samples are in an interval, the higher the
probability of that interval.
But:
How to draw samples from a function/distribution?
CS5340 :: G.H. Lee 20
P
ro
b
ab
ili
ty
/
w
ei
gh
t
P
ro
b
ab
ili
ty
/
w
ei
gh
t
x x
Rejection Sampling
• Suppose we wish to sample from a distribution (),
where direct sampling is difficult.
• Furthermore, suppose that we are unable to easily
evaluate () due to an unknown normalizing
constant , so that:
• Where ෤() can readily be evaluated.
CS5340 :: G.H. Lee 21
Rejection Sampling
CS5340 :: G.H. Lee 22
Grey area:
rejection region
White area:
acceptance region
Algorithm : Rejection Sampling
Set = 1
Repeat until =
1. Sample ()~() and ~ (0,1)
2. If <
෤
( )
, then accept () and increment the counter by 1.
3. Otherwise, reject.
// sample from proposal distribution ()
// sample from uniform distribution (0,1)
Proposal distribution () is an easier-to-
sample distribution e.g. Gaussian!
Image source: “Machine Learning and Pattern Recognition”, Christopher Bishop
// draw samples
Rejection Sampling
CS5340 :: G.H. Lee 23
Grey area:
rejection region
White area:
acceptance region
Algorithm : Rejection Sampling
Set = 1
Repeat until =
1. Sample ()~() and ~ (0,1)
2. If <
෤
( )
, then accept () and increment the counter by 1.
3. Otherwise, reject.
Accept proposal () when
falls in the acceptance region.
Image source: “Machine Learning and Pattern Recognition”, Christopher Bishop
// accept proposal () if <
෥
( )
,
// constant is chosen such that ෤ ≤ for all values of
// sample from proposal distribution ()
// sample from uniform distribution (0,1)
// draw samples
Rejection Sampling: Limitations
• It is not always possible to bound
෤()
()
with a reasonable
constant over the whole space .
• If is too large, the acceptance probability:
• This makes the method impractical in high dimensional
space scenarios.
CS5340 :: G.H. Lee 24
Pr accepted = Pr <
෤

=
1

,
will be too small.
Importance Sampling
• Given a target distribution () which is difficult to
draw samples directly.
• Importance sampling provides a framework for
approximating expectations of a function w.r.t.
().
• Samples {()} are drawn from a simpler distribution
(), i.e. proposal distribution.
CS5340 :: G.H. Lee 25
Importance Sampling
• Express expectation in the form of a finite sum over
samples {()} weighted by the ratios (())/(()):
CS5340 :: G.H. Lee 26
Image source: “Machine Learning and Pattern Recognition”, Christopher Bishop
Recall:
Importance Sampling
• The quantities = (
)/(()) are known as
importance weights.
• And they correct the bias introduced by sampling
from the wrong distribution.
• Note that, unlike rejection sampling, all the samples
generated are retained.
CS5340 :: G.H. Lee 27
Importance Sampling
• Often the case that ෤() can be evaluated easily, but not
= ෤()/, where is unknown.
• Let us define the proposal distribution in similar form, i.e.
= ෤()/.
• We then have:
CS5340 :: G.H. Lee 28
where ෥ =
෤(())
෤(())
.
,
Importance Sampling
• We can use the same sample set to evaluate the ratio /
with the result:
• And hence
CS5340 :: G.H. Lee 29
where
Important weight which is
easy to compute!
where ෥ =
෤(())
෤(())
.
Importance Sampling
Proof:
CS5340 :: G.H. Lee 30
Substituting

=
1

σ ǁ into =

1

σ ǁ
,
we get:
=

σ ǁ
1

෍

ǁ

=෍

ǁ
σ ǁ

=෍

Importance Sampling: Limitations
• Success of the importance sampling approach
depends crucially on how well () matches ().
• A strongly varying ()() has a significant
proportion of its mass concentrated over relatively
small regions of space.
• Set of importance weights {} may be dominated by
a few weights having large values, with the
remaining weights being relatively insignificant.
CS5340 :: G.H. Lee 31
Importance Sampling:
Example on a Bayesian Network
CS5340 :: G.H. Lee 32
1 2
3
4
5
1: Difficulty, 2: Intelligence, 3: Grade, 4: Letter, 5: SAT score
1 = 0 1 = 1
0.6 0.4
2 = 0 2 = 1
0.7 0.3
(1) (2)
3 = 0 3 = 1 3 = 2
1 = 0,2 = 0 0.3 0.4 0.3
1 = 0,2 = 1 0.05 0.25 0.7
1 = 1,2 = 0 0.9 0.08 0.02
1 = 1,2 = 1 0.5 0.3 0.2
(3|1, 2)
5 = 0 5 = 1
2 = 0 0.95 0.05
2 = 1 0.2 0.8
(5|2)
4 = 0 4 = 1
3 = 0 0.1 0.9
3 = 1 0.4 0.6
3 = 2 0.99 0.01
(4|3)
How do we compute
1, 4, 5 2 = 1, 3 = 1)?
Importance Sampling:
Example on a Bayesian Network
CS5340 :: G.H. Lee 33
• How do we compute 1, 4, 5 2 = 1, 3 = 1)?
• What should we use as the proposal distribution ()?
1, 4, 5 2 = 1, 3 = 1) =
(1, 4, 5, 2 = 1, 3 = 1)
(2 = 1, 3 = 1)
=
( , )
()
=
1

෤()We don’t want
to evaluate
Importance Sampling!!!
Target distribution
Importance Sampling:
Example on a Bayesian Network
CS5340 :: G.H. Lee 34
1 2
3
4
5
1 = 0 1 = 1
0.6 0.4
2 = 0 2 = 1
0 1
(1) (2 = 1)
3 = 0 3 = 1 3 = 2
0 1 0
(3 = 1|1, 2)
5 = 0 5 = 1
2 = 1 0.2 0.8
(5|2 = 1)
4 = 0 4 = 1
3 = 1 0.4 0.6
(4|3 = 1)
Proposal distribution:
1, 4, 5 = 1 4 3 = 1)(5|2 = 1)
How do we sample from (1, 4, 5)?
1~(1)
4~(4|3 = 1)
5~(5|2 = 1)
e.g. randomly generate a number within [0,1]
(uniform distribution), i.e. = rand; 1 = 0 if
< 0.6, 1 = 1 otherwise.
Importance Sampling:
Example on a Bayesian Network
• For each sample (), we evaluate the weight as:
• Example:
CS5340 :: G.H. Lee 35
=
ǁ
σ ǁ
=
෤ /(())
σ ෤ /(())
.
෤ () = (1 = 0, 4 = 1, 5 = 1, 2 = 1, 3 = 1)
(): 1 = 0, 4 = 1, 5 = 1 obtained from sampling, we have
= 1 = 0 2 = 1 3 = 1 2 = 1, 1 = 0)
4 = 1 3 = 1) 5 = 1 2 = 1)
= (0.6)(0.3)(0.25)(0.6)(0.8)
= 0.0216
Importance Sampling:
Example on a Bayesian Network
• For each sample (), we evaluate the weight as:
• Example:
CS5340 :: G.H. Lee 36
=
ǁ
σ ǁ
=
෤ /(())
σ ෤ /(())
.
() = 1 = 0 4 = 1 3 = 1)(5 = 1|2 = 1)
(): 1 = 0, 4 = 1, 5 = 1 obtained from sampling, we have
= 0.6 0.6 0.8 = 0.288
⇒
෤

=
0.0216
0.288
= 0.075
• Finally, denominator (hence each weight ) can be computed
from all samples.
Importance Sampling:
Example on a Bayesian Network
• We can compute 1, 4, 5 2 = 1, 3 = 1) from
all the weights and samples:
• In summary, we get:
CS5340 :: G.H. Lee 37
1 = 0, 4 = 0, 5 = 0 2 = 1, 3 = 1) =
σ(
= {1=0,4=0,5=0})
σ
1 = 1, 4 = 1, 5 = 1 2 = 1, 3 = 1) =
σ(
= {1=1,4=1,5=1})
σ
…
Sum of all weights from samples at
{1= 0, 4 = 0, 5 = 0 }
) =
σ(
)
σ
normalizer: ensure probability sums to 1
Sampling and the EM Algorithm
• Sampling methods can be used to approximate the
E step of the EM algorithm for models in which the
E step cannot be performed analytically.
CS5340 :: G.H. Lee 38
Cannot be computed analytically!
Sampling and the EM Algorithm
• Approximate integral by a finite sum over samples {},
which are drawn from the current estimate for
, ), so that:
• The Q function is then optimized in the usual way in
the M step.
• This procedure is called the Monte Carlo EM algorithm.
CS5340 :: G.H. Lee 39
Markov Chain Monte Carlo (MCMC)
• MCMC is a strategy for generating samples ()
while exploring the state space using a Markov
chain.
• The chain is constructed to spend more time in the
most important regions.
• In particular, it is constructed so that the samples
() mimic samples drawn from the target
distribution ().
CS5340 :: G.H. Lee 40
Markov Chain Monte Carlo (MCMC)
CS5340 :: G.H. Lee 41
Importance sampling
with bad proposal ()
• MCMC algorithms feature adaptive proposals:
➢ Instead of (’), we use (’|) where ’ is the new state
being sampled, and is the previous sample.
➢As changes, (’|) can also change (as a function of ’).
()
()
(2)(3) (1)
MCMC with adaptive
proposal (′|)
()
(1)
( 2 |(1))
(2)
( 3 |(2))
(3)
( 4 |(3))
MCMC: Metropolis-Hasting Algorithm
CS5340 :: G.H. Lee 42
Algorithm : Metropolis-Hasting
1. Initialize (0)
2. For = 0 to − 1
3. Sample ~ 0,1
4. Sample ′~ ′ ())
5. If < ′, = min 1,
෤ ′ ( |′)
෤ (′| )
6. +1 = ′
7. else
8. (+1) = ()
// draw from proposal
// draw acceptance threshold
// acceptance probability
// new sample is accepted
// new sample is rejected
// we create a duplicate of the previous sample
MCMC: Metropolis-Hasting Algorithm
Example:
• Our goal is to sample from a bimodal distribution ().
• Let (’|) be a Gaussian centered on .
CS5340 :: G.H. Lee 43
()
Initialize (0)
(0)
′, = min 1,
෤ ′ ( |′)
෤ (′| )
MCMC: Metropolis-Hasting Algorithm
Example:
• Our goal is to sample from a bimodal distribution ().
• Let (’|) be a Gaussian centered on .
CS5340 :: G.H. Lee 44
()
Initialize (0)
Draw, accept (1)
(0)
((1)|(0))
(1)
′, = min 1,
෤ ′ ( |′)
෤ (′| )
MCMC: Metropolis-Hasting Algorithm
Example:
• Our goal is to sample from a bimodal distribution ().
• Let (’|) be a Gaussian centered on .
CS5340 :: G.H. Lee 45
()
Initialize (0)
Draw, accept (1)
Draw, accept (2)
(0)
((2)|(1))
(1) (2)
′, = min 1,
෤ ′ ( |′)
෤ (′| )
MCMC: Metropolis-Hasting Algorithm
Example:
• Our goal is to sample from a bimodal distribution ().
• Let (’|) be a Gaussian centered on .
CS5340 :: G.H. Lee 46
()
Initialize (0)
Draw, accept (1)
Draw, accept (2)
Draw but reject; set (3) = (2)
(0)
((3)|(2))
(1) (2)
Reject because
෤ ′
′ (2)
< 1 and
෤ (2)
(2) ′
> 1, hence ′, (2)
is close to zero!
(3)
′ (rejected)
′, = min 1,
෤ ′ ( |′)
෤ (′| )
MCMC: Metropolis-Hasting Algorithm
Example:
• Our goal is to sample from a bimodal distribution ().
• Let (’|) be a Gaussian centered on .
CS5340 :: G.H. Lee 47
()
Initialize (0)
Draw, accept (1)
Draw, accept (2)
Draw but reject; set (3) = (2)
Draw, accept (4)
(0)
((4)|(3))
(1) (2)
(3)
(4)
′, = min 1,
෤ ′ ( |′)
෤ (′| )
MCMC: Metropolis-Hasting Algorithm
Example:
• Our goal is to sample from a bimodal distribution ().
• Let (’|) be a Gaussian centered on .
CS5340 :: G.H. Lee 48
()
Initialize (0)
Draw, accept (1)
Draw, accept (2)
Draw but reject; set (3) = (2)
Draw, accept (4)
Draw, accept (5)
(0)
((5)|(4))
(1) (2)
(3)
(4) (5)
The adaptive proposal
(′| ) allows us to sample
both modes of ()!
′, = min 1,
෤ ′ ( |′)
෤ (′| )
Burn-In Period
• The initial samples may follow a very different
distribution, especially if the starting point is in a region
of low density.
• As a result, a burn-in period is typically necessary, where
an initial number of samples (e.g. the first 1,000 or so)
are thrown away.
CS5340 :: G.H. Lee 49
Burn-in period
Image source: http://wiki.ubc.ca/Course:CPSC522/MCMC
What is the connection between Markov
chains and MCMC?
Why does the Metropolis-Hasting algorithm
work?
CS5340 :: G.H. Lee 50
What is a Markov Chain?
• Intuitive to introduce Markov chains on finite state spaces,
where () can only take discrete values () ∈ =
{1, 2, … , }.
• The stochastic process () is called a Markov chain if:
• Current state () is conditionally independent of all previous
states given most recent state (−1).
CS5340 :: G.H. Lee 51
× matrix
…
0 −1−2−3
Properties of a Markov Chain
1. Homogeneous chain:
• Chain is homogeneous if ≜ (−1)) remains
invariant ∀ , with σ()
(−1)) = 1 for any .
• That is, the evolution of the chain in a space
depends solely on the current state of the chain and a
fixed transition matrix.
CS5340 :: G.H. Lee 52
Sum of each row in equals to 1
Properties of a Markov Chain
2. Stationary and limiting distributions:
• A probability vector = () defined on is a
stationary (invariant) distribution (w.r.t ) if
• A limiting distribution , is a distribution over the
states such that whatever the starting distribution
0, the Markov chain converges to .
CS5340 :: G.H. Lee 53
= .
Properties of a Markov Chain
Example:
CS5340 :: G.H. Lee 54
Image source: “An introduction to MCMC for Machine Learning”, Christophe Andrieu at. al.
Transition graph for the Markov chain
example with = {1, 2, 3}.
If initial state is 1 = 0.5, 0.2, 0.3 (can be any state), it follows that:
2 = 1 = (0.18, 0.64, 0.18)
…
.
= −1 = (0.2213, 0.4098, 0.3689)Converges to
stationary distribution! +1 = +2 = (0.2213, 0.4098, 0.3689)
Properties of a Markov Chain
3. Irreducibility:
• A Markov chain is irreducible if for any state of the
Markov chain, there is a positive probability of
visiting all other states, i.e.
CS5340 :: G.H. Lee 55
∀, ∈ , ∋ ≥ 0
. . = 0 = ) > 0
Properties of a Markov Chain
4. Aperiodicity:
• The Markov chain should not get trapped in cycles,
i.e.
CS5340 :: G.H. Lee 56
gcd ∶ = 0 = > 0} = 1, ∀ ∈
greatest common divisor
Ergodic Theorem for Markov Chains
• A Markov chain is ergodic if it is irreducible and
aperiodic.
• Ergodicity is important: it implies we can reach the
stationary/limiting distribution , no matter the
initial distribution 0.
• All good MCMC algorithms must satisfy ergodicity, so
that we cannot initialize in a way that will never
converge.
CS5340 :: G.H. Lee 57
Detailed Balance (Reversibility)
• A probability vector = () defined on satisfies
detailed balance w.r.t if:
Remark 1: Detailed balance ⟹ stationary distribution,
i.e. = .
Proof:
CS5340 :: G.H. Lee 58
= , ∀, ∈
= σ
= σ (detailed balance )
= σ (sum over row of )= 1
= , ∀ ∈ (stationary distribution)
Detailed Balance (Reversibility)
• A probability vector = () defined on satisfies
detailed balance w.r.t if:
Remark 2: Detailed balance = “reversibility”
• Just a terminology: we say that a Markov chain is
“reversible” if it had a stationary distribution that
satisfies detailed balance w.r.t .
CS5340 :: G.H. Lee 59
= , ∀, ∈
Why does Metropolis-Hastings work?
• Recall that we draw a sample ’ according to (’|),
and then accept/reject according to ′, .
• In other words, the transition kernel is:
• We shall prove that the Metropolis-Hasting algorithm
satisfies detailed balance!
CS5340 :: G.H. Lee 60
( ′ | ) = ( ′| )( ′| )
Why does Metropolis-Hastings work?
• Recall that:
• Notice this implies the following:
if ′, ≤ 1, then
෤ (′|)
෤ ′ (|′)
≥ 1
and thus , ′ = 1
CS5340 :: G.H. Lee 61
′, = min 1,
෤ ′ (|′)
෤ (′|)
Why does Metropolis-Hastings work?
• Now suppose ′, < 1 and , ′ = 1, we
have:
CS5340 :: G.H. Lee 62
′, =
෤ ′ (|′)
෤ (′|)
′, ෤ (′|) = ෤ ′ (|′)
′, ෤ (′|) = , ′ ෤ ′ (|′)
෤ ( ′ | ) = ෤ ′ | ′
This is the detailed balance condition!
Why does Metropolis-Hastings work?
• In other words, the Metropolis-Hasting algorithm
leads to a stationary distribution ෤ !
• Recall we defined ෤ to be the (un-normalized)
true distribution of !
• Thus, the Metropolis-Hasting eventually converges
to the true distribution!
CS5340 :: G.H. Lee 63
෤ ( ′ | ) = ෤ ′ | ′
Metropolis Algorithm
• Metropolis algorithm is a special case of the Metropolis-
Hasting algorithm.
• Proposal distribution is a random walk, i.e. (|′)=
′ , e.g. an isotropic Gaussian distribution.
• Acceptance probability of Metropolis algorithm is given
by:
CS5340 :: G.H. Lee 64
′, = min 1,
෤ ′ (|′)
෤ (′|)
= min 1,
෤ ′
෤
Metropolis Algorithm
• Illustration of using Metropolis algorithm (proposal
distribution: isotropic Gaussian) to sample from a Gaussian
distribution:
CS5340 :: G.H. Lee 65
Image Source: “Pattern Recognition and Machine Learning”, Christopher Bishop
Accepted sample
Rejected sample
150 candidate samples are
generated, 43 are rejected.
Gibbs Sampling
• Suppose we have an -dimensional vector and the
expressions for the full conditionals:
( |1, … , −1, +1, … , ).
• In this case, we use the following proposal distribution
for = 1,… , :
′ = ൝(
′ | \
()
)
0
if \
′ = \
()
Otherwise
CS5340 :: G.H. Lee 66
Gibbs Sampling
• Gibbs sampling is a special case of the Metropolis-
Hasting algorithm where the acceptance probability
is always one.
Proof:
CS5340 :: G.H. Lee 67
′, = min 1,
′ (|′)
(′|)
= min 1,

′ \
′ (\
′ )( | \
′ )
\ \
′ \)
= \ (\)
′ =
′ \
′ (\
′ ), where
We use \
′ = \ because these components are kept fixed during
the sampling step:
⟹ ′, = min 1,

′ \
′ \ \)
\ \
′ \
′ )
= 1
1
Gibbs Sampling
CS5340 :: G.H. Lee 68
Algorithm : Gibbs Sampling
Josiah Willard Gibbs
1839–1903
1. Initialize { ∶ = 1, . . . , }
2. For = 1, . . . , :
3. Sample 1
+1 ∼ 1 2
()
, 3
()
, . . . ,
()
).
4. Sample 2
+1 ∼ 2 1
(+1)
, 3
()
, . . . ,
()
).
5. Sample
+1 ∼ 1
+1 , … , −1
(+1)
, +1
()
, . . . ,
()
).
6. Sample
+1 ∼ 1
(+1)
, 2
(+1)
, . . . , −1
(+1)
)
Image Source: “Pattern Recognition and Machine Learning”, Christopher Bishop
…
…
Gibbs Sampling: Markov Blankets
• The conditional ( |1, … , −1, +1, … , ) looks
intimidating, but recall Markov Blankets:
CS5340 :: G.H. Lee 69
Markov blanket of
1, … , −1, +1, … , = ( | MB ) .
Image Source: “Pattern Recognition and Machine Learning”, Christopher Bishop
• Bayesian network: the Markov
blanket of is the set containing its
parents, children, and co-parents.
• MRF: the Markov Blanket of is its
immediate neighbors.

Gibbs Sampling:
Example on a Bayesian Network
CS5340 :: G.H. Lee 70

()
0.001
()
0.002
,)
T T 0.95
T F 0.94
F T 0.29
F F 0.01
: Burglary, : Earthquake, : Alarm, : John Calls, : Mary Calls
)
T 0.90
F 0.05
)
T 0.70
F 0.01
• Assume we sample variables in the
order , , , , .
• Initialize all variables at = 0 to
False.

0 F F F F F
1
2
3
4
Gibbs Sampling:
Example on a Bayesian Network
CS5340 :: G.H. Lee 71

()
0.001 ()
0.002
,)
T T 0.95
T F 0.94
F T 0.29
F F 0.01
)
T 0.90
F 0.05
)
T 0.70
F 0.01
• Sampling , ) at = 1: using Bayes rule, we have
• , = (, ), we compute the following, and sample =

0 F F F F F
1 F
2
3
4
( | , ) ∝ ( | , ) ( )
= = , = ) ∝ 0.99 0.999 = 0.98901
= = , = ) ∝ 0.06 0.001 = 0.00006
Gibbs Sampling:
Example on a Bayesian Network
CS5340 :: G.H. Lee 72

()
0.001 ()
0.002
,)
T T 0.95
T F 0.94
F T 0.29
F F 0.01
)
T 0.90
F 0.05
)
T 0.70
F 0.01
• Sampling (|, ) at = 1: using Bayes rule, we have
• , = (, ), we compute the following, and sample =

0 F F F F F
1 F T
2
3
4
( | , ) ∝ ( | , ) ( )
= = , = ) ∝ 0.99 0.998 = 0.98802
= = , = ) ∝ 0.71 0.002 = 0.00142
Gibbs Sampling:
Example on a Bayesian Network
CS5340 :: G.H. Lee 73

()
0.001 ()
0.002
,)
T T 0.95
T F 0.94
F T 0.29
F F 0.01
)
T 0.90
F 0.05
)
T 0.70
F 0.01
• Sampling (|, , ,) at = 1: using Bayes rule
• (, , ,) = (, , , ), we compute the following, and sample
=

0 F F F F F
1 F T F
2
3
4
( = | = , = , = , = ) ∝ 0.1 0.3 (0.29) = 0.0087
( = | = , = , = , = ) ∝ 0.95 0.99 (0.71) = 0.6678
, , ,) ∝ (|, )
Gibbs Sampling:
Example on a Bayesian Network
CS5340 :: G.H. Lee 74

()
0.001 ()
0.002
,)
T T 0.95
T F 0.94
F T 0.29
F F 0.01
)
T 0.90
F 0.05
)
T 0.70
F 0.01
• Sampling (|) at = 1: no need to apply Bayes rule
• = , we compute the following, and sample =

0 F F F F F
1 F T F T
2
3
4
( = | = ) ∝ 0.05
( = | = ) ∝ 0.95
Gibbs Sampling:
Example on a Bayesian Network
CS5340 :: G.H. Lee 75

()
0.001 ()
0.002
,)
T T 0.95
T F 0.94
F T 0.29
F F 0.01
)
T 0.90
F 0.05
)
T 0.70
F 0.01
• Sampling (|) at = 1: no need to apply Bayes rule
• = , we compute the following, and sample =

0 F F F F F
1 F T F T F
2
3
4
( = | = ) ∝ 0.01
( = | = ) ∝ 0.99
Gibbs Sampling:
Example on a Bayesian Network
CS5340 :: G.H. Lee 76

()
0.001 ()
0.002
,)
T T 0.95
T F 0.94
F T 0.29
F F 0.01
)
T 0.90
F 0.05
)
T 0.70
F 0.01
• Now = 2, and we repeat the procedure to sample new
values of , , , , …
• And similarly for = 3, 4, etc.

0 F F F F F
1 F T F T F
2 F T T T T
3 T F T F T
4 T F T F F
Gibbs Sampling: Illustration
CS5340 :: G.H. Lee 77
Image source: http://slideplayer.com/slide/4261639/
Summary
• We have looked at how to:
1. Explain the Monte Carlo principle and its justification
for sampling methods.
2. Apply Rejection, Importance, Metropolis-Hasting,
Metropolis and Gibbs sampling methods to do
maximal probability, approximate inference, and
expectation.
3. Use Markov chain properties, i.e. homogenous,
stationary distribution, irreducibility, aperiod, egordic
and detail balance, to show validity of MH algorithm.
CS5340 :: G.H. Lee 78