STAT413-无代写|学霸联盟

STAT413-无代写

时间：2023-03-26

Computing for Data Science
or Stochastic Computation and Algorithms
Course notes for STAT 413
Adam B Kashlak
Mathematical & Statistical Sciences
University of Alberta
Edmonton, Canada, T6G 2G1
March 12, 2023
cbna
This work is licensed under the Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International License. To view a
copy of this license, visit http://creativecommons.org/licenses/
by-nc-sa/4.0/.
Contents
Preface 1
1 Random Number Generation 2
1.1 Probability Integral Transform . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 CDFs with computable inverses . . . . . . . . . . . . . . . . . 4
1.1.2 CDFs without computable inverses . . . . . . . . . . . . . . . 5
1.1.3 CDFs with discrete distributions . . . . . . . . . . . . . . . . 7
1.2 Other Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Box-Muller Transform . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Acceptance-Rejection Sampling . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Acceptance-Rejection-Squeeze . . . . . . . . . . . . . . . . . . 13
1.3.2 Box-Muller without Trigonometry . . . . . . . . . . . . . . . 13
1.4 Ratio of Uniforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 Gamma Random Variables . . . . . . . . . . . . . . . . . . . 16
1.5 Points in Geometric Objects . . . . . . . . . . . . . . . . . . . . . . . 17
1.5.1 Simplexes and the Dirichlet Distribution . . . . . . . . . . . . 17
1.5.2 Spheres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Volume Computation and Integration 20
2.1 Volume Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Sampling from a regular grid . . . . . . . . . . . . . . . . . . 21
2.1.2 Uniform Monte Carlo . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Deterministic Approaches . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . 26
2.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Stochastic Optimization 32
3.1 Basic Stochastic Search . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 32
A Appendix 33
A.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Preface
Never really seized by the need to calculate, he was more
apt to be aware of pattern than of brute numeration.
Ratner’s Star
Don DeLillo (1976)
This course recently came into being a few years ago. At its birth, it was called
Statistical Computing. Then, it was rebranded as Computing for Data Science. If I
were to give is a more descriptive name, I would call it Stochastic Computation and
Algorithms. The main focus of these notes is how to calculate a desired number using
randomization to achieve it. In Chapter 1, we first discuss how to generate random
numbers using a deterministic computer. It has been said that “It may seem perverse
to use a computer, that most precise and deterministic of all machines conceived by
the human mind, to produce ‘random’ numbers” (Press et al., 2007). Nevertheless,
we require random numbers, for some definition of random, in order to proceed with
stochastic algorithms. In Chapter 2, we consider how to use stochastic algorithms
to compute volumes and integrate functions. In Chapter 3, we consider how to use
stochastic algorithms to optimize a function.
There are three textbooks I used as inspiration for these course notes: George
Fishman’s Monte Carlo: concepts, algorithms, and applications (Fishman, 2013);
the famous Numerical Recipes by Press, Teukolsky, Vetterling, and Flannery (Press
et al., 2007); and James Spall’s Introduction to stochastic search and optimization
(Spall, 2005).
Adam B Kashlak
Edmonton, Canada
Fall 2022
1
Chapter 1
Random Number Generation
Introduction
Type ?set.seed into the R command prompt and start reading under the “Details”
heading. It is easy to take random number generation for granted, but an incredible
amount of work has gone into making sure that your simulations in R or Python or
other languages are as random as possible given a deterministic computer. Many
of the implemented methods within R or Python or other computing languages are
based on algorithms such as Linear Feedback Shift Registers (LFSR) and Linear
Congruential Generators (LCG). More details on these methods can be found in
Chapter 7 of Fishman (2013) and Chapter 7 of Press et al. (2007). When discussing
“random” numbers generated on a computer, they are often called psuedorandom
numbers as they are deterministic but satisfy many properties of a set of truly
random numbers.
For this chapter, we assume that it is possible to generate independent uniformly
random binary variables. That is, for any n ∈ N, we can generate B1, . . . , Bn such
that P (Bi = 1) = P (Bi = 0) = 0.5 and such that Bi and Bj are independent for all
i 6= j. At the time of writing, pure randomness has not been achieved and one has
to be cautious of the fact that such a sequence of Bi’s are necessarily deterministic,
but sufficiently chaotic to be considered as random.
From uniformly random bits comes uniformly random variates on the unit inter-
val [0, 1]. We will denote U1, . . . , Un
iid∼ Uniform [0, 1] to be n independent uniform
random variates. This implies that the cumulative distribution function (CDF) for
Ui is
P (Ui < u) =

0, u ≤ 0
u, u ∈ (0, 1)
1, u ≥ 1
and Ui ⊥⊥ Uj for i 6= j.1 Generating such uniform variates can be done in R via the
1 The symbol ⊥⊥ is often used to indicate independence between two random variables.
2
function runif( ). This function directly calls compiled C code that implements
the chosen random generator from set.seed( ).
Note that the events U = 0 and U = 1 have probability zero of occurring. In
fact, in the C code underlying the R code for random number generation, if the
psuedorandom uniform variate is exactly 0 or 1, the code adjusts very slightly up
or down, respectively, so that runif( ) never actually returns 0 or 1.
Remark 1.0.1. In probability & measure theory, there is a theorem proving the
existence of countably infinite independent random variables on R. The technique of
the proof precisely uses the same approach as the computer does. That is, (1) gener-
ate independent Bernoulli random variables. (2) use those to construct independent
uniform random variables. (3) use the uniforms to construct a general collection of
independent random variables. See STAT 571 notes for more details on this proof.
1.1 Probability Integral Transform
Thus far, we have our collection of iid uniform random variates U1, . . . , Un. However,
more interesting and diverse distributions are often desirable. We will use U1, . . . , Un
to construct new random variables with arbitrary probability distributions. The first
tool we will make use of is the probability integral transform.
Theorem 1.1.1 (Probability Integral Transform). Let F : R → [0, 1] be a strictly
increasing continuous function such that limx→−∞ F (x) = 0 and limx→∞ F (x) = 1.
Then, F−1 : [0, 1]→ R exists and if U ∼ Uniform [0, 1], then
Z = F−1(U)
has F as its cumulative distribution function.
Proof. We want the CDF for Z = F−1(U). That is,
P (Z ≤ t) = P (F−1(U) ≤ t)
= P (U ≤ F (t)) = F (t).
Alternatively, this theorem implies that if Z has continuous increasing CDF F ,
then F (Z) is uniform on [0, 1]. Note that this theorem can be extended to CDFs
with discontinuities by defining the inverse CDF as follows:
F−1(u) = inf{t ∈ R : F (t) ≥ u}
for u ∈ [0, 1]. A jump discontinuity in a CDF implies a point mass exists in the
distribution. Only a countable number of these are possible.
3
1.1.1 CDFs with computable inverses
If the function F has an inverse that exists and furthermore if that inverse can be
expressed in a closed form analytically, then generating random variables from F
is as easy as evaluating the inverse F−1. However, just because generating random
variables from F is easy to code does not imply that this approach is computationally
efficient. More on this will be discussed in Section 1.2.
Example 1.1.1 (Exponential Distribution). The exponential distribution is criti-
cally important in queuing processes, Markov processes, and other models where a
“memoryless” waiting time is required. Thus, being able to simulate exponential
random variates is very desirable.
The exponential distribution with rate parameter λ > 0 respectively has PDF and
CDF
f(x) = λe−λx and F (x) = 1− e−λx.
Hence, F−1(u) = −λ−1 log(1 − u). However, if U ∼ Uniform [0, 1] then 1 − U ∼
Uniform [0, 1]. Thus, to generate exponential random variates with rate λ, we can
compute
X = − 1
λ
log(U).
The parameter λ is referred to as the rate parameter as it can be interpreted as
the reciprocal of the expected waiting time. That is, EX = 1/λ. Hence, if we are
modelling, say, the time until the next lightning strike occurs in a thunderstorm or
the next atom radioactively decays, a larger λ means a faster rate of occurrences.
Note that while this is an easy way to generate exponential random variates, it
is not the way that R does it with the rexp() function in the base stats package.
This will be discussed in a subsequent section.
Example 1.1.2 (Cauchy Distribution). The Cauchy distribution is a Student’s t-
distribution with only 1 degree of freedom. While is has a bell-curve shape similar to
the normal distribution, none of the moments of the Cauchy distribution are defined.
e.g. it does not have a finite mean.
A random variable X is Cauchy if it respectively has the PDF and CDF
f(x) =
1
pi(1 + x2)
and F (x) =
1
pi
arctan(x) +
1
2
.
Thus, we can write X = F−1(U) = tan (pi(U − 1/2)) to generate Cauchy random
variates X given U ∼ Uniform [0, 1]
The R source code uses this approach for generating Cauchy random variates via
rcauchy(). For the more general t-distribution, one can note that if Z ∼ N (0, 1),
V ∼ χ2 (ν), and Z ⊥⊥ V , then X = Z/√ν ∼ t(ν). Thus, the R code in rt.c
generates a standard normal random variate Z and a chi squared variate V in order
to return a t-random variate, which will be Cauchy if ν = 1.
4
1.1.2 CDFs without computable inverses
If often is the case that F−1 exists, but no closed form analytic expression is known.
The most obvious example of this is the normal or Gaussian distribution. The CDF
is written in terms of an integral
F (x) =
1
2pi
∫ x
−∞
e−t
2/2dt
which does not have a closed form.
However, such a issue does not completely ruin our ability to use the probability
integral transform. Indeed, if we wish to compute x = F−1(u) for some u ∈ [0, 1],
this is equivalent to finding an x such that 0 = F (x) − u. Hence, we have a root
finding problem, which can be solved by many algorithms. Most notably, we can
use the Newton-Raphson method, which is detailed in Algorithm 1.
The Newton-Raphson Method
The Newton-Raphson algorithm can be used to find the root of F (x)− u assuming
that the PDF f(x) = F ′(x) exists and that both f(x) and F (x) can be evaluated on
a computer. The idea behind the Newton-Raphson algorithm is to start at a point
x0 in the domain D of F . Then, a tangent line can be drawn passing through the
point (x0, F (x0)− u). Let xi be the root of this tangent line, then
f(x0) =
F (x0)− u
x0 − x1
and solving for x1 gives
x1 = x0 − F (x0)− u
f(x0)
.
The algorithm can be iterated by repeating the above with x1 in place of x0 to get
a new root x2. This may continue until convergence is achieved.
It can be shown using Taylor’s Theorem that if F has a continuous second
derivative, then the Newton-Raphson algorithm converges quadratically. Indeed,
denoting the true root by x∞, we have that
|xi+1 − x∞| = Ci(xi − x∞)2.
But this requires some additional conditions such as f(x∞) = F ′(x∞) 6= 0. Other-
wise, convergence can be hindered.
Furthermore, the choice of starting point is very important for quick convergence.
Also, certain functions will cause this algorithm to fail. For example, try applying
Newton-Raphson to F (x) =
√
x for x ≥ 0 and F (x) = −√−x for x < 0 for an initial
value of x0 = 1 and see what happens.
5
Algorithm 1 The Newton-Raphson Algorithm
Input:
F : D → R, an invertible differentiable function
f : D → R+, the derivative f = F ′
x0 ∈ D, an initial value in the domain of F
τ ∈ R+, a threshold for convergence
Algorithm:
For i ≥ 1
xi = xi−1 − F (xi−1)/f(xi−1)
if |xi − xi−1| < τ
stop
return xi
Output:
xi such that F (xi) ≈ 0
Remark 1.1.3 (Halley’s Method). Many other methods of root finding exist. One
method known as Halley’s method can be achieved by applying Newton-Raphson to
the function F (x)/
√|F ′(x)| assuming that for any x such that F (x) = 0, we have
that F ′(x) 6= 0. In Halley’s method, the update set of
xi+1 = xi − F (xi)
F ′(xi)
is replaced by
xi+1 = xi − 2F (xi)F
′(xi)
2F ′(xi)2 − F (xi)F ′′(xi) .
This approach comes with its own assumptions and concerns, which will be left to
other courses to investigate. See “Householder Methods” for even more.
The Bisection Method
In the context of random variate generation, the existence of a CDF usually implies
the existence of a PDF. However, given a function F to find a root, we may not
have access to the derivative f . In that case, we require a gradient-free root finding
method. Many of these exist. In this section, we will highlight a simple algorithm
applicable when it is known that the root of F lies in the finite interval [a, b] and
F (a) < 0 and F (b) > 0.2 This is the Bisection algorithm detailed in Section 2.3
of Fishman (2013) and written out in Algorithm 2. It is effectively a binary search
algorithm.
2 Or alternatively, F (a) > 0 > F (b). Just as long as the signs are different.
6
Much like a binary search algorithm, the Bisection algorithm is guaranteed to
stop after dlog2 ((b− a)/τ1)e number of steps. Hence, the tighter the interval [a, b]
is, the quicker the algorithm will converge, so like the starting value for Newton-
Raphson, the closer you are to the true root, the faster the algorithm will terminate.
Example 1.1.4 (Beta Distribution). The beta distribution commonly occurs as a
Bayesian prior and posterior distributions when a parameter of interest lies in the
unit interval. If X ∼ Beta (α, β), then the PDF is
f(x) =
1
B(α, β)
xα−1(1− x)β−1
for x ∈ [0, 1] where B(α, β) = Γ(α)Γ(β)/Γ(α+β) is the beta function and Γ( ) is the
gamma function.3 The CDF is the known at the regularized incomplete beta function
F (x) =
1
B(α, β)
∫ x
0
tα−1(1− t)β−1dt.
Of note, when α = β = 1/2, the distribution is known as the arcsin distribution as
its CDF is
F (x) =
2
pi
arcsin(
√
x)
for x ∈ [0, 1]. Finding a root x of F (x) − u for some u ∈ [0, 1] could be done with
the Bisection algorithm.
1.1.3 CDFs with discrete distributions
There are many discrete probability distributions that we may wish to simulate from.
These include the binomial, Poisson, geometric, and hypergeometric distributions.
In such a case, we have a discrete support D = x0, x1, x2, . . . and a CDF F such
that
0 = F (x0) < F (x1) < . . . < F (xi) < F (xi+1) < . . . ≤ 1.
If the support is finite with, say, |D| = m, then F (xm) = 1. Otherwise, F (xi) → 1
as i→∞.4
One easy approach (coding-wise) to generating from random variates from a
discrete distribution is to first compute the ordered values F (xi) ∈ [0, 1]. Then,
generate a random U ∼ Uniform [0, 1] and find the smallest i such that F (xi) <
U . In practice, the C code written to generate, say, Poisson or binomial random
variables for R is immensely complex and needing to consider various extreme cases
and numerical stability.
3For n ∈ N, Γ(n) = (n− 1)! and for general x ∈ R+, γ(x) = ∫∞
0
tx−1e−xdt.
4We can also extend the support of F to −∞ if desired. However, most discrete distributions
of interest have support contained in Z+.
7
Algorithm 2 The Bisection Algorithm
Input:
F : D → R, an function with a root to find
a, b ∈ D such that F (a) < 0 < F (b), interval to search
τ1, τ2 ∈ R+, thresholds for convergence
Algorithm:
Set x0 = a and x1 = b
For i ≥ 1
Compute z = (x0 + x1)/2
Compute F (z)
if F (z) < 0
Set x0 = z
else if F (z) ≥ 0
Set x1 = z
if |x1 − x0| < τ1 or |F (z)| < τ2
stop
return z
Output:
z such that F (z) ≈ 0
1.2 Other Transformations
The probability integral transform described in the previous section takes a uniform
random variate U and transforms it into a new random variate with some CDF F .
In this section, we will consider other ways to transform one random variate into
another.
Example 1.2.1 (Student’s t distribution). The t distribution is of critical impor-
tance in statistical hypothesis testing under the assumption of Gaussian errors. How-
ever, we may also wish to generate t random variates. The t distribution with ν
degrees of freedom has PDF
f(x) = C(1 + x2/ν)−(ν+1)/2
for some normalizing constant C and a CDF in terms of something called a Gaussian
Hypergeometric Function. Suffice to say, we don’t want to try to write down F let
alone invert it. Luckily, as noted above in Example 1.1.2 on the Cauchy distribution,
a t random variate X with ν degrees of freedom can be defined as X = Z/
√
V where
Z ∼ N (0, 1), V ∼ χ2 (ν), and Z ⊥⊥ V . This is precisely how R generated t random
variates.
Example 1.2.2 (Exponential distribution). As discussed in Example 1.1.1, expo-
nential random variates with rate parameter λ > 0 can be generated by using the
8
probability integral transform
X = − 1
λ
log(U)
for U ∼ Uniform [0, 1]. However, more efficient methods exist.
In Fishman (2013) Section 3.10, it is noted that more efficient methods of gen-
eration exist. In particular, let X0 be a random variate from a truncated exponential
distribution. That is, for x ∈ [0, log 2], X0 has PDF and CDF
fX0(x) = 2e
−x and FX0(x) = 2− 2e−x.
Next, let K ∼ Geometric(1/2), which is supported on the non-negative integers such
that
pK(k) = P (K = k) = (1/2)
k+1
such that X0 ⊥⊥ K. Then, we can write X = X0+K log 2, which is Exponential (1).
Indeed, any x ∈ R+ can be uniquely written at x = x0 + k log 2 for k ∈ Z+ and
x0 ∈ [0, log 2). Denoting the convolution product by ?, the PDF is
fX(x) = (fX0 ? pK)(x) = P (K = k) fX0(x0)
= 2−k−12e−x0 = exp(−k log 2− x0) = e−x.
Meanwhile, the current R source code uses an algorithm published by Ahrens & Di-
eter in 1972 in a paper called “Computer methods for sampling from the exponential
and normal distributions.”
1.2.1 Box-Muller Transform
The Gaussian or Normal distribution is perhaps the most widely used probabil-
ity distribution. While its shape is an elegant bell curve, the CDF has no closed
form making use of tools like the Probability Integral Transform very tedious to
implement. Luckily, the Box-Muller transform gives us an alternative approach.
We say that X ∼ N (µ, σ2) if
fX(x) =
1√
2piσ2
exp
(
−1
2
(x− µ)2
)
for µ ∈ R and σ2 > 0. Strictly speaking σ2 = 0 allows for a degenerate Gaussian
distribution that reduces to a point mass at the mean µ. It can be shown that
if Z ∼ N (0, 1), which is referred to as a standard normal distribution, that X =
µ+σZ ∼ N (µ, σ2). Hence, we only need to be concerned with generating standard
normal random variates Z.
Theorem 1.2.1 (Box-Muller Transform, 1958). If U ∼ Uniform [0, 1] and V ∼
Exponential (1) and U ⊥⊥ V , then
x =
√
2v cos(2piu), and
y =
√
2v sin(2piu)
9
are independent N (0, 1) random variables.
Proof. This theorem is proven via a change of variables. Note that
V =
1
2
(X2 + Y 2) and
U =
1
2pi
arctan(Y/X).
Then, defining the functions v(x, y) = (x2 + y2)/2 and u(x, y) = arctan(y/x)/2pi,
gives partial derivatives
∂v
∂x
= x and
∂v
∂y
= y
∂u
∂x
=
1
2pi
1
1 + y2/x2
−y
x2
=
−y
4piv(x, y)
∂u
∂y
=
1
2pi
1
1 + y2/x2
1
x
=
x
4piv(x, y)
Then, be change of variables, we have that
fX,Y (x, y) = fU,V (u, v)
∣∣∣∣∂u∂x ∂v∂y − ∂u∂y ∂v∂x
∣∣∣∣
= 1[u(x, y) ∈ [0, 1]] e−v(x,y)
∣∣∣∣− y24piv(x, y) − x24piv(x, y)
∣∣∣∣
=
1
2pi
exp
{
−x
2 + y2
2
}
.
As this is the Gaussian PDF in R2 with mean zero and covariance I2, we have our
conclusion.
The Box-Muller transform reduces the task of generating a normal random
variate to the tasks of generating a uniform and an exponential random variate.
However, as is typical in these notes, more sophisticated and optimized generation
methods exist in the R source code.
1.3 Acceptance-Rejection Sampling
Acceptance-Rejection sampling, sometimes just called Rejection sampling, is a pow-
erful technique for sampling from complex distributions. In short, we wish to sample
from some probability density f(x), but it is hard to do so. The solution is to find a
proposal or candidate distribution h(x) that is close to f and easy to sample from.
Then, we sample from h instead and reject any samples that don’t look like they
came from f . This is made more precise in the following theorem attributed to John
Von Neumann.
10
Note that the acceptance-rejection method is a critical step in the Metropolis-
Hastings algorithm for performing Markov Chain Monte Carlo. However, MCMC
is beyond scope of this course. In these notes, we restrict to independent sampling
rather than invoking Markov chains.
Remark 1.3.1. In the following theorem, we use | to denote conditioning. That
is, we define a random variable X to equal another random variable Z after forcing
a condition to hold. For example, you could consider a random 20-sided dice roll
conditioned on the number displayed is less than 8.
In general, for two real-valued random variables X and Y , we can write the
conditional density function as
fY |X(y|x) =
f(x, y)
fX(x)
where f(x, y) is the joint density function.
Theorem 1.3.1 (Von Neumann, 1951). Let f(x) be a pdf with support on [a, b] that
can be written as f(x) = cg(x)h(x) where h(x) ≥ 0, ∫ ba h(x)dx = 1, g(x) ∈ [0, 1],
and some constant c ≥ 1. If Z ∼ h(z) and U ∼ Uniform [0, 1] and Z ⊥⊥ U , then
X = Z|{U ≤ g(Z)} has pdf f(x).
Proof. Since U and Z are independent, the joint pdf is simply f(u, z) = 1u∈[0,1]h(z).
Consequently, the conditional pdf for X is
h(z|U ≤ g(Z))P (U ≤ g(Z)) =
∫ g(z)
0
f(u, z)du
h(z|U ≤ g(Z)) =
∫ g(z)
0 f(u, z)du
P (U ≤ g(Z)) .
The numerator becomes
∫ g(z)
0 f(u, z)du = g(z)h(z). The denominator becomes
P (U ≤ g(Z)) =
∫∫
[0,1]×[a,b]
1[u ≤ g(z)] duh(z)dz
=
∫ b
a
∫ 1
0
1[u ≤ g(z)] duh(z)dz =
∫ b
a
g(z)h(z)dz
As f(x) is a pdf,
1 =
∫ b
a
f(x)dx = c
∫ b
a
g(x)h(x)dx.
Thus,
∫ b
a g(x)h(x)dx = c
−1, and consequently, h(z|U ≤ g(Z)) = f(z).
Remark 1.3.2. While many choices of c and g(x) are possible, the acceptance-
rejection method is most efficient when c is close to 1. This is because we accept a
11
Algorithm 3 The Acceptance-Rejection Algorithm
Input:
g(x) ∈ [0, 1]
h(x), a pdf to sample from
Algorithm:
Repeat
Generate z from pdf h(z)
Generate u from Uniform [0, 1]
if u ≤ g(z)
stop
return x = z
Output:
x from pdf f(x) = cg(x)h(x).
randomly generated variate X when U ≤ g(Z) which occurs with probability 1/c as
per the above proof.
Hence, given a pdf of interest f and a candidate distribution h, then c is chosen
to be
c = sup
x
{f(x)/h(x)}
and g chosen accordingly.
Furthermore, we can consider the event U ≤ g(Z) to have a geometric distri-
bution with probability 1/c of success. Hence, the expected number of iterates until
acceptance is achieved is c.
Example 1.3.3 (Half-Normal Distribution). This example is taken from Example
3.1 of Fishman (2013). The half-normal distribution is the standard normal but
only supported on the non-negative reals. That is,
f(x) =
√
2
pi
e−x
2/2
=
√
2e
pi
e−(x−1)
2/2 e−x.
Thus, taking c =
√
2e/pi, g(x) = exp[−(x − 1)2/2], and h(x) = exp(−x), we can
sample from the exponential distribution e−x to apply the Algorithm 3 to generate
half-normal random variates.
Furthermore, this allows for an alternative way to sample from the normal dis-
tribution. That is, use Algorithm 3 to generate X as half-normal. Then, generate
U ∼ Uniform [0, 1] and set
X ←
{
X if U ≥ 1/2
−X if U < 1/2 .
12
Algorithm 4 The Acceptance-Rejection-Squeeze Algorithm
Input:
g, gL, gU ∈ [0, 1] such that gL ≤ g ≤ gU
h(x), a pdf to sample from
Algorithm:
Repeat
Generate z from pdf h(z)
Generate u from Uniform [0, 1]
if u ≤ gL(z) pretest with gL
stop
else if u ≤ gU (z) pretest with gU
if u ≤ g(z) only evaluate g if pretests fail
stop
return x = z
Output:
x from pdf f(x) = cg(x)h(x).
In this setting, we require one Uniform and one Exponential random variate until
a sample is “accepted”. Then, one final Uniform random variate is required to set
the sign, i.e. positive or negative.
1.3.1 Acceptance-Rejection-Squeeze
Sometimes the function g(x) may be slow to compute. However, if we can construct
an envelope gL(x) ≤ g(x) ≤ gU (x) where gL, gU are “simpler” functions, then we
have use gL and gU to perform pretests on whether or not we should accept or reject.
The most natural choices for gL and gU come from power series. For example,
1 − x ≤ e−x ≤ 1 − x + x2/2 for x ≥ 0. Hence, instead of merely evaluating
exp[−(x− 1)2/2] in the previous example for the half-normal distribution, we could
choose
gL(x) = 1− (x− 1)
2
2
and gU (x) = 1− (x− 1)
2
2
+
(x− 1)4
8
.
1.3.2 Box-Muller without Trigonometry
Instead of generating two independent uniform deviates to perform the Box-Muller
transform (i.e. sampling within the square [0, 1]× [0, 1]), we can instead sample from
within the unit circle using an acceptance-rejection step to generate normal random
variates without need to evaluate sines and cosines.
Indeed, we can generate U, V ∼ Uniform [−1, 1] until U2 + V 2 ≤ 1 and hence
lies within the unit circle. This means that the ordered pair (U, V ) has a uni-
13
formly random angle between 0 and 2pi and a distance to the origin (radius) that is
Uniform [0, 1]. After that is achieved, we claim that√
−2 log(U2 + V 2) V
2
U2 + V 2
∼ N (0, 1) .
This is because U2+V 2 is Uniform [0, 1] thus making− log(U2+V 2) ∼ Exponential (1).
Furthermore, the sine and cosine of the angle of the ordered pair (U, V ) can be com-
puted as U√
U2+V 2
and V√
U2+V 2
. Thus, we can compare to the original Box-Muller
transform
x =
√
2v cos(2piu), and
y =
√
2v sin(2piu)
for u uniform and v exponential in this case.
The same trick can be applied generate Cauchy random variates. Recall from
Example 1.1.2 that for U ∼ Uniform [0, 1], then X = tan[pi(U − 1/2)] is Cauchy. If
we preferred to not evaluate the tangent, we can instead sample from the upper-half
of the unit circle. That is, generate U ∼ Uniform [−1, 1] and V ∼ Uniform [0, 1]
until U2 + V 2 ≤ 1. Then, the tangent of the angle of the ordered pair (U, V ) is
simply
U/V ∼ Cauchy.
See sections 7.3.4 and 7.3.7 of Press et al. (2007) for detailed code for these methods.
1.4 Ratio of Uniforms
In this section, we generalize the previously discussed idea of generating a uniform
element of the unit circle to simulate from more complex univariate distributions.
This Ratio-of-Uniforms method is attributed to Kinderman & Monahan (Kinderman
and Monahan, 1977). The idea is to define a region in R2 (e.g. the unit circle), then
generate a uniformly random point (U, V ) from that region, and then return U/V ,
which will have the desired distribution. Once again, we have a theorem to make
this more precise.
Theorem 1.4.1 (Ratio of Uniforms). Let f(z) be a pdf on R and r(z) a non-negative
function such that
f(z) =
r(z)∫∞
−∞ r(t)dt
.
Let D0 ⊂ R2 be defined as
D0 =
{
(x, y) : 0 ≤ x ≤
√
r(y/x)
}
and let (X,Y ) be a uniform random point in D1, some bounded region such that
D0 ⊂ D1 and let W = X2. Then, if W ≤ r(Y/X), the Z = Y/X has pdf f(z).
14
Proof. Let |D0| denote the area5 of the region D0. Then, conditioning on (X,Y )
being in D0, the joint pdf of (X,Y ) is simply f(x, y) = |D0|−1. Since W = X2
and Z = Y/X, we have that X =
√
W and Y = Z
√
W . Applying the change of
variables (X,Y )→ (W,Z) gives
fW,Z(w, z) = JfX,Y (
√
w, z
√
w)
where J is the Jacobian determinant
J = det
( ∂x
∂w
∂x
∂z
∂y
∂w
∂y
∂z
)
= det
(
1/2
√
w 0
z/2
√
w
√
w
)
=
1
2
.
Therefore, fW,Z(w, z) = [2|D0|]−11[0 ≤ w ≤ r(z)] . Therefore, the pdf for Z is
fZ(z) =
∫ r(z)
0
fW,Z(w, z)dw =
r(z)
2|D0| .
Since fZ(z) must integrate to 1, we have that 2|D0| =
∫∞
−∞ r(z)dz.
The region D0 may be hard to visualize and even harder to sample from. Hence,
we can bound D0 with a rectangle D1 by noting the following. On the x-axis,
the region D0 is constrained to the interval [0, x∨] where x∨ = supz
√
r(z). On
the y-axis, we have that y = zx = z
√
r(z). Hence, the region D0 is constrained
to the interval [y∧, y∨][infz z
√
rz, supz z
√
rz]. Thus, we can use this rectangle for
implementing the ratio of uniforms method as detailed in Algorithm 5.
Note that since we are sampling from a rectangle D1 rather than the actual
set of interest D0, there is an acceptance-rejection step in Algorithm 5. Thus, the
probability of accepting a randomly drawn point (X,Y ) is |D0|/|D1|. From the
above proof, this probability can be written down explicitly:
P ((X,Y ) ∈ D0) = |D0||D1| =
1
2x∨(y∨ − y∧)
∫ ∞
−∞
r(z)dz.
Remark 1.4.1 (Shift and Scale). Note that we can always take Y/X and shift and
scale it to get a0 +a1(Y/X) if desired. The above theorem, proof, and algorithm can
be modified accordingly.
As noted, we already saw specific examples of the ratio of uniforms method. In
the case of generating Cauchy random variables, we found that forX ∼ Uniform [0, 1]
and Y ∼ Uniform [−1,−1], then Z = Y/X is Cauchy conditioned on X2 + Y 2 ≤ 1.
If we set r(z) = (1 + z2)−1, then
x∨ = sup
z∈R
1√
1 + z2
= 1
y∨ = sup
z∈R
z√
1 + z2
= sup
z∈R
z
|z|
1√
1 + z−2
= 1
5 i.e. Lebesgue measure
15
Algorithm 5 The Ratio of Uniforms Algorithm
Input:
function r(z) = cf(z) for some c > 0
Algorithm:
Compute x∨ = supz
√
r(z).
Compute y∨ = supz z
√
rz.
Compute y∧ = infz z
√
rz.
Repeat
Generate x from pdf Uniform [0, x∨]
Generate y from pdf Uniform [y∧, y∨]
if x2 ≤ r(y/x)
stop
return z = y/x
Output:
z from pdf f(z).
and y∧ = −1 from the same formula. Furthermore, the half circle is recovered by
noting that
0 ≤ x ≤ 1√
1 + y2/x2
⇒ 0 ≤
√
x2 + y2 ≤ 1
And thus we can rederive the formula for a Cauchy random variable via the ratio
of uniforms method. In this case, the probability of acceptance is pi/4.
1.4.1 Gamma Random Variables
In this section, we examine the method of Cheng and Feast (1980) for generating
Gamma random variates making use of the Ratio of Uniforms algorithm. Note that
the current R source code uses methods attributed to Ahrens & Dieter in rgamma()
and other methods also exist.
We say that a random variable Z ∼ Gamma (α, β) for α, β > 0 if Z ≥ 0 and has
pdf
f(z) =
βα
Γ(α)
zα−1e−βz.
But if Z ∼ Gamma (α, 1), then βZ ∼ Gamma (α, β). Thus, we can restrict our
investigate to generating Gamma random variables with β = 1.
Gamma distributions “add” in the sense that if Z1 ∼ Gamma (α1, β) and Z2 ∼
Gamma (α2, β) and Z1 ⊥⊥ Z2, then Z1 + Z2 ∼ Gamma (α1 + α2, β). Furthermore,
Gamma (1, 1) is the standard exponential distribution. Hence, one could generate a
Gamma (n, 1) random variable by summing n independent Exponential (1) random
16
variables.6 However, this would be highly computationally inefficient.
Instead, we can apply the Ratio of Uniforms method by choosing r(z) = zα−1ez
where we impose that α > 1.7 Thus, the region D0 is
D0 =
{
(x, y) ∈ R2 : 0 ≤ x ≤
√
(y/x)α−1e−y/x
}
,
and the bounding rectangle is
x ∈
[
0,
(
α− 1
e
)(α−1)/2]
, y ∈
[
0,
(
α+ 1
e
)(α+1)/2]
,
and the probability of an acceptance is
|D0|
|D1| =
Γ(α)eα
2(α− 1)α−1(α+ 1)α+1 .
For large α, this is approximately O(α−1/2) as can be derived via Stirling’s approx-
imation to the Gamma function. This implies that the ratio of uniforms will have
vanishingly small acceptance probabilities as α grows. The proposed solution to
this problem is to not sample from the rectangle D1 but to instead construct the
parallelogram defined by
P =
{
(x, y) ∈ R2 : α−1/2 ≤ y − x ≤
√
2/αe, 0 ≤ y ≤ 1
}
and sample from P ∩ [0, 1] × [0, 1]. More details on this can be found in Fishman
(2013) and Cheng and Feast (1980).
1.5 Points in Geometric Objects
The following two subsections consider random generation of points within a geomet-
ric object. The first object of interest is the simplex, which has direct application
to topics like Bayesian priors on probability vectors. The second is spheres and
ellipsoids, which arise in areas like directional data among others.
1.5.1 Simplexes and the Dirichlet Distribution
A simplex in Rn is a convex polytope consisting of n+ 1 verticies. Effectively, it is a
generalized “triangle” or “tetrahedron” in n-dimensions. Fixing one of the vertices
at the origin allows us to define a standard simplex as
Sn(r) =
{
x = (x1, . . . , xn)
T :
n∑
i=1
xi ≤ r, xi ≥ 0∀i
}
6When α = n is an integer, the Gamma distribution is sometimes called the Erlang distribution
named after Agner Krarup Erlang.
7The 0 < α < 1 is handled by other algorithms.
17
for some constant r > 0. The most common setting is r = 1. Then, choosing
x0 = 1−
∑n
i=1 xi gives the (n+1)-long vector (x0, x1, . . . , xn), which is a probability
vector.8
To generate a uniform random point inside the simplex Sn(1), we first generate
n Uniform [0, 1] random variables U1, . . . , Un and sort them so that U1 ≤ U2 ≤
. . . ≤ Un. Note that there are effcient algorithms for generating an ordered set of
independent random variables; see order statistics. Then, we simply set X1 = U1
and Xi = Ui − Ui−1 for i = 2, . . . , n. The remainder is X0 = 1− Un.
In Bayesian analysis and other topics in statistics and machline learning, we
often want to generate a probability vector p = (p1, . . . , pn) such that pi ≥ 0 for all
i and
∑n
i=1 pi = 1. Furthermore, we may want a distribution on p other than the
uniform. That is, we say that p has a Dirichlet distribution with parameter vector
α = (α1, . . . , αn) with αi > 0 for all i if it has pdf
f(p) =
Γ(α1 + . . .+ αn)
Γ(α1) . . .Γ(αn)
n∏
i=1
pαi−1i .
If all of the αi = 1, then we have a uniform distribution on the simplex. If the
αi > 1, the the mass of the distribution is pushed towards the centre of the simplex.
If the αi < 1, then the mass is pushed towards the verticies of the simplex.
An easy to code method for generating random variates from a Dirichlet distri-
bution is to generate n independent Gamma random variables
Xi ∼ Gamma (αi, 1) for i = 1, . . . , n
such that Xi ⊥⊥ Xj for all i 6= j. Then, we set
pi =
Xi∑n
i=1Xi
.
This approach can be shown to yield a Dirichlet random variate, by first starting
with the joint pdf of X1, . . . , Xn, which is
fX(x1, . . . , xn) =
n∏
i=1
xαi−1i e
−xi
Γ(αi)
= exp
(
−
n∑
i=1
xi
)
n∏
i=1
xαi−1i
Γ(αi)
.
Then, we denote the sum
∑n
i=1 xi = T , and we change variables from xi to pi = xi/T
for i = 1, . . . , n− 1 and set pn =
∑n
i=1 xi = T . Hence, xi = piT for i = 1, . . . , n− 1,
and xn = T (1−
∑n−1
i=1 pi). The Jacobian matrix with i, jth entry ∂xi/∂pj is
J =

T 0 . . . p1
0 T . . . p2
...
...
. . .
...
−T −T . . . 1−∑n−1i=1 pi

8i.e. the entries are non-negative and sum to one.
18
where we can find that det(J) = Tn−1. Thus, we have that
fp(p1, . . . , pn−1, T ) = e−T
[T (1−∑n−1i=1 pi)]αn−1∏n−1i=1 (Tpi)αi−1∏n
i=1 Γ(αi)
Tn−1
=
(1−∑n−1i=1 pi)αn−1∏n−1i=1 pαi−1i∏n
i=1 Γ(αi)
T
∑
αi−1e−T .
And integrating out T from 0 to ∞ gives the desired formula as
Γ(
∑
αi) =
∫ ∞
0
T
∑
αi−1e−TdT.
1.5.2 Spheres
The first problem in this section is to generate a uniformly random point on the
surface of an n-dimensional hypersphere ( or just the usual 3D sphere, if you prefer
), which is Sn−1(r) = {z ∈ Rn :
∑
z2i = r
2}. To achieve this, we can simulate a
multivariate standard normal random vector and normalize it. That is, if
X = (X1, . . . , Xn) ∼ N (0, In) ,
then the vector Z with entries
Z1 =
X1√∑n
i=1Xi
, . . . , Zn =
Xn√∑n
i=1Xi
,
will be uniformly distributed on the surface of a n-dimensional sphere of radius 1.
To change the radius, multiply each Zi by some r > 0.
The second problem is to generate points within an n-dimensional sphere, which
is Bn−1(r) = {z ∈ Rn :
∑
z2i ≤ r2}. This can be achieved by sampling from the
surface as above and the beta distribution. Indeed, let Z be uniformly random on
the surface of the sphere and let
W ∼ Beta (n, 1) ,
be independent of Z. Then we claim that WZ is uniformly distributed within the
sphere of radius 1 where Z is the position on the sphere and W acts as a random
radius. To see this, note that
fZ(z|r) = Γ(n/2)
2pin/2rn−1
for a sphere with radius r
fW (w) = nw
n−1 for w ∈ [0, 1].
fZ(z|r = w)fW (w) = nΓ(n/2)
2pin/2
=
Γ(n/2 + 1)
pin/2
.
19
Chapter 2
Volume Computation and
Integration
Introduction
In Chapter 2 of these lecture notes, we are concerned with using randomly generated
variates to actually compute things. This will take on different forms. One will
be to compute the volume of some region in space. Another will be to numerically
integrate some function. Much as before, we will have to be clever to sample random
variates in a “smart” way in order to produce efficient algorithms.
One of the new considerations in this chapter is the accuracy of computation.
That is, we will be concerned with topics like error estimation, confidence intervals
for estimates, and dependecy on sample size. That is, how large of a sample do we
need to get an estimate with a given accuracy.
2.1 Volume Computation
Consider the following problem: Let D be a be a bounded region in Rd. How can
we compute the d-dimensional volume of the region D? While the region D may be
complex, we assume there exists a function
φ(x) =
{
1, x ∈ D
0, x ∈ D
that is easy to compute. Then, we want to compute the integral
vol(D) =
∫
D
dx =
∫
Rd
φ(x)dx.
Hence, we will try to sample points in Rd to estimate the integral of φ(x).
20
Remark 2.1.1. Without loss of generality, we will assume that D lives within the
d-dimensional unit cube
C = [0, 1]⊗d.
We can always scale and translate a region D to make this so.
2.1.1 Sampling from a regular grid
Eschewing randomness for the moment, we could merely sample from a regular grid
within the unit cube C. That is, for some k ∈ N, we can generate a grid of n = kd
points of the form x = (x1, . . . , xd) where each
xi ∈
{
0 + 0.5
k
,
1 + 0.5
k
, . . .
(k − 1) + 0.5
k
}
.
Then, we can simply count how many of these points x are in D, i.e. φ(x) = 1. This
is performed in Algorithm 6. The obvious problem here is that kd will get very big
very fast as the dimension increases.
In Section 2.1 of Fishman (2013), it is claimed that the error of this estimate
can be upper bounded as follows:∣∣∣∣Vol(D)− Skd
∣∣∣∣ ≤ Surface(D)k
where Surface(D) is the “surface area” of the boundary of D. If we are in 2-
dimensions, this is the length of the boundary. If we are in 3-dimensions, this is the
usually idea of surface area. In d-dimensional space, this is a (d − 1)-dimensional
boundary measure. Since the number of points considered n = kd, this implies that
the error bound shrinks like O(n−1/d). Another way to think about this is that if
we want an error less than some ε > 0, then we require on the order of n & (1/ε)d
where the notation & means n ≥ C(1/ε)d for some unspecified constant C > 0.
Example 2.1.2 (Fraction of a Sphere). As an example, consider the region
D = {x ∈ Rd : xi ≥ 0 and x21 + . . .+ x2k ≤ 1}.
This region is 1/2d of the d-dimensional sphere. By looking up formulae for volume
and surface area, we find that
Vol(D) = (pi/4)
d/2
Γ(d/2 + 1)
and Surf(D) = 2(pi/4)
d/2
Γ(d/2)
.
Therefore, we may want to compute this volume using Algorithm 6 to an accuracy
of ε. However, Vol(D) → 0 as d → ∞. Hence, we would have to adapt ε to the
dimension of the problem. Therefore, we will consider the normalized error bound:∣∣Vol(D)− k−dS∣∣
Vol(D) ≤
1
k
Surf(D)
Vol(D)
21
Algorithm 6 Volume via regular grid
Input:
function φ(x) that is 1 on D and 0 otherwise.
k ∈ N, the number of points per dimension to sample.
Algorithm:
Set S = 0.
Construct a grid X of kd points x.
for each x ∈ X
if φ(x) = 1
S ← S + 1
return S/kd
Output:
Volume estimate of region D.
Then, the sample size becomes
ε = n−1/d
Surf(D)
Vol(D)
n = ε−d
(
Surf(D)
Vol(D)
)d
=
(
2
ε
)d(Γ(d/2 + 1)
Γ(d/2)
)d
=
(
d
ε
)d
Hence, if we want the relative error less than, say, 0.001, then the sample sizes are
approximately
106.6, 1010.4, 1014.4, 1018.5
for d = 2, 3, 4, 5.
2.1.2 Uniform Monte Carlo
Instead of using a regular grid, we can do what we did in Chapter 1 of these
notes. That is, we can sample uniformly at random from the unit cube C and
use acceptance-rejection to compute the volume of D.
Indeed, we can simulate X1, . . . , Xn ∈ C with iid Uniform [0, 1] entries. Then,
vol(D) =
∫
Rd
φ(x)dx =
∫
Rd
1
n
n∑
i=1
φ(xi)dx ≈ 1
n
n∑
i=1
φ(xi) =
|{xi ∈ D}|
n
.
22
This algorithm is detailed in Algorithm 7. Going one step farther, we can compute
the variance of this estimate as well. We can consider each bi = φ(xi) to be a
Bernouli random variable with probability |D| of getting a 1 and 1− |D| of getting
a zero. Hence, for each i
Var (bi) = |D|(1− |D|),
and for our estimate, we have
σ2 = Var
(
1
n
n∑
i=1
bi
)
=
|D|(1− |D|)
n
.
Thus, the variance of the estimate decreases at the usual n−1 rate. We can also
compute the unbiased estimate of the variance, which is
σˆ2 =
(S/n)(1− S/n)
n− 1 .
The “usual” normal style confidence interval is∣∣∣∣|D| − Sn
∣∣∣∣ ≤ t1−α/2,n−1σˆ
where t1−α/2,n−1 is the value such that P
(
T > t1−α/2,n−1
)
= α/2 where T ∼
t (n− 1). This hints that our estimation error decreases at a rate of O(n−1/2) for
sample size n.
The problem with this is that in the normal distribution case, the estimate for
the mean µˆ and the estimate for the variance σˆ2 are independent of each other.
However, for these Bernouli random variables, this is not the case. Hence, S/n and
σˆ2 may be correlated, which can invalidate the confidence interval. We will consider
two more powerful methods for confidence interval construction and sample size
estimation.
Chebyshev’s Inequality
One approach to confidence interval construction is to use Chebyshev’s inequality.
This is a very general and very useful upper bound on the tail probability of a
random variable only assuming a finite variance. Hence, it can apply to a wide
range of distributions.
Theorem 2.1.1 (Chebyshev’s Inequality). Let Z ∈ R be a mean zero random
variable with EZ2 <∞. Then,
P (|Z| > ε) ≤ EZ
2
ε2
.
23
Algorithm 7 Volume via uniform Monte Carlo
Input:
function φ(x) that is 1 on D and 0 otherwise.
n ∈ N, the number random points to simulate.
Algorithm:
Set S = 0.
for i = 1, . . . , n
Generate x with iid entries xj ∼ Uniform [0, 1] for j = 1, . . . , d.
if φ(x) = 1
S ← S + 1
return estimate S/n
return variance [(S/n)(1− S/n)]/(n− 1)
return confidence interal (see below)
Output:
Volume estimate of region D along with variance and confidence interval.
As computed above, we have Z = S/n − |D|. Then, EZ2 is the variance of the
estimate being |D|(1− |D|)/n. Applying Chebyshev’s inequality results in
P (|S/n−D| > ε) ≤ |D|(1− |D|)
nε2
≤ 1
4nε2
.
This is because |D|(1−|D|) ≤ 1/4. Therefore, if we want a confidence level of 1−α,
then we set
α = (4nε2)−1 ⇒ ε = (4nα)−1/2
and conclude that the confidence interval
|S/n−D| ≤ (4nα)−1/2
has probability at least 1− α.
Using this approach, we learn a few things. First, we see again that the error
tolerance ε = O(n−1/2) for a fixed confidence level 1 − α. We also see that the
sample size n = (2αε2)−1. Hence, n is inversely proportional to α; e.g. cutting α in
half requires a doubling of the sample size. Lastly, even though we have the same
O(n−1/2) conclusion as occurs for the normal distribution, usage of Chebyshev’s
inequality does not invoke the Central Limit Theorem and hence applies for any n
as opposed to as n→∞.
Hoeffding’s Inequality
Hoeffding’s Inequality is a slightly more sophisticated result than Chebyshev’s in-
equality. It applies to sums if bounded random variables1 such as sums of Bernoulli
1and more generally to sums of sub-Gaussian random variables
24
random variables, which is how we state the following version of Hoeffding’s inequal-
ity.
Theorem 2.1.2 (Hoeffding’s Inequality). Let B1, . . . , Bn
iid∼ Bernoulli (p) for some
p ∈ (0, 1) and B¯ = n−1∑ni=1Bi. Then,
P
(|B¯ − p| > t) ≤ 2 exp(−2nt2).
Applying Hoeffding’s inequality to the above problem results in
P (|S/n− |D|| > ε) ≤ 2 exp(−2nε2).
Therefore, a 1− α confidence interval can be computed by
α = 2e−2nε
2 ⇒ ε =
√
− log(α/2)
2n
,
which results in the confidence interval
|S/n− |D|| ≤
√
− log(α/2)
2n
having probability at least 1− α.
Once again, we see that the width of the confidence interval is proportional to
n−1/2 much like with Chebyshev’s inequality. However, now we have that the sample
size depends on ε and α as n = − log(α/2)/2ε2. Hence, n grows like − log(α) as
α→ 0.
2.2 Integration
Above, we noticed that computing the volume of some region D ⊂ C can be written
as an integral of some function φ(x) that is 1 for x ∈ D and 0 for x /∈ D. That is,
Vol(D) =
∫
C
φ(x)dx ≈ 1
n
n∑
i=1
φ(xi)
where the x1, . . . , xn are uniformly random points in the unit cube C.
Now, we will do the same thing, but for an integrable function g : R→ R. That
is, assume that the domain of g is [0, 1]. Then, we can pick x1, . . . , xn ∈ [0, 1] and
estimate ∫ 1
0
g(x)dx ≈ 1
n
n∑
i=1
g(xi).
First, we have to determine how to pick the points xi.
25
2.2.1 Deterministic Approaches
There are tons of approaches to numerical integration in existence. We will briefly
discuss non-Monte Carlo methods before resuming the MC discussion. Methods
referred to a “Quadrature” typically involve picking points xi and weights ωi and
computing ∫ 1
0
g(x)dx ≈
n∑
i=1
ωig(xi).
Let 0 = x0 < x1 < . . . < xn = 1. The rectangular or midpoint rule yields∫ 1
0
g(x)dx ≈
n∑
i=1
(xi − xi−1)g
(
xi + xi−1
2
)
,
which is just the width times the height of n rectangles. The trapezoidal rule yields∫ 1
0
g(x)dx ≈
n∑
i=1
(xi − xi−1)
(
g(xi) + g(xi−1)
2
)
,
which linearly interpolates between (xi−1, g(xi−1)) and (xi, g(xi)). Of course, many
more sophisticated methods exist.
2.2.2 Monte Carlo Integration
In contrast to deterministic approaches, we will focus on Monte Carlo methods for
numerically computing the valtue of an integral. Considering the same problem as
above, we want to estimate
µ =
∫ 1
0
g(x)dx ≈ 1
n
n∑
i=1
g(xi) = µˆ.
If we take each xi to be a uniform random variate Xi
iid∼ Uniform [0, 1], then we have
for each i = 1, . . . , n that
Eg(Xi) =
∫ 1
0
g(x)dx
and thus
E
(
1
n
n∑
i=1
g(Xi)
)
=
∫ 1
0
g(x)dx,
which means we have an unbiased estimator of the integral.
Furthermore, we can compute the variance and get
Var
(
1
n
n∑
i=1
g(Xi)
)
=
1
n2
n∑
i=1
Var (g(Xi)) =
1
n
Var (g(X)) .
26
For X ∼ Uniform [0, 1],
σ2 = Var (g(X)) =
∫ 1
0
g(x)2dx−
(∫ 1
0
g(x)dx
)2
=
∫ 1
0
g(x)2dx− µ2.
Therefore, we have the usual conclusion that Var
(
1
n
∑n
i=1 g(Xi)
)
= σ2/n. This can
be estimated by the unbiased estimator
σˆ2n =
1
n− 1
n∑
i=1
g(xi)2 −
 1
n
n∑
j=1
g(xj)
2 = 1
n− 1
n∑
i=1
[
g(xi)
2 − µˆ2n
]
Confidence Intervals
If we have that
∫ 1
0 g(x)
4dx <∞,2 then the central limit theorem tells us that
µˆn − µ√
σˆ2n/n
d−→ N (0, 1) .
Therefore, we can use z-scores to construct an asymptotic (1−α)-confidence interval
for µ being
|µˆn − µ| ≤ z1−α/2
√
σˆ2n/n.
Note that in this formulation that µˆn and σˆ
2
n is computed from the same data. If
one has the CPU cycles to spare, it is preferable to generate independent sets of
data to estimate µˆn and σˆ
2
n independently.
Steaming Computation
For a highly accurate estimate of µ, we may require a very large n. In the above
formulation, computing the variance σˆ2n requires computation of µˆn first. Thus, we
would have to save n evaluations g(xi) in memory, which could become cumbersome.
Instead, we can reformulate the above computations in a streaming fashion that does
not require saving all of the evaluations to memory.
A first attempt would be to write
σˆ2n =
1
n− 1
n∑
i=1
g(xi)
2 − 1
n(n− 1)
(
n∑
i=1
g(xi)
)2
.
Hence, we could save
∑n
i=1 g(xi) and
∑n
i=1 g(xi)
2 only. However, the use of sub-
traction could result in loss of numerical precision.
2 i.e. finite fourth moment.
27
Algorithm 8 Integral via uniform Monte Carlo
Input:
function g(x) on the domain [0, 1].
n ∈ N, the number random points to simulate.
Algorithm:
Set S = 0. Integral Estimate
Set V = 0. Variance Estimate
for i = 1, . . . , n
Generate x ∼ Uniform [0, 1].
Compute g(x)
V ← ( i−1i−2)V + 1i (g(x)− S)2
S ← 1i g(x) + ( i−1i )S
return estimate S
return variance V
return confidence interval S ± z1−α/2
√
V/n
Output:
Volume estimate of region D along with variance and confidence interval.
A more sophisticated approach is to write
(n− 1)σˆ2n − (n− 2)σˆ2n−1 =
n∑
i=1
(g(xi)− µˆn)2 −
n−1∑
i=1
(g(xi)− µˆn−1)2
= g(xn)
2 − nµˆ2n + (n− 1)µˆ2n−1
= g(xn)
2 − 1
n
(
n∑
i=1
g(xi)
)2
+ (n− 1)µˆ2n−1
= g(xn)
2 − 1
n
(g(xn) + (n− 1)µˆn−1)2 + (n− 1)µˆ2n−1
=
(
n− 1
n
)
g(xn)
2 − 2
(
n− 1
n
)
g(xn)µˆn−1 − (n− 1)
2 − n(n− 1)
n
µˆ2n−1
σˆ2n =
n− 2
n− 1 σˆ
2
n−1 +
1
n
[g(xn)− µˆn−1]2 .
Therefore, we can update
µˆn =
g(xi)
n
+
n− 1
n
µˆn−1
and use summations to compute σ2n at each step of the algorithm.
28
2.3 Importance Sampling
In the previous sections, we only considered sampling uniformly from [0, 1]. However,
it is often worth considering non-uniform distributions in order to sample more
efficiently. That is, starting from the above integration problem, say we want to
compute
µ =
∫ 1
0
g(t)dt = E[g(X)]
for X ∼ Uniform [0, 1]. The estimator we had before will be denoted
µˆn =
1
n
n∑
i=1
g(xi)
for x1, . . . , xn drawn from the Uniform [0, 1] distribution. Then, given some density
function f(t) which has support on [0, 1] and f(t) > 0 for all t ∈ [0, 1], we can write
µ =
∫ 1
0
g(t)dt =
∫ 1
0
g(t)
f(t)
f(t)
dt =
∫ 1
0
g(t)
f(t)
f(t)dt = E[g(Z)/f(Z)]
for Z ∼ f(z). Numerically speaking, we can simulate z1, . . . , zn from f(z) and then
estimate the integral to be
µ˜n =
1
n
n∑
i=1
g(zi)
f(zi)
.
Then, the question to consider is, “What should f(z) be in order to get the most
efficient estimate?”
Based on the above equations, we can derive the variance for our new estimator
to be
Var (g(Z)/f(Z)) = E
[(
g(Z)
f(Z)
− µ
)2]
= E
[
g(Z)2
f(Z)2
]
− µ2 =
∫ 1
0
g(t)2
f(t)
dt− µ2.
In contrast, the variance of g(X) for X uniform is
∫ 1
0 g(t)
2dt − µ2. Thus, we can
consider the difference
Var (g(X))−Var
(
g(Z)
f(Z)
)
=
∫ 1
0
g(t)2
(
1− 1
f(t)
)
dt.
Thus, we want this integral to be as big as possible to indicate a large drop in the
variance. Solving for the perfect f is typically not doable. However, the above
equation gives us valuable intuition. In particular, if g(t)2 is big then we want
f(t) > 1 and if g(t)2 is small then we want f(t) < 1. Thus we are sampling more
or less frequently than the uniform distribution depending on what f is.
29
Example 2.3.1. If we want to numerically estimate the integral of g(t) = t2 +
t3, we can try simulating from the uniform distribution—i.e. Beta (1, 1)—and the
Beta (2, 1) distribution with pdf f(t) = 2t. Computing the variance difference from
above yields∫ 1
0
(t2 + t3)2
(
1− 1
2t
)
dt =
∫ 1
0
(t4 + 2t5 + t6)dt− 1
2
∫ 1
0
(t3 + 2t4 + t5)dt
=
(
1
5
+
1
3
+
1
7
)
−
(
1
8
+
1
5
+
1
12
)
=
15
56
≈ 0.268.
The variance under the uniform distribution is∫ 1
0
(t2 + t3)2dt−
[∫ 1
0
(t2 + t3)
]2
≈ 0.3359.
Hence, dropping by 0.268 is a big drop in the variance of this estimator.
2.4 Stratified Sampling
Another useful tool for variance reduction is to take the region you are trying to
integrate over, break it into k pieces (or strata), and sample from each to estimate
the integral.
To integrate g(x) over the interval [0, 1], we partition the interval into disjoint
sets A1, . . . , Ak such that [0, 1] =
⋃k
i=1Ai. Then, we can write
µ =
∫ 1
0
g(x)dx =
k∑
i=1
∫
Ai
g(x)dx =
k∑
i=1
ωi
∫
Ai
g(x)
dx
ωi
=
k∑
i=1
ωiEAi [g(x)]
where ωi = |Ai| =
∫
Ai
g(x)dx. The point of weighting with ωi is so that each
integral becomes an expectation. Note that ω1 + . . .+ ωk = 1. We can also change
the probability distribution for sampling on each Ai if we want to complicate this
further.
For each Ai, we sample ni points from Ai and denote these points as xi,j . Then,
our estimator becomes
µˆssn =
k∑
i=1
ωi
ni
ni∑
j=1
g(xi,j)
and the variance of the estimator becomes
Var (µˆssn ) =
k∑
i=1
ω2i
ni
varAi(g(X))
where varAi(g(X)) is the variance over each stratum Ai. Therefore, our goal is to
chose the Ai and ni to reduce the variance.
30
One way to get a variance reduction is to choose each ni = nωi, which simply
says to sample from each Ai proportional to its size. That is,
Var (µˆssn ) =
k∑
i=1
ωi
n
varAi(g(X))
=
1
n
k∑
i=1
ωiEAi
[
(g(X)− EAi [g(X)])2
]
=
1
n
k∑
i=1
ωiEAi
[
(g(X)− E[g(x)] + EAi [g(X)]− E[g(x)])2
]
=
1
n
k∑
i=1
ωi
(
EAi
[
(g(X)− E[g(x)])2]− (EAi [g(X)]− E[g(x)])2)
= Var (µˆn)− 1
n
k∑
i=1
ωi (EAi [g(X)]− E[g(x)])2 ,
which is less than or equal to Var (µˆn). The main intuition here is that we are
forcing the sampling to be more evenly spread out. However, this approach can still
be outperformed with more clever thought. In particular, this above partitioning
does not take the function g into account at all.
Example 2.4.1. If we simply want to integrate g(x) = x on [0, 1], we generate
x1, . . . , xn uniformly and get
3
µˆn =
1
n
n∑
i=1
xi and Var (µˆn) =
1
12n
.
If we instead partition into intervals [0, 0.5] and [0.5, 1] with n/2 points in each, we
get
µˆssn =
2∑
i=1
1/2
n/2
n/2∑
j=1
xi,j =
1
n
2∑
i=1
n/2∑
j=1
xi,j ,
which is the same as µˆn except that exactly half of the samples are in each of the
two subintervals. The variance of this estimator is
Var (µˆssn ) =
2∑
i=1
(1/2)2
n/2
varAi(X) =
1
2n
2∑
i=1
(1/2)2
12
=
1
48n
.
Hence, our variance has reduced by a factor of 4.
3Recall that the variance of the Uniform [a, b] distribution is (b− a)2/12.
31
Chapter 3
Stochastic Optimization
Introduction
3.1 Basic Stochastic Search
3.2 Stochastic Gradient Descent
32
Appendix A
Appendix
A.1 Change of Variables
A tool that is integral to many of the theorems and proofs in these notes is the
change-of-variables formula. Imagine we are in Rn with coordinate variables (x1, . . . , xn).
But now we wish to consider a new coordinate system (y1, . . . , yn). This allows us
to define the n× n Jacobian matrix
J =

∂y1
∂x1
. . . ∂y1∂xn
...
. . .
...
∂yn
∂x1
. . . ∂yn∂xn
 .
The determinant of J will be of particular interest and is often also referred to simply
as the Jacobian. Note that we want this determinant to be non-zero. Otherwise,
the mapping from x → y is not invertible.1 The change-of-variables formula is as
follows. Note that more general formulations of this result exist, but we stick with
a simpler one suitable for our purposes.
Proposition A.1.1 (Change of Variables). Let U ⊂ Rn be an open set, and let the
change of variables function ϕ : x→ y be one-to-one, differentiable, have continuous
partial derivatives, and non-zero Jacobian for all x ∈ U . Then, for any real-valued
function f that is continuous with support contained in ϕ(U),∫
ϕ(U)
f(y)dy =
∫
U
f(x)|det(J)|dx.
Example A.1.2 (Spherical Coordinates). An archetypal application is switching
from Cartesian to polar coordinates. In R3, we can define two coordinate systems
(x, y, z) and (r, θ, φ) such that
x = r sinφ cos θ, y = r sinφ sin θ, z = r cosφ.
1see the Inverse Function Theorem.
33
In this case, the Jacobian matrix is
J =

∂x
∂r
∂x
∂φ
∂x
∂θ
∂y
∂r
∂y
∂φ
∂y
∂θ
∂z
∂r
∂z
∂φ
∂z
∂θ
 =
sinφ cos θ r cosφ cos θ −r sinφ sin θsinφ sin θ r cosφ sin θ r sinφ cos θ
cosφ −r sinφ 0
 ,
and the determinant is det(J) = r2 sinφ. Let ϕ : (r, φ, θ) → (x, y, z). If we have a
region U ⊂ R3 and function f , then∫∫∫
ϕ(U)
f(x, y, z)dx dy dz =∫∫∫
U
f(r sinφ cos θ, r sinφ sin θ, r cosφ)r2 sinφdr dφ dθ.
The change-of-variables formula is critical to working with probabilities and
random variables. This is because probabilities are actually measures which are
actually integrals. More precisely, consider real valued random variables X,Y with
joint density function fX,Y (x, y). Then,
P ((X,Y ) ∈ U) =
∫∫
U
fX,Y (x, y)dx dy.
If we wish to transform into new random variables W,Z where (W,Z) = ϕ(X,Y ),
then
P ((W,Z) ∈ ϕ(U)) =
∫∫
ϕ(U)
fW,Z(w, z)dw dz =∫∫
U
fX,Y (x(w, z), y(w, z))
∣∣∣∣ ∂x∂w ∂y∂z − ∂x∂z ∂y∂w
∣∣∣∣dw dz.
This implies that the joint density of W and Z is
fW,Z(w, z) = fX,Y (x(w, z), y(w, z))
∣∣∣∣ ∂x∂w ∂y∂z − ∂x∂z ∂y∂w
∣∣∣∣,
which is used (among other places) in the proof of the Box-Muller transform and
the Ratio of Uniforms method.
Pro Tip : det(AB) = det(A) det(B) and det(A−1) = det(A)−1.
34
Bibliography
Russell CH Cheng and GM Feast. Gamma variate generators with increased shape
parameter range. Communications of the ACM, 23(7):389–395, 1980.
George Fishman. Monte Carlo: concepts, algorithms, and applications. Springer
Science & Business Media, 2013.
Albert J Kinderman and John F Monahan. Computer generation of random vari-
ables using the ratio of uniform deviates. ACM Transactions on Mathematical
Software (TOMS), 3(3):257–260, 1977.
William H Press, Saul A Teukolsky, William T Vetterling, and Brian P Flannery.
Numerical recipes 3rd edition: The art of scientific computing. Cambridge uni-
versity press, 2007.
James C Spall. Introduction to stochastic search and optimization: estimation,
simulation, and control, volume 65. John Wiley & Sons, 2005.