数学代写-MA20224|学霸联盟

数学代写-MA20224

时间：2021-12-30

MA20224 Probability 2A
Lecturer: Daniel Kious 1
October 9, 2020
Contents
1 Events and Probabilities 3
1.1 Events and the Event Space . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Random Variables 9
2.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Integrable X, L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Conditional probabilities and independence 16
3.1 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Random Walks 20
4.1 Simple Random Walks: Hitting and return probabilities . . . . . . . . . . 20
4.2 Gambler’s Ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Gambler’s Ruin - Expected duration of the game . . . . . . . . . . . . . . 24
5 Generating Functions 27
6 More on random walks 31
6.1 Distribution of the return time for a simple random walk . . . . . . . . . . 31
6.2 The Reflection Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Weak and Strong Laws 40
1These lecture notes are based on Simon C. Harris’s notes for the course last given in 2017/18,
transcribed by Charlie Collier and updated by Marcel Ortgiese
1
8 Modes of Convergence 44
9 Conditional Expectation 47
9.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 50
10 Branching Processes 53
10.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.2 Average Size of the Population . . . . . . . . . . . . . . . . . . . . . . . . 53
10.3 Extinction probability of the population . . . . . . . . . . . . . . . . . . . 54
11 Poisson Processes 59
11.1 Poisson Point Processes on [0,∞) . . . . . . . . . . . . . . . . . . . . . . . 59
11.2 Decomposing a Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . 60
11.3 Conditional Uniformity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
11.4 Poisson point processes on Rn . . . . . . . . . . . . . . . . . . . . . . . . . 63
12 Generating functions and the central limit theorem 66
References 70
A Important distributions 71
2
1 Events and Probabilities
In this chapter, we review some of the basic constructions from your first year probability
course, but place more emphasis on some of the formal subtleties. The main goal is give
meaning to the construct of a probability space
(Ω,F ,P).
This is the basic construction that we use to model a random experiment. The ingredients
are:
• the sample space Ω. Each ω ∈ Ω corresponds to a realization of your experiment.
• The σ-algebra F is a collection of subsets of Ω, also known as events, that we would
like to assign probabilities to.
• The probability measure P is a mapping that associates to each event F ∈ F a
probability, i.e. a number between 0 and 1 that tells us how likely the particular
event is.
In the following sections, we will look more closely at each of these ingredients. We will
look at the rules they have to obey and derive some easy consequences.
1.1 Events and the Event Space
The first thing to write down, when constructing a model for a random experiment is the
sample space Ω. Each ω ∈ Ω corresponds to a particular realisation of our experiment.
Examples 1.1.
(i) Experiment C2; toss a coin twice, so ΩC2 = {H,T}2 = {HH,HT, TH, TT}.
(ii) Experiment C∞; toss a coin infinitely many times, so ΩC∞ = {H,T}N. Then
ω ∈ ΩC∞ is a sequence ω = (ω1, ω2, . . . ) with ωi ∈ {H,T} and ωi is the outcome
of the i-th toss for all i ∈ N.
(iii) Experiment S1; imagine we want to model the service time of the first customer in
a queue, then we take ΩS1 = R+ = (0,∞).
(iv) Experiment Q; all arrival times of customers at a queue. We can take
ΩQ = {ω = (ω1, ω2, . . . ) |ωi ∈ R+ ∀i, ω1 < ω2 < · · · } ⊆ (R+)N,
where ωi represents the arrival time of the i-th customer.
We will be interested in events F , that is (certain kind of) subsets of Ω.
3
Examples 1.2.
(i) Experiment C2: If F is the event of getting at least one head, then F = {HH,HT, TH} ⊂
ΩC2 .
(ii) Experiment S1: Let A be the event the service time takes between α and β units
of time, so A = (α, β) ⊂ ΩS1 .
(iii) Experiment C∞; if Gp is the event the proportion of heads converges to p, then
Gp =
{
ω = (ω1, ω2, . . . ) ⊂ ΩC∞
∣∣∣∣∣ 1n
n∑
i=1
1l{H}(ωi)→ p as n→∞
}
To formulate the last event, we have used the following notation:
Definition 1.3 (Indicator function). For an event E ⊂ Ω, the indicator function 1lE is
defined as
1lE(ω) :=
{
1, ω ∈ E
0, ω 6∈ E
So in the above example,
n∑
i=1
1l{H}(wi),
counts how many of the ωi are equal to H in the first n throws of the dice. To simplify
notation, we will also often write
1lH := 1l{H},
and similarly for any other singleton sets.
Examples 1.4 (Combination of events). We also frequently use that operations on
events translate directly into operations on indicator functions.
Event Description Indicator Function
F F occurs 1lF
∅ ‘Impossible event’, nothing occurs 1l∅ = 0
Ω ‘Certain event’, something occurs 1lΩ = 1
F c complement: F does not occur 1lF c = 1− 1lF⋂∞
n=1 Fn All Fn occur simultaneously 1l∩Fn =
∏∞
n=1 1lFn⋃∞
n=1 Fn = (
⋂∞
n=1 F
c
n)
c at least one Fn occurs 1−
∏∞
n=1 (1− 1lFn)
To see for example the result for the intersection of events: Note that 1l⋂∞
n=1 Fn
(ω) = 1 if
and only if ω ∈ ⋂∞n=1 Fn iff ω ∈ Fn for all n. In particular, for all n, we have 1lFn(ω) = 1,
so that this is equivalent to (
∏∞
n=1 1lFn)(ω) = 1.
4
In the following we will denote by F the collection of all events of interest for a particular
experiment. If Ω is finite (or countably infinite), we usually take
F := P(Ω) = {A : A ⊂ Ω},
the power set of Ω. Note that if |Ω| = N , then |P(Ω)| = 2N . Exercise: Why?
Example 1.5. In the coin tossing experiment C2, we take
F = P({H,T}2) = {∅, {HH}, {HT}, {HH,HT}, . . . ,Ω},
Note that here |F| = 24 = 16.
However, if Ω is uncountable, we cannot simply assign probabilities to all subsets of
Ω. If we tried, this would lead to contradictions (look up ‘non-measurable sets’ or see
unit ‘Measure theory; also check out the Banach-Tarski paradox for some surprising
consequences). The solution is not to assign probabilities to all subsets of Ω, but only
to a large enough collection that contains all the events of interest. For consistency, this
collection needs to satisfy the following rules:
Definition 1.6 (σ-algebra). F ⊆ P(Ω) is a σ-algebra (or σ−field, or event space) on Ω
if
(i) ∅ ∈ F ,
(ii) if F ∈ F , then F c ∈ F and
(iii) if (Fn : n ∈ N) is a sequence of sets in F (i.e. Fn ∈ F ∀n), then
⋃
n∈N Fn ∈ F .
Note that these conditions imply that Ω = ∅c ∈ F by (i) and (ii). Also, ⋂n Fn =
(
⋃
n F
c
n)
c ∈ F by using (ii) and (iii).
Example 1.7. As an example of very small σ-algebra consider the following. For any
subset E ⊂ Ω, we have F := {∅, E,Ec,Ω} is a σ-algebra on Ω. Exercise: Check!
Definition 1.8. If A ⊆ P(Ω) is a collection of subsets of Ω, then σ(A) is known as the
σ-algebra generated by A and is the smallest σ-algebra on Ω containing A.
Note that σ(A) is the intersection of all σ-algebras on Ω that contain A. This σ-algebra
is non-empty, since P(A) is always a σ-algebra that contains A.
Example 1.9. Coming back to the previous example, we can check that if E ⊂ Ω, then
σ(E) = {∅, E,Ec,Ω}.
Exercise: Can you find σ({E,F})? It is already a bit more complicated!
Examples 1.10 (Important examples).
5
(i) Let Ω = R. We cannot take F := P(Ω), since that is badly behaved. So instead
we let
G = {(−∞, x] : x ∈ R}
and take F = σ(G). This σ-algebra is known as the Borel σ-algebra on R and it
denoted by B(R). It contains all subsets of R that are relevant for us and that we
need to assign probabilities to. For example, note
(a) (−∞, x) = ⋃∞n=1(−∞, x − 1/n], so (−∞, x) ∈ B(R) by property (iii) of a
σ-algebra.
(b) We could take H := {(x,∞) : x ∈ R} or I = {(−∞, x) : x ∈ R} and get the
same Borel σ-algebra, i.e. B(R) = σ(G) = σ(H) = σ(I).
(ii) Infinite coin tossing; for experiment C∞, note that |ΩC∞ | ∼= 2N, which is uncount-
able. If Fn := {ω ∈ ΩC∞ |ωn = H} is the event of getting a head on the n-th toss,
then taking F = σ(Fn : n ∈ N) contains all of the events we need (including the
event Gp from Example 1.2)
1.2 Probability Measures
Definition 1.11 (Kolmogorov’s axioms). Let Ω be a sample space and F a σ-algebra
on Ω. Then a map P : F → [0, 1] is called a probability measure if
(i) P(Ω) = 1;
(ii) σ-additivity: If (Fn : n ∈ N) is a sequence of disjoint events in F , i.e. Fn ∈ F ∀n
and Fi ∩ Fj = ∅ ∀i 6= j, then
P
( ⋃
n∈N
Fn
)
=
∑
n∈N
P(Fn)
Note a consequence of the axioms is that P(∅) = 0. We interpret P(F ) as the probability
of the event F occurring. Also, we call the triple (Ω,F ,P) a probability space (or a
probability triple).
These axioms lead to the natural question, whether probability spaces even exist. This
question is answered in the following construction of the fundamental model. Let Ω =
[0, 1]. The Borel σ-algebra on [0, 1], denoted by B([0, 1]), can be generated by σ(G)
where G := {[0, x] : x ∈ [0, 1]}.
Theorem 1.12 (Lebesgue, Borel). There exists a unique probability measure P on(
[0, 1],B([0, 1])) such that P([0, x]) = x ∀x ∈ [0, 1].
The fundamental model allows us to choose a single number between [0, 1] according to
the uniform distribution. It is quite an amazing fact that every experiment, no matter
6
how complicated, can be constructed out of the single experiment (as in the above
theorem). [For a hint how this might work, see the problem sheets].
In the following we collect some of the properties of a probability measure. Some of these
you have already seen in the first year probability course (and some will be proved on
the problem sheet).
Proposition 1.13 (Properties of P). Let (Ω,F ,P) be a probability triple, then
(i) If A ∈ F , then P(Ac) = 1− P(A);
(ii) If A,B ∈ F and A ⊆ B, then P(A) ≤ P(B);
(iii) If A1, . . . , An ∈ F are disjoint events, so Ai ∩Aj = ∅ ∀i 6= j, then
P
( n⋃
i=1
Ai
)
=
n∑
i=1
P(Ai)
(iv) Boole’s inequality: For any A1, . . . , An ∈ F ,
P
( n⋃
i=1
Ai
)
≤
n∑
i=1
P(Ai)
(v) For any A,B ∈ F , we have P(A ∪B) = P(A) + P(B)− P(A ∩B);
Note that the last statement also generalizes to three sets
P(A∪B ∪C) = P(A) +P(B) +P(C)−P(A∩B)−P(B ∩C)−P(A∩C) +P(A∩B ∩C).
Moreover, going one step further we obtain in full generality:
Proposition 1.14. Inclusion-exclusion principle For any F1, . . . , Fn ∈ F ,
P
(
n⋃
i=1
Fi
)
=
∑
i
P(Fi)−
∑
iP(Fi ∩ Fj) +
∑
iP(Fi ∩ Fj ∩ Fk)− . . .
+ (−1)n−1P(F1 ∩ · · · ∩ Fn)
=
n∑
r=1
(−1)r−1
∑
i1P(Fi1 ∩ · · · ∩ Fir)
Proof. We will give an elegant proof later on in Chapter 2.
Proposition 1.15 (Continuity of P).
(a) Continuity from below: Consider events A1 ⊆ A2 ⊆ · · · and A :=
⋃
n∈NAn, then
P(An) ↑ P(A). That is, the sequence
(
P(An) : n ∈ N
)
is an increasing sequence with
limit P(A).
7
(b) Continuity from above: Suppose B1 ⊇ B2 ⊇ · · · and B :=
⋂
n∈NBn, then P(Bn) ↓
P(B).
Example 1.16. Consider the eventAn := {a population of animals is extinct by time n},
then An ⊆ An+1 and
A =
⋃
n
An = {population eventually becomes extinct}.
Hence, by the continuity of P, we have that
P(An) ↑ P(A).
8
2 Random Variables
When conducting a random experiment, we are often not only directly interested in the
outcome, but rather in various ‘observables’ that can be read off from the experiment.
E.g. when tossing a dice several times, we might only be interested in the sum of all the
throws. That is we want to understand mappings from Ω to R, since we are typically
making a real-valued observation. This naturally leads to the concept of a random
variable. Indeed, working on the same probability space we might also be interested in
understanding different observables (e.g. also the difference of the first two throws of the
dice). Later on in the course, we might often ‘forget’ about the underlying probability
space and write our assumptions in terms of random variables only.
Definition 2.1. A random variable X on R is a mapping X : Ω→ R such that
{X ≤ x} := {ω ∈ Ω : X(ω) ≤ x} ∈ F ∀x ∈ R (1)
Remark 2.2. (i) Note that since B(R) = σ({(−∞, x] : x ∈ R}) and F is a σ-algebra,
then one can show that
X−1(A) := {X ∈ A} := {ω ∈ Ω |X(ω) ∈ A} ∈ F for any A ∈ B(R) (2)
We say that X is an F-measurable function from Ω to R.
(ii) Often we want random variables that can take values on R ∪ {+∞,−∞} or R+ ∪
{+∞}. The definition above extends simply by replacing R with these.
(iii) Note: everything that you are likely to meet in probability will be suitably mea-
surable! So you should be aware of these problems, but you don’t have to worry
too much about these.
Example 2.3. Consider Experiment Q from Example 1.1, where we model the arrival
times of customers in a queue. So if ω = (ω1, ω2, . . . ) ∈ ΩQ, and we are interested in
T1, the time of the first arrival at the queue, then we can formally define T1 : Ω → R
as T1(ω) = ω1. Another example might be the gap between the arrival of the first and
second customers: G1 : Ω→ R, is given by G1(ω) := ω2 − ω1.
In order to describe a random variable, we introduce the concept of its distribution.
Definition 2.4. The distribution of a random variable X is the probability measure PX
defined on (R,B(R)) via
PX(A) = P(X ∈ A) for any A ∈ B(R).
The distribution function FX of a random variable X on (Ω,F ,P) is the map FX : R→
[0, 1] defined by
F (x) := P(X ≤ x) = P({ω ∈ Ω : X(ω) ≤ x})
9
Notice that in order for these quantities to make sense, we are really using that X is a
random variable so that (2) and (1) hold. As before in Remark 2.2 it turns out that the
distribution function completely describes the distribution of a random variable.
On the problem sheet, we will see that any distribution function F is an increasing, right-
continuous function such that limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. Conversely,
one can show that any function F with these properties is the distribution function of a
real-valued random variable.
Examples 2.5.
(i) We say that T1 is exponentially distributed with rate q > 0, or T1 ∼ Exp(q), if
FT1(t) = P(T1 ≤ t) =
{
1− e−qt, t ≥ 0
0, t < 0
(ii) Suppose X is a random variable taking values {1, 2, 3} with equal probability, then
FX(x) = P(X ≤ x) =

0, x < 1
1/3, 1 ≤ x < 2
2/3, 2 ≤ x < 3
1, x ≥ 3
There are two cases when its particular easy to describe the distribution of a random
variable.
Definition 2.6. Suppose the random variableX takes values in a countable set, X(Ω) :=
{X(ω) : ω ∈ Ω}. Then we say that X is a discrete random variable and we call the
function
pX(x) := P(X = x) = P
({ω ∈ Ω : X(ω) = x}),
where pX : X(Ω)→ [0, 1], the probability mass function (PMF) of X.
Example 2.7. In the Example 2.5(ii)
pX(x) =
{
1/3, x ∈ {1, 2, 3} = X(Ω)
0, otherwise
Definition 2.8. A random variable is called continuous if there exists a non-negative
function fX on R such that
P(a < X ≤ b) = FX(b)− FX(a) =
∫ b
a
fX(x) dx
where fX is called the probability density function (PDF) of X.
Example 2.9. (i) If T1 ∼ Exp(q) as in Example 2.5(i), then one can check that it is
continuous with density
fT1(t) =
{
qe−qt, t ≥ 0
0, t < 0
.
10
(ii) The following is also a distribution function of a random variable X,
F (x) =

0 if x ≤ 0,
1
2x if x ∈ (0, 1)
1 if x ≥ 1
The correspoding random variable is neither discrete nor continuous (see also the
problem sheet).
2.1 Expectation
Definition 2.10. Let X be a random variable with 0 ≤ X ≤ ∞. If X is discrete, then
E(X) :=
∑
x∈X(Ω)
xP(X = x) =
∑
x∈X(Ω)
x pX(x) ≤ ∞
Note that an indicator function 1lE of an event E is a discrete random variable (taking
values 0 and 1) and we have that
E(1lE) = 0 · P(1lE = 0) + 1 · P(1lE = 1) = P(E)
since 1lE(ω) = 1 if, and only if, ω ∈ E.
Definition 2.11. For any random variable X with 0 ≤ X ≤ ∞, we define
E(X) :=
∫
X(ω) dP(ω) := sup{E(Y ) : Y is discrete and Y ≤ X} ∈ [0,∞].
As you will see later in the unit ‘Measure theory’ the expectation is thus the Lebesgue
integral of X with respect to P.
As a consequence of the definition, one can show that
E(X) =

∑
x∈X(Ω)
xP(X = x), if X discrete∫
R
x fX(x) dx, if X continuous
Lemma 2.12 (Properties of expectation).
(i) If X,Y ≥ 0 and λ, µ > 0 are constants, then E(λX + µY ) = λE(X) + µE(Y ).
(ii) If X ≥ 0 and P(X =∞) > 0, then E(X) =∞; so
(ii)’ If X ≥ 0 and E(X) <∞, then P(X <∞) = 1 (contrapositive).
Note that if an event holds with probability 1, then we also say that it holds almost
surely. I.e. another way of saying P(X <∞) = 1 is ‘X is finite almost surely’.
11
Theorem 2.13 (Monotone convergence theorem, MON). If (Xn)n∈N is a sequence of
non-negative random variables, then
0 ≤ Xn ↑ X =⇒ E(Xn) ↑ E(X) (≤ ∞)
That is, if for every ω ∈ Ω, Xn(ω) ≤ Xn+1(ω) ∀n ∈ N and X(ω) = limn→∞Xn(ω), then
limn→∞ E(Xn) = E (limn→∞Xn), i.e. E(Xn) ↑ E(X).
Remark 2.14. If F1 ⊆ F2 ⊆ · · · and F :=
⋃
n Fn, then 1lFn ↑ 1lF so E(1lFn) ↑ E(1lF )
by the monotone convergence theorem, that is, P(Fn) ↑ P(F ). Hence MON implies the
continuity of probability measures.
Proposition 2.15. Suppose that Xk ≥ 0, then
∞∑
k=1
E(Xk) <∞ =⇒ E
( ∞∑
k=1
Xk
)
=
∞∑
k=1
E(Xk) <∞
=⇒ P
( ∞∑
k=1
Xk <∞
)
= 1 =⇒ P(Xk → 0) = 1
Proof. For the first implication, note that
Xk ≥ 0 =⇒
n∑
k=1
Xk ↑
∞∑
k=1
Xk
so MON gives
E
( n∑
k=1
Xk
)
↑ E
( ∞∑
k=1
Xk
)
but on the other hand by the linearity of expectation, Lemma 2.12,
E
( n∑
k=1
Xk
)
=
n∑
k=1
E(Xk) ↑
∞∑
k=1
E(Xk)
by definition of the infinite sum, since E(Xk) ≥ 0. Hence, we have shown that
E
( ∞∑
k=1
Xk
)
=
∞∑
k=1
E(Xk).
The second implication follows from Lemma 2.12(ii)’
For the third implication, note that for any ω, Xk ≥ 0 and
∑
k≥1Xk < ∞ implies that
Xk → 0 as k →∞.
This proposition has a very important corollary.
12
Lemma 2.16 (First Borel-Cantelli lemma). Let J1, J2, . . . be a sequence of events. If∑∞
k=1 P(Jk) < ∞, then it is almost surely true that only finitely many of the events Jk
occur.
Proof. Let Xk = 1lJk , then Y =
∑∞
k=1Xk counts the total number of successes in
all events. Since
∑∞
k=1 E(Xk) =
∑∞
k=1 P(Jk) < ∞, we have that E(Y ) < ∞ (first
implication of Proposition 2.15) and hence Y =
∑∞
k=1 1lJk < ∞ by Lemma 2.12(ii)’.
That is, we have a finite total number of successes with probability one.
2.2 Integrable X, L1
So far we have only defined the expected value for non-negative random variables. To
extend it to general random variables, we use the following construction. Suppose X :
Ω→ R is a random variable on R. Define the positive, resp. negative, part of X as
X+(ω) := max{0, X(ω)} and X−(ω) := max{0,−X(ω)}.
Then X+ and X− are both non-negative random variables and we have that
X = X+ −X− and |X| = X+ +X−
(where the equality holds for all ω ∈ Ω.)
Definition 2.17. If both E(X+) and E(X−) are finite, equivalently, if E|X| = E(X+)+
E(X−) <∞, we call X integrable, denoted as X ∈ L1. In this case, we define
E(X) = E(X+)− E(X−).
Proposition 2.18 (Properties of the expectation).
(i) If X ∈ L1, then |E(X)| ≤ E(|X|);
(ii) If X,Y ∈ L1 and λ, µ ∈ R, then λX+µY ∈ L1 and E(λX+µY ) = λE(X)+µE(Y );
(iii) If X is a discrete random variable and h : X(Ω)→ R, then
E
(
h(X)
)
=
∑
x∈X(Ω)
h(x)P(X = x)
where h(X) is considered integrable if, and only if, the series converges absolutely.
(iv) If X is continuous with PDF fX , then for any piecewise continuous function h :
R→ R,
E
(
h(X)
)
=
∫
R
h(x)P(X ∈ dx) :=
∫
R
h(x)fX(x) dx
where h(X) is integrable if, and only if,∫
R
|h(x)|fX(x) dx <∞
13
Proof. See the unit ‘Measure theory’ or any textbook on probability.
Example 2.19. Suppose T1 ∼ Exp(q), then we have that
E(T1) =
∫ ∞
−∞
tfT1(t) dt =
∫ ∞
0
tqe−qt dt =
1
q
,
where in order to calculate the last integral we use integration by parts (Details: Exer-
cise!).
We will now give an easy proof of the inclusion-exclusion principle using expectations
and the fact that the expectation of an indicator function 1lE is just the probability P(E).
First, recall the statement of the inclusion-exclusion principle, Proposition 1.14: For any
F1, . . . , Fn ∈ F ,
P
(
n⋃
i=1
Fi
)
=
∑
i
P(Fi)−
∑
iP(Fi ∩ Fj) +
∑
iP(Fi ∩ Fj ∩ Fk)− . . .
+ (−1)n−1P(F1 ∩ · · · ∩ Fn)
=
n∑
r=1
(−1)r−1
∑
i1P(Fi1 ∩ · · · ∩ Fir)
Proof. We have that
1l⋃n
i=1 Fi
= 1− 1l⋂n
i=1 F
c
i
= 1−
n∏
i=1
1lF ci = 1−
n∏
i=1
(1− 1lFi)
and for xi ∈ [0, 1],
n∏
i=1
(1− xi) = 1−
∑
i
xi +
∑
ixixj −
∑
ixixjxk + · · ·+ (−1)nx1x2 · · ·xn,
hence
P
(
n⋃
i=1
Fi
)
= E
(
1l⋃n
i=1 Fi
)
= E
∑
i
1lFi −
∑
i1lFi1lFj + · · · − (−1)n 1lF1 · · · 1lFn

=
∑
i
E (1lFi)−
∑
iE
(
1lFi∩Fj
)
+ · · ·+ (−1)n−1E (1lF1∩···∩Fn)
=
∑
i
P(Fi)−
∑
iP(Fi ∩ Fj) +
∑
iP(Fi ∩ Fj ∩ Fk)− . . .
+ (−1)n−1P(F1 ∩ · · · ∩ Fn),
as claimed.
14
Theorem 2.20 (Bounded-convergence theorem). Suppose (Xn)n∈N and X are random
variables such that Xn → X almost surely, that is, P(Xn → X as n→∞) = 1. Suppose
further that ∃K > 0 such that
|Xn(ω)| ≤ K, ∀n,∀ω ∈ Ω.
Then
E(|Xn −X|)→ 0
and hence
E(Xn)→ E(X).
Proof. Assumed; for the last part we note that −|Xn −X| ≤ Xn −X ≤ |Xn −X|.
15
3 Conditional probabilities and independence
In this chapter, we recall the basic concepts of conditional probabilities and indepen-
dence and derive the complement to the first Borel-Cantelli lemma. Throughout, we let
(Ω,F ,P) be a probability space.
3.1 Conditional Probabilities
Definition 3.1. Suppose that A ∈ F with P(A) > 0, then ∀B ∈ F , define
P(B|A) := P(B ∩A)
P(A)
We can check that in this case, P(·|A) : F → [0, 1] is a probability measure on (Ω,F).
You have already seen the next two results in the first year probability course. First
recall that from the definition if P(A) > 0, then
P(A ∩B) = P(A)P(B |A). (3)
If P(A ∩B) > 0, this generalizes to three events:
P(A ∩B ∩ C) = P(A)P(B |A)P(C |A ∩B).
To see that this holds, write A ∩B ∩C = (A ∩B) ∩C and apply (3) twice. Even more
generally, we have:
Theorem 3.2 (Multiplication principle). Suppose Ai ∈ F for all i ∈ {1, . . . , n}. If
P(A1 ∩ · · · ∩An) > 0 , then P(A1 ∩ · · · ∩Aj) > 0 ∀j = 1, . . . , n− 1 and
P(A1 ∩ · · · ∩An) = P(A1)P(A2|A1)P(A3|A1 ∩A2) · · ·P(An|A1 ∩ · · · ∩An−1)
The multiplication principle leads to the well-known Bayes’ theorem as follows.
Suppose H1, . . . ,Hn ∈ F are a partition of Ω in that they are disjoint
Hi ∩Hj = ∅ for alli 6= j,
and cover Ω in that
n⋃
k=1
Hk = Ω.
Assume that P(Hk) > 0 ∀k. Let K ∈ F , then from the additivity of probability measures
and the multiplication principle (3), we have the law of total probability:
P(K) =
n∑
j=1
P(Hj ∩K) =
n∑
j=1
P(Hj)P(K|Hj) (4)
16
Now, if additionally P(K) > 0, then again by (3)
P(Hi|K) = P(Hi ∩K)P(K) =
P(Hi)P(K|Hi)
P(K)
.
Substituting in (4), we have proved:
Theorem 3.3 (Bayes’ theorem). Let H1, . . . ,Hn ∈ F be a partition of Ω with P(Hi) > 0.
Then, for any K ∈ F with P(K) > 0, we have
P(Hi|K) = P(Hi)P(K|Hi)∑n
j=1 P(Hj)P(K|Hj)
.
3.2 Independence
A central concept in probability is independence. We recall the definition for events and
then see how it applies to random variables.
Recall, events A,B ∈ F are independent if
P(A ∩B) = P(A)P(B).
If A and B are independent, then P(A|B) = P(A|Bc) = P(A), so whether A occurs or
not does not affect the probability of A occurring. The concept generalizes to finitely
many and also infinitely many events.
Definition 3.4. (i) Finitely many events. Events E1, E2, . . . , En are independent if
for any subsequence i1 < i2 < . . . < ik of {1, . . . , n} we have
P(Ei1 ∩ . . . ∩ Eik) = P(Ei1) · · ·P(Eik).
(ii) Infinitely many events. Events E1, E2, . . . are independent if for every n ∈ N, the
events E1, . . . , En are independent.
With the extra assumption of independence we can formulate the complementary result
to the first Borel-Cantelli lemma.
Lemma 3.5 (Second Borel-Cantelli lemma). Suppose J1, J2, . . . is an infinite sequence
of independent events such that
∑∞
i=1 P(Ji) = +∞. Then the probability that infinitely
many Jn occur is 1, that is,
P
( ∞∑
n=1
1lJn = +∞
)
= 1
Proof. We have that
P(at least one of J1, . . . , Jn occur) = P
(
n⋃
i=1
Ji
)
= 1− P
(
n⋂
i=1
Jci
)
= 1− P(none of J1, . . . , Jn occur)
17
and
P
(
n⋂
i=1
Jci
)
∗
=
n∏
i=1
P(Jci ) =
n∏
i=1
(
1− P(Ji)
) ≤ exp(− n∑
i=1
P(Ji)
)
where (∗) follows by independence and 1− x ≤ e−x ∀x. Now we have that
n⋂
i=1
Jci ↓
∞⋂
i=1
Jci
so by MON/continuity of probability, this implies that
P
(
n⋂
i=1
Jci
)
↓ P
( ∞⋂
i=1
Jci
)
= P(none of J1, J2, . . . occur)
but
0 ≤ P
(
n⋂
i=1
Jci
)
≤ exp
(
−
n∑
i=1
P(Ji)
)
→ 0 as n→∞
since
∑∞
i=1 P(Ji) = +∞ by assumption. So
P(none of J1, J2, . . . occur) = 0 =⇒ P(at least one of J1, J2, . . . occur) = 1
Proof will be continued on Problem Sheet 2.
Definition 3.6. A finite or infinite sequence of random variables is independent if for
any x1, x2, x3, . . . ∈ R, the events
{X1 ≤ x1}, {X2 ≤ x2}, {X3 ≤ x3}, . . .
are independent.
Remark 3.7. By the definition of the independence of events this means that when-
ever i1, i2, . . . , ir are distinct integers such that Xik exist (in the case of finite random
variables), and x1, . . . , xr ∈ R we have
P(Xi1 ≤ x1;Xi2 ≤ x2; . . . ;Xir ≤ xr) = P(Xi1 ≤ x1)P(Xi2 ≤ x2) · · ·P(Xir ≤ xr)
We will often use, but not prove, the following fact.
Proposition 3.8. If X1, X2, . . . are independent random variables and f1, f2, . . . are
nice2 functions on R, then f1(X1), f2(X2), . . . are independent. Moreover, if g : Rn → R
is a nice function, then the random variables g(X1, . . . , Xn), Xn+1, Xn+2, . . . are inde-
pendent.
2The technical term is Borel measurable, see also Remark 3.10. Most functions you will come across
are Borel measurable. Continuous functions would work for example.
18
An important property of independent random variables is that they make working out
the expectation of a product particularly easy. The more advanced version of a result
you have seen in year 1 probability:
Lemma 3.9. Let X,Y be independent random variables, then
(i) If X ≥ 0, Y ≥ 0, then E(XY ) = E(X)E(Y );
(ii) If X,Y ∈ L1, then XY ∈ L1 and E(XY ) = E(X)E(Y ).
Remark 3.10. Technical clarification [Not examinable!]: Let S be a topological
space (e.g. R or Rn). Then the Borel σ-algebra B(S) is the smallest σ-algebra that
contains all the open subsets of S. A Borel subset of S is an element of B(S).
A function f : Rn → R is Borel measurable if for any x ∈ R,
{s ∈ Rn : f(s) ≤ x} ∈ B(Rn).
For the more general definition, we need some basics about topological spaces (which we
are not assuming in this unit): Recall that a continuous function f : S → R is one such
that the inverse image
f−1(B) = {s ∈ S : f(s) ∈ B},
of any open subset B ⊂ R is an open subset of S.
Analogously, f : S → R is Borel measurable if the inverse image f−1(B) of any Borel
subset B ⊂ R is Borel in S. Equivalently, if for any x ∈ R,
{s ∈ S : f(s) ≤ x} ∈ B(S).
19
4 Random Walks
In this chapter, we will introduce the first basic probabilistic model, a random walk,
which we will be studying throughout. In the easiest case, this is a process that at each
step either takes one step upwards or one step downwards. In this chapter, we will first
determine the probability that the random walker hits 0 when started from any other
state. Secondly, we will consider the gambler’s ruin problem, where we work out the
probability that the random walk hits a state N before hitting 0. We will also find the
expected value of the first time that either state 0 or N are hit.
4.1 Simple Random Walks: Hitting and return probabilities
The simplest case is a random walk on Z, where in each step a ‘particle’ independently
decides to go up (with probability p) and down with probability 1− p.
More formally, let p ∈ (0, 1), then the position of a simple random walk SRW(p) at time
n is given by
Wn := a+X1 +X2 + · · ·+Xn
where W0 := a ∈ Z is the starting position and (Xn : n ∈ N) is an independent and
identically distributed (i.i.d.) sequence of random variables with
P(Xn = +1) = p, P(Xn = −1) = q := 1− p
This is illustrated in Figure 1.
Figure 1: A random walk trajectory.
Note that probabilities for this random walk W = (Wn)n≥0 will depend on the starting
position a: we write Pa for the probability law for the random walk when W0 = a and
Ea for the corresponding expectation.
In the following we will derive the hitting/return probabilities of 0 for a simple random
walk. Let H be the event that Wn = 0 for some strictly positive time,
H := {Wn = 0 for some n ≥ 1}
20
and define xi := Pi(H).
If i 6= 0, then xi is the probability the random walk started at position i and hits 0; but
if i = 0, then x0 is the probability the random walk started at position 0 and returns to
the origin.
Now we have by the law of total probability that if we split according to the first step
Pi(H) = P(H|W0 = i) = Pi(H;X1 = +1) + Pi(H;X1 = −1)
= Pi(X1 = +1)Pi(H|X1 = +1) + Pi(X1 = −1)Pi(H|X1 = −1)
We start by considering the case when W0 = +1:
P1(H) = pP1(H|X1 = +1) + q P1(H|X1 = −1)
but P1(H|X1 = +1) = P2(H) as conditioning on X1 = +1 takes the random walk to
position 2 and then everything ‘starts afresh’ at time 1 since X2, X3, . . . are independent
of X1 and P1(H|X1 = −1) = 1 since conditional on X1 = −1, we haveW1 = 0 and event
H is guaranteed to occur. Thus{
x1 = px2 + q,
xi = pxi+1 + qxi−1 for all i ≥ 2 (5)
Note that x1 6= px2 + qx0 since x0 need not be 1, whereas the random walk jumping
down from 1 to 0 on the first step definitely hits 0 at a positive time.
Intuitively,
x2 = P2(hit 0)
= P2(hit 1)P2(hit 0 | hit 1)
= P2(hit 1)P1(hit 0),
since once the random walk gets to 1 from 2, everything ‘starts afresh’ and independently
it must get down from 1 to 0. Formally, we are using the strong Markov property of the
random walk at random time T1 = inf{n ≥ 0 : Wn = 1}, but we will not discuss this in
all formal details here, see also Figure 2.
Thus, since the probability of going from 2 to 1 must be the same as from 1 to 0, we
have shown
x2 = x
2
1. (6)
Similarly, we find that
xi = (x1)
i for i ≥ 1. (7)
Combining (5) and (6) we have that
x1 = px
2
1 + q =⇒ px21 − x1 + q = (px1 − q)(x1 − 1) = 0.
Hence, x1 = q/p or x1 = 1.
21
Figure 2: The random walk starts afresh once reaching level 1.
If q ≥ p, only one of the solutions is in [0, 1], so we must have x1 = 1. Thus, we are
guaranteed to return if we are more likely to jump down.
If q < p, both solutions could be probabilities. Intuitively, since E(Xk) = 1.p+ (−1).q =
p− q we will have
Wn
n
=
W0 +X1 +X2 + · · ·+Xn
n
→ E(X1) = p− q > 0
(this is the strong law of large numbers, SLLN, that we will discuss later) and hence
Wn tends to +∞, so there must be a positive probability of drifting off to +∞ without
hitting 0. Hence if q < p, then x1 = q/p.
Combining this discussion with (7), we have shown the following:
Theorem 4.1. Let (Wn) be a simple random walk with q ≤ p. Then,
Pi(hit 0) = Pi(H) =
{
(q/p)i, i ≥ 1
1, i ≤ −1
Note that for p < q an analogous result holds by symmetry.
Lastly, noting
x0 = px1 + qx−1 =
{
p(q/p) + q.1 = 2q, q ≤ p
p.1 + q(p/q) = 2p, p ≤ q
we find that x0 = 1− |p− q| for any p ∈ (0, 1), hence in all cases
P0(no return to 0) = 1− x0 = |p− q|
which is the bias in the random walk.
4.2 Gambler’s Ruin
For the gambler’s ruin problem, we consider a gambler who consecutively plays the
following game: at each step she bets Pounds1: if she wins then she receives Pounds2 in
22
return (i.e. she gains Pounds1) and otherwise she loses her pound. If she starts with k
pounds and wins a game with probability p, then the evolution of her capital is described
by a simple random walk. In this section, we answer the question, how likely it is that
she gains PoundsN before going bankrupt.
Let k and N be integers with 0 < k < N . Let
xk := Pk(TN < T0),
where
Tj := inf{n ≥ 0 : Wn = j},
is the first hitting time of state j for the SRW(p) with W = (Wn)n≥1. That is, xk is the
probability the SRW(p) that started at k will hit N before 0.
Figure 3: Gambler’s ruin.
Considering the first move as before, we have that for 1 ≤ k ≤ N − 1,
xk = pxk+1 + qxk−1 where x0 = 0, xN = 1
which is the recurrence/difference equation. Since p+q = 1, we can rewrite this equation
as
p(xk+1 − xk)− q(xk − xk−1) = 0.
and therefore
xk+1 − xk = q
p
(xk − xk−1) =
(
q
p
)2
(xk−1 − xk−2)
Continuing, we obtain
xk+1 − xk =
(
q
p
)k
(x1 − x0).
Now,
xk − x0 = (x1 − x0) + (x2 − x1) + · · ·+ (xk − xk−1)
= (x1 − x0)
[
1 +
q
p
+
(
q
p
)2
+ · · ·+
(
q
p
)k−1] (8)
23
So for p 6= q, we have
xk − x0 = (x1 − x0)
(
1− (q/p)k
1− (q/p)
)
, q 6= p
Now using the boundary conditions x0 = 0 and xN = 1,
xN − x0 = 1 = (x1 − x0)
(
1− (q/p)N
1− (q/p)
)
and so by combining the two above equations and noting that x0 = 0, we have shown
the following result.
Theorem 4.2. For p 6= q and all 0 ≤ k ≤ N ,
Pk(TN < T0) = xk =
1− (q/p)k
1− (q/p)N .
For p = q = 1/2, we have that
Pk(TN < T0) =
k
N
The second result follows by going back to the summation in (8) or it follows from the
first one by sending p ↓ q.
4.3 Gambler’s Ruin - Expected duration of the game
Given any event B ∈ F and a random variable X, we will use the following notation:
The expectation of X and B is defined by
E(X;B) := E(X1lB) =:
∫
B
X(ω) dP(ω)
With this notation we can define the conditional expectation.
Definition 4.3. For any B ∈ F with P(B) > 0, the conditional expectation of X given
B is given by
E(X|B) = E(X;B)
P(B)
=
E(X1lB)
P(B)
Then, one can show that it is possible to split E(X) relative to B, i.e. we have that
E(X) = E
(
X(1lB + 1lBc)
)
= P(B)E(X|B) + P(Bc)E(X|Bc). (9)
In Gambler’s ruin for SRW(p), let T be the random time when W first gets to 0 or N ,
so
T = T0 ∧ TN := min{T0, TN}
24
Let yk := Ek(T ) = E(T |W0 = k). Then
Ek(T ) = Pk(W1 = k + 1)Ek(T |W1 = k + 1) + Pk(W1 = k − 1)Ek(T |W1 = k − 1)
so
yk = p(1 + yk+1) + q(1 + yk−1)
hence
yk = pyk+1 + qyk−1 + 1 ∀k ≥ 1, where y0 = 0, yN = 0 (10)
To solve this non-homogeneous difference equation, we first find a ‘particular solution’,
that is, any solution of
zk = pzk+1 + qzk−1 + 1,
where we ignore the boundary conditions. If we guess zk = ck, we require
ck = pc(k + 1) + qc(k − 1) + 1 =⇒ c = 1
q − p,
Hence, zk = kq−p . Secondly, we note that wk := yk − zk satisfies
wk = pwk+1 + qwk−1 where w0 = 0, wN =
N
p− q (11)
From difference equation theory, we look for solutions of the form wk is proportional to
λk, i.e. wk = aλk for some a ∈ R. Substituting wk = aλk into (11) yields the following
auxiliary equation
λ = pλ2 + q ⇐⇒ pλ2 − λ+ q = 0⇐⇒ (λ− 1)(pλ− q) = 0
hence λ = 1 or λ = q/p. If q 6= p, then we have distinct roots for λ and the general
solution is
wk = A× 1k +B
(
q
p
)k
k = 0, . . . , N
Now, if we take yk = wk + zk, then yk satisfies the recursion in (10) and we now have to
choose A,B in order to satisfy the boundary conditions. Note that z0 = 0 and zN = Nq−p .
Hence, in order to satisfy (10), we need that y0 = 0 and yN = N , which translates into
w0 = 0 and wN = −zN = Np−q . Thus, A,B have to satisfy
A+B = 0 and A+B
(
q
p
)N
=
N
p− q
Hence
wk =
N
(
1− (q/p)k)
(p− q)(1− (q/p)N) = Nxkp− q
Recalling that Ek(T ) = yk = wk + zk, this suggests the following.
25
Theorem 4.4. Let p 6= 12 . If T is the first time a SRW(p) hits 0 or N , then for
k ∈ {1, . . . , N − 1},
Ek(T ) =
Nxk − k
p− q =
(N − k)−N(q/p)k + k(q/p)N
(p− q)(1− (q/p)N ) , p 6= q
Proof. [Not examinable.] We have shown that Ek(T ) satisfies the equation
yk = 1 + pyk+1 + qyk−1, y0 = 0, yN = N, (12)
and we have found one solution (the right hand side). We still have to show that Ek(T )
is the minimal non-negative solution of (12), thus we can exclude the trivial solution
yk ≡ ∞.
Let (yk) be any non-negative solution of (12). Then,
yk = 1 + pyk+1 + qyk = 1 +
∑
`/∈{0,N}
Pk(X1 = `)y`.
Iterating this equation, gives
yk = 1 +
∑
`/∈{0,N}
Pk(X1 = `) +
∑
`1,`2 /∈{0,N}
Pk(X1 = `1)P`1(X1 = `2)
= 1 +
∑
`/∈{0,N}
Pk(X1 = `) +
∑
`1,`2 /∈{0,N}
Pk(X1 = `1, X2 = `2)y`,
where we used the ‘starting afresh’ property. Reiterating again, we get
yk = 1 +
∑
`/∈{0,N}
Pk(X1 = `) +
∑
`1,`2 /∈{0,N}
Pk(X1 = `1, X2 = `2)
+
∑
`1,`2,`3 /∈{0,N}
Pk(X1 = `1, X2 = `2, X3 = `3)y`3
≥ 1 +
∑
`/∈{0,N}
Pk(X1 = `) +
∑
`1,`2 /∈{0,N}
Pk(X1 = `1, X2 = `2) + . . .
+
∑
`1,...,`n /∈{0,N}
Pk(Xi = `i, i = 1, . . . , n)
= Pk(T ≥ 1) + Pk(T ≥ 2) + . . .+ Pk(T ≥ n)
where we used that yk ≥ 0. Letting n→∞, we have seen that
yk ≥
∑
`≥1
Pk(T ≥ `) =
∫ ∞
0
Pk(T ≥ t) dt = Ek(T ),
by the identity proved on the problem sheet.
26
5 Generating Functions
In this section, we introduce the tool of generating functions, which will be one of the
main techniques used in this unit. These will allows us to determine the distribution of
random variables and in particular, we will use it to determine limit laws of a sequence
of random variables.
Definition 5.1. For a random variable X taking values in Z+ ∪ {∞}, we define the
probability generating function (PGF) gX , of X as gX : [0, 1)→ [0, 1] with
gX(θ) := E(θX) =
∞∑
k=0
θk P(X = k)
Here, we use the convention that θ∞ = 0 for 0 ≤ θ < 1.
PGFs have radius of convergence of at least 1. In particular, for |θ| < 1, we can swap
limits and differentiation and integration with the infinite sum.
Examples 5.2.
(i) For the Bernoulli distribution X ∼ Ber(p), we have that
gX(θ) = θ
0P(X = 0) + θP(X = 1) = (1− p) + pθ.
(ii) For the Poisson distribution Y ∼ Po(λ), we have that
gY (θ) =
∞∑
k=0
θke−λ
λk
k!
= e−λ
∞∑
k=0
(θλ)k
k!
= eλ(θ−1).
Remark 5.3. We have the following
gX(1−) := lim
θ↑1
gX(θ) = P(X <∞)
by MON since θX ↑ 1l{X<∞} for θ ↑ 1 (since θX = 0 on the event {X =∞}). Moreover,
note that for 0 ≤ θ < 1,
g′X(θ) = E(XθX−1).
and hence
g′X(1−) = E(X;X <∞).
The arguments in the previous remark show the following result:
Lemma 5.4. If P(X <∞) = 1, then
gX(1) = 1, g
′
X(1) = E(X)
g′′X(1) = E(X(X − 1))
g
(k)
X (1) = E(X(X − 1) · · · (X − k + 1))
27
One reason why generating functions are so useful is that they interact well with sums
of independent random variables.
Theorem 5.5. If X and Y are independent random variables with values in Z+ ∪{∞},
then
gX+Y (θ) = gX(θ)gY (θ)
Proof. If X and Y are independent, then so are θX and θY , so
E(θX+Y ) = E(θXθY ) = E(θX)E(θY )
by independence, as required.
Examples 5.6.
(i) Binomial sums; if Xn ∼ B(n, p) then Xn = Y1 + · · ·+ Yn where Y1, Y2, . . . are IID,
Yi ∼ B(1, p) with P(Yi = +1) = p, P(Yi = −1) = 1− p and
gYi(θ) = pθ + (1− p)1 = pθ + (1− p)
hence
gXn(θ) = E
(
θY1+···+Yn
)
=
n∏
i=1
E
(
θYi
)
= (q + pθ)n
(ii) If X ∼ Po(λ) and Y ∼ Po(µ) with X and Y independent, then
gX+Y (θ) = gX(θ)gY (θ) = e
λ(θ−1)eµ(θ−1) = e(λ+µ)(θ−1)
This is the PGF of a Po(λ+ µ) random variable, hence X + Y ∼ Po(λ+ µ) by the
following theorem.
Theorem 5.7. The PGF gZ uniquely determines the distribution of the random variable
Z. In particular,
P(Z = k) =
g
(k)
Z (0)
k!
=
1
k!
dk
dθk
gZ(θ)
∣∣∣∣
θ=0
and
P(Z = +∞) = 1−
∞∑
k=1
P(Z = k) = 1− P(Z <∞) = 1− gZ(1−)
In practice, if the PGF is recognisable, then we already have the distribution. Otherwise,
if we can expand a PGF gZ(θ) as a power series expansion in θ, that is
gZ(θ) =
∞∑
i=0
piθ
i
for some {pi : i ∈ Z+}, then necessarily P(Z = i) = pi.
28
Sometimes, the following identity can be useful when multiplying power series:( ∞∑
k=0
akθ
k
)( ∞∑
`=0
b`θ
`
)
=
∞∑
n=0
cnθ
n, (13)
where
cn =
n∑
m=0
ambn−m.
[To check, multiply out the LHS and equate coefficients of θn.]
Theorem 5.8 (Convergence of PGFs). Let X,X1, X2, X3, . . . be a sequence of random
variables on Z+ ∪ {∞}. Then the following are equivalent:
(i) The PGFs gXn converge pointwise to the PGF gX as n → ∞, that is, for all
θ ∈ [0, 1),
lim
n→∞ gXn(θ) = gX(θ)
(ii) The distributions of Xn for n ≥ 1 converge to the distribution of X, that is, for all
k ≥ 0,
lim
n→∞P(Xn = k) = P(X = k)
We say that (Xn)n≥1 converges in distribution to X, written Xn
D−→ X, or (Xn)n≥1
converges in law to X, Xn
L−→ X.
Proof. Suppose that gXn(θ)→ gX(θ) ∀θ ∈ [0, 1).
∞∑
k=0
θkP(X = k) = gX(θ) = lim
n→∞ gXn(θ)
= lim
n→∞
∞∑
k=0
θkP(Xn = k)
=
∞∑
k=0
lim
n→∞ θ
kP(Xn = k)
=
∞∑
k=0
θk lim
n→∞P(Xn = k)
since the PGF is uniformly convergent for θ ∈ [0, 1). Then, as the power series are
unique, comparing θk coefficients gives
P(X = k) = lim
n→∞P(Xn = k)
This proves that (i) =⇒ (ii). For the backwards implication, we just reverse the argu-
ment.
29
Example 5.9 (Poisson approximation of binomial random variables). LetXn ∼ B(n, pn)
where pn = λ/n. Then
gXn(θ) = (1− pn + pnθ)n =
(
1− λ
n
+
λ
n
θ
)n
=
(
1 + (θ − 1)λ
n
)n
→ e(θ−1)λ,
as n → ∞. But if X ∼ Po(λ), then gX(θ) = e(θ−1)λ, so gXn(θ) → gX(θ) ∀θ ∈ [0, 1).
Hence Xn
D−→ X, that is, for all k ∈ Z+
lim
n→∞P(Xn = k) = P(X = k) =
λk
k!
e−λ
In other words, for n large, Xn is approximately Po(λ) distributed.
30
6 More on random walks
In this chapter, we will come back to analysing simple random walks. In the first part,
we will apply the technique of generating functions to describe the distribution of the
first return time to the origin. In the second part of the chapter, we will introduce the
reflection principle, which allows us to compute further quantities related to the simple
random walk, such as the distribution of its maximum.
6.1 Distribution of the return time for a simple random walk
Let W = (Wn)n≥0 be a SRW(p). Define the return time to r as
Hr := inf{k > 0 : Wk = r}.
Note that Hr > 0 and Hr = +∞ if Wk 6= r ∀k > 0. Let gr be the PGF of Hr given
W0 = 0:
gr(θ) := E0(θHr) = E(θHr |W0 = 0)
Theorem 6.1. (i) For any r > 0,
gr(θ) = g1(θ)
r, g1(θ) =
1−
√
1− 4pqθ2
2qθ
g−r(θ) = g−1(θ)r, g−1(θ) =
1−
√
1− 4pqθ2
2pθ
For r = 0, i.e. for the return time H0 to 0, we have
g0(θ) = E(θH0) = 1−
√
1− 4pqθ2.
(ii) Moreover,
P0(H0 =∞) = P0(no return) = |p− q|
P0(H0 = 2n) =
1
2n− 1
(
2n
n
)
pnqn (n ≥ 1)
E0(H0|H0 <∞) = 1 + |p− q||p− q| , E0(H0) = +∞
Note for p = q = 1/2, P0(H0 <∞) = 1 but E0(H0) = +∞.
Proof. (i) Let θ ∈ [0, 1). By splitting over the time to hit 1, we see that
gr(θ) = E0(θHr) = E0(θHr ;H1 <∞)
31
since Hr ≥ H1 and θHr ≤ θH1 = 0 if H1 = +∞ and θ ∈ [0, 1). So
gr(θ) =
∞∑
k=1
E0(θHr ;H1 = k)
=
∞∑
k=1
P0(H1 = k)E0(θHr |H1 = k)
Now, the increments Xk+1, Xk+2, . . . are independent from W0, . . . ,Wk. Hence, we can
continue
gr(θ) =
∞∑
k=1
P0(H1 = k) θk E1(θHr)
= gr−1(θ)
∞∑
k=1
θk P0(H1 = k)
= gr−1(θ)g1(θ)
Iterating the arguments yields,
gr(θ) = g1(θ)
r, r ≥ 1 and g−r(θ) = g−1(θ)r, r ≥ 1
Now if we split over the first move, we see that
g1(θ) = E0(θH1) = pE0(θH1 |W1 = 1) + (1− p)E0(θH1 |W1 = −1)
= pθ + (1− p)θE−1(θH1)
where
E−1(θH1) = E0(θH2) = g2(θ) = g1(θ)2
This implies that
g1(θ) = pθ + (1− p)θg1(θ)2 =⇒ qθg1(θ)2 − g1(θ) + pθ = 0
so g1(θ) must be one of the roots
1Pm
√
1− 4pqθ2
2qθ
but g1(θ) = E0(θH1) ∈ [0, 1) ∀θ ∈ [0, 1), whereas
1 +
√
1− 4pqθ2
2qθ
→ +∞ as θ → 0
hence
g1(θ) =
1−
√
1− 4pqθ2
2qθ
(14)
32
and similarly (by symmetry),
g−1(θ) =
1−
√
1− 4pqθ2
2pθ
(15)
It remains to consider g0(θ). Again, by splitting according to the first move,
g0(θ) = E0(θH0) = pθE1(θH0) + qθE−1(θH0) = pθg−1(θ) + qθg1(θ)
Hence, by (14) and (15),
g0(θ) = 1−
√
1− 4pqθ2
Thus, we have proved part (i) of the theorem.
(ii) By Remark 5.3
P0(H0 <∞) = g0(1−) := lim
θ↑1
g0(θ)
= 1−
√
1− 4pq
= 1−
√
(2p− 1)2 = 1−
√
(p− q)2
= 1− |p− q|.
Hence,
P0(H0 = +∞) = |p− q|
which is the probability of no return to the origin.
Also
g′0(θ) = E0(H0θH0−1;H0 <∞) =
Partial
Partialθ
(
1−
√
1− 4pqθ2
)
=
4pqθ√
1− 4pqθ2
and then
E0(H0;H0 <∞) = g′0(1−) := lim
θ↑1
4pqθ√
1− 4pqθ2
=

4pq√
1− 4pq =
4pq
|p− q| , p 6= q
+∞, p = q = 12
Hence, for p 6= q,
E0(H0|H0 <∞) = E0(H0;H0 <∞)P0(H0 <∞) =
4pq
|p− q|
(
1
1−√1− 4pq
)
=
4pq(1 +
√
1− 4pq)
|p− q|(1− (1− 4pq)) = 1 + |p− q||p− q|
Note that since P0(H0 = +∞) > 0 for p 6= q, it follows that E0(H0) = +∞. If
p = q = 1/2, then P0(H0 =∞) = 0, so E0(H0) = E(H0;H0 <∞) = +∞ also.
33
Finally to recover the distribution we note that by a series expansion
1−√1− x =
∞∑
n=1
(
2n
n
)
xn
22n
1
2n− 1 ∀x ∈ [0, 1)
hence,
g0(θ) = 1−
√
1− 4pqθ2
=
∞∑
n=1
(
2n
n
)
pnqn
2n− 1θ
2n
but
g0(θ) := E0(θH0) =
∞∑
k=0
θkP0(H0 = k)
by definition, so by equating coefficients of θk we obtain that
P0(H0 = 2n) =
(
2n
n
)
pnqn
2n− 1 ∀n ∈ N
and
P0(H0 = 2n− 1) = 0 ∀n ∈ N.
34
6.2 The Reflection Principle
Let m,n, x, y ∈ Z with n > m. How many paths go from (m,x) to (n, y)?
Figure 4: The reflection principle.
Let u be the number of ‘ups’, i.e. the number of +1 jumps and let d be the number of
‘downs’, i.e. the number of −1 jumps. Then we need{
u+ d = n−m, total time for the path
u− d = y − x, change in position
Hence
u =
(n−m) + (y − x)
2
and d =
(n−m)− (y − x)
2
= (n−m)− u (16)
The ‘ups’ can occur in any u of the n − m time intervals, hence the number of paths
from (m,x) to (n, y) is (
n−m
u
)
where u =
(n−m) + (y − x)
2
(17)
with the convention that the Binomial coefficient is 0 unless u ∈ {0, 1, . . . , n−m}.
Remark 6.2. If there is a path from (m,x) to (n, y) and n−m is even (or respectively
odd), then y − x must also be even (odd). Clearly we need both u and d to be integers
in the range {0, 1, . . . , n−m} or no path can go from (m,x) to (n, y).
Corollary 6.3. For SRW(p),
P0(Wn = 2u− n) =
(
n
u
)
puqn−u for u ∈ {0, 1, . . . , n}
Note that Wn = 2u− n corresponds to u ‘ups’ and n− u ‘downs’ when W0 = 0 (by (16)
with x = 0,m = 0). If W0 = 0, then Wn can only be at positions {−n,−n+ 2, . . . , n−
2, n}.
35
Theorem 6.4 (Reflection principle for paths). Let x, y > 0 and n > m. Then
#

paths from (m,x) to (n, y)
which touch or cross the
time axis, {(k, 0) : k ∈ Z}
 = #
{
paths from (m,x)
to (n,−y)
}
= #
{
paths from (m,−x)
to (n, y)
}
Proof. Any path that reaches the time-axis can be ‘reflected’ from the first time it hits
to give a path ending at (n,−y). This gives a one-to-one correspondence between paths
touching the time-axis ending at (n, y) and those paths ending at (n,−y).
Theorem 6.5 (Reflection principle for symmetric simple RW). Let Mn := max{Wk :
k ≤ n} (the maximum of the random walk) for W = (Wn)n≥0 a SRW(1/2), then ∀k,m ∈
Z+,
P0(Mn ≥ m;Wn = m− k) = P0(Wn = m+ k)
Proof. By the Reflection Principle, there is a one-to-one correspondence of paths reaching
36
m and ending at m− k and paths ending at m+ k, so that
P0(Mn ≥ m;Wn = m− k) =
(
1
2
)n
#
{
paths such that
Mn ≥ m, Wn = m− k
}
=
(
1
2
)n
#
{
paths with
Wn = m+ k
}
= P0(Wn = m+ k)
Theorem 6.6 (Ballot theorem). Consider an election between two candidates A and B
where A gets a votes and B gets b votes with a > b. If slips are counted in a random
order, then the probability that A is strictly in the lead from the first slip counted until
the end of the count is
a− b
a+ b
Proof. Each count corresponds to a path from (0, 0) to (a + b, a − b) and each path is
equally likely as the slips are in a random order (the number of ups is a, and number of
downs b). Then
#{paths from (0, 0) to (a+ b, a− b)} =
(
a+ b
a
)
For A to always be strictly in the lead, the first slip must be for A (an ‘up’) and then
the path must not touch 0 again.
Note that
#{paths from (1, 1) to (a+ b, a− b)} =
(
a+ b− 1
b
)
but by the reflection principle
#
{
paths from (1, 1) to (a+ b, a− b)
that touch the time axis
}
= #
{
paths from (1, 1) to(
a+ b,−(a− b))
}
=
(
a+ b− 1
b− 1
)
,
and where for the second equality we use that the number of ‘up’ steps required in the
path from (1, 1) to (a+ b,−(a− b)) is given as in (17) by
u :=
1
2
(a+ b− 1 + (−(a− b)− 1)) = b− 1.
37
Then,
#
{
paths (1, 1) to (a+ b, a− b)
avoiding hitting 0
}
= #
{
paths (1, 1) to
(a+ b, a− b)
}
−#
{
paths (1, 1) to (a+ b, a− b)
that hit 0 line
}
=
(
a+ b− 1
b
)
−
(
a+ b− 1
b− 1
)
Hence, as all paths are equally likely,
P
(
A always strictly
in the lead
)
=
(
a+ b− 1
b
)
−
(
a+ b− 1
b− 1
)
(
a+ b
b
)
=
a!b!
(a+ b)!
{
(a+ b− 1)!
b!(a− 1)! −
(a+ b− 1)!
(b− 1)!a!
}
=
1
a+ b
(a− b) = a− b
a+ b
Corollary 6.7. For SRW(p),
P0(Wk 6= 0, 1 ≤ k ≤ 2n− 1 |W2n−1 = 1) = 1
2n− 1
Proof. All given paths of SRW(p) that end at 1 at time 2n − 1 are equally likely with
probability pnqn−1 as each path has n ups and n−1 downs chosen at random from 2n−1
time slots.
The Ballot theorem then tells us the required probability is
n− (n− 1)
n+ (n− 1) =
1
(2n− 1) .
Using the reflection principle, we can give an alternative proof of Theorem 6.1 describing
the distribution of the first return time to 0 for the random walk.
38
Theorem 6.8 (Application to return times). For a SRW(p) started at 0, let H0 be the
first time the particle returns to 0. Then,
P0(H0 = 2n) =
2
2n− 1
(
2n− 1
n
)
pnqn
Proof. Splitting into paths that stay above, resp. below, the 0-line and then using Corol-
lary 6.7
P0(H0 = 2n) = P0(W2n = 0;Wk 6= 0 for k = 1, . . . , 2n− 1)
= P(W2n = 0;W2n−1 = +1;Wk 6= 0 for k = 1, . . . , 2n− 1)
+ P0(W2n = 0;W2n−1 = −1;Wk 6= 0 for k = 1, . . . , 2n− 1)
= P0(W2n−1 = +1)P0(Wk 6= 0, ∀k |W2n−1 = +1)
× P0(W2n = 0 |W2n−1 = +1,Wk 6= 0,∀k)
+ P0(W2n−1 = −1)P0(Wk 6= 0, ∀k |W2n−1 = −1)
× P0(W2n = 0 |W2n−1 = −1,Wk 6= 0,∀k)
=
(
2n− 1
n
)
pnqn−1 · 1
2n− 1 · q +
(
2n− 1
n
)
pn−1qn · 1
2n− 1 · p
=
2
2n− 1
(
2n− 1
n
)
pnqn.
This answer agrees with Theorem 6.1 since(
2n
n
)
=
2n(2n− 1)!
n× (n− 1)!× n! = 2
(
2n− 1
n
)
.
39
7 Weak and Strong Laws
The goal of this chapter is to introduce the law of large numbers, which states that if
X1, X2, . . . , Xn are i.i.d. random variables with expectation µ = E(X1), then
1
n
n∑
i=1
Xi → µ.
The statement is probably not surprising. Think e.g. of tossing a coin repeatedly and
taking Xi = 1l{H}. Then, you expect that 1n
∑n
i=1Xi, i.e. the proportion of coin tosses
that land on head, is close to P(H) = E(1l{H}). In the following, we will make the
meaning of ‘→’ more precise. In particular, we will see in this and also the next chapter
that in probability there are various different types of convergence.
Before we can state the first version of the law of large numbers, we need the following
inequality:
Lemma 7.1 (Chebyshev’s inequality). If X ∈ L2, that is E(X2) <∞, and c > 0, then
for µ := E(X),
P
(|X − µ| ≥ c) ≤ Var(X)
c2
.
We recall that Var(X) = E((X − E(X))2) = E(X2)− E(X)2.
Proof. Note that for all µ ∈ R and c > 0,
c21l{|X−µ|>c} = c21l{|X−µ|2>c2} ≤ (X − µ)2.
Now, letting µ := E(X) and taking expectations, we obtain
E
(
c21l{|X − µ| ≥ c}) ≤ E((X − µ)2).
Therefore,
c2P
(|X − µ| ≥ c) ≤ Var(X).
Rearranging gives the statement of the theorem.
Remark 7.2. Note that E(X2) < ∞ implies that E|X| < ∞, so X ∈ L1. Hence
E(X) <∞ and Var(X) = E(X2)− E(X)2 <∞.
Theorem 7.3 (Weak law of large numbers, WLLN). Let X1, X2, . . . be IID random
variables such that E(X21 ) < ∞ and common expectation µ := E(X1). Then the weak
law of large numbers holds in that, for any ε > 0,
P
( ∣∣∣∣∑ni=1Xin − µ
∣∣∣∣ > ε)→ 0 as n→∞
40
Remark 7.4. The conditions we give above on X1, X2, . . . are not optimal. In fact, it
is possible to show that the WLLN holds if and only if one of the following conditions
holds:
(i) nP
(|X1| > n)→ 0 and E(X1; |X1| ≤ n)→ µ as n→∞;
(ii) The characteristic function of X1, ϕ(t) := E(eitX1), is differentiable at t = 0 and
ϕ′(0) = iµ.
See also [GS01, Section 7.4]
Proof. Let Sn :=
∑n
i=1Xi. Then
E(Sn) = E
( n∑
i=1
Xi
)
=
n∑
i=1
E(Xi) = nµ
and by independence
Var(Sn) = Var
(
n∑
i=1
Xi
)
=
n∑
i=1
Var(Xi) = nVar(X1)
Now σ2 := Var(X1) <∞ since X1 ∈ L2. Also,
E(Sn/n) =
1
n
E(Sn) = µ
and
Var(Sn/n) =
1
n2
Var(Sn) =
σ2
n
Therefore, for any ε > 0, by Chebyshev’s inequality,
P
(∣∣∣∣Snn − µ
∣∣∣∣ > ε) ≤ 1ε2Var
(
Sn
n
)
=
σ2
εn
→ 0,
as n→∞.
Theorem 7.5 (Strong law of large numbers, SLLN). Let X1, X2, . . . be i.i.d. random
variables. Then,
P
(∑n
i=1Xi
n
→ µ
)
= 1, (18)
for some constant µ if and only if E|X1| <∞. In this case µ = E(X1).
Note that{∑n
i=1Xi
n
→ µ
}
=
{
ω ∈ Ω :
∑n
i=1Xi(ω)
n
→ µ
}
=
{
ω ∈ Ω : ∀ε > 0,∃N(ω) ∈ N s.t.
∣∣∣∣∑ni=1Xi(ω)n − µ
∣∣∣∣ ≤ ε, ∀n ≥ N(ω)}
In particular, N depends on both the choice of ω ∈ Ω and ε > 0, in general.
41
Proof. We will only prove the following special case: We suppose additionally that
E(X21 ) <∞. For the general case see [GS01, Section 7.5].
We will first show that 1nSn converges along the subsequence ni = i
2 and then ‘fill the
gaps’ to recover the full convergence.
By Chebyshev’s inequality, Lemma 7.1, we have that (using similar calculations as in
the previous theorem),
P
(∣∣∣ 1
i2
Si2 − µ
∣∣∣ > ε) ≤ Var(Si2)
i4ε
=
Var(X1)
i2ε2
.
The right hand side is summable in i, so that by the Borel-Cantelli Lemma I, only finitely
many of the events on the right hand side occur. In particular, for any ε > 0, there exists
K = K(ω, ε) such that for all i ≥ K,∣∣∣ 1
i2
Si2 − µ
∣∣∣ ≤ ε.
Thus, P(limi→∞ 1i2Si2 = µ) = 1. To show the convergence of the full sequence, assume
for the moment that the Xi are non-negative. In this case we have
Si2 ≤ Sn ≤ S(i+1)2 , for all i2 ≤ n ≤ (i+ 1)2.
Dividing by n, we have
1
(i+ 1)2
Si2 ≤
1
n
Sn ≤ 1
i2
S(i+1)2 , for all i2 ≤ n ≤ (i+ 1)2.
Now, note that ii+1 → 1. Hence, we have shown that for non-negative random variables
P
( 1
n
Sn → µ
)
= 1. (19)
Coming back to general (not necessarily non-negative) random variables, we decompose
Xi = X
+
i −X−i into positive and negative part (recall the definitions from Section 2.2).
X+i and X
−
i are non-negative random variables, so that we can apply (19) and deduce
that on a set of probability 1, we have that
1
n
Sn =
1
n
( n∑
i=1
X+i +
n∑
i=1
X−i
)
→ E(X+1 )− E(X−1 ) = E(X1),
as n→∞.
Remark 7.6. Discussion. Let Sn =
∑n
i=1Xi. The SLLN is stronger than the WLLN.
The WLLN says that, for a given large enough n, | 1nSn − µ| is likely to be small with
a high probability. However, it does not imply that | 1nSn − µ| remains small for all
sufficiently large n with probability 1 – this is the SLLN. In fact,
SLLN holds =⇒WLLN holds
42
but
WLLN holds 6=⇒ SLLN holds
in general.
For example, taking X1 symmetric (so X1 and −X1 have the same distribution) and
P(X1 ≥ x) ∼ 1
x log x
for x large. Then E|X1| = +∞, but E(X1; |X1| ≤ n) = 0 ∀n by symmetry, so µ = 0 and
(i) holds in Remark 7.4. Hence, the WLLN holds, but the SLLN does not.
43
8 Modes of Convergence
There are four main ways to interpret Xn → X as n→∞.
Definition 8.1. Let X,X1, X2, . . . be random variables on a probability space (Ω,F ,P).
We say that;
(i) Xn → X almost surely, written Xn a.s.−→ X, if {ω ∈ Ω : Xn(ω)→ X(ω) as n→∞}
is an event of probability 1. That is, P(Xn → X) = 1.
(ii) Xn → X in p−th mean, where p ≥ 1, written Xn p
th
−→ X or Xn → X in Lp, if
E
(|Xn −X|p)→ 0 as n→∞
(iii) Xn → X in probability, written Xn prob.−→ X, if
∀ε > 0, P(|Xn −X| > ε)→ 0 as n→∞
(iv) Xn → X in distribution, written Xn D−→ X or Xn Law−→ X, if
P(Xn ≤ x)→ P(X ≤ x) as n→∞
for all points x at which FX(x) := P(X ≤ x) is continuous.
Remark 8.2. (i) There is a fundamental difference between the first three and the
last type of convergence. For the first three, all random variables and the limit
need to be defined on the same probability space, however, this does not apply to
the convergence in distribution.
(ii) Note that with Sn := X1 + X2 + · · · + Xn and µ := E(X), the SLLN says that
Sn/n→ µ almost surely and the WLLN says that Sn/n→ µ in probability.
Theorem 8.3. The following implications hold;
Xn → X a.s. =⇒
Xn → X in Lp =⇒
Xn → X in prob. =⇒ Xn D−→ X
for any p ≥ 1. Also, if p > q ≥ 1 then
Xn → X in Lp =⇒ Xn → X in Lq
Further, no other implications hold in general.
44
Proof. Suppose that Xn → X a.s.. We will show that this implies that Xn → X in
probability. Note, for any ε > 0, we have that
P(|Xn −X| ≥ ε for infinitely many n) = P
(⋂
m
⋃
n≥m
{|Xn −X| ≥ ε}
)
= 0.
Now,
P(|Xm −X| ≥ ε) ≤ P
( ⋃
n≥m
{|Xn −X| ≥ ε}
)
.
Taking the limit m→∞, we get
lim
m→∞P(|Xm −X| ≥ ε) ≤ limm→∞P
( ⋃
n≥m
{|Xn −X| ≥ ε}
)
= P
(⋂
m
⋃
n≥m
{|Xn −X| ≥ ε}
)
= 0,
as required.
An example of a sequence that converges in probability, but not almost surely, will be
discussed on Problem Sheet 6. The fact that convergence in L1 implies convergence in
probability will proved on Problem Sheet 7. All other implications and counter examples
can be found in [GS01, Chapter 7].
In special cases, we can formulate the converse statements of the above as well.
Theorem 8.4.
(i) If Xn
D−→ c where c is a constant, then Xn prob.−→ c;
(ii) If Xn
prob.−→ X and P(|Xn| ≤ K) = 1 ∀n for some constant K, then Xn → X in Lp
for all p ≥ 1;
(iii) If
∞∑
n=1
P
(|Xn −X| > ε) <∞
for all ε > 0, then Xn → X almost surely.
The following proof is not examinable!
Proof. (i) Note that the distribution function of a random variable X that is constant
equal to c is given as
FX(x) = P(X ≤ x) =
{
0 if x < c
1 if x ≥ c
45
In particular, if Xn → c in distribution, then as n→∞,
P(Xn ≤ x)→
{
0 if x < c
1 if x ≥ c
Therefore,
P(|Xn − c| > ε) = P(Xn < c− ε) + P(Xn > c+ ε)
≤ P(Xn ≤ c− ε) + 1− P(Xn ≥ c+ ε)→ 0 + 1− 1 = 0 ,
as n→∞, which shows that Xn → c in probability.
(ii) Suppose that Xn → X in probability. Then, since P(|Xn| ≤ K) = 1, it follows that
P(|X| ≤ K) = 1. Indeed, using that |Xn| ≤ |Xn −X|+ |X|,
P(|X| ≥ K + 2ε) ≤ P(|X| ≥ K + 2ε; |Xn −X| ≤ ε) + P(|Xn −X| ≥ ε)
= P(K + 2ε ≤ |X| ≤ |Xn −X|+ |Xn| ≤ ε+K) + P(|Xn −X| ≥ ε)
= P(|Xn −X| ≥ ε).
Hence, as n→∞, we deduce that P(|X| ≥ K + 2ε) = 0. Then, we can let ε ↓ 0 to find
that P(|X| > K) = 0.
Now, we can argue that on a set of probability 1
|Xn −X|p ≤ εp1l{|Xn−X|≤ε} + (2K)p1l{|Xn−X|>ε}.
Taking expectations, we obtain
E(|Xn −X|p) ≤ εp + (2K)pP(|Xn −X| > ε).
Hence, letting n→∞ gives
lim
n→∞E|Xn −X|
p ≤ εp.
Since the left hand side does not depend on ε, we have that limn→∞ E|Xn −X|p = 0.
(iii) Note that for any ε > 0,
P(|Xn −X| > ε for infinitely many n) = P
( ⋂
m∈N
⋃
n≥m
|Xn −X| ≥ ε
)
= lim
m→∞P
( ⋃
n≥m
|Xn −X| ≥ ε
)
≤ lim
m→∞
∑
n≥m
P(|Xn −X| ≥ ε) = 0,
since by assumption
∑
n∈N P(|Xn −X| ≥ ε) <∞. Now, finally,
P(Xn 6→ X) ≤ P
( ⋃
k∈N
{|Xn −X| ≥ 1
k
infinitely often}
)
≤
∑
k∈N
P(|Xn −X| ≥ 1
k
infinitely often) = 0,
by the above, showing that Xn → X a.s.
46
9 Conditional Expectation
9.1 Discrete Random Variables
Recall that if A is an event with P(A) > 0 and X is a non-negative random variable
taking values in the discrete set X(Ω) := {x1, x2, . . . }, then
E(X|A) = E(X1lA)
P(A)
.
Note that
E(X|A) = E(X1lA)
P(A)
=
∞∑
j=1
E(X1l{X=xj}1lA)
P(A)
=
∞∑
j=1
xj
E(1l{X=xj}1lA)
P(A)
=
∞∑
j=1
xjP(X = xj |A).
This is just the mean of the conditional distribution of X given the event A occurs.
Definition 9.1 (Conditional expectation given a random variable). For a discrete ran-
dom variable Y taking values in Y (Ω) := {y1, y2, . . . } where P(Y = yj) > 0 and∑∞
j=1 P(Y = yj) = 1, we define the conditional expectation of X given Y as
E(X|Y ) :=
∞∑
j=1
1l{Y=yj}E(X|Y = yj).
Remark 9.2. (i) Note that E(X|Y = yj) is just a number, whereas E(X |Y ) is a
random variable since the indicators are random variables.
(ii) Furthermore, the random variable E(X|Y ) is a function of the random variable Y .
Indeed, define f(y) := E(X|Y = y) where P(Y = y) > 0, then on the event Y = yi,
we have that
E(X |Y ) = E(X |Y = yi) = f(yi) = f(Y ).
Since,
⋃
i{Y = yi} = Ω, we have that E(X |Y ) = f(Y ).
Example 9.3. Suppose at a nuclear power plant there are Y radiation alerts, Y ∼ Po(λ).
Each radiation alert results in a radiation leak with probability p independent of the
others and of Y .
(i) Given Y = n, what is the expected number of radiation leaks?
(ii) Conditional on the random variable Y , what is the expected number of radiation
leaks?
Solution. Let Z be the number of radiation leaks.
47
(i) Given Y = n, Z ∼ B(n, p). Hence E(Z|Y = n) = np, which is a deterministic
number.
(ii) Then, the conditional expectation of Z given the random variable Y is given by
E(Z|Y ) = pY , i.e. for ω ∈ Ω, E(Z|Y )(ω) := pY (ω). Note that E(Z|Y ) is a random
variable only dependent on Y .
Lemma 9.4 (The tower property). We have that E
(
E(X|Y )) = E(X).
Proof. Suppose Y takes values in Y (Ω) := {y1, y2, . . . }. Then
E
(
E(X|Y )) = E( ∑
yj∈Y (Ω)
1l{Y=yj}E(X|Y = yj)
)
=
∑
yj
E
(
1l{Y=yj}E(X|Y = yj)
)
=
∑
yj
E(X|Y = yj)E
(
1l{Y=yj}
)
[since E(X |Y = yj) is a constant]
=
∑
yj
E(X|Y = yj)P(Y = yj)
=
∑
yj
E(X1l{Y=yj})
= E
(
X
∑
yj
1l{Y=yj}
)
[by definition of the conditional expectation]
= E(X).
Example 9.5. What is the distribution of the number of radiation leaks from example
9.3?
Solution. Let X1, X2, . . . be i.i.d. Ber(p). Then
Z =
Y∑
j=1
Xj
Now for q := 1− p,
E(sZ |Y = n) = (ps+ q)n
the PGF of a B(n, p). Then E(sZ |Y ) = (ps+ q)Y . This is a random variable depending
on Y . Now, using the tower property and the fact that E(tY ) = eλ(t−1) for the Poisson
48
random variable Y ,
E(sZ) = E
(
E(sZ |Y )) = E((ps+ q)Y )
= exp{λ((ps+ q)− 1)} = exp{λp(s− 1)}
but this is the PGF of a Po(λp) distribution, hence Z ∼ Po(λp).
Lemma 9.6 (Other key properties). For discrete random variables X,Y, Z and a Borel
function h ,
(i) ‘Taking out what is known’: E
(
h(Y )X|Y ) = h(Y )E(X|Y ).
(ii) If X and Y are independent, then E(X|Y ) = E(X).
(iii) Linearity: For any constants λ, µ ∈ R, E(λX + µZ|Y ) = λE(X|Y ) + µE(Z|Y ).
Proof. Recall, that for discrete random variables, we have that
E(X|Y ) :=
∑
yj∈Y (Ω)
1l{Y=yj}E(X|Y = yj)
(i) We have the following, using the definitions of the conditional expectations,
E
(
h(Y )X|Y ) = ∑
j
1l{Y=yj}E
(
h(Y )X|Y = yj
)
=
∑
j
1l{Y=yj}
E
(
h(Y )X1l{Y=yj}
)
P(Y = yj)
=
∑
j
1l{Y=yj}
E
(
h(yj)X1l{Y=yj}
)
P(Y = yj)
=
∑
j
1l{Y=yj}h(yj)
E(X1l{Y=yj})
P(Y = yj)
=
∑
j
1l{Y=yj}h(Y )E(X|Y = yj)
= h(Y )
∑
j
1l{Y=yj}E(X|Y = yj)
= h(Y )E(X|Y ).
(ii) Since X and Y are independent, we have that for X(Ω) = {x1, x2, . . .},
E(X|Y = yj) =
∑
i
xi P(X = xi|Y = yj) =
∑
xi P(X = xi) = E(X)
49
Hence
E(X|Y ) =
∑
j
1l{Y=yj}E(X|Y = yj) =
∑
j
1l{Y=yj}E(X)
= E(X)
∑
j
1l{Y=yj} = E(X)
(iii) Follows directly from the linearity of E(· |Y = yj).
9.2 Continuous Random Variables
Suppose X and Y have a joint PDF fX,Y (x, y) on R2, then, for any A ∈ B(R2):
P((X,Y ) ∈ A) =
∫
(x,y)∈A
fX,Y (x, y) dy dx
Recall that we can recover the marginal distribution of one of the variables, e.g. X, by
integrating out the other variables, i.e. for any A ∈ B(R),
P(X ∈ A) =
∫
y∈R,x∈A
fX,Y (x, y) dx dy =
∫
x∈A
∫
y∈R
fX,Y (x, y) dy dx =
∫
x∈A
fX(x) dx,
if we define
fX(x) :=
∫
y∈R
fX,Y (x, y) dy,
the marginal density of X. The definition of fY follows analogously as
fY (y) :=
∫
x∈R
fX,Y (x, y) dx.
Definition 9.7. We define the conditional PDF fX|Y (x|y) of X given that Y = y via
fX|Y (x|y) :=
fX,Y (x, y)
fY (y)
for any y such that fY (y) > 0 and set it to be 0 otherwise. The conditional distribution
of X given Y = y is given by
FX|Y (x|y) =
∫ x
−∞
fX|Y (u|y) du =: P(X ≤ x|Y = y)
Now let
Ψ(y) := E(X|Y = y) :=
∫
x∈R
x fX|Y (x|y) dx,
then the conditional expectation E(X|Y ) of X given Y is given by Ψ(Y ).
50
As in the discrete case, E(X|Y ) is a random variable and a function of the random
variable Y only.
Note that the same properties as listed in Lemmas 9.4 and 9.6 hold as in the discrete
case. For example, we have with Ψ as above:
E(E(X |Y )) = E(Ψ(Y )) =
∫
R
Ψ(y) fY (y) dy
=
∫
R
∫
R
xfX|Y (x|y) dxfY (y) dy
=
∫
R
∫
R
x
fX,Y (x, y)
fY (y)
dx1l{fY (y)>0}fY (y) dy
=
∫
R
∫
R
xfX,Y (x, y) dx1l{fY (y)>0} dy
=
∫
R
x
∫
R
fX,Y (x, y) dy dx
=
∫
R
xfX(x) dx = E(X).
Here, we used that for any y.
fY (y) = 0 =⇒
∫
R
fX,Y (x, y) dx = 0 =⇒
∫
R
xfX,Y (x, y) = 0.
Example 9.8. Let X and Y have joint PDF given by
fX,Y (x, y) =

1
y
, 0 ≤ x ≤ y ≤ 1
0, otherwise
The region, where fX,Y (x, y) > 0, can be seen in the following diagram:
So, the marginal PDF of Y , fY (y), is given by
fY (y) =
∫
x∈R
fX,Y (x, y) dx =
∫ y
0
1
y
dx =
1
y
∫ y
0
1dx = 1
51
for y ∈ [0, 1]. That is, Y ∼ Unif[0, 1]. Then, the conditional PDF fX|Y (x|y), of X given
Y = y ∈ [0, 1] is
fX|Y (x|y) =
fX,Y (x, y)
fY (y)
=

1
y
, 0 ≤ x ≤ y ≤ 1
0, otherwise
That is, given Y = y ∈ [0, 1], we have X ∼ Unif[0, y]. In particular, E(X|Y = y) = y/2
for y ∈ [0, 1], and hence E(X|Y ) = Y/2 - a random variable whose randomness depends
only on the random variable Y .
52
10 Branching Processes
10.1 The Model
Consider a model for the evolution of bacteria. The colony evolves in discrete time. Each
bacterium lives for 1 unit of time, at the end of which it is replaced by a random number
of descendants. For different bacteria, the number of replacements are independent and
have the same distribution as a random variable X, where
P(X = k) = pk, for k = 0, 1, 2, . . . .
The distribution of X is called the offspring distribution and pk is the probability that
one bacterium is being replaced by k bacteria.
More formally, let (X(n)i , i ∈ N, n ∈ N) be i.i.d. random variables with the same distribu-
tion as X. These random variables have the following interpretation: the ith bacterium
(if it exists) in generation n − 1 will be replaced by X(n)i bacteria in generation n. Let
Zn represent the number of bacteria in generation n. Under the probability law P, and
expectation E, we suppose that Z0 = 1. Then,
Zn+1 = X
(n+1)
1 +X
(n+1)
2 + · · ·+X(n+1)Zn =
Zn∑
i=1
X
(n+1)
i .
Morever, we assume throughout that
µ := E(X) =
∞∑
k=0
kpk <∞
In the following the two main questions that we will answer are:
1. What is the average size of the population?
2. What is the probability the population dies out?
10.2 Average Size of the Population
Our first goal is to find the average size of the population. We start by finding the
conditional size of the next generation given we know the size of the current one. For
53
each fixed n, r ∈ Z+,
E(Zn+1 |Zn = r) = E
(
Zn∑
i=1
X
(n+1)
i
∣∣∣∣∣Zn = r
)
= E
(
r∑
i=1
X
(n+1)
i
∣∣∣∣∣Zn = r
)
= E
(
r∑
i=1
X
(n+1)
i
)
[by independence]
=
r∑
i=1
E
(
X
(n+1)
i
)
=
r∑
i=1
E(X) = µr
since µ := E(X) and since X(n+1)1 , X
(n+1)
2 , . . . are i.i.d. with the same distribution as X.
Then, if define
f(r) := E(Zn+1 |Zn = r) = µr.
Then, we know that
E(Zn+1 |Zn) = f(Zn) = µZn.
Hence, using the tower property, Lemma 9.4, we can deduce that
E(Zn+1) = E
(
E(Zn+1 |Zn)
)
= E(µZn) = µE(Zn).
Hence,
E(Zn+1) = µE(Zn) = µ2E(Zn−1) = · · · = µn+1E(Z0) = µn+1
since E(Z0) = 1. In particular, we have shown the following:
Proposition 10.1. For any n ≥ 0,
E(Zn) = µn ∀n ≥ 0
Note that if µ > 1, then E(Zn) → +∞ and if µ < 1, then E(Zn) → 0. If µ = 1, then
E(Zn) = 1 for all n ∈ N.
10.3 Extinction probability of the population
We want to find
pi := P(ultimate extinction) = P(Zn = 0 for some n)
54
Assume that P(X = 0) =: p0 > 0 throughout (otherwise trivially, pi = 0). Since
E(Zn) = µn, it is plausible that pi = 1 if µ < 1 and pi < 1 if µ > 1. Define
pin := P(extinction by time n) = P(Zn = 0)
Lemma 10.2. We have that pin ↑ pi and
pin+1 = g(pin),
where
g(θ) = E(θX),
is the PGF of the offspring distribution.
Proof. Note that {Zn+1 = 0} ⊇ {Zn = 0} since if bacteria are extinct at time n, they
are also extinct at time n+ 1. Then,
{Zn = 0} ↑
∞⋃
i=0
{Zi = 0} = {Zk = 0 for some k}
and continuity of P, or MON, tell us that
pin = P(Zn = 0) ↑ P(Zn = 0 for some n) = pi,
which proves the first part of the lemma. Now, by splitting over the number of bacteria
alive in the first generation, we get
pin+1 = P(Zn+1 = 0)
=
∞∑
k=0
P(Z1 = k;Zn+1 = 0)
=
∞∑
k=0
P(Z1 = k)P(Zn+1 = 0 |Z1 = k).
But P(Zn+1 = 0 |Z1 = k) is the probability that k independent family trees each starting
from one bacterium all die out before generation n, so this is pikn. Hence
pin+1 =
∞∑
k=0
P(X = k)pikn = g(pin),
proving the second part of the lemma.
Taking the limit n→∞ in the previous lemma, suggests that pi is a fixed point of g, i.e.
g(pi) = pi. Indeed, this fact can be used to characterize the extinction probability:
55
Theorem 10.3. If pi := P(extinction), then
(i) if µ := E(X) ≤ 1, we have pi = 1;
(ii) if µ > 1, then pi is the unique root of g(pi) = pi in (0, 1).
Proof. By Lemma 10.2 and since g(θ) is continuous,
lim
n→∞pin+1 = limn→∞ g(pin) = g
(
lim
n→∞pin
)
Therefore, pi = g(pi).
Also note that g′(1) = E(X) =: µ and
g(1) =
∑
k
P(X = k) = 1, g(0) = p0 > 0.
Furthermore,
g′(θ) = E(XθX−1) > 0, and g′′(θ) = E
(
X(X − 1)θX−2) ≥ 0.
In particular, g(θ) is convex; its slope, g′(θ), is never decreasing. There are two situations
that can arise:
Case 1: g′(1) = µ ≤ 1
We first claim that g′(θ) < 1 for all θ ∈ [0, 1). If µ < 1, then this follows immediately:
using that g′(θ) is increasing, we have g′(θ) ≤ g′(1) = µ < 1 for all θ ∈ [0, 1).
If µ = 1, then
1 = g′(1) = p1 + 2p2 + 3p3 + · · · (20)
implies that p1 + p0 < 1. Indeed, if p1 + p0 = 1, then p2 = p3 = . . . = 0. Therefore, (20)
implies that p1 = 1. However, since p0 > 0, we know that p1 < 1, a contradiction.
56
Now, p1 + p0 < 1 implies that there exists ` ≥ 2 such that p` > 0 and therefore
g′′(θ) =
∑
k≥2
k(k − 1)pkθk−2 > 0,
and thus g′ is strictly increasing so that g′(θ) < g′(1) = 1 for any θ ∈ [0, 1).
Now, assume that there exists θ′ ∈ [0, 1) such that g(θ′) = θ′. Then, by the mean value
theorem, there exists θ∗ ∈ (θ′, 1) such that
g′(θ∗) =
g(1)− g(θ′)
1− θ′ =
1− θ′
1− θ′ = 1. (21)
This contradicts our initial claim and so g(θ) > θ for all θ ∈ [0, 1). Hence, we must have
pi = g(pi) implies that pi = 1.
Case 2: g′(1) = µ > 1:
Note that in this case there has to be at least one θ ∈ [0, 1) such that g(θ) = θ. Otherwise,
g(0) > 0 would imply g(θ) ≥ θ for all θ ∈ [0, 1], which would mean that g′(1) ≤ 1. Also,
there cannot be more than one such point. Indeed, µ ≥ 1 means that there exists ` ≥ 2
such p` > 0 so that g′′(θ) > 0 and g′ is strictly increasing as above. But as in (21)
between any two fixed points of g (i.e. any θ such that g(θ) = θ) there is a point θ∗ with
g′(θ∗) = 1. Thus, g′ strictly increasing implies that there are at most two fixed points
(one of which is 1).
Let pi∞ ∈ (0, 1) be this point with g(pi∞) = pi∞. We want to show that
pi := P(extinction) = pi∞ ∈ (0, 1)
From the picture, it is clear that pin ↑ pi∞, but we already know that
pin ↑ pi = P(extinction)
so we should have pi = pi∞ ∈ (0, 1), the unique root in (0, 1) of pi = g(pi).
57
To argue more carefully,
0 < pi1 = P(Z1 = 0) = p0 = g(0) ≤ g(pi∞) = pi∞
as g is increasing on [0, 1] and pi∞ ∈ (0, 1) so pi∞ > 0. That is, pi1 ≤ pi∞. Again, since g
is increasing,
pin ≤ pi∞ =⇒ g(pin) ≤ g(pi∞)
=⇒ pin+1 ≤ pi∞
since pin+1 = g(pin) and pi∞ = g(pi∞). By induction, pin ≤ pi∞ ∀n ≥ 1, so
pi = lim
n→∞pin ≤ pi∞ < 1
by definition of pi∞. But pi = g(pi) and we now know that pi < 1 and thus it follows that
pi = pi∞.
58
11 Poisson Processes
A Poisson process on [0,∞) can be used to model random occurrences of events such as
the times of arrivals at a queue, call requests at a telephone exchange, times of clicks of
a Geiger counter.
In the most basic case, the time parameter set for arrivals is [0,∞). Let N(A) denote
the number of arrivals in a time interval A ⊂ [0,∞), e.g. we write N(t, t + s] for the
number of arrivals during time (t, t+ s].
11.1 Poisson Point Processes on [0,∞)
The first description of a Poisson is as follows:
Definition 11.1. A Poisson process of rate λ, denoted PP(λ), is a collection of random
variables NI with I = (s, t], for 0 ≤ s ≤ t, where NI counts the number of arrival during
time interval I, that satisfies:
(i) the number of arrivals N(A1), N(A2), . . . , N(An) during disjoint time intervals
A1, A2, . . . , An are independent random variables;
(ii) the number of arrivals during any interval of length t has a Po(λt) distribution,
that is, N(Ai) ∼ Po(λ|Ai|) where |Ai| is the length of the time interval Ai.
Note that if X ∼ Po(λ) and Y ∼ Po(µ) are independent random variables, we know
that X + Y ∼ Po(λ+µ). This is consistent with our definition as N(0, t] +N(t, t+ s] =
N(0, t+ s].
An equivalent description of a Poisson process is the following inter-arrival description:
Lemma 11.2. A Poisson process of rate λ, PP(λ), is fully described by the fact that the
inter-arrival times ξ1, ξ2, . . . are i.i.d. exponential random variables with parameter λ.
That is, arrivals occur at times T1 := ξ1, T2 := ξ1 + ξ2, . . . , Tn :=
∑n
i=1 ξi, and then the
number of arrivals by time t is given by
N(0, t] := max
{
n :
n∑
k=1
ξk ≤ t
}
Moreover, N(s, t] can then be recovered as N(s, t] = N(0, t]−N(0, s].
This construction is illustrated in Figure 5.
Proof. We will not prove this result. However, we will point out a few connections.
First of all assume that N(s,t] is a Poisson process and consider the first arrival time
ξ1 := inf{t ≥ 0 : N(0, t] = 1}.
59
Figure 5: A realisation of a Poisson point process.
Then, since N(0, t] ∼ Po(λt),
P(ξ1 > t) = P(N(0, t] = 0) = e−λt.
Hence, ξ1 is Exp(λ)-distributed. The converse direction uses that the distribution of Tn is
explicitly known. Furthermore, the memory-less property of the exponential distribution
is essential for showing the independence of increments.
11.2 Decomposing a Poisson Process
In the following we will consider a Poisson Process that models arrivals of two different
types. In order to obtain an elegant description of this process, we will need the following
more basic lemma.
Lemma 11.3 (‘Boys and Girls’ principle.). The follow two models are equivalent:
(i) The total number of births is Po(γ) and each birth, independently of everything else,
produces a boy with probability p, or a girl with probability q := 1− p.
(ii) The number of boys is Po(γp) and the number of girls is Po(γq), and the number
of boys is independent from the number of girls.
Proof. [(ii) =⇒ (i)]: If B ∼ Po(γp) and G ∼ Po(γq) and B,G are independent, then we
60
have that B +G ∼ Po(γp+ γq) = Po(γ) by Example 5.6. Now,
P(B = b |B +G = n) = P(B = b;G = n− b)
P(B +G = n)
=
P(B = b)P(G = n− b)
P(B +G = n)
=
(γp)b
b!
e−γp
(γq)n−b
(n− b)!e
−γq n!
γne−γ
=
(
n
b
)
pbqn−b.
Thus, B ∼ B(n, p) conditional on B +G = n, that is, it has the same distribution as if
each birth is a boy with probability p and the trials are independent.
[(i) =⇒ (ii)]: Denote by B, resp. G, the number of boys, resp. girls, born and set
N = B + G. Using that N is Poisson(γ) and each birth is independently a boy with
probability p
P(B = b;G = g) = P(N = b+ g;B = b)
=
γb+g
(b+ g)!
e−γ
(
b+ g
b
)
pbqg
=
(γp)b
b!
e−pγ
(γq)g
g!
e−qγ
= P(Np = b)P(Nq = g),
whereNp andNq are independent random variables with distributions Po(γp) and Po(γq)
respectively.
Summing over all possible values of g, we find
P(B = b) =
∑
g∈Z+
P(B = b;G = g) = P(Np = b).
Showing that B ∼ Po(λp). Similarly, we see that G ∼ Po(λq). Moreover, the above
shows that G and B are independent.
The previous lemma motivates the ‘Boys and Girls’ principle for Poisson processes.
Proposition 11.4. The following two models are equal in distribution:
(Model 1) Men arrive at a queue as a PP(λ) process, and women arrive at the queue as
a PP(µ) process and these two Poisson processes are independent.
(Model 2) People arrive at the queue as a PP(λ+ µ) process and each person, indepen-
dently of everything else, is a man with probability λλ+µ and a woman with
probability µλ+µ .
61
11.3 Conditional Uniformity
Theorem 11.5. Let N be a Poisson process of constant rate λ > 0 on [0,∞). Condi-
tional on N(0, t] = n, the times of the n arrivals over (0, t] are independent uniformly
distributed on (0, t].
Proof. Partition I := (0, t] into disjoint intervals J1, . . . , Jm of lengths `1, . . . , `m, i.e.
|Ji| = `i and
m∑
i=1
`i = t
Let pi = `i/t. Suppose that Z1, . . . , Zn are i.i.d. Unif(0, t] random variables. Then, for
k1, . . . , km ∈ {0, . . . , n} with k1 + · · ·+ km = n,
P
(
m⋂
r=1
{kr of the Z1, . . . , Zn fall in Jr}
)
=
(
n!
k1!k2! · · · km!
)
pk11 p
k2
2 · · · pkmm
= n!
m∏
i=1
pkrr
kr!
(22)
where (
n!
k1!k2! · · · km!
)
is the number of ways n objects can be divided into groups of size k1, k2, . . . , km where
k1 + · · ·+ km = n and
pk11 p
k2
2 · · · pkmm
is the probability of k1 given the objects all fall in J1, and k2 given the objects fall in
J2, . . . , and km given the objects fall in Jm.
Let N(Jr) be the number of points of N in Jr, then by the independence (since the Jr
are disjoint) we have
P
( m⋂
r=1
{N(Jr) = kr}
)
=
m∏
r=1
P
(
N(Jr) = kr
)
=
m∏
r=1
(λ`r)
kr
kr!
e−λ`r
=
(
m∏
r=1
(λprt)
kr
kr!
)
exp
(
−λ
m∑
r=1
`r
)
=
(
m∏
r=1
pkrr
kr!
)
e−λt(λt)k1+···+km since
m∑
r=1
`r = t
= n!
(
m∏
r=1
pkrr
kr!
)
(λt)n
n!
e−λt since k1 + · · ·+ km = n
62
where (∗) follows by independence as the Jr are disjoint intervals. So for J1, . . . , Jm a
partition of (0, t] and
∑m
r=1 kr = n, we have that
P
(
m⋂
r=1
{
N(Jr) = kr
} ∣∣∣∣∣N(0, t] = n
)
=
1
P
(
N(0, t] = n
)P( m⋂
r=1
{
N(Jr) = kr
}
;N(0, t] = n
)
= n!
m∏
r=1
pkrr
kr!
,
which agrees with (22), i.e. the probability that the kr uniforms follow into the rth
Interval, for r = 1, . . . ,m.
11.4 Poisson point processes on Rn
We can generalize the definition of a Poisson process on [0,∞) to arbitrary subsets of
Rn.
Definition 11.6. Let G ⊆ Rn and let λ(·) be a non-negative function on G, λ : G→ R+.
A Poisson point process on G with intensity function λ(·), written as PPP(λ(·)), is a
random collection of points in G such that
(a) if G1, G2, . . . are disjoint subsets of G, then the number N(G1), N(G2), . . . of points
occurring in G1, G2, . . . are independent random variables;
(b) If G1 ⊆ G, then N(G1) has the Poisson distribution of parameter
∫
G1
λ(x) dx, that
is,
N(Gk) ∼ Po
(∫
· · ·
∫
Gk⊂Rn
λ(x1, . . . , xn) dx1 . . . dxn
)
Remark 11.7. (a) If we take n = 1, G = [0,∞) ⊂ R and further assume that λ(x)
is constant equal to λ > 0, then we recover the classical definition of a Poisson
process with intensity λ.
(b) Note that if G1 ⊆ G, then N(G1) ∼ Po(
∫
G1
λ(x) dx) and in particular, we have
E(N(G1)) =
∫
G1
λ(x) dx.
Therefore, the function λ controls how dense the number of points is in any region
of G.
Example 11.8. Consider a PPP on R with intensity function λ(x) = e−µ|x|. Then,∫
R
λ(x) dx =
∫
R
e−µ|x| dx = 2
∫ ∞
0
e−µx dx =
2
µ
.
In particular, the total number of points in R satisfiesN(R) ∼ Po(∫R λ(x) dx) = Po(2/µ).
Note that here the total number of points is finite (almost surely).
63
Proposition 11.9 (A conditional uniformity result.). Suppose we have a PPP(λ) on
G ⊆ Rn with constant intensity λ > 0. Let V ⊆ G, then conditional on N(V ) = n, the
distribution of these n points in V is the same as if they were each independently chosen
with the uniform distribution on V .
Note that this is really another version of the ‘boys and girls’ principle.
Example 11.10. We imagine that raindrops land at positions on R and in time on
[0,∞) as a Poisson Point Process N of constant rate λ on R+×R time space. Then the
number of raindrops landing in [x, x+ y] between times s and s+ t is
N
(
(s, s+ t]× [x, x+ y]) ∼ Po(∫ s+t
s
∫ x+y
x
λ dx dt
)
= Po(λyt)
This is illustrated in the following diagram:
(a) What’s the distribution of the first time a raindrop falls in [0, x]? We have that
P(no raindrops land in [0, x] by time t) = P
(
N((0, t]× [0, x]) = 0) = e−λxt
So the time of the first raindrop falling in [0, x] is
T[0,x] ∼ Exp(λx)
Note that the first drop to arrive anywhere on [0,∞) does so immediately in time. Also,
P(k raindrops fall in [0, x] by time t) = P
(
N((0, t]× [0, x]) = k) = (λxt)k
k!
e−λxt
(b) Given that n raindrops have landed in [0, x] by time t, what’s the distribution of the
first time, U , a raindrop arrives in [0, x]?
As arrivals happen at constant intensity λ in (0, t] × [0, x], then n raindrops are each
independent and uniformly distributed in this rectangle, i.e. (Ti, Xi) ∼ Unif
(
(0, t], [0, x]
)
64
for i = 1, . . . , n. Then
P(U > s |N((0, t]× [0, x]) = n)
= P(no points occurred by time s |N((0, t]× [0, x]) = n)
= P(Ti ∈ [s, t] for all i ∈ {1, . . . , n})
=
(
t− s
t
)n
It follows that the conditional distribution function of U is given as
P(U ≤ s |N((0, t]× [0, x]) = n) = 1− P(U > s |N((0, t]× [0, x]) = n) = 1−
(
t− s
t
)n
.
Hence, by differentiating with respect to s, we obtain that U has density
n
(
t− s
t
)n−1
, s ∈ [0, t]
(c) What’s the distribution of the closest raindrop to the origin by time t?
P(no raindrops land in [−x, x] by time t) = P(N((0, t]× [−x, x]) = 0) = e−2λxt
so if Dt is the distance to the nearest raindrop by time t, then P(Dt > x) = e−2λxt, i.e.
Dt ∼ Exp(2λt).
65
12 Generating functions and the central limit theorem
We have seen that for non-negative random variables taking values in Z+ probability
generating functions have been very useful. Recall that for a Z+-valued random variable
X, the probability generating function, PGF, is given by
gX(θ) := E(θX), 0 ≤ θ ≤ 1
There are further generating functions that work for more general random variables.
Definition 12.1. (i) The moment generating function, MGF, of a random variable X
is given by
MX(α) := E(eαX) ,
for all α ∈ R such that the expectation is finite.
(ii) The characteristic function, CF, of a random variable X is given by
ϕX(α) := E(eiαX), for all α ∈ R.
where i ∈ C, i2 = −1.
Remark 12.2. Unlike moment generating functions, characteristic functions are always
well defined via
E(eiαX) = E(cos(αX)) + iE(sin(αX)),
where the integrals are always finite since | cos | and | sin | are bounded by 1.
Example 12.3. Generating functions of the normal distribution. Recall that a normal-
distributed random variable Z ∼ N(µ, σ2) for parameters µ ∈ R, σ2 > 0 has density
f(z) =
1√
2piσ2
e−
(z−µ)2
2σ2 , z ∈ R.
We would like to find the moment generating function of Z. But first we consider
X ∼ N(0, 1), then
MX(α) = E(eαX) =
∫ ∞
−∞
eαx
1√
2pi
e−x
2/2 dx
=
1√
2pi
∫ ∞
−∞
e−
1
2
(x2−2αx+α2−α2) dx
= e
1
2
α2 1√
2pi
∫ ∞
−∞
e−
1
2
(x−α)2 dx
= e
1
2
α2 ,
since we recognise the second factor as the integral of the density of N(0, α)-distributed
random variable, which is therefore equal to 1.
66
To calculate the characteristic function ofX, we would like to do the following calculation
ϕZ(α) = E(eiαX) = MX(iα) = e
1
2
(iα)2 = e−
1
2
α2 .
In general, this is not allowed. However, in this special case it gives the right answer
(the keyword for a proof is: analytic continuation).
If Z ∼ N(µ, σ2) is a general normal random variable, then Z = µ+ σX for a standard-
normal random variable X ∼ N(0, 1). In particular, we have
MZ(α) = E(eα(µ+σX)) = eαµE(eασX) = eαµMX(ασ) = eαµ+
1
2
α2σ2 .
Moreover, for the characteristic function, we get
ϕZ(α) = E(eiα(µ+σX)) = eiαµE(eiασX) = eiαµϕX(ασ) = eiαµ−
1
2
α2σ2 .
Similarly as for PGFs, we can show the following for MGFs:
Lemma 12.4 (Properties of MGFs). (i) If M(α) is defined for α from an open inter-
val containing 0, then for k ∈ N
E(Xk) = M (k)X (0),
where M (k)X denotes the kth derivative of MX .
(ii) If X,Y are independent, then for all α such that the right hand side is defined
MX+Y (α) = MX(α)MY (α).
Instead of continuing to explore the properties of M , we rather consider characteristic
functions, where we don’t have to worry about existence.
Proposition 12.5 (Properties of CFs). (a) A characteristic function ϕ uniquely de-
termines the distribution function FX(x) := P(X ≤ x).
(b) If X and Y are independent, then
ϕX+Y (α) = ϕX(α)ϕY (α) for all α ∈ R.
(c) If for some k ∈ N, we have that E(|X|k) <∞, then
ϕX(t) =
k∑
j=0
E(Xj)
j!
(it)j + o(tk)
as t→ 0 and
ϕ(k)(0) :=
Partialk
Partialαk
ϕX(α)
∣∣∣∣
α=0
= ikE(Xk)
67
Recall that we write g(t) = o(tk) as t→ 0 if limt→0,t6=0 g(t)tk = 0.
Proof. Part (c) is essentially Taylor’s theorem for a function of a complex variable.
Theorem 12.6 (Continuity theorem). Let F, F1, F2, . . . be distribution functions with
CFs ϕ,ϕ1, ϕ2, . . .
(i) If Fn(x)→ F (x) for all points x at which F is continuous, then ϕn(α)→ ϕ(α) ∀α ∈
R;
(ii) Conversely, if ϕ(α) = limn→∞ ϕn(α) exists for all α ∈ R and ϕ is continuous at
α = 0, then ϕ is the CF of some distribution function F and Fn(x)→ F (x) for all
x where F is continuous.
For a proof, see [GS01, Theorem 5.9.5].
Recall that if X1, X2, . . . are i.i.d. random variables with E(|Xi|) <∞, then
1
n
n∑
i=1
Xi → E(X1) almost surely.
Rewriting we have that
∑n
i=1Xi ≈ nE(X1). The next theorem, the central limit theo-
rem tells us that if E(X2i ) < ∞, then the next term in this expansion is of order
√
n.
Moreover, the prefactor is random and normally distributed, so it does not depend on
the distribution of the Xi that we are started with.
Theorem 12.7 (Central limit theorem). Let X1, X2, . . . be independent and identically
distributed random variables with E(Xi) =: µ, Var(Xi) =: σ2 <∞, then with
Sn := X1 +X2 + · · ·+Xn =
n∑
i=1
Xi
we have that
Sn − nµ√
nσ2
D−→ Z
where Z ∼ N(0, 1). That is, for each x ∈ R,
P
(
Sn − nµ√
nσ2
≤ x
)
−→ Φ(x) :=
∫ x
−∞
1√
2pi
e−y
2/2 dy
Note the normalization of the sum is chosen so that the expectation of the left hand side
is
E
(
1√
nσ2
( n∑
i=1
Xi − nE(X1)
))
= 0,
and further
Var
(
1√
nσ2
( n∑
i=1
Xi − nE(X1)
))
=
1
nσ2
Var
( n∑
i=1
Xi
)
=
1
nσ2
n∑
i=1
Var(Xi) = 1,
where we used the independence to interchange sum and taking variance.
68
Proof. W.l.o.g we may assume that µ = 0 and σ2 = 1 (otherwise consider Yi = Xi−µ√
σ2
).
We want to use the continuity theorem, Theorem 12.6, and so it will suffice to show that
the characteristic function
ϕn(α) := E(e
iα Sn√
n ),
satisfies ϕn(α)→ ϕZ(α) for all α, where
ϕZ(α) := E
(
eiαZ
)
= e−α
2/2
is the characteristic function of Z ∼ N(0, 1).
Note that ϕ1(α) = E
(
eiαX1
)
. Then, by the assumption of independence we have
ϕn(α) = E
(
eiαSn/
√
n
)
= E
(
eiα(X1+···+Xn)/
√
n
)
= ϕ1
(
α√
n
)n
Now, by Proposition 12.5 we have
ϕn(α) =
{
1 +
iα√
n
E(X1) +
(iα)2
2n
E(X21 ) + o
(
α2
n
)}n
=
{
1− α
2
2n
+ o
(
α2
n
)}n
→ e−α2/2 = ϕZ(α) as n→∞,
where we used that (1 + xn(1 + o(1)))
n → ex for all x.
The following material in this chapter is not examinable!
How accurate is the central limit theorem?
Theorem 12.8 (Berry-Esseen). If X1, X2, . . . are i.i.d. with E(Xi) = 0, var(Xi) = σ2
and E|X1|3 <∞, then for all x,∣∣∣∣P( Sn√
nσ2
≤ x
)
− Φ(x)
∣∣∣∣ ≤ 3E|X1|3σ3√n
Note that the rate 1/
√
n cannot be improved in general, but the constant 3 can be
slightly optimised. This result gives the maximum error in the normal approximation of
the CLT and it works for finite n and is not a purely asymptotic statement.
Example 12.9. Toss a fair coin 100 times. What is the probability of 65 or more heads?
Solution. We have that n = 100, µ = 12 , σ
2 = 12
1
2 =
1
4 . So Xi ∼ B(1, 1/2) are i.i.d. and
Sn ∼ B(n, 1/2). By Chebychev’s inequality,
P
(|S100 − 50| > 15) ≤ (15)−2Var(S100)
=
1
225
· 100 · 1
4
=
1
9
69
Hence P(at least 65 heads) ≤ 1/18 by symmetry. On the other hand, using the CLT,
S100 − 50√
25
is approximatelyN(0, 1)-distributed.
So
P(S100 − 50 ≥ 15) = P
(
S100 − 50√
25
≥ 3
)
≈ P(Z ≥ 3) = 0.0014
where Z ∼ N(0, 1). We can see the accuracy of this by the Berry-Essen theorem; this
says that ∀x ∈ R,∣∣∣∣∣P
(∑n
i=1(Xi − 1/2)√
100(1/2)2
≤ x
)
− Φ(x)
∣∣∣∣∣ ≤ 3E|X1 − 1/2|3(1/8)√100 = 310
where E|X1 − 1/2|3 = 1/8 is implied by P(X1 = 1) = 1/2 and P(X1 = 0) = 1/2. So,∣∣P(S100 ≥ 65)− 0.0014∣∣ ≤ 0.3
i.e. the worst approximation is P(S100 ≥ 65) ≤ 0.3014.
References
[GS01] G. Grimmett and D. Stirzaker. Probability and random processes. Oxford Uni-
versity Press, third edition, 2001.
70
A Important distributions
Discrete distributions
Name Probability mass function E(X) Var(X) PGF
Bernoulli P(X = 1) = p = 1− P(X = 0) p p(1− p) pθ + (1− p)
Ber(p)
Binomial
(
n
k
)
pk(1− p)n−k, k ∈ {0, . . . , n} np np(1− p) (pθ + (1− p))n
B(n, p)
Poisson e−λ λ
k
k! , k = 0, 1, 2, . . . λ λ exp{λ(θ − 1)}
Po(λ)
Continuous distributions
Name Probability density function Expectation Variance3
Exponential Exp(λ) λe−λx, x ≥ 0 1λ 1λ2
Uniform Unif[a, b] 1b−a1l[a,b](x), x ∈ R a+b2 112(b− a)2
Normal N(µ, σ2)
1√
2piσ2
exp
{
− (x− µ)
2
2σ2
}
, x ∈ R µ σ2
3Not examinable