留学生代写-MAST30020
时间:2021-04-21
MAST30020
Probability for Inference
Lecture Slides∗
∗Written by K. Borovkov, last time modified on 1 January 2021 (or even later).
0-0
1. Probability Spaces
Random experiment (RE): You should be familiar with the concept, from
the 2nd year level probability course you have done. Anyway, that’s a
(somewhat vaguely defined) concept representing real-life phenomena that:
• have a “mass character” (i.e. could be repeated many-many times, at least
in theory),
• don’t display “deterministic regularity” (i.e. the outcome of any given trial
is uncertain, to the best of our prior knowledge), but
• possess what’s called “statistical regularity”: the relative frequencies
(nA/n) of events one can observe in the experiment stabilize around some
values ∈ [0, 1] as the # of (“independent”) repetitions of the experiment
grows.
It is that last dot-point that makes Probability Theory possible.
Examples (Ex’s): coin tossing; dice rolling; gender of newborn babies etc.
1
Need to specify the outcome of our RE: how?
In mathematics, we have sets = collections of objects.
• Sample space Ω is the set of all possible outcomes.
Ex. Coin tossing. H/T? What if the coin doesn’t fall on its side? What if
where it landed also matters for us? Several different Ω’s are possible (and
can be used under different circumstances).
• So outcomes are elements ω ∈ Ω (a.k.a. elementary events, sample
points, realizations).
• Events: these are subsets A ⊂ Ω for which probability is defined.
NB: in non-trivial cases, there are subsets A ⊂ Ω for which probability
CANNOT be defined. For example, one can partition Ω = [0, 1] into
countably many “identical” sets Ai (they can be obtained from each other
by translations): Ω =

i≥1Ai. Assuming uniform probability distribution
on [0, 1], what would be the probability of A1?
2
Sample spaces
• Ex. Finitely many possible outcomes ⇒ only need a finite Ω. Just list all
the outcomes: Ω = {H,T} ≡ {0, 1} for the experiment of tossing a coin
once; Ω = {1, 2, . . . , 6} for rolling a die once.
Now consider an RE where we toss a coin
roll a die
n = 3 times in a row.
Product spaces come handy. For two sets A and B, their (Cartesian)
product is defined by
A×B := {(a, b) : a ∈ A, b ∈ B}
(the set of all pairs (a, b) s.t. a ∈ A and b ∈ B), and we put
An := A× · · · ×A︸ ︷︷ ︸
n copies of A
= {(a1, a2, . . . , an) : aj ∈ A, j = 1, 2, . . . , n}.
RE with a sample space Ω0 is replicated n times⇒ the sample space of the
composite experiment is Ω = Ωn0 , with ω = (ω1, ω2, . . . , ωn), ωj = outcome
of the jth replication of the basic (sub-)RE with the sample space Ω0.
3
• Ex. Keep tossing a coin till H shows up. Will a finite Ω suffice?
There are countably many possible outcomes that can be represented by
points from the set N = {1, 2, . . . } (ω = j if we observed j − 1 tails
followed by heads).a
• Ex. You have a date with your girl/boyfriend. She/he shows up
ω ∈ R+ = [0,∞) minutes late. Here we need Ω = R+.
One often uses the whole real line R or a subinterval thereof. For a
composite RE consisting of n dates, one can use
Ω = Rn := {(x1, x2, . . . , xn) : xj ∈ R, j = 1, 2, . . . , n}.
aAny set A s.t. there exists a 1–1 mapping f : A→ N is said to be countable (denumerable).
Thus, Z = {. . . ,−1, 0, 1, . . . } is countable. So is N2, but not [0, 1].
4
• If there can be any number of dates (at least, in theory: imagine that if
your girl/boyfriend was less than 20′ late for a given date, you decide to
meet again), we’ll need
RN := {(x1, x2, . . . ) : xj ∈ R, j ∈ N},
the set of real sequences, i.e. all real-valued functions on N.
[Other functional spaces are used, too, e.g. C[0, 1] for the so-called
Brownian motion process.]
• This models a situation where the basic RE can be repeated infinitely
many times. Likewise, in the coin tossing RE in which you are tossing a
coin till you get H for the first time, one can use Ω = {T,H}N.
• BTW: Note that {T,H}N is uncountable (why?). In the last example, you
could actually use a countable space of outcomes — which one?
5
• Probabilities will be assigned to sets of outcomes, i.e. subsets of Ω.
• For a given RE, can one assign probabilities to all subsets of Ω?
• If Ω is countable, the answer is YES.
In the general case, the answer is NO.
• It is impossible in the basic RE of choosing a point at random from [0, 1].
• So we MUST restrain ourselves and consider only some of the subsets
of Ω, chosen in such a way that there will be no problems with assigning
probabilities to them — and these subsets will be called “events”.
• Now how to choose them? One needs to be able to manipulate events,
and, quite naturally, such (admissible!) manipulations should be
producing events. Let’s look at the ways to manipulate events.
6
Ex. Choose a student from a large class. Want the events that the student:
1) is NOT smoking;
2) is a female AND more than 55 y.o.;
3) was born in Australia OR New Zealand;
4) was born in Australia AND is NOT smoking.
These can be expressed in terms of simpler events using set operations:
1) Let Ω := population of all students in class, A := sub-population of
smokers. Take the complement Ac := {ω ∈ Ω : ω 6∈ A}.
2) If B := sub-population of female students, C := those who are > 55 y.o.,
then take the intersection B ∩ C = {ω ∈ Ω : ω ∈ B and ω ∈ C}.
3) If D := students born in Australia, E := students born in New Zealand,
then take the union D ∪ E = {ω ∈ Ω : ω ∈ D or ω ∈ E}.
4) This will be the set difference D \A := D ∩Ac.
7
Note that D∩E = ∅, i.e. the events are disjoint (no common outcomes). Note
also that, in the case of disjoint sets, one often uses D + E to denote D ∪ E.
In fact, it is often more convenient to work with functions rather than sets.
How to “replace” sets with functions? Use indicators (indicator functions):
1A(ω) :=
 1, ω ∈ A,0, ω 6∈ A.
To the (main) set operations on events, there correspond the following
operations on indicator functions (draw diagrams & check!!):
1Ac = 1− 1A, 1A∩B = 1A1B , 1A∪B = max{1A,1B}.
For the symmetric set difference A4B := A \B +B \A ≡ (A ∪B) \ (A ∩B),
1A4B = |1A − 1B |.
8
Ex. Express the following events:
Neither A nor B occurred [i.e. 1A + 1B = 0].
I Ac ∩Bc = (A ∪B)c, de Morgan law.
A and B occurred, but C didn’t.
I A ∩B ∩ Cc ≡ (A ∩B) \ C.
Only one of A1, A2, A3 occurred [i.e.
∑3
j=1 1Aj = 1].
I A1 \ (A2 ∪A3) +A2 \ (A1 ∪A3) +A3 \ (A1 ∪A2).
Exactly two out of A1, . . . , A5 occurred [i.e.
∑5
j=1 1Aj = 2].
I

i(Ai ∩Aj) \ ⋃
k 6=i,j
Ak
.
9
As we said earlier, want to manipulate events, need the resulting sets still be
events. Hence the requirement: the class of events on Ω must be closed under
the main set operations, i.e. the complement of an event should still be an
event, and the union of events must still be an event (the same for
intersections, but this is automatic from de Morgan laws, see below).
To make things work, need a bit more: namely, that we are allowed to take
countable unions (and intersections!). In mathematics, when countable
infinity is allowed/involved, one often uses σ to indicate that.
Def. A family F of subsets of Ω is said to be a σ-algebra on Ω if
(A.1) Ω ∈ F ,
(A.2) A ∈ F ⇒ Ac ∈ F ,
(A.3) A1, A2, . . . ∈ F ⇒
⋃∞
n=1An ∈ F .
In words: the family is closed under complementation and countable union and
intersection. Why the latter? De Morgan + (A.2) + (A.3) + (A.2):⋂∞
n=1An =
[
(
⋂∞
n=1An)
c]c
= [
⋃∞
n=1A
c
n]
c
= [
⋃∞
n=1Bn]
c
, Bn := A
c
n ∈ F .
10
Ex. (continued)
Infinitely many of the events A1, A2, . . . occurred [i.e.
∑∞
j=1 1Aj =∞].
Is this actually an event? [Denoted: An, i.o.]
Why? For instance: will a random walk S0, S1, S2, . . . on Z visit 0 infinitely
many times? Let An := {Sn = 0}.
I

n≥1

k≥n
Ak — and this is an event indeed, using (A.2) + (A.3).
Here: ∩ ←→ ∀, “for all”; ∪ ←→ ∃, “there exists”.
Finitely many of the events A1, A2, . . . occurred [i.e.
∑∞
j=1 1Aj <∞].
Why? For instance: in a random walk S0, S1, S2, . . . , let An := {|Sn/n| > ε}
(for a fixed ε > 0). Related to the “strong” LLN.
I

n≥1

k≥n
Ack — use de Morgan or apply the same logic as above.
11
One starts modelling an RE by specifying a suitable sample space Ω and then
choosing an appropriate σ-algebra F of subsets Ω. The elements of this
σ-algebra are called events.
NB: Always ∅ ∈ F : indeed, ∅ = Ωc, then use (A.1) + (A.2).
NB: So taking A3 = A4 = · · · = ∅ in (A.3) yields
A1, A2 ∈ F ⇒ A1 ∪A2 ∈ F .
Likewise for any finite union (intersection) of events: still an event. If only
that held instead of (A.3), then F would be called an algebra of sets.
Ex. The trivial σ-algebra: F = {∅,Ω}.
No fun: no uncertain events! All the events we are allowed to look at are: the
impossible event ∅ (it never occurs!) and the certain event Ω (occurs always!).
Ex. The power set P(Ω) := class of all subsets of Ω.
This is often the choice in simple situations with discrete sample spaces.
12
Prm. Suppose Fn, n = 1, 2, . . . , are σ-algebras on a common sample space Ω.
Is F1 ∪ F2 a σ-algebra as well? What about F1 ∩ F2? What about
⋂∞
n=1 Fn?
Of course, there are many different possible choices of F . May wish to
consider, say, a σ-algebra containing a given set A ⊂ Ω. The smallest such
σ-algebra is clearly the so-called σ-algebra generated by A:
σ(A) := {∅, A,Ac,Ω}.
Extending this, let G = {A1, . . . , An} be a finite partition of Ω, i.e. the sets
Ai ⊂ Ω are pairwise disjoint,
∑n
i=1Ai = Ω. Then the σ-algebra generated by G,
i.e. the smallest σ-algebra that contains all the sets Aj , is
σ(G) :=
{∑
i∈I
Ai : I ⊂ {1, 2, . . . , n}
}
.
(Clearly a σ-algebra. Why is it the smallest one containing G?)
Similarly for a countable partition of Ω.
13
In case of the σ-algebra generated by a partition G, it is easy to give
representation for all the elements of σ(G). One can also introduce the concept
of the σ-algebra generated by an arbitrary given family G of subsets of Ω —
but this is less elementary (what about all possible intersections etc?).
Thm [1.12] For any family G of subsets of Ω, there exists a unique
σ-algebra, denoted by σ(G) and called the σ-algebra generated by G, s.t.
1) G ⊂ σ(G), and
2) if H is a σ-algebra on Ω and G ⊂ H, then σ(G) ⊂ H.
That is, σ(G) is the smallest σ-algebra on Ω containing G.
I How to prove such an assertion?? It’s not too difficult. First note that
there are σ-algebras on Ω that contain G: just take P(Ω). So the class of all
σ-algebras on Ω that contain G is non-empty. Now consider the intersection of
all σ-algebras from the class. It will contain G (as each of the σ-algebras
contains it!) and it will be a σ-algebra (as an intersection of such). And it will
be the smallest one with these properties!! (Why?)
14
An important example of a generated σ-algebra is the class B(R) of Borel
subsets of R (a.k.a. the Borel σ-algebra on R):
B(R) := σ{(a, b] : a, b ∈ R, a < b}.
All “reasonable” subsets of R are Borel (e.g. finite and countable subsets, open
intervals, open and closed sets etc.), but B(R) 6= P(R)! [Although giving an
example of a set which is not Borel is a challenge!]
This extends to the multivariate case:
B(Rm) := σ
{
m∏
i=1
(ai, bi] : ai, bi ∈ R, ai < bi
}
.
Here
∏m
i=1(ai, bi] is the Cartesian product of intervals (a “brick”).
Equivalently, B(Rm) is generated by open balls. As in the univariate case, all
reasonable subsets of Rm are Borel.
When Ω = Rm, B(Rm) is the default choice of F . For Ω ⊂ Rm, one takes
F = {Ω ∩A : A ∈ B(Rm)}, the trace of B(Rm) on Ω.
15
Now it’s time to introduce
Probability, from Latin probabilis “provable,” from probare “to try, to test”
(cf. to prove, to probe), from probus “good”.
Probable cause as a legal term is attested from 1676.
Probably is attested from 1535.
Probability is attested from 1551. [Source: http://www.etymonline.com]
Let (Ω,F) be a sample space endowed with a σ-algebra of its subsets (the
couple is called a measurable space).
Def. A probability on (Ω,F) is a function P : F → R s.t.
(P.1) P(A) ≥ 0, A ∈ F ,
(P.2) P(Ω) = 1,
(P.3) for any pairwise disjoint A1, A2, · · · ∈ F ,
P
 ∞⋃
j=1
Aj
 = ∞∑
j=1
P(Aj), “countable additivity”.
16
Def. The triple (Ω,F ,P) is called a probability space.
NB1: P is referred to as a set function (as its argument assumes “values” that
are sets, its domain being F). NB: ω ∈ Ω and {ω} ⊂ Ω are distinct objects!
Note: P(ω) is NO GOOD, P({ω}) is OK.
NB2: On one and the same measurable space, we can have infinitely many
different probabilities. Ex: tossing a (biased) coin (once). In statistics, we do
consider different probabilities on the same measurable space all the time!
NB3: Properties (P.1) and (P.3) specify what’s called a measure. Adding
(P.2), we get a measure of “total mass one”; one often uses “probability
measure” in that case.
NB4: Why was this def’n adopted? It mimics the properties of relative
frequencies of events! Turns out that measure theory is the most natural
framework for formal treatment of probabilities. Very successful, starting with
being able to establish theoretically all the main “statistical laws” observed in
the real world, and first of all — the LLN and CLT.
17
Ex. The point mass (degenerate distribution) at (a fixed point) ω ∈ Ω :
εω(A) := 1A(ω).
NB the difference in interpretation: the LHS is a function of A (for a fixed
outcome ω), whereas the RHS is a function of ω (for a fixed event A).
It models a situation with a deterministic outcome: repeat your RE till you
turn blue, but each time you’ll be seeing one and the same outcome: ω.
Ex. Counting measure on N (or even R ⊃ N):
µ(B) :=

n≥1
εn(B), B ∈ F = P(N).
This is not a probability! The measure counts the number of points in B
(which can be infinite, of course).
18
Ex. Discrete uniform distribution: suppose Ω is finite, F = P(Ω). If all
outcomes ω ∈ Ω are “equally likely”, then they should have the same
probability. Using notation |B| for cardinality of B, just put
P(A) := |A|/|Ω|, A ∈ F
(this is the so-called “classical probability”).
NB: using a version of the counting measure, this can be re-written as
P(A) =
1
|Ω|

ω∈Ω
εω(A).
19
Elementary properties of probability (Thm 1.23):
a) P(∅) = 0.
I Taking A1 = A2 = · · · = ∅ in (P.3), we have P(∅) =
∑∞
n=1 P(∅) — bingo!
b) finite additivity : for any pairwise disjoint A1, . . . , An ∈ F ,
(PF.3) P
 n⋃
j=1
Aj
 = n∑
j=1
P(Aj).
I Take An+1 = An+2 = · · · = ∅ in (P.3) and use a) — bingo!
NB: in the special case A1 = A, A2 = A
c, we obtain
P(Ac) = 1−P(A).
20
Elementary properties of probability (Thm 1.23): continued.
c) If A ⊂ B (from now on, always assume that A,B, . . . ∈ F), then
P(B \A) = P(B)−P(A).
I Follows from b): take A1 = A, A2 = B \A, then B = A1 +A2.
NB: So probability is non-decreasing: P(A) ≤ P(B) for A ⊂ B.
d) For any events A and B,
P(B ∪A) = P(B) + P(A)−P(B ∩A)
(the simplest version of the inclusion-exclusion principle), and so always
P(B ∪A) ≤ P(B) + P(A), “subadditivity of probability”.
I As A ∪B = A+B \A1, where A1 := A ∩B ⊂ B, can use b) and then c):
P(A ∪B) = P(A) + P(B \A1) = P(A) + P(B)−P(A1),
bingo!
21
Subadditivity of prob’ty extends to Boole’s ineq’ty (Propn 1.24): for any
A1, A2, . . . ∈ F ,
P
 ∞⋃
j=1
Aj
 ≤ ∞∑
j=1
P(Aj).
I “Disjointification”: let B1 := A1, B2 := A2 \A1, B3 := A3 \ (A1 ∪A2) etc:
Bn := An \
n−1⋃
j=1
Aj .
Then 1) Bn ⊂ An, n ≥ 1, 2)

j≤nAj =

j≤nBj , n ≤ ∞ (prove by
induction), and 3) B1, B2, . . . are disjoint. Hence, using monotonicity of P,
P
 ∞⋃
j=1
Aj
 2)= P
 ∞∑
j=1
Bj
 (P.3)= ∞∑
j=1
P (Bj)
1)

∞∑
j=1
P (Aj) , bingo!
22
Natural Q: in our def’n of probability, why can’t we assume finite additivity
(sl. 20) instead of the countable one (our (P.3))?
Direct A to this: we would lose important continuity properties.
Notation: An ↑ A as n→∞ ⇔ A1 ⊂ A2 ⊂ A3 ⊂ · · · and
⋃∞
n=1An = A;
An ↓ A as n→∞ ⇔ A1 ⊃ A2 ⊃ A3 ⊃ · · · and
⋂∞
n=1An = A.
Thm [1.25] Suppose P : F → R satisfies conditions (P.1), (P.2) and (PF.3).
Then the following are equivalent :
a) P satisfies (P.3) (and hence is a probability).
b) An ↑ A ⇒ P(An) ↑ P(A).
c) An ↓ A ⇒ P(An) ↓ P(A).
d) An ↓ ∅ ⇒ P(An) ↓ 0. [Of course, n→∞ in b)–d).]
I We will show that a) ⇒ b) ⇒ c) ⇒ d) ⇒ a). Obvious: b) ⇔ c) (by
complementation) and c) ⇒ d). So only need to show a) ⇒ b) and d) ⇒ a).
23
a) ⇒ b): If An ↑ A, then
⋃n−1
j=1 Aj = An−1 and, putting A0 := ∅, the
disjointification procedure from sl. 22 yields: Bn = An \An−1,
P(A) = P
( ∞⋃
n=1
An
)
=
∞∑
n=1
P(Bn)
c), sl. 21
=
∞∑
n=1
[P(An)−P(An−1)]
= lim
m→∞
m∑
n=1
[P(An)−P(An−1)] = lim
m→∞P(Am),
using the “telescoping argument” ((a2 − a1) + (a1 − a0) = a2 − a0 etc).
24
d) ⇒ a): If B1, B2, . . . are disjoint, then
∞∑
j=n+1
Bj =: An ↓ ∅ as n→∞.
Indeed, if this is NOT so, then

n≥1An 6= ∅, i.e. there is a point ω that
belongs to ALL An and hence ω ∈ Bm for some m ≥ 1. Since the Bn’s are
DISJOINT, we must have ω 6∈ Am for that m, a CONTRADICTION!
Now
P
 ∞∑
j=1
Bj
 = P
 n∑
j=1
Bj +An
 (PF.3)= n∑
j=1
P (Bj)︸ ︷︷ ︸
→∑∞j=1 P(Bj)
+ P(An)︸ ︷︷ ︸
→0
as n→∞. We proved (P.3). Bingo!
25
Now we can prove a very simple, but quite important assertion:
(The 1st) Borel-Cantelli Lemma [Thm 1.27]
If
∑∞
n=1 P(An) <∞, then P(An, i.o.) = 0. [Re i.o., see sl. 11.]
I Using Thm 1.25, c) and Boole’s inequality (sl. 22),
P(An, i.o.) = P
⋂
n≥1

k≥n
Ak
 = lim
n→∞P
⋃
k≥n
Ak
 ≤ lim
n→∞

k≥n
P (Ak) = 0.
26
2. Probabilities on R
• Always on (R,B(R)).
• Although probabilities are GIVEN ON B(R), one usually SAYS they are
given on R (slang!).
• For probabilities on R, we will use P rather than P, and for a reason:
such probabilities will mostly be “induced” on R by random variables
given on a general “underlying” space (Ω,F ,P), so it is more convenient
to reserve P for probabilities on the general space (Ω,F).
• Anyway, to give a probability on R, one apparently needs to specify all the
values P (A), A ∈ B(R) — but B(R) is HUGE!
• In fact, one doesn’t: there is a much more economical way.
Def. The distribution function (DF, a.k.a. CDF) of a probability P on R is
the function FP : R→ R defined by
FP (t) := P ((−∞, t]), t ∈ R.
27
Prpn [1.32] FP ≡ FP ′ ⇔ P ≡ P ′
This is a consequence of the fact that σ
({(−∞, t] : t ∈ R}) = B(R) (hey,
(a, b] = (−∞, b] \ (−∞, a], cf. sl. 15!). In fact, a much stronger assertion holds:
Thm [1.36] below. So we’ll talk about that later. Meanwhile: to the
characteristic properties of DFs.
Thm [1.33] For any probability P on R, its DF F := FP satisfies:
a) F is non-decreasing: s < t ⇒ F (s) ≤ F (t). Hence, at any point
t ∈ R, it has one-sided limits:
F (t−) := lim
s↑t
F (s), F (t+) := lim
s↓t
F (s); F (t−) ≤ F (t) ≤ F (t+);
b) F is right-continuous: F (t) = F (t+);
c) lim
t→−∞F (t) = 0, limt→+∞F (t) = 1.
28
I
a) Obvious from the monotonicity of P (sl. 21): as (−∞, s] ⊂ (−∞, t] for
s < t, we have
F (s) ≡ P ((−∞, s]) ≤ P ((−∞, t]) ≡ F (t).
The existence of one-sided limits follows from monotonicity (recall: any
bounded increasing sequence has a finite limit).
b) Obvious from the continuity of P (sl. 23):
Let An := (−∞, tn], tn ↓ t as n→∞.
Then An ↓ A := (−∞, t] and so P (An) ↓ P (A).
c) Obvious from the continuity of P :
Since A′n := (−∞,−n] ↓ ∅ and A′′n := (−∞, n] ↑ R as n→∞,
lim
n→∞F (−n) ≡ limn→∞P (A

n) = P (∅) = 0, lim
n→∞F (n) ≡ limn→∞P (An) = P (R) = 1.
It remains to make use of the monotonicity of F . Bingo.
29
Ex. Point mass and beyond: For P = εs (s ∈ R is a fixed value),
FP (t) =
 0, t < s1, t ≥ s
 = 1(s ≤ t)
(here 1(C) = 1 if the condition C is met, 1(C) = 0 otherwise).
For P = (1− p)ε0 + pε1 (do you recognize the Bernoulli distribution B(p)? It’s
a mixture of two point masses),
FP (t) =

0, t < 0
1− p, 0 ≤ t < 1
1, t ≥ 1
 = (1− p)1(0 ≤ t) + p1(1 ≤ t).
In the general case, given a DF FP , what is P ({t})? Since (t− 1/n, t] ↓ {t} as
n→∞, we have P ({t}) = limn(FP (t)− FP (t− 1/n)) = FP (t)− FP (t−).
30
Thm [1.36] For any F : R→ R which satisfies a)–c) from Thm [1.33], there
exists a unique P on B(R) s.t. F ≡ FP .
I This is a rather non-trivial result. Its proof is beyond the scope of our
subject. Just a few words how it’s done:
Start with putting P ((a, b]) := F (b)− F (a) (≥ 0!) for arbitrary a < b.
Next, for A :=
⋃n
i=1(ai, bi], −∞ ≤ a1 < b1 < a2 < b2 < . . . < bn ≤ ∞, we put
P (A) :=
n∑
i=1
(F (bi)− F (ai)).
The collection A of all such A’s is an algebra (verify), B(R) = σ(A) (obvious).
From the construction it is obvious that P is finitely additive on A, and one
can prove, using b) from Thm [1.33], that it is also countably additive on A.
The last shot is to apply Carathe´odory’s extension theorem: a probability
given on an algebra can be uniquely extended to one on the generated
σ-algebra. Bingo.
31
Implications: Can completely specify a probability on R by its DF!
Ex. Consider the function
F (t) =

0, t < 0
t, 0 ≤ t < 1
1, t ≥ 1
It clearly satisfies a)–c). So by Thm [1.36], there exists a unique probability
on R with this DF. This probability is called the uniform distribution on [0, 1],
denoted by U [0, 1].
NB: the U [0, 1]-probability of a given set is invariant w.r.t. to translations
(provided the set remains within [0, 1]). Likewise, the uniform distribution on
a set B ∈ B(Rn) is a probability invariant w.r.t. translations.
BTW: What is the DF of U [a, b], a < b?
Now we will look at a few important large classes of distributions on R.
32
Discrete Probabilities on R: P (C) = 1 for some countable C ⊂ R.
Prpn [1.39] The following are equivalent:
a) P is discrete.
b) For some {ti}i≥1 ⊂ R and {pi > 0}i≥1 with

i pi = 1, one has
P =

i
piεti .
c) For some {ti}i≥1 ⊂ R and {pi > 0}i≥1 with

i pi = 1, one has
FP (t) =

i
pi1(ti ≤ t).
[Picture.]
[BTW: can one assume that the ti’s can be ordered, e.g. t1 < t2 < . . .?]
33
I a) ⇒ b): C is countable ⇔ C = {ti}i≥1 ⊂ R. Now for B ∈ B(R),
P (B) = P (B ∩ C) + P (B ∩ Cc)︸ ︷︷ ︸
≤P (Cc)=1−P (C)=0
= P
B ∩∑
i≥1
{ti}
 = P
∑
i≥1
(B ∩ {ti})
 = ∑
i≥1
P (B ∩ {ti})
=

i≥1
P ({ti}) 1(ti ∈ B) =

i≥1
pi1(ti ∈ B) sl. 18=

i≥1
piεti(B).
b) ⇒ c): By def’n, FP (t) ≡ P ((−∞, t]) =

i≥1
piεti((−∞, t]) =

i≥1
pi1(ti ≤ t).
c) ⇒ a): For C := {tj}j≥1 we have
P (C) =

j
P ({tj}) =

j
(FP (tj)− FP (tj−)) =

j
pj = 1
since 1(ti ≤ t) is continuous at tj for i 6= j, so that FP (tj)− FP (tj−) = pj .
34
Absolutely Continuous (AC) Prob’s on R: the ones with densities.
Def. A probability P on R is AC if there exists a f’n fP : R→ R+, called the
density (a.k.a. PDF) of P , s.t.
FP (t) =
∫ t
−∞
fP (s) ds, t ∈ R.
Clearly, this implies that P ((a, b]) =
∫ b
a
fP (s) ds and fP (t) = F

P (t) (a.e.!).
NB: In most cases, the integral here is our good old friend the Riemann
integral. However, in the general case, it must be understood as the so-called
Lebesgue integral, but we’ll talk about that later.
ANY integrable function f ≥ 0 on R with ∫ f(s) ds = 1 specifies a probability
on R. Indeed,
F (t) :=
∫ t
−∞
f(s) ds
defines a DF on R (i.e. it has properties a)–c) from sl. 28, and so Thm [1.36]
applies.
35
Mixed Distributions: neither discrete nor AC, but mixtures thereof, i.e.
P = pPd + (1− p)Pa,
where Pd is discrete, Pa is AC, p ∈ (0, 1) is a fixed number.
Ex. Waiting times at an ATM: when a customer arrives, either nobody is
using the ATM (w.p. p) or it is in use (plus there may be a queue!) — and
then the customer has to wait for a random time ∼ E(λ), the exponential
distribution with par’r λ (having the density λe−λt, t > 0). Then, using the
total probability formula (rings a bell?), the waiting time has the distribution
P = pε0 + (1− p)E(λ).
But wait: there is more!
36
Singular Distributions: with a continuous DF, but not AC!
Continuous DF means no point is assigned a positive probability
(otherwise there would be jumps in the DF, sl. 30). So the prob’ty is “spread”
over R, but — there is no density. Singular (deviating from the usual or
expected; odd) indeed.
Ex [1.51] Cantor’s ladder (explained in class).
And this is basically it: any distribution on R is a mixture of three “pure
type” distributions!
Thm [1.52] (Lebesgue’s decomposition) Any probability on R has a unique
representation of the form
P = αdPd + αaPa + αsPs,
where αi ≥ 0, i = d, a, s;

αi = 1; Pd is discrete, Pa is AC, Ps is singular.
37
3. Random Variables (RVs)
“Na¨ıve definition”: RV X is a “function of chance”, i.e. X = X(ω) : Ω→ R
(assuming we are given an underlying probability space modelling our RE).
Not good enough: one usually wants to know probabilities of the RV hitting
some given sets, e.g. X ∈ [a, b]. Therefore the respective set of favourable
outcomes
X−1([a, b]) := {ω ∈ Ω : X(ω) ∈ [a, b]}
called the inverse image (under X) of [a, b], MUST be an event. Hence:
Def [2.3] An RV is a function X = X(ω) : Ω→ R s.t., for any B ∈ B(R),
X−1(B) := {ω ∈ Ω : X(ω) ∈ B} ∈ F
(such functions are called measurable; shorthand: {X ∈ B}).
So, for RVs X, the probabilities P(X ∈ B) are defined for all B ∈ B(R)! Good.
38
In fact, to satisfy Def [2.3], it suffices to have
{X ∈ (−∞, t]} := X−1((−∞, t]) ≡ {ω ∈ Ω : X(ω) ∈ (−∞, t]} ∈ F , t ∈ R. (∗)
Which is kind of nicer: much fewer sets to play with. But why is that so?
Because X−1 preserves all set operation (and disjointness).
Prpn [2.2] Let {Bα : α ∈ I} be an arbitrary family of subsets of R.
a) Bα ⊂ Bβ ⇒ X−1(Bα) ⊂ X−1(Bβ).
b)

α∈I X
−1(Bα) = X−1
(⋃
α∈I Bα
)
and

α∈I X
−1(Bα) = X−1
(⋂
α∈I Bα
)
.
c) Bα ∩Bβ = ∅ ⇒ X−1(Bα) ∩X−1(Bβ) = ∅.
d) X−1(Bcα) =
[
X−1(Bα)
]c
.
I
It’s all next to obvious — a matter of simple logic. For instance, look at d):
what’s on its LHS? The set of all ω’s s.t. X(ω) ∈ Bcα. But this is the same as
the set of all ω’s that don’t have the property that X(ω) ∈ Bα, which is the
complement of the set of all ω’s that do, i.e.
[
X−1(Bα)
]c
. Bingo, right?
39
Now how does this help if we want to show that (∗) is equivalent to Def [2.3]?
Let C be the collection of all sets B ∈ B(R) s.t. X−1(B) ∈ F . By Prpn [2.2],
C will be a σ-algebra on R!! Indeed, R ∈ C since X−1(R) = Ω ∈ F , so (A.1)
holds. Next, B ∈ C ⇔ X−1(B) ∈ F , and as F itself is a σ-algebra, one has[
X−1(B)
]c ∈ F . By Prpn [2.2],d), this means that X−1(Bc) ∈ F , so that
Bc ∈ C, i.e. (A.2) holds for C. Similarly for (A.3). Good.
Now (∗) states that all (−∞, t] ∈ C, t ∈ R, and so the smallest σ-algebra
containing these sets will be part of C (as the latter is a σ-algebra itself):
σ ((−∞, t], t ∈ R) ⊂ C.
Hey, but what’s on the LHS? This is B(R) ⊃ C!
We conclude that B(R) = C, but this means that Def [2.3] is satisfied.
40
Ex Constants are RVs: for X ≡ c = const, one has {X ≤ t} =
{ ∅, t < c,
Ω, t ≥ c.
Ex Random indicators. For an event A ∈ F , {1A ≤ t} =

∅, t < 0,
Ac, 0 ≤ t < 1,
Ω, t ≥ 1.
Ex Simple RVs: X :=
∑n
i=1 ai1Ai , where ai ∈ R, Ai ∈ F , i ≤ n <∞.
One usually assumes here (for convenience) that {Ai}i≤n is a partition of Ω.
This is no big deal: if it’s not so, one can always re-write X in an alternative
form, using a partition, X =
∑n′
i=1 a

i1A′i , with the A

i’s of the form(⋂
i∈I Ai
) ∩ (⋂i∈Ic Aci), I ⊂ {1, 2, . . . , n}, which form a partition.
This is an RV: {X ≤ t} = ⋃i: ai≤tAi ∈ F .
41
Random Vectors: X = (X1, . . . , Xd) : Ω→ Rd s.t. all Xi, i ≤ d, are RVs.
Equivalently, s.t. X−1(B) ∈ F for all B ∈ B(Rd). This extends to “random
elements” in much more general (than Rd) spaces.
Similarly, Z : Ω→ C is a complex-valued RV if Z = X + iY, where X and Y
are RVs. [Here, of course, i =
√−1.]
42
Prpn [2.9] Given an RV X, σ(X) := {X−1(B) : B ∈ B(R)} is a σ-algebra
on Ω which is called the σ-algebra generated by the RV X.
NB: From the def’n of RV, σ(X) ⊂ F . Usually, it is smaller than F .
I Just verify (A.1)–(A.3) using Prpn [2.2].
(A.1) Ω = X−1(R) ∈ σ(X), good.
(A.2) For A ∈ σ(X) there exists a B ∈ B(R) s.t. A = X−1(B).
Now Ac =
[
X−1(B)
]c
= X−1(Bc) ∈ σ(X), good.
(A.3) Similarly. Bingo!
Ex [2.10] σ(1A) = {∅, A,Ac,Ω}. Follows from Ex on sl. 13.
Ex [2.11] For a simple RV X =
∑n
i=1 ai1Ai , where a1, . . . , an are distinct
and {Ai} is a partition of Ω,
σ(X) =
{∑
i∈I
Ai : I ⊂ {1, 2, . . . , n}
}
.
43
Combinations of RVs
The general fact: if X = (X1, . . . , Xd) is a random vector, g : Rd → R is a
continuous function, then g(X1, . . . , Xd) is an RV as well.
NB: This extends to more general functions g (to the so-called measurable f’s).
We’ll skip the proof: a bit beyond the scope. Just look at a couple of simple
special cases. Assume that X and Y are RVs, a, b ∈ R. Then:
a) aX is an RV: {aX ≤ t} =
{
{X ≤ t/a}, a > 0;
{X ≥ t/a} = {X ∈ [t/a,∞)}, a < 0.
b) aX + bY is an RV: it suffices to prove this for a = b = 1 (from a)). Look:
{X + Y < t} =

r∈Q
({X < r} ∩ {Y < t− r}),
Q being the (countable) set of all rationals (r = m/n, where m,n ∈ Z).
c) XY is an RV. (How to show that? E.g. use XY = ((X+Y )2−X2−Y 2)/2.)
44
Distributions and DFs of RVs and RVecs
Def For an RV X on (Ω,F ,P), the set function
PX(B) := P(X ∈ B), B ∈ B(R),
is called the distribution of X. Similarly for random vectors.
Prpn [2.23] PX is a probability on R (or Rd — in the case when X ∈ Rd).
I That PX(B) ≥ 0 and PX(R) = 1 is obvious from the definition. Use
Prpn [2.2] to show countable additivity. Bingo.
The DF of X is the DF of PX : FX(t) := PX((−∞, t]) ≡ P(X ≤ t).
We call X discrete (AC, singular) if FX is discrete (AC, singular, resp.).
If X is AC, fX is continuous at x, then P(X ∈ (x, x+ ∆)) = fX(x)∆ + o(∆) =
(fX(x) + o(1))∆ as ∆→ 0. [Recall the meaning of o(1), O(1) etc.]
The survival function (a.k.a. the tail) of X is the function SX(t) = 1− FX(t).
We say that X, Y are identically distributed (and write X
d
= Y ) iff PX ≡ PY .
45
In the case when X ∈ Rd, by the DF of X we understand the function
FX(t1, . . . , td) := P(X1 ≤ t1, . . . , Xd ≤ td), (t1, . . . , td) ∈ Rd.
As in the univariate case, the distr’n PX is uniquely specified by FX — this
follows from the fact that the orthants
{(x1, . . . , xd) : x1 ≤ t1, . . . , xd ≤ td}, (t1, . . . , td) ∈ Rd,
generate B(Rd) (why?) How to express P(Xi ∈ (ai, bi], i ≤ d) in terms of FX?
There are analogs of Thms [1.33], [1.36] in the multivariate case, although
they are a bit more sophisticated. BTW, what will replace the monotonicity
condition a)? What about limits an “infinities”?
The def’n of a discrete RVec is the same as for discrete RVs (this holds in
more general spaces as well). An AC distribution has a density fX , satisfying
FX(t1, . . . , td) :=
∫ t1
−∞
· · ·
∫ td
−∞
fX(s1, . . . , sd) ds1 . . . dsd.
46
Prpn [2.28] X = (X1, . . . , Xd) is discrete iff all Xi, i ≤ d, are discrete.
I ⇒ : If PX(C) = 1 for a countable set C ⊂ Rd, then, for each i,
P(Xi ∈ Ci) = 1, where Ci := {xi : x = (x1, . . . , xd) ∈ C} are countable, too.
⇐ : If, for each j ≤ d, there is a countable set Cj s.t. P(Xj ∈ Cj) = 1, then
P(X ∈ C) = 1 for C := ∏dj=1 Cj , which is countable. Bingo.
Prpn [2.29] If X = (X1, . . . , Xd) is AC, then so is Xj for any j ≤ d, and
fXj (x) =

· · ·

︸ ︷︷ ︸
d−1
fX(s1, . . . , sj−1, x, sj+1, . . . , sd) ds1 · · · dsj−1 dsj+1 · · · dsd.
I Use FXj (t) = lim
ti→∞, i 6=j
FX(t1, . . . , tj−1, t, tj+1, . . . , td) that leads to
FXj (t) =

· · ·

︸ ︷︷ ︸
j−1
t∫
−∞

· · ·

︸ ︷︷ ︸
d−j
fX(. . . , sj−1, x, sj+1, . . .) ds1 · · · dsj−1 dx dsj+1 · · · dsd
and then change the order of integration (Fubini). Bingo. [Is the converse true?]
47
Some Popular Distributions on R and Beyond
Discrete RVs . . .
AC RVs . . .
RVecs . . . (Singular distributions don’t need to be exotic in this case)
Transformations of RVs
Prpn [2.40] If X is an RV, g an increasing & continuous function on R, with
inverse h := g−1, then the RV Y := g(X) has the DF
FY (t) = FX(h(t)).
I Obvious: FY (t) = P(Y ≤ t) = P(g(X) ≤ t) = P(X ≤ g−1(t)), bingo.
What if g were decreasing?
Thm [2.41] If, in addition to the conditions in Prpn [2.40], X is AC and g
is continuously diff’ble on an open set U ⊂ R s.t. P(X ∈ U) = 1, then Y is
AC, with the density fY (y) = fX(h(y))|h′(y)|. [This allows for decreasing g!]
48
I Obvious: FY (t) = FX(h(t)) =
∫ h(t)
−∞ fX(s) ds, next either differentiate or
change variables (s = h(u)): bingo! [Picture.]
Ex Linear g(x) = ax+ b, a > 0 (+ what it does to normal distributions).
Thm [2.43] = extension of Thm [2.41] to RVecs: g : Rd → Rd with an
inverse h, smooth, Jh(y) := det
(
∂hi
∂yj
)
is the Jacobian of h [rings a bell?].
If X ∈ Rd is AC, so is Y = g(X), and fY (y) = fX(h(y))|Jh(y)|.
Works nicely for normal distributions!
Prpn [2.47] Let F be a DF on R, Q(x) := inf{t : F (t) ≥ x}, x ∈ (0, 1), its
quantile function, U ∼ U [0, 1]. Then X := Q(U) ∼ F .
I Note that Q is non-decreasing and Q(x) ≤ t ⇔ x ≤ F (t). So
P(X ≤ t) = P(Q(U) ≤ t) = P(U ≤ F (t)) = F (t), bingo!
NB: If F is continuous and X ∼ F , then also F (X) ∼ U [0, 1].
[DIY! What if F is not continuous?]
49
Independent RVs
Def [3.1] RVs X1, . . . , Xn are called independent if, ∀B1, . . . , Bn ∈ B(R),
P(X1 ∈ B1, . . . , Xn ∈ Bn) =
n∏
j=1
P(Xj ∈ Bj).
NB: Makes perfect sense when interpreting P’s as relative frequencies!
NB: i.i.d. means “independent identically distributed” (if refers to an infinite
sequence, it means that any finite subset of Xj ’s is i.i.d.).
How do we know if RVs are independent? Suppose we know their (joint) DF.
Thm [3.3] RVs X1, . . . , Xn are independent iff, ∀t1, . . . , tn ∈ R,
FX1,...,Xn(t1, . . . , tn) =
n∏
j=1
FXj (tj).
I ⇒ Special case: Bj = (−∞, tj ].
⇐ Can easily verify for Bj = (aj , bj ], and then extend. Trust me, OK? Good.
50
Thm [3.4] Discrete RVs X1, . . . , Xn are independent iff, ∀t1, . . . , tn ∈ R,
P(X1 = t1, . . . , Xn = tn) =
n∏
j=1
P(Xj = tj). (∗)
I ⇒ Again a special case: Bj = {tj}.
⇐ Will prove for n = 2; the same argument works in the general case.
Suppose X and Y are discrete RVs whose joint PMF factorises as in (∗). Then
P(X ∈ A, Y ∈ B) = P
 ⋃
ai∈A,bj∈B
{X = ai, Y = bj}

=

ai∈A

bj∈B
P(X = ai, Y = bj) =

ai∈A

bj∈B
P(X = ai) P(Y = bj)
=
(∑
ai∈A
P(X = ai)
)∑
bj∈B
P(Y = bj)
 , bingo!
51
A similar criterion holds for AC RVs. Its proof is basically the same: just
replace sums with integrals.
Thm [3.5] AC RVs X1, . . . , Xn are independent iff, ∀t1, . . . , tn ∈ R,
fX1,...,Xn(t1, . . . , tn) =
n∏
j=1
fXj (tj).
Ex Uniform distribution on [0, 1]d.
Ex Standard multivariate normal distribution.
NB: If gj are “nice enough” functions, X1, . . . , Xn are independent RVs, then
so are the RVs Yj := gj(Xj), j = 1, . . . , n. [Kind of common sense, no?] Look:
P(Yj ∈ Bj , j ≤ n) = P(Xj ∈ g−1j (Bj) =: B′j , j ≤ n) =
n∏
j=1
P(Xj ∈ B′j) etc.
BTW: Do you remember how to compute the PMF of the sum of two
independent integer-valued RVs? The density in case of AC RVs?
52
When we say: “Let X1, . . . , Xn be independent RVs with DFs F1, . . . , Fn”,
how do we know that such a thing exists at all??
I Take Ω := Rn, F := B(Rn), define P as the probability on (Rn,B(Rn))
whose DF is given by

j≤n Fj(xj), and take Xj(ω) := ωj (coordinate
projections: ω = (ω1, . . . , ωn)) — done!
Alternatively, take Ω := [0, 1]n, F := B([0, 1]n), define P as the uniform
probability on [0, 1]n, and take Xj(ω) := F
−1
j (ωj) (quantile functions of the
coordinates) — done!
Constructing infinite sequences of independent RVs is a bit more interesting.
[BTW, what about dependent ones?]
53
Independent Events
Def [3.19] Events A1, . . . , An are called independent if their indicators are
independent RVs.
Equivalently, for any I ⊂ {1, . . . , n},
P
(⋂
i∈I
Ai
)
=

i∈I
P(Ai) [= standard def’n].
⇒) The indicators of Ai, i ∈ I, are independent. Use Def. [3.19], [3.1] for that
subset of events, with Bi := {1} (then {Xi ≡ 1Ai ∈ Bi} = Ai)
⇐) Want: P(⋂nj=1{1Aj ∈ Bj}) = ∏nj=1 P(1Aj ∈ Bj). Note: {1Aj ∈ Bj} = Ω
if 0, 1 ∈ Bj ; = ∅ if 0, 1 6∈ Bj ; = Aj if 1 ∈, 0 6∈ Bj ; = Acj if 0 ∈, 1 /∈ Bj . Only
care about the last two alternatives: for I ⊂ {1, . . . , n}, show (by induction):
P
(⋂
i∈I
Ai ∩

j∈Ic
Aj
)
=

i∈I
P(Ai)×

j∈Ic
P(Acj).
54
NB: Again, makes perfect sense when we interpret probabilities as the
(limiting values of) relative frequencies.
NB: Differs from pair-wise independence [Bernstein’s example].
Cor [3.21] Events A1, . . . , An are independent iff A
c
1, . . . , A
c
n are.
I Prac class exercise.
55
4. Expectations
From 2nd year probability/stats subjects etc.: if X is a discrete RV (countable
set C = {ti} of possible values), then
EX =

i
tiP(X = ti).
[
=

xfX(x) dx for AC X’s
]
.
Why is this called the “expected” (or “mean”) value of X?
Recall the relative frequency interpretation of probability: in n independent
replications of our RE, Xj being the value observed in the jth replication, set
ni := #{j ≤ n : Xj = ti}. Then P(X = ti) ≈ ni
n
for large n, so that
Xn :=
1
n
n∑
j=1
Xj =
1
n
n∑
j=1

i
ti1(Xj = ti)
=
1
n

i
ti
n∑
j=1
1(Xj = ti)︸ ︷︷ ︸
=ni
=

i
ti
ni
n


i
tiP(X = ti).
56
NB: In both cases, these are just computational rules rather than definitions
of expectation. Need a common one, applicable not only to discrete and AC
RVs, but also to mixtures thereof etc, and s.t. Xn → EX, n→∞ (LLN!).
Let’s start with indicators: for X = 1A, A ∈ F , set
EX := P(A).
This makes sense, as Xn = nA/n. How to proceed?
Expected properties of expectations of general RVs:
• constants are expectations of themselves: EX = c for X ≡ c;
• linearity: E(aX + bY ) = aEX + bEY, where a, b ∈ R are constants;
• monotonicity: X ≤ Y ⇒ EX ≤ EY .
Linearity leads to the following
Def [4.1] For a simple RV X =
∑n
i=1 ai1Ai , the expectation is defined as
EX :=
n∑
i=1
aiP(Ai).
57
NB: This is a consistent def’n: the same result for any representation of X!!
If one can also write X =
∑n′
i=1 a

i1A′i , then
∑n
i=1 aiP(Ai) =
∑n′
i=1 a

iP(A

i).
One can see that by looking at the atoms of σ(A1, . . . , An;A

1, . . . , A

n′).
NB: In particular, X ≡ c = c1Ω has expectation EX = cP(Ω) = c, OK!
Prpn [4.2] Expectation is a linear operation on simple RVs: for simple
X =
∑n
i=1 ai1Ai and Y =
∑m
j=1 bj1Bj , and constants a, b ∈ R,
E(aX + bY ) = aEX + bEY.
I Indeed, assuming that {Ai} and {Bj} are partitions of Ω, we have
E(aX + bY ) = E
n∑
i=1
m∑
j=1
(aai + bbj)1Ai∩Bj =
n∑
i=1
m∑
j=1
(aai + bbj)P(Ai ∩Bj)
= a
n∑
i=1
ai
m∑
j=1
P(Ai ∩Bj)︸ ︷︷ ︸
=P(Ai)
+b
m∑
j=1
bj
n∑
i=1
P(Ai ∩Bj)︸ ︷︷ ︸
=P(Bj)
= aEX + bEY, OK!
58
NB: For a simple RV X ≥ 0, clearly EX ≥ 0. Now monotonicity follows
from linearity: if X ≤ Y then Y −X ≥ 0 and so
0 ≤ E(Y −X) = EY −EX, good!
Now what?
Idea: perhaps can approximate an arbitrary RV by a sequence of simple RVs?
And perhaps the expectations of these simple RVs will tend somewhere?
First look at non-negative RVs. Any such RV can be approximated by an
increasing sequence of simple RVs {Xn}n≥1 in the following sense:
∀ω ∈ Ω, Xn(ω) ↑ X(ω) as n→∞.
59
In fact, we can construct such a sequence explicitly: let N := n2n, and
consider disjoint (for a fixed n) events
An,k := {k2−n ≤ X(ω) < (k + 1)2−n}, k = 0, 1, 2, . . . , N − 1,
An,N := {X(ω) ≥ n}.
Next put
Xn(ω) :=
 k2−n, ω ∈ An,k,n, ω ∈ An,N . [Picture.]
Look: An,k = An+1,2k +An+1,2k+1, k < N, and so Xn+1 ≥ Xn.
Moreover, if we fix ω, then, for n > X(ω), we will have
0 ≤ X(ω)−Xn(ω) ≤ 2−n,
so that Xn(ω) ↑ X(ω) as n→∞. It works!! [Even for ω’s with X(ω) =∞. . . ]
60
Def [4.4] For an arbitrary RV X ≥ 0, we put
EX := lim
n→∞EXn,
where Xn ≥ 0 are simple RVs s.t. ∀ω ∈ Ω, Xn(ω) ↑ X(ω) as n→∞.
[We already know that such sequences of RVs do exist!!]
NB: From monotonicity of E, 0 ≤ EX1 ≤ EX2 ≤ EX3 ≤ . . . , so the limit of
this numerical sequence always exists (but can be infinite, of course).
A sober question: How can one use such a def’n? What if for different
sequences {Xn} of simple RVs one can have different values of limn→∞EXn?
Prpn [4.5] One cannot.
Hence Def [4.4] is consistent: regardless of the choice of {Xn}, the value of
the limit will be one and the same and so can be used to define EX.
61
I A nice exercise: will do it for illustration purposes (typical argument).
• Let both {Xn} and {X˜n} be as in Def [4.4].
We will show that one must have limn→∞EXn = limn→∞EX˜n.
• Suppose Y ≤ X is a simple RV, fix an ε > 0 and set An := {Xn > Y − ε}.
Look: since Xn+1 ≥ Xn, one has An ⊂ An+1, and
since Xn ↑ X ≥ Y > Y − ε, one has An ↑ Ω.
Therefore P(An)→ 1, equivalently P(Acn)→ 0 as n→∞.
• Now Xn ≥ (Y − ε)1An
(on An this holds due to the def’n of An, whereas on A
c
n the RHS is 0), so
EXn ≥ E(Y − ε)1An = E Y 1An︸ ︷︷ ︸
=Y (1−1Acn )
− εP(An)︸ ︷︷ ︸
≤1
≥ EY (1− 1Acn)− ε
= EY −EY 1Acn − ε ≥ EY −maxω∈Ω Y (ω)︸ ︷︷ ︸
<∞
P(Acn)︸ ︷︷ ︸
→0
−ε→ EY − ε.
62
• Thus limn→∞EXn ≥ EY − ε, and as ε > 0 is arbitrary (small), we have
lim
n→∞EXn ≥ EY.
• So if we take Y = X˜k for any fixed k (can do! As X˜k ≤ X), this will give
lim
n→∞EXn ≥ EX˜k.
Therefore
lim
n→∞EXn ≥ limk→∞EX˜k.
By symmetry, will also have
lim
n→∞EX˜n ≥ limk→∞EXk.
Hence the two limits must coincide! Bingo.
63
So we have got a def’n of EX for arbitrary X ≥ 0. What if X ≷ 0?
Let’s use non-negative RVs X+ = max{X, 0}, X− = −min{X, 0}, noting that
X = X+ −X−, |X| = X+ +X−.
Def [4.12] An RV X is called integrable if E|X| <∞ (and this is often
written as “X ∈ L1”). If X is integrable, its expectation is defined by
EX := EX+ −EX−. (∗)
Expectation of X over an event A is E(X;A) := EX1A.
NB: If E|X| <∞ then both EX± <∞, so (∗) makes sense.
If one of EX+ and EX− is infinite, then one can still use (∗) to define EX
(which will be ±∞, depending on which of EX± is infinite).
If both EX± =∞ then EX is undefined (what is ∞−∞?).
64
Good news: thus defined expectation inherits all the good properties of
expectation of simple RVs.
• Monotonicity: if X ≤ Y and EY <∞, then EX ≤ EY .
I First prove for non-negative RVs, using simple RVs (sim. to Prpn [4.5]).
This implies, in particular, that if |X| ≤ Y and Y ∈ L1 then also X ∈ L1.
• Linearity: if X,Y ∈ L1 and a, b ∈ R, then E(aX + bY ) = aEX + bEY.
I First note that V := aX + bY ∈ L1 since |V | ≤ |aX|+ |bY | = |a||X|+ |b||Y |
and E|X| <∞, E|Y | <∞.
Secondly, establish linearity per se: again using simple functions to
approximate RVs, passing to the limit. We’ll skip it. Not examinable. Good.
Cor [4.14] For X ∈ L1, one has |EX| ≤ E|X|.
I By definition, EX = EX+ −EX−, so
|EX| ≤ |EX+|+ |EX−| = EX+ + EX− linearity= E(X+ +X−) = E|X|.
65
A natural question: Given that Xn(ω)→ X(ω) as n→∞ (say, for all
ω ∈ A, P(A) = 1), will this imply that EXn → EX as well?
A natural answer: It depends.
Ex. On Ω = (0, 1) with P = U [0, 1], let
Xn(ω) := n1(ω < 1/n).
Obviously, Xn(ω) = X(ω) ≡ 0 for n > 1/ω, so Xn(ω)→ X(ω) for all ω ∈ Ω.
However, EXn = nP((0, 1/n)) = n× 1/n = 1 6→ 0 = EX. No good.
Convergence of expectations always holds for:
• monotone sequences of RVs, e.g. Xn(ω) ↑ X(ω);
• “dominated” sequences of RVs, i.e. when |Xn| ≤ Y, where Y ∈ L1
(the dominated convergence theorem, cf. Ex. above).
These results will be discussed later, after we have introduced a.s. convergence.
66
For RVecs and complex-valued RVs, expectations are defined component-wise.
Thus, for Z = X + iY ∈ C, with X,Y ∈ R, we set
EZ := EX + iEY.
And finally: another notation for E using integrals:
EX ≡


X(ω)P(dω) ≡


X(ω) dP(ω) ≡


X dP.
The integral construction that we described is that for the Lebesgue
integral. In what is it different from the (conventional) Riemann integral?
Recall : for Riemann integral, we partition the domain of integration, usu. part
of R or Rd (how can we do that in more general cases?). However, for
Lebesgue integral, we partition the range of the integrand that can be defined
on an abstract set Ω! Hence it’s a much more general def’n. [Ex. Banknotes.]
When integrating “nice” functions on Rd w.r.t. volume measure (dx in the
case of R), both integrals give the same answer.
67
Integrals w.r.t. distributions and DFs
As we said earlier, when dealing with RVs one can often just work with the
distributions thereof, “forgetting” the original underlying (Ω,F ,P) and X(ω)
and switching to (Rd,B(Rd), PX). Accordingly, one abandons the general
integrals


X(ω) dP(ω) (that are important, of course, for theoretical
calculations) in favour of (more practical)∫
x dP (x) and

g(x) dP (x), P ≡ PX .
The latter integral is often denoted by

g(x) dF (x), F being the DF of P .
Prpn. Assume that X ∼ P is an RV on (Ω,F ,P) whose expectation is defined
(i.e., at least one of EX± is finite). Then EX ≡ ∫

X(ω) dP(ω) =

R x dP (x).
That is, expectations are numerical characteristics of the distributions of RVs.
The above claim is kind of a “change of variables” formula. Requires a proof.
68
I We only need to prove the Prpn for X ≥ 0. Recall our approximating
sequence of simple RVs Xn ↑ X (as n→∞) from sl. 60:
Xn =
∑n2n−1
k=0 k2
−n1
(
k2−n ≤ X < (k + 1)2−n)+ n1(X ≥ n).
For the RV X∗(x) := x on the prob’ty space (R,B(R), P ) one has X∗ d= X,
X∗n(x) :=
∑n2n−1
k=0 k2
−n1
(
k2−n ≤ x < (k + 1)2−n)+ n1(x ≥ n)
is an approx’g sequence of simple RVs on (R,B(R), P ): X∗n ↑ X∗ as n→∞,
EXn =
∑n2n−1
k=0 k2
−nP
(
k2−n ≤ X < (k + 1)2−n)+ nP(X ≥ n)
=
∑n2n−1
k=0 k2
−nP
(
[k2−n, (k + 1)2−n)
)
+ nP ([n,∞)) = EX∗n.
It remains to recall that EX = limn→∞EXn, EX∗ = limn→∞EX∗n. Bingo!
Basically the same argument shows that if Y = g(X) ∼ PY , X ∼ PX , then
EY = Eg(X) =

ydPY (y) =

g(x)dPX(x).
69
If F is AC with a “nice” density f = F ′ (a.e.), g is “nice” as well (e.g. both
are piece-wise continuous), then∫
g(x) dF (x) =
∫ ∞
−∞
g(x)f(x)dx,
the conventional Riemann integral. Again, this needs to be proved (which is
done by starting with g piece-wise constant and then passing to limits), which
is somewhat beyond the scope of our course.
The above two relations are the basic tools for computing expectations in
discrete & AC cases. When F is a mixture of the two, use a “mixed formula”.
NB: a mnemonic interpretation of the integral: similarly to∫
g(x)dx ≈

(g(x)-values)× (increments of x),
one can (naively) think that∫
g(x)dF (x) ≈

(g(x)-values)× (increments of F (x)).
70
Thm [4.23] For X ≥ 0, EX = ∫∞
0
FX(x)dx, FX := 1− FX(x). [Picture.a]
For X ≷ 0, EX = − ∫ 0−∞ FX(x)dx+ ∫∞0 FX(x))dx.
I For X ≥ 0 s.t. P(X ∈ hZ) = 1 for some h > 0, one has
EX =
∞∑
k=0
hkP(X = hk) = h
∞∑
k=1
k∑
j=1
P(X = hk) = h
∞∑
j=1
∞∑
k=j
P(X = hk)
= h
∞∑
j=0
P(X > hj) =
∞∑
j=0
FX(hj)h =
∫ ∞
0
FX(x)dx. (∗)
For general X ≥ 0, set X ′n := bnXcn , X ′′n := X ′n + 1n , n ≥ 1, s.t. X ′n ≤ X ≤ X ′′n
and X ′′n −X ′n ≤ 1n . Hence EX ′n ≤ EX ≤ EX ′′n ≤ EX ′n + 1n andb
EX ′n
(∗)
=
∫ ∞
0
FX′n(x)dx ≤
∫ ∞
0
FX(x)dx ≤
∫ ∞
0
FX′′n (x)dx
(∗)
= EX ′′n ≤ EX ′n +
1
n
.
Bingo.
aIn particular, for integer-valued X’s, EX =

n≥1 nP(X = n) =

n≥1P(X ≥ n).
bObvious: if X ≤ Y then FX(x) = P(X > x) ≤ P(Y > x) = FY (x), x ∈ R.
71
Functions of RVs
For the RV Y := g(X) (g is “nice” enough),
EY =

y dFY (y) =

g(x) dFX(x),
so in fact don’t need to compute the DF FY to find EY .
We mostly use the following consequences of this: for discrete/AC X’s,
Eg(X) =

ti∈CX
g(ti)P(X = ti), Eg(X) =

g(x)fX(x) dx.
The same applies to f’ns of RVecs. An important special case: (X1, X2) ∈ R2.
72
Cor [4.30] If X1 and X2 are independent RVs, gi(Xi) ∈ L1, i = 1, 2, then
E g1(X1)g2(X2) = E g1(X1)E g2(X2). (∗)
I First consider gi(x) = 1(x ∈ Bi) for some Bi ∈ B(R) :
E g1(X1)g2(X2) = E 1(X1 ∈ B1)1(X2 ∈ B2) = E 1(X1 ∈ B1, X2 ∈ B2)
= P (X1 ∈ B1, X2 ∈ B2) indep’ce= P (X1 ∈ B1) P (X2 ∈ B2)
= E 1(X1 ∈ B1) E 1(X2 ∈ B2) = E g1(X1) E g2(X2).
Next show that (∗) holds for gi(x) =
∑n
j=1 ai,j1(x ∈ Bi,j):
E g1(X1)g2(X2) =

j,k
a1,ja2,kE1(X1 ∈ B1,j)1(X2 ∈ B2,k) = [using the above]
=

j,k
a1,ja2,kE1(X1 ∈ B1,j)E1(X2 ∈ B2,k) = E g1(X1)E g2(X2).
Such simple functions g form a large enough class to approximate general
functions g. The usual limiting procedure works. Good.
73
Special case: Moments
This is when g(x) = xk.
Def The kth moment of X is EXk =

xkdF (x).
Expectation EX = the 1st moment. [Why called ‘moments’?]
NB: the expectation is a characteristic of location. Can be BAD.
Def The kth central moment of X is E(X −EX)k.
Variance Var (X) := E(X −EX)2 ≡ EX2 − (EX)2 = 2nd central moment.
NB: the variance is a characteristic of scale/spread. Can be BAD.
Absolute moments: for | · |, i.e. E|X −EX|k = kth absolute central moment.
For RVs X with E|X|p <∞, we write: X ∈ Lp (p > 0).
74
What are moments good for?
Sometimes they can be easily calculated.
If you know all the moments of X, you’ll know its distribution as well
(under broad conditions — but this is not always so).
There are several very useful inequalities involving moments.
Some of them relate different moments, some of them give bounds for
probabilities in terms of moments.
75
Thm [4.39] (Jensen’s inequality) Let X ∈ L1 and g : R→ R be a convex
function. Then
g(EX) ≤ E g(X). [NB: Cor [4.14] is a special case: g(x) = |x|.]
I For a convex g, for any x0 ∈ R there always exists an a ∈ R s.t.
g(x) ≥ g(x0) + a(x− x0) for all x ∈ R.
Now take x0 = EX and replace x with X:
g(X) ≥ g(EX) + a(X −EX), ω ∈ Ω.
Taking expectations of both sides, using linearity & monotonicity:
E g(X) ≥ g(EX) + aE(X −EX) = g(EX).
Bingo.
76
Cor [4.37] (Lyapunov’s inequality) For 0 < r ≤ s,
(E|X|r)1/r ≤ (E|X|s)1/s .
NB: this implies, in particular, that if the sth moment is finite, then so is the
rth one. Thus, if the second moment is finite, then the expectation must be
finite, too.
I Letting Y := |X|r, the inequality can be re-written as
(EY )
s/r ≤ EY s/r.
This is a special case of Jensen’s inequality, with g(x) = xs/r, x ≥ 0, which is
convex for s ≥ r. [Can put g(x) = 0 for x < 0, if you wish.] Bingo, right?
77
Thm [4.40] (Chebyshev’s [Markov’s] inequality) If g : R→ R is a
positive non-decreasing function, then, for any RV X and number a ∈ R,
P(X ≥ a) ≤ Eg(X)
g(a)
.
I Since 1(X ≥ a) ≤ g(X)/g(a), we have
P(X ≥ a) = E 1(X ≥ a) ≤ E g(X)
g(a)
=
Eg(X)
g(a)
.
Bingo. Special cases:
P(|X| ≥ a) ≤ E|X|
p
ap
for p, a > 0;
P(|X −EX| ≥ a) ≤ Var (X)
a2
for a > 0;a
P(X ≥ a) ≤ Ee
tX
eta
for t > 0.
aBTW: this gives a ‘3σ-rule bound’ of 1/9 only!
78
Def Mixed moments: EXkY m (+ central, absolute etc.).
Def For X,Y ∈ L2, the covariance of X and Y is
Cov (X,Y ) = E (X −EX)(Y −EY ) ≡ EXY −EXEY. [Do you see why? ]
The correlation between X and Y with Var (X),Var (Y ) > 0 is
Corr (X,Y ) =
Cov (X,Y )√
Var (X) Var (Y )
. [NB: Both are symmetric!]
BTW: Why did we require that X,Y ∈ L2? This ensures that EXY is finite:
0 ≤ (X ± Y )2 = X2 + Y 2 ± 2XY ⇒ |XY | ≤ 1
2
(X2 + Y 2), bingo.
[Of course, it also ensures that X,Y ∈ L1, see Cor [4.37].]
If we replace X with X/

EX2, and Y with Y/

EY 2, we’ll get∣∣∣∣ XY√EX2EY 2
∣∣∣∣ ≤ 12
(
X2
EX2
+
Y 2
EY 2
)
. (∗)
Taking expectations on both sides leads to the famous
79
Cor [4.36] (Cauchy–Bunyakovsky inequality)a If X,Y ∈ L2 then
XY ∈ L1 and
E |XY | ≤

EX2EY 2.
Since |EXY | ≤ E|XY |, for variances/covariances Cor [4.36] means that
|Cov (X,Y )| ≤

Var (X)Var (Y ),
and hence that
|Corr (X,Y )| ≤ 1.
Cool. BTW: when do we have “=” in this inequality? Looking at the
derivation of the inequality, one can see that in fact
Corr (X,Y ) = 1 (−1, resp.) ⇔ X = aY + b, a > 0 (a < 0, resp.),
see sl. 85 for a formal derivation.
aFor some unclear reasons, it’s often referred to as Cauchy–Schwarz inequality.
80
So correlation = ±1 when there is a perfect linear relationship b/w the RVs.
When the correlation is zero, one says that the RVs are uncorrelated, which
is NOT the same as independent.
NB: IF X and Y are independent, then by Cor [4.30] (sl. 74)
EXY = EX EY,
and hence Corr (X,Y ) = 0, but not the other way around! [Examples?]
But: for Gaussian (X,Y ), it’s the same [look at the density!].
Covariance & correlation have nice (and also quite insightful & useful)
geometric interpretations.
81
Consider X,Y ∈ L2, set X0 := X −EX, Y0 := Y −EY, and look at
Var (X + Y ) = E(X0 + Y0)
2 = EX20 + EY
2
0 + 2EX0Y0
= Var (X) + Var (Y ) + 2Cov (X,Y ).
Compare this with: for u, v ∈ Rd,
‖u+ v‖2 = (u+ v, u+ v) = (u, u) + (v, v) + 2(u, v)
= ‖u‖2 + ‖v‖2 + 2(u, v).
In fact, Cov (X,Y ) can be interpreted as a scalar product (of X0 and Y0, the
centred versions of X and Y , in the linear space L2 of square-integrable RVs,
so that the norm of X is

EX2 — quite similar to the std Euclidean norm
in Rd) and so has the same properties.
82
In particular, |(u, v)| ≤ ‖u‖‖v‖ becomes our |Cov (X,Y )| ≤√Var (X)Var (Y ).
Note also that correlation is an analog of
(u, v)
‖u‖‖v‖ = cos(∠(u, v)).
So uncorrelated RVs are like orthogonal vectors (in Rd: for them, (u, v) = 0
or, which is the same, cos(∠(u, v)) = 0).
In particular, Pythagoras’ theorem holds:
Var (X + Y ) = Var (X) + Var (Y ) ⇔ Cov (X,Y ) = 0.
Independent RVs is a special case of uncorrelated RVs (sl. 82), but not the
other way around: uncorrelated RVs don’t need to be independent! [Examples?]
83
Correlation is a measure of linear association rather than independence.
To the geometric fact that
| cos(∠(u, v))| = 1 ⇔ u and v are collinear (i.e. u = av for some a 6= 0)
there corresponds the following relation:
|Corr (X,Y )| = 1 ⇔ P(Y = aX + b) = 1 for some a 6= 0, b ∈ R.
[NB: Cov is the scalar product of the centred versions of X and Y .]
I Indeed, assume that Corr (X,Y ) = 1. Then, for the standardized RVs
X1 :=
X −EX√
Var (X)
, Y1 :=
Y −EY√
Var (Y )
, we have
E(X1 − Y1)2 = EX21 + EY 21 − 2Corr (X,Y ) = 1 + 1− 2 = 0.
Likewise, assuming that Corr (X,Y ) = −1, we’ll get E(X1 + Y1)2 = 0.
Either way, we have an RV Z = (X1 ∓ Y1)2 ≥ 0 with EZ = 0. But this is only
possible when P(Z = 0) = 1, and this is why:
84
Set An := {Z > 1/n}, n = 1, 2, . . . Then, by Markov’s inequality
(Thm [4.40], sl. 79),
P(An) = P(Z > 1/n) ≤ EZ
1/n
= 0.
Further, we clearly have An ⊂ An+1 and An ↑ A :=

k≥1Ak ≡ {Z > 0}, so
that by continuity of probability holds
P(A) = lim
n→∞P(An) = 0.
Thus we proved this:
Corr (X,Y ) = ±1 ⇒ P(X1 ∓ Y1 = 0) = 1,
i.e. w.p. 1 holds
X −EX√
Var (X)
∓ Y −EY√
Var (Y )
= 0.
This means that Y = aX + b for some a, b ∈ R with a of the same sign as the
correlation Corr (X,Y )). [Picture.] Bingo!
85
When considering d > 2 RVs, i.e. dealing with RVecs X = (X1, . . . , Xd), one
uses covariance matrices (CovMs):
C2X ≡ [C2X(i, j)]i,j=1,...,d := [Cov (Xi, Xj)] = E (X −EX)>(X −EX).
NB: We are using row vectors x = (x1, . . . , xd) here, so x
> is a column and the
matrix product x>x = [xixj ] is a d× d-matrix. If we used column vectors, we
would be writing xx> for the same object.
Note that C2X(i, i) = Var (Xi), i ≤ d.
These two are the key properties of CovMs:
(CovM.1) C2X is symmetric: C
2
X(i, j) = C
2
X(j, i).
(CovM.2) C2X is positive (or, rather, non-negative) definite: ∀x ∈ Rd,
xC2Xx
> ≥ 0.
86
(CovM.1) is obvious, as Cov (X,Y ) = Cov (Y,X).
(CovM.2) requires some effort. Setting Y := Xx> ∈ R, we have
0 ≤ Var (Y ) = E(Xx> −EXx>)2 = E [(X −EX)x>]2
= E
[
(X −EX)x>(X −EX)x>]
= E
[(
(X −EX)x>)>(X −EX)x>] [as a> = a for a ∈ R]
= E
[
x(X −EX)>(X −EX)x>]
= xE
[
(X −EX)>(X −EX)]x> = xC2Xx>, ufff, bingo!
It turns out that any d× d-matrix satisfying (CovM.1) & (CovM.2) is the
CovM of some distribution on Rd (e.g. a d-dimensional normal distribution,
which is our next topic).
87
Multivariate Normal Distributions
Recall: the standard d-dim normal RVec X = (X1, . . . , Xd) has i.i.d.
components Xj ∼ N(0, 1), and hence its density has the product form:
f(x) =
d∏
j=1
1√
2pi
e−x
2
j/2 ≡ 1
(2pi)d/2
exp
{
−1
2
xx>
}
, x ∈ Rd.
Clearly, EX = 0 ∈ Rd, C2X = I ≡ diag (1, . . . , 1), the identity matrix.
Now consider Y := µ+XA, where µ ∈ Rm, A ∈ Rd×m, so that Y ∈ Rm. Then
EY = E(µ+XA) = µ+ (EX)A = µ
and
C2Y = E (Y −EY )>(Y −EY ) = E (XA)>XA
= EA>X>XA = A>(EX>X)A = A>IA = A>A.
88
Now if m ≤ d and detC2Y 6= 0 (why m > d is no good? NB: detC2Y > 0 for
non-singular C2Y ) then we can use Thm [2.43] to claim that X has density
fY (y) =
1
(2pi)m/2

detC2Y
exp
{
−1
2
(y − µ)[C2Y ]−1(y − µ)>
}
, y ∈ Rm. (∗)
Why? The easiest case is when m = d and detA 6= 0; we have g(x) = µ+ xA
with the inverse h(y) = (y − µ)A−1, so that
(
∂hi
∂yj
)
= (A−1)> is constant and
detC2Y = detA
> detA = (detA)2 ⇒ Jh = detA−1 = (detA)−1 = 1√
detC2Y
.
Remember we said that any matrix satisfying (CovM.1) & (CovM.2) is the
CovM of some distribution (sl. 88)? If B is such a matrix, detB 6= 0, the
function obtained by replacing C2Y in (∗) with our B will be a (normal)
probability density. Good.
If detB = 0, it will still be the CovM of a normal distribution, but the latter
will be concentrated on a (shifted) linear subspace (and hence not AC).
89
5. Conditional Expectations (CEs)
The expectation of an RV X is a number, a numerical characteristic of X (or,
rather, of PX). We know (cf. Problem 5 from PS–5) that, for X ∈ L2, the
value EX is the best (in mean quadratic) predictor among constants for X :
E(X − a)2 −→ min
a
for a = EX.
So, if you don’t know the value of X, but do know PX , your best “educated
guess” for X will be its mean EX.
However, we often do know something about the outcome of our RE, in
which X was also produced, but remains hidden from us. For instance, we
don’t know if patient A has got disease D, but know the results of tests made
on him (here the desired RV is X = 1D); or, X is tomorrow’s price of stock S
which we don’t know yet, but whose history up to now we have observed
(plus, we saw the prices of other stocks in the market as well). In Bayesian
statistics, we assume that parameters’ values are random and observe data;
what do the data tell us about the parameters?
90
In all these situations, the key question is:
How to make the best use of the available information?
CEs provide a powerful tool for doing exactly that.
CEs are not numbers but RVs themselves, and in fact are functions of
the observed RVs (the information we have).
NB: If you condition on an event, which is rarely the case, then you end up
with a number, but even in this case you are actually dealing with a f’n.
First we will consider just this special case: conditioning on an event.
Suppose that all we know about the outcome of our RE is that event A
occurred. Given this information, what would be our best “educated guess”
about the realized value of X?
91
Let’s minimise the mean quadratic error on A: set
g(a) := E[(X − a)2;A] = E(X − a)21A = EX21A − 2aEX1A + a2P(A),
and solve (for a) the equation
0 = g′(a) = −2EX1A + 2aP(A).
This yields
a =
EX1A
P(A)
≡ E(X;A)
P(A)
=: E(X |A),
which we are well familiar with from our 2nd year probability course(s).
Interpretation: we just average the values of X over A, ignoring the rest of the
sample space. [Picture.]
NB: Of course, we don’t need X ∈ L2 to use this definition, it suffices to
have X ∈ L1 (i.e. E|X| <∞). But the above derivation provides motivation.
92
Of course, we can do the same for Ac instead of A.
So if we know the value of 1A (telling us if A occurred or not) and need to
provide the best forecast for X (whether A occurred or not), we use
E(X |1A) :=

E(X |A) ≡ E(X; 1A = 1)
P(1A = 1)
, ω ∈ A,
E(X |Ac) ≡ E(X; 1A = 0)
P(1A = 0)
, ω ∈ Ac,
which is an RV!
Next assume that we observed a simple RV Y =
∑n
i=1 yi1Ai , where
{A1, . . . , An} is a partition of Ω, all yi being distinct.
In other words, we know which of the events Ai occurred. Now if Y = yi then
ω ∈ Ai, and we know that in this case the best forecast for X will be given by
Xˆ =
E(X;Ai)
P(Ai)
=: xi, ω ∈ Ai.
This again specifies an RV!
93
Moreover, since Ai = {Y = yi}, introducing the function h(y) by putting
h(yi) := xi, we see that this new RV is actually a function of Y :
Xˆ = h(Y ) =: E(X |Y ),
called the CE of X given the simple RV Y .
This can be thought of as a “crude” version of X obtained by averaging the
values of X over events on which Y assumes one and the same value, the
“atoms” Ai = {Y = yi} of σ(Y ). [Picture.]
NB: Note that the values of Y don’t really matter when defining E(X |Y ).
What matters is the partition generated by Y or, equivalently, σ(Y ).
Ex. Expected weight of inhabitants given the postcode.
NB: Please note that all what we have said re CEs so far (and what we will be
saying below) holds for RVecs Y as well! (We never used the fact that Y ∈ R.)
94
NB: It is obvious that X and Xˆ have the same average values on the sets Ai:
by def’n of xi,
E(Xˆ;Ai) = E Xˆ1Ai
def’n of Xˆ
= Exi1Ai = xiP(Ai)
def’n of xi= E(X;Ai).
Q: Will this still hold if we replace Ai with AI :=

i∈I Ai, I ⊂ {1, 2, . . . , n}?
A: Sure. It’s like averaging the averages:
E(Xˆ;AI) = E Xˆ1AI = E Xˆ

i∈I
1Ai =

i∈I
E Xˆ1Ai=

i∈I
EX1Ai = · · · = E(X;AI).
Summarising, we see that the CE Xˆ = E(X |Y ) has these two properties:
(CE.1) Xˆ is “flat” on the “atoms” of σ(Y ), i.e. Xˆ is an RV
w.r.t. σ(Y ) (cf. Tutorial Problem 3 from PS–3; one says that
Xˆ is σ(Y )-measurable); this holds iff Xˆ = h(Y ) for some h, and
(CE.2) E(Xˆ;A) = E(X;A) for any A ∈ σ(Y ) [i.e. A = {Y ∈ B}].
95
Important: (CE.1) & (CE.2) uniquely specify the CE Xˆ = E(X |Y ) in case
of simple Y ’s! Indeed, using (CE.2) with A = Ai and applying (CE.1)
immediately implies that Xˆ = xi on Ai, bingo.
Now one can formally define CE in case of general Y using these properties.
Thm [CE] Let X ∈ L1 and Y be RVs on a common probability space. Then
there exists an RV Xˆ satisfying (CE.1) & (CE.2). This RV is unique, up to its
values on a set of zero probability, and is called the CE of X given Y , denoted
by E(X |Y ). [Cf. Def [8.12].]
I The proof is based on Radon-Nikodym’s theorem from Measure Theory,
and we won’t give it here. Good.
Recall: we already said that what really mattered in the def’n of E(X |Y ) in
the case of simple Y ’s was the partition generated by Y or, equivalently, σ(Y ).
Likewise in the general case.
96
First observe that if ϕ is a 1–1 function, then E(X |Y ) = E(X |ϕ(Y )) (why?).
In particular, one always has E(X |Y ) = E(X |Y 3) = E(X | eY ) etc., but in
the general case E(X |Y ) 6= E(X |Y 2) (when will “=” hold?).
Second observe that we don’t really need RVs Y themselves for
conditioning — we need information contained in their values. The same
information is also contained in σ(Y ): if we know which events from it
occurred and which didn’t, we know the value of Y ! If, for a given
sub-σ-algebra F1 ⊂ F , we replace (CE.1) with the requirement that Xˆ is
F1-measurable (i.e. Xˆ is an RV w.r.t. F1), then Thm [CE] will still hold true.
Thus introduced RV Xˆ =: E(X | F1) is called the CE of X given σ-algebra F1.
In probability & statistics, it’s critically important (i) to be aware of the
general definition of CE, (ii) to know and be able to use the key properties of
CE, and hence be able to effectively use this powerful tool, and (iii) to know
how to compute CEs in important special cases.
97
Ex. Poisson sums: We know (from our 2nd year probability subject or
otherwise) that, for independent X ∼ P (λ) and Y ∼ P (µ), the sum is also
Poisson: Z := X + Y ∼ P (λ+ µ). Show that E(X |Z) = λλ+µZ.
[Motivation: Suppose we know the total number NT of jumps in a Poisson
process during the time interval [0, T ]. What can we say about Nt for t < T?]
(CE.1) is obvious ( λλ+µZ is a function of Z, right?).
(CE.2) To verify this property, we write, for any k ≥ 0,
E(X ; Z = k) = EX1(Z = k) =

i≥0

j≥0
i 1(i+ j = k)︸ ︷︷ ︸
= 1 iff j = k − i
P(X = i, Y = j)
=
k∑
i=0
iP(X = i, Y = k − i)︸ ︷︷ ︸
λie−λ
i!
× µ
k−ie−µ
(k − i)!
= e−(λ+µ)
k∑
i=0
iλi
i!
× µ
k−i
(k − i)!
98
= e−(λ+µ)λ
k∑
i=1
1
(i− 1)!(k − i)!λ
i−1µk−i [letting l := i− 1]
= e−(λ+µ)
λ
(k − 1)!
k−1∑
l=0
(k − 1)!
l!(k − 1− l)!λ
lµk−1−l = e−(λ+µ)
λ(λ+ µ)k−1
(k − 1)! .
On the other hand,
E
(
λ
λ+ µ
Z; Z = k
)
=
λk
λ+ µ
P(Z = k) =
λk
λ+ µ
(λ+ µ)k
k!
e−(λ+µ),
which is the same! Bingo.
What if the question was not to show that , but rather to compute (without
knowing the answer in advance)? That is, what about (iii)?
The standard route is to first find the conditional distribution and then
compute the expectation for it (re how this is done in our Ex, see Tutorial
Problem 1 in PS–6).
99
OK, what is the conditional distribution?
Recall: for any event A, one has P(A) = E1A. Likewise, one defines
conditional probabilities (cf. Def [8.18]) by setting
P(A |Y ) := E(1A |Y ).
Distributions are collections of probabilities of the form PX(B) ≡ P(X ∈ B),
B ∈ B(R). And so conditional distributions are defined by letting
gB(Y ) := P(X ∈ B |Y ) := E[1(X ∈ B) |Y ].
and then setting PX|Y (B | y) := gB(y).
In fact, it’s a bit more tricky than just that. Our Thm [CE] on sl. 97 claims,
for a given RV Z, the existence of g(Y ) := E(Z |Y ) and its uniqueness up to
events of null probability. Now we have a family of RVs {gB(Y ), B ∈ B(R)}
and kind of want it to be a distribution (in B) for any outcome!
That is, for any y ∈ R, our gB(y) should be a probability (a set function in B).
100
Well, it turns out such a thing does exists. Always. And we call it the
conditional distribution of X given Y .
When Y is discrete, we just compute P(X ∈ B |Y = yi) (provided that
P(Y = yi) > 0, cf. Thm [8.37]), it’s all nice & simple.
When (X,Y ) is AC, one uses the conditional density (cf. Def [8.38])
fX|Y (x|y) :=
f(X,Y )(x, y)
fY (y)
, fY (y) =

f(X,Y )(x, y) dx.
Then (cf. Thm [8.39])
P(X ∈ B, Y ∈ B′) =

B×B′
f(X,Y )(x, y) dxdy =

B×B′
fX|Y (x|y)fY (y) dxdy
=

B′
[∫
B
fX|Y (x|y) dx
]
︸ ︷︷ ︸
=:gB(y)
fY (y) dy = E[gB(Y ); Y ∈ B′],
so that gB satisfies the def’n of conditional probability.
101
So, to compute E(X |Y ) when (X,Y ) is AC, one first finds the conditional
density fX|Y (x|y). Then one computes
g(y) :=

xfX|Y (x|y) dx
(sometimes denoted by E(X |Y = y)) and lets
E(X |Y ) := g(Y ).
Ex Uniform distribution on D := {(x, y) ∈ R2 : x ≥ 0, y ≥ 0, x+ y ≤ 1}.
Ex Normal distribution.
Quite often one can use the general properties of CEs to make the
computation feasible. Our exposition will be somewhat different from the one
presented in Section 8.3.2 from the text.
102
Properties of CEs
(CEP.1) Linearity: for constants a, b ∈ R,
E(aX + bZ |Y ) = aE(X |Y ) + bE(Z |Y ).
[NB: all relations of this kind, involving CEs, are understood a.s. (=almost
surely), which means they hold up to an event of zero probability.]
I Indeed, the RHS is a function of Y , so (CE.1) holds. To verify (CE.2), we
use the def’n of CE and linearity of expectation:
E[RHS ;Y ∈ B] = aE[E(X |Y ) ;Y ∈ B]︸ ︷︷ ︸
=E(X;Y ∈B)
+bE
[
E(Z |Y ) ;Y ∈ B]︸ ︷︷ ︸
=E(Z;Y ∈B)
= aE(X;Y ∈ B) + bE(Z;Y ∈ B) = E(aX + bZ;Y ∈ B), good.
103
(CEP.2) Monotonicity: if X ≤ Z a.s., then also E(X |Y ) ≤ E(Z |Y ) (a.s.).
I Indeed, if it were NOT so, we would have (from linearity)
E(Z −X |Y ) = E(Z |Y )−E(X |Y ) < 0
with positive probability. The LHS is a function of Y (by def’n), say, h(Y ),
and so, for B := h−1((−∞, 0)),
{ω ∈ Ω : E(Z −X |Y ) < 0} ≡ {ω ∈ Ω : h(Y ) < 0} = {ω ∈ Ω : Y ∈ B}.
Now by (CE.2),
E
(
h(Y );Y ∈ B) = E(Z −X;Y ∈ B).
Look: the integrals are over the set {Y ∈ B} of positive probability, and the
integrand on the LHS is < 0, whereas the one on the RHS is ≥ 0.
Contradiction.
104
(CEP.3) If Z = g(Y ) then
E(ZX |Y ) = ZE(X |Y ).
That is, functions of Y behave like constants when one conditions on Y .
I Indeed, the RHS is a function of Y , so (CE.1) is fine.
Re (CE.2): first consider the case Z = 1{Y ∈C} for some C ∈ B(R). Then
E(ZX;Y ∈ B) = EX1{Y ∈C}1{Y ∈B},
whereas
E
[
ZE(X |Y );Y ∈ B] = E[E(X |Y );Y ∈ C ∩B] (CE.2)= E(X;Y ∈ C ∩B),
which is the same. Next we verify the property for simple RVs, then look at
their limits. Good.
105
(CEP.4) If X is independent of Y , then
E(X |Y ) = EX. (∗)
I Indeed, (CE.1) is met as a constant (= EX) is a function of Y .
Secondly, X and 1(Y ∈ B) are also independent, and so by Cor [4.30] (sl. 74)
E(X; Y ∈ B) ≡ E[X1{Y ∈B}] = EX E1{Y ∈B}
= E
[
(EX)1{Y ∈B}
]
= E(EX; Y ∈ B),
so that (CE.2) is also met.
In particular, if Y = const, then (∗) always holds.
106
(CEP.5) The double expectation law (a.k.a. the “tower property”):
E
[
E(X |Y1, Y2) |Y1
]
= E(X |Y1).
I Oh well, (CE.1) is obvious. To show (CE.2):
E
[
E(X |Y1, Y2);Y1 ∈ B
]
= E
[
E(X |Y1, Y2); (Y1, Y2) ∈ B × R
]
(CE.2)
= E(X; (Y1, Y2) ∈ B × R) = E(X;Y1 ∈ B)
(CE.2)
= E
[
E(X |Y1);Y1 ∈ B
]
,
bingo.
In particular, taking Y1 = const, we see that
E
[
E(X |Y )] = EX
(but that was obvious from (CE.2): just take B = R there).
This relation is actually a form of the total probability law, which is a very
useful thing!
107
One of the great features of CE is that, for X ∈ L2, the CE Xˆ := E(X |Y ) is
the best (in m.q.) forecast for X from Y . Let’s prove that!
For an RV Z = h(Y ), consider
E(X − Z)2 = E[(X − Xˆ) + (Xˆ − Z)]2
= E
[
(X − Xˆ)2 + 2(X − Xˆ)(Xˆ − Z) + (Xˆ − Z)2]
= E(X − Xˆ)2 + 2E(X − Xˆ)(Xˆ − Z) + E(Xˆ − Z)2
Use (CEP.5) to evaluate the middle term: the mixed moment equals
EE
[
(X − Xˆ)(Xˆ − Z) |Y ] (CEP.3)= E[(Xˆ − Z)E[(X − Xˆ) |Y ]]
(CEP.1)
= E
[
(Xˆ − Z)[E(X |Y )︸ ︷︷ ︸
=Xˆ
− E(Xˆ |Y )︸ ︷︷ ︸
=Xˆ by (CEP.3)
]
]
= 0,
so E(X − Z)2 = E(X − Xˆ)2 + E(Xˆ − Z)2 −→ minZ=h(Y ) for Z ≡ Xˆ, as then
the second (non-negative) term = 0, while the first one doesn’t depend on Z.
[Again note: can have Y ∈ Rd here. Projection interpr’n. Linear vs general f’cast.]
108
The final comment in the section on CEs:
That Xˆ = E(X |Y ) minimises the mean quadratic distance to X among all
RVs that are functions of Y also has a simple geometric interpretation.
In mathematics, an operator R : L→ L (L is a linear space, e.g. Rn or L2) is
called a projection if R2 = R, i.e. R(Rx) = Rx for any x ∈ L. [Picture.]
Let LY = space of all RVs Z ∈ L2 that are f’ns of a given RV Y : Z = h(Y ) for
some h. Then the operator R(·) = E(· |Y ) is a projection onto LY :
Xˆ = R(X) ≡ E(X |Y ) (CE.1)= h(Y ), and so R(R(X)) ≡ E(Xˆ |Y ) (CEP.3)= Xˆ.
The geometry of L2 is Euclidean, as in Rn (it’s based on a scalar product), so
shares a lot of properties, e.g. Pythagoras thm. In particular, the projection Xˆ
and the “error” X − Xˆ will be orthogonal in L2 (≡ uncorrelated):
EXˆ(X − Xˆ) (CEP.5)= EE[Xˆ(X − Xˆ)|Y ] (CEP.3)= E[XˆE(X − Xˆ|Y )︸ ︷︷ ︸
E(X |Y )−E(Xˆ |Y )=Xˆ−Xˆ
]
= 0.
109
6. Some Applications to Statistics
First a few words on the relationship b/w Probability Theory (PT) and
Mathematical Statistics (MS).
In PT: knowing the nature of a random phenomenon, we derive the
distributions of the characteristics of the phenomenon (that one can usually
observe in the RE). [Our knowledge is built into the mathematical model of
the RE: (Ω,F ,P) etc.] One can call this a direct problem.
Ex. Knowing the composition of the general population, what can one say
about the composition of a random sample from that population?
In MS, we deal with an inverse problem , e.g. in our Ex, given the observed
composition of the random sample, what can we say about the general
population? In the general situation, we have a mathematical model of the
RE: (Ω,F ,P) etc., where P is (at least, partially) unknown, observe the value
of an RVec resulting from our RE, and then use PT to make inference
concerning P on the basis of this data.
110
Later on, we will discuss application of the key results of PT, its limit
theorems, to MS. Now we will briefly discuss an important application of the
concept of CE in the context of parameter estimation. [You must have seen
elements of that in 2nd year stats subjects — in case you did them, of course.]
Our model for observed data:
• There is an underlying RE, modelled by (Ω,F ,Pθ), where Pθ is a
probability depending on parameter θ ∈ Θ ⊂ Rd, whose value we don’t (but
want to) know. Thus we have a family of suspects P := {Pθ}θ∈Θ and need to
point at the one that would fit the observed data best (in some sense).
• We observe an RVec X = X(ω) ∈ Rn. Denote by Pθ the distribution of X
on (Rn,B(Rn)) induced by Pθ. One can often identify (Ω,F ,Pθ) with
(Rn,B(Rn), Pθ) (may be awkward if one analyses “large sample” situations,
when n→∞), and then Ω = Rn does become the sample space.
111
Sufficient Statistics
You may remember some elements of this stuff from 2nd year stats.
Def. Any (measurable) function S = S(X) of the observation X is called a
statistic in our sampling experiment.
Estimators of the unknown parameter θ are statistics θ∗ = θ∗(X) (i.e. just
functions of the sample X assuming values in the parameter space Θ),
statistical tests are statistics δ∗ = δ∗(X) (assuming values 1 and 0 when
testing a single hypothesis H0 vs alternative H1: δ
∗ = 1 means that we reject
H0; for randomized tests, δ
∗ is the probability of rejecting H0; similarly in the
case of multiple hypotheses).
Clearly, a statistic S is also an RV (or RVec), and so one can talk about the
conditional distribution Pθ(X ∈ B|S), B ∈ B(Rn), cf. sl. 82.
112
Def. [R. A. Fisher, 1922] A statistic S = S(X) is called sufficient (SS) for
parameter θ if the conditional distribution Pθ(X ∈ B|S) doesn’t depend on θ.a
NB: if ϕ is a 1–1 function, then S1 = ϕ(S) is also an SS for θ (cf. sl. 98).
Being a CE given S, Pθ(X ∈ B|S) is a function of S, so there exists a function
P (B|s), B ∈ B(Rn), s ∈ Rd, s.t.
Pθ(X ∈ B|S) = P (B|S)
Interpretation: P (B|s) is the cond’l distribution of the sample X given S = s.
Meaning: For a given SS S, if you know that the sample point X is on the
“surface” {x ∈ Rn : S(x) = s}, any further information re where on this
surface the point X is tells you nothing about the value of θ (as the location
of the point follows then one and the same distribution for all θ) — and so
this information is irrelevant to the estimation problem. [Picture: N(0, σ2).]
aMore formally: there exists a version of the conditional distribution which is independent
of θ, cf. slides 102, 103. But we don’t bother about such nuances too much here.
113
Ex. Let X = (X1, . . . , Xn) be an i.i.d. sample, Xi ∼ P (λ).
Recall: for independent X ∼ P (λ) and Y ∼ P (µ), we have X + Y ∼ P (λ+ µ),
and the conditional distribution of X given X + Y = m is binomial
B(m, λλ+µ ). Hence that of Xj given S := X1 + · · ·+Xn = m is B(m, 1/n).
Perhaps S is a sufficient statistic for λ? Verify: fix an integer s ≥ 0, then for
x = (x1, . . . , xn), one clearly has Pλ(X = x |S = s) = 0 if
∑n
j=1 xj 6= s, while
if the sum = s we have
Pλ(X = x |S = s) = Pλ(X = x)
Pλ(S = s)
=
∏n
j=1 e
−λλ
xj
xj !
e−nλ
(nλ)s
s!
=
s!λ

xj
(nλ)s

xj !
=
(x1 + · · ·+ xn)!
x1! · · ·xn!
(
1
n
)x1
· · ·
(
1
n
)xn
,
multinomial distr’n with s =
∑n
j=1 xj independent trials with n equally likely
outcomes. Thus S = nX is an SS for λ. Good.
114
NB: What a great reduction of data! It turns out that all the information
about θ contained in the whole sample is “stored” in a single value, that of S.
More on that later. Now: how to find SSs?
The most natural formulation of the main result here uses the concept of
density.
Recall: a distribution P on Rn has a density f if, for any Borel set B ⊂ Rn,
P (B) =

B
f(x) dx ≡

· · ·

f(x)1B(x) dx1 · · · dxn. (∗)
But we also introduced integrals of more general form (sl. 68, 69):∫
B
f(x) dQ(x), Q is a probability.
Basically the same def’n works for integrals w.r.t. measures (see sl. 17) as
well, rather than just probability measures, e.g. Lebesgue measure (= length
on R, area on R2, volume on Rn, n > 2), as in (∗), or counting measure (see
sl. 18).
115
Thus, if X ∼ P is a discrete RVec taking values in a countable set
C = {xi}i≥1, p(x) := P(X = x), and µ is the counting measure on C (i.e.
µ(B) = #{x ∈ B ∩ C}), then one has
P(X ∈ B) ≡ P (B) =

B
dP (x) =

B
p(x) dµ(x) =

x∈B∩C
p(x) ≡

i:xi∈B
p(xi).
In this case, one uses notation: p(x) =
dP

(x) and says that P is AC w.r.t. µ
and p is the density of P w.r.t. µ.
So the concept of AC is relative ; when used without any further
explanations, it always means that P is AC w.r.t. the Lebesgue measure, i.e.
(∗) from sl. 116 holds. But one often needs to use some other measures µ.
NB: So it turns out that discrete distributions are actually AC w.r.t. the
counting measures on their supports!
Now we are ready to formulate our main result here.
116
Thm [Neyman–Fisher (NF) factorisation.] Suppose all Pθ are AC w.r.t.
some measure µ, with densities fθ(x) =
dPθ
dµ (x). A necessary and sufficient
condition for statistic S to be an SS for θ is that, for some functions ψ(s, θ)
and h(x),
fθ(x) = ψ(S(x), θ)h(x). (∗)
If X = (X1, . . . , Xn) is an i.i.d. sample, Xj having a density fθ(x), then
fθ(x) ≡
n∏
j=1
fθ(xj) = ψ(S(x), θ)h(x).
NB: Factorisation (∗) is not unique, of course: say, the pair
ψ1(s, θ) := e
sψ(s, θ), h1(x) := e
−S(x)h(x)
would be OK, too!
117
Ex. (cont’d) In our Poisson example, the distribution is on the integers, with
the density fλ(x) = e
−λ λx
x! w.r.t. the counting measure on {0, 1, 2, . . .}. So the
likelihood function has the form
fλ(x) ≡
n∏
j=1
fλ(xj) =
n∏
j=1
e−λ
λxj
xj !
=
e−nλλ
∑n
j=1 xj∏n
j=1 xj !
= ψ(S(x), λ)h(x)
with
ψ(s, λ) = e−nλλs, S(x) =
n∑
j=1
xj , h(x) =
1∏n
j=1 xj !
.
Therefore, according to the NF Thm, S(X) :=
∑n
j=1Xj is an SS for λ (but
we have already proved that directly anyway).
118
I We will prove the NF Thm in the discrete case only; in the general case,
the same idea is used (but the argument becomes more technical).
Thus we assume that Pθ(X ∈ C) = 1, ∀θ ∈ Θ, for a countable set C, µ is the
counting measure on C, fθ(x) = Pθ(X = x) for x ∈ C.
⇐) Suppose that factorisation (∗) (sl. 118) takes place. Then, for x ∈ C,
s = S(x) (otherwise it’s trivial: zeros!),
Pθ(X = x |S(X) = s) = Pθ(X = x, S(X) = S(x))
Pθ(S(X) = s)
=
Pθ(X = x)
Pθ(S(X) = s)
=
fθ(x)∑
y∈C:S(y)=s fθ(y)
(∗)
=
ψ(S(x), θ)h(x)∑
y∈C:S(y)=s ψ(S(y), θ)h(y)
=
h(x)∑
y∈C:S(y)=s h(y)
,
which doesn’t depend on θ, hence S is an SS for θ!
119
⇒) Now assume that S is an SS for θ. Then, for x ∈ C, s = S(x) (otherwise
it’s trivial: zeros!),
Pθ(X = x |S(X) = s) =: h(x)
doesn’t depend on θ. Therefore
fθ(x) ≡ Pθ(X = x) = Pθ(X = x, S(X) = S(x))
= Pθ(X = x |S(X) = S(x))︸ ︷︷ ︸
=h(x)
Pθ(S(X) = S(x))︸ ︷︷ ︸
=:ψ(S(x),θ)
,
bingo!!
120
Ex. When X = (X1, . . . , Xn), where Xj ∼ N(µ, σ2), the parameter
θ := (µ, σ2) is 2-dim (NB: here µ ∈ R, it’s not a measure, just a number). The
likelihood function has the form:
fθ(x) =
n∏
j=1
fθ(xj) = (2piσ
2)−n/2 exp
{
− 1
2σ2
n∑
j=1
(xj − µ)2
}
= (2piσ2)−n/2 exp
{
− 1
2σ2
[∑
x2j︸ ︷︷ ︸
=:S2
−2µ

xj︸ ︷︷ ︸
=:S1
+nµ2
]}
= (2piσ2)−n/2 exp
{
− 1
2σ2
[
S2 − 2µS1 + nµ2
]}
= ψ(S, θ)h(x),
where S = (S1, S2), h(x) ≡ 1 (but could take h(x) ≡ (2pi)−n/2 as well).
Therefore (S1, S2) is an SS for θ = (µ, σ
2): of all the information contained in
the sample X, we only need two numbers, (S1, S2), for estimation of (µ, σ
2).
121
Ex. Suppose X = (X1, . . . , Xn), where Xj ∼ U [0, θ]. This is an AC distr’n,
fθ(x) =
{
θ−1 if x ∈ [0, θ],
0 otherwise.
Hence, using x(1) = minj≤n xj and x(n) = maxj≤n xj , the likelihood function is
fθ(x) =
n∏
j=1
fθ(xj) =
{
θ−n if xj ∈ [0, θ], j ≤ n,
0 otherwise.
= θ−n1
(
0 ≤ x(1), x(n) ≤ θ
)
= θ−n1{x(n)≤θ}︸ ︷︷ ︸
=:ψ(S,θ)
1{x(1)≥0}︸ ︷︷ ︸
=:h(x)
Therefore S(X) := X(n) is an SS for θ.
Cor [of NF Thm] If T is a statistic, ϕ a function, and S := ϕ(T ) is an SS
for θ, then T is also an SS for θ.
I Obvious from NF Thm. In fact, the “best” SSs are the minimal ones
(that are f’ns of all the others): they can be used to get the best estimators.
122
Recall the following important concept.
Def. θˆ∗ = θˆ∗(X) := arg maxθ∈Θ fθ(X) is called the maximum likelihood
estimator (MLE) of θ from X.
This statistic can be a (very) good estimator for θ and possesses some nice
properties you may be familiar with. At the moment, we observe this:
Cor [of NF Thm] If S is an SS for θ, then the MLE θˆ∗ is a function of S
only (no further information from the sample X is needed).
I Indeed, from the NF Thm,
max
θ
fθ(X) = max
θ
[
ψ(S(X), θ)h(X)
]
= h(X) max
θ
ψ(S(X), θ)
so the value θ at which the max is attained depends on S(X) only.
Moreover, if S is an SS for θ, then all the Bayesian estimators (rings a bell?)
are functions of S only etc.
123
How SSs can improve estimators’ efficiency
Q: How to measure performance of estimators θ∗ of θ? First consider the case
when Θ ⊂ R. The standard mean quadratic error approach:
Eθ(θ
∗ − θ)2 −→ min
θ∗
(∗)
Q: Does there exist an estimator θ∗0 minimising the LHS of (∗) for all θ?
A: In non-trivial cases, the answer is negative. Indeed, suppose that there
exists such an estimator θ∗0 . Fix a θ1 ∈ Θ and take θ∗1 ≡ θ1. Then
Eθ(θ

1 − θ)2 = (θ1 − θ)2,
which turns into zero for θ = θ1. But we know that θ

0 is the best estimator,
so must also have
Eθ1(θ

0 − θ1)2 = 0.
And this holds for any θ1 ∈ Θ!! This is only possible when the obesrvation X
uniquely determines the value of θ (e.g. when Xi ∼ εθ, Θ = R, or
Xi ∼ U [θ, θ + 1], Θ = Z)
124
So one compares performance of estimators within reasonable classes, e.g.
unbiased estimators.
Def. An estimator θ∗0 = θ

0(X) from a class K of estimators of θ is called
efficient in K if, for any θ∗ ∈ K,
Eθ(θ

0 − θ)2 ≤ Eθ(θ∗ − θ)2, ∀θ ∈ Θ.
Ex (of an important class of estimators). For a function b = b(θ), θ ∈ Θ, let
Kb = {θ∗ : Eθθ∗ = θ + b(θ), ∀θ ∈ Θ}
be the class of all estimators with the bias b(θ).
In particular, K0 is the class of all unbiased estimators. Estimators efficient
in K0 are called simply efficient.
BTW: unbiasedness is a desirable (but not necessary) property of a good
estimator. Unbiased estimators don’t need to exist (e.g. if X ∼ B(ϕ(θ)) for
some f’n ϕ, then θ∗ ∈ K0 ⇔ Eθθ∗ ≡ θ∗(0)(1− ϕ(θ)) + θ∗(1)ϕ(θ) ?!= θ, ∀θ ∈ Θ).
125
Thm. An efficient in Kb estimator (if it exists) is unique (modulo its values
on a subset A of the sample space s.t. Pθ(A) = 0, ∀θ ∈ Θ).
I Suppose that both θ∗1 and θ∗2 are efficient in Kb:
Eθ(θ

i − θ)2 = min
θ∗∈Kb
Eθ(θ
∗ − θ)2 =: Rθ, ∀θ ∈ Θ, i = 1, 2.
Clearly, θ∗0 :=
1
2 (θ

1 + θ

2) ∈ Kb and, using
(
a1+a2
2
)2
+
(
a1−a2
2
)2
=
a21+a
2
2
2 with
ai = θ

i − θ, i = 1, 2, and taking Eθ’s, we obtain
Eθ(θ

0 − θ)2︸ ︷︷ ︸
≥Rθ
+
1
4
Eθ(θ

1 − θ∗2)2 = Rθ ⇒ Eθ(θ∗1 − θ∗2)2 ≤ 0, hence = 0.
As we showed on slides 85 & 86 , this means that Pθ(θ

1 − θ∗2 = 0) = 1. Bingo.
[NB: This assertion has a transparent geometric interpretation. Picture.]
126
Thm [Rao-Blackwell (RB)] Let θ∗ ∈ Kb, S be an SS for θ. Then the CE
θ∗S := Eθ(θ
∗ |S) has the following properties:
(i) θ∗S is a function of S only (and hence is a statistic);
(ii) θ∗S ∈ Kb;
(iii) Eθ(θ

S − θ)2 ≤ Eθ(θ∗− θ)2, ∀θ ∈ Θ, where “=” holds iff Pθ(θ∗S = θ∗) = 1.
Thus applying Eθ(· |S) to an estimator θ∗ improves it!
I First observe that
θ∗S =

θ∗(x)Pθ(X ∈ dx |S) S is an SS=

θ∗(x)P (dx |S)
doesn’t depend on θ and is a function of S = S(X) only. Hence θ∗S is a statistic
and so can be used as an estimator. [It wouldn’t be so if S were NOT an SS!]
This proves (i). To prove (ii), note:
Eθθ

S = EθEθ(θ
∗ |S) (CEP.5)= Eθθ∗ = θ + b(θ), so that θ∗S ∈ Kb indeed.
127
To demonstrate (iii), we do this:
Eθ(θ
∗ − θ)2 = Eθ
(
(θ∗ − θ∗S) + (θ∗S − θ)
)2
= Eθ(θ
∗ − θ∗S)2︸ ︷︷ ︸
≥0
+Eθ(θ

S − θ)2 + 2 Eθ(θ∗ − θ∗S)(θ∗S − θ)︸ ︷︷ ︸
=0, see bottom of sl. 110
≥ Eθ(θ∗S − θ)2.
It remains to note that “=” holds here iff Eθ(θ
∗ − θ∗S)2 = 0, but, as we proved
on slides 85 & 86, the latter is equivalent to Pθ(θ
∗ − θ∗S = 0) = 1. Bingo.
NB: Recall Cor from sl. 123: S = ϕ(T ), S is an SS =⇒ T is an SS, too. Using
the same argument, one can show that then
Eθ(θ

S − θ)2 ≤ Eθ(θ∗T − θ)2
That is, the “smaller” the conditioning SS is, the more efficient the result!
128
Ex. Consider an i.i.d. sample X = (X1, . . . , Xn), Xj ∈ P (λ), and λ∗ = X1. As
we know, Eλλ
∗ = EλX1 = λ, so that λ∗ ∈ K0, and
Eλ(λ
∗ − λ)2 = Var λ(X1) = λ.
We also know (sl. 115) that S :=
∑n
j=1Xj is an SS for λ, so we can form the
estimator
λ∗S = Eλ(λ
∗ |S) = Eλ(X1 |S) = S
n
≡ X
since the conditional distr’n of X1 given S = m is binomial B(m, 1/n) (cf.
Tutorial Problem 1, PS-6).
So
Eλ(λ

S − λ)2 = Var λ(X) =
λ
n
,
which is much better than for the original λ∗!
One can show that λ∗S is actually efficient (i.e., it has the smallest mean
quadratic error among all unbiased estimators).
129
Q: What do we do when θ ∈ Rd?
A: A possible way to evaluate the performance of estimators is to compare
mean quadratic dispersions
Eθ(θ
∗ − θ, a)2, a ∈ Rd,
(·, ·) being the scalar product, and prefer θ∗1 to θ∗2 if the former has smaller
dispersion for all a.
That is, one looks at the projections of θ∗ − θ onto different directions a,
Thm [multivariate Rao-Blackwell] As in the univariate case, except for (iii)
which is replaced with:
(iii′) Eθ(θ∗S − θ, a)2 ≤ Eθ(θ∗ − θ, a)2, ∀θ ∈ Θ, ∀a ∈ Rd, where “=” holds for
all a ∈ Rd iff Pθ(θ∗S = θ∗) = 1.
The same proof: just do it for the univariate parameters (θ, a) and estimators
(θ∗, a) for them. Good.
130
Q: How far can one go along the path paved by RB Thm?
A: When there exists a complete SS for θ, one can go till the very end: in that
case, for a θ∗ ∈ Kb, the estimator θ∗S will be efficient in Kb. But this is from
another book. . .
131
7. Convergence of Random Variables
Modes of Convergence
Recall this: Let {xn}n≥1 be a sequence in R (or Rd). We say that xn converge
to x as n→∞, and write
xn → x as n→∞, or lim
n→∞xn = x,
if, ∀ε > 0, ∃nε <∞ s.t. one has
|xn − x| < ε for all n ≥ nε.
That is, for an arbitrarily small neighbourhood U of x, the xn’s should be in
U for all large enough n.
Now what might Xn → X mean when Xn = Xn(ω) are RVs? They are
functions, not numbers! [Picture.]
132
In Probability Theory (as in other areas of mathematics), one considers
several modes of convergence of RVs (functions). They are different, used in
different contexts, some more often than others. We will give general
definitions and briefly describe relationships between some of them.
Def. [5.1] Convergence almost surely (a.s.), or w.p. 1:
Xn
a.s.−→ X as n→∞
if there is an event A with P(A) = 1 s.t. ∀ω ∈ A, Xn(ω)→ X(ω) as n→∞.
In other words, it’s point-wise convergence on a set of probability 1.
Def. [5.2] Convergence in probability:
Xn
P−→ X as n→∞
if ∀ε > 0, P(|Xn −X| > ε)→ 0 as n→∞.
133
Def. [5.3] Quadratic mean convergence:
Xn
L2−→ X as n→∞
if Xn, X ∈ L2 and E(Xn −X)2 → 0 as n→∞.
In fact, this is convergence of elements of the space L2 of square-integrable
RVs (on a common (Ω,F ,P)) in its “native norm” (the one corresponding to
the scalar product (X,Y ) := EXY ).
Def. [5.4] Convergence in mean:
Xn
L1−→ X as n→∞
if Xn, X ∈ L1 and E|Xn −X| → 0 as n→∞.
The last two modes (5.3, 5.4) are particularly popular in engineering
applications (one of the reasons being that calculating moments is a relatively
simple task).
134
All four are modes of convergence of sequences of RVs, given on a common
probability space. However, the most important from the applications’
viewpoint is convergence in distribution (a.k.a. “weak convergence of
distributions”), which doesn’t care where and how the RVs are defined — only
the distributions of Xn matter.
Def. [5.5] Convergence in distribution:
Xn
d−→ X as n→∞
if limn→∞ FXn(t) = FX(t) at all points t ∈ R where FX(t) is continuous,
i.e. such t that FX(t−) = FX(t) ⇐⇒ P(X = t) = 0.
NB: Why this restriction to continuity points of FX? Because it makes sense
(and perfectly agrees with an alternative, more natural def’n, see Thm [5.8]
below). Consider Xn ≡ 1/n, X ≡ 0. [Picture: DFs.] Then FXn(t)→ FX(t) at
all t 6= 0, whereas FX(0) = 1 6= 0 = FXn(0).
Do we want to exclude such a situation? Don’t the distributions converge?
The def’n allows the jump points to “move” & also emerge in the limit.
135
The key argument for the above def’n of “
d−→” is actually the following
alternative def’n (which works not only for RVs and RVecs, but also in much
more general cases).
Thm [5.8] Xn
d−→ X as n→∞ iff for any continuous bounded function f
Ef(Xn)→ Ef(X) as n→∞.
It is this property that makes convergence in distribution so useful &
important.
I The proof is somewhat technical, let’s leave it for the future. Just note: for
“nice” f (e.g. differentiable and vanishing outside a bounded interval), one can
integrate by parts to obtain:

f(x) dFXn(x) = −

FXn(x) df(x), so if FXn(x)
converge to F (x) everywhere (perhaps except for at most countable set where
F has jumps), then. . . Makes sense.
136
Ex. If X,Xj ∈ N, j ≥ 1, then Xn d−→ X iff ∀k ∈ N,
P(Xn = k)→ P(X = k), n→∞.
Indeed,
P(Xn = k) = P(Xn ≤ k)−P(Xn ≤ k − 1)
= FXn(k)− FXn(k − 1)
= FXn(k + s)− FXn(k − 1 + s) for any s ∈ (0, 1),
[picture!] and also, for any s ∈ (0, 1),
FXn(k + s) =
k∑
j=1
P(Xn = j).
The assertion follows.
137
Ex. Let Yn be uniformly distr’d on {0, 1, . . . , n}: P(Yn = k) = 1n+1 , 0 ≤ k ≤ n.
Prove that Xn := Yn/n
d−→ X ∼ U [0, 1] as n→∞.
I The limiting DF F (x) = x1(x ∈ [0, 1]) + 1(x > 1) is continuous, so have to
prove: ∀x ∈ R, FXn(x)→ F (x) as n→∞. As P(Yn ≤ k) = k+1n+1 , 0 ≤ k ≤ n,
FXn(x) = P(Yn ≤ nx) =

0, x < 0,
P(Yn ≤ bnxc) = bnxc+1n+1 , x ∈ [0, 1],
1, x > 1.
But bnxc+1n+1 → x as n→∞:∣∣∣∣bnxc+ 1n+ 1 − x
∣∣∣∣ = |bnxc+ 1− (n+ 1)x|n+ 1 ≤ |bnxc − nx|+ 1− xn+ 1 ≤ 2n+ 1 .
BTW, what about the alternative def’n (Thm [5.8])? Look:
Ef(Xn) =
∑n
k=0 f(
k
n )
1
n+1 →
∫ 1
0
f(x) dx = Ef(X), so it’s OK!!
138
Some Relationships among the Modes
a.s.−→
always ⇓
P−→
under integrability conditions⇒
always⇐
L2−→
always ⇓
d−→
[There are some further “restricted” implications, which we didn’t mention.]
Why do these implications hold?
139
1)
a.s.−→ ⇒ P−→
I Assume that Xn a.s.−→ X, i.e. Xn(ω)→ X(ω) for all ω ∈ A, where P(A) = 1.
Fix an arbitrary ε > 0 and set An := {|Xn −X| > ε}. Clearly we cannot have
An i.o. for an ω ∈ A, so that
Ac ⊃ [An i.o.] sl.11≡

k

n≥k
An,
and hence
0 = P(Ac) = P
(⋂
k

n≥k
An
)
continuity of P
= lim
k→∞
P
(⋃
n≥k
An
)
,
so that, as k →∞,
0← P
(⋃
n≥k
An
)
≥ P(Ak) = P(|Xk −X| > ε),
which means that Xn
P−→ X. Bingo.
140
2)
P−→ under integrability conditions⇒ L
2
−→
I This is a bit technical, uses integration theory. Leave it for the future.
3)
P−→ always⇐ L
2
−→
I This immediately follows from Markov’s inequality (Thm [4.40], sl. 79):
P(|Xn −X| > ε) ≤ E|Xn −X|
2
ε2
.
4)
P−→ ⇒ d−→
This one we can prove! Better do that on the next slide.
141
I Let t be a continuity point of FX , ε > 0 be fixed. Then
FXn(t) = P(Xn ≤ t)
= P(Xn ≤ t, |Xn −X| ≤ ε︸ ︷︷ ︸
⊂{X≤t+ε}
) + P(Xn ≤ t, |Xn −X| > ε)
≤ P(X ≤ t+ ε) + P(|Xn −X| > ε)
= FX(t+ ε) + P(|Xn −X| > ε),
and similarly
FX(t− ε) ≤ FXn(t) + P(|Xn −X| > ε).
That is,
FX(t− ε)−P(|Xn −X| > ε) ≤ FXn(t) ≤ FX(t+ ε) + P(|Xn −X| > ε).
As ε > 0 is arbitrarily small, can choose it so that FX(t± ε) will be arbitrarily
close to FX(t) (can do that as FX is continuity at t), and P(|Xn−X| > ε)→ 0
as n→∞. So what happens to FXn(t) then? Bingo.
142
Examples and Counterexamples
Consider Ω = [0, 1] with P = U [0, 1]. [Pictures!]
1) Let Xn := n1(0,1/n), X ≡ 0. Then:
Xn
a.s.−→ X, and hence also P−→, d−→, but Xn
L2
6−→ X, Xn
L1
6−→ X, since
E|Xn −X| = EXn = 1 6→ 0.
2) Let Xn :=

n1(0,1/n), X ≡ 0. Then we’ll have the same relations, except
for having now Xn
L1−→ X since E|Xn −X| = EXn = 1/

n→ 0.
3) Let X ≡ 0,
X1 := 1[0,1/2), X2 := 1[1/2,1),
X3 := 1[0,1/3), X4 := 1[1/3,2/3), X5 := 1[2/3,1),
X6 := 1[0,1/4), X7 := 1[1/4,2/4), X8 := 1[2/4,3/4), X9 := 1[3/4,1),
etc. Then Xn
a.s.
6−→ X, but P−→, d−→, L
2
−→, L
1
−→.
143
4) Let X ≡ 0,
X1 := 21[0,1/2), X2 := 21[1/2,1),
X3 := 31[0,1/3), X4 := 31[1/3,2/3), X5 := 31[2/3,1),
X6 := 41[0,1/4), X7 := 41[1/4,2/4), X8 := 41[2/4,3/4), X9 := 41[3/4,1),
etc. Then Xn
a.s.
6−→ X, but P−→, d−→, but
L1
6−→ since E|Xn −X| = EXn = 1 6→ 0
(and hence also
L2
6−→).
5) If Xn := 1[0, 12+
1
n ]
, X := 1[ 12 ,1], then Xn
d−→ X, but all other →’s fail.
6) If Xn ∼ U [ 12 − 1n , 12 + 1n ], X ≡ 12 , then Xn
d−→ X, even though
FXn(
1
2 ) =
1
2 6→ 1 = FX( 12 ). Can we assert
P−→? What about L
2
−→ and a.s.−→?
144
Convergence under Transformations
1) As for the “usual convergence” of sequences in R (or Rd),
if, for two sequences of RVs {Xn} and {Yn} given on a common probability
space, Xn
a.s.−→ X and Yn a.s.−→ Y as n→∞, then also Xn + Yn a.s.−→ X + Y .
(Thm [5.19])
Indeed, we know that {Xn} converges on an event A with P(A) = 1 and {Yn}
converges on an event B with P(B) = 1. Clearly, P(A ∩B) = 1 and both
sequences converge on the event A ∩B.
[BTW: What about the products XnYn?]
The same applies to
P−→: as |(Xn + Yn)− (X + Y )| ≤ |Xn −X|+ |Yn − Y |,
P
(|(Xn + Yn)− (X + Y )| > ε) ≤ P(|Xn −X| > ε/2)+ P(|Yn − Y | > ε/2).
[BTW: What about the products XnYn?]
145
2) If Xn
L2−→ X and Yn L
2
−→ Y as n→∞, then also Xn + Yn L
2
−→ X + Y .
This follows from the triangle inequality for L2-norm, which follows from
Cauchy-Bunyakovsky inequality (Cor [4.36], sl. 81).
But for the products we can only claim that XnYn
L1−→ XY — and we don’t
even know if XnYn ∈ L2, but that it’s in L1 follows from Cauchy-
Bunyakovsky inequality! The
L1−→-claim follows from
XnYn −XY = XnYn −XnY +XnY −XY = Xn(Yn − Y ) + (Xn −X)Y
and the Cauchy-Bunyakovsky inequality (once again).
146
3) Of course, if we only know that Xn
d−→ X and Yn d−→ Y as n→∞, then
even asking if Xn + Yn
d−→ X + Y is meaningless: Xn and Yn may be defined
on different probability spaces, we cannot add them!
4) But we have this important general result:
Thm [5.23] Let g : R→ R be a continuous function. Then, as n→∞,
a) if Xn
a.s.−→ X then g(Xn) a.s.−→ g(X),
b) if Xn
P−→ X then g(Xn) P−→ g(X),
c) if Xn
d−→ X then g(Xn) d−→ g(X).
This theorem holds for RVecs in Rd as well, and beyond.
I a) is obvious.
b) This we could prove as in the text, but it’s more instructive to go another
way basing on this fundamental result from Real Analysis: any function g
continuous on a closed bounded interval [a, b] is uniformly continuous there.
147
Uniform continuity: ∀ε > 0, ∃δ > 0 s.t. [Pictures & examples.]
x, y,∈ [a, b], |x− y| ≤ δ =⇒ |g(x)− g(y)| < ε.
Fix an ε > 0, let An,ε :=
{|g(Xn)− g(X)| > ε}. Then, ∀N > 0, IN := [−N,N ],
P(An,ε) = P(An,ε; X ∈ IN ) + P(An,ε; X 6∈ IN )
≤ P(An,ε; X ∈ IN ) + P(X 6∈ IN )
≤ P(An,ε; X ∈ IN , |Xn −X| ≤ δ) + P(|Xn −X| > δ) + P(X 6∈ IN )
=: P1 + P2 + P3.
Now, for an arbitrary small η > 0, we can choose N so large that P3 < η/2.
Next, since g is uniformly continuous on [−N − 1, N + 1], we can choose δ < 1
so small that if x, y ∈ [−N − 1, N + 1], |x− y| ≤ δ then |g(x)− g(y)| < ε. But
this yields P1 = 0.
Finally, as Xn
P−→ X, for all large enough n holds P(|Xn −X| > δ) < η/2,
and that will imply P(An,ε) < η. Bingo for b).
148
c) is obvious from Thm [5.8]: we have to show that
Yn := g(Xn)
d−→ Y := g(X), but since for a bounded continuous f the
composition (f ◦ g)(x) := f(g(x)) is also bounded (as f is bounded) and
continuous (as g is continuous and hence f ◦ g is), we have
Ef(Yn) = E(f ◦ g)(Xn)→ E(f ◦ g)(X) = Ef(Y ).
Total bingo.
149
Now we turn to the convergence results mentioned on sl. 66.
Thm [4.9] (Monotone Convergence Theorem.)If Xn ≥ 0 are RVs on a
common probability space and Xn ↑ X a.s. as n→∞, then EXn ↑ EX.
I For any n ≥ 1, there exists a sequence of simple RVs X(k)n ↑ Xn as k →∞,
and for them EX
(k)
n ↑ EXn (cf. Def [4.4], Prpn [4.5]).
Now X(k) := maxn≤kX
(k)
n , k ≥ 1, are clearly also simple RVs, and
X(k−1) ≤ X(k) ≤ Xk, k ≥ 1.
By monotonicity, there exists the a.s. limit limk→∞X(k) =: Y.
Since, for any n ≥ 1, one has (as k →∞)
Xn Y X
↑ ↑ ↑
X
(k)
n ≤ X(k) ≤ Xk
a.s., one concludes that Xn ≤ Y ≤ X a.s. for any n ≥ 1.
150
Hence Y = X a.s. (recall that Xn ↑ X a.s. as n→∞).
Therefore, for the simple RVs X(k), one has X(k) ↑ X a.s. Hence by
Prpn [4.5] one has the last relation (as k →∞) in the next line:
EX ≥ EXk ≥ EX(k) ↑ EX.
We conclude that also EXk ↑ EX. Bingo!
151
Thm [4.8] (Fatou’s Lemma.) If Xn ≥ 0 are RVs on a common probability
space then
E lim inf
n→∞ Xn ≤ lim infn→∞ EXn.
Recall: lim infn→∞ xn = limn→∞ infk≥n xk is the least partial limit for the
sequence {xn}n≥1 ⊂ R.
I Set X := lim infn→∞Xn ≡ limn→∞ Yn, where Yn := infm≥nXm ↑ X as
n→∞, 0 ≤ Yn ≤ Xn. By the Monotone Convergence Theorem,
EX = lim
n→∞EYn = lim infn→∞ EYn ≤ lim infn→∞ EXn.
Bingo!
152
Thm [4.16] (Dominated Convergence Theorem.) If |Xn| ≤ c <∞ a.s. for
any n ≥ 1, Xn → X a.s. as n→∞, then there exists limn→∞EXn = EX.
Note that the first condition can be replaced with: |Xn| ≤ Y a.s. for any
n ≥ 1, where EY <∞. Check the proof below!
I Since Xn + c ≥ 0, c−Xn ≥ 0 for any n ≥ 1, by Fatou’s lemma one has
E lim inf
n→∞ Xn ≤ lim infn→∞ EXn,
E lim sup
n→∞
Xn ≥ lim sup
n→∞
EXn.
But lim infn→∞Xn = lim supn→∞Xn = limn→∞Xn = X by assumption, so
the LHS’s in the above formulae coincide with each other and with EX. Hence
lim sup
n→∞
EXn ≤ EX ≤ lim inf
n→∞ EXn,
which can only hold if there exists the limit limn→∞EXn = EX.
Bingo!
153
Our First Limit Theorems: Sums of Bernoullia RVs
Suppose {Xn} is an i.i.d. sequence of B(p)-RVs,
P(Xj = 1) = 1−P(Xj = 0) = p ∈ (0, 1), q := 1− p.
Recall: for Sn := X1 + · · ·+Xn,
P(Sn = k) =
(
n
k
)
pkqn−k, k = 0, 1, . . . , n.
Laws of Large Numbers (LLNs)
Thm [5.30] (Weak LLN.)
Sn
n
P−→ p as n→∞.
I As L
2
−→ implies P−→, we only need to prove the former, which is obvious:
E
(
Sn
n
− p
)2
=
E(Sn − np)2
n2
=
Var (Sn)
n2
=
nVar (X1)
n2
=
npq
n2
=
pq
n
→ 0.
aNamed after Jacob Bernoulli (1654–1705), whose Ars Conjectandi (1713) contained the
first proof of the WLLN.
154
Thm [5.31] (Strong LLN.)
Sn
n
a.s.−→ p as n→∞.
I For ε > 0, set An(ε) := {|Sn/n− p| > ε}. The main task: show that An(ε)
i.o. w.p. 0. [Indeed, then

k[An(1/k) i.o.] also has probability 0, and we are
done.]
Refer to Thm [1.27] (Borel-Cantelli) on sl. 26: for this it suffices to show that
∞∑
n=1
P(An(ε)) <∞.
Show that: by Markov’s inequality (Thm [4.40], sl. 79),
P(An(ε)) = P
(|Sn − np| > nε) ≤ E(Sn − np)4
n4ε4
.
If we show that E(Sn − np)4 ≤ cn2, we are done: then P(An(ε)) ≤ c1n−2, and∑
n n
−2 <∞!
155
Letting X˜j := Xj − p (so that EX˜j = 0), we have
E(Sn − np)4 = E
( n∑
j=1
X˜j
)4
= E
[ n∑
j=1
X˜4j + 6

jX˜2j X˜
2
k︸ ︷︷ ︸
E(··· )=(EX˜21 )2
+

terms containing X˜1j︸ ︷︷ ︸
E(··· )=0
]
= nEX˜41 + 3n(n− 1)(EX˜21 )2
= n [(1− p)4p+ (1− q)4q]︸ ︷︷ ︸
=pq(p3+q3)≤pq≤1/4
+3n(n− 1) (pq)2︸ ︷︷ ︸
≤(1/4)2=1/16<1/12
≤ n
4
+
n(n− 1)
4
=
n2
4
.
Total bingo. We completed the proof.
156
NB: This result validates Probability Theory (as we constructed it):
the Xj ’s are actually indicators of independent events occurring with the same
probability p. Can interpret this as a sequence of independent trials,
observing/not observing the same event in each of them.
Thus Sn is the total number of occurrences of our event in n trials, and
Sn/n = relative frequency of the event. We showed that, for “almost all”
sequences of trials, the relative frequency tends to p, which is exactly what we
aimed to reproduce with our mathematical model.
157
Ex. Consider Ω = [0, 1], P = U [0, 1], Yj(ω) = the jth digit in the decimal
expansion of the number ω = 0.Y1Y2Y3 . . . ∈ [0, 1].
Then {Yj} is an i.i.d. sequence, with P(Y1 = k) = 110 , k = 0, 1, . . . , 9. Both
claims are obvious [picture!].
What is the freq’cy of a given digit k in a “typical number’s” decimal exp’n?
Letting Xj := 1(Yj = k), we get an i.i.d. sequence of Bernoulli RVs with
p = 110 , with the frequency of k in the first n digits of ω given by Sn/n.
Now from the SLLN we know that, for Ck := {Sn/n→ p ≡ 110 as n→∞}, one
has P(Ck) = 1. And this holds for all k = 0, 1, . . . , 9, so that P
(⋂9
k=0 Ck
)
= 1.
This means that, for ALMOST ALL numbers ω ∈ [0, 1], each of the ten
decimal digits appears with frequency 110 in ω’s decimal expansion.
Most rationals are exceptions, of course (e.g. 1/3 = 0.3333 . . . etc.: they all
have periodic expansions starting from some place) — but there are only
countably many of them, a negligibly small proportion of all numbers in [0, 1]!
158
Q: Could we use the same tools to establish the WLLN & SLLN in the general
case (when Xj 6∼ B(p))?
A: To some extent. Look: all we needed in the proof of the WLLN was that
E(Sn − np)2 ≡ E(Sn − nµ)2 ≡ Var (Sn) = nVar (X1).
But this will hold in the general case, for uncorrelated Xj with common
values of µ := EXj and Var (Xj) (we even don’t need independence!). The
same argument shows that Sn/n
P−→ µ in this case.
For the SLLN, our argument will still work for i.i.d. Xj with EX
4
j <∞.
However, one doesn’t need that much: in fact, in the i.i.d.-case,
SLLN ⇔ E|X1| <∞. The proof of this is much more sophisticated, leave it
for the future.
159
We will not discuss here de Moivre-Laplace limit theorem (=“local CLT”) for
i.i.d. Xj ∼ B(p) that describes the behaviour of
P(Sn = k)
when we keep k “

n-close” to the mean ESn = np: these probabilities can
then be approximated by multiples of the normal density values. [Picture.]
The proof of this is based on:
(a) the binomial formula P(Sn = k) =
(
n
k
)
pkqn−k and
(b) Stirling’s forumlaa: k! =

2pikk+1/2e−k(1 + o(1)) as k →∞ (o(1)→ 0).
We won’t discuss here the Poisson limit theorem (“the law of small numbers”)
which concerns approximating P(Sn = k) in situations where p = pn → 0 s.t.
npn → λ ∈ (0,∞): we will do it as an exercise.
Instead, we will turn to powerful (analytic) tools that can be used to analyse
the behaviour of probability distributions in much more general situations.
aCheck how well it works! In fact, k! =

2pikk+1/2e−k+θ(k), 1
12k+1
< θ(k) < 1
12k
.
160
8. Characteristic Functions (ChFs)
Def [6.1]. For an RV X, its ChF ϕX(t) : R→ C is defined by
ϕX(t) := Ee
itX =

eitxdFX(x).
Recall Euler’s formula: eit = cos t+ i sin t, |eit| = 1 for t ∈ R. [Picture.]
So ϕX(t) = E cos(tX) + iE sin(tX), always exists and is finite. Moreover,
|ϕX(t)| =
∣∣EeitX ∣∣ ≤ E∣∣eitX ∣∣ = 1, ϕX(0) = Eei0X = E1 = 1.
The ChF of distribution (or DF) F is the ChF of X ∼ F .
NB: for X ∈ Z, ϕX(t) =

eitkP(X = k); for AC X’s, ϕX(t) =

eitxfX(x) dx.
What’s the point of introducing ChFs? To use Fourier analysis: represent a
given function as a “mixture” of harmonic oscillations eitx with different
frequencies t. Analogy: orthogonal basis expansion, coordinates etc. And most
importantly: DF
1−1←→ChF, and there is MORE! We will see.
161
Ex. X ≡ c = const ⇒ ϕX(t) = Eeitc = eitc.
Ex. X ∼ B(p) ⇒ ϕX(t) = EeitX = peit·1 + qeit·0 = 1 + p(eit − 1).
Ex. X ∼ U [0, 1] ⇒ ϕX(t) =
∫ 1
0
eitxdx =
1
it
[
eitx
]1
0
=
eit − 1
it
.
Oops: what if t = 0? Trouble? Nope. Look: |eit − 1| ≤ |t|. [Picture.]
Indeed, eit − 1 = ∫ t
0
(eis)′ds = i
∫ t
0
eisds, where |eis| = 1.
Ex. X ∼ N [0, 1] ⇒ ϕX(t) =

eitx
e−x
2/2

2pi
dx =
1√
2pi

eitx−x
2/2dx
=
1√
2pi

e−
1
2 (x
2−2itx±(it)2)dx = e−t
2/2 1√
2pi

e−
1
2 (x−it)2dx︸ ︷︷ ︸
=1
= e−t
2/2.
Why = 1? Imagine we had µ ∈ R instead of it in e− 12 (x−it)2 , OK? Formal
proof: Cauchy thm for integrals of analytic f’ns over closed contours. [Picture.]
162
Prpn. If Y = aX + b, where a, b ∈ R are constants, then ϕY (t) = eitbϕX(at).
I Obvious: ϕY (t) = Eeit(aX+b) = Eeit(aX)eitb = eitbEei(ta)X = eitbϕX(at).
Ex. If Y ∼ U [a, b], then Y d= a+ (b− a)X, X ∼ U [0, 1], and so
ϕY (t) = e
itaϕX((b− a)t) = eita × e
i(b−a)t − 1
i(b− a)t =
eibt − eiat
i(b− a)t .
Can verify by a direct calculation: ϕY (t) =
∫ b
a
eitx
1
b− a dx.
Ex. If X ∼ N(µ, σ2), then X d= µ+ σZ, Z ∼ N [0, 1], and so
ϕX(t) = e
itµϕZ(σt) = e
itµe−(σt)
2/2 = exp
{
itµ− σ
2t2
2
}
.
163
Recall: for z = x+ iy ∈ C (x, y ∈ R are the real and imaginary parts of z), its
complex conjugate z¯ := x− iy [Picture.], z1 + z2 = z¯1 + z¯2 (obvious).
NB: eiu = e−iu (e.g. from Euler’s formula) and so
ϕX(t) =

eitxdFX(x) =

eitxdFX(x) =

e−itxdFX(x) = ϕX(−t) = ϕ−X(t).
Thus we established the following
Prpn. ϕX(t) = ϕX(−t) = ϕ−X(t).
This will prove quite handy. For instance, note that if X is a symmetric RV,
i.e. X
d
= −X, then we obtain ϕX(t) = ϕ−X(t) = ϕX(t) by the Prpn.
Now what does z = z¯ mean? It means that z ∈ R, so that the ChF of a
symmetric RV X is always real-valued. We will see later that the converse
is true as well!
164
Prpn [6.2]. Any ChF is uniformly continuous.a
I Fix an arbitrary ε > 0, consider IN := [−N,N ]. Then, ∀t, h ∈ R,
|ϕX(t+ h)− ϕX(t)| =
∣∣Eei(t+h)X −EeitX ∣∣ = ∣∣E(ei(t+h)X − eitX)∣∣
=
∣∣EeitX(eihX − 1)∣∣ ≤ E∣∣eitX(eihX − 1)∣∣
= E |eitX |︸ ︷︷ ︸
=1
|eihX − 1| = E |eihX − 1|︸ ︷︷ ︸
≤|eihX |+|1|=2
= E( |eihX − 1|︸ ︷︷ ︸
≤|hX|≤|h|N
; X ∈ IN ) + E(|eihX − 1|; X 6∈ IN )︸ ︷︷ ︸
≤2P(X 6∈IN )
≤ |h|N + 2P(X 6∈ IN ).
Now first choose N so large that the 2nd term < ε/2, and then the whole
thing will be < ε for |h| < ε2N (regardless of the value of t!). This proves
uniform continuity. Bingo.
aRe uniform continuity, see sl. 149.
165
Thm [6.4]. If X and Y are independent RVs, then ϕX+Y (t) = ϕX(t)ϕY (t).
NB1: Not ϕX(t) + ϕY (t)!! This isn’t even a ChF! (Why?)
NB2: Products are MUCH easier to compute than convolutions!
I ϕX+Y (t) = Eeit(X+Y ) = E eitXeitY︸ ︷︷ ︸
independent RVs
= EeitXEeitY = ϕX(t)ϕY (t). Good.
Ex. If X ∼ N(µX , σ2X) and Y ∼ N(µY , σ2Y ) are independent, then
ϕX+Y (t) = exp
{
it(µX + µY )− t
2
2
(σ2X + σ
2
Y )
}
.
If we had the uniqueness result (i.e. DF
1−1←→ChF), that would mean that
X + Y ∼ N(µX + µY , σ2X + σ2Y ). A bit later.
NB: Try to use convolution to derive that result. No fun.
166
Thm [6.11]. Let k ≥ 1. If E|X|k <∞, then ϕX(t) is k times continuously
differentiable and EXk = i−k
dk
dtk
ϕX(t)
∣∣∣∣
t=0
.
In particular, EX = −iϕ′X(0), EX2 = −ϕ′′X(0) (when the moments exist).
I d
k
dtk
ϕX(t) =
dk
dtk
EeitX
?
= E
dk
dtk
eitX = E(iX)k eitX = ikEXk eitX , which
turns into ikEXk when we put t := 0. We just need to justify
?
=: this follows
from the Dominated Convergence Theorem (sl. 154) — and we need
E|X|k <∞ for that. Good.
The converse is true for even k and “almost true” for odd k. Thus, if ϕX(t) is
twice differentiable at zero, then EX2 <∞ (and hence ϕX(t) is everywhere
twice differentiable, cf. Thm [6.12]). Ex: N(µ, σ2).
Thus, the smoother ϕX(t), the lighter the “tails” of FX , and the other way
around!
167
Inversion Formulae and Uniqueness
Thm [6.7]. If
∫ |ϕX(t)|dt <∞ then X has a continuous density given by
fX(x) =
1
2pi

e−itxϕX(t) dt.
This is a general result from Fourier analysis.
Meaning (a bit loose): One can think of fX as a sum of harmonic oscillations
with different frequencies t. When computing ϕX(t), we find how strong the
contribution of oscillations at frequency t to fX is (it’s like an orthogonal basis
expansion of a vector). Then we can “assemble” fX back from these
oscillations, and this is what the inversion formula does.
NB: Thm [6.7] implies that there is a one-to-one correspondence between
distributions and their ChFs (at least in the AC case, when the ChF is
integrable, but it’s true always, we’ll discuss that a bit later).
Summary: Two different distributions cannot have the same ChF!
168
Ex. Consider ϕX(t) =
sin t
t
. How do we know that this is a ChF? See Ex. on
sl. 164: the ChF of U [−1, 1] is
eit·1 − eit·(−1)
it(1− (−1)) =
eit − e−it
2it
=
sin t
t
by Euler’s formula. But clearly
∫ ∣∣ sin t
t
∣∣ dt =∞ (why?), so cannot use the
inversion formula. No wonder though, as otherwise X would have continuous
density, which is wrong.
Ex. Consider ϕX(t) = 1 + p(e
it − 1). Can we use the inversion formula?
NB: If X ∈ Z then ϕX(2pik) = Eei2pikX = 1 for k ∈ Z. So, for such X,
ϕX(t) 6→ 0 as |t| → ∞, whereas for AC X always ϕX(t)→ 0 [Lebesgue thm].
Ex. Consider ϕX(t) = e
−t2/2 (this corresponds to X ∼ N(0, 1), see sl. 163).
The integrability condition is clearly met, so there exists continuous
fX(t) =
1
2pi

e−itxe−t
2/2dt
sl. 163
=
1√
2pi
e−x
2/2, good!
169
Ex. Compute the ChF of X ∼ E(1) (do you see how to extend this to E(λ)?):
ϕX(t) =
∫ ∞
0
eitxe−xdx =
∫ ∞
0
e−(1−it)xdx =
−1
1− it
[
e−(1−it)x
]∞
0
=
1
1− it .
Can one apply the inversion formula?
Nope, as
∫ |ϕX(t)|dt =∞, so cannot use the inversion formula. No wonder
though, as otherwise X would have continuous density, which is wrong.
Now consider the double exponential distribution (the first Laplace distr’n):
fX(x) =
1
2
e−|x|, x ∈ R.
NB: this is a mixture of E(1) and its “mirror reflection” (the distr’n of −Y ,
Y ∼ E(1)), with equal weights, so
ϕX(t) =

eitx
(
1
2fY (x) +
1
2f−Y (x)
)
dx =
1
2
(
ϕY (t) + ϕ−Y (t)
)
=
1
2
(
1
1− it +
1
1− i(−t)
)
=
1
1 + t2
.
170
NB: This ϕX(t) is already integrable on R, so the inversion formula is
applicable, and hence
1
2
e−|x| =
1
2pi

e−itx
1 + t2
dt.
If we replace here x↔ t, and then t with −t, the result can be re-written as
e−|t| =

eitx
pi(1 + x2)
dx.
That is, the ChF of the standard Cauchy distribution is e−|t|.
Try to compute it directly: no fun. Also, note that it is NOT differentiable at
zero, so there is no way for the first moment of the distribution to be finite.
Good.
171
Q: Is there any way to invert the ChF when
∫ |ϕX(t)|dt =∞?
A: Yes. Look: assuming for a moment that we can use Thm [6.7],
FX(y)− FX(x) =
∫ y
x
fX(u) du =
∫ y
x
[
1
2pi

e−ituϕX(t) dt
]
du
?
=
1
2pi
∫ [∫ y
x
e−ituϕX(t) du
]
dt =
1
2pi
∫ [
ϕX(t)
∫ y
x
e−itudu
]
dt
=
1
2pi

e−itx − e−ity
it
ϕX(t) dt, (1)
where one can justify
?
=. The RHS makes sense when
∫ ∣∣ϕX(t)
t
∣∣dt <∞, and
then the LHS still equals the RHS. Formal proof: if Z ∼ N(0, 1) is independent
of X, ε > 0, then ϕX+εZ(t) = ϕX(t)e
−ε2t2/2 is integrable on R, inversion f’la
applies, (1) holds with X + εZ instead of X. Then pass to the limit as ε→ 0.
In the general case, the resulting f’la just has limε→0 on the RHS. Or replace
the

on the RHS with limT→∞
∫ T
−T (the “principal value integral”).
172
The most important conclusion: uniqueness holds in the general case as well.
To different DFs there correspond different ChFs! (Cf. (1)!)
Now we can claim that if, say, the ChF of X is eiµt−σ
2t2/2, then we must
have X ∼ N(µ, σ2) (cf. Ex. on sl. 167).
Also, now we can claim that if X,Y ∼ E(1) are independent then since,
according to sl. 171,
ϕX−Y (t) = ϕX(t)ϕ−Y (t) = ϕX(t)ϕY (−t) = 1
1− it ×
1
1 + it
=
1
1 + t2
,
the difference X − Y must follow the double exponential distribution.a One
can discover that using convolution, but ChFs make things much easier.
And now we can assert that if ϕX(t) is real-valued, then ϕX(t) = ϕ−X(t)
(see sl. 165) implies that X
d
= −X, i.e. the distribution of X is symmetric (as
we promised to justify).
a Here X − Y has the mixture distribution 1
2
E(1) + 1
2
(−E(1)) (gross notation abuse!!),
due to the memoryless property of E(1) (use the TPF).
173
NB: Using the same argument as in Thm [6.11] (sl. 168), one can show that
if
∫ |tkϕX(t)| dt <∞, then X has a k times continuously differentiable density.
Thus, the smoother fX(x), the lighter the “tails” of ϕX(t), and the other way
around! Cf. remark on sl. 168.
NB: Sums of independent RVs have smoother distributions than summands,
as |ϕX(t)ϕY (t)| decays as t→ ±∞ faster than any of the factors.
Ex. Sums of i.i.d. Xj ∼ U [−1, 1] : for Sn := X1 + · · ·+Xn, using sl. 170,
ϕS1(t) =
sin t
t
, ϕS2(t) =
sin2 t
t2
(so that S2 has a continuous density),
ϕS3(t) =
sin3 t
t3
(so that S3 has a continuously differentiable density) etc.
[Picture.]
Ex. What can one say about the distribution of X + Y , where X and Y are
independent, X ∈ Z and Y is AC?
Ex. The sum of two independent singular RVs can be AC.
174
Continuity Theorems & Applications
One of the great things about ChFs is that there is a very simple & useful
relationship between convergence in distribution and that of ChFs.
Thm [6.15] As n→∞, Xn d−→ X ⇔ ∀t ∈ R, ϕXn(t)→ ϕX(t).
I ⇒) Obvious: note that ϕXn(t) = Ef(Xn), where f(x) = eitx is a bounded
continuous function of x (we keep t fixed here), and recall Thm [5.8], sl. 137.
⇐) Not so obvious. What happens is essentially this: the collection of
functions f(x) = eitx for different t ∈ R is rich enough to ensure that
convergence EeitXn → EeitX for all t implies that Ef(Xn)→ Ef(X) for all
bounded continuous f (we can take linear combinations of eitjx for collections
of tj ’s to approximate general f ’s etc).
Or one can use the inversion formula. Thus, assuming for simplicity that
|ϕXn(t)|, |ϕX(t)| ≤ g(t) for some bounded function g(t) s.t. g(t)/t is integrable
at ±∞, we can use the formula from sl. 173 to write:
175
FX(y)− FX(x) = 1
2pi

e−itx − e−ity
it
ϕX(t) dt
=
1
2pi

e−itx − e−ity
it
lim
n→∞ϕXn(t) dt

= lim
n→∞
1
2pi

e−itx − e−ity
it
ϕXn(t) dt
= lim
n→∞(FXn(y)− FXn(x)), (∗)
where

= is justified by the Dominated Convergence Theorem (cf. sl. 154).
Good.
Is it clear that when convergence (∗) holds then Xn d−→ X? If yes, it’s bingo.
176
Q: What if we just know that ϕXn(t)→ ϕ(t) as n→∞, where ϕ is some
function (we don’t know if it’s a ChF of some distribution)?
A: Then bad things can happen, and we can say when they do happen!
Thm [6.17] If ∀t ∈ R holds ϕXn(t)→ ϕ(t) as n→∞, and ϕ(t) is
continuous at t = 0, then ϕ(t) is a ChF of some RV X and Xn
d−→ X.
I The proof is somewhat beyond the scope of this course, so we’ll leave it for
the future. At the moment: just a few words re WHAT happens when the
limiting function ϕ is discontinuous at t = 0.
Ex. Let Xn ∼ U [−n, n]. Then (sl’s 164, 170) ϕXn(t) =
sinnt
nt
{ → 0, t 6= 0,
≡ 1, t = 0.
The limiting f’n ϕ(t) = 1(t = 0) is discont’s at 0, and so cannot be a ChF.
In this example, the probability “escapes to infinity”, and it is in such
situations that one obtains a limit for ϕXn that is discont’s at 0. Look e.g. at
a similar situation where Xn ∼ N(0, n). [Write down ϕXn and see what happens!]
177
NB: Assertions similar to Thms [6.15], [6.17] hold for Laplace transforms
lX(t) := Ee
−tX (popular tools when X ≥ 0) and GFs ζX(z) := EzX (used for
X ∈ Z), as they are basically the same as the ChF.
Now how do these theorems work in applications?
We know that ChFs love addition of independent RVs: they multiply then. So
the technique is well-suited for analysing situations where we add such RVs.
But why do we pay so much attention to sums of RVs in PT?
When a large number of relatively small factors act together, the total effect
can often be (at least, approximately) linear. Which is no wonder though, as,
for a differentiable f ,
f(x + εy) = f(x) + ε

j
∂f(x)
∂xj
yj + o(ε) as ε→ 0,
so we do have a sum after all. . .
178
Thm [6.19] (WLLN) Let X1, X2, . . . be i.i.d. RVs. If E|X1| <∞ then
Sn
n
d−→ µ := EX1 as n→∞ (and hence also P−→, see PS–8, tute problem 1).
I Using the properties of ChFs, ∀t ∈ R,
ϕSn/n(t) = ϕSn(t/n) =
(
ϕX(t/n)
)n
. (∗)
Here the argument t/n→ 0, so one would be inclined to expand ϕX about
zero to see what happens — and we can do that since ϕX is continuously
differentiable with ϕ′X(0) = iµ (due to the assumption that E|X1| <∞). So:
ϕX(s) = ϕX(0) + ϕ

X(0)s+ o(s) = 1 + iµs+ o(s) as s→ 0.
Now back to (∗):
ϕSn/n(t) =
[
1 +
iµt+ o(1)
n
]n
→ eitµ = ϕµ(t),
bingo by Thm [6.15].
179
Right. This means that X is a consistent estimator of µ = EX1 (by
definition). This is good, but to find, say, confidence intervals or evaluate
errors of tests based on X (or other consistent estimators), one needs more.
Namely, one needs to know the distribution (at least, approximately) of the
difference X − µ, which is vanishing as n→∞.
So we need a “magnifying glass” to see any patterns here, which is achieved by
scaling the difference, by considering
bn(X − µ) ≡ bn
(
Sn
n
− µ
)
=
bn
n
(Sn − nµ) for some bn →∞.
For what choice of {bn} will this have a limiting distribution? It turns out
that when the Xj ’s have a finite second moment, the right choice is bn = c

n.
Re what to do when EX2j =∞, we’ll talk a bit later.
First — to the Central Limit Theorem (CLT).
180
Thm [6.20] (CLT) If, in addition to the assumptions of Thm [6.19],
EX2j <∞ and σ2 := Var (X1) > 0, then
Yn :=
Sn − nµ
σ

n
d−→ Z ∼ N(0, 1) as n→∞.
I It suffices to show that ϕYn(t)→ e−t
2/2. First we standardise the Xj ’s by
setting X˜j := (Xj − µ)/σ. Then clearly X˜j are i.i.d. with EX˜j = 0, EX˜2j = 1,
and Yn = S˜n/

n, where S˜n := X˜1 + · · ·+ X˜n. Therefore
ϕYn(t) = ϕS˜n/

n (t) = ϕS˜n(t/

n ) =
(
ϕX˜1(t/

n )
)n
. (∗)
Here t/

n→ 0 as n→∞, and we will again expand ϕX˜1 about zero, but as
EX˜2j <∞, the ChF ϕX˜1 is twice differentiable and we have one more term
in Taylor’s series:
ϕX˜(s) = ϕX˜(0)︸ ︷︷ ︸
=1
+ϕ′

(0)︸ ︷︷ ︸
=iµ˜=0
s+
1
2
ϕ′′

(0)︸ ︷︷ ︸
=−EX˜2=−1
s2 + o(s2) = 1− s
2
2
(1 + o(1)).
181
Now back to (∗):
ϕYn(t) =
[
1− 1
2
( t√
n
)2
(1 + o(1))
]n
=
[
1− t
2(1 + o(1))
2n
]n
→ e−t2/2
as n→∞. Bingo.
NB: Using the same techniques, can extend this to non-identically distributed
independent RVs Xj . Just need to be careful, need an additional condition,
e.g. the Lyapunov condition : assuming all EXj = 0 (no big deal), one has
B−3n
n∑
j=1
E|Xj |3 → 0 as n→∞, where B2n := Var (Sn) =
∑n
j=1 EX
2
j
(the condition ensures that, in the limit, all the Xj ’s are “negligibly small”
compared to the sum Sn). Then for ϕXj one can use Taylor’s expansion with
three terms (the third moment is finite!) and use the condition to show that
one has ϕSn/Bn(t)→ e−t
2/2, so that Sn/Bn
d−→ Z ∼ N(0, 1) (you can DIY!).
182
Now what if EX2j =∞ (in the i.i.d. case)? If µ := EX1 is still finite and hence
X ≡ Snn
P−→ µ, there can still exist another scaling seq’ce bn →∞ such that
bn(X−µ) d−→ something. For this, need “regular variation” of the tails of FX :
F (−x) = x−αL−(x), 1− F (x) = x−αL+(x),
where α ∈ [1, 2], L± are “slowly varying” in the sense that L(vx)L(x) → 1 as
x→∞, v fixed (Ex: lnx, but not xa), and limx→∞ 1−F (x)F (−x) → c ∈ [0,∞].
Then with bn = n
1−1/αl(n), l being another slowly varying f’n, the limiting
distribution will be one of the so-called stable laws. When E|X1| =∞, under
the above conditions on the tails (with α ∈ (0, 1]), n−1/αl(n)Sn will converge
in distribution. Technically, it’s much harder to prove than the CLT.
Important difference: roughly speaking, the contributions of individual Xj ’s to
the sum Sn are all negligibly small in the case of the CLT, whereas in the case
of convergence to a non-normal stable distribution, the main contribution to
Sn comes from a small proportion of the Xj ’s (the largest ones!).
183
Ex. Cauchy distribution with density f(x) =
1
pi(1 + x2)
has ChF ϕ(t) = e−|t|.
Therefore, in this case
ϕSn/n(t) =
(
ϕ(t/n)
)n
=
(
e−|t/n|
)n
= e−|t| = ϕ(t),
i.e.
Sn
n
d
= X1. Wow!!
Thus, say, using the sample mean X to estimate the parameter θ in the
location family of densities fθ(x) =
1
pi(1 + (x− θ)2) would be meaningless, as
X
d
= X1 (no gain in precision compared to a single observation). But: the
sample median X(n/2) would work (later about that).
Cauchy distribution is an example of a non-normal stable distribution.
184
Thm [6.21] (Poisson limit theorem.) If Xn,1, . . . , Xn,n are independent RVs,
P(Xn,j = 1) = 1−P(Xn,j = 0) = pn, j = 1, . . . , n,
and npn → λ ∈ (0,∞) as n→∞, then Sn := Xn,1 + · · ·+Xn,n d−→ Y ∼ P (λ).
I Here
ϕSn(t) =
(
ϕXn,1(t)
)n sl. 163
= (1 + pn(e
it − 1))n =
[(
1 + pn(e
it − 1))1/pn]npn

(
ee
it−1

= eλ(e
it−1) as n→∞, bingo (why?).
Q: Do we really need all Xn,j to have the same distribution here? Seems to be
an overstretch, from the applications view-point. Also, do they really need to
be Bernoulli RVs?
A: Nope to both. Go to the next slide for more.
185
Here is more: suppose that
P(Xn,j = k) =
 1− pn,j − qn,j , k = 0,pn,j , k = 1, P(Xn,j 6∈ {0, 1}) = qn,j .
Then, omitting the subscripts n,j for brevity and assuming that p+ q is small,
ϕX(t) = (1− p− q)eit0 + peit1 + qη(t) = 1 + p(eit − 1) + q(η(t)− 1)
= exp
{
p(eit − 1) + q(η(t)− 1) + o(p+ q)},
where η(t) = E(eitX |X 6∈ {0, 1}) and hence |η(t)− 1| ≤ |η(t)|+ 1 ≤ 2.
Therefore, putting λn :=
∑n
j=1 pn,j , we obtain
ϕSn(t) =
n∏
j=1
ϕXn,j (t) = exp
{
λn(e
it − 1) +O(∑nj=1 qn,j) + o(λn)}→ eλ(eit−1)
provided that λn → λ and maxnj=1 pn,j +
∑n
j=1 qn,j = o(1), good!
186
What do we do in the case of RVecs?
More or less the same, but there is more fun. For X = (X1, . . . , Xd) ∈ Rd, the
ChF is a function of t = (t1, . . . , td) ∈ Rd defined by
ϕX(t) := Ee
i(t,X) = E exp
{
i
d∑
j=1
tjXj
}
.
All the results for univariate ChFs extend in a natural way to the multivariate
case, including 1–1 correspondence between distributions and their ChFs and
continuity theorems. The change under linear transformation has this form: if
Y = XA+ b, where A is a d×m-matrix and b ∈ Rm, then, since for the scalar
product we can write (s, XA) = s(XA)> = sA>X> = (sA>, X), one has
ϕY (s) = Ee
i(s,Y ) = Eei(s,XA)+i(s,b) = ei(s,b)Eei(sA
>,X) = ei(s,b)ϕX(sA
>).
187
The inversion formula will require multivariate integration, calculation of
moments can be done using partial differentiation, e.g.
∂k1+k2
∂tk11 ∂t
k2
2
ϕX(t) = i
k1+k2EXk11 X
k2
2 e
i(t,X).
Letting here t := 0 yields the mixed moment EXk11 X
k2
2 (of course, for this to
work we need E|Xk11 Xk22 | <∞). And so on.
Now observe this: for a fixed unit vector b ∈ Rd, one can write, for the
univariate RV Xb = (b, X) (which is the projection of X on the direction of b),
ϕXb(t) = ϕX(t), t = tb.
Meaning: knowing the ChF of X is equivalent to knowing the ChFs of all the
projections of X on various directions in Rd. Switching to distributions:
knowing the distribution of X is equivalent to knowing the distributions of all
the projections Xb, i.e. the probabilities of all half-spaces [picture!]. Wow.
188
Now consider an i.i.d. sequence of RVecs X(1), X(2), . . . [NB: our notation
here differs from that in the book: we use Xj(k) for the jth component of the
kth vector X(k)], S(n) := X(1) + · · ·+X(n). It is obvious that if the SLLN
(or WLLN) holds for each of the components:
Sj(n)
n
a.s.−→ µj
(
Sj(n)
n
P−→ µj , resp.
)
as n→∞,
where µj = EXj(k), then it also holds for the vectors. (Can you prove that?)
So for the LLNs to hold, we just need E‖X(1)‖ <∞. Easy.
The proof of the CLT based on the use of ChFs is also not very difficult; we’ll
just need to use the multivariate Taylor expansion and the formula for
moments from the previous slide. Before formulating the multivariate CLT, we
consider the following example.
189
Ex. For X ∼ N(µ,C2X) (the normal distribution in Rd with mean µ ∈ Rd and
CovM C2X , see sl. 87–89, the ChF can be computed from the representation
X
d
= ZA+ µ, where Z ∼ N(0, I) is the standard d-dim normal distribution (it
has i.i.d. N(0, 1) components) and A ∈ Rd×d is a “square root” of the
(nonnegative-definite) matrix C2X : A
>A = C2X .
a First we compute
ϕZ(t) = E exp
{
i
d∑
j=1
tjZj
}
=

j
EeitjZj = e−

j t
2
j/2 = e−tt
>/2.
Using the transformation formula from the bottom of sl. 188, we obtain now
ϕX(t) = e
i(t,µ)ϕZ(tA
>) = ei(t,µ)e−(tA
>)(tA>)>/2 = ei(t,µ)−tC
2
Xt
>/2.
BTW, this can be taken as the general definition of the multivariate normal
distribution (whether it has a density or not, i.e. whether it is concentrated on
the whole space Rd or on a subspace of lower dimensionality).
aOne way of finding such an A is to use Cholesky decomposition.
190
Now return to our multivariate random walk S(n) := X(1) + · · ·+X(n),
where X(1), X(2), . . . are i.i.d. RVecs with E‖X(1)‖2 <∞, so that (i) X(1)
has a finite mean (vector) µ := EX(1), and (ii) the CovM C2X of X(1) exists.
We already know that
S(n)
n
P−→ µ as n→∞; what about getting more detail
on this convergence (as given by the CLT in the univariate case)? NB:
Cov (S(n)− µn) = nCov (X(1)).
As we pointed out before (sl. 190), using the multivariate Taylor formula, we
obtain the following analog of the classical CLT. The proof is almost identical
to the one in the univariate case.
Thm (Multivariate CLT). Under the above assumptions, as n→∞,

n
(
S(n)
n
− µ
)
≡ S(n)− µn√
n
d−→ Y ∼ N(0, C2X).
In the case of non-identically distributed X(j)’s, there is more fun, of course.
191
NB: Here and in all our earlier Limit Theorems, one can estimate how fast
the resp. convergence is. For example, in the case of the univariate CLT for
i.i.d. summands, under additional assumptions, one can give an upper bound
for |Fn(x)−Φ(x)|, where Fn is the DF of the std’d Sn, Φ is the DF of N(0, 1).
Thus, if γ := E|X1 − µ|3 <∞, then the difference won’t exceed 0.4748γ
σ3

n
(for
any x; Berry–Esseen Thm). This bound is unimprovable (well, the constant
might be made smaller, but not smaller than 0.4098) [Picture!]. Of course,
1/

n→ 0 pretty slowly; if unhappy with that approximation rate, can go
further (from, say, Fn(x) ≈ Φ(x) to Fn(x) ≈ Φ(x) + E(X1−µ)
3
6σ3

n
(1− x2)Φ′(x),
which has an error O(1/n) under proper conditions, etc.).
Ex. CLT for multinomial distributions. We have d bins and conduct the
following multi-stage RE: at each step, a ball is placed in a randomly chosen
bin, P(jth bin selected) = pj > 0 independently of the past,
∑d
j=1 pj = 1.
Consider the RVec S(n) whose components are counts Sj(n) := # of balls in
bin j after n steps. Describe the behaviour of S(n) as n→∞.
192
It is clear that S(n) =
∑n
k=1X(k), where X(k) ∈ Rd are i.i.d. RVecs with
X(1) =

(1, 0, 0, . . . , 0) =: e1 w.p. p1,
(0, 1, 0, . . . , 0) =: e2 w.p. p2,
· · · · · · · · ·
(0, 0, 0, . . . , 1) =: ed w.p. pd.
We see that the components Xj(1) are dependent B(pj)-RVs. For any fixed j,
however, the RVs Xj(1), Xj(2), Xj(3), . . . are i.i.d., and so by the SLLN one
has Sj(n)/n
a.s.−→ pj . We conclude that
S(n)
n
a.s.−→ p := (p1, p2, . . . , pd) as n→∞.
Since ‖X(1)‖ ≡ 1, the Multivariate CLT clearly holds for S(n). We just need
to find the CovM of X(1), which will give us the covariance matrix of the
limiting normal distribution.
193
As X(1) is always one of the coordinate vectors e1, . . . , ed, one has
Xj(1)Xk(1) = δjkXj(1) ⇒ EXj(1)Xk(1) = δjkEXj(1) = δjkpj ,
where δjk := 1(j = k) is Kronecker’s delta, so that
Cov (Xj(1), Xk(1)) = EXj(1)Xk(1)−EXj(1)EXk(1) = δjkpj − pjpk.
Thus the CovM of X(1) is shown to be
D2p := diag (pj)− p>p.
From the Multivariate CLT we conclude that, as n→∞,
Wn :=

n
(
S(n)
n
− p
)
=
S(n)− np√
n
d−→W ∼ N(0, D2p). (∗)
This result has immediate important implications for statistics! Suppose
Y1, . . . , Yn is an i.i.d. sample of RVs with a common (say, univariate and
unknown to us) DF G. We want to test the hypothesis H0 = {G = F},
where F is some hypothesized DF (‘goodness-of-fit’ testing).
194
One way to do that is to use the χ2-test. What is that about?
Partition R into d intervals: (−∞, t1], (t1, t2], . . . , (td−2, td−1], (td−1,∞).
Denote by Sj(n) the # of sample points Yk that fell into the jth partition
interval. Clearly, the RVec S(n) ∼ multinomial distribution with parameters
(n,p), where p1 := F (t1), p2 := F (t2)− F (t1), . . . , pd−1 := F (td−1)− F (td−2),
pd := 1− F (td−1) under the hypothesis H0. [Picture!]
From the previous slide, (∗) will hold. How to use that for testing H0?
Well, by Thm [5.23] (sl. 148), for any continuous function g(x), the
distribution of g(Wn) will converge to that of g(W ). In particular, this will
hold if we take g(x) := ‖x‖2 = ∑dj=1 x2j :
‖Wn‖2 d−→ ‖W‖2.
As we know the distr’n of W under H0 (it’s N(0, D
2
p)), we can find that
of ‖W‖2 and use it to construct a test of a(n approximately, for large n) given
type I error (that of rejecting H0 when it’s true), e.g. ‖Wn‖2 > r.
195
It would be OK if it were not for the fact that, to use the test for different F ’s
and partitions used, each time we would have to compute the distr’n of ‖W‖2
(as it depends on p), which is a nightmare. Can we get around it?
Yes. There is a simple modification of the test procedure that makes the
(asymptotic, as n→∞) distribution of the test statistic independent of both
p and F ! Such tests are called “asymptotically distribution-free”.
This is what we will do: consider
X˜(1) :=
(
X1(1)− p1√
p1
, . . . ,
Xd(1)− pd√
pd
)
∈ Rd, S˜(n) := X˜(1) + · · ·+ X˜(n).
Then S˜(n)√
n
=
(
S1(n)−np1√
np1
, . . . , Sd(n)−npd√npd
)
= S(n)−np√
n
B with B := diag(pj
−1/2),
and so ∥∥∥∥ S˜(n)√n
∥∥∥∥2 = d∑
j=1
(Sj(n)− npj)2
npj
=: H(n)
will be the famous χ2-statistic.
196
It follows from relation (∗) (sl. 195) and Thm [5.23] that
S˜(n)√
n
d−→WB ∼ N(0, B>D2pB),
with B>D2pB = diag(pj
−1/2)
(
diag(pj)− p>p
)
diag(pj
−1/2) = Id − b>b, where
Id is the unit matrix in Rd, and b := (

p1, . . . ,

pd) is a unit vector:
‖b‖2 = ∑dj=1√pj2 = 1. So, again by Thm [5.23], as n→∞,
H(n) =
∥∥∥∥ S˜(n)√n
∥∥∥∥2 d−→ ‖WB‖2, WB ∼ N(0, Id − b>b). (∗∗)
Now consider Z ∼ N(0, Id) and form the difference
Z − (b, Z)b.
This is the projection of Z onto the (d− 1)-dim hyperplane orthogonal to the
vector b. About this RVec we will make the following two observations (i), (ii):
197
(i) WB
d
= Z − (b, Z)b.
Indeed, both are normal vectors in Rd with zero means, whereas the
covariance matrix of the latter vector equals
C2Z−(b,Z)b = E(Z − (b, Z)b)>(Z − (b, Z)b) = E
(
Z(Id − b>b)
)>
Z(Id − b>b)
= E(Id − b>b)>Z>Z(Id − b>b) = (Id − b>b)>(EZ>Z)(Id − b>b)
= (Id − b>b)>(Id − b>b) = (Id − (b>b)>)(Id − b>b)
= (Id − b>b)(Id − b>b) = I2d − Idb>b− b>bId + (b>b)2
= Id − 2b>b + b>( bb>︸︷︷︸
=
∑√
pj2=1
)b = Id − b>b = C2WB .
As multivariate normal distributions are uniquely specified by their mean
vectors and covariance matrices, the assertion follows.
198
(ii) As N(0, Id) is invariant w.r.t. rotations, the “nature” of the distr’n of the
projection Z − (b, Z)b will be the same as for the projection of Z onto any
other (d− 1)-dim hyperplane, e.g. Z∗ := (Z1, . . . , Zd−1) ∈ Rd−1. Indeed:
Prpn [∗] If Z ∼ N(0, Id) and b1, . . . , bd is an orthonormal system in Rd,
then Y := ((b1, Z), . . . , (bd, Z)) ∼ N(0, Id) as well.
I Compute the ChF of Y : for t = (t1, . . . , td) ∈ Rd, recalling from sl. 191 that
the ChF of Z is ϕZ(t) = e
−tt>/2 and using the fact that
∥∥∑d
j=1 tjbj
∥∥2
=
(∑d
j=1 tjbj ,
∑d
k=1 tkbk
)
=
∑d
j=1
∑d
k=1 tjtk bjb
>
k︸ ︷︷ ︸
=δjk
= tt>, we have
ϕY (t) = Ee
itY > = E exp
{
i
d∑
j=1
tj(bj , Z)
}
= E exp
{
i
( d∑
j=1
tjbj , Z
)}
= exp
{
−1
2
∥∥∥ d∑
j=1
tjbj
∥∥∥2} = e−tt>/2 = ϕZ(t), bingo!
199
The desired assertion (ii) follows, as, taking b1, . . . , bd s.t. bd = b and noting
that Z =
∑d
j=1(bj , Z)bj , we have Z − (b, Z)b =
∑d−1
j=1(bj , Z)bj , which is a
(d− 1)-dim RVec with i.i.d. comp’s (bj , Z) ∼ N(0, 1), like Z∗ = (Z1, . . . , Zd−1).
Therefore
‖Z − (b, Z)b‖2 d=
∥∥∥d−1∑
j=1
(bj , Z)bj
∥∥∥2 d= ‖Z∗‖2 = d−1∑
j=1
Z2j ∼ χ2d−1
by the definition of χ2-distribution with d− 1 degrees of freedom.
Now from convergence (∗∗) and properties (i), (ii) we conclude that
H(n)
d−→ ‖WB‖2 d= ‖Z − (b, Z)b‖2 ∼ χ2d−1. Thus the limiting distribution
doesn’t depend neither on F nor on the partition used! Very handy: can use
the same χ2-distribution for any such goodness-of-fit test with the same
number of “bins” (in a much more general context than testing our H0!).
200
BTW, basically the same argument shows that, for i.i.d. Xj ∼ N(0, 1), one has∑n
j=1(Xj −X)2 ∼ χ2n−1. Going just a bit further, one can also notice that
this sum of squares and X will be independent RVs.
This follows from Prpn [∗]: just take an orthonormal system b1, . . . , bn in Rn
s.t. bn = (n
−1/2, . . . , n−1/2) (clearly, ‖b‖ = 1). Then X = n−1/2(bn, X) and,
as X =
∑n
j=1(bj , X)bj , one has
n∑
j=1
(Xj −X)2 = ‖X − (bn, X)bn‖2 =
∥∥∥n−1∑
j=1
(bj , X)bj
∥∥∥2,
which is independent of (bn, X) ∼ N(0, 1) (and has the same distribution as
‖∑n−1j=1 X2j ‖2 ∼ χ2n−1) by the Proposition.
This is why, in particular, X√
1
n−1
∑n
j=1(Xj−X)2
d
= 1√
n
× Xn√
1
n−1
∑n−1
j=1 X
2
j
, where
the distr’n of the last quotient is called the t-distribution with n− 1 degrees of
freedom (used in tests/CIs for means when the variance is unknown).
201
9. Further Applications in Statistics
Empirical Distribution Functions and Empirical Processes
Suppose X1, . . . , Xn is an i.i.d. sample, the DF F of the Xj ’s being (at least,
partly) unknown. Recall (PS–6): the vector S := (X(1), . . . , X(n)) of order
statistics for the sample is an SS for F .
An alternative way of representing the information stored in S is the
empirical distribution function (EDF, Def [7.26]; note that we use a
slightly different notation)
F ∗n(t) :=
1
n
n∑
j=1
1(Xj ≤ t) ≡
n∑
j=1
1
n
1(X(j) ≤ t), t ∈ R.
NB: This is the DF of the random probability distribution P ∗n :=
n∑
j=1
1
n
εXj
that assigns probabilities 1/n to each of the sample points, and which is
well-defined when the Xj ’s are from a general space (e.g. Xj ∈ Rd etc.).
202
It is clear that we can “extract” the values of all the order statistics from the
EDF F ∗n (they are just the jump points of the function), so the EDF (the
whole function, not just its value at some t!) is an SS as well.
Observe that quite a few useful statistics/estimators can easily be expressed
from the EDF, e.g. the sample mean
X =

tdF ∗n(t)
and the sample variance
s2 :=
1
n
n∑
j=1
(
Xj −X
)2
=
1
n
n∑
j=1
X2j −X
2
=

t2dF ∗n(t)−
(∫
tdF ∗n(t)
)2
are the mean and variance of the EDF, and likewise the sample median
m̂ :=
 X(k) if n = 2k − 11
2 (X(k) +X(k+1)) if n = 2k
is the median of the EDF.
203
NB: Such estimators are often called “plug-in estimators” (or the method of
substitution estimators), as they estimate a parameter θ that can be expressed
as θ = G(F ), G being a (nice) functional (= function of a function), by
plugging the EDF into G in place of F :
θ∗ := G(F ∗n).
Now, if we had
• F ∗n → F (in some suitable sense) as n→∞, and
• G continuous (again, in some suitable sense),
then we would be able to establish consistency: θ∗ = G(F ∗n)→ G(F ) = θ.
Q: So does one have F ∗n → F? In what sense?
A: One does. In a very strong sense. Go to the next slide.
204
Thm [7.27] (Glivenko–Cantelli) Let X1, X2, . . . be i.i.d. RVs with a common
DF F . Then, as n→∞,
Dn := sup
t
|F ∗n(t)− F (t)| a.s.−→ 0.
I 1) For any fixed t ∈ R, by the SLLN (for the Bernoulli scheme),
F ∗n(t) ≡
1
n
n∑
j=1
1(Xj ≤ t)︸ ︷︷ ︸
i.i.d. B(F (t))
a.s.−→ F (t).
2) Hence, for any fixed collection −∞ = t0 < t1 < · · · < td−1 < td =∞,
Mn := max
k≤d
|F ∗n(tk)− F (tk)| a.s.−→ 0.
205
3) Since both F ∗n and F are non-decreasing, nothing really BAD can happen
between the nodes tk: for s ∈ (tk−1, tk] and ∆kF := F (tk)− F (tk−1),
F ∗n(s)− F (s) =
 ≤ F ∗n(tk)− F (tk−1) = F ∗n(tk)− F (tk) + ∆kF,≥ F ∗n(tk−1)− F (tk) = F ∗n(tk−1)− F (tk−1)−∆kF.
Thus
Dn ≡ sup
t
|F ∗n(t)− F (t)| ≤Mn + max
k≤d
∆kF.
4) Assume for simplicity that F is continuous (if not, no big deal: one just
deals with its “large” jumps separately). Then, ∀ε > 0, can choose d large
enough and {tk} s.t. maxk≤d ∆kF < ε/2. Now, in view of 2), we also have
Mn < ε/2 for all large enough n, and then we have Dn < ε. Bingo.
This result, as we said, can be used to establish consistency of estimators.
But wait: there is more!
206
It is tempting to use Dn for goodness-of-fit testing. To do that, need the
distr’n of Dn, which apparently depends on F . If this is so, it’s a mess. But:
Assume that F is continuous and let Q be its quantile function, so that
F (Q(t)) ≡ t, t ∈ (0, 1). Then, as we saw (sl. 49), X d= Q(U) for U ∼ U [0, 1],
and so we can think of the sequence {Xj} as obtained by Q-transforming an
i.i.d. sequence {Uj} with Uj ∼ U [0, 1], so that
1(Xj ≤ t) = 1(Q(Uj) ≤ t) = 1(Uj ≤ F (t)), F ∗n(t) = R∗n(F (t)),
where R∗n(u) :=
1
n
∑n
j=1 1(Uj ≤ u) is the uniform EDF.
This, in particular, implies that
Dn ≡ sup
t
|F ∗n(t)− F (t)| = sup
t
|R∗n(F (t))− F (t)| = sup
u∈[0,1]
|R∗n(u)− u|.
Wow! This means that the distribution of Dn is one and the same for all
continuous F , which is very handy for goodness-of-fit testing.
But wait: there is more!
207
From the binomial CLT, ∀u ∈ [0, 1],

n
(
R∗n(u)− u
) d−→ V (u) ∼ N(0, u(1− u)) as n→∞.
From the multinomial CLT (sl. 193), for any 0 ≤ u1 < · · · < un ≤ 1,

n
(
R∗n(u1)− u1, . . . , R∗n(ud)− ud
) d−→ (V (u1), . . . , V (ud)) ∼ N(0, C2(u)),
where u = (u1, . . . , ud) and elementary calculations show that the CovM has
the form
C2(u) =
[
min{uj , uk}(1−max{uj , uk})
]
j,k=1...,d
.
In fact, if we consider
{√
n
(
R∗n(u)− u
)}
u∈[0,1] as a random process on [0, 1]
(it is called the empirical process), then it will converge in distribution (on the
space C[0, 1] of continuous functions on [0, 1]) to the so-called Brownian
Bridge process {V (u)}u∈[0,1] (a close relative of the Brownian motion
process, a.k.a. the Wiener process).
208
So what? It follows (Kolmogorov Thm) that, as n→∞,

nDn
d
= sup
u∈[0,1]
∣∣√n(R∗n(u)− u)∣∣ d−→ max
u∈[0,1]
|V (u)|,
and the good news is that the distribution of the RHS is known:
P
(
max
u∈[0,1]
|V (u)| ≤ x
)
= K(x) := 1 + 2
∞∑
k=1
(−1)ke−2k2x2 , x > 0.
Therefore limn→∞P
(√
nDn ≤ x
)
= K(x), and one can use that for
goodness-of-fit testing (Kolmogorov test).
Another famous example is von Mises–Smirnov ω2-test based on the statistic
ω2n := n
∫ [
(F ∗n(t)− F (t))
]2
dF (t)
d
=
∫ 1
0
[√
n(R∗n(u)− u)
]2
du.
Again, because of the convergence of empirical processes (in distribution),
lim
n→∞P(ω
2
n ≤ x) = P
(∫ 1
0
V 2(u)du ≤ x
)
, x > 0, which is known, etc.
209
Asymptotic Normality & Efficiency of the
Maximum Likelihood Estimator
Recall: basing on an i.i.d. sample X = (X1, X2, . . . , Xn) with the Xj ’s having
density fθ(x) (w.r.t. to some measure µ), we construct the maximum
likelihood estimator (MLE) as
θˆ = θˆn(X) := arg max
θ
fθ(X) ≡ arg max
θ
n∏
j=1
fθ(Xj) ≡ arg max
θ
L(X, θ),
where L(X, θ) := ln fθ(X) =
∑n
j=1 l(Xj , θ) is the log-likelihood function
(here l(x, θ) = ln fθ(x), of course).
Ex. If Xj ∼ B(p) then fθ(x) = px(1− p)1−x, x = 0, 1 (w.r.t. the counting
measure on the integers), so L(X, p) =
∑n
j=1[Xj ln p+ (1−Xj) ln(1− p)] =
n
[
X ln p+ (1−X) ln(1− p)] and 0 = ∂L∂p = Xp − 1−X1−p ⇐⇒ p = pˆ := X.
210
Ex. If Xj ∼ N(µ, σ2), then fθ(x) = 1√2piσ e−(x−µ)
2/2σ2 (w.r.t. the Lebesgue
measure on R), θ = (µ, σ2) ∈ R× R+. Here clearly
L(X, θ) = −n
2
ln 2pi − n lnσ − 1
2σ2
n∑
j=1
(Xj − µ)2.
Equations for the critical point(s):
0 =
∂L
∂µ
=
1
σ2
n∑
j=1
(Xj − µ) ≡ n
σ2
(X − µ),
0 =
∂L
∂σ
= −n
σ
+
1
σ3
n∑
j=1
(Xj − µ)2 ≡ − n
σ3
[
σ2 − 1
n
n∑
j=1
(Xj − µ)2
]
.
Obvious solution (clearly a max): µˆ = X, σ̂2 = 1n
∑n
j=1(Xj −X)2.
Ex. What if Xj ∼ U(0, θ) or Xj ∼ U(θ, 1 + θ)? [Pictures.]
211
In all the examples, MLEs were nice consistent estimators. Why is this so?
Why does the method work? What properties can one expect from the MLEs?
Denote by ϑ the true value of the parameter θ. Then, by the SLLN,
1
n
L(X, θ) ≡ 1
n
n∑
j=1
l(Xj , θ)
a.s.−→ Eϑl(Xj , θ) as n→∞
(provided that Eϑl(Xj , θ) is finite; we assume everywhere that all relevant
conditions are met, without going into detail).
So one would expect that, for large n (and this is what the word “asymptotic”
refers to), the maximum of L(X, θ) will be located close to the point where
Eϑl(Xj , θ) attains its maximum (as a function of θ).
Now where is that point?
212
Thm (Gibbs’ inequality) For any two densities f and g (on a common space,
w.r.t. a common measure µ),∫
f(x) ln f(x)µ(dx) ≥

f(x) ln g(x)µ(dx) (∗)
if both integrals are finite. Here, “ ≥” becomes “ = ” iff f(x) = g(x) (µ-almost
everywhere, i.e. everywhere except for a set of zero µ-measure).
I Using lnu ≤ u− 1 (holds for all u > 0, and “=” holds iff u = 1), we have∫
f(x)
(
ln g(x)− ln f(x))µ(dx) = ∫ f(x) ln g(x)
f(x)
µ(dx)
∗≤

f(x)
(
g(x)
f(x)
− 1
)
µ(dx) =

g(x)µ(dx)−

f(x)µ(dx) = 1− 1 = 0.
Now “=” holds in (∗) iff we have “=” in ∗≤, which, in its turn, holds if
ln g(x)f(x) =
g(x)
f(x) − 1 wherever µ “can see it”, i.e. g(x)/f(x) = 1 µ-a.e. Bingo.
[Well, need to be a bit more careful as f(x) or g(x) can vanish, but it’s OK.]
213
Thus arg maxθ Eϑl(Xj , θ) = ϑ is the true value of the parameter!!
Ex. Let Xj ∼ P (θ) : fθ(x) = e−θ θxx! , x = 0, 1, 2, . . . (w.r.t. the counting
measure on Z), so
Eϑl(X1, θ) = Eϑ
(−θ +X1 ln θ − ln(X1!))
= −θ + EϑX1 ln θ −Eϑ ln(X1!) = −θ + ϑ ln θ −Eϑ ln(X1!).
To find the maximum, we solve (for θ) the usual equation:
0 =

∂θ
Eϑl(X1, θ) = −1 + ϑ
θ
⇐⇒ θ = ϑ.
Good.
So we expect that the MLE θˆ will be close to the maximum point ϑ of the
function y1(θ) := Eϑl(Xj , θ) :
214
6-
θϑ θˆn

y1(θ) =

fϑ(x) ln fθ(x)µ(dx)

y2(θ) =
1
nL(X, θ)
Look: the “sharper” the peak in the curve y1(θ), the closer should be the
maxima of the curves yj ! The curvature of y1 is given by its 2nd derivative:
∂2
∂θ2

fϑ(x) ln fθ(x)µ(dx) =

fϑ(x)
∂2
∂θ2
ln fθ(x)µ(dx)
=

fϑ(x)
[
f ′′θ (x)
fθ(x)

(f ′θ(x)
fθ(x)
)2]
µ(dx).
[For brevity, we use this convention: ′ = ∂∂θ .]
215
Now at the point θ = ϑ of the max of y1(θ), this becomes∫
f ′′ϑ (x)µ(dx)︸ ︷︷ ︸
=0 (do you see why?)


(f ′ϑ(x))
2
fϑ(x)
µ(dx)︸ ︷︷ ︸
=:I(ϑ)
= −I(ϑ),
where I(θ) the Fisher information that appears in the famous Rao–Crame´r
(lower) bounda for estimators’ errors: say, for unbiased estimators θn (where
n indicates the i.i.d. sample size), one has
Eϑ(θ

n − ϑ)2 ≥
1
nI(ϑ)
.
This leads to the following plausible conclusion: the higher the value of I(ϑ),
the closer should θˆ be to ϑ. And this is so indeed:
aDoes that ring a bell?
216
Thm [7.24+7.25] Under some regularity assumptions, the MLE θˆ = θˆn is
consistent: θˆn
P−→ ϑ as n→∞, and asymptotically normal:

n(θˆn − ϑ) d−→ Y ∼ N(0, 1/I(ϑ)).
In fact, convergence here holds for all moments as well:
Eϑθˆn = ϑ+ o
(
n−1/2
)
, Eϑ(θˆn − ϑ)2 = 1 + o(1)
nI(ϑ)
.
Thus one can say that the MLE θˆn is asymptotically efficient.
217
I We will only give a sketch of the proof.
Consistency. For any ε > 0, one has:
An : =
{|θˆn − ϑ| > ε} ⊂ { sup
|u|>ε
L(X,ϑ+ u) > L(X,ϑ)
}
=
{
sup
|u|>ε
[
1
n
L(X,ϑ+ u)︸ ︷︷ ︸
a.s.−→ Eϑl(X1,ϑ+u)
− 1
n
L(X,ϑ)︸ ︷︷ ︸
a.s.−→ Eϑl(X1,ϑ)
]
> 0
}
as, by the SLLN,
1
n
L(X, θ) =
1
n
n∑
j=1
l(Xj , θ)
a.s.−→ Eϑl(X1, θ).
Here by Gibbs’ inequality Eϑl(X1, ϑ+ u)−Eϑl(X1, ϑ) < 0. No wonder that
one can prove that P(An)→ 0 as n→∞, which means that θˆn P−→ ϑ. Too
technical for us at the moment, but we have got an idea of why the MLE is
consistent.
218
Asymptotic normality. Using Taylor’s formula (basically, it’s just the mean
value theorem for the function L′(X, θ)) at the point θˆn:
L′(X,ϑ) = L′(X, θˆn)︸ ︷︷ ︸
=0
+(ϑ− θˆn)L′′(X, θ∗),
where θ∗ is a point between ϑ and θˆn (why is L′(X, θˆn) = 0?). Thus
1√
n
n∑
j=1
l′(Xj , ϑ)︸ ︷︷ ︸
d−→ Z∼N(0,I(ϑ))
=

n
(
θˆn − ϑ
)× (−1)× 1
n
n∑
j=1
l′′(Xj , θ∗)︸ ︷︷ ︸
P−→ Eϑl′′(X1,ϑ)
,
where the first convergence (
d−→) is due to the CLT and the fact that
Eϑl
′(Xj , ϑ) = 0 =

f ′ϑ(x)µ(dx) (do you see why?) and Eϑ(l
′(Xj , ϑ))2 = I(ϑ),
and the second one (
P−→) follows from the LLN since Eϑl′′(X1, θ∗) is finite
and Eϑl
′′(X1, θ∗)
P−→ Eϑl′′(X1, ϑ) = −I(ϑ) (see sl. 216, 217) due to θˆn P−→ ϑ.
Thus

n
(
θˆn − ϑ
) d−→ ZI(ϑ) ∼ N(0, 1/I(ϑ)), bingo!
219
The End
220


essay、essay代写