xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

留学生代写-MAST30020

时间：2021-04-21

MAST30020

Probability for Inference

Lecture Slides∗

∗Written by K. Borovkov, last time modified on 1 January 2021 (or even later).

0-0

1. Probability Spaces

Random experiment (RE): You should be familiar with the concept, from

the 2nd year level probability course you have done. Anyway, that’s a

(somewhat vaguely defined) concept representing real-life phenomena that:

• have a “mass character” (i.e. could be repeated many-many times, at least

in theory),

• don’t display “deterministic regularity” (i.e. the outcome of any given trial

is uncertain, to the best of our prior knowledge), but

• possess what’s called “statistical regularity”: the relative frequencies

(nA/n) of events one can observe in the experiment stabilize around some

values ∈ [0, 1] as the # of (“independent”) repetitions of the experiment

grows.

It is that last dot-point that makes Probability Theory possible.

Examples (Ex’s): coin tossing; dice rolling; gender of newborn babies etc.

1

Need to specify the outcome of our RE: how?

In mathematics, we have sets = collections of objects.

• Sample space Ω is the set of all possible outcomes.

Ex. Coin tossing. H/T? What if the coin doesn’t fall on its side? What if

where it landed also matters for us? Several different Ω’s are possible (and

can be used under different circumstances).

• So outcomes are elements ω ∈ Ω (a.k.a. elementary events, sample

points, realizations).

• Events: these are subsets A ⊂ Ω for which probability is defined.

NB: in non-trivial cases, there are subsets A ⊂ Ω for which probability

CANNOT be defined. For example, one can partition Ω = [0, 1] into

countably many “identical” sets Ai (they can be obtained from each other

by translations): Ω =

∑

i≥1Ai. Assuming uniform probability distribution

on [0, 1], what would be the probability of A1?

2

Sample spaces

• Ex. Finitely many possible outcomes ⇒ only need a finite Ω. Just list all

the outcomes: Ω = {H,T} ≡ {0, 1} for the experiment of tossing a coin

once; Ω = {1, 2, . . . , 6} for rolling a die once.

Now consider an RE where we toss a coin

roll a die

n = 3 times in a row.

Product spaces come handy. For two sets A and B, their (Cartesian)

product is defined by

A×B := {(a, b) : a ∈ A, b ∈ B}

(the set of all pairs (a, b) s.t. a ∈ A and b ∈ B), and we put

An := A× · · · ×A︸ ︷︷ ︸

n copies of A

= {(a1, a2, . . . , an) : aj ∈ A, j = 1, 2, . . . , n}.

RE with a sample space Ω0 is replicated n times⇒ the sample space of the

composite experiment is Ω = Ωn0 , with ω = (ω1, ω2, . . . , ωn), ωj = outcome

of the jth replication of the basic (sub-)RE with the sample space Ω0.

3

• Ex. Keep tossing a coin till H shows up. Will a finite Ω suffice?

There are countably many possible outcomes that can be represented by

points from the set N = {1, 2, . . . } (ω = j if we observed j − 1 tails

followed by heads).a

• Ex. You have a date with your girl/boyfriend. She/he shows up

ω ∈ R+ = [0,∞) minutes late. Here we need Ω = R+.

One often uses the whole real line R or a subinterval thereof. For a

composite RE consisting of n dates, one can use

Ω = Rn := {(x1, x2, . . . , xn) : xj ∈ R, j = 1, 2, . . . , n}.

aAny set A s.t. there exists a 1–1 mapping f : A→ N is said to be countable (denumerable).

Thus, Z = {. . . ,−1, 0, 1, . . . } is countable. So is N2, but not [0, 1].

4

• If there can be any number of dates (at least, in theory: imagine that if

your girl/boyfriend was less than 20′ late for a given date, you decide to

meet again), we’ll need

RN := {(x1, x2, . . . ) : xj ∈ R, j ∈ N},

the set of real sequences, i.e. all real-valued functions on N.

[Other functional spaces are used, too, e.g. C[0, 1] for the so-called

Brownian motion process.]

• This models a situation where the basic RE can be repeated infinitely

many times. Likewise, in the coin tossing RE in which you are tossing a

coin till you get H for the first time, one can use Ω = {T,H}N.

• BTW: Note that {T,H}N is uncountable (why?). In the last example, you

could actually use a countable space of outcomes — which one?

5

• Probabilities will be assigned to sets of outcomes, i.e. subsets of Ω.

• For a given RE, can one assign probabilities to all subsets of Ω?

• If Ω is countable, the answer is YES.

In the general case, the answer is NO.

• It is impossible in the basic RE of choosing a point at random from [0, 1].

• So we MUST restrain ourselves and consider only some of the subsets

of Ω, chosen in such a way that there will be no problems with assigning

probabilities to them — and these subsets will be called “events”.

• Now how to choose them? One needs to be able to manipulate events,

and, quite naturally, such (admissible!) manipulations should be

producing events. Let’s look at the ways to manipulate events.

6

Ex. Choose a student from a large class. Want the events that the student:

1) is NOT smoking;

2) is a female AND more than 55 y.o.;

3) was born in Australia OR New Zealand;

4) was born in Australia AND is NOT smoking.

These can be expressed in terms of simpler events using set operations:

1) Let Ω := population of all students in class, A := sub-population of

smokers. Take the complement Ac := {ω ∈ Ω : ω 6∈ A}.

2) If B := sub-population of female students, C := those who are > 55 y.o.,

then take the intersection B ∩ C = {ω ∈ Ω : ω ∈ B and ω ∈ C}.

3) If D := students born in Australia, E := students born in New Zealand,

then take the union D ∪ E = {ω ∈ Ω : ω ∈ D or ω ∈ E}.

4) This will be the set difference D \A := D ∩Ac.

7

Note that D∩E = ∅, i.e. the events are disjoint (no common outcomes). Note

also that, in the case of disjoint sets, one often uses D + E to denote D ∪ E.

In fact, it is often more convenient to work with functions rather than sets.

How to “replace” sets with functions? Use indicators (indicator functions):

1A(ω) :=

1, ω ∈ A,0, ω 6∈ A.

To the (main) set operations on events, there correspond the following

operations on indicator functions (draw diagrams & check!!):

1Ac = 1− 1A, 1A∩B = 1A1B , 1A∪B = max{1A,1B}.

For the symmetric set difference A4B := A \B +B \A ≡ (A ∪B) \ (A ∩B),

1A4B = |1A − 1B |.

8

Ex. Express the following events:

Neither A nor B occurred [i.e. 1A + 1B = 0].

I Ac ∩Bc = (A ∪B)c, de Morgan law.

A and B occurred, but C didn’t.

I A ∩B ∩ Cc ≡ (A ∩B) \ C.

Only one of A1, A2, A3 occurred [i.e.

∑3

j=1 1Aj = 1].

I A1 \ (A2 ∪A3) +A2 \ (A1 ∪A3) +A3 \ (A1 ∪A2).

Exactly two out of A1, . . . , A5 occurred [i.e.

∑5

j=1 1Aj = 2].

I

∑

i(Ai ∩Aj) \ ⋃

k 6=i,j

Ak

.

9

As we said earlier, want to manipulate events, need the resulting sets still be

events. Hence the requirement: the class of events on Ω must be closed under

the main set operations, i.e. the complement of an event should still be an

event, and the union of events must still be an event (the same for

intersections, but this is automatic from de Morgan laws, see below).

To make things work, need a bit more: namely, that we are allowed to take

countable unions (and intersections!). In mathematics, when countable

infinity is allowed/involved, one often uses σ to indicate that.

Def. A family F of subsets of Ω is said to be a σ-algebra on Ω if

(A.1) Ω ∈ F ,

(A.2) A ∈ F ⇒ Ac ∈ F ,

(A.3) A1, A2, . . . ∈ F ⇒

⋃∞

n=1An ∈ F .

In words: the family is closed under complementation and countable union and

intersection. Why the latter? De Morgan + (A.2) + (A.3) + (A.2):⋂∞

n=1An =

[

(

⋂∞

n=1An)

c]c

= [

⋃∞

n=1A

c

n]

c

= [

⋃∞

n=1Bn]

c

, Bn := A

c

n ∈ F .

10

Ex. (continued)

Infinitely many of the events A1, A2, . . . occurred [i.e.

∑∞

j=1 1Aj =∞].

Is this actually an event? [Denoted: An, i.o.]

Why? For instance: will a random walk S0, S1, S2, . . . on Z visit 0 infinitely

many times? Let An := {Sn = 0}.

I

⋂

n≥1

⋃

k≥n

Ak — and this is an event indeed, using (A.2) + (A.3).

Here: ∩ ←→ ∀, “for all”; ∪ ←→ ∃, “there exists”.

Finitely many of the events A1, A2, . . . occurred [i.e.

∑∞

j=1 1Aj <∞].

Why? For instance: in a random walk S0, S1, S2, . . . , let An := {|Sn/n| > ε}

(for a fixed ε > 0). Related to the “strong” LLN.

I

⋃

n≥1

⋂

k≥n

Ack — use de Morgan or apply the same logic as above.

11

One starts modelling an RE by specifying a suitable sample space Ω and then

choosing an appropriate σ-algebra F of subsets Ω. The elements of this

σ-algebra are called events.

NB: Always ∅ ∈ F : indeed, ∅ = Ωc, then use (A.1) + (A.2).

NB: So taking A3 = A4 = · · · = ∅ in (A.3) yields

A1, A2 ∈ F ⇒ A1 ∪A2 ∈ F .

Likewise for any finite union (intersection) of events: still an event. If only

that held instead of (A.3), then F would be called an algebra of sets.

Ex. The trivial σ-algebra: F = {∅,Ω}.

No fun: no uncertain events! All the events we are allowed to look at are: the

impossible event ∅ (it never occurs!) and the certain event Ω (occurs always!).

Ex. The power set P(Ω) := class of all subsets of Ω.

This is often the choice in simple situations with discrete sample spaces.

12

Prm. Suppose Fn, n = 1, 2, . . . , are σ-algebras on a common sample space Ω.

Is F1 ∪ F2 a σ-algebra as well? What about F1 ∩ F2? What about

⋂∞

n=1 Fn?

Of course, there are many different possible choices of F . May wish to

consider, say, a σ-algebra containing a given set A ⊂ Ω. The smallest such

σ-algebra is clearly the so-called σ-algebra generated by A:

σ(A) := {∅, A,Ac,Ω}.

Extending this, let G = {A1, . . . , An} be a finite partition of Ω, i.e. the sets

Ai ⊂ Ω are pairwise disjoint,

∑n

i=1Ai = Ω. Then the σ-algebra generated by G,

i.e. the smallest σ-algebra that contains all the sets Aj , is

σ(G) :=

{∑

i∈I

Ai : I ⊂ {1, 2, . . . , n}

}

.

(Clearly a σ-algebra. Why is it the smallest one containing G?)

Similarly for a countable partition of Ω.

13

In case of the σ-algebra generated by a partition G, it is easy to give

representation for all the elements of σ(G). One can also introduce the concept

of the σ-algebra generated by an arbitrary given family G of subsets of Ω —

but this is less elementary (what about all possible intersections etc?).

Thm [1.12] For any family G of subsets of Ω, there exists a unique

σ-algebra, denoted by σ(G) and called the σ-algebra generated by G, s.t.

1) G ⊂ σ(G), and

2) if H is a σ-algebra on Ω and G ⊂ H, then σ(G) ⊂ H.

That is, σ(G) is the smallest σ-algebra on Ω containing G.

I How to prove such an assertion?? It’s not too difficult. First note that

there are σ-algebras on Ω that contain G: just take P(Ω). So the class of all

σ-algebras on Ω that contain G is non-empty. Now consider the intersection of

all σ-algebras from the class. It will contain G (as each of the σ-algebras

contains it!) and it will be a σ-algebra (as an intersection of such). And it will

be the smallest one with these properties!! (Why?)

14

An important example of a generated σ-algebra is the class B(R) of Borel

subsets of R (a.k.a. the Borel σ-algebra on R):

B(R) := σ{(a, b] : a, b ∈ R, a < b}.

All “reasonable” subsets of R are Borel (e.g. finite and countable subsets, open

intervals, open and closed sets etc.), but B(R) 6= P(R)! [Although giving an

example of a set which is not Borel is a challenge!]

This extends to the multivariate case:

B(Rm) := σ

{

m∏

i=1

(ai, bi] : ai, bi ∈ R, ai < bi

}

.

Here

∏m

i=1(ai, bi] is the Cartesian product of intervals (a “brick”).

Equivalently, B(Rm) is generated by open balls. As in the univariate case, all

reasonable subsets of Rm are Borel.

When Ω = Rm, B(Rm) is the default choice of F . For Ω ⊂ Rm, one takes

F = {Ω ∩A : A ∈ B(Rm)}, the trace of B(Rm) on Ω.

15

Now it’s time to introduce

Probability, from Latin probabilis “provable,” from probare “to try, to test”

(cf. to prove, to probe), from probus “good”.

Probable cause as a legal term is attested from 1676.

Probably is attested from 1535.

Probability is attested from 1551. [Source: http://www.etymonline.com]

Let (Ω,F) be a sample space endowed with a σ-algebra of its subsets (the

couple is called a measurable space).

Def. A probability on (Ω,F) is a function P : F → R s.t.

(P.1) P(A) ≥ 0, A ∈ F ,

(P.2) P(Ω) = 1,

(P.3) for any pairwise disjoint A1, A2, · · · ∈ F ,

P

∞⋃

j=1

Aj

= ∞∑

j=1

P(Aj), “countable additivity”.

16

Def. The triple (Ω,F ,P) is called a probability space.

NB1: P is referred to as a set function (as its argument assumes “values” that

are sets, its domain being F). NB: ω ∈ Ω and {ω} ⊂ Ω are distinct objects!

Note: P(ω) is NO GOOD, P({ω}) is OK.

NB2: On one and the same measurable space, we can have infinitely many

different probabilities. Ex: tossing a (biased) coin (once). In statistics, we do

consider different probabilities on the same measurable space all the time!

NB3: Properties (P.1) and (P.3) specify what’s called a measure. Adding

(P.2), we get a measure of “total mass one”; one often uses “probability

measure” in that case.

NB4: Why was this def’n adopted? It mimics the properties of relative

frequencies of events! Turns out that measure theory is the most natural

framework for formal treatment of probabilities. Very successful, starting with

being able to establish theoretically all the main “statistical laws” observed in

the real world, and first of all — the LLN and CLT.

17

Ex. The point mass (degenerate distribution) at (a fixed point) ω ∈ Ω :

εω(A) := 1A(ω).

NB the difference in interpretation: the LHS is a function of A (for a fixed

outcome ω), whereas the RHS is a function of ω (for a fixed event A).

It models a situation with a deterministic outcome: repeat your RE till you

turn blue, but each time you’ll be seeing one and the same outcome: ω.

Ex. Counting measure on N (or even R ⊃ N):

µ(B) :=

∑

n≥1

εn(B), B ∈ F = P(N).

This is not a probability! The measure counts the number of points in B

(which can be infinite, of course).

18

Ex. Discrete uniform distribution: suppose Ω is finite, F = P(Ω). If all

outcomes ω ∈ Ω are “equally likely”, then they should have the same

probability. Using notation |B| for cardinality of B, just put

P(A) := |A|/|Ω|, A ∈ F

(this is the so-called “classical probability”).

NB: using a version of the counting measure, this can be re-written as

P(A) =

1

|Ω|

∑

ω∈Ω

εω(A).

19

Elementary properties of probability (Thm 1.23):

a) P(∅) = 0.

I Taking A1 = A2 = · · · = ∅ in (P.3), we have P(∅) =

∑∞

n=1 P(∅) — bingo!

b) finite additivity : for any pairwise disjoint A1, . . . , An ∈ F ,

(PF.3) P

n⋃

j=1

Aj

= n∑

j=1

P(Aj).

I Take An+1 = An+2 = · · · = ∅ in (P.3) and use a) — bingo!

NB: in the special case A1 = A, A2 = A

c, we obtain

P(Ac) = 1−P(A).

20

Elementary properties of probability (Thm 1.23): continued.

c) If A ⊂ B (from now on, always assume that A,B, . . . ∈ F), then

P(B \A) = P(B)−P(A).

I Follows from b): take A1 = A, A2 = B \A, then B = A1 +A2.

NB: So probability is non-decreasing: P(A) ≤ P(B) for A ⊂ B.

d) For any events A and B,

P(B ∪A) = P(B) + P(A)−P(B ∩A)

(the simplest version of the inclusion-exclusion principle), and so always

P(B ∪A) ≤ P(B) + P(A), “subadditivity of probability”.

I As A ∪B = A+B \A1, where A1 := A ∩B ⊂ B, can use b) and then c):

P(A ∪B) = P(A) + P(B \A1) = P(A) + P(B)−P(A1),

bingo!

21

Subadditivity of prob’ty extends to Boole’s ineq’ty (Propn 1.24): for any

A1, A2, . . . ∈ F ,

P

∞⋃

j=1

Aj

≤ ∞∑

j=1

P(Aj).

I “Disjointification”: let B1 := A1, B2 := A2 \A1, B3 := A3 \ (A1 ∪A2) etc:

Bn := An \

n−1⋃

j=1

Aj .

Then 1) Bn ⊂ An, n ≥ 1, 2)

⋃

j≤nAj =

⋃

j≤nBj , n ≤ ∞ (prove by

induction), and 3) B1, B2, . . . are disjoint. Hence, using monotonicity of P,

P

∞⋃

j=1

Aj

2)= P

∞∑

j=1

Bj

(P.3)= ∞∑

j=1

P (Bj)

1)

≤

∞∑

j=1

P (Aj) , bingo!

22

Natural Q: in our def’n of probability, why can’t we assume finite additivity

(sl. 20) instead of the countable one (our (P.3))?

Direct A to this: we would lose important continuity properties.

Notation: An ↑ A as n→∞ ⇔ A1 ⊂ A2 ⊂ A3 ⊂ · · · and

⋃∞

n=1An = A;

An ↓ A as n→∞ ⇔ A1 ⊃ A2 ⊃ A3 ⊃ · · · and

⋂∞

n=1An = A.

Thm [1.25] Suppose P : F → R satisfies conditions (P.1), (P.2) and (PF.3).

Then the following are equivalent :

a) P satisfies (P.3) (and hence is a probability).

b) An ↑ A ⇒ P(An) ↑ P(A).

c) An ↓ A ⇒ P(An) ↓ P(A).

d) An ↓ ∅ ⇒ P(An) ↓ 0. [Of course, n→∞ in b)–d).]

I We will show that a) ⇒ b) ⇒ c) ⇒ d) ⇒ a). Obvious: b) ⇔ c) (by

complementation) and c) ⇒ d). So only need to show a) ⇒ b) and d) ⇒ a).

23

a) ⇒ b): If An ↑ A, then

⋃n−1

j=1 Aj = An−1 and, putting A0 := ∅, the

disjointification procedure from sl. 22 yields: Bn = An \An−1,

P(A) = P

( ∞⋃

n=1

An

)

=

∞∑

n=1

P(Bn)

c), sl. 21

=

∞∑

n=1

[P(An)−P(An−1)]

= lim

m→∞

m∑

n=1

[P(An)−P(An−1)] = lim

m→∞P(Am),

using the “telescoping argument” ((a2 − a1) + (a1 − a0) = a2 − a0 etc).

24

d) ⇒ a): If B1, B2, . . . are disjoint, then

∞∑

j=n+1

Bj =: An ↓ ∅ as n→∞.

Indeed, if this is NOT so, then

⋂

n≥1An 6= ∅, i.e. there is a point ω that

belongs to ALL An and hence ω ∈ Bm for some m ≥ 1. Since the Bn’s are

DISJOINT, we must have ω 6∈ Am for that m, a CONTRADICTION!

Now

P

∞∑

j=1

Bj

= P

n∑

j=1

Bj +An

(PF.3)= n∑

j=1

P (Bj)︸ ︷︷ ︸

→∑∞j=1 P(Bj)

+ P(An)︸ ︷︷ ︸

→0

as n→∞. We proved (P.3). Bingo!

25

Now we can prove a very simple, but quite important assertion:

(The 1st) Borel-Cantelli Lemma [Thm 1.27]

If

∑∞

n=1 P(An) <∞, then P(An, i.o.) = 0. [Re i.o., see sl. 11.]

I Using Thm 1.25, c) and Boole’s inequality (sl. 22),

P(An, i.o.) = P

⋂

n≥1

⋃

k≥n

Ak

= lim

n→∞P

⋃

k≥n

Ak

≤ lim

n→∞

∑

k≥n

P (Ak) = 0.

26

2. Probabilities on R

• Always on (R,B(R)).

• Although probabilities are GIVEN ON B(R), one usually SAYS they are

given on R (slang!).

• For probabilities on R, we will use P rather than P, and for a reason:

such probabilities will mostly be “induced” on R by random variables

given on a general “underlying” space (Ω,F ,P), so it is more convenient

to reserve P for probabilities on the general space (Ω,F).

• Anyway, to give a probability on R, one apparently needs to specify all the

values P (A), A ∈ B(R) — but B(R) is HUGE!

• In fact, one doesn’t: there is a much more economical way.

Def. The distribution function (DF, a.k.a. CDF) of a probability P on R is

the function FP : R→ R defined by

FP (t) := P ((−∞, t]), t ∈ R.

27

Prpn [1.32] FP ≡ FP ′ ⇔ P ≡ P ′

This is a consequence of the fact that σ

({(−∞, t] : t ∈ R}) = B(R) (hey,

(a, b] = (−∞, b] \ (−∞, a], cf. sl. 15!). In fact, a much stronger assertion holds:

Thm [1.36] below. So we’ll talk about that later. Meanwhile: to the

characteristic properties of DFs.

Thm [1.33] For any probability P on R, its DF F := FP satisfies:

a) F is non-decreasing: s < t ⇒ F (s) ≤ F (t). Hence, at any point

t ∈ R, it has one-sided limits:

F (t−) := lim

s↑t

F (s), F (t+) := lim

s↓t

F (s); F (t−) ≤ F (t) ≤ F (t+);

b) F is right-continuous: F (t) = F (t+);

c) lim

t→−∞F (t) = 0, limt→+∞F (t) = 1.

28

I

a) Obvious from the monotonicity of P (sl. 21): as (−∞, s] ⊂ (−∞, t] for

s < t, we have

F (s) ≡ P ((−∞, s]) ≤ P ((−∞, t]) ≡ F (t).

The existence of one-sided limits follows from monotonicity (recall: any

bounded increasing sequence has a finite limit).

b) Obvious from the continuity of P (sl. 23):

Let An := (−∞, tn], tn ↓ t as n→∞.

Then An ↓ A := (−∞, t] and so P (An) ↓ P (A).

c) Obvious from the continuity of P :

Since A′n := (−∞,−n] ↓ ∅ and A′′n := (−∞, n] ↑ R as n→∞,

lim

n→∞F (−n) ≡ limn→∞P (A

′

n) = P (∅) = 0, lim

n→∞F (n) ≡ limn→∞P (An) = P (R) = 1.

It remains to make use of the monotonicity of F . Bingo.

29

Ex. Point mass and beyond: For P = εs (s ∈ R is a fixed value),

FP (t) =

0, t < s1, t ≥ s

= 1(s ≤ t)

(here 1(C) = 1 if the condition C is met, 1(C) = 0 otherwise).

For P = (1− p)ε0 + pε1 (do you recognize the Bernoulli distribution B(p)? It’s

a mixture of two point masses),

FP (t) =

0, t < 0

1− p, 0 ≤ t < 1

1, t ≥ 1

= (1− p)1(0 ≤ t) + p1(1 ≤ t).

In the general case, given a DF FP , what is P ({t})? Since (t− 1/n, t] ↓ {t} as

n→∞, we have P ({t}) = limn(FP (t)− FP (t− 1/n)) = FP (t)− FP (t−).

30

Thm [1.36] For any F : R→ R which satisfies a)–c) from Thm [1.33], there

exists a unique P on B(R) s.t. F ≡ FP .

I This is a rather non-trivial result. Its proof is beyond the scope of our

subject. Just a few words how it’s done:

Start with putting P ((a, b]) := F (b)− F (a) (≥ 0!) for arbitrary a < b.

Next, for A :=

⋃n

i=1(ai, bi], −∞ ≤ a1 < b1 < a2 < b2 < . . . < bn ≤ ∞, we put

P (A) :=

n∑

i=1

(F (bi)− F (ai)).

The collection A of all such A’s is an algebra (verify), B(R) = σ(A) (obvious).

From the construction it is obvious that P is finitely additive on A, and one

can prove, using b) from Thm [1.33], that it is also countably additive on A.

The last shot is to apply Carathe´odory’s extension theorem: a probability

given on an algebra can be uniquely extended to one on the generated

σ-algebra. Bingo.

31

Implications: Can completely specify a probability on R by its DF!

Ex. Consider the function

F (t) =

0, t < 0

t, 0 ≤ t < 1

1, t ≥ 1

It clearly satisfies a)–c). So by Thm [1.36], there exists a unique probability

on R with this DF. This probability is called the uniform distribution on [0, 1],

denoted by U [0, 1].

NB: the U [0, 1]-probability of a given set is invariant w.r.t. to translations

(provided the set remains within [0, 1]). Likewise, the uniform distribution on

a set B ∈ B(Rn) is a probability invariant w.r.t. translations.

BTW: What is the DF of U [a, b], a < b?

Now we will look at a few important large classes of distributions on R.

32

Discrete Probabilities on R: P (C) = 1 for some countable C ⊂ R.

Prpn [1.39] The following are equivalent:

a) P is discrete.

b) For some {ti}i≥1 ⊂ R and {pi > 0}i≥1 with

∑

i pi = 1, one has

P =

∑

i

piεti .

c) For some {ti}i≥1 ⊂ R and {pi > 0}i≥1 with

∑

i pi = 1, one has

FP (t) =

∑

i

pi1(ti ≤ t).

[Picture.]

[BTW: can one assume that the ti’s can be ordered, e.g. t1 < t2 < . . .?]

33

I a) ⇒ b): C is countable ⇔ C = {ti}i≥1 ⊂ R. Now for B ∈ B(R),

P (B) = P (B ∩ C) + P (B ∩ Cc)︸ ︷︷ ︸

≤P (Cc)=1−P (C)=0

= P

B ∩∑

i≥1

{ti}

= P

∑

i≥1

(B ∩ {ti})

= ∑

i≥1

P (B ∩ {ti})

=

∑

i≥1

P ({ti}) 1(ti ∈ B) =

∑

i≥1

pi1(ti ∈ B) sl. 18=

∑

i≥1

piεti(B).

b) ⇒ c): By def’n, FP (t) ≡ P ((−∞, t]) =

∑

i≥1

piεti((−∞, t]) =

∑

i≥1

pi1(ti ≤ t).

c) ⇒ a): For C := {tj}j≥1 we have

P (C) =

∑

j

P ({tj}) =

∑

j

(FP (tj)− FP (tj−)) =

∑

j

pj = 1

since 1(ti ≤ t) is continuous at tj for i 6= j, so that FP (tj)− FP (tj−) = pj .

34

Absolutely Continuous (AC) Prob’s on R: the ones with densities.

Def. A probability P on R is AC if there exists a f’n fP : R→ R+, called the

density (a.k.a. PDF) of P , s.t.

FP (t) =

∫ t

−∞

fP (s) ds, t ∈ R.

Clearly, this implies that P ((a, b]) =

∫ b

a

fP (s) ds and fP (t) = F

′

P (t) (a.e.!).

NB: In most cases, the integral here is our good old friend the Riemann

integral. However, in the general case, it must be understood as the so-called

Lebesgue integral, but we’ll talk about that later.

ANY integrable function f ≥ 0 on R with ∫ f(s) ds = 1 specifies a probability

on R. Indeed,

F (t) :=

∫ t

−∞

f(s) ds

defines a DF on R (i.e. it has properties a)–c) from sl. 28, and so Thm [1.36]

applies.

35

Mixed Distributions: neither discrete nor AC, but mixtures thereof, i.e.

P = pPd + (1− p)Pa,

where Pd is discrete, Pa is AC, p ∈ (0, 1) is a fixed number.

Ex. Waiting times at an ATM: when a customer arrives, either nobody is

using the ATM (w.p. p) or it is in use (plus there may be a queue!) — and

then the customer has to wait for a random time ∼ E(λ), the exponential

distribution with par’r λ (having the density λe−λt, t > 0). Then, using the

total probability formula (rings a bell?), the waiting time has the distribution

P = pε0 + (1− p)E(λ).

But wait: there is more!

36

Singular Distributions: with a continuous DF, but not AC!

Continuous DF means no point is assigned a positive probability

(otherwise there would be jumps in the DF, sl. 30). So the prob’ty is “spread”

over R, but — there is no density. Singular (deviating from the usual or

expected; odd) indeed.

Ex [1.51] Cantor’s ladder (explained in class).

And this is basically it: any distribution on R is a mixture of three “pure

type” distributions!

Thm [1.52] (Lebesgue’s decomposition) Any probability on R has a unique

representation of the form

P = αdPd + αaPa + αsPs,

where αi ≥ 0, i = d, a, s;

∑

αi = 1; Pd is discrete, Pa is AC, Ps is singular.

37

3. Random Variables (RVs)

“Na¨ıve definition”: RV X is a “function of chance”, i.e. X = X(ω) : Ω→ R

(assuming we are given an underlying probability space modelling our RE).

Not good enough: one usually wants to know probabilities of the RV hitting

some given sets, e.g. X ∈ [a, b]. Therefore the respective set of favourable

outcomes

X−1([a, b]) := {ω ∈ Ω : X(ω) ∈ [a, b]}

called the inverse image (under X) of [a, b], MUST be an event. Hence:

Def [2.3] An RV is a function X = X(ω) : Ω→ R s.t., for any B ∈ B(R),

X−1(B) := {ω ∈ Ω : X(ω) ∈ B} ∈ F

(such functions are called measurable; shorthand: {X ∈ B}).

So, for RVs X, the probabilities P(X ∈ B) are defined for all B ∈ B(R)! Good.

38

In fact, to satisfy Def [2.3], it suffices to have

{X ∈ (−∞, t]} := X−1((−∞, t]) ≡ {ω ∈ Ω : X(ω) ∈ (−∞, t]} ∈ F , t ∈ R. (∗)

Which is kind of nicer: much fewer sets to play with. But why is that so?

Because X−1 preserves all set operation (and disjointness).

Prpn [2.2] Let {Bα : α ∈ I} be an arbitrary family of subsets of R.

a) Bα ⊂ Bβ ⇒ X−1(Bα) ⊂ X−1(Bβ).

b)

⋃

α∈I X

−1(Bα) = X−1

(⋃

α∈I Bα

)

and

⋂

α∈I X

−1(Bα) = X−1

(⋂

α∈I Bα

)

.

c) Bα ∩Bβ = ∅ ⇒ X−1(Bα) ∩X−1(Bβ) = ∅.

d) X−1(Bcα) =

[

X−1(Bα)

]c

.

I

It’s all next to obvious — a matter of simple logic. For instance, look at d):

what’s on its LHS? The set of all ω’s s.t. X(ω) ∈ Bcα. But this is the same as

the set of all ω’s that don’t have the property that X(ω) ∈ Bα, which is the

complement of the set of all ω’s that do, i.e.

[

X−1(Bα)

]c

. Bingo, right?

39

Now how does this help if we want to show that (∗) is equivalent to Def [2.3]?

Let C be the collection of all sets B ∈ B(R) s.t. X−1(B) ∈ F . By Prpn [2.2],

C will be a σ-algebra on R!! Indeed, R ∈ C since X−1(R) = Ω ∈ F , so (A.1)

holds. Next, B ∈ C ⇔ X−1(B) ∈ F , and as F itself is a σ-algebra, one has[

X−1(B)

]c ∈ F . By Prpn [2.2],d), this means that X−1(Bc) ∈ F , so that

Bc ∈ C, i.e. (A.2) holds for C. Similarly for (A.3). Good.

Now (∗) states that all (−∞, t] ∈ C, t ∈ R, and so the smallest σ-algebra

containing these sets will be part of C (as the latter is a σ-algebra itself):

σ ((−∞, t], t ∈ R) ⊂ C.

Hey, but what’s on the LHS? This is B(R) ⊃ C!

We conclude that B(R) = C, but this means that Def [2.3] is satisfied.

40

Ex Constants are RVs: for X ≡ c = const, one has {X ≤ t} =

{ ∅, t < c,

Ω, t ≥ c.

Ex Random indicators. For an event A ∈ F , {1A ≤ t} =

∅, t < 0,

Ac, 0 ≤ t < 1,

Ω, t ≥ 1.

Ex Simple RVs: X :=

∑n

i=1 ai1Ai , where ai ∈ R, Ai ∈ F , i ≤ n <∞.

One usually assumes here (for convenience) that {Ai}i≤n is a partition of Ω.

This is no big deal: if it’s not so, one can always re-write X in an alternative

form, using a partition, X =

∑n′

i=1 a

′

i1A′i , with the A

′

i’s of the form(⋂

i∈I Ai

) ∩ (⋂i∈Ic Aci), I ⊂ {1, 2, . . . , n}, which form a partition.

This is an RV: {X ≤ t} = ⋃i: ai≤tAi ∈ F .

41

Random Vectors: X = (X1, . . . , Xd) : Ω→ Rd s.t. all Xi, i ≤ d, are RVs.

Equivalently, s.t. X−1(B) ∈ F for all B ∈ B(Rd). This extends to “random

elements” in much more general (than Rd) spaces.

Similarly, Z : Ω→ C is a complex-valued RV if Z = X + iY, where X and Y

are RVs. [Here, of course, i =

√−1.]

42

Prpn [2.9] Given an RV X, σ(X) := {X−1(B) : B ∈ B(R)} is a σ-algebra

on Ω which is called the σ-algebra generated by the RV X.

NB: From the def’n of RV, σ(X) ⊂ F . Usually, it is smaller than F .

I Just verify (A.1)–(A.3) using Prpn [2.2].

(A.1) Ω = X−1(R) ∈ σ(X), good.

(A.2) For A ∈ σ(X) there exists a B ∈ B(R) s.t. A = X−1(B).

Now Ac =

[

X−1(B)

]c

= X−1(Bc) ∈ σ(X), good.

(A.3) Similarly. Bingo!

Ex [2.10] σ(1A) = {∅, A,Ac,Ω}. Follows from Ex on sl. 13.

Ex [2.11] For a simple RV X =

∑n

i=1 ai1Ai , where a1, . . . , an are distinct

and {Ai} is a partition of Ω,

σ(X) =

{∑

i∈I

Ai : I ⊂ {1, 2, . . . , n}

}

.

43

Combinations of RVs

The general fact: if X = (X1, . . . , Xd) is a random vector, g : Rd → R is a

continuous function, then g(X1, . . . , Xd) is an RV as well.

NB: This extends to more general functions g (to the so-called measurable f’s).

We’ll skip the proof: a bit beyond the scope. Just look at a couple of simple

special cases. Assume that X and Y are RVs, a, b ∈ R. Then:

a) aX is an RV: {aX ≤ t} =

{

{X ≤ t/a}, a > 0;

{X ≥ t/a} = {X ∈ [t/a,∞)}, a < 0.

b) aX + bY is an RV: it suffices to prove this for a = b = 1 (from a)). Look:

{X + Y < t} =

⋃

r∈Q

({X < r} ∩ {Y < t− r}),

Q being the (countable) set of all rationals (r = m/n, where m,n ∈ Z).

c) XY is an RV. (How to show that? E.g. use XY = ((X+Y )2−X2−Y 2)/2.)

44

Distributions and DFs of RVs and RVecs

Def For an RV X on (Ω,F ,P), the set function

PX(B) := P(X ∈ B), B ∈ B(R),

is called the distribution of X. Similarly for random vectors.

Prpn [2.23] PX is a probability on R (or Rd — in the case when X ∈ Rd).

I That PX(B) ≥ 0 and PX(R) = 1 is obvious from the definition. Use

Prpn [2.2] to show countable additivity. Bingo.

The DF of X is the DF of PX : FX(t) := PX((−∞, t]) ≡ P(X ≤ t).

We call X discrete (AC, singular) if FX is discrete (AC, singular, resp.).

If X is AC, fX is continuous at x, then P(X ∈ (x, x+ ∆)) = fX(x)∆ + o(∆) =

(fX(x) + o(1))∆ as ∆→ 0. [Recall the meaning of o(1), O(1) etc.]

The survival function (a.k.a. the tail) of X is the function SX(t) = 1− FX(t).

We say that X, Y are identically distributed (and write X

d

= Y ) iff PX ≡ PY .

45

In the case when X ∈ Rd, by the DF of X we understand the function

FX(t1, . . . , td) := P(X1 ≤ t1, . . . , Xd ≤ td), (t1, . . . , td) ∈ Rd.

As in the univariate case, the distr’n PX is uniquely specified by FX — this

follows from the fact that the orthants

{(x1, . . . , xd) : x1 ≤ t1, . . . , xd ≤ td}, (t1, . . . , td) ∈ Rd,

generate B(Rd) (why?) How to express P(Xi ∈ (ai, bi], i ≤ d) in terms of FX?

There are analogs of Thms [1.33], [1.36] in the multivariate case, although

they are a bit more sophisticated. BTW, what will replace the monotonicity

condition a)? What about limits an “infinities”?

The def’n of a discrete RVec is the same as for discrete RVs (this holds in

more general spaces as well). An AC distribution has a density fX , satisfying

FX(t1, . . . , td) :=

∫ t1

−∞

· · ·

∫ td

−∞

fX(s1, . . . , sd) ds1 . . . dsd.

46

Prpn [2.28] X = (X1, . . . , Xd) is discrete iff all Xi, i ≤ d, are discrete.

I ⇒ : If PX(C) = 1 for a countable set C ⊂ Rd, then, for each i,

P(Xi ∈ Ci) = 1, where Ci := {xi : x = (x1, . . . , xd) ∈ C} are countable, too.

⇐ : If, for each j ≤ d, there is a countable set Cj s.t. P(Xj ∈ Cj) = 1, then

P(X ∈ C) = 1 for C := ∏dj=1 Cj , which is countable. Bingo.

Prpn [2.29] If X = (X1, . . . , Xd) is AC, then so is Xj for any j ≤ d, and

fXj (x) =

∫

· · ·

∫

︸ ︷︷ ︸

d−1

fX(s1, . . . , sj−1, x, sj+1, . . . , sd) ds1 · · · dsj−1 dsj+1 · · · dsd.

I Use FXj (t) = lim

ti→∞, i 6=j

FX(t1, . . . , tj−1, t, tj+1, . . . , td) that leads to

FXj (t) =

∫

· · ·

∫

︸ ︷︷ ︸

j−1

t∫

−∞

∫

· · ·

∫

︸ ︷︷ ︸

d−j

fX(. . . , sj−1, x, sj+1, . . .) ds1 · · · dsj−1 dx dsj+1 · · · dsd

and then change the order of integration (Fubini). Bingo. [Is the converse true?]

47

Some Popular Distributions on R and Beyond

Discrete RVs . . .

AC RVs . . .

RVecs . . . (Singular distributions don’t need to be exotic in this case)

Transformations of RVs

Prpn [2.40] If X is an RV, g an increasing & continuous function on R, with

inverse h := g−1, then the RV Y := g(X) has the DF

FY (t) = FX(h(t)).

I Obvious: FY (t) = P(Y ≤ t) = P(g(X) ≤ t) = P(X ≤ g−1(t)), bingo.

What if g were decreasing?

Thm [2.41] If, in addition to the conditions in Prpn [2.40], X is AC and g

is continuously diff’ble on an open set U ⊂ R s.t. P(X ∈ U) = 1, then Y is

AC, with the density fY (y) = fX(h(y))|h′(y)|. [This allows for decreasing g!]

48

I Obvious: FY (t) = FX(h(t)) =

∫ h(t)

−∞ fX(s) ds, next either differentiate or

change variables (s = h(u)): bingo! [Picture.]

Ex Linear g(x) = ax+ b, a > 0 (+ what it does to normal distributions).

Thm [2.43] = extension of Thm [2.41] to RVecs: g : Rd → Rd with an

inverse h, smooth, Jh(y) := det

(

∂hi

∂yj

)

is the Jacobian of h [rings a bell?].

If X ∈ Rd is AC, so is Y = g(X), and fY (y) = fX(h(y))|Jh(y)|.

Works nicely for normal distributions!

Prpn [2.47] Let F be a DF on R, Q(x) := inf{t : F (t) ≥ x}, x ∈ (0, 1), its

quantile function, U ∼ U [0, 1]. Then X := Q(U) ∼ F .

I Note that Q is non-decreasing and Q(x) ≤ t ⇔ x ≤ F (t). So

P(X ≤ t) = P(Q(U) ≤ t) = P(U ≤ F (t)) = F (t), bingo!

NB: If F is continuous and X ∼ F , then also F (X) ∼ U [0, 1].

[DIY! What if F is not continuous?]

49

Independent RVs

Def [3.1] RVs X1, . . . , Xn are called independent if, ∀B1, . . . , Bn ∈ B(R),

P(X1 ∈ B1, . . . , Xn ∈ Bn) =

n∏

j=1

P(Xj ∈ Bj).

NB: Makes perfect sense when interpreting P’s as relative frequencies!

NB: i.i.d. means “independent identically distributed” (if refers to an infinite

sequence, it means that any finite subset of Xj ’s is i.i.d.).

How do we know if RVs are independent? Suppose we know their (joint) DF.

Thm [3.3] RVs X1, . . . , Xn are independent iff, ∀t1, . . . , tn ∈ R,

FX1,...,Xn(t1, . . . , tn) =

n∏

j=1

FXj (tj).

I ⇒ Special case: Bj = (−∞, tj ].

⇐ Can easily verify for Bj = (aj , bj ], and then extend. Trust me, OK? Good.

50

Thm [3.4] Discrete RVs X1, . . . , Xn are independent iff, ∀t1, . . . , tn ∈ R,

P(X1 = t1, . . . , Xn = tn) =

n∏

j=1

P(Xj = tj). (∗)

I ⇒ Again a special case: Bj = {tj}.

⇐ Will prove for n = 2; the same argument works in the general case.

Suppose X and Y are discrete RVs whose joint PMF factorises as in (∗). Then

P(X ∈ A, Y ∈ B) = P

⋃

ai∈A,bj∈B

{X = ai, Y = bj}

=

∑

ai∈A

∑

bj∈B

P(X = ai, Y = bj) =

∑

ai∈A

∑

bj∈B

P(X = ai) P(Y = bj)

=

(∑

ai∈A

P(X = ai)

)∑

bj∈B

P(Y = bj)

, bingo!

51

A similar criterion holds for AC RVs. Its proof is basically the same: just

replace sums with integrals.

Thm [3.5] AC RVs X1, . . . , Xn are independent iff, ∀t1, . . . , tn ∈ R,

fX1,...,Xn(t1, . . . , tn) =

n∏

j=1

fXj (tj).

Ex Uniform distribution on [0, 1]d.

Ex Standard multivariate normal distribution.

NB: If gj are “nice enough” functions, X1, . . . , Xn are independent RVs, then

so are the RVs Yj := gj(Xj), j = 1, . . . , n. [Kind of common sense, no?] Look:

P(Yj ∈ Bj , j ≤ n) = P(Xj ∈ g−1j (Bj) =: B′j , j ≤ n) =

n∏

j=1

P(Xj ∈ B′j) etc.

BTW: Do you remember how to compute the PMF of the sum of two

independent integer-valued RVs? The density in case of AC RVs?

52

When we say: “Let X1, . . . , Xn be independent RVs with DFs F1, . . . , Fn”,

how do we know that such a thing exists at all??

I Take Ω := Rn, F := B(Rn), define P as the probability on (Rn,B(Rn))

whose DF is given by

∏

j≤n Fj(xj), and take Xj(ω) := ωj (coordinate

projections: ω = (ω1, . . . , ωn)) — done!

Alternatively, take Ω := [0, 1]n, F := B([0, 1]n), define P as the uniform

probability on [0, 1]n, and take Xj(ω) := F

−1

j (ωj) (quantile functions of the

coordinates) — done!

Constructing infinite sequences of independent RVs is a bit more interesting.

[BTW, what about dependent ones?]

53

Independent Events

Def [3.19] Events A1, . . . , An are called independent if their indicators are

independent RVs.

Equivalently, for any I ⊂ {1, . . . , n},

P

(⋂

i∈I

Ai

)

=

∏

i∈I

P(Ai) [= standard def’n].

⇒) The indicators of Ai, i ∈ I, are independent. Use Def. [3.19], [3.1] for that

subset of events, with Bi := {1} (then {Xi ≡ 1Ai ∈ Bi} = Ai)

⇐) Want: P(⋂nj=1{1Aj ∈ Bj}) = ∏nj=1 P(1Aj ∈ Bj). Note: {1Aj ∈ Bj} = Ω

if 0, 1 ∈ Bj ; = ∅ if 0, 1 6∈ Bj ; = Aj if 1 ∈, 0 6∈ Bj ; = Acj if 0 ∈, 1 /∈ Bj . Only

care about the last two alternatives: for I ⊂ {1, . . . , n}, show (by induction):

P

(⋂

i∈I

Ai ∩

⋂

j∈Ic

Aj

)

=

∏

i∈I

P(Ai)×

∏

j∈Ic

P(Acj).

54

NB: Again, makes perfect sense when we interpret probabilities as the

(limiting values of) relative frequencies.

NB: Differs from pair-wise independence [Bernstein’s example].

Cor [3.21] Events A1, . . . , An are independent iff A

c

1, . . . , A

c

n are.

I Prac class exercise.

55

4. Expectations

From 2nd year probability/stats subjects etc.: if X is a discrete RV (countable

set C = {ti} of possible values), then

EX =

∑

i

tiP(X = ti).

[

=

∫

xfX(x) dx for AC X’s

]

.

Why is this called the “expected” (or “mean”) value of X?

Recall the relative frequency interpretation of probability: in n independent

replications of our RE, Xj being the value observed in the jth replication, set

ni := #{j ≤ n : Xj = ti}. Then P(X = ti) ≈ ni

n

for large n, so that

Xn :=

1

n

n∑

j=1

Xj =

1

n

n∑

j=1

∑

i

ti1(Xj = ti)

=

1

n

∑

i

ti

n∑

j=1

1(Xj = ti)︸ ︷︷ ︸

=ni

=

∑

i

ti

ni

n

≈

∑

i

tiP(X = ti).

56

NB: In both cases, these are just computational rules rather than definitions

of expectation. Need a common one, applicable not only to discrete and AC

RVs, but also to mixtures thereof etc, and s.t. Xn → EX, n→∞ (LLN!).

Let’s start with indicators: for X = 1A, A ∈ F , set

EX := P(A).

This makes sense, as Xn = nA/n. How to proceed?

Expected properties of expectations of general RVs:

• constants are expectations of themselves: EX = c for X ≡ c;

• linearity: E(aX + bY ) = aEX + bEY, where a, b ∈ R are constants;

• monotonicity: X ≤ Y ⇒ EX ≤ EY .

Linearity leads to the following

Def [4.1] For a simple RV X =

∑n

i=1 ai1Ai , the expectation is defined as

EX :=

n∑

i=1

aiP(Ai).

57

NB: This is a consistent def’n: the same result for any representation of X!!

If one can also write X =

∑n′

i=1 a

′

i1A′i , then

∑n

i=1 aiP(Ai) =

∑n′

i=1 a

′

iP(A

′

i).

One can see that by looking at the atoms of σ(A1, . . . , An;A

′

1, . . . , A

′

n′).

NB: In particular, X ≡ c = c1Ω has expectation EX = cP(Ω) = c, OK!

Prpn [4.2] Expectation is a linear operation on simple RVs: for simple

X =

∑n

i=1 ai1Ai and Y =

∑m

j=1 bj1Bj , and constants a, b ∈ R,

E(aX + bY ) = aEX + bEY.

I Indeed, assuming that {Ai} and {Bj} are partitions of Ω, we have

E(aX + bY ) = E

n∑

i=1

m∑

j=1

(aai + bbj)1Ai∩Bj =

n∑

i=1

m∑

j=1

(aai + bbj)P(Ai ∩Bj)

= a

n∑

i=1

ai

m∑

j=1

P(Ai ∩Bj)︸ ︷︷ ︸

=P(Ai)

+b

m∑

j=1

bj

n∑

i=1

P(Ai ∩Bj)︸ ︷︷ ︸

=P(Bj)

= aEX + bEY, OK!

58

NB: For a simple RV X ≥ 0, clearly EX ≥ 0. Now monotonicity follows

from linearity: if X ≤ Y then Y −X ≥ 0 and so

0 ≤ E(Y −X) = EY −EX, good!

Now what?

Idea: perhaps can approximate an arbitrary RV by a sequence of simple RVs?

And perhaps the expectations of these simple RVs will tend somewhere?

First look at non-negative RVs. Any such RV can be approximated by an

increasing sequence of simple RVs {Xn}n≥1 in the following sense:

∀ω ∈ Ω, Xn(ω) ↑ X(ω) as n→∞.

59

In fact, we can construct such a sequence explicitly: let N := n2n, and

consider disjoint (for a fixed n) events

An,k := {k2−n ≤ X(ω) < (k + 1)2−n}, k = 0, 1, 2, . . . , N − 1,

An,N := {X(ω) ≥ n}.

Next put

Xn(ω) :=

k2−n, ω ∈ An,k,n, ω ∈ An,N . [Picture.]

Look: An,k = An+1,2k +An+1,2k+1, k < N, and so Xn+1 ≥ Xn.

Moreover, if we fix ω, then, for n > X(ω), we will have

0 ≤ X(ω)−Xn(ω) ≤ 2−n,

so that Xn(ω) ↑ X(ω) as n→∞. It works!! [Even for ω’s with X(ω) =∞. . . ]

60

Def [4.4] For an arbitrary RV X ≥ 0, we put

EX := lim

n→∞EXn,

where Xn ≥ 0 are simple RVs s.t. ∀ω ∈ Ω, Xn(ω) ↑ X(ω) as n→∞.

[We already know that such sequences of RVs do exist!!]

NB: From monotonicity of E, 0 ≤ EX1 ≤ EX2 ≤ EX3 ≤ . . . , so the limit of

this numerical sequence always exists (but can be infinite, of course).

A sober question: How can one use such a def’n? What if for different

sequences {Xn} of simple RVs one can have different values of limn→∞EXn?

Prpn [4.5] One cannot.

Hence Def [4.4] is consistent: regardless of the choice of {Xn}, the value of

the limit will be one and the same and so can be used to define EX.

61

I A nice exercise: will do it for illustration purposes (typical argument).

• Let both {Xn} and {X˜n} be as in Def [4.4].

We will show that one must have limn→∞EXn = limn→∞EX˜n.

• Suppose Y ≤ X is a simple RV, fix an ε > 0 and set An := {Xn > Y − ε}.

Look: since Xn+1 ≥ Xn, one has An ⊂ An+1, and

since Xn ↑ X ≥ Y > Y − ε, one has An ↑ Ω.

Therefore P(An)→ 1, equivalently P(Acn)→ 0 as n→∞.

• Now Xn ≥ (Y − ε)1An

(on An this holds due to the def’n of An, whereas on A

c

n the RHS is 0), so

EXn ≥ E(Y − ε)1An = E Y 1An︸ ︷︷ ︸

=Y (1−1Acn )

− εP(An)︸ ︷︷ ︸

≤1

≥ EY (1− 1Acn)− ε

= EY −EY 1Acn − ε ≥ EY −maxω∈Ω Y (ω)︸ ︷︷ ︸

<∞

P(Acn)︸ ︷︷ ︸

→0

−ε→ EY − ε.

62

• Thus limn→∞EXn ≥ EY − ε, and as ε > 0 is arbitrary (small), we have

lim

n→∞EXn ≥ EY.

• So if we take Y = X˜k for any fixed k (can do! As X˜k ≤ X), this will give

lim

n→∞EXn ≥ EX˜k.

Therefore

lim

n→∞EXn ≥ limk→∞EX˜k.

By symmetry, will also have

lim

n→∞EX˜n ≥ limk→∞EXk.

Hence the two limits must coincide! Bingo.

63

So we have got a def’n of EX for arbitrary X ≥ 0. What if X ≷ 0?

Let’s use non-negative RVs X+ = max{X, 0}, X− = −min{X, 0}, noting that

X = X+ −X−, |X| = X+ +X−.

Def [4.12] An RV X is called integrable if E|X| <∞ (and this is often

written as “X ∈ L1”). If X is integrable, its expectation is defined by

EX := EX+ −EX−. (∗)

Expectation of X over an event A is E(X;A) := EX1A.

NB: If E|X| <∞ then both EX± <∞, so (∗) makes sense.

If one of EX+ and EX− is infinite, then one can still use (∗) to define EX

(which will be ±∞, depending on which of EX± is infinite).

If both EX± =∞ then EX is undefined (what is ∞−∞?).

64

Good news: thus defined expectation inherits all the good properties of

expectation of simple RVs.

• Monotonicity: if X ≤ Y and EY <∞, then EX ≤ EY .

I First prove for non-negative RVs, using simple RVs (sim. to Prpn [4.5]).

This implies, in particular, that if |X| ≤ Y and Y ∈ L1 then also X ∈ L1.

• Linearity: if X,Y ∈ L1 and a, b ∈ R, then E(aX + bY ) = aEX + bEY.

I First note that V := aX + bY ∈ L1 since |V | ≤ |aX|+ |bY | = |a||X|+ |b||Y |

and E|X| <∞, E|Y | <∞.

Secondly, establish linearity per se: again using simple functions to

approximate RVs, passing to the limit. We’ll skip it. Not examinable. Good.

Cor [4.14] For X ∈ L1, one has |EX| ≤ E|X|.

I By definition, EX = EX+ −EX−, so

|EX| ≤ |EX+|+ |EX−| = EX+ + EX− linearity= E(X+ +X−) = E|X|.

65

A natural question: Given that Xn(ω)→ X(ω) as n→∞ (say, for all

ω ∈ A, P(A) = 1), will this imply that EXn → EX as well?

A natural answer: It depends.

Ex. On Ω = (0, 1) with P = U [0, 1], let

Xn(ω) := n1(ω < 1/n).

Obviously, Xn(ω) = X(ω) ≡ 0 for n > 1/ω, so Xn(ω)→ X(ω) for all ω ∈ Ω.

However, EXn = nP((0, 1/n)) = n× 1/n = 1 6→ 0 = EX. No good.

Convergence of expectations always holds for:

• monotone sequences of RVs, e.g. Xn(ω) ↑ X(ω);

• “dominated” sequences of RVs, i.e. when |Xn| ≤ Y, where Y ∈ L1

(the dominated convergence theorem, cf. Ex. above).

These results will be discussed later, after we have introduced a.s. convergence.

66

For RVecs and complex-valued RVs, expectations are defined component-wise.

Thus, for Z = X + iY ∈ C, with X,Y ∈ R, we set

EZ := EX + iEY.

And finally: another notation for E using integrals:

EX ≡

∫

Ω

X(ω)P(dω) ≡

∫

Ω

X(ω) dP(ω) ≡

∫

Ω

X dP.

The integral construction that we described is that for the Lebesgue

integral. In what is it different from the (conventional) Riemann integral?

Recall : for Riemann integral, we partition the domain of integration, usu. part

of R or Rd (how can we do that in more general cases?). However, for

Lebesgue integral, we partition the range of the integrand that can be defined

on an abstract set Ω! Hence it’s a much more general def’n. [Ex. Banknotes.]

When integrating “nice” functions on Rd w.r.t. volume measure (dx in the

case of R), both integrals give the same answer.

67

Integrals w.r.t. distributions and DFs

As we said earlier, when dealing with RVs one can often just work with the

distributions thereof, “forgetting” the original underlying (Ω,F ,P) and X(ω)

and switching to (Rd,B(Rd), PX). Accordingly, one abandons the general

integrals

∫

Ω

X(ω) dP(ω) (that are important, of course, for theoretical

calculations) in favour of (more practical)∫

x dP (x) and

∫

g(x) dP (x), P ≡ PX .

The latter integral is often denoted by

∫

g(x) dF (x), F being the DF of P .

Prpn. Assume that X ∼ P is an RV on (Ω,F ,P) whose expectation is defined

(i.e., at least one of EX± is finite). Then EX ≡ ∫

Ω

X(ω) dP(ω) =

∫

R x dP (x).

That is, expectations are numerical characteristics of the distributions of RVs.

The above claim is kind of a “change of variables” formula. Requires a proof.

68

I We only need to prove the Prpn for X ≥ 0. Recall our approximating

sequence of simple RVs Xn ↑ X (as n→∞) from sl. 60:

Xn =

∑n2n−1

k=0 k2

−n1

(

k2−n ≤ X < (k + 1)2−n)+ n1(X ≥ n).

For the RV X∗(x) := x on the prob’ty space (R,B(R), P ) one has X∗ d= X,

X∗n(x) :=

∑n2n−1

k=0 k2

−n1

(

k2−n ≤ x < (k + 1)2−n)+ n1(x ≥ n)

is an approx’g sequence of simple RVs on (R,B(R), P ): X∗n ↑ X∗ as n→∞,

EXn =

∑n2n−1

k=0 k2

−nP

(

k2−n ≤ X < (k + 1)2−n)+ nP(X ≥ n)

=

∑n2n−1

k=0 k2

−nP

(

[k2−n, (k + 1)2−n)

)

+ nP ([n,∞)) = EX∗n.

It remains to recall that EX = limn→∞EXn, EX∗ = limn→∞EX∗n. Bingo!

Basically the same argument shows that if Y = g(X) ∼ PY , X ∼ PX , then

EY = Eg(X) =

∫

ydPY (y) =

∫

g(x)dPX(x).

69

If F is AC with a “nice” density f = F ′ (a.e.), g is “nice” as well (e.g. both

are piece-wise continuous), then∫

g(x) dF (x) =

∫ ∞

−∞

g(x)f(x)dx,

the conventional Riemann integral. Again, this needs to be proved (which is

done by starting with g piece-wise constant and then passing to limits), which

is somewhat beyond the scope of our course.

The above two relations are the basic tools for computing expectations in

discrete & AC cases. When F is a mixture of the two, use a “mixed formula”.

NB: a mnemonic interpretation of the integral: similarly to∫

g(x)dx ≈

∑

(g(x)-values)× (increments of x),

one can (naively) think that∫

g(x)dF (x) ≈

∑

(g(x)-values)× (increments of F (x)).

70

Thm [4.23] For X ≥ 0, EX = ∫∞

0

FX(x)dx, FX := 1− FX(x). [Picture.a]

For X ≷ 0, EX = − ∫ 0−∞ FX(x)dx+ ∫∞0 FX(x))dx.

I For X ≥ 0 s.t. P(X ∈ hZ) = 1 for some h > 0, one has

EX =

∞∑

k=0

hkP(X = hk) = h

∞∑

k=1

k∑

j=1

P(X = hk) = h

∞∑

j=1

∞∑

k=j

P(X = hk)

= h

∞∑

j=0

P(X > hj) =

∞∑

j=0

FX(hj)h =

∫ ∞

0

FX(x)dx. (∗)

For general X ≥ 0, set X ′n := bnXcn , X ′′n := X ′n + 1n , n ≥ 1, s.t. X ′n ≤ X ≤ X ′′n

and X ′′n −X ′n ≤ 1n . Hence EX ′n ≤ EX ≤ EX ′′n ≤ EX ′n + 1n andb

EX ′n

(∗)

=

∫ ∞

0

FX′n(x)dx ≤

∫ ∞

0

FX(x)dx ≤

∫ ∞

0

FX′′n (x)dx

(∗)

= EX ′′n ≤ EX ′n +

1

n

.

Bingo.

aIn particular, for integer-valued X’s, EX =

∑

n≥1 nP(X = n) =

∑

n≥1P(X ≥ n).

bObvious: if X ≤ Y then FX(x) = P(X > x) ≤ P(Y > x) = FY (x), x ∈ R.

71

Functions of RVs

For the RV Y := g(X) (g is “nice” enough),

EY =

∫

y dFY (y) =

∫

g(x) dFX(x),

so in fact don’t need to compute the DF FY to find EY .

We mostly use the following consequences of this: for discrete/AC X’s,

Eg(X) =

∑

ti∈CX

g(ti)P(X = ti), Eg(X) =

∫

g(x)fX(x) dx.

The same applies to f’ns of RVecs. An important special case: (X1, X2) ∈ R2.

72

Cor [4.30] If X1 and X2 are independent RVs, gi(Xi) ∈ L1, i = 1, 2, then

E g1(X1)g2(X2) = E g1(X1)E g2(X2). (∗)

I First consider gi(x) = 1(x ∈ Bi) for some Bi ∈ B(R) :

E g1(X1)g2(X2) = E 1(X1 ∈ B1)1(X2 ∈ B2) = E 1(X1 ∈ B1, X2 ∈ B2)

= P (X1 ∈ B1, X2 ∈ B2) indep’ce= P (X1 ∈ B1) P (X2 ∈ B2)

= E 1(X1 ∈ B1) E 1(X2 ∈ B2) = E g1(X1) E g2(X2).

Next show that (∗) holds for gi(x) =

∑n

j=1 ai,j1(x ∈ Bi,j):

E g1(X1)g2(X2) =

∑

j,k

a1,ja2,kE1(X1 ∈ B1,j)1(X2 ∈ B2,k) = [using the above]

=

∑

j,k

a1,ja2,kE1(X1 ∈ B1,j)E1(X2 ∈ B2,k) = E g1(X1)E g2(X2).

Such simple functions g form a large enough class to approximate general

functions g. The usual limiting procedure works. Good.

73

Special case: Moments

This is when g(x) = xk.

Def The kth moment of X is EXk =

∫

xkdF (x).

Expectation EX = the 1st moment. [Why called ‘moments’?]

NB: the expectation is a characteristic of location. Can be BAD.

Def The kth central moment of X is E(X −EX)k.

Variance Var (X) := E(X −EX)2 ≡ EX2 − (EX)2 = 2nd central moment.

NB: the variance is a characteristic of scale/spread. Can be BAD.

Absolute moments: for | · |, i.e. E|X −EX|k = kth absolute central moment.

For RVs X with E|X|p <∞, we write: X ∈ Lp (p > 0).

74

What are moments good for?

Sometimes they can be easily calculated.

If you know all the moments of X, you’ll know its distribution as well

(under broad conditions — but this is not always so).

There are several very useful inequalities involving moments.

Some of them relate different moments, some of them give bounds for

probabilities in terms of moments.

75

Thm [4.39] (Jensen’s inequality) Let X ∈ L1 and g : R→ R be a convex

function. Then

g(EX) ≤ E g(X). [NB: Cor [4.14] is a special case: g(x) = |x|.]

I For a convex g, for any x0 ∈ R there always exists an a ∈ R s.t.

g(x) ≥ g(x0) + a(x− x0) for all x ∈ R.

Now take x0 = EX and replace x with X:

g(X) ≥ g(EX) + a(X −EX), ω ∈ Ω.

Taking expectations of both sides, using linearity & monotonicity:

E g(X) ≥ g(EX) + aE(X −EX) = g(EX).

Bingo.

76

Cor [4.37] (Lyapunov’s inequality) For 0 < r ≤ s,

(E|X|r)1/r ≤ (E|X|s)1/s .

NB: this implies, in particular, that if the sth moment is finite, then so is the

rth one. Thus, if the second moment is finite, then the expectation must be

finite, too.

I Letting Y := |X|r, the inequality can be re-written as

(EY )

s/r ≤ EY s/r.

This is a special case of Jensen’s inequality, with g(x) = xs/r, x ≥ 0, which is

convex for s ≥ r. [Can put g(x) = 0 for x < 0, if you wish.] Bingo, right?

77

Thm [4.40] (Chebyshev’s [Markov’s] inequality) If g : R→ R is a

positive non-decreasing function, then, for any RV X and number a ∈ R,

P(X ≥ a) ≤ Eg(X)

g(a)

.

I Since 1(X ≥ a) ≤ g(X)/g(a), we have

P(X ≥ a) = E 1(X ≥ a) ≤ E g(X)

g(a)

=

Eg(X)

g(a)

.

Bingo. Special cases:

P(|X| ≥ a) ≤ E|X|

p

ap

for p, a > 0;

P(|X −EX| ≥ a) ≤ Var (X)

a2

for a > 0;a

P(X ≥ a) ≤ Ee

tX

eta

for t > 0.

aBTW: this gives a ‘3σ-rule bound’ of 1/9 only!

78

Def Mixed moments: EXkY m (+ central, absolute etc.).

Def For X,Y ∈ L2, the covariance of X and Y is

Cov (X,Y ) = E (X −EX)(Y −EY ) ≡ EXY −EXEY. [Do you see why? ]

The correlation between X and Y with Var (X),Var (Y ) > 0 is

Corr (X,Y ) =

Cov (X,Y )√

Var (X) Var (Y )

. [NB: Both are symmetric!]

BTW: Why did we require that X,Y ∈ L2? This ensures that EXY is finite:

0 ≤ (X ± Y )2 = X2 + Y 2 ± 2XY ⇒ |XY | ≤ 1

2

(X2 + Y 2), bingo.

[Of course, it also ensures that X,Y ∈ L1, see Cor [4.37].]

If we replace X with X/

√

EX2, and Y with Y/

√

EY 2, we’ll get∣∣∣∣ XY√EX2EY 2

∣∣∣∣ ≤ 12

(

X2

EX2

+

Y 2

EY 2

)

. (∗)

Taking expectations on both sides leads to the famous

79

Cor [4.36] (Cauchy–Bunyakovsky inequality)a If X,Y ∈ L2 then

XY ∈ L1 and

E |XY | ≤

√

EX2EY 2.

Since |EXY | ≤ E|XY |, for variances/covariances Cor [4.36] means that

|Cov (X,Y )| ≤

√

Var (X)Var (Y ),

and hence that

|Corr (X,Y )| ≤ 1.

Cool. BTW: when do we have “=” in this inequality? Looking at the

derivation of the inequality, one can see that in fact

Corr (X,Y ) = 1 (−1, resp.) ⇔ X = aY + b, a > 0 (a < 0, resp.),

see sl. 85 for a formal derivation.

aFor some unclear reasons, it’s often referred to as Cauchy–Schwarz inequality.

80

So correlation = ±1 when there is a perfect linear relationship b/w the RVs.

When the correlation is zero, one says that the RVs are uncorrelated, which

is NOT the same as independent.

NB: IF X and Y are independent, then by Cor [4.30] (sl. 74)

EXY = EX EY,

and hence Corr (X,Y ) = 0, but not the other way around! [Examples?]

But: for Gaussian (X,Y ), it’s the same [look at the density!].

Covariance & correlation have nice (and also quite insightful & useful)

geometric interpretations.

81

Consider X,Y ∈ L2, set X0 := X −EX, Y0 := Y −EY, and look at

Var (X + Y ) = E(X0 + Y0)

2 = EX20 + EY

2

0 + 2EX0Y0

= Var (X) + Var (Y ) + 2Cov (X,Y ).

Compare this with: for u, v ∈ Rd,

‖u+ v‖2 = (u+ v, u+ v) = (u, u) + (v, v) + 2(u, v)

= ‖u‖2 + ‖v‖2 + 2(u, v).

In fact, Cov (X,Y ) can be interpreted as a scalar product (of X0 and Y0, the

centred versions of X and Y , in the linear space L2 of square-integrable RVs,

so that the norm of X is

√

EX2 — quite similar to the std Euclidean norm

in Rd) and so has the same properties.

82

In particular, |(u, v)| ≤ ‖u‖‖v‖ becomes our |Cov (X,Y )| ≤√Var (X)Var (Y ).

Note also that correlation is an analog of

(u, v)

‖u‖‖v‖ = cos(∠(u, v)).

So uncorrelated RVs are like orthogonal vectors (in Rd: for them, (u, v) = 0

or, which is the same, cos(∠(u, v)) = 0).

In particular, Pythagoras’ theorem holds:

Var (X + Y ) = Var (X) + Var (Y ) ⇔ Cov (X,Y ) = 0.

Independent RVs is a special case of uncorrelated RVs (sl. 82), but not the

other way around: uncorrelated RVs don’t need to be independent! [Examples?]

83

Correlation is a measure of linear association rather than independence.

To the geometric fact that

| cos(∠(u, v))| = 1 ⇔ u and v are collinear (i.e. u = av for some a 6= 0)

there corresponds the following relation:

|Corr (X,Y )| = 1 ⇔ P(Y = aX + b) = 1 for some a 6= 0, b ∈ R.

[NB: Cov is the scalar product of the centred versions of X and Y .]

I Indeed, assume that Corr (X,Y ) = 1. Then, for the standardized RVs

X1 :=

X −EX√

Var (X)

, Y1 :=

Y −EY√

Var (Y )

, we have

E(X1 − Y1)2 = EX21 + EY 21 − 2Corr (X,Y ) = 1 + 1− 2 = 0.

Likewise, assuming that Corr (X,Y ) = −1, we’ll get E(X1 + Y1)2 = 0.

Either way, we have an RV Z = (X1 ∓ Y1)2 ≥ 0 with EZ = 0. But this is only

possible when P(Z = 0) = 1, and this is why:

84

Set An := {Z > 1/n}, n = 1, 2, . . . Then, by Markov’s inequality

(Thm [4.40], sl. 79),

P(An) = P(Z > 1/n) ≤ EZ

1/n

= 0.

Further, we clearly have An ⊂ An+1 and An ↑ A :=

⋃

k≥1Ak ≡ {Z > 0}, so

that by continuity of probability holds

P(A) = lim

n→∞P(An) = 0.

Thus we proved this:

Corr (X,Y ) = ±1 ⇒ P(X1 ∓ Y1 = 0) = 1,

i.e. w.p. 1 holds

X −EX√

Var (X)

∓ Y −EY√

Var (Y )

= 0.

This means that Y = aX + b for some a, b ∈ R with a of the same sign as the

correlation Corr (X,Y )). [Picture.] Bingo!

85

When considering d > 2 RVs, i.e. dealing with RVecs X = (X1, . . . , Xd), one

uses covariance matrices (CovMs):

C2X ≡ [C2X(i, j)]i,j=1,...,d := [Cov (Xi, Xj)] = E (X −EX)>(X −EX).

NB: We are using row vectors x = (x1, . . . , xd) here, so x

> is a column and the

matrix product x>x = [xixj ] is a d× d-matrix. If we used column vectors, we

would be writing xx> for the same object.

Note that C2X(i, i) = Var (Xi), i ≤ d.

These two are the key properties of CovMs:

(CovM.1) C2X is symmetric: C

2

X(i, j) = C

2

X(j, i).

(CovM.2) C2X is positive (or, rather, non-negative) definite: ∀x ∈ Rd,

xC2Xx

> ≥ 0.

86

(CovM.1) is obvious, as Cov (X,Y ) = Cov (Y,X).

(CovM.2) requires some effort. Setting Y := Xx> ∈ R, we have

0 ≤ Var (Y ) = E(Xx> −EXx>)2 = E [(X −EX)x>]2

= E

[

(X −EX)x>(X −EX)x>]

= E

[(

(X −EX)x>)>(X −EX)x>] [as a> = a for a ∈ R]

= E

[

x(X −EX)>(X −EX)x>]

= xE

[

(X −EX)>(X −EX)]x> = xC2Xx>, ufff, bingo!

It turns out that any d× d-matrix satisfying (CovM.1) & (CovM.2) is the

CovM of some distribution on Rd (e.g. a d-dimensional normal distribution,

which is our next topic).

87

Multivariate Normal Distributions

Recall: the standard d-dim normal RVec X = (X1, . . . , Xd) has i.i.d.

components Xj ∼ N(0, 1), and hence its density has the product form:

f(x) =

d∏

j=1

1√

2pi

e−x

2

j/2 ≡ 1

(2pi)d/2

exp

{

−1

2

xx>

}

, x ∈ Rd.

Clearly, EX = 0 ∈ Rd, C2X = I ≡ diag (1, . . . , 1), the identity matrix.

Now consider Y := µ+XA, where µ ∈ Rm, A ∈ Rd×m, so that Y ∈ Rm. Then

EY = E(µ+XA) = µ+ (EX)A = µ

and

C2Y = E (Y −EY )>(Y −EY ) = E (XA)>XA

= EA>X>XA = A>(EX>X)A = A>IA = A>A.

88

Now if m ≤ d and detC2Y 6= 0 (why m > d is no good? NB: detC2Y > 0 for

non-singular C2Y ) then we can use Thm [2.43] to claim that X has density

fY (y) =

1

(2pi)m/2

√

detC2Y

exp

{

−1

2

(y − µ)[C2Y ]−1(y − µ)>

}

, y ∈ Rm. (∗)

Why? The easiest case is when m = d and detA 6= 0; we have g(x) = µ+ xA

with the inverse h(y) = (y − µ)A−1, so that

(

∂hi

∂yj

)

= (A−1)> is constant and

detC2Y = detA

> detA = (detA)2 ⇒ Jh = detA−1 = (detA)−1 = 1√

detC2Y

.

Remember we said that any matrix satisfying (CovM.1) & (CovM.2) is the

CovM of some distribution (sl. 88)? If B is such a matrix, detB 6= 0, the

function obtained by replacing C2Y in (∗) with our B will be a (normal)

probability density. Good.

If detB = 0, it will still be the CovM of a normal distribution, but the latter

will be concentrated on a (shifted) linear subspace (and hence not AC).

89

5. Conditional Expectations (CEs)

The expectation of an RV X is a number, a numerical characteristic of X (or,

rather, of PX). We know (cf. Problem 5 from PS–5) that, for X ∈ L2, the

value EX is the best (in mean quadratic) predictor among constants for X :

E(X − a)2 −→ min

a

for a = EX.

So, if you don’t know the value of X, but do know PX , your best “educated

guess” for X will be its mean EX.

However, we often do know something about the outcome of our RE, in

which X was also produced, but remains hidden from us. For instance, we

don’t know if patient A has got disease D, but know the results of tests made

on him (here the desired RV is X = 1D); or, X is tomorrow’s price of stock S

which we don’t know yet, but whose history up to now we have observed

(plus, we saw the prices of other stocks in the market as well). In Bayesian

statistics, we assume that parameters’ values are random and observe data;

what do the data tell us about the parameters?

90

In all these situations, the key question is:

How to make the best use of the available information?

CEs provide a powerful tool for doing exactly that.

CEs are not numbers but RVs themselves, and in fact are functions of

the observed RVs (the information we have).

NB: If you condition on an event, which is rarely the case, then you end up

with a number, but even in this case you are actually dealing with a f’n.

First we will consider just this special case: conditioning on an event.

Suppose that all we know about the outcome of our RE is that event A

occurred. Given this information, what would be our best “educated guess”

about the realized value of X?

91

Let’s minimise the mean quadratic error on A: set

g(a) := E[(X − a)2;A] = E(X − a)21A = EX21A − 2aEX1A + a2P(A),

and solve (for a) the equation

0 = g′(a) = −2EX1A + 2aP(A).

This yields

a =

EX1A

P(A)

≡ E(X;A)

P(A)

=: E(X |A),

which we are well familiar with from our 2nd year probability course(s).

Interpretation: we just average the values of X over A, ignoring the rest of the

sample space. [Picture.]

NB: Of course, we don’t need X ∈ L2 to use this definition, it suffices to

have X ∈ L1 (i.e. E|X| <∞). But the above derivation provides motivation.

92

Of course, we can do the same for Ac instead of A.

So if we know the value of 1A (telling us if A occurred or not) and need to

provide the best forecast for X (whether A occurred or not), we use

E(X |1A) :=

E(X |A) ≡ E(X; 1A = 1)

P(1A = 1)

, ω ∈ A,

E(X |Ac) ≡ E(X; 1A = 0)

P(1A = 0)

, ω ∈ Ac,

which is an RV!

Next assume that we observed a simple RV Y =

∑n

i=1 yi1Ai , where

{A1, . . . , An} is a partition of Ω, all yi being distinct.

In other words, we know which of the events Ai occurred. Now if Y = yi then

ω ∈ Ai, and we know that in this case the best forecast for X will be given by

Xˆ =

E(X;Ai)

P(Ai)

=: xi, ω ∈ Ai.

This again specifies an RV!

93

Moreover, since Ai = {Y = yi}, introducing the function h(y) by putting

h(yi) := xi, we see that this new RV is actually a function of Y :

Xˆ = h(Y ) =: E(X |Y ),

called the CE of X given the simple RV Y .

This can be thought of as a “crude” version of X obtained by averaging the

values of X over events on which Y assumes one and the same value, the

“atoms” Ai = {Y = yi} of σ(Y ). [Picture.]

NB: Note that the values of Y don’t really matter when defining E(X |Y ).

What matters is the partition generated by Y or, equivalently, σ(Y ).

Ex. Expected weight of inhabitants given the postcode.

NB: Please note that all what we have said re CEs so far (and what we will be

saying below) holds for RVecs Y as well! (We never used the fact that Y ∈ R.)

94

NB: It is obvious that X and Xˆ have the same average values on the sets Ai:

by def’n of xi,

E(Xˆ;Ai) = E Xˆ1Ai

def’n of Xˆ

= Exi1Ai = xiP(Ai)

def’n of xi= E(X;Ai).

Q: Will this still hold if we replace Ai with AI :=

∑

i∈I Ai, I ⊂ {1, 2, . . . , n}?

A: Sure. It’s like averaging the averages:

E(Xˆ;AI) = E Xˆ1AI = E Xˆ

∑

i∈I

1Ai =

∑

i∈I

E Xˆ1Ai=

∑

i∈I

EX1Ai = · · · = E(X;AI).

Summarising, we see that the CE Xˆ = E(X |Y ) has these two properties:

(CE.1) Xˆ is “flat” on the “atoms” of σ(Y ), i.e. Xˆ is an RV

w.r.t. σ(Y ) (cf. Tutorial Problem 3 from PS–3; one says that

Xˆ is σ(Y )-measurable); this holds iff Xˆ = h(Y ) for some h, and

(CE.2) E(Xˆ;A) = E(X;A) for any A ∈ σ(Y ) [i.e. A = {Y ∈ B}].

95

Important: (CE.1) & (CE.2) uniquely specify the CE Xˆ = E(X |Y ) in case

of simple Y ’s! Indeed, using (CE.2) with A = Ai and applying (CE.1)

immediately implies that Xˆ = xi on Ai, bingo.

Now one can formally define CE in case of general Y using these properties.

Thm [CE] Let X ∈ L1 and Y be RVs on a common probability space. Then

there exists an RV Xˆ satisfying (CE.1) & (CE.2). This RV is unique, up to its

values on a set of zero probability, and is called the CE of X given Y , denoted

by E(X |Y ). [Cf. Def [8.12].]

I The proof is based on Radon-Nikodym’s theorem from Measure Theory,

and we won’t give it here. Good.

Recall: we already said that what really mattered in the def’n of E(X |Y ) in

the case of simple Y ’s was the partition generated by Y or, equivalently, σ(Y ).

Likewise in the general case.

96

First observe that if ϕ is a 1–1 function, then E(X |Y ) = E(X |ϕ(Y )) (why?).

In particular, one always has E(X |Y ) = E(X |Y 3) = E(X | eY ) etc., but in

the general case E(X |Y ) 6= E(X |Y 2) (when will “=” hold?).

Second observe that we don’t really need RVs Y themselves for

conditioning — we need information contained in their values. The same

information is also contained in σ(Y ): if we know which events from it

occurred and which didn’t, we know the value of Y ! If, for a given

sub-σ-algebra F1 ⊂ F , we replace (CE.1) with the requirement that Xˆ is

F1-measurable (i.e. Xˆ is an RV w.r.t. F1), then Thm [CE] will still hold true.

Thus introduced RV Xˆ =: E(X | F1) is called the CE of X given σ-algebra F1.

In probability & statistics, it’s critically important (i) to be aware of the

general definition of CE, (ii) to know and be able to use the key properties of

CE, and hence be able to effectively use this powerful tool, and (iii) to know

how to compute CEs in important special cases.

97

Ex. Poisson sums: We know (from our 2nd year probability subject or

otherwise) that, for independent X ∼ P (λ) and Y ∼ P (µ), the sum is also

Poisson: Z := X + Y ∼ P (λ+ µ). Show that E(X |Z) = λλ+µZ.

[Motivation: Suppose we know the total number NT of jumps in a Poisson

process during the time interval [0, T ]. What can we say about Nt for t < T?]

(CE.1) is obvious ( λλ+µZ is a function of Z, right?).

(CE.2) To verify this property, we write, for any k ≥ 0,

E(X ; Z = k) = EX1(Z = k) =

∑

i≥0

∑

j≥0

i 1(i+ j = k)︸ ︷︷ ︸

= 1 iff j = k − i

P(X = i, Y = j)

=

k∑

i=0

iP(X = i, Y = k − i)︸ ︷︷ ︸

λie−λ

i!

× µ

k−ie−µ

(k − i)!

= e−(λ+µ)

k∑

i=0

iλi

i!

× µ

k−i

(k − i)!

98

= e−(λ+µ)λ

k∑

i=1

1

(i− 1)!(k − i)!λ

i−1µk−i [letting l := i− 1]

= e−(λ+µ)

λ

(k − 1)!

k−1∑

l=0

(k − 1)!

l!(k − 1− l)!λ

lµk−1−l = e−(λ+µ)

λ(λ+ µ)k−1

(k − 1)! .

On the other hand,

E

(

λ

λ+ µ

Z; Z = k

)

=

λk

λ+ µ

P(Z = k) =

λk

λ+ µ

(λ+ µ)k

k!

e−(λ+µ),

which is the same! Bingo.

What if the question was not to show that , but rather to compute (without

knowing the answer in advance)? That is, what about (iii)?

The standard route is to first find the conditional distribution and then

compute the expectation for it (re how this is done in our Ex, see Tutorial

Problem 1 in PS–6).

99

OK, what is the conditional distribution?

Recall: for any event A, one has P(A) = E1A. Likewise, one defines

conditional probabilities (cf. Def [8.18]) by setting

P(A |Y ) := E(1A |Y ).

Distributions are collections of probabilities of the form PX(B) ≡ P(X ∈ B),

B ∈ B(R). And so conditional distributions are defined by letting

gB(Y ) := P(X ∈ B |Y ) := E[1(X ∈ B) |Y ].

and then setting PX|Y (B | y) := gB(y).

In fact, it’s a bit more tricky than just that. Our Thm [CE] on sl. 97 claims,

for a given RV Z, the existence of g(Y ) := E(Z |Y ) and its uniqueness up to

events of null probability. Now we have a family of RVs {gB(Y ), B ∈ B(R)}

and kind of want it to be a distribution (in B) for any outcome!

That is, for any y ∈ R, our gB(y) should be a probability (a set function in B).

100

Well, it turns out such a thing does exists. Always. And we call it the

conditional distribution of X given Y .

When Y is discrete, we just compute P(X ∈ B |Y = yi) (provided that

P(Y = yi) > 0, cf. Thm [8.37]), it’s all nice & simple.

When (X,Y ) is AC, one uses the conditional density (cf. Def [8.38])

fX|Y (x|y) :=

f(X,Y )(x, y)

fY (y)

, fY (y) =

∫

f(X,Y )(x, y) dx.

Then (cf. Thm [8.39])

P(X ∈ B, Y ∈ B′) =

∫

B×B′

f(X,Y )(x, y) dxdy =

∫

B×B′

fX|Y (x|y)fY (y) dxdy

=

∫

B′

[∫

B

fX|Y (x|y) dx

]

︸ ︷︷ ︸

=:gB(y)

fY (y) dy = E[gB(Y ); Y ∈ B′],

so that gB satisfies the def’n of conditional probability.

101

So, to compute E(X |Y ) when (X,Y ) is AC, one first finds the conditional

density fX|Y (x|y). Then one computes

g(y) :=

∫

xfX|Y (x|y) dx

(sometimes denoted by E(X |Y = y)) and lets

E(X |Y ) := g(Y ).

Ex Uniform distribution on D := {(x, y) ∈ R2 : x ≥ 0, y ≥ 0, x+ y ≤ 1}.

Ex Normal distribution.

Quite often one can use the general properties of CEs to make the

computation feasible. Our exposition will be somewhat different from the one

presented in Section 8.3.2 from the text.

102

Properties of CEs

(CEP.1) Linearity: for constants a, b ∈ R,

E(aX + bZ |Y ) = aE(X |Y ) + bE(Z |Y ).

[NB: all relations of this kind, involving CEs, are understood a.s. (=almost

surely), which means they hold up to an event of zero probability.]

I Indeed, the RHS is a function of Y , so (CE.1) holds. To verify (CE.2), we

use the def’n of CE and linearity of expectation:

E[RHS ;Y ∈ B] = aE[E(X |Y ) ;Y ∈ B]︸ ︷︷ ︸

=E(X;Y ∈B)

+bE

[

E(Z |Y ) ;Y ∈ B]︸ ︷︷ ︸

=E(Z;Y ∈B)

= aE(X;Y ∈ B) + bE(Z;Y ∈ B) = E(aX + bZ;Y ∈ B), good.

103

(CEP.2) Monotonicity: if X ≤ Z a.s., then also E(X |Y ) ≤ E(Z |Y ) (a.s.).

I Indeed, if it were NOT so, we would have (from linearity)

E(Z −X |Y ) = E(Z |Y )−E(X |Y ) < 0

with positive probability. The LHS is a function of Y (by def’n), say, h(Y ),

and so, for B := h−1((−∞, 0)),

{ω ∈ Ω : E(Z −X |Y ) < 0} ≡ {ω ∈ Ω : h(Y ) < 0} = {ω ∈ Ω : Y ∈ B}.

Now by (CE.2),

E

(

h(Y );Y ∈ B) = E(Z −X;Y ∈ B).

Look: the integrals are over the set {Y ∈ B} of positive probability, and the

integrand on the LHS is < 0, whereas the one on the RHS is ≥ 0.

Contradiction.

104

(CEP.3) If Z = g(Y ) then

E(ZX |Y ) = ZE(X |Y ).

That is, functions of Y behave like constants when one conditions on Y .

I Indeed, the RHS is a function of Y , so (CE.1) is fine.

Re (CE.2): first consider the case Z = 1{Y ∈C} for some C ∈ B(R). Then

E(ZX;Y ∈ B) = EX1{Y ∈C}1{Y ∈B},

whereas

E

[

ZE(X |Y );Y ∈ B] = E[E(X |Y );Y ∈ C ∩B] (CE.2)= E(X;Y ∈ C ∩B),

which is the same. Next we verify the property for simple RVs, then look at

their limits. Good.

105

(CEP.4) If X is independent of Y , then

E(X |Y ) = EX. (∗)

I Indeed, (CE.1) is met as a constant (= EX) is a function of Y .

Secondly, X and 1(Y ∈ B) are also independent, and so by Cor [4.30] (sl. 74)

E(X; Y ∈ B) ≡ E[X1{Y ∈B}] = EX E1{Y ∈B}

= E

[

(EX)1{Y ∈B}

]

= E(EX; Y ∈ B),

so that (CE.2) is also met.

In particular, if Y = const, then (∗) always holds.

106

(CEP.5) The double expectation law (a.k.a. the “tower property”):

E

[

E(X |Y1, Y2) |Y1

]

= E(X |Y1).

I Oh well, (CE.1) is obvious. To show (CE.2):

E

[

E(X |Y1, Y2);Y1 ∈ B

]

= E

[

E(X |Y1, Y2); (Y1, Y2) ∈ B × R

]

(CE.2)

= E(X; (Y1, Y2) ∈ B × R) = E(X;Y1 ∈ B)

(CE.2)

= E

[

E(X |Y1);Y1 ∈ B

]

,

bingo.

In particular, taking Y1 = const, we see that

E

[

E(X |Y )] = EX

(but that was obvious from (CE.2): just take B = R there).

This relation is actually a form of the total probability law, which is a very

useful thing!

107

One of the great features of CE is that, for X ∈ L2, the CE Xˆ := E(X |Y ) is

the best (in m.q.) forecast for X from Y . Let’s prove that!

For an RV Z = h(Y ), consider

E(X − Z)2 = E[(X − Xˆ) + (Xˆ − Z)]2

= E

[

(X − Xˆ)2 + 2(X − Xˆ)(Xˆ − Z) + (Xˆ − Z)2]

= E(X − Xˆ)2 + 2E(X − Xˆ)(Xˆ − Z) + E(Xˆ − Z)2

Use (CEP.5) to evaluate the middle term: the mixed moment equals

EE

[

(X − Xˆ)(Xˆ − Z) |Y ] (CEP.3)= E[(Xˆ − Z)E[(X − Xˆ) |Y ]]

(CEP.1)

= E

[

(Xˆ − Z)[E(X |Y )︸ ︷︷ ︸

=Xˆ

− E(Xˆ |Y )︸ ︷︷ ︸

=Xˆ by (CEP.3)

]

]

= 0,

so E(X − Z)2 = E(X − Xˆ)2 + E(Xˆ − Z)2 −→ minZ=h(Y ) for Z ≡ Xˆ, as then

the second (non-negative) term = 0, while the first one doesn’t depend on Z.

[Again note: can have Y ∈ Rd here. Projection interpr’n. Linear vs general f’cast.]

108

The final comment in the section on CEs:

That Xˆ = E(X |Y ) minimises the mean quadratic distance to X among all

RVs that are functions of Y also has a simple geometric interpretation.

In mathematics, an operator R : L→ L (L is a linear space, e.g. Rn or L2) is

called a projection if R2 = R, i.e. R(Rx) = Rx for any x ∈ L. [Picture.]

Let LY = space of all RVs Z ∈ L2 that are f’ns of a given RV Y : Z = h(Y ) for

some h. Then the operator R(·) = E(· |Y ) is a projection onto LY :

Xˆ = R(X) ≡ E(X |Y ) (CE.1)= h(Y ), and so R(R(X)) ≡ E(Xˆ |Y ) (CEP.3)= Xˆ.

The geometry of L2 is Euclidean, as in Rn (it’s based on a scalar product), so

shares a lot of properties, e.g. Pythagoras thm. In particular, the projection Xˆ

and the “error” X − Xˆ will be orthogonal in L2 (≡ uncorrelated):

EXˆ(X − Xˆ) (CEP.5)= EE[Xˆ(X − Xˆ)|Y ] (CEP.3)= E[XˆE(X − Xˆ|Y )︸ ︷︷ ︸

E(X |Y )−E(Xˆ |Y )=Xˆ−Xˆ

]

= 0.

109

6. Some Applications to Statistics

First a few words on the relationship b/w Probability Theory (PT) and

Mathematical Statistics (MS).

In PT: knowing the nature of a random phenomenon, we derive the

distributions of the characteristics of the phenomenon (that one can usually

observe in the RE). [Our knowledge is built into the mathematical model of

the RE: (Ω,F ,P) etc.] One can call this a direct problem.

Ex. Knowing the composition of the general population, what can one say

about the composition of a random sample from that population?

In MS, we deal with an inverse problem , e.g. in our Ex, given the observed

composition of the random sample, what can we say about the general

population? In the general situation, we have a mathematical model of the

RE: (Ω,F ,P) etc., where P is (at least, partially) unknown, observe the value

of an RVec resulting from our RE, and then use PT to make inference

concerning P on the basis of this data.

110

Later on, we will discuss application of the key results of PT, its limit

theorems, to MS. Now we will briefly discuss an important application of the

concept of CE in the context of parameter estimation. [You must have seen

elements of that in 2nd year stats subjects — in case you did them, of course.]

Our model for observed data:

• There is an underlying RE, modelled by (Ω,F ,Pθ), where Pθ is a

probability depending on parameter θ ∈ Θ ⊂ Rd, whose value we don’t (but

want to) know. Thus we have a family of suspects P := {Pθ}θ∈Θ and need to

point at the one that would fit the observed data best (in some sense).

• We observe an RVec X = X(ω) ∈ Rn. Denote by Pθ the distribution of X

on (Rn,B(Rn)) induced by Pθ. One can often identify (Ω,F ,Pθ) with

(Rn,B(Rn), Pθ) (may be awkward if one analyses “large sample” situations,

when n→∞), and then Ω = Rn does become the sample space.

111

Sufficient Statistics

You may remember some elements of this stuff from 2nd year stats.

Def. Any (measurable) function S = S(X) of the observation X is called a

statistic in our sampling experiment.

Estimators of the unknown parameter θ are statistics θ∗ = θ∗(X) (i.e. just

functions of the sample X assuming values in the parameter space Θ),

statistical tests are statistics δ∗ = δ∗(X) (assuming values 1 and 0 when

testing a single hypothesis H0 vs alternative H1: δ

∗ = 1 means that we reject

H0; for randomized tests, δ

∗ is the probability of rejecting H0; similarly in the

case of multiple hypotheses).

Clearly, a statistic S is also an RV (or RVec), and so one can talk about the

conditional distribution Pθ(X ∈ B|S), B ∈ B(Rn), cf. sl. 82.

112

Def. [R. A. Fisher, 1922] A statistic S = S(X) is called sufficient (SS) for

parameter θ if the conditional distribution Pθ(X ∈ B|S) doesn’t depend on θ.a

NB: if ϕ is a 1–1 function, then S1 = ϕ(S) is also an SS for θ (cf. sl. 98).

Being a CE given S, Pθ(X ∈ B|S) is a function of S, so there exists a function

P (B|s), B ∈ B(Rn), s ∈ Rd, s.t.

Pθ(X ∈ B|S) = P (B|S)

Interpretation: P (B|s) is the cond’l distribution of the sample X given S = s.

Meaning: For a given SS S, if you know that the sample point X is on the

“surface” {x ∈ Rn : S(x) = s}, any further information re where on this

surface the point X is tells you nothing about the value of θ (as the location

of the point follows then one and the same distribution for all θ) — and so

this information is irrelevant to the estimation problem. [Picture: N(0, σ2).]

aMore formally: there exists a version of the conditional distribution which is independent

of θ, cf. slides 102, 103. But we don’t bother about such nuances too much here.

113

Ex. Let X = (X1, . . . , Xn) be an i.i.d. sample, Xi ∼ P (λ).

Recall: for independent X ∼ P (λ) and Y ∼ P (µ), we have X + Y ∼ P (λ+ µ),

and the conditional distribution of X given X + Y = m is binomial

B(m, λλ+µ ). Hence that of Xj given S := X1 + · · ·+Xn = m is B(m, 1/n).

Perhaps S is a sufficient statistic for λ? Verify: fix an integer s ≥ 0, then for

x = (x1, . . . , xn), one clearly has Pλ(X = x |S = s) = 0 if

∑n

j=1 xj 6= s, while

if the sum = s we have

Pλ(X = x |S = s) = Pλ(X = x)

Pλ(S = s)

=

∏n

j=1 e

−λλ

xj

xj !

e−nλ

(nλ)s

s!

=

s!λ

∑

xj

(nλ)s

∏

xj !

=

(x1 + · · ·+ xn)!

x1! · · ·xn!

(

1

n

)x1

· · ·

(

1

n

)xn

,

multinomial distr’n with s =

∑n

j=1 xj independent trials with n equally likely

outcomes. Thus S = nX is an SS for λ. Good.

114

NB: What a great reduction of data! It turns out that all the information

about θ contained in the whole sample is “stored” in a single value, that of S.

More on that later. Now: how to find SSs?

The most natural formulation of the main result here uses the concept of

density.

Recall: a distribution P on Rn has a density f if, for any Borel set B ⊂ Rn,

P (B) =

∫

B

f(x) dx ≡

∫

· · ·

∫

f(x)1B(x) dx1 · · · dxn. (∗)

But we also introduced integrals of more general form (sl. 68, 69):∫

B

f(x) dQ(x), Q is a probability.

Basically the same def’n works for integrals w.r.t. measures (see sl. 17) as

well, rather than just probability measures, e.g. Lebesgue measure (= length

on R, area on R2, volume on Rn, n > 2), as in (∗), or counting measure (see

sl. 18).

115

Thus, if X ∼ P is a discrete RVec taking values in a countable set

C = {xi}i≥1, p(x) := P(X = x), and µ is the counting measure on C (i.e.

µ(B) = #{x ∈ B ∩ C}), then one has

P(X ∈ B) ≡ P (B) =

∫

B

dP (x) =

∫

B

p(x) dµ(x) =

∑

x∈B∩C

p(x) ≡

∑

i:xi∈B

p(xi).

In this case, one uses notation: p(x) =

dP

dµ

(x) and says that P is AC w.r.t. µ

and p is the density of P w.r.t. µ.

So the concept of AC is relative ; when used without any further

explanations, it always means that P is AC w.r.t. the Lebesgue measure, i.e.

(∗) from sl. 116 holds. But one often needs to use some other measures µ.

NB: So it turns out that discrete distributions are actually AC w.r.t. the

counting measures on their supports!

Now we are ready to formulate our main result here.

116

Thm [Neyman–Fisher (NF) factorisation.] Suppose all Pθ are AC w.r.t.

some measure µ, with densities fθ(x) =

dPθ

dµ (x). A necessary and sufficient

condition for statistic S to be an SS for θ is that, for some functions ψ(s, θ)

and h(x),

fθ(x) = ψ(S(x), θ)h(x). (∗)

If X = (X1, . . . , Xn) is an i.i.d. sample, Xj having a density fθ(x), then

fθ(x) ≡

n∏

j=1

fθ(xj) = ψ(S(x), θ)h(x).

NB: Factorisation (∗) is not unique, of course: say, the pair

ψ1(s, θ) := e

sψ(s, θ), h1(x) := e

−S(x)h(x)

would be OK, too!

117

Ex. (cont’d) In our Poisson example, the distribution is on the integers, with

the density fλ(x) = e

−λ λx

x! w.r.t. the counting measure on {0, 1, 2, . . .}. So the

likelihood function has the form

fλ(x) ≡

n∏

j=1

fλ(xj) =

n∏

j=1

e−λ

λxj

xj !

=

e−nλλ

∑n

j=1 xj∏n

j=1 xj !

= ψ(S(x), λ)h(x)

with

ψ(s, λ) = e−nλλs, S(x) =

n∑

j=1

xj , h(x) =

1∏n

j=1 xj !

.

Therefore, according to the NF Thm, S(X) :=

∑n

j=1Xj is an SS for λ (but

we have already proved that directly anyway).

118

I We will prove the NF Thm in the discrete case only; in the general case,

the same idea is used (but the argument becomes more technical).

Thus we assume that Pθ(X ∈ C) = 1, ∀θ ∈ Θ, for a countable set C, µ is the

counting measure on C, fθ(x) = Pθ(X = x) for x ∈ C.

⇐) Suppose that factorisation (∗) (sl. 118) takes place. Then, for x ∈ C,

s = S(x) (otherwise it’s trivial: zeros!),

Pθ(X = x |S(X) = s) = Pθ(X = x, S(X) = S(x))

Pθ(S(X) = s)

=

Pθ(X = x)

Pθ(S(X) = s)

=

fθ(x)∑

y∈C:S(y)=s fθ(y)

(∗)

=

ψ(S(x), θ)h(x)∑

y∈C:S(y)=s ψ(S(y), θ)h(y)

=

h(x)∑

y∈C:S(y)=s h(y)

,

which doesn’t depend on θ, hence S is an SS for θ!

119

⇒) Now assume that S is an SS for θ. Then, for x ∈ C, s = S(x) (otherwise

it’s trivial: zeros!),

Pθ(X = x |S(X) = s) =: h(x)

doesn’t depend on θ. Therefore

fθ(x) ≡ Pθ(X = x) = Pθ(X = x, S(X) = S(x))

= Pθ(X = x |S(X) = S(x))︸ ︷︷ ︸

=h(x)

Pθ(S(X) = S(x))︸ ︷︷ ︸

=:ψ(S(x),θ)

,

bingo!!

120

Ex. When X = (X1, . . . , Xn), where Xj ∼ N(µ, σ2), the parameter

θ := (µ, σ2) is 2-dim (NB: here µ ∈ R, it’s not a measure, just a number). The

likelihood function has the form:

fθ(x) =

n∏

j=1

fθ(xj) = (2piσ

2)−n/2 exp

{

− 1

2σ2

n∑

j=1

(xj − µ)2

}

= (2piσ2)−n/2 exp

{

− 1

2σ2

[∑

x2j︸ ︷︷ ︸

=:S2

−2µ

∑

xj︸ ︷︷ ︸

=:S1

+nµ2

]}

= (2piσ2)−n/2 exp

{

− 1

2σ2

[

S2 − 2µS1 + nµ2

]}

= ψ(S, θ)h(x),

where S = (S1, S2), h(x) ≡ 1 (but could take h(x) ≡ (2pi)−n/2 as well).

Therefore (S1, S2) is an SS for θ = (µ, σ

2): of all the information contained in

the sample X, we only need two numbers, (S1, S2), for estimation of (µ, σ

2).

121

Ex. Suppose X = (X1, . . . , Xn), where Xj ∼ U [0, θ]. This is an AC distr’n,

fθ(x) =

{

θ−1 if x ∈ [0, θ],

0 otherwise.

Hence, using x(1) = minj≤n xj and x(n) = maxj≤n xj , the likelihood function is

fθ(x) =

n∏

j=1

fθ(xj) =

{

θ−n if xj ∈ [0, θ], j ≤ n,

0 otherwise.

= θ−n1

(

0 ≤ x(1), x(n) ≤ θ

)

= θ−n1{x(n)≤θ}︸ ︷︷ ︸

=:ψ(S,θ)

1{x(1)≥0}︸ ︷︷ ︸

=:h(x)

Therefore S(X) := X(n) is an SS for θ.

Cor [of NF Thm] If T is a statistic, ϕ a function, and S := ϕ(T ) is an SS

for θ, then T is also an SS for θ.

I Obvious from NF Thm. In fact, the “best” SSs are the minimal ones

(that are f’ns of all the others): they can be used to get the best estimators.

122

Recall the following important concept.

Def. θˆ∗ = θˆ∗(X) := arg maxθ∈Θ fθ(X) is called the maximum likelihood

estimator (MLE) of θ from X.

This statistic can be a (very) good estimator for θ and possesses some nice

properties you may be familiar with. At the moment, we observe this:

Cor [of NF Thm] If S is an SS for θ, then the MLE θˆ∗ is a function of S

only (no further information from the sample X is needed).

I Indeed, from the NF Thm,

max

θ

fθ(X) = max

θ

[

ψ(S(X), θ)h(X)

]

= h(X) max

θ

ψ(S(X), θ)

so the value θ at which the max is attained depends on S(X) only.

Moreover, if S is an SS for θ, then all the Bayesian estimators (rings a bell?)

are functions of S only etc.

123

How SSs can improve estimators’ efficiency

Q: How to measure performance of estimators θ∗ of θ? First consider the case

when Θ ⊂ R. The standard mean quadratic error approach:

Eθ(θ

∗ − θ)2 −→ min

θ∗

(∗)

Q: Does there exist an estimator θ∗0 minimising the LHS of (∗) for all θ?

A: In non-trivial cases, the answer is negative. Indeed, suppose that there

exists such an estimator θ∗0 . Fix a θ1 ∈ Θ and take θ∗1 ≡ θ1. Then

Eθ(θ

∗

1 − θ)2 = (θ1 − θ)2,

which turns into zero for θ = θ1. But we know that θ

∗

0 is the best estimator,

so must also have

Eθ1(θ

∗

0 − θ1)2 = 0.

And this holds for any θ1 ∈ Θ!! This is only possible when the obesrvation X

uniquely determines the value of θ (e.g. when Xi ∼ εθ, Θ = R, or

Xi ∼ U [θ, θ + 1], Θ = Z)

124

So one compares performance of estimators within reasonable classes, e.g.

unbiased estimators.

Def. An estimator θ∗0 = θ

∗

0(X) from a class K of estimators of θ is called

efficient in K if, for any θ∗ ∈ K,

Eθ(θ

∗

0 − θ)2 ≤ Eθ(θ∗ − θ)2, ∀θ ∈ Θ.

Ex (of an important class of estimators). For a function b = b(θ), θ ∈ Θ, let

Kb = {θ∗ : Eθθ∗ = θ + b(θ), ∀θ ∈ Θ}

be the class of all estimators with the bias b(θ).

In particular, K0 is the class of all unbiased estimators. Estimators efficient

in K0 are called simply efficient.

BTW: unbiasedness is a desirable (but not necessary) property of a good

estimator. Unbiased estimators don’t need to exist (e.g. if X ∼ B(ϕ(θ)) for

some f’n ϕ, then θ∗ ∈ K0 ⇔ Eθθ∗ ≡ θ∗(0)(1− ϕ(θ)) + θ∗(1)ϕ(θ) ?!= θ, ∀θ ∈ Θ).

125

Thm. An efficient in Kb estimator (if it exists) is unique (modulo its values

on a subset A of the sample space s.t. Pθ(A) = 0, ∀θ ∈ Θ).

I Suppose that both θ∗1 and θ∗2 are efficient in Kb:

Eθ(θ

∗

i − θ)2 = min

θ∗∈Kb

Eθ(θ

∗ − θ)2 =: Rθ, ∀θ ∈ Θ, i = 1, 2.

Clearly, θ∗0 :=

1

2 (θ

∗

1 + θ

∗

2) ∈ Kb and, using

(

a1+a2

2

)2

+

(

a1−a2

2

)2

=

a21+a

2

2

2 with

ai = θ

∗

i − θ, i = 1, 2, and taking Eθ’s, we obtain

Eθ(θ

∗

0 − θ)2︸ ︷︷ ︸

≥Rθ

+

1

4

Eθ(θ

∗

1 − θ∗2)2 = Rθ ⇒ Eθ(θ∗1 − θ∗2)2 ≤ 0, hence = 0.

As we showed on slides 85 & 86 , this means that Pθ(θ

∗

1 − θ∗2 = 0) = 1. Bingo.

[NB: This assertion has a transparent geometric interpretation. Picture.]

126

Thm [Rao-Blackwell (RB)] Let θ∗ ∈ Kb, S be an SS for θ. Then the CE

θ∗S := Eθ(θ

∗ |S) has the following properties:

(i) θ∗S is a function of S only (and hence is a statistic);

(ii) θ∗S ∈ Kb;

(iii) Eθ(θ

∗

S − θ)2 ≤ Eθ(θ∗− θ)2, ∀θ ∈ Θ, where “=” holds iff Pθ(θ∗S = θ∗) = 1.

Thus applying Eθ(· |S) to an estimator θ∗ improves it!

I First observe that

θ∗S =

∫

θ∗(x)Pθ(X ∈ dx |S) S is an SS=

∫

θ∗(x)P (dx |S)

doesn’t depend on θ and is a function of S = S(X) only. Hence θ∗S is a statistic

and so can be used as an estimator. [It wouldn’t be so if S were NOT an SS!]

This proves (i). To prove (ii), note:

Eθθ

∗

S = EθEθ(θ

∗ |S) (CEP.5)= Eθθ∗ = θ + b(θ), so that θ∗S ∈ Kb indeed.

127

To demonstrate (iii), we do this:

Eθ(θ

∗ − θ)2 = Eθ

(

(θ∗ − θ∗S) + (θ∗S − θ)

)2

= Eθ(θ

∗ − θ∗S)2︸ ︷︷ ︸

≥0

+Eθ(θ

∗

S − θ)2 + 2 Eθ(θ∗ − θ∗S)(θ∗S − θ)︸ ︷︷ ︸

=0, see bottom of sl. 110

≥ Eθ(θ∗S − θ)2.

It remains to note that “=” holds here iff Eθ(θ

∗ − θ∗S)2 = 0, but, as we proved

on slides 85 & 86, the latter is equivalent to Pθ(θ

∗ − θ∗S = 0) = 1. Bingo.

NB: Recall Cor from sl. 123: S = ϕ(T ), S is an SS =⇒ T is an SS, too. Using

the same argument, one can show that then

Eθ(θ

∗

S − θ)2 ≤ Eθ(θ∗T − θ)2

That is, the “smaller” the conditioning SS is, the more efficient the result!

128

Ex. Consider an i.i.d. sample X = (X1, . . . , Xn), Xj ∈ P (λ), and λ∗ = X1. As

we know, Eλλ

∗ = EλX1 = λ, so that λ∗ ∈ K0, and

Eλ(λ

∗ − λ)2 = Var λ(X1) = λ.

We also know (sl. 115) that S :=

∑n

j=1Xj is an SS for λ, so we can form the

estimator

λ∗S = Eλ(λ

∗ |S) = Eλ(X1 |S) = S

n

≡ X

since the conditional distr’n of X1 given S = m is binomial B(m, 1/n) (cf.

Tutorial Problem 1, PS-6).

So

Eλ(λ

∗

S − λ)2 = Var λ(X) =

λ

n

,

which is much better than for the original λ∗!

One can show that λ∗S is actually efficient (i.e., it has the smallest mean

quadratic error among all unbiased estimators).

129

Q: What do we do when θ ∈ Rd?

A: A possible way to evaluate the performance of estimators is to compare

mean quadratic dispersions

Eθ(θ

∗ − θ, a)2, a ∈ Rd,

(·, ·) being the scalar product, and prefer θ∗1 to θ∗2 if the former has smaller

dispersion for all a.

That is, one looks at the projections of θ∗ − θ onto different directions a,

Thm [multivariate Rao-Blackwell] As in the univariate case, except for (iii)

which is replaced with:

(iii′) Eθ(θ∗S − θ, a)2 ≤ Eθ(θ∗ − θ, a)2, ∀θ ∈ Θ, ∀a ∈ Rd, where “=” holds for

all a ∈ Rd iff Pθ(θ∗S = θ∗) = 1.

The same proof: just do it for the univariate parameters (θ, a) and estimators

(θ∗, a) for them. Good.

130

Q: How far can one go along the path paved by RB Thm?

A: When there exists a complete SS for θ, one can go till the very end: in that

case, for a θ∗ ∈ Kb, the estimator θ∗S will be efficient in Kb. But this is from

another book. . .

131

7. Convergence of Random Variables

Modes of Convergence

Recall this: Let {xn}n≥1 be a sequence in R (or Rd). We say that xn converge

to x as n→∞, and write

xn → x as n→∞, or lim

n→∞xn = x,

if, ∀ε > 0, ∃nε <∞ s.t. one has

|xn − x| < ε for all n ≥ nε.

That is, for an arbitrarily small neighbourhood U of x, the xn’s should be in

U for all large enough n.

Now what might Xn → X mean when Xn = Xn(ω) are RVs? They are

functions, not numbers! [Picture.]

132

In Probability Theory (as in other areas of mathematics), one considers

several modes of convergence of RVs (functions). They are different, used in

different contexts, some more often than others. We will give general

definitions and briefly describe relationships between some of them.

Def. [5.1] Convergence almost surely (a.s.), or w.p. 1:

Xn

a.s.−→ X as n→∞

if there is an event A with P(A) = 1 s.t. ∀ω ∈ A, Xn(ω)→ X(ω) as n→∞.

In other words, it’s point-wise convergence on a set of probability 1.

Def. [5.2] Convergence in probability:

Xn

P−→ X as n→∞

if ∀ε > 0, P(|Xn −X| > ε)→ 0 as n→∞.

133

Def. [5.3] Quadratic mean convergence:

Xn

L2−→ X as n→∞

if Xn, X ∈ L2 and E(Xn −X)2 → 0 as n→∞.

In fact, this is convergence of elements of the space L2 of square-integrable

RVs (on a common (Ω,F ,P)) in its “native norm” (the one corresponding to

the scalar product (X,Y ) := EXY ).

Def. [5.4] Convergence in mean:

Xn

L1−→ X as n→∞

if Xn, X ∈ L1 and E|Xn −X| → 0 as n→∞.

The last two modes (5.3, 5.4) are particularly popular in engineering

applications (one of the reasons being that calculating moments is a relatively

simple task).

134

All four are modes of convergence of sequences of RVs, given on a common

probability space. However, the most important from the applications’

viewpoint is convergence in distribution (a.k.a. “weak convergence of

distributions”), which doesn’t care where and how the RVs are defined — only

the distributions of Xn matter.

Def. [5.5] Convergence in distribution:

Xn

d−→ X as n→∞

if limn→∞ FXn(t) = FX(t) at all points t ∈ R where FX(t) is continuous,

i.e. such t that FX(t−) = FX(t) ⇐⇒ P(X = t) = 0.

NB: Why this restriction to continuity points of FX? Because it makes sense

(and perfectly agrees with an alternative, more natural def’n, see Thm [5.8]

below). Consider Xn ≡ 1/n, X ≡ 0. [Picture: DFs.] Then FXn(t)→ FX(t) at

all t 6= 0, whereas FX(0) = 1 6= 0 = FXn(0).

Do we want to exclude such a situation? Don’t the distributions converge?

The def’n allows the jump points to “move” & also emerge in the limit.

135

The key argument for the above def’n of “

d−→” is actually the following

alternative def’n (which works not only for RVs and RVecs, but also in much

more general cases).

Thm [5.8] Xn

d−→ X as n→∞ iff for any continuous bounded function f

Ef(Xn)→ Ef(X) as n→∞.

It is this property that makes convergence in distribution so useful &

important.

I The proof is somewhat technical, let’s leave it for the future. Just note: for

“nice” f (e.g. differentiable and vanishing outside a bounded interval), one can

integrate by parts to obtain:

∫

f(x) dFXn(x) = −

∫

FXn(x) df(x), so if FXn(x)

converge to F (x) everywhere (perhaps except for at most countable set where

F has jumps), then. . . Makes sense.

136

Ex. If X,Xj ∈ N, j ≥ 1, then Xn d−→ X iff ∀k ∈ N,

P(Xn = k)→ P(X = k), n→∞.

Indeed,

P(Xn = k) = P(Xn ≤ k)−P(Xn ≤ k − 1)

= FXn(k)− FXn(k − 1)

= FXn(k + s)− FXn(k − 1 + s) for any s ∈ (0, 1),

[picture!] and also, for any s ∈ (0, 1),

FXn(k + s) =

k∑

j=1

P(Xn = j).

The assertion follows.

137

Ex. Let Yn be uniformly distr’d on {0, 1, . . . , n}: P(Yn = k) = 1n+1 , 0 ≤ k ≤ n.

Prove that Xn := Yn/n

d−→ X ∼ U [0, 1] as n→∞.

I The limiting DF F (x) = x1(x ∈ [0, 1]) + 1(x > 1) is continuous, so have to

prove: ∀x ∈ R, FXn(x)→ F (x) as n→∞. As P(Yn ≤ k) = k+1n+1 , 0 ≤ k ≤ n,

FXn(x) = P(Yn ≤ nx) =

0, x < 0,

P(Yn ≤ bnxc) = bnxc+1n+1 , x ∈ [0, 1],

1, x > 1.

But bnxc+1n+1 → x as n→∞:∣∣∣∣bnxc+ 1n+ 1 − x

∣∣∣∣ = |bnxc+ 1− (n+ 1)x|n+ 1 ≤ |bnxc − nx|+ 1− xn+ 1 ≤ 2n+ 1 .

BTW, what about the alternative def’n (Thm [5.8])? Look:

Ef(Xn) =

∑n

k=0 f(

k

n )

1

n+1 →

∫ 1

0

f(x) dx = Ef(X), so it’s OK!!

138

Some Relationships among the Modes

a.s.−→

always ⇓

P−→

under integrability conditions⇒

always⇐

L2−→

always ⇓

d−→

[There are some further “restricted” implications, which we didn’t mention.]

Why do these implications hold?

139

1)

a.s.−→ ⇒ P−→

I Assume that Xn a.s.−→ X, i.e. Xn(ω)→ X(ω) for all ω ∈ A, where P(A) = 1.

Fix an arbitrary ε > 0 and set An := {|Xn −X| > ε}. Clearly we cannot have

An i.o. for an ω ∈ A, so that

Ac ⊃ [An i.o.] sl.11≡

⋂

k

⋃

n≥k

An,

and hence

0 = P(Ac) = P

(⋂

k

⋃

n≥k

An

)

continuity of P

= lim

k→∞

P

(⋃

n≥k

An

)

,

so that, as k →∞,

0← P

(⋃

n≥k

An

)

≥ P(Ak) = P(|Xk −X| > ε),

which means that Xn

P−→ X. Bingo.

140

2)

P−→ under integrability conditions⇒ L

2

−→

I This is a bit technical, uses integration theory. Leave it for the future.

3)

P−→ always⇐ L

2

−→

I This immediately follows from Markov’s inequality (Thm [4.40], sl. 79):

P(|Xn −X| > ε) ≤ E|Xn −X|

2

ε2

.

4)

P−→ ⇒ d−→

This one we can prove! Better do that on the next slide.

141

I Let t be a continuity point of FX , ε > 0 be fixed. Then

FXn(t) = P(Xn ≤ t)

= P(Xn ≤ t, |Xn −X| ≤ ε︸ ︷︷ ︸

⊂{X≤t+ε}

) + P(Xn ≤ t, |Xn −X| > ε)

≤ P(X ≤ t+ ε) + P(|Xn −X| > ε)

= FX(t+ ε) + P(|Xn −X| > ε),

and similarly

FX(t− ε) ≤ FXn(t) + P(|Xn −X| > ε).

That is,

FX(t− ε)−P(|Xn −X| > ε) ≤ FXn(t) ≤ FX(t+ ε) + P(|Xn −X| > ε).

As ε > 0 is arbitrarily small, can choose it so that FX(t± ε) will be arbitrarily

close to FX(t) (can do that as FX is continuity at t), and P(|Xn−X| > ε)→ 0

as n→∞. So what happens to FXn(t) then? Bingo.

142

Examples and Counterexamples

Consider Ω = [0, 1] with P = U [0, 1]. [Pictures!]

1) Let Xn := n1(0,1/n), X ≡ 0. Then:

Xn

a.s.−→ X, and hence also P−→, d−→, but Xn

L2

6−→ X, Xn

L1

6−→ X, since

E|Xn −X| = EXn = 1 6→ 0.

2) Let Xn :=

√

n1(0,1/n), X ≡ 0. Then we’ll have the same relations, except

for having now Xn

L1−→ X since E|Xn −X| = EXn = 1/

√

n→ 0.

3) Let X ≡ 0,

X1 := 1[0,1/2), X2 := 1[1/2,1),

X3 := 1[0,1/3), X4 := 1[1/3,2/3), X5 := 1[2/3,1),

X6 := 1[0,1/4), X7 := 1[1/4,2/4), X8 := 1[2/4,3/4), X9 := 1[3/4,1),

etc. Then Xn

a.s.

6−→ X, but P−→, d−→, L

2

−→, L

1

−→.

143

4) Let X ≡ 0,

X1 := 21[0,1/2), X2 := 21[1/2,1),

X3 := 31[0,1/3), X4 := 31[1/3,2/3), X5 := 31[2/3,1),

X6 := 41[0,1/4), X7 := 41[1/4,2/4), X8 := 41[2/4,3/4), X9 := 41[3/4,1),

etc. Then Xn

a.s.

6−→ X, but P−→, d−→, but

L1

6−→ since E|Xn −X| = EXn = 1 6→ 0

(and hence also

L2

6−→).

5) If Xn := 1[0, 12+

1

n ]

, X := 1[ 12 ,1], then Xn

d−→ X, but all other →’s fail.

6) If Xn ∼ U [ 12 − 1n , 12 + 1n ], X ≡ 12 , then Xn

d−→ X, even though

FXn(

1

2 ) =

1

2 6→ 1 = FX( 12 ). Can we assert

P−→? What about L

2

−→ and a.s.−→?

144

Convergence under Transformations

1) As for the “usual convergence” of sequences in R (or Rd),

if, for two sequences of RVs {Xn} and {Yn} given on a common probability

space, Xn

a.s.−→ X and Yn a.s.−→ Y as n→∞, then also Xn + Yn a.s.−→ X + Y .

(Thm [5.19])

Indeed, we know that {Xn} converges on an event A with P(A) = 1 and {Yn}

converges on an event B with P(B) = 1. Clearly, P(A ∩B) = 1 and both

sequences converge on the event A ∩B.

[BTW: What about the products XnYn?]

The same applies to

P−→: as |(Xn + Yn)− (X + Y )| ≤ |Xn −X|+ |Yn − Y |,

P

(|(Xn + Yn)− (X + Y )| > ε) ≤ P(|Xn −X| > ε/2)+ P(|Yn − Y | > ε/2).

[BTW: What about the products XnYn?]

145

2) If Xn

L2−→ X and Yn L

2

−→ Y as n→∞, then also Xn + Yn L

2

−→ X + Y .

This follows from the triangle inequality for L2-norm, which follows from

Cauchy-Bunyakovsky inequality (Cor [4.36], sl. 81).

But for the products we can only claim that XnYn

L1−→ XY — and we don’t

even know if XnYn ∈ L2, but that it’s in L1 follows from Cauchy-

Bunyakovsky inequality! The

L1−→-claim follows from

XnYn −XY = XnYn −XnY +XnY −XY = Xn(Yn − Y ) + (Xn −X)Y

and the Cauchy-Bunyakovsky inequality (once again).

146

3) Of course, if we only know that Xn

d−→ X and Yn d−→ Y as n→∞, then

even asking if Xn + Yn

d−→ X + Y is meaningless: Xn and Yn may be defined

on different probability spaces, we cannot add them!

4) But we have this important general result:

Thm [5.23] Let g : R→ R be a continuous function. Then, as n→∞,

a) if Xn

a.s.−→ X then g(Xn) a.s.−→ g(X),

b) if Xn

P−→ X then g(Xn) P−→ g(X),

c) if Xn

d−→ X then g(Xn) d−→ g(X).

This theorem holds for RVecs in Rd as well, and beyond.

I a) is obvious.

b) This we could prove as in the text, but it’s more instructive to go another

way basing on this fundamental result from Real Analysis: any function g

continuous on a closed bounded interval [a, b] is uniformly continuous there.

147

Uniform continuity: ∀ε > 0, ∃δ > 0 s.t. [Pictures & examples.]

x, y,∈ [a, b], |x− y| ≤ δ =⇒ |g(x)− g(y)| < ε.

Fix an ε > 0, let An,ε :=

{|g(Xn)− g(X)| > ε}. Then, ∀N > 0, IN := [−N,N ],

P(An,ε) = P(An,ε; X ∈ IN ) + P(An,ε; X 6∈ IN )

≤ P(An,ε; X ∈ IN ) + P(X 6∈ IN )

≤ P(An,ε; X ∈ IN , |Xn −X| ≤ δ) + P(|Xn −X| > δ) + P(X 6∈ IN )

=: P1 + P2 + P3.

Now, for an arbitrary small η > 0, we can choose N so large that P3 < η/2.

Next, since g is uniformly continuous on [−N − 1, N + 1], we can choose δ < 1

so small that if x, y ∈ [−N − 1, N + 1], |x− y| ≤ δ then |g(x)− g(y)| < ε. But

this yields P1 = 0.

Finally, as Xn

P−→ X, for all large enough n holds P(|Xn −X| > δ) < η/2,

and that will imply P(An,ε) < η. Bingo for b).

148

c) is obvious from Thm [5.8]: we have to show that

Yn := g(Xn)

d−→ Y := g(X), but since for a bounded continuous f the

composition (f ◦ g)(x) := f(g(x)) is also bounded (as f is bounded) and

continuous (as g is continuous and hence f ◦ g is), we have

Ef(Yn) = E(f ◦ g)(Xn)→ E(f ◦ g)(X) = Ef(Y ).

Total bingo.

149

Now we turn to the convergence results mentioned on sl. 66.

Thm [4.9] (Monotone Convergence Theorem.)If Xn ≥ 0 are RVs on a

common probability space and Xn ↑ X a.s. as n→∞, then EXn ↑ EX.

I For any n ≥ 1, there exists a sequence of simple RVs X(k)n ↑ Xn as k →∞,

and for them EX

(k)

n ↑ EXn (cf. Def [4.4], Prpn [4.5]).

Now X(k) := maxn≤kX

(k)

n , k ≥ 1, are clearly also simple RVs, and

X(k−1) ≤ X(k) ≤ Xk, k ≥ 1.

By monotonicity, there exists the a.s. limit limk→∞X(k) =: Y.

Since, for any n ≥ 1, one has (as k →∞)

Xn Y X

↑ ↑ ↑

X

(k)

n ≤ X(k) ≤ Xk

a.s., one concludes that Xn ≤ Y ≤ X a.s. for any n ≥ 1.

150

Hence Y = X a.s. (recall that Xn ↑ X a.s. as n→∞).

Therefore, for the simple RVs X(k), one has X(k) ↑ X a.s. Hence by

Prpn [4.5] one has the last relation (as k →∞) in the next line:

EX ≥ EXk ≥ EX(k) ↑ EX.

We conclude that also EXk ↑ EX. Bingo!

151

Thm [4.8] (Fatou’s Lemma.) If Xn ≥ 0 are RVs on a common probability

space then

E lim inf

n→∞ Xn ≤ lim infn→∞ EXn.

Recall: lim infn→∞ xn = limn→∞ infk≥n xk is the least partial limit for the

sequence {xn}n≥1 ⊂ R.

I Set X := lim infn→∞Xn ≡ limn→∞ Yn, where Yn := infm≥nXm ↑ X as

n→∞, 0 ≤ Yn ≤ Xn. By the Monotone Convergence Theorem,

EX = lim

n→∞EYn = lim infn→∞ EYn ≤ lim infn→∞ EXn.

Bingo!

152

Thm [4.16] (Dominated Convergence Theorem.) If |Xn| ≤ c <∞ a.s. for

any n ≥ 1, Xn → X a.s. as n→∞, then there exists limn→∞EXn = EX.

Note that the first condition can be replaced with: |Xn| ≤ Y a.s. for any

n ≥ 1, where EY <∞. Check the proof below!

I Since Xn + c ≥ 0, c−Xn ≥ 0 for any n ≥ 1, by Fatou’s lemma one has

E lim inf

n→∞ Xn ≤ lim infn→∞ EXn,

E lim sup

n→∞

Xn ≥ lim sup

n→∞

EXn.

But lim infn→∞Xn = lim supn→∞Xn = limn→∞Xn = X by assumption, so

the LHS’s in the above formulae coincide with each other and with EX. Hence

lim sup

n→∞

EXn ≤ EX ≤ lim inf

n→∞ EXn,

which can only hold if there exists the limit limn→∞EXn = EX.

Bingo!

153

Our First Limit Theorems: Sums of Bernoullia RVs

Suppose {Xn} is an i.i.d. sequence of B(p)-RVs,

P(Xj = 1) = 1−P(Xj = 0) = p ∈ (0, 1), q := 1− p.

Recall: for Sn := X1 + · · ·+Xn,

P(Sn = k) =

(

n

k

)

pkqn−k, k = 0, 1, . . . , n.

Laws of Large Numbers (LLNs)

Thm [5.30] (Weak LLN.)

Sn

n

P−→ p as n→∞.

I As L

2

−→ implies P−→, we only need to prove the former, which is obvious:

E

(

Sn

n

− p

)2

=

E(Sn − np)2

n2

=

Var (Sn)

n2

=

nVar (X1)

n2

=

npq

n2

=

pq

n

→ 0.

aNamed after Jacob Bernoulli (1654–1705), whose Ars Conjectandi (1713) contained the

first proof of the WLLN.

154

Thm [5.31] (Strong LLN.)

Sn

n

a.s.−→ p as n→∞.

I For ε > 0, set An(ε) := {|Sn/n− p| > ε}. The main task: show that An(ε)

i.o. w.p. 0. [Indeed, then

⋃

k[An(1/k) i.o.] also has probability 0, and we are

done.]

Refer to Thm [1.27] (Borel-Cantelli) on sl. 26: for this it suffices to show that

∞∑

n=1

P(An(ε)) <∞.

Show that: by Markov’s inequality (Thm [4.40], sl. 79),

P(An(ε)) = P

(|Sn − np| > nε) ≤ E(Sn − np)4

n4ε4

.

If we show that E(Sn − np)4 ≤ cn2, we are done: then P(An(ε)) ≤ c1n−2, and∑

n n

−2 <∞!

155

Letting X˜j := Xj − p (so that EX˜j = 0), we have

E(Sn − np)4 = E

( n∑

j=1

X˜j

)4

= E

[ n∑

j=1

X˜4j + 6

∑

jX˜2j X˜

2

k︸ ︷︷ ︸

E(··· )=(EX˜21 )2

+

∑

terms containing X˜1j︸ ︷︷ ︸

E(··· )=0

]

= nEX˜41 + 3n(n− 1)(EX˜21 )2

= n [(1− p)4p+ (1− q)4q]︸ ︷︷ ︸

=pq(p3+q3)≤pq≤1/4

+3n(n− 1) (pq)2︸ ︷︷ ︸

≤(1/4)2=1/16<1/12

≤ n

4

+

n(n− 1)

4

=

n2

4

.

Total bingo. We completed the proof.

156

NB: This result validates Probability Theory (as we constructed it):

the Xj ’s are actually indicators of independent events occurring with the same

probability p. Can interpret this as a sequence of independent trials,

observing/not observing the same event in each of them.

Thus Sn is the total number of occurrences of our event in n trials, and

Sn/n = relative frequency of the event. We showed that, for “almost all”

sequences of trials, the relative frequency tends to p, which is exactly what we

aimed to reproduce with our mathematical model.

157

Ex. Consider Ω = [0, 1], P = U [0, 1], Yj(ω) = the jth digit in the decimal

expansion of the number ω = 0.Y1Y2Y3 . . . ∈ [0, 1].

Then {Yj} is an i.i.d. sequence, with P(Y1 = k) = 110 , k = 0, 1, . . . , 9. Both

claims are obvious [picture!].

What is the freq’cy of a given digit k in a “typical number’s” decimal exp’n?

Letting Xj := 1(Yj = k), we get an i.i.d. sequence of Bernoulli RVs with

p = 110 , with the frequency of k in the first n digits of ω given by Sn/n.

Now from the SLLN we know that, for Ck := {Sn/n→ p ≡ 110 as n→∞}, one

has P(Ck) = 1. And this holds for all k = 0, 1, . . . , 9, so that P

(⋂9

k=0 Ck

)

= 1.

This means that, for ALMOST ALL numbers ω ∈ [0, 1], each of the ten

decimal digits appears with frequency 110 in ω’s decimal expansion.

Most rationals are exceptions, of course (e.g. 1/3 = 0.3333 . . . etc.: they all

have periodic expansions starting from some place) — but there are only

countably many of them, a negligibly small proportion of all numbers in [0, 1]!

158

Q: Could we use the same tools to establish the WLLN & SLLN in the general

case (when Xj 6∼ B(p))?

A: To some extent. Look: all we needed in the proof of the WLLN was that

E(Sn − np)2 ≡ E(Sn − nµ)2 ≡ Var (Sn) = nVar (X1).

But this will hold in the general case, for uncorrelated Xj with common

values of µ := EXj and Var (Xj) (we even don’t need independence!). The

same argument shows that Sn/n

P−→ µ in this case.

For the SLLN, our argument will still work for i.i.d. Xj with EX

4

j <∞.

However, one doesn’t need that much: in fact, in the i.i.d.-case,

SLLN ⇔ E|X1| <∞. The proof of this is much more sophisticated, leave it

for the future.

159

We will not discuss here de Moivre-Laplace limit theorem (=“local CLT”) for

i.i.d. Xj ∼ B(p) that describes the behaviour of

P(Sn = k)

when we keep k “

√

n-close” to the mean ESn = np: these probabilities can

then be approximated by multiples of the normal density values. [Picture.]

The proof of this is based on:

(a) the binomial formula P(Sn = k) =

(

n

k

)

pkqn−k and

(b) Stirling’s forumlaa: k! =

√

2pikk+1/2e−k(1 + o(1)) as k →∞ (o(1)→ 0).

We won’t discuss here the Poisson limit theorem (“the law of small numbers”)

which concerns approximating P(Sn = k) in situations where p = pn → 0 s.t.

npn → λ ∈ (0,∞): we will do it as an exercise.

Instead, we will turn to powerful (analytic) tools that can be used to analyse

the behaviour of probability distributions in much more general situations.

aCheck how well it works! In fact, k! =

√

2pikk+1/2e−k+θ(k), 1

12k+1

< θ(k) < 1

12k

.

160

8. Characteristic Functions (ChFs)

Def [6.1]. For an RV X, its ChF ϕX(t) : R→ C is defined by

ϕX(t) := Ee

itX =

∫

eitxdFX(x).

Recall Euler’s formula: eit = cos t+ i sin t, |eit| = 1 for t ∈ R. [Picture.]

So ϕX(t) = E cos(tX) + iE sin(tX), always exists and is finite. Moreover,

|ϕX(t)| =

∣∣EeitX ∣∣ ≤ E∣∣eitX ∣∣ = 1, ϕX(0) = Eei0X = E1 = 1.

The ChF of distribution (or DF) F is the ChF of X ∼ F .

NB: for X ∈ Z, ϕX(t) =

∑

eitkP(X = k); for AC X’s, ϕX(t) =

∫

eitxfX(x) dx.

What’s the point of introducing ChFs? To use Fourier analysis: represent a

given function as a “mixture” of harmonic oscillations eitx with different

frequencies t. Analogy: orthogonal basis expansion, coordinates etc. And most

importantly: DF

1−1←→ChF, and there is MORE! We will see.

161

Ex. X ≡ c = const ⇒ ϕX(t) = Eeitc = eitc.

Ex. X ∼ B(p) ⇒ ϕX(t) = EeitX = peit·1 + qeit·0 = 1 + p(eit − 1).

Ex. X ∼ U [0, 1] ⇒ ϕX(t) =

∫ 1

0

eitxdx =

1

it

[

eitx

]1

0

=

eit − 1

it

.

Oops: what if t = 0? Trouble? Nope. Look: |eit − 1| ≤ |t|. [Picture.]

Indeed, eit − 1 = ∫ t

0

(eis)′ds = i

∫ t

0

eisds, where |eis| = 1.

Ex. X ∼ N [0, 1] ⇒ ϕX(t) =

∫

eitx

e−x

2/2

√

2pi

dx =

1√

2pi

∫

eitx−x

2/2dx

=

1√

2pi

∫

e−

1

2 (x

2−2itx±(it)2)dx = e−t

2/2 1√

2pi

∫

e−

1

2 (x−it)2dx︸ ︷︷ ︸

=1

= e−t

2/2.

Why = 1? Imagine we had µ ∈ R instead of it in e− 12 (x−it)2 , OK? Formal

proof: Cauchy thm for integrals of analytic f’ns over closed contours. [Picture.]

162

Prpn. If Y = aX + b, where a, b ∈ R are constants, then ϕY (t) = eitbϕX(at).

I Obvious: ϕY (t) = Eeit(aX+b) = Eeit(aX)eitb = eitbEei(ta)X = eitbϕX(at).

Ex. If Y ∼ U [a, b], then Y d= a+ (b− a)X, X ∼ U [0, 1], and so

ϕY (t) = e

itaϕX((b− a)t) = eita × e

i(b−a)t − 1

i(b− a)t =

eibt − eiat

i(b− a)t .

Can verify by a direct calculation: ϕY (t) =

∫ b

a

eitx

1

b− a dx.

Ex. If X ∼ N(µ, σ2), then X d= µ+ σZ, Z ∼ N [0, 1], and so

ϕX(t) = e

itµϕZ(σt) = e

itµe−(σt)

2/2 = exp

{

itµ− σ

2t2

2

}

.

163

Recall: for z = x+ iy ∈ C (x, y ∈ R are the real and imaginary parts of z), its

complex conjugate z¯ := x− iy [Picture.], z1 + z2 = z¯1 + z¯2 (obvious).

NB: eiu = e−iu (e.g. from Euler’s formula) and so

ϕX(t) =

∫

eitxdFX(x) =

∫

eitxdFX(x) =

∫

e−itxdFX(x) = ϕX(−t) = ϕ−X(t).

Thus we established the following

Prpn. ϕX(t) = ϕX(−t) = ϕ−X(t).

This will prove quite handy. For instance, note that if X is a symmetric RV,

i.e. X

d

= −X, then we obtain ϕX(t) = ϕ−X(t) = ϕX(t) by the Prpn.

Now what does z = z¯ mean? It means that z ∈ R, so that the ChF of a

symmetric RV X is always real-valued. We will see later that the converse

is true as well!

164

Prpn [6.2]. Any ChF is uniformly continuous.a

I Fix an arbitrary ε > 0, consider IN := [−N,N ]. Then, ∀t, h ∈ R,

|ϕX(t+ h)− ϕX(t)| =

∣∣Eei(t+h)X −EeitX ∣∣ = ∣∣E(ei(t+h)X − eitX)∣∣

=

∣∣EeitX(eihX − 1)∣∣ ≤ E∣∣eitX(eihX − 1)∣∣

= E |eitX |︸ ︷︷ ︸

=1

|eihX − 1| = E |eihX − 1|︸ ︷︷ ︸

≤|eihX |+|1|=2

= E( |eihX − 1|︸ ︷︷ ︸

≤|hX|≤|h|N

; X ∈ IN ) + E(|eihX − 1|; X 6∈ IN )︸ ︷︷ ︸

≤2P(X 6∈IN )

≤ |h|N + 2P(X 6∈ IN ).

Now first choose N so large that the 2nd term < ε/2, and then the whole

thing will be < ε for |h| < ε2N (regardless of the value of t!). This proves

uniform continuity. Bingo.

aRe uniform continuity, see sl. 149.

165

Thm [6.4]. If X and Y are independent RVs, then ϕX+Y (t) = ϕX(t)ϕY (t).

NB1: Not ϕX(t) + ϕY (t)!! This isn’t even a ChF! (Why?)

NB2: Products are MUCH easier to compute than convolutions!

I ϕX+Y (t) = Eeit(X+Y ) = E eitXeitY︸ ︷︷ ︸

independent RVs

= EeitXEeitY = ϕX(t)ϕY (t). Good.

Ex. If X ∼ N(µX , σ2X) and Y ∼ N(µY , σ2Y ) are independent, then

ϕX+Y (t) = exp

{

it(µX + µY )− t

2

2

(σ2X + σ

2

Y )

}

.

If we had the uniqueness result (i.e. DF

1−1←→ChF), that would mean that

X + Y ∼ N(µX + µY , σ2X + σ2Y ). A bit later.

NB: Try to use convolution to derive that result. No fun.

166

Thm [6.11]. Let k ≥ 1. If E|X|k <∞, then ϕX(t) is k times continuously

differentiable and EXk = i−k

dk

dtk

ϕX(t)

∣∣∣∣

t=0

.

In particular, EX = −iϕ′X(0), EX2 = −ϕ′′X(0) (when the moments exist).

I d

k

dtk

ϕX(t) =

dk

dtk

EeitX

?

= E

dk

dtk

eitX = E(iX)k eitX = ikEXk eitX , which

turns into ikEXk when we put t := 0. We just need to justify

?

=: this follows

from the Dominated Convergence Theorem (sl. 154) — and we need

E|X|k <∞ for that. Good.

The converse is true for even k and “almost true” for odd k. Thus, if ϕX(t) is

twice differentiable at zero, then EX2 <∞ (and hence ϕX(t) is everywhere

twice differentiable, cf. Thm [6.12]). Ex: N(µ, σ2).

Thus, the smoother ϕX(t), the lighter the “tails” of FX , and the other way

around!

167

Inversion Formulae and Uniqueness

Thm [6.7]. If

∫ |ϕX(t)|dt <∞ then X has a continuous density given by

fX(x) =

1

2pi

∫

e−itxϕX(t) dt.

This is a general result from Fourier analysis.

Meaning (a bit loose): One can think of fX as a sum of harmonic oscillations

with different frequencies t. When computing ϕX(t), we find how strong the

contribution of oscillations at frequency t to fX is (it’s like an orthogonal basis

expansion of a vector). Then we can “assemble” fX back from these

oscillations, and this is what the inversion formula does.

NB: Thm [6.7] implies that there is a one-to-one correspondence between

distributions and their ChFs (at least in the AC case, when the ChF is

integrable, but it’s true always, we’ll discuss that a bit later).

Summary: Two different distributions cannot have the same ChF!

168

Ex. Consider ϕX(t) =

sin t

t

. How do we know that this is a ChF? See Ex. on

sl. 164: the ChF of U [−1, 1] is

eit·1 − eit·(−1)

it(1− (−1)) =

eit − e−it

2it

=

sin t

t

by Euler’s formula. But clearly

∫ ∣∣ sin t

t

∣∣ dt =∞ (why?), so cannot use the

inversion formula. No wonder though, as otherwise X would have continuous

density, which is wrong.

Ex. Consider ϕX(t) = 1 + p(e

it − 1). Can we use the inversion formula?

NB: If X ∈ Z then ϕX(2pik) = Eei2pikX = 1 for k ∈ Z. So, for such X,

ϕX(t) 6→ 0 as |t| → ∞, whereas for AC X always ϕX(t)→ 0 [Lebesgue thm].

Ex. Consider ϕX(t) = e

−t2/2 (this corresponds to X ∼ N(0, 1), see sl. 163).

The integrability condition is clearly met, so there exists continuous

fX(t) =

1

2pi

∫

e−itxe−t

2/2dt

sl. 163

=

1√

2pi

e−x

2/2, good!

169

Ex. Compute the ChF of X ∼ E(1) (do you see how to extend this to E(λ)?):

ϕX(t) =

∫ ∞

0

eitxe−xdx =

∫ ∞

0

e−(1−it)xdx =

−1

1− it

[

e−(1−it)x

]∞

0

=

1

1− it .

Can one apply the inversion formula?

Nope, as

∫ |ϕX(t)|dt =∞, so cannot use the inversion formula. No wonder

though, as otherwise X would have continuous density, which is wrong.

Now consider the double exponential distribution (the first Laplace distr’n):

fX(x) =

1

2

e−|x|, x ∈ R.

NB: this is a mixture of E(1) and its “mirror reflection” (the distr’n of −Y ,

Y ∼ E(1)), with equal weights, so

ϕX(t) =

∫

eitx

(

1

2fY (x) +

1

2f−Y (x)

)

dx =

1

2

(

ϕY (t) + ϕ−Y (t)

)

=

1

2

(

1

1− it +

1

1− i(−t)

)

=

1

1 + t2

.

170

NB: This ϕX(t) is already integrable on R, so the inversion formula is

applicable, and hence

1

2

e−|x| =

1

2pi

∫

e−itx

1 + t2

dt.

If we replace here x↔ t, and then t with −t, the result can be re-written as

e−|t| =

∫

eitx

pi(1 + x2)

dx.

That is, the ChF of the standard Cauchy distribution is e−|t|.

Try to compute it directly: no fun. Also, note that it is NOT differentiable at

zero, so there is no way for the first moment of the distribution to be finite.

Good.

171

Q: Is there any way to invert the ChF when

∫ |ϕX(t)|dt =∞?

A: Yes. Look: assuming for a moment that we can use Thm [6.7],

FX(y)− FX(x) =

∫ y

x

fX(u) du =

∫ y

x

[

1

2pi

∫

e−ituϕX(t) dt

]

du

?

=

1

2pi

∫ [∫ y

x

e−ituϕX(t) du

]

dt =

1

2pi

∫ [

ϕX(t)

∫ y

x

e−itudu

]

dt

=

1

2pi

∫

e−itx − e−ity

it

ϕX(t) dt, (1)

where one can justify

?

=. The RHS makes sense when

∫ ∣∣ϕX(t)

t

∣∣dt <∞, and

then the LHS still equals the RHS. Formal proof: if Z ∼ N(0, 1) is independent

of X, ε > 0, then ϕX+εZ(t) = ϕX(t)e

−ε2t2/2 is integrable on R, inversion f’la

applies, (1) holds with X + εZ instead of X. Then pass to the limit as ε→ 0.

In the general case, the resulting f’la just has limε→0 on the RHS. Or replace

the

∫

on the RHS with limT→∞

∫ T

−T (the “principal value integral”).

172

The most important conclusion: uniqueness holds in the general case as well.

To different DFs there correspond different ChFs! (Cf. (1)!)

Now we can claim that if, say, the ChF of X is eiµt−σ

2t2/2, then we must

have X ∼ N(µ, σ2) (cf. Ex. on sl. 167).

Also, now we can claim that if X,Y ∼ E(1) are independent then since,

according to sl. 171,

ϕX−Y (t) = ϕX(t)ϕ−Y (t) = ϕX(t)ϕY (−t) = 1

1− it ×

1

1 + it

=

1

1 + t2

,

the difference X − Y must follow the double exponential distribution.a One

can discover that using convolution, but ChFs make things much easier.

And now we can assert that if ϕX(t) is real-valued, then ϕX(t) = ϕ−X(t)

(see sl. 165) implies that X

d

= −X, i.e. the distribution of X is symmetric (as

we promised to justify).

a Here X − Y has the mixture distribution 1

2

E(1) + 1

2

(−E(1)) (gross notation abuse!!),

due to the memoryless property of E(1) (use the TPF).

173

NB: Using the same argument as in Thm [6.11] (sl. 168), one can show that

if

∫ |tkϕX(t)| dt <∞, then X has a k times continuously differentiable density.

Thus, the smoother fX(x), the lighter the “tails” of ϕX(t), and the other way

around! Cf. remark on sl. 168.

NB: Sums of independent RVs have smoother distributions than summands,

as |ϕX(t)ϕY (t)| decays as t→ ±∞ faster than any of the factors.

Ex. Sums of i.i.d. Xj ∼ U [−1, 1] : for Sn := X1 + · · ·+Xn, using sl. 170,

ϕS1(t) =

sin t

t

, ϕS2(t) =

sin2 t

t2

(so that S2 has a continuous density),

ϕS3(t) =

sin3 t

t3

(so that S3 has a continuously differentiable density) etc.

[Picture.]

Ex. What can one say about the distribution of X + Y , where X and Y are

independent, X ∈ Z and Y is AC?

Ex. The sum of two independent singular RVs can be AC.

174

Continuity Theorems & Applications

One of the great things about ChFs is that there is a very simple & useful

relationship between convergence in distribution and that of ChFs.

Thm [6.15] As n→∞, Xn d−→ X ⇔ ∀t ∈ R, ϕXn(t)→ ϕX(t).

I ⇒) Obvious: note that ϕXn(t) = Ef(Xn), where f(x) = eitx is a bounded

continuous function of x (we keep t fixed here), and recall Thm [5.8], sl. 137.

⇐) Not so obvious. What happens is essentially this: the collection of

functions f(x) = eitx for different t ∈ R is rich enough to ensure that

convergence EeitXn → EeitX for all t implies that Ef(Xn)→ Ef(X) for all

bounded continuous f (we can take linear combinations of eitjx for collections

of tj ’s to approximate general f ’s etc).

Or one can use the inversion formula. Thus, assuming for simplicity that

|ϕXn(t)|, |ϕX(t)| ≤ g(t) for some bounded function g(t) s.t. g(t)/t is integrable

at ±∞, we can use the formula from sl. 173 to write:

175

FX(y)− FX(x) = 1

2pi

∫

e−itx − e−ity

it

ϕX(t) dt

=

1

2pi

∫

e−itx − e−ity

it

lim

n→∞ϕXn(t) dt

∗

= lim

n→∞

1

2pi

∫

e−itx − e−ity

it

ϕXn(t) dt

= lim

n→∞(FXn(y)− FXn(x)), (∗)

where

∗

= is justified by the Dominated Convergence Theorem (cf. sl. 154).

Good.

Is it clear that when convergence (∗) holds then Xn d−→ X? If yes, it’s bingo.

176

Q: What if we just know that ϕXn(t)→ ϕ(t) as n→∞, where ϕ is some

function (we don’t know if it’s a ChF of some distribution)?

A: Then bad things can happen, and we can say when they do happen!

Thm [6.17] If ∀t ∈ R holds ϕXn(t)→ ϕ(t) as n→∞, and ϕ(t) is

continuous at t = 0, then ϕ(t) is a ChF of some RV X and Xn

d−→ X.

I The proof is somewhat beyond the scope of this course, so we’ll leave it for

the future. At the moment: just a few words re WHAT happens when the

limiting function ϕ is discontinuous at t = 0.

Ex. Let Xn ∼ U [−n, n]. Then (sl’s 164, 170) ϕXn(t) =

sinnt

nt

{ → 0, t 6= 0,

≡ 1, t = 0.

The limiting f’n ϕ(t) = 1(t = 0) is discont’s at 0, and so cannot be a ChF.

In this example, the probability “escapes to infinity”, and it is in such

situations that one obtains a limit for ϕXn that is discont’s at 0. Look e.g. at

a similar situation where Xn ∼ N(0, n). [Write down ϕXn and see what happens!]

177

NB: Assertions similar to Thms [6.15], [6.17] hold for Laplace transforms

lX(t) := Ee

−tX (popular tools when X ≥ 0) and GFs ζX(z) := EzX (used for

X ∈ Z), as they are basically the same as the ChF.

Now how do these theorems work in applications?

We know that ChFs love addition of independent RVs: they multiply then. So

the technique is well-suited for analysing situations where we add such RVs.

But why do we pay so much attention to sums of RVs in PT?

When a large number of relatively small factors act together, the total effect

can often be (at least, approximately) linear. Which is no wonder though, as,

for a differentiable f ,

f(x + εy) = f(x) + ε

∑

j

∂f(x)

∂xj

yj + o(ε) as ε→ 0,

so we do have a sum after all. . .

178

Thm [6.19] (WLLN) Let X1, X2, . . . be i.i.d. RVs. If E|X1| <∞ then

Sn

n

d−→ µ := EX1 as n→∞ (and hence also P−→, see PS–8, tute problem 1).

I Using the properties of ChFs, ∀t ∈ R,

ϕSn/n(t) = ϕSn(t/n) =

(

ϕX(t/n)

)n

. (∗)

Here the argument t/n→ 0, so one would be inclined to expand ϕX about

zero to see what happens — and we can do that since ϕX is continuously

differentiable with ϕ′X(0) = iµ (due to the assumption that E|X1| <∞). So:

ϕX(s) = ϕX(0) + ϕ

′

X(0)s+ o(s) = 1 + iµs+ o(s) as s→ 0.

Now back to (∗):

ϕSn/n(t) =

[

1 +

iµt+ o(1)

n

]n

→ eitµ = ϕµ(t),

bingo by Thm [6.15].

179

Right. This means that X is a consistent estimator of µ = EX1 (by

definition). This is good, but to find, say, confidence intervals or evaluate

errors of tests based on X (or other consistent estimators), one needs more.

Namely, one needs to know the distribution (at least, approximately) of the

difference X − µ, which is vanishing as n→∞.

So we need a “magnifying glass” to see any patterns here, which is achieved by

scaling the difference, by considering

bn(X − µ) ≡ bn

(

Sn

n

− µ

)

=

bn

n

(Sn − nµ) for some bn →∞.

For what choice of {bn} will this have a limiting distribution? It turns out

that when the Xj ’s have a finite second moment, the right choice is bn = c

√

n.

Re what to do when EX2j =∞, we’ll talk a bit later.

First — to the Central Limit Theorem (CLT).

180

Thm [6.20] (CLT) If, in addition to the assumptions of Thm [6.19],

EX2j <∞ and σ2 := Var (X1) > 0, then

Yn :=

Sn − nµ

σ

√

n

d−→ Z ∼ N(0, 1) as n→∞.

I It suffices to show that ϕYn(t)→ e−t

2/2. First we standardise the Xj ’s by

setting X˜j := (Xj − µ)/σ. Then clearly X˜j are i.i.d. with EX˜j = 0, EX˜2j = 1,

and Yn = S˜n/

√

n, where S˜n := X˜1 + · · ·+ X˜n. Therefore

ϕYn(t) = ϕS˜n/

√

n (t) = ϕS˜n(t/

√

n ) =

(

ϕX˜1(t/

√

n )

)n

. (∗)

Here t/

√

n→ 0 as n→∞, and we will again expand ϕX˜1 about zero, but as

EX˜2j <∞, the ChF ϕX˜1 is twice differentiable and we have one more term

in Taylor’s series:

ϕX˜(s) = ϕX˜(0)︸ ︷︷ ︸

=1

+ϕ′

X˜

(0)︸ ︷︷ ︸

=iµ˜=0

s+

1

2

ϕ′′

X˜

(0)︸ ︷︷ ︸

=−EX˜2=−1

s2 + o(s2) = 1− s

2

2

(1 + o(1)).

181

Now back to (∗):

ϕYn(t) =

[

1− 1

2

( t√

n

)2

(1 + o(1))

]n

=

[

1− t

2(1 + o(1))

2n

]n

→ e−t2/2

as n→∞. Bingo.

NB: Using the same techniques, can extend this to non-identically distributed

independent RVs Xj . Just need to be careful, need an additional condition,

e.g. the Lyapunov condition : assuming all EXj = 0 (no big deal), one has

B−3n

n∑

j=1

E|Xj |3 → 0 as n→∞, where B2n := Var (Sn) =

∑n

j=1 EX

2

j

(the condition ensures that, in the limit, all the Xj ’s are “negligibly small”

compared to the sum Sn). Then for ϕXj one can use Taylor’s expansion with

three terms (the third moment is finite!) and use the condition to show that

one has ϕSn/Bn(t)→ e−t

2/2, so that Sn/Bn

d−→ Z ∼ N(0, 1) (you can DIY!).

182

Now what if EX2j =∞ (in the i.i.d. case)? If µ := EX1 is still finite and hence

X ≡ Snn

P−→ µ, there can still exist another scaling seq’ce bn →∞ such that

bn(X−µ) d−→ something. For this, need “regular variation” of the tails of FX :

F (−x) = x−αL−(x), 1− F (x) = x−αL+(x),

where α ∈ [1, 2], L± are “slowly varying” in the sense that L(vx)L(x) → 1 as

x→∞, v fixed (Ex: lnx, but not xa), and limx→∞ 1−F (x)F (−x) → c ∈ [0,∞].

Then with bn = n

1−1/αl(n), l being another slowly varying f’n, the limiting

distribution will be one of the so-called stable laws. When E|X1| =∞, under

the above conditions on the tails (with α ∈ (0, 1]), n−1/αl(n)Sn will converge

in distribution. Technically, it’s much harder to prove than the CLT.

Important difference: roughly speaking, the contributions of individual Xj ’s to

the sum Sn are all negligibly small in the case of the CLT, whereas in the case

of convergence to a non-normal stable distribution, the main contribution to

Sn comes from a small proportion of the Xj ’s (the largest ones!).

183

Ex. Cauchy distribution with density f(x) =

1

pi(1 + x2)

has ChF ϕ(t) = e−|t|.

Therefore, in this case

ϕSn/n(t) =

(

ϕ(t/n)

)n

=

(

e−|t/n|

)n

= e−|t| = ϕ(t),

i.e.

Sn

n

d

= X1. Wow!!

Thus, say, using the sample mean X to estimate the parameter θ in the

location family of densities fθ(x) =

1

pi(1 + (x− θ)2) would be meaningless, as

X

d

= X1 (no gain in precision compared to a single observation). But: the

sample median X(n/2) would work (later about that).

Cauchy distribution is an example of a non-normal stable distribution.

184

Thm [6.21] (Poisson limit theorem.) If Xn,1, . . . , Xn,n are independent RVs,

P(Xn,j = 1) = 1−P(Xn,j = 0) = pn, j = 1, . . . , n,

and npn → λ ∈ (0,∞) as n→∞, then Sn := Xn,1 + · · ·+Xn,n d−→ Y ∼ P (λ).

I Here

ϕSn(t) =

(

ϕXn,1(t)

)n sl. 163

= (1 + pn(e

it − 1))n =

[(

1 + pn(e

it − 1))1/pn]npn

→

(

ee

it−1

)λ

= eλ(e

it−1) as n→∞, bingo (why?).

Q: Do we really need all Xn,j to have the same distribution here? Seems to be

an overstretch, from the applications view-point. Also, do they really need to

be Bernoulli RVs?

A: Nope to both. Go to the next slide for more.

185

Here is more: suppose that

P(Xn,j = k) =

1− pn,j − qn,j , k = 0,pn,j , k = 1, P(Xn,j 6∈ {0, 1}) = qn,j .

Then, omitting the subscripts n,j for brevity and assuming that p+ q is small,

ϕX(t) = (1− p− q)eit0 + peit1 + qη(t) = 1 + p(eit − 1) + q(η(t)− 1)

= exp

{

p(eit − 1) + q(η(t)− 1) + o(p+ q)},

where η(t) = E(eitX |X 6∈ {0, 1}) and hence |η(t)− 1| ≤ |η(t)|+ 1 ≤ 2.

Therefore, putting λn :=

∑n

j=1 pn,j , we obtain

ϕSn(t) =

n∏

j=1

ϕXn,j (t) = exp

{

λn(e

it − 1) +O(∑nj=1 qn,j) + o(λn)}→ eλ(eit−1)

provided that λn → λ and maxnj=1 pn,j +

∑n

j=1 qn,j = o(1), good!

186

What do we do in the case of RVecs?

More or less the same, but there is more fun. For X = (X1, . . . , Xd) ∈ Rd, the

ChF is a function of t = (t1, . . . , td) ∈ Rd defined by

ϕX(t) := Ee

i(t,X) = E exp

{

i

d∑

j=1

tjXj

}

.

All the results for univariate ChFs extend in a natural way to the multivariate

case, including 1–1 correspondence between distributions and their ChFs and

continuity theorems. The change under linear transformation has this form: if

Y = XA+ b, where A is a d×m-matrix and b ∈ Rm, then, since for the scalar

product we can write (s, XA) = s(XA)> = sA>X> = (sA>, X), one has

ϕY (s) = Ee

i(s,Y ) = Eei(s,XA)+i(s,b) = ei(s,b)Eei(sA

>,X) = ei(s,b)ϕX(sA

>).

187

The inversion formula will require multivariate integration, calculation of

moments can be done using partial differentiation, e.g.

∂k1+k2

∂tk11 ∂t

k2

2

ϕX(t) = i

k1+k2EXk11 X

k2

2 e

i(t,X).

Letting here t := 0 yields the mixed moment EXk11 X

k2

2 (of course, for this to

work we need E|Xk11 Xk22 | <∞). And so on.

Now observe this: for a fixed unit vector b ∈ Rd, one can write, for the

univariate RV Xb = (b, X) (which is the projection of X on the direction of b),

ϕXb(t) = ϕX(t), t = tb.

Meaning: knowing the ChF of X is equivalent to knowing the ChFs of all the

projections of X on various directions in Rd. Switching to distributions:

knowing the distribution of X is equivalent to knowing the distributions of all

the projections Xb, i.e. the probabilities of all half-spaces [picture!]. Wow.

188

Now consider an i.i.d. sequence of RVecs X(1), X(2), . . . [NB: our notation

here differs from that in the book: we use Xj(k) for the jth component of the

kth vector X(k)], S(n) := X(1) + · · ·+X(n). It is obvious that if the SLLN

(or WLLN) holds for each of the components:

Sj(n)

n

a.s.−→ µj

(

Sj(n)

n

P−→ µj , resp.

)

as n→∞,

where µj = EXj(k), then it also holds for the vectors. (Can you prove that?)

So for the LLNs to hold, we just need E‖X(1)‖ <∞. Easy.

The proof of the CLT based on the use of ChFs is also not very difficult; we’ll

just need to use the multivariate Taylor expansion and the formula for

moments from the previous slide. Before formulating the multivariate CLT, we

consider the following example.

189

Ex. For X ∼ N(µ,C2X) (the normal distribution in Rd with mean µ ∈ Rd and

CovM C2X , see sl. 87–89, the ChF can be computed from the representation

X

d

= ZA+ µ, where Z ∼ N(0, I) is the standard d-dim normal distribution (it

has i.i.d. N(0, 1) components) and A ∈ Rd×d is a “square root” of the

(nonnegative-definite) matrix C2X : A

>A = C2X .

a First we compute

ϕZ(t) = E exp

{

i

d∑

j=1

tjZj

}

=

∏

j

EeitjZj = e−

∑

j t

2

j/2 = e−tt

>/2.

Using the transformation formula from the bottom of sl. 188, we obtain now

ϕX(t) = e

i(t,µ)ϕZ(tA

>) = ei(t,µ)e−(tA

>)(tA>)>/2 = ei(t,µ)−tC

2

Xt

>/2.

BTW, this can be taken as the general definition of the multivariate normal

distribution (whether it has a density or not, i.e. whether it is concentrated on

the whole space Rd or on a subspace of lower dimensionality).

aOne way of finding such an A is to use Cholesky decomposition.

190

Now return to our multivariate random walk S(n) := X(1) + · · ·+X(n),

where X(1), X(2), . . . are i.i.d. RVecs with E‖X(1)‖2 <∞, so that (i) X(1)

has a finite mean (vector) µ := EX(1), and (ii) the CovM C2X of X(1) exists.

We already know that

S(n)

n

P−→ µ as n→∞; what about getting more detail

on this convergence (as given by the CLT in the univariate case)? NB:

Cov (S(n)− µn) = nCov (X(1)).

As we pointed out before (sl. 190), using the multivariate Taylor formula, we

obtain the following analog of the classical CLT. The proof is almost identical

to the one in the univariate case.

Thm (Multivariate CLT). Under the above assumptions, as n→∞,

√

n

(

S(n)

n

− µ

)

≡ S(n)− µn√

n

d−→ Y ∼ N(0, C2X).

In the case of non-identically distributed X(j)’s, there is more fun, of course.

191

NB: Here and in all our earlier Limit Theorems, one can estimate how fast

the resp. convergence is. For example, in the case of the univariate CLT for

i.i.d. summands, under additional assumptions, one can give an upper bound

for |Fn(x)−Φ(x)|, where Fn is the DF of the std’d Sn, Φ is the DF of N(0, 1).

Thus, if γ := E|X1 − µ|3 <∞, then the difference won’t exceed 0.4748γ

σ3

√

n

(for

any x; Berry–Esseen Thm). This bound is unimprovable (well, the constant

might be made smaller, but not smaller than 0.4098) [Picture!]. Of course,

1/

√

n→ 0 pretty slowly; if unhappy with that approximation rate, can go

further (from, say, Fn(x) ≈ Φ(x) to Fn(x) ≈ Φ(x) + E(X1−µ)

3

6σ3

√

n

(1− x2)Φ′(x),

which has an error O(1/n) under proper conditions, etc.).

Ex. CLT for multinomial distributions. We have d bins and conduct the

following multi-stage RE: at each step, a ball is placed in a randomly chosen

bin, P(jth bin selected) = pj > 0 independently of the past,

∑d

j=1 pj = 1.

Consider the RVec S(n) whose components are counts Sj(n) := # of balls in

bin j after n steps. Describe the behaviour of S(n) as n→∞.

192

It is clear that S(n) =

∑n

k=1X(k), where X(k) ∈ Rd are i.i.d. RVecs with

X(1) =

(1, 0, 0, . . . , 0) =: e1 w.p. p1,

(0, 1, 0, . . . , 0) =: e2 w.p. p2,

· · · · · · · · ·

(0, 0, 0, . . . , 1) =: ed w.p. pd.

We see that the components Xj(1) are dependent B(pj)-RVs. For any fixed j,

however, the RVs Xj(1), Xj(2), Xj(3), . . . are i.i.d., and so by the SLLN one

has Sj(n)/n

a.s.−→ pj . We conclude that

S(n)

n

a.s.−→ p := (p1, p2, . . . , pd) as n→∞.

Since ‖X(1)‖ ≡ 1, the Multivariate CLT clearly holds for S(n). We just need

to find the CovM of X(1), which will give us the covariance matrix of the

limiting normal distribution.

193

As X(1) is always one of the coordinate vectors e1, . . . , ed, one has

Xj(1)Xk(1) = δjkXj(1) ⇒ EXj(1)Xk(1) = δjkEXj(1) = δjkpj ,

where δjk := 1(j = k) is Kronecker’s delta, so that

Cov (Xj(1), Xk(1)) = EXj(1)Xk(1)−EXj(1)EXk(1) = δjkpj − pjpk.

Thus the CovM of X(1) is shown to be

D2p := diag (pj)− p>p.

From the Multivariate CLT we conclude that, as n→∞,

Wn :=

√

n

(

S(n)

n

− p

)

=

S(n)− np√

n

d−→W ∼ N(0, D2p). (∗)

This result has immediate important implications for statistics! Suppose

Y1, . . . , Yn is an i.i.d. sample of RVs with a common (say, univariate and

unknown to us) DF G. We want to test the hypothesis H0 = {G = F},

where F is some hypothesized DF (‘goodness-of-fit’ testing).

194

One way to do that is to use the χ2-test. What is that about?

Partition R into d intervals: (−∞, t1], (t1, t2], . . . , (td−2, td−1], (td−1,∞).

Denote by Sj(n) the # of sample points Yk that fell into the jth partition

interval. Clearly, the RVec S(n) ∼ multinomial distribution with parameters

(n,p), where p1 := F (t1), p2 := F (t2)− F (t1), . . . , pd−1 := F (td−1)− F (td−2),

pd := 1− F (td−1) under the hypothesis H0. [Picture!]

From the previous slide, (∗) will hold. How to use that for testing H0?

Well, by Thm [5.23] (sl. 148), for any continuous function g(x), the

distribution of g(Wn) will converge to that of g(W ). In particular, this will

hold if we take g(x) := ‖x‖2 = ∑dj=1 x2j :

‖Wn‖2 d−→ ‖W‖2.

As we know the distr’n of W under H0 (it’s N(0, D

2

p)), we can find that

of ‖W‖2 and use it to construct a test of a(n approximately, for large n) given

type I error (that of rejecting H0 when it’s true), e.g. ‖Wn‖2 > r.

195

It would be OK if it were not for the fact that, to use the test for different F ’s

and partitions used, each time we would have to compute the distr’n of ‖W‖2

(as it depends on p), which is a nightmare. Can we get around it?

Yes. There is a simple modification of the test procedure that makes the

(asymptotic, as n→∞) distribution of the test statistic independent of both

p and F ! Such tests are called “asymptotically distribution-free”.

This is what we will do: consider

X˜(1) :=

(

X1(1)− p1√

p1

, . . . ,

Xd(1)− pd√

pd

)

∈ Rd, S˜(n) := X˜(1) + · · ·+ X˜(n).

Then S˜(n)√

n

=

(

S1(n)−np1√

np1

, . . . , Sd(n)−npd√npd

)

= S(n)−np√

n

B with B := diag(pj

−1/2),

and so ∥∥∥∥ S˜(n)√n

∥∥∥∥2 = d∑

j=1

(Sj(n)− npj)2

npj

=: H(n)

will be the famous χ2-statistic.

196

It follows from relation (∗) (sl. 195) and Thm [5.23] that

S˜(n)√

n

d−→WB ∼ N(0, B>D2pB),

with B>D2pB = diag(pj

−1/2)

(

diag(pj)− p>p

)

diag(pj

−1/2) = Id − b>b, where

Id is the unit matrix in Rd, and b := (

√

p1, . . . ,

√

pd) is a unit vector:

‖b‖2 = ∑dj=1√pj2 = 1. So, again by Thm [5.23], as n→∞,

H(n) =

∥∥∥∥ S˜(n)√n

∥∥∥∥2 d−→ ‖WB‖2, WB ∼ N(0, Id − b>b). (∗∗)

Now consider Z ∼ N(0, Id) and form the difference

Z − (b, Z)b.

This is the projection of Z onto the (d− 1)-dim hyperplane orthogonal to the

vector b. About this RVec we will make the following two observations (i), (ii):

197

(i) WB

d

= Z − (b, Z)b.

Indeed, both are normal vectors in Rd with zero means, whereas the

covariance matrix of the latter vector equals

C2Z−(b,Z)b = E(Z − (b, Z)b)>(Z − (b, Z)b) = E

(

Z(Id − b>b)

)>

Z(Id − b>b)

= E(Id − b>b)>Z>Z(Id − b>b) = (Id − b>b)>(EZ>Z)(Id − b>b)

= (Id − b>b)>(Id − b>b) = (Id − (b>b)>)(Id − b>b)

= (Id − b>b)(Id − b>b) = I2d − Idb>b− b>bId + (b>b)2

= Id − 2b>b + b>( bb>︸︷︷︸

=

∑√

pj2=1

)b = Id − b>b = C2WB .

As multivariate normal distributions are uniquely specified by their mean

vectors and covariance matrices, the assertion follows.

198

(ii) As N(0, Id) is invariant w.r.t. rotations, the “nature” of the distr’n of the

projection Z − (b, Z)b will be the same as for the projection of Z onto any

other (d− 1)-dim hyperplane, e.g. Z∗ := (Z1, . . . , Zd−1) ∈ Rd−1. Indeed:

Prpn [∗] If Z ∼ N(0, Id) and b1, . . . , bd is an orthonormal system in Rd,

then Y := ((b1, Z), . . . , (bd, Z)) ∼ N(0, Id) as well.

I Compute the ChF of Y : for t = (t1, . . . , td) ∈ Rd, recalling from sl. 191 that

the ChF of Z is ϕZ(t) = e

−tt>/2 and using the fact that

∥∥∑d

j=1 tjbj

∥∥2

=

(∑d

j=1 tjbj ,

∑d

k=1 tkbk

)

=

∑d

j=1

∑d

k=1 tjtk bjb

>

k︸ ︷︷ ︸

=δjk

= tt>, we have

ϕY (t) = Ee

itY > = E exp

{

i

d∑

j=1

tj(bj , Z)

}

= E exp

{

i

( d∑

j=1

tjbj , Z

)}

= exp

{

−1

2

∥∥∥ d∑

j=1

tjbj

∥∥∥2} = e−tt>/2 = ϕZ(t), bingo!

199

The desired assertion (ii) follows, as, taking b1, . . . , bd s.t. bd = b and noting

that Z =

∑d

j=1(bj , Z)bj , we have Z − (b, Z)b =

∑d−1

j=1(bj , Z)bj , which is a

(d− 1)-dim RVec with i.i.d. comp’s (bj , Z) ∼ N(0, 1), like Z∗ = (Z1, . . . , Zd−1).

Therefore

‖Z − (b, Z)b‖2 d=

∥∥∥d−1∑

j=1

(bj , Z)bj

∥∥∥2 d= ‖Z∗‖2 = d−1∑

j=1

Z2j ∼ χ2d−1

by the definition of χ2-distribution with d− 1 degrees of freedom.

Now from convergence (∗∗) and properties (i), (ii) we conclude that

H(n)

d−→ ‖WB‖2 d= ‖Z − (b, Z)b‖2 ∼ χ2d−1. Thus the limiting distribution

doesn’t depend neither on F nor on the partition used! Very handy: can use

the same χ2-distribution for any such goodness-of-fit test with the same

number of “bins” (in a much more general context than testing our H0!).

200

BTW, basically the same argument shows that, for i.i.d. Xj ∼ N(0, 1), one has∑n

j=1(Xj −X)2 ∼ χ2n−1. Going just a bit further, one can also notice that

this sum of squares and X will be independent RVs.

This follows from Prpn [∗]: just take an orthonormal system b1, . . . , bn in Rn

s.t. bn = (n

−1/2, . . . , n−1/2) (clearly, ‖b‖ = 1). Then X = n−1/2(bn, X) and,

as X =

∑n

j=1(bj , X)bj , one has

n∑

j=1

(Xj −X)2 = ‖X − (bn, X)bn‖2 =

∥∥∥n−1∑

j=1

(bj , X)bj

∥∥∥2,

which is independent of (bn, X) ∼ N(0, 1) (and has the same distribution as

‖∑n−1j=1 X2j ‖2 ∼ χ2n−1) by the Proposition.

This is why, in particular, X√

1

n−1

∑n

j=1(Xj−X)2

d

= 1√

n

× Xn√

1

n−1

∑n−1

j=1 X

2

j

, where

the distr’n of the last quotient is called the t-distribution with n− 1 degrees of

freedom (used in tests/CIs for means when the variance is unknown).

201

9. Further Applications in Statistics

Empirical Distribution Functions and Empirical Processes

Suppose X1, . . . , Xn is an i.i.d. sample, the DF F of the Xj ’s being (at least,

partly) unknown. Recall (PS–6): the vector S := (X(1), . . . , X(n)) of order

statistics for the sample is an SS for F .

An alternative way of representing the information stored in S is the

empirical distribution function (EDF, Def [7.26]; note that we use a

slightly different notation)

F ∗n(t) :=

1

n

n∑

j=1

1(Xj ≤ t) ≡

n∑

j=1

1

n

1(X(j) ≤ t), t ∈ R.

NB: This is the DF of the random probability distribution P ∗n :=

n∑

j=1

1

n

εXj

that assigns probabilities 1/n to each of the sample points, and which is

well-defined when the Xj ’s are from a general space (e.g. Xj ∈ Rd etc.).

202

It is clear that we can “extract” the values of all the order statistics from the

EDF F ∗n (they are just the jump points of the function), so the EDF (the

whole function, not just its value at some t!) is an SS as well.

Observe that quite a few useful statistics/estimators can easily be expressed

from the EDF, e.g. the sample mean

X =

∫

tdF ∗n(t)

and the sample variance

s2 :=

1

n

n∑

j=1

(

Xj −X

)2

=

1

n

n∑

j=1

X2j −X

2

=

∫

t2dF ∗n(t)−

(∫

tdF ∗n(t)

)2

are the mean and variance of the EDF, and likewise the sample median

m̂ :=

X(k) if n = 2k − 11

2 (X(k) +X(k+1)) if n = 2k

is the median of the EDF.

203

NB: Such estimators are often called “plug-in estimators” (or the method of

substitution estimators), as they estimate a parameter θ that can be expressed

as θ = G(F ), G being a (nice) functional (= function of a function), by

plugging the EDF into G in place of F :

θ∗ := G(F ∗n).

Now, if we had

• F ∗n → F (in some suitable sense) as n→∞, and

• G continuous (again, in some suitable sense),

then we would be able to establish consistency: θ∗ = G(F ∗n)→ G(F ) = θ.

Q: So does one have F ∗n → F? In what sense?

A: One does. In a very strong sense. Go to the next slide.

204

Thm [7.27] (Glivenko–Cantelli) Let X1, X2, . . . be i.i.d. RVs with a common

DF F . Then, as n→∞,

Dn := sup

t

|F ∗n(t)− F (t)| a.s.−→ 0.

I 1) For any fixed t ∈ R, by the SLLN (for the Bernoulli scheme),

F ∗n(t) ≡

1

n

n∑

j=1

1(Xj ≤ t)︸ ︷︷ ︸

i.i.d. B(F (t))

a.s.−→ F (t).

2) Hence, for any fixed collection −∞ = t0 < t1 < · · · < td−1 < td =∞,

Mn := max

k≤d

|F ∗n(tk)− F (tk)| a.s.−→ 0.

205

3) Since both F ∗n and F are non-decreasing, nothing really BAD can happen

between the nodes tk: for s ∈ (tk−1, tk] and ∆kF := F (tk)− F (tk−1),

F ∗n(s)− F (s) =

≤ F ∗n(tk)− F (tk−1) = F ∗n(tk)− F (tk) + ∆kF,≥ F ∗n(tk−1)− F (tk) = F ∗n(tk−1)− F (tk−1)−∆kF.

Thus

Dn ≡ sup

t

|F ∗n(t)− F (t)| ≤Mn + max

k≤d

∆kF.

4) Assume for simplicity that F is continuous (if not, no big deal: one just

deals with its “large” jumps separately). Then, ∀ε > 0, can choose d large

enough and {tk} s.t. maxk≤d ∆kF < ε/2. Now, in view of 2), we also have

Mn < ε/2 for all large enough n, and then we have Dn < ε. Bingo.

This result, as we said, can be used to establish consistency of estimators.

But wait: there is more!

206

It is tempting to use Dn for goodness-of-fit testing. To do that, need the

distr’n of Dn, which apparently depends on F . If this is so, it’s a mess. But:

Assume that F is continuous and let Q be its quantile function, so that

F (Q(t)) ≡ t, t ∈ (0, 1). Then, as we saw (sl. 49), X d= Q(U) for U ∼ U [0, 1],

and so we can think of the sequence {Xj} as obtained by Q-transforming an

i.i.d. sequence {Uj} with Uj ∼ U [0, 1], so that

1(Xj ≤ t) = 1(Q(Uj) ≤ t) = 1(Uj ≤ F (t)), F ∗n(t) = R∗n(F (t)),

where R∗n(u) :=

1

n

∑n

j=1 1(Uj ≤ u) is the uniform EDF.

This, in particular, implies that

Dn ≡ sup

t

|F ∗n(t)− F (t)| = sup

t

|R∗n(F (t))− F (t)| = sup

u∈[0,1]

|R∗n(u)− u|.

Wow! This means that the distribution of Dn is one and the same for all

continuous F , which is very handy for goodness-of-fit testing.

But wait: there is more!

207

From the binomial CLT, ∀u ∈ [0, 1],

√

n

(

R∗n(u)− u

) d−→ V (u) ∼ N(0, u(1− u)) as n→∞.

From the multinomial CLT (sl. 193), for any 0 ≤ u1 < · · · < un ≤ 1,

√

n

(

R∗n(u1)− u1, . . . , R∗n(ud)− ud

) d−→ (V (u1), . . . , V (ud)) ∼ N(0, C2(u)),

where u = (u1, . . . , ud) and elementary calculations show that the CovM has

the form

C2(u) =

[

min{uj , uk}(1−max{uj , uk})

]

j,k=1...,d

.

In fact, if we consider

{√

n

(

R∗n(u)− u

)}

u∈[0,1] as a random process on [0, 1]

(it is called the empirical process), then it will converge in distribution (on the

space C[0, 1] of continuous functions on [0, 1]) to the so-called Brownian

Bridge process {V (u)}u∈[0,1] (a close relative of the Brownian motion

process, a.k.a. the Wiener process).

208

So what? It follows (Kolmogorov Thm) that, as n→∞,

√

nDn

d

= sup

u∈[0,1]

∣∣√n(R∗n(u)− u)∣∣ d−→ max

u∈[0,1]

|V (u)|,

and the good news is that the distribution of the RHS is known:

P

(

max

u∈[0,1]

|V (u)| ≤ x

)

= K(x) := 1 + 2

∞∑

k=1

(−1)ke−2k2x2 , x > 0.

Therefore limn→∞P

(√

nDn ≤ x

)

= K(x), and one can use that for

goodness-of-fit testing (Kolmogorov test).

Another famous example is von Mises–Smirnov ω2-test based on the statistic

ω2n := n

∫ [

(F ∗n(t)− F (t))

]2

dF (t)

d

=

∫ 1

0

[√

n(R∗n(u)− u)

]2

du.

Again, because of the convergence of empirical processes (in distribution),

lim

n→∞P(ω

2

n ≤ x) = P

(∫ 1

0

V 2(u)du ≤ x

)

, x > 0, which is known, etc.

209

Asymptotic Normality & Efficiency of the

Maximum Likelihood Estimator

Recall: basing on an i.i.d. sample X = (X1, X2, . . . , Xn) with the Xj ’s having

density fθ(x) (w.r.t. to some measure µ), we construct the maximum

likelihood estimator (MLE) as

θˆ = θˆn(X) := arg max

θ

fθ(X) ≡ arg max

θ

n∏

j=1

fθ(Xj) ≡ arg max

θ

L(X, θ),

where L(X, θ) := ln fθ(X) =

∑n

j=1 l(Xj , θ) is the log-likelihood function

(here l(x, θ) = ln fθ(x), of course).

Ex. If Xj ∼ B(p) then fθ(x) = px(1− p)1−x, x = 0, 1 (w.r.t. the counting

measure on the integers), so L(X, p) =

∑n

j=1[Xj ln p+ (1−Xj) ln(1− p)] =

n

[

X ln p+ (1−X) ln(1− p)] and 0 = ∂L∂p = Xp − 1−X1−p ⇐⇒ p = pˆ := X.

210

Ex. If Xj ∼ N(µ, σ2), then fθ(x) = 1√2piσ e−(x−µ)

2/2σ2 (w.r.t. the Lebesgue

measure on R), θ = (µ, σ2) ∈ R× R+. Here clearly

L(X, θ) = −n

2

ln 2pi − n lnσ − 1

2σ2

n∑

j=1

(Xj − µ)2.

Equations for the critical point(s):

0 =

∂L

∂µ

=

1

σ2

n∑

j=1

(Xj − µ) ≡ n

σ2

(X − µ),

0 =

∂L

∂σ

= −n

σ

+

1

σ3

n∑

j=1

(Xj − µ)2 ≡ − n

σ3

[

σ2 − 1

n

n∑

j=1

(Xj − µ)2

]

.

Obvious solution (clearly a max): µˆ = X, σ̂2 = 1n

∑n

j=1(Xj −X)2.

Ex. What if Xj ∼ U(0, θ) or Xj ∼ U(θ, 1 + θ)? [Pictures.]

211

In all the examples, MLEs were nice consistent estimators. Why is this so?

Why does the method work? What properties can one expect from the MLEs?

Denote by ϑ the true value of the parameter θ. Then, by the SLLN,

1

n

L(X, θ) ≡ 1

n

n∑

j=1

l(Xj , θ)

a.s.−→ Eϑl(Xj , θ) as n→∞

(provided that Eϑl(Xj , θ) is finite; we assume everywhere that all relevant

conditions are met, without going into detail).

So one would expect that, for large n (and this is what the word “asymptotic”

refers to), the maximum of L(X, θ) will be located close to the point where

Eϑl(Xj , θ) attains its maximum (as a function of θ).

Now where is that point?

212

Thm (Gibbs’ inequality) For any two densities f and g (on a common space,

w.r.t. a common measure µ),∫

f(x) ln f(x)µ(dx) ≥

∫

f(x) ln g(x)µ(dx) (∗)

if both integrals are finite. Here, “ ≥” becomes “ = ” iff f(x) = g(x) (µ-almost

everywhere, i.e. everywhere except for a set of zero µ-measure).

I Using lnu ≤ u− 1 (holds for all u > 0, and “=” holds iff u = 1), we have∫

f(x)

(

ln g(x)− ln f(x))µ(dx) = ∫ f(x) ln g(x)

f(x)

µ(dx)

∗≤

∫

f(x)

(

g(x)

f(x)

− 1

)

µ(dx) =

∫

g(x)µ(dx)−

∫

f(x)µ(dx) = 1− 1 = 0.

Now “=” holds in (∗) iff we have “=” in ∗≤, which, in its turn, holds if

ln g(x)f(x) =

g(x)

f(x) − 1 wherever µ “can see it”, i.e. g(x)/f(x) = 1 µ-a.e. Bingo.

[Well, need to be a bit more careful as f(x) or g(x) can vanish, but it’s OK.]

213

Thus arg maxθ Eϑl(Xj , θ) = ϑ is the true value of the parameter!!

Ex. Let Xj ∼ P (θ) : fθ(x) = e−θ θxx! , x = 0, 1, 2, . . . (w.r.t. the counting

measure on Z), so

Eϑl(X1, θ) = Eϑ

(−θ +X1 ln θ − ln(X1!))

= −θ + EϑX1 ln θ −Eϑ ln(X1!) = −θ + ϑ ln θ −Eϑ ln(X1!).

To find the maximum, we solve (for θ) the usual equation:

0 =

∂

∂θ

Eϑl(X1, θ) = −1 + ϑ

θ

⇐⇒ θ = ϑ.

Good.

So we expect that the MLE θˆ will be close to the maximum point ϑ of the

function y1(θ) := Eϑl(Xj , θ) :

214

6-

θϑ θˆn

y1(θ) =

∫

fϑ(x) ln fθ(x)µ(dx)

y2(θ) =

1

nL(X, θ)

Look: the “sharper” the peak in the curve y1(θ), the closer should be the

maxima of the curves yj ! The curvature of y1 is given by its 2nd derivative:

∂2

∂θ2

∫

fϑ(x) ln fθ(x)µ(dx) =

∫

fϑ(x)

∂2

∂θ2

ln fθ(x)µ(dx)

=

∫

fϑ(x)

[

f ′′θ (x)

fθ(x)

−

(f ′θ(x)

fθ(x)

)2]

µ(dx).

[For brevity, we use this convention: ′ = ∂∂θ .]

215

Now at the point θ = ϑ of the max of y1(θ), this becomes∫

f ′′ϑ (x)µ(dx)︸ ︷︷ ︸

=0 (do you see why?)

−

∫

(f ′ϑ(x))

2

fϑ(x)

µ(dx)︸ ︷︷ ︸

=:I(ϑ)

= −I(ϑ),

where I(θ) the Fisher information that appears in the famous Rao–Crame´r

(lower) bounda for estimators’ errors: say, for unbiased estimators θn (where

n indicates the i.i.d. sample size), one has

Eϑ(θ

∗

n − ϑ)2 ≥

1

nI(ϑ)

.

This leads to the following plausible conclusion: the higher the value of I(ϑ),

the closer should θˆ be to ϑ. And this is so indeed:

aDoes that ring a bell?

216

Thm [7.24+7.25] Under some regularity assumptions, the MLE θˆ = θˆn is

consistent: θˆn

P−→ ϑ as n→∞, and asymptotically normal:

√

n(θˆn − ϑ) d−→ Y ∼ N(0, 1/I(ϑ)).

In fact, convergence here holds for all moments as well:

Eϑθˆn = ϑ+ o

(

n−1/2

)

, Eϑ(θˆn − ϑ)2 = 1 + o(1)

nI(ϑ)

.

Thus one can say that the MLE θˆn is asymptotically efficient.

217

I We will only give a sketch of the proof.

Consistency. For any ε > 0, one has:

An : =

{|θˆn − ϑ| > ε} ⊂ { sup

|u|>ε

L(X,ϑ+ u) > L(X,ϑ)

}

=

{

sup

|u|>ε

[

1

n

L(X,ϑ+ u)︸ ︷︷ ︸

a.s.−→ Eϑl(X1,ϑ+u)

− 1

n

L(X,ϑ)︸ ︷︷ ︸

a.s.−→ Eϑl(X1,ϑ)

]

> 0

}

as, by the SLLN,

1

n

L(X, θ) =

1

n

n∑

j=1

l(Xj , θ)

a.s.−→ Eϑl(X1, θ).

Here by Gibbs’ inequality Eϑl(X1, ϑ+ u)−Eϑl(X1, ϑ) < 0. No wonder that

one can prove that P(An)→ 0 as n→∞, which means that θˆn P−→ ϑ. Too

technical for us at the moment, but we have got an idea of why the MLE is

consistent.

218

Asymptotic normality. Using Taylor’s formula (basically, it’s just the mean

value theorem for the function L′(X, θ)) at the point θˆn:

L′(X,ϑ) = L′(X, θˆn)︸ ︷︷ ︸

=0

+(ϑ− θˆn)L′′(X, θ∗),

where θ∗ is a point between ϑ and θˆn (why is L′(X, θˆn) = 0?). Thus

1√

n

n∑

j=1

l′(Xj , ϑ)︸ ︷︷ ︸

d−→ Z∼N(0,I(ϑ))

=

√

n

(

θˆn − ϑ

)× (−1)× 1

n

n∑

j=1

l′′(Xj , θ∗)︸ ︷︷ ︸

P−→ Eϑl′′(X1,ϑ)

,

where the first convergence (

d−→) is due to the CLT and the fact that

Eϑl

′(Xj , ϑ) = 0 =

∫

f ′ϑ(x)µ(dx) (do you see why?) and Eϑ(l

′(Xj , ϑ))2 = I(ϑ),

and the second one (

P−→) follows from the LLN since Eϑl′′(X1, θ∗) is finite

and Eϑl

′′(X1, θ∗)

P−→ Eϑl′′(X1, ϑ) = −I(ϑ) (see sl. 216, 217) due to θˆn P−→ ϑ.

Thus

√

n

(

θˆn − ϑ

) d−→ ZI(ϑ) ∼ N(0, 1/I(ϑ)), bingo!

219

The End

220

Probability for Inference

Lecture Slides∗

∗Written by K. Borovkov, last time modified on 1 January 2021 (or even later).

0-0

1. Probability Spaces

Random experiment (RE): You should be familiar with the concept, from

the 2nd year level probability course you have done. Anyway, that’s a

(somewhat vaguely defined) concept representing real-life phenomena that:

• have a “mass character” (i.e. could be repeated many-many times, at least

in theory),

• don’t display “deterministic regularity” (i.e. the outcome of any given trial

is uncertain, to the best of our prior knowledge), but

• possess what’s called “statistical regularity”: the relative frequencies

(nA/n) of events one can observe in the experiment stabilize around some

values ∈ [0, 1] as the # of (“independent”) repetitions of the experiment

grows.

It is that last dot-point that makes Probability Theory possible.

Examples (Ex’s): coin tossing; dice rolling; gender of newborn babies etc.

1

Need to specify the outcome of our RE: how?

In mathematics, we have sets = collections of objects.

• Sample space Ω is the set of all possible outcomes.

Ex. Coin tossing. H/T? What if the coin doesn’t fall on its side? What if

where it landed also matters for us? Several different Ω’s are possible (and

can be used under different circumstances).

• So outcomes are elements ω ∈ Ω (a.k.a. elementary events, sample

points, realizations).

• Events: these are subsets A ⊂ Ω for which probability is defined.

NB: in non-trivial cases, there are subsets A ⊂ Ω for which probability

CANNOT be defined. For example, one can partition Ω = [0, 1] into

countably many “identical” sets Ai (they can be obtained from each other

by translations): Ω =

∑

i≥1Ai. Assuming uniform probability distribution

on [0, 1], what would be the probability of A1?

2

Sample spaces

• Ex. Finitely many possible outcomes ⇒ only need a finite Ω. Just list all

the outcomes: Ω = {H,T} ≡ {0, 1} for the experiment of tossing a coin

once; Ω = {1, 2, . . . , 6} for rolling a die once.

Now consider an RE where we toss a coin

roll a die

n = 3 times in a row.

Product spaces come handy. For two sets A and B, their (Cartesian)

product is defined by

A×B := {(a, b) : a ∈ A, b ∈ B}

(the set of all pairs (a, b) s.t. a ∈ A and b ∈ B), and we put

An := A× · · · ×A︸ ︷︷ ︸

n copies of A

= {(a1, a2, . . . , an) : aj ∈ A, j = 1, 2, . . . , n}.

RE with a sample space Ω0 is replicated n times⇒ the sample space of the

composite experiment is Ω = Ωn0 , with ω = (ω1, ω2, . . . , ωn), ωj = outcome

of the jth replication of the basic (sub-)RE with the sample space Ω0.

3

• Ex. Keep tossing a coin till H shows up. Will a finite Ω suffice?

There are countably many possible outcomes that can be represented by

points from the set N = {1, 2, . . . } (ω = j if we observed j − 1 tails

followed by heads).a

• Ex. You have a date with your girl/boyfriend. She/he shows up

ω ∈ R+ = [0,∞) minutes late. Here we need Ω = R+.

One often uses the whole real line R or a subinterval thereof. For a

composite RE consisting of n dates, one can use

Ω = Rn := {(x1, x2, . . . , xn) : xj ∈ R, j = 1, 2, . . . , n}.

aAny set A s.t. there exists a 1–1 mapping f : A→ N is said to be countable (denumerable).

Thus, Z = {. . . ,−1, 0, 1, . . . } is countable. So is N2, but not [0, 1].

4

• If there can be any number of dates (at least, in theory: imagine that if

your girl/boyfriend was less than 20′ late for a given date, you decide to

meet again), we’ll need

RN := {(x1, x2, . . . ) : xj ∈ R, j ∈ N},

the set of real sequences, i.e. all real-valued functions on N.

[Other functional spaces are used, too, e.g. C[0, 1] for the so-called

Brownian motion process.]

• This models a situation where the basic RE can be repeated infinitely

many times. Likewise, in the coin tossing RE in which you are tossing a

coin till you get H for the first time, one can use Ω = {T,H}N.

• BTW: Note that {T,H}N is uncountable (why?). In the last example, you

could actually use a countable space of outcomes — which one?

5

• Probabilities will be assigned to sets of outcomes, i.e. subsets of Ω.

• For a given RE, can one assign probabilities to all subsets of Ω?

• If Ω is countable, the answer is YES.

In the general case, the answer is NO.

• It is impossible in the basic RE of choosing a point at random from [0, 1].

• So we MUST restrain ourselves and consider only some of the subsets

of Ω, chosen in such a way that there will be no problems with assigning

probabilities to them — and these subsets will be called “events”.

• Now how to choose them? One needs to be able to manipulate events,

and, quite naturally, such (admissible!) manipulations should be

producing events. Let’s look at the ways to manipulate events.

6

Ex. Choose a student from a large class. Want the events that the student:

1) is NOT smoking;

2) is a female AND more than 55 y.o.;

3) was born in Australia OR New Zealand;

4) was born in Australia AND is NOT smoking.

These can be expressed in terms of simpler events using set operations:

1) Let Ω := population of all students in class, A := sub-population of

smokers. Take the complement Ac := {ω ∈ Ω : ω 6∈ A}.

2) If B := sub-population of female students, C := those who are > 55 y.o.,

then take the intersection B ∩ C = {ω ∈ Ω : ω ∈ B and ω ∈ C}.

3) If D := students born in Australia, E := students born in New Zealand,

then take the union D ∪ E = {ω ∈ Ω : ω ∈ D or ω ∈ E}.

4) This will be the set difference D \A := D ∩Ac.

7

Note that D∩E = ∅, i.e. the events are disjoint (no common outcomes). Note

also that, in the case of disjoint sets, one often uses D + E to denote D ∪ E.

In fact, it is often more convenient to work with functions rather than sets.

How to “replace” sets with functions? Use indicators (indicator functions):

1A(ω) :=

1, ω ∈ A,0, ω 6∈ A.

To the (main) set operations on events, there correspond the following

operations on indicator functions (draw diagrams & check!!):

1Ac = 1− 1A, 1A∩B = 1A1B , 1A∪B = max{1A,1B}.

For the symmetric set difference A4B := A \B +B \A ≡ (A ∪B) \ (A ∩B),

1A4B = |1A − 1B |.

8

Ex. Express the following events:

Neither A nor B occurred [i.e. 1A + 1B = 0].

I Ac ∩Bc = (A ∪B)c, de Morgan law.

A and B occurred, but C didn’t.

I A ∩B ∩ Cc ≡ (A ∩B) \ C.

Only one of A1, A2, A3 occurred [i.e.

∑3

j=1 1Aj = 1].

I A1 \ (A2 ∪A3) +A2 \ (A1 ∪A3) +A3 \ (A1 ∪A2).

Exactly two out of A1, . . . , A5 occurred [i.e.

∑5

j=1 1Aj = 2].

I

∑

i

k 6=i,j

Ak

.

9

As we said earlier, want to manipulate events, need the resulting sets still be

events. Hence the requirement: the class of events on Ω must be closed under

the main set operations, i.e. the complement of an event should still be an

event, and the union of events must still be an event (the same for

intersections, but this is automatic from de Morgan laws, see below).

To make things work, need a bit more: namely, that we are allowed to take

countable unions (and intersections!). In mathematics, when countable

infinity is allowed/involved, one often uses σ to indicate that.

Def. A family F of subsets of Ω is said to be a σ-algebra on Ω if

(A.1) Ω ∈ F ,

(A.2) A ∈ F ⇒ Ac ∈ F ,

(A.3) A1, A2, . . . ∈ F ⇒

⋃∞

n=1An ∈ F .

In words: the family is closed under complementation and countable union and

intersection. Why the latter? De Morgan + (A.2) + (A.3) + (A.2):⋂∞

n=1An =

[

(

⋂∞

n=1An)

c]c

= [

⋃∞

n=1A

c

n]

c

= [

⋃∞

n=1Bn]

c

, Bn := A

c

n ∈ F .

10

Ex. (continued)

Infinitely many of the events A1, A2, . . . occurred [i.e.

∑∞

j=1 1Aj =∞].

Is this actually an event? [Denoted: An, i.o.]

Why? For instance: will a random walk S0, S1, S2, . . . on Z visit 0 infinitely

many times? Let An := {Sn = 0}.

I

⋂

n≥1

⋃

k≥n

Ak — and this is an event indeed, using (A.2) + (A.3).

Here: ∩ ←→ ∀, “for all”; ∪ ←→ ∃, “there exists”.

Finitely many of the events A1, A2, . . . occurred [i.e.

∑∞

j=1 1Aj <∞].

Why? For instance: in a random walk S0, S1, S2, . . . , let An := {|Sn/n| > ε}

(for a fixed ε > 0). Related to the “strong” LLN.

I

⋃

n≥1

⋂

k≥n

Ack — use de Morgan or apply the same logic as above.

11

One starts modelling an RE by specifying a suitable sample space Ω and then

choosing an appropriate σ-algebra F of subsets Ω. The elements of this

σ-algebra are called events.

NB: Always ∅ ∈ F : indeed, ∅ = Ωc, then use (A.1) + (A.2).

NB: So taking A3 = A4 = · · · = ∅ in (A.3) yields

A1, A2 ∈ F ⇒ A1 ∪A2 ∈ F .

Likewise for any finite union (intersection) of events: still an event. If only

that held instead of (A.3), then F would be called an algebra of sets.

Ex. The trivial σ-algebra: F = {∅,Ω}.

No fun: no uncertain events! All the events we are allowed to look at are: the

impossible event ∅ (it never occurs!) and the certain event Ω (occurs always!).

Ex. The power set P(Ω) := class of all subsets of Ω.

This is often the choice in simple situations with discrete sample spaces.

12

Prm. Suppose Fn, n = 1, 2, . . . , are σ-algebras on a common sample space Ω.

Is F1 ∪ F2 a σ-algebra as well? What about F1 ∩ F2? What about

⋂∞

n=1 Fn?

Of course, there are many different possible choices of F . May wish to

consider, say, a σ-algebra containing a given set A ⊂ Ω. The smallest such

σ-algebra is clearly the so-called σ-algebra generated by A:

σ(A) := {∅, A,Ac,Ω}.

Extending this, let G = {A1, . . . , An} be a finite partition of Ω, i.e. the sets

Ai ⊂ Ω are pairwise disjoint,

∑n

i=1Ai = Ω. Then the σ-algebra generated by G,

i.e. the smallest σ-algebra that contains all the sets Aj , is

σ(G) :=

{∑

i∈I

Ai : I ⊂ {1, 2, . . . , n}

}

.

(Clearly a σ-algebra. Why is it the smallest one containing G?)

Similarly for a countable partition of Ω.

13

In case of the σ-algebra generated by a partition G, it is easy to give

representation for all the elements of σ(G). One can also introduce the concept

of the σ-algebra generated by an arbitrary given family G of subsets of Ω —

but this is less elementary (what about all possible intersections etc?).

Thm [1.12] For any family G of subsets of Ω, there exists a unique

σ-algebra, denoted by σ(G) and called the σ-algebra generated by G, s.t.

1) G ⊂ σ(G), and

2) if H is a σ-algebra on Ω and G ⊂ H, then σ(G) ⊂ H.

That is, σ(G) is the smallest σ-algebra on Ω containing G.

I How to prove such an assertion?? It’s not too difficult. First note that

there are σ-algebras on Ω that contain G: just take P(Ω). So the class of all

σ-algebras on Ω that contain G is non-empty. Now consider the intersection of

all σ-algebras from the class. It will contain G (as each of the σ-algebras

contains it!) and it will be a σ-algebra (as an intersection of such). And it will

be the smallest one with these properties!! (Why?)

14

An important example of a generated σ-algebra is the class B(R) of Borel

subsets of R (a.k.a. the Borel σ-algebra on R):

B(R) := σ{(a, b] : a, b ∈ R, a < b}.

All “reasonable” subsets of R are Borel (e.g. finite and countable subsets, open

intervals, open and closed sets etc.), but B(R) 6= P(R)! [Although giving an

example of a set which is not Borel is a challenge!]

This extends to the multivariate case:

B(Rm) := σ

{

m∏

i=1

(ai, bi] : ai, bi ∈ R, ai < bi

}

.

Here

∏m

i=1(ai, bi] is the Cartesian product of intervals (a “brick”).

Equivalently, B(Rm) is generated by open balls. As in the univariate case, all

reasonable subsets of Rm are Borel.

When Ω = Rm, B(Rm) is the default choice of F . For Ω ⊂ Rm, one takes

F = {Ω ∩A : A ∈ B(Rm)}, the trace of B(Rm) on Ω.

15

Now it’s time to introduce

Probability, from Latin probabilis “provable,” from probare “to try, to test”

(cf. to prove, to probe), from probus “good”.

Probable cause as a legal term is attested from 1676.

Probably is attested from 1535.

Probability is attested from 1551. [Source: http://www.etymonline.com]

Let (Ω,F) be a sample space endowed with a σ-algebra of its subsets (the

couple is called a measurable space).

Def. A probability on (Ω,F) is a function P : F → R s.t.

(P.1) P(A) ≥ 0, A ∈ F ,

(P.2) P(Ω) = 1,

(P.3) for any pairwise disjoint A1, A2, · · · ∈ F ,

P

∞⋃

j=1

Aj

= ∞∑

j=1

P(Aj), “countable additivity”.

16

Def. The triple (Ω,F ,P) is called a probability space.

NB1: P is referred to as a set function (as its argument assumes “values” that

are sets, its domain being F). NB: ω ∈ Ω and {ω} ⊂ Ω are distinct objects!

Note: P(ω) is NO GOOD, P({ω}) is OK.

NB2: On one and the same measurable space, we can have infinitely many

different probabilities. Ex: tossing a (biased) coin (once). In statistics, we do

consider different probabilities on the same measurable space all the time!

NB3: Properties (P.1) and (P.3) specify what’s called a measure. Adding

(P.2), we get a measure of “total mass one”; one often uses “probability

measure” in that case.

NB4: Why was this def’n adopted? It mimics the properties of relative

frequencies of events! Turns out that measure theory is the most natural

framework for formal treatment of probabilities. Very successful, starting with

being able to establish theoretically all the main “statistical laws” observed in

the real world, and first of all — the LLN and CLT.

17

Ex. The point mass (degenerate distribution) at (a fixed point) ω ∈ Ω :

εω(A) := 1A(ω).

NB the difference in interpretation: the LHS is a function of A (for a fixed

outcome ω), whereas the RHS is a function of ω (for a fixed event A).

It models a situation with a deterministic outcome: repeat your RE till you

turn blue, but each time you’ll be seeing one and the same outcome: ω.

Ex. Counting measure on N (or even R ⊃ N):

µ(B) :=

∑

n≥1

εn(B), B ∈ F = P(N).

This is not a probability! The measure counts the number of points in B

(which can be infinite, of course).

18

Ex. Discrete uniform distribution: suppose Ω is finite, F = P(Ω). If all

outcomes ω ∈ Ω are “equally likely”, then they should have the same

probability. Using notation |B| for cardinality of B, just put

P(A) := |A|/|Ω|, A ∈ F

(this is the so-called “classical probability”).

NB: using a version of the counting measure, this can be re-written as

P(A) =

1

|Ω|

∑

ω∈Ω

εω(A).

19

Elementary properties of probability (Thm 1.23):

a) P(∅) = 0.

I Taking A1 = A2 = · · · = ∅ in (P.3), we have P(∅) =

∑∞

n=1 P(∅) — bingo!

b) finite additivity : for any pairwise disjoint A1, . . . , An ∈ F ,

(PF.3) P

n⋃

j=1

Aj

= n∑

j=1

P(Aj).

I Take An+1 = An+2 = · · · = ∅ in (P.3) and use a) — bingo!

NB: in the special case A1 = A, A2 = A

c, we obtain

P(Ac) = 1−P(A).

20

Elementary properties of probability (Thm 1.23): continued.

c) If A ⊂ B (from now on, always assume that A,B, . . . ∈ F), then

P(B \A) = P(B)−P(A).

I Follows from b): take A1 = A, A2 = B \A, then B = A1 +A2.

NB: So probability is non-decreasing: P(A) ≤ P(B) for A ⊂ B.

d) For any events A and B,

P(B ∪A) = P(B) + P(A)−P(B ∩A)

(the simplest version of the inclusion-exclusion principle), and so always

P(B ∪A) ≤ P(B) + P(A), “subadditivity of probability”.

I As A ∪B = A+B \A1, where A1 := A ∩B ⊂ B, can use b) and then c):

P(A ∪B) = P(A) + P(B \A1) = P(A) + P(B)−P(A1),

bingo!

21

Subadditivity of prob’ty extends to Boole’s ineq’ty (Propn 1.24): for any

A1, A2, . . . ∈ F ,

P

∞⋃

j=1

Aj

≤ ∞∑

j=1

P(Aj).

I “Disjointification”: let B1 := A1, B2 := A2 \A1, B3 := A3 \ (A1 ∪A2) etc:

Bn := An \

n−1⋃

j=1

Aj .

Then 1) Bn ⊂ An, n ≥ 1, 2)

⋃

j≤nAj =

⋃

j≤nBj , n ≤ ∞ (prove by

induction), and 3) B1, B2, . . . are disjoint. Hence, using monotonicity of P,

P

∞⋃

j=1

Aj

2)= P

∞∑

j=1

Bj

(P.3)= ∞∑

j=1

P (Bj)

1)

≤

∞∑

j=1

P (Aj) , bingo!

22

Natural Q: in our def’n of probability, why can’t we assume finite additivity

(sl. 20) instead of the countable one (our (P.3))?

Direct A to this: we would lose important continuity properties.

Notation: An ↑ A as n→∞ ⇔ A1 ⊂ A2 ⊂ A3 ⊂ · · · and

⋃∞

n=1An = A;

An ↓ A as n→∞ ⇔ A1 ⊃ A2 ⊃ A3 ⊃ · · · and

⋂∞

n=1An = A.

Thm [1.25] Suppose P : F → R satisfies conditions (P.1), (P.2) and (PF.3).

Then the following are equivalent :

a) P satisfies (P.3) (and hence is a probability).

b) An ↑ A ⇒ P(An) ↑ P(A).

c) An ↓ A ⇒ P(An) ↓ P(A).

d) An ↓ ∅ ⇒ P(An) ↓ 0. [Of course, n→∞ in b)–d).]

I We will show that a) ⇒ b) ⇒ c) ⇒ d) ⇒ a). Obvious: b) ⇔ c) (by

complementation) and c) ⇒ d). So only need to show a) ⇒ b) and d) ⇒ a).

23

a) ⇒ b): If An ↑ A, then

⋃n−1

j=1 Aj = An−1 and, putting A0 := ∅, the

disjointification procedure from sl. 22 yields: Bn = An \An−1,

P(A) = P

( ∞⋃

n=1

An

)

=

∞∑

n=1

P(Bn)

c), sl. 21

=

∞∑

n=1

[P(An)−P(An−1)]

= lim

m→∞

m∑

n=1

[P(An)−P(An−1)] = lim

m→∞P(Am),

using the “telescoping argument” ((a2 − a1) + (a1 − a0) = a2 − a0 etc).

24

d) ⇒ a): If B1, B2, . . . are disjoint, then

∞∑

j=n+1

Bj =: An ↓ ∅ as n→∞.

Indeed, if this is NOT so, then

⋂

n≥1An 6= ∅, i.e. there is a point ω that

belongs to ALL An and hence ω ∈ Bm for some m ≥ 1. Since the Bn’s are

DISJOINT, we must have ω 6∈ Am for that m, a CONTRADICTION!

Now

P

∞∑

j=1

Bj

= P

n∑

j=1

Bj +An

(PF.3)= n∑

j=1

P (Bj)︸ ︷︷ ︸

→∑∞j=1 P(Bj)

+ P(An)︸ ︷︷ ︸

→0

as n→∞. We proved (P.3). Bingo!

25

Now we can prove a very simple, but quite important assertion:

(The 1st) Borel-Cantelli Lemma [Thm 1.27]

If

∑∞

n=1 P(An) <∞, then P(An, i.o.) = 0. [Re i.o., see sl. 11.]

I Using Thm 1.25, c) and Boole’s inequality (sl. 22),

P(An, i.o.) = P

⋂

n≥1

⋃

k≥n

Ak

= lim

n→∞P

⋃

k≥n

Ak

≤ lim

n→∞

∑

k≥n

P (Ak) = 0.

26

2. Probabilities on R

• Always on (R,B(R)).

• Although probabilities are GIVEN ON B(R), one usually SAYS they are

given on R (slang!).

• For probabilities on R, we will use P rather than P, and for a reason:

such probabilities will mostly be “induced” on R by random variables

given on a general “underlying” space (Ω,F ,P), so it is more convenient

to reserve P for probabilities on the general space (Ω,F).

• Anyway, to give a probability on R, one apparently needs to specify all the

values P (A), A ∈ B(R) — but B(R) is HUGE!

• In fact, one doesn’t: there is a much more economical way.

Def. The distribution function (DF, a.k.a. CDF) of a probability P on R is

the function FP : R→ R defined by

FP (t) := P ((−∞, t]), t ∈ R.

27

Prpn [1.32] FP ≡ FP ′ ⇔ P ≡ P ′

This is a consequence of the fact that σ

({(−∞, t] : t ∈ R}) = B(R) (hey,

(a, b] = (−∞, b] \ (−∞, a], cf. sl. 15!). In fact, a much stronger assertion holds:

Thm [1.36] below. So we’ll talk about that later. Meanwhile: to the

characteristic properties of DFs.

Thm [1.33] For any probability P on R, its DF F := FP satisfies:

a) F is non-decreasing: s < t ⇒ F (s) ≤ F (t). Hence, at any point

t ∈ R, it has one-sided limits:

F (t−) := lim

s↑t

F (s), F (t+) := lim

s↓t

F (s); F (t−) ≤ F (t) ≤ F (t+);

b) F is right-continuous: F (t) = F (t+);

c) lim

t→−∞F (t) = 0, limt→+∞F (t) = 1.

28

I

a) Obvious from the monotonicity of P (sl. 21): as (−∞, s] ⊂ (−∞, t] for

s < t, we have

F (s) ≡ P ((−∞, s]) ≤ P ((−∞, t]) ≡ F (t).

The existence of one-sided limits follows from monotonicity (recall: any

bounded increasing sequence has a finite limit).

b) Obvious from the continuity of P (sl. 23):

Let An := (−∞, tn], tn ↓ t as n→∞.

Then An ↓ A := (−∞, t] and so P (An) ↓ P (A).

c) Obvious from the continuity of P :

Since A′n := (−∞,−n] ↓ ∅ and A′′n := (−∞, n] ↑ R as n→∞,

lim

n→∞F (−n) ≡ limn→∞P (A

′

n) = P (∅) = 0, lim

n→∞F (n) ≡ limn→∞P (An) = P (R) = 1.

It remains to make use of the monotonicity of F . Bingo.

29

Ex. Point mass and beyond: For P = εs (s ∈ R is a fixed value),

FP (t) =

0, t < s1, t ≥ s

= 1(s ≤ t)

(here 1(C) = 1 if the condition C is met, 1(C) = 0 otherwise).

For P = (1− p)ε0 + pε1 (do you recognize the Bernoulli distribution B(p)? It’s

a mixture of two point masses),

FP (t) =

0, t < 0

1− p, 0 ≤ t < 1

1, t ≥ 1

= (1− p)1(0 ≤ t) + p1(1 ≤ t).

In the general case, given a DF FP , what is P ({t})? Since (t− 1/n, t] ↓ {t} as

n→∞, we have P ({t}) = limn(FP (t)− FP (t− 1/n)) = FP (t)− FP (t−).

30

Thm [1.36] For any F : R→ R which satisfies a)–c) from Thm [1.33], there

exists a unique P on B(R) s.t. F ≡ FP .

I This is a rather non-trivial result. Its proof is beyond the scope of our

subject. Just a few words how it’s done:

Start with putting P ((a, b]) := F (b)− F (a) (≥ 0!) for arbitrary a < b.

Next, for A :=

⋃n

i=1(ai, bi], −∞ ≤ a1 < b1 < a2 < b2 < . . . < bn ≤ ∞, we put

P (A) :=

n∑

i=1

(F (bi)− F (ai)).

The collection A of all such A’s is an algebra (verify), B(R) = σ(A) (obvious).

From the construction it is obvious that P is finitely additive on A, and one

can prove, using b) from Thm [1.33], that it is also countably additive on A.

The last shot is to apply Carathe´odory’s extension theorem: a probability

given on an algebra can be uniquely extended to one on the generated

σ-algebra. Bingo.

31

Implications: Can completely specify a probability on R by its DF!

Ex. Consider the function

F (t) =

0, t < 0

t, 0 ≤ t < 1

1, t ≥ 1

It clearly satisfies a)–c). So by Thm [1.36], there exists a unique probability

on R with this DF. This probability is called the uniform distribution on [0, 1],

denoted by U [0, 1].

NB: the U [0, 1]-probability of a given set is invariant w.r.t. to translations

(provided the set remains within [0, 1]). Likewise, the uniform distribution on

a set B ∈ B(Rn) is a probability invariant w.r.t. translations.

BTW: What is the DF of U [a, b], a < b?

Now we will look at a few important large classes of distributions on R.

32

Discrete Probabilities on R: P (C) = 1 for some countable C ⊂ R.

Prpn [1.39] The following are equivalent:

a) P is discrete.

b) For some {ti}i≥1 ⊂ R and {pi > 0}i≥1 with

∑

i pi = 1, one has

P =

∑

i

piεti .

c) For some {ti}i≥1 ⊂ R and {pi > 0}i≥1 with

∑

i pi = 1, one has

FP (t) =

∑

i

pi1(ti ≤ t).

[Picture.]

[BTW: can one assume that the ti’s can be ordered, e.g. t1 < t2 < . . .?]

33

I a) ⇒ b): C is countable ⇔ C = {ti}i≥1 ⊂ R. Now for B ∈ B(R),

P (B) = P (B ∩ C) + P (B ∩ Cc)︸ ︷︷ ︸

≤P (Cc)=1−P (C)=0

= P

B ∩∑

i≥1

{ti}

= P

∑

i≥1

(B ∩ {ti})

= ∑

i≥1

P (B ∩ {ti})

=

∑

i≥1

P ({ti}) 1(ti ∈ B) =

∑

i≥1

pi1(ti ∈ B) sl. 18=

∑

i≥1

piεti(B).

b) ⇒ c): By def’n, FP (t) ≡ P ((−∞, t]) =

∑

i≥1

piεti((−∞, t]) =

∑

i≥1

pi1(ti ≤ t).

c) ⇒ a): For C := {tj}j≥1 we have

P (C) =

∑

j

P ({tj}) =

∑

j

(FP (tj)− FP (tj−)) =

∑

j

pj = 1

since 1(ti ≤ t) is continuous at tj for i 6= j, so that FP (tj)− FP (tj−) = pj .

34

Absolutely Continuous (AC) Prob’s on R: the ones with densities.

Def. A probability P on R is AC if there exists a f’n fP : R→ R+, called the

density (a.k.a. PDF) of P , s.t.

FP (t) =

∫ t

−∞

fP (s) ds, t ∈ R.

Clearly, this implies that P ((a, b]) =

∫ b

a

fP (s) ds and fP (t) = F

′

P (t) (a.e.!).

NB: In most cases, the integral here is our good old friend the Riemann

integral. However, in the general case, it must be understood as the so-called

Lebesgue integral, but we’ll talk about that later.

ANY integrable function f ≥ 0 on R with ∫ f(s) ds = 1 specifies a probability

on R. Indeed,

F (t) :=

∫ t

−∞

f(s) ds

defines a DF on R (i.e. it has properties a)–c) from sl. 28, and so Thm [1.36]

applies.

35

Mixed Distributions: neither discrete nor AC, but mixtures thereof, i.e.

P = pPd + (1− p)Pa,

where Pd is discrete, Pa is AC, p ∈ (0, 1) is a fixed number.

Ex. Waiting times at an ATM: when a customer arrives, either nobody is

using the ATM (w.p. p) or it is in use (plus there may be a queue!) — and

then the customer has to wait for a random time ∼ E(λ), the exponential

distribution with par’r λ (having the density λe−λt, t > 0). Then, using the

total probability formula (rings a bell?), the waiting time has the distribution

P = pε0 + (1− p)E(λ).

But wait: there is more!

36

Singular Distributions: with a continuous DF, but not AC!

Continuous DF means no point is assigned a positive probability

(otherwise there would be jumps in the DF, sl. 30). So the prob’ty is “spread”

over R, but — there is no density. Singular (deviating from the usual or

expected; odd) indeed.

Ex [1.51] Cantor’s ladder (explained in class).

And this is basically it: any distribution on R is a mixture of three “pure

type” distributions!

Thm [1.52] (Lebesgue’s decomposition) Any probability on R has a unique

representation of the form

P = αdPd + αaPa + αsPs,

where αi ≥ 0, i = d, a, s;

∑

αi = 1; Pd is discrete, Pa is AC, Ps is singular.

37

3. Random Variables (RVs)

“Na¨ıve definition”: RV X is a “function of chance”, i.e. X = X(ω) : Ω→ R

(assuming we are given an underlying probability space modelling our RE).

Not good enough: one usually wants to know probabilities of the RV hitting

some given sets, e.g. X ∈ [a, b]. Therefore the respective set of favourable

outcomes

X−1([a, b]) := {ω ∈ Ω : X(ω) ∈ [a, b]}

called the inverse image (under X) of [a, b], MUST be an event. Hence:

Def [2.3] An RV is a function X = X(ω) : Ω→ R s.t., for any B ∈ B(R),

X−1(B) := {ω ∈ Ω : X(ω) ∈ B} ∈ F

(such functions are called measurable; shorthand: {X ∈ B}).

So, for RVs X, the probabilities P(X ∈ B) are defined for all B ∈ B(R)! Good.

38

In fact, to satisfy Def [2.3], it suffices to have

{X ∈ (−∞, t]} := X−1((−∞, t]) ≡ {ω ∈ Ω : X(ω) ∈ (−∞, t]} ∈ F , t ∈ R. (∗)

Which is kind of nicer: much fewer sets to play with. But why is that so?

Because X−1 preserves all set operation (and disjointness).

Prpn [2.2] Let {Bα : α ∈ I} be an arbitrary family of subsets of R.

a) Bα ⊂ Bβ ⇒ X−1(Bα) ⊂ X−1(Bβ).

b)

⋃

α∈I X

−1(Bα) = X−1

(⋃

α∈I Bα

)

and

⋂

α∈I X

−1(Bα) = X−1

(⋂

α∈I Bα

)

.

c) Bα ∩Bβ = ∅ ⇒ X−1(Bα) ∩X−1(Bβ) = ∅.

d) X−1(Bcα) =

[

X−1(Bα)

]c

.

I

It’s all next to obvious — a matter of simple logic. For instance, look at d):

what’s on its LHS? The set of all ω’s s.t. X(ω) ∈ Bcα. But this is the same as

the set of all ω’s that don’t have the property that X(ω) ∈ Bα, which is the

complement of the set of all ω’s that do, i.e.

[

X−1(Bα)

]c

. Bingo, right?

39

Now how does this help if we want to show that (∗) is equivalent to Def [2.3]?

Let C be the collection of all sets B ∈ B(R) s.t. X−1(B) ∈ F . By Prpn [2.2],

C will be a σ-algebra on R!! Indeed, R ∈ C since X−1(R) = Ω ∈ F , so (A.1)

holds. Next, B ∈ C ⇔ X−1(B) ∈ F , and as F itself is a σ-algebra, one has[

X−1(B)

]c ∈ F . By Prpn [2.2],d), this means that X−1(Bc) ∈ F , so that

Bc ∈ C, i.e. (A.2) holds for C. Similarly for (A.3). Good.

Now (∗) states that all (−∞, t] ∈ C, t ∈ R, and so the smallest σ-algebra

containing these sets will be part of C (as the latter is a σ-algebra itself):

σ ((−∞, t], t ∈ R) ⊂ C.

Hey, but what’s on the LHS? This is B(R) ⊃ C!

We conclude that B(R) = C, but this means that Def [2.3] is satisfied.

40

Ex Constants are RVs: for X ≡ c = const, one has {X ≤ t} =

{ ∅, t < c,

Ω, t ≥ c.

Ex Random indicators. For an event A ∈ F , {1A ≤ t} =

∅, t < 0,

Ac, 0 ≤ t < 1,

Ω, t ≥ 1.

Ex Simple RVs: X :=

∑n

i=1 ai1Ai , where ai ∈ R, Ai ∈ F , i ≤ n <∞.

One usually assumes here (for convenience) that {Ai}i≤n is a partition of Ω.

This is no big deal: if it’s not so, one can always re-write X in an alternative

form, using a partition, X =

∑n′

i=1 a

′

i1A′i , with the A

′

i’s of the form(⋂

i∈I Ai

) ∩ (⋂i∈Ic Aci), I ⊂ {1, 2, . . . , n}, which form a partition.

This is an RV: {X ≤ t} = ⋃i: ai≤tAi ∈ F .

41

Random Vectors: X = (X1, . . . , Xd) : Ω→ Rd s.t. all Xi, i ≤ d, are RVs.

Equivalently, s.t. X−1(B) ∈ F for all B ∈ B(Rd). This extends to “random

elements” in much more general (than Rd) spaces.

Similarly, Z : Ω→ C is a complex-valued RV if Z = X + iY, where X and Y

are RVs. [Here, of course, i =

√−1.]

42

Prpn [2.9] Given an RV X, σ(X) := {X−1(B) : B ∈ B(R)} is a σ-algebra

on Ω which is called the σ-algebra generated by the RV X.

NB: From the def’n of RV, σ(X) ⊂ F . Usually, it is smaller than F .

I Just verify (A.1)–(A.3) using Prpn [2.2].

(A.1) Ω = X−1(R) ∈ σ(X), good.

(A.2) For A ∈ σ(X) there exists a B ∈ B(R) s.t. A = X−1(B).

Now Ac =

[

X−1(B)

]c

= X−1(Bc) ∈ σ(X), good.

(A.3) Similarly. Bingo!

Ex [2.10] σ(1A) = {∅, A,Ac,Ω}. Follows from Ex on sl. 13.

Ex [2.11] For a simple RV X =

∑n

i=1 ai1Ai , where a1, . . . , an are distinct

and {Ai} is a partition of Ω,

σ(X) =

{∑

i∈I

Ai : I ⊂ {1, 2, . . . , n}

}

.

43

Combinations of RVs

The general fact: if X = (X1, . . . , Xd) is a random vector, g : Rd → R is a

continuous function, then g(X1, . . . , Xd) is an RV as well.

NB: This extends to more general functions g (to the so-called measurable f’s).

We’ll skip the proof: a bit beyond the scope. Just look at a couple of simple

special cases. Assume that X and Y are RVs, a, b ∈ R. Then:

a) aX is an RV: {aX ≤ t} =

{

{X ≤ t/a}, a > 0;

{X ≥ t/a} = {X ∈ [t/a,∞)}, a < 0.

b) aX + bY is an RV: it suffices to prove this for a = b = 1 (from a)). Look:

{X + Y < t} =

⋃

r∈Q

({X < r} ∩ {Y < t− r}),

Q being the (countable) set of all rationals (r = m/n, where m,n ∈ Z).

c) XY is an RV. (How to show that? E.g. use XY = ((X+Y )2−X2−Y 2)/2.)

44

Distributions and DFs of RVs and RVecs

Def For an RV X on (Ω,F ,P), the set function

PX(B) := P(X ∈ B), B ∈ B(R),

is called the distribution of X. Similarly for random vectors.

Prpn [2.23] PX is a probability on R (or Rd — in the case when X ∈ Rd).

I That PX(B) ≥ 0 and PX(R) = 1 is obvious from the definition. Use

Prpn [2.2] to show countable additivity. Bingo.

The DF of X is the DF of PX : FX(t) := PX((−∞, t]) ≡ P(X ≤ t).

We call X discrete (AC, singular) if FX is discrete (AC, singular, resp.).

If X is AC, fX is continuous at x, then P(X ∈ (x, x+ ∆)) = fX(x)∆ + o(∆) =

(fX(x) + o(1))∆ as ∆→ 0. [Recall the meaning of o(1), O(1) etc.]

The survival function (a.k.a. the tail) of X is the function SX(t) = 1− FX(t).

We say that X, Y are identically distributed (and write X

d

= Y ) iff PX ≡ PY .

45

In the case when X ∈ Rd, by the DF of X we understand the function

FX(t1, . . . , td) := P(X1 ≤ t1, . . . , Xd ≤ td), (t1, . . . , td) ∈ Rd.

As in the univariate case, the distr’n PX is uniquely specified by FX — this

follows from the fact that the orthants

{(x1, . . . , xd) : x1 ≤ t1, . . . , xd ≤ td}, (t1, . . . , td) ∈ Rd,

generate B(Rd) (why?) How to express P(Xi ∈ (ai, bi], i ≤ d) in terms of FX?

There are analogs of Thms [1.33], [1.36] in the multivariate case, although

they are a bit more sophisticated. BTW, what will replace the monotonicity

condition a)? What about limits an “infinities”?

The def’n of a discrete RVec is the same as for discrete RVs (this holds in

more general spaces as well). An AC distribution has a density fX , satisfying

FX(t1, . . . , td) :=

∫ t1

−∞

· · ·

∫ td

−∞

fX(s1, . . . , sd) ds1 . . . dsd.

46

Prpn [2.28] X = (X1, . . . , Xd) is discrete iff all Xi, i ≤ d, are discrete.

I ⇒ : If PX(C) = 1 for a countable set C ⊂ Rd, then, for each i,

P(Xi ∈ Ci) = 1, where Ci := {xi : x = (x1, . . . , xd) ∈ C} are countable, too.

⇐ : If, for each j ≤ d, there is a countable set Cj s.t. P(Xj ∈ Cj) = 1, then

P(X ∈ C) = 1 for C := ∏dj=1 Cj , which is countable. Bingo.

Prpn [2.29] If X = (X1, . . . , Xd) is AC, then so is Xj for any j ≤ d, and

fXj (x) =

∫

· · ·

∫

︸ ︷︷ ︸

d−1

fX(s1, . . . , sj−1, x, sj+1, . . . , sd) ds1 · · · dsj−1 dsj+1 · · · dsd.

I Use FXj (t) = lim

ti→∞, i 6=j

FX(t1, . . . , tj−1, t, tj+1, . . . , td) that leads to

FXj (t) =

∫

· · ·

∫

︸ ︷︷ ︸

j−1

t∫

−∞

∫

· · ·

∫

︸ ︷︷ ︸

d−j

fX(. . . , sj−1, x, sj+1, . . .) ds1 · · · dsj−1 dx dsj+1 · · · dsd

and then change the order of integration (Fubini). Bingo. [Is the converse true?]

47

Some Popular Distributions on R and Beyond

Discrete RVs . . .

AC RVs . . .

RVecs . . . (Singular distributions don’t need to be exotic in this case)

Transformations of RVs

Prpn [2.40] If X is an RV, g an increasing & continuous function on R, with

inverse h := g−1, then the RV Y := g(X) has the DF

FY (t) = FX(h(t)).

I Obvious: FY (t) = P(Y ≤ t) = P(g(X) ≤ t) = P(X ≤ g−1(t)), bingo.

What if g were decreasing?

Thm [2.41] If, in addition to the conditions in Prpn [2.40], X is AC and g

is continuously diff’ble on an open set U ⊂ R s.t. P(X ∈ U) = 1, then Y is

AC, with the density fY (y) = fX(h(y))|h′(y)|. [This allows for decreasing g!]

48

I Obvious: FY (t) = FX(h(t)) =

∫ h(t)

−∞ fX(s) ds, next either differentiate or

change variables (s = h(u)): bingo! [Picture.]

Ex Linear g(x) = ax+ b, a > 0 (+ what it does to normal distributions).

Thm [2.43] = extension of Thm [2.41] to RVecs: g : Rd → Rd with an

inverse h, smooth, Jh(y) := det

(

∂hi

∂yj

)

is the Jacobian of h [rings a bell?].

If X ∈ Rd is AC, so is Y = g(X), and fY (y) = fX(h(y))|Jh(y)|.

Works nicely for normal distributions!

Prpn [2.47] Let F be a DF on R, Q(x) := inf{t : F (t) ≥ x}, x ∈ (0, 1), its

quantile function, U ∼ U [0, 1]. Then X := Q(U) ∼ F .

I Note that Q is non-decreasing and Q(x) ≤ t ⇔ x ≤ F (t). So

P(X ≤ t) = P(Q(U) ≤ t) = P(U ≤ F (t)) = F (t), bingo!

NB: If F is continuous and X ∼ F , then also F (X) ∼ U [0, 1].

[DIY! What if F is not continuous?]

49

Independent RVs

Def [3.1] RVs X1, . . . , Xn are called independent if, ∀B1, . . . , Bn ∈ B(R),

P(X1 ∈ B1, . . . , Xn ∈ Bn) =

n∏

j=1

P(Xj ∈ Bj).

NB: Makes perfect sense when interpreting P’s as relative frequencies!

NB: i.i.d. means “independent identically distributed” (if refers to an infinite

sequence, it means that any finite subset of Xj ’s is i.i.d.).

How do we know if RVs are independent? Suppose we know their (joint) DF.

Thm [3.3] RVs X1, . . . , Xn are independent iff, ∀t1, . . . , tn ∈ R,

FX1,...,Xn(t1, . . . , tn) =

n∏

j=1

FXj (tj).

I ⇒ Special case: Bj = (−∞, tj ].

⇐ Can easily verify for Bj = (aj , bj ], and then extend. Trust me, OK? Good.

50

Thm [3.4] Discrete RVs X1, . . . , Xn are independent iff, ∀t1, . . . , tn ∈ R,

P(X1 = t1, . . . , Xn = tn) =

n∏

j=1

P(Xj = tj). (∗)

I ⇒ Again a special case: Bj = {tj}.

⇐ Will prove for n = 2; the same argument works in the general case.

Suppose X and Y are discrete RVs whose joint PMF factorises as in (∗). Then

P(X ∈ A, Y ∈ B) = P

⋃

ai∈A,bj∈B

{X = ai, Y = bj}

=

∑

ai∈A

∑

bj∈B

P(X = ai, Y = bj) =

∑

ai∈A

∑

bj∈B

P(X = ai) P(Y = bj)

=

(∑

ai∈A

P(X = ai)

)∑

bj∈B

P(Y = bj)

, bingo!

51

A similar criterion holds for AC RVs. Its proof is basically the same: just

replace sums with integrals.

Thm [3.5] AC RVs X1, . . . , Xn are independent iff, ∀t1, . . . , tn ∈ R,

fX1,...,Xn(t1, . . . , tn) =

n∏

j=1

fXj (tj).

Ex Uniform distribution on [0, 1]d.

Ex Standard multivariate normal distribution.

NB: If gj are “nice enough” functions, X1, . . . , Xn are independent RVs, then

so are the RVs Yj := gj(Xj), j = 1, . . . , n. [Kind of common sense, no?] Look:

P(Yj ∈ Bj , j ≤ n) = P(Xj ∈ g−1j (Bj) =: B′j , j ≤ n) =

n∏

j=1

P(Xj ∈ B′j) etc.

BTW: Do you remember how to compute the PMF of the sum of two

independent integer-valued RVs? The density in case of AC RVs?

52

When we say: “Let X1, . . . , Xn be independent RVs with DFs F1, . . . , Fn”,

how do we know that such a thing exists at all??

I Take Ω := Rn, F := B(Rn), define P as the probability on (Rn,B(Rn))

whose DF is given by

∏

j≤n Fj(xj), and take Xj(ω) := ωj (coordinate

projections: ω = (ω1, . . . , ωn)) — done!

Alternatively, take Ω := [0, 1]n, F := B([0, 1]n), define P as the uniform

probability on [0, 1]n, and take Xj(ω) := F

−1

j (ωj) (quantile functions of the

coordinates) — done!

Constructing infinite sequences of independent RVs is a bit more interesting.

[BTW, what about dependent ones?]

53

Independent Events

Def [3.19] Events A1, . . . , An are called independent if their indicators are

independent RVs.

Equivalently, for any I ⊂ {1, . . . , n},

P

(⋂

i∈I

Ai

)

=

∏

i∈I

P(Ai) [= standard def’n].

⇒) The indicators of Ai, i ∈ I, are independent. Use Def. [3.19], [3.1] for that

subset of events, with Bi := {1} (then {Xi ≡ 1Ai ∈ Bi} = Ai)

⇐) Want: P(⋂nj=1{1Aj ∈ Bj}) = ∏nj=1 P(1Aj ∈ Bj). Note: {1Aj ∈ Bj} = Ω

if 0, 1 ∈ Bj ; = ∅ if 0, 1 6∈ Bj ; = Aj if 1 ∈, 0 6∈ Bj ; = Acj if 0 ∈, 1 /∈ Bj . Only

care about the last two alternatives: for I ⊂ {1, . . . , n}, show (by induction):

P

(⋂

i∈I

Ai ∩

⋂

j∈Ic

Aj

)

=

∏

i∈I

P(Ai)×

∏

j∈Ic

P(Acj).

54

NB: Again, makes perfect sense when we interpret probabilities as the

(limiting values of) relative frequencies.

NB: Differs from pair-wise independence [Bernstein’s example].

Cor [3.21] Events A1, . . . , An are independent iff A

c

1, . . . , A

c

n are.

I Prac class exercise.

55

4. Expectations

From 2nd year probability/stats subjects etc.: if X is a discrete RV (countable

set C = {ti} of possible values), then

EX =

∑

i

tiP(X = ti).

[

=

∫

xfX(x) dx for AC X’s

]

.

Why is this called the “expected” (or “mean”) value of X?

Recall the relative frequency interpretation of probability: in n independent

replications of our RE, Xj being the value observed in the jth replication, set

ni := #{j ≤ n : Xj = ti}. Then P(X = ti) ≈ ni

n

for large n, so that

Xn :=

1

n

n∑

j=1

Xj =

1

n

n∑

j=1

∑

i

ti1(Xj = ti)

=

1

n

∑

i

ti

n∑

j=1

1(Xj = ti)︸ ︷︷ ︸

=ni

=

∑

i

ti

ni

n

≈

∑

i

tiP(X = ti).

56

NB: In both cases, these are just computational rules rather than definitions

of expectation. Need a common one, applicable not only to discrete and AC

RVs, but also to mixtures thereof etc, and s.t. Xn → EX, n→∞ (LLN!).

Let’s start with indicators: for X = 1A, A ∈ F , set

EX := P(A).

This makes sense, as Xn = nA/n. How to proceed?

Expected properties of expectations of general RVs:

• constants are expectations of themselves: EX = c for X ≡ c;

• linearity: E(aX + bY ) = aEX + bEY, where a, b ∈ R are constants;

• monotonicity: X ≤ Y ⇒ EX ≤ EY .

Linearity leads to the following

Def [4.1] For a simple RV X =

∑n

i=1 ai1Ai , the expectation is defined as

EX :=

n∑

i=1

aiP(Ai).

57

NB: This is a consistent def’n: the same result for any representation of X!!

If one can also write X =

∑n′

i=1 a

′

i1A′i , then

∑n

i=1 aiP(Ai) =

∑n′

i=1 a

′

iP(A

′

i).

One can see that by looking at the atoms of σ(A1, . . . , An;A

′

1, . . . , A

′

n′).

NB: In particular, X ≡ c = c1Ω has expectation EX = cP(Ω) = c, OK!

Prpn [4.2] Expectation is a linear operation on simple RVs: for simple

X =

∑n

i=1 ai1Ai and Y =

∑m

j=1 bj1Bj , and constants a, b ∈ R,

E(aX + bY ) = aEX + bEY.

I Indeed, assuming that {Ai} and {Bj} are partitions of Ω, we have

E(aX + bY ) = E

n∑

i=1

m∑

j=1

(aai + bbj)1Ai∩Bj =

n∑

i=1

m∑

j=1

(aai + bbj)P(Ai ∩Bj)

= a

n∑

i=1

ai

m∑

j=1

P(Ai ∩Bj)︸ ︷︷ ︸

=P(Ai)

+b

m∑

j=1

bj

n∑

i=1

P(Ai ∩Bj)︸ ︷︷ ︸

=P(Bj)

= aEX + bEY, OK!

58

NB: For a simple RV X ≥ 0, clearly EX ≥ 0. Now monotonicity follows

from linearity: if X ≤ Y then Y −X ≥ 0 and so

0 ≤ E(Y −X) = EY −EX, good!

Now what?

Idea: perhaps can approximate an arbitrary RV by a sequence of simple RVs?

And perhaps the expectations of these simple RVs will tend somewhere?

First look at non-negative RVs. Any such RV can be approximated by an

increasing sequence of simple RVs {Xn}n≥1 in the following sense:

∀ω ∈ Ω, Xn(ω) ↑ X(ω) as n→∞.

59

In fact, we can construct such a sequence explicitly: let N := n2n, and

consider disjoint (for a fixed n) events

An,k := {k2−n ≤ X(ω) < (k + 1)2−n}, k = 0, 1, 2, . . . , N − 1,

An,N := {X(ω) ≥ n}.

Next put

Xn(ω) :=

k2−n, ω ∈ An,k,n, ω ∈ An,N . [Picture.]

Look: An,k = An+1,2k +An+1,2k+1, k < N, and so Xn+1 ≥ Xn.

Moreover, if we fix ω, then, for n > X(ω), we will have

0 ≤ X(ω)−Xn(ω) ≤ 2−n,

so that Xn(ω) ↑ X(ω) as n→∞. It works!! [Even for ω’s with X(ω) =∞. . . ]

60

Def [4.4] For an arbitrary RV X ≥ 0, we put

EX := lim

n→∞EXn,

where Xn ≥ 0 are simple RVs s.t. ∀ω ∈ Ω, Xn(ω) ↑ X(ω) as n→∞.

[We already know that such sequences of RVs do exist!!]

NB: From monotonicity of E, 0 ≤ EX1 ≤ EX2 ≤ EX3 ≤ . . . , so the limit of

this numerical sequence always exists (but can be infinite, of course).

A sober question: How can one use such a def’n? What if for different

sequences {Xn} of simple RVs one can have different values of limn→∞EXn?

Prpn [4.5] One cannot.

Hence Def [4.4] is consistent: regardless of the choice of {Xn}, the value of

the limit will be one and the same and so can be used to define EX.

61

I A nice exercise: will do it for illustration purposes (typical argument).

• Let both {Xn} and {X˜n} be as in Def [4.4].

We will show that one must have limn→∞EXn = limn→∞EX˜n.

• Suppose Y ≤ X is a simple RV, fix an ε > 0 and set An := {Xn > Y − ε}.

Look: since Xn+1 ≥ Xn, one has An ⊂ An+1, and

since Xn ↑ X ≥ Y > Y − ε, one has An ↑ Ω.

Therefore P(An)→ 1, equivalently P(Acn)→ 0 as n→∞.

• Now Xn ≥ (Y − ε)1An

(on An this holds due to the def’n of An, whereas on A

c

n the RHS is 0), so

EXn ≥ E(Y − ε)1An = E Y 1An︸ ︷︷ ︸

=Y (1−1Acn )

− εP(An)︸ ︷︷ ︸

≤1

≥ EY (1− 1Acn)− ε

= EY −EY 1Acn − ε ≥ EY −maxω∈Ω Y (ω)︸ ︷︷ ︸

<∞

P(Acn)︸ ︷︷ ︸

→0

−ε→ EY − ε.

62

• Thus limn→∞EXn ≥ EY − ε, and as ε > 0 is arbitrary (small), we have

lim

n→∞EXn ≥ EY.

• So if we take Y = X˜k for any fixed k (can do! As X˜k ≤ X), this will give

lim

n→∞EXn ≥ EX˜k.

Therefore

lim

n→∞EXn ≥ limk→∞EX˜k.

By symmetry, will also have

lim

n→∞EX˜n ≥ limk→∞EXk.

Hence the two limits must coincide! Bingo.

63

So we have got a def’n of EX for arbitrary X ≥ 0. What if X ≷ 0?

Let’s use non-negative RVs X+ = max{X, 0}, X− = −min{X, 0}, noting that

X = X+ −X−, |X| = X+ +X−.

Def [4.12] An RV X is called integrable if E|X| <∞ (and this is often

written as “X ∈ L1”). If X is integrable, its expectation is defined by

EX := EX+ −EX−. (∗)

Expectation of X over an event A is E(X;A) := EX1A.

NB: If E|X| <∞ then both EX± <∞, so (∗) makes sense.

If one of EX+ and EX− is infinite, then one can still use (∗) to define EX

(which will be ±∞, depending on which of EX± is infinite).

If both EX± =∞ then EX is undefined (what is ∞−∞?).

64

Good news: thus defined expectation inherits all the good properties of

expectation of simple RVs.

• Monotonicity: if X ≤ Y and EY <∞, then EX ≤ EY .

I First prove for non-negative RVs, using simple RVs (sim. to Prpn [4.5]).

This implies, in particular, that if |X| ≤ Y and Y ∈ L1 then also X ∈ L1.

• Linearity: if X,Y ∈ L1 and a, b ∈ R, then E(aX + bY ) = aEX + bEY.

I First note that V := aX + bY ∈ L1 since |V | ≤ |aX|+ |bY | = |a||X|+ |b||Y |

and E|X| <∞, E|Y | <∞.

Secondly, establish linearity per se: again using simple functions to

approximate RVs, passing to the limit. We’ll skip it. Not examinable. Good.

Cor [4.14] For X ∈ L1, one has |EX| ≤ E|X|.

I By definition, EX = EX+ −EX−, so

|EX| ≤ |EX+|+ |EX−| = EX+ + EX− linearity= E(X+ +X−) = E|X|.

65

A natural question: Given that Xn(ω)→ X(ω) as n→∞ (say, for all

ω ∈ A, P(A) = 1), will this imply that EXn → EX as well?

A natural answer: It depends.

Ex. On Ω = (0, 1) with P = U [0, 1], let

Xn(ω) := n1(ω < 1/n).

Obviously, Xn(ω) = X(ω) ≡ 0 for n > 1/ω, so Xn(ω)→ X(ω) for all ω ∈ Ω.

However, EXn = nP((0, 1/n)) = n× 1/n = 1 6→ 0 = EX. No good.

Convergence of expectations always holds for:

• monotone sequences of RVs, e.g. Xn(ω) ↑ X(ω);

• “dominated” sequences of RVs, i.e. when |Xn| ≤ Y, where Y ∈ L1

(the dominated convergence theorem, cf. Ex. above).

These results will be discussed later, after we have introduced a.s. convergence.

66

For RVecs and complex-valued RVs, expectations are defined component-wise.

Thus, for Z = X + iY ∈ C, with X,Y ∈ R, we set

EZ := EX + iEY.

And finally: another notation for E using integrals:

EX ≡

∫

Ω

X(ω)P(dω) ≡

∫

Ω

X(ω) dP(ω) ≡

∫

Ω

X dP.

The integral construction that we described is that for the Lebesgue

integral. In what is it different from the (conventional) Riemann integral?

Recall : for Riemann integral, we partition the domain of integration, usu. part

of R or Rd (how can we do that in more general cases?). However, for

Lebesgue integral, we partition the range of the integrand that can be defined

on an abstract set Ω! Hence it’s a much more general def’n. [Ex. Banknotes.]

When integrating “nice” functions on Rd w.r.t. volume measure (dx in the

case of R), both integrals give the same answer.

67

Integrals w.r.t. distributions and DFs

As we said earlier, when dealing with RVs one can often just work with the

distributions thereof, “forgetting” the original underlying (Ω,F ,P) and X(ω)

and switching to (Rd,B(Rd), PX). Accordingly, one abandons the general

integrals

∫

Ω

X(ω) dP(ω) (that are important, of course, for theoretical

calculations) in favour of (more practical)∫

x dP (x) and

∫

g(x) dP (x), P ≡ PX .

The latter integral is often denoted by

∫

g(x) dF (x), F being the DF of P .

Prpn. Assume that X ∼ P is an RV on (Ω,F ,P) whose expectation is defined

(i.e., at least one of EX± is finite). Then EX ≡ ∫

Ω

X(ω) dP(ω) =

∫

R x dP (x).

That is, expectations are numerical characteristics of the distributions of RVs.

The above claim is kind of a “change of variables” formula. Requires a proof.

68

I We only need to prove the Prpn for X ≥ 0. Recall our approximating

sequence of simple RVs Xn ↑ X (as n→∞) from sl. 60:

Xn =

∑n2n−1

k=0 k2

−n1

(

k2−n ≤ X < (k + 1)2−n)+ n1(X ≥ n).

For the RV X∗(x) := x on the prob’ty space (R,B(R), P ) one has X∗ d= X,

X∗n(x) :=

∑n2n−1

k=0 k2

−n1

(

k2−n ≤ x < (k + 1)2−n)+ n1(x ≥ n)

is an approx’g sequence of simple RVs on (R,B(R), P ): X∗n ↑ X∗ as n→∞,

EXn =

∑n2n−1

k=0 k2

−nP

(

k2−n ≤ X < (k + 1)2−n)+ nP(X ≥ n)

=

∑n2n−1

k=0 k2

−nP

(

[k2−n, (k + 1)2−n)

)

+ nP ([n,∞)) = EX∗n.

It remains to recall that EX = limn→∞EXn, EX∗ = limn→∞EX∗n. Bingo!

Basically the same argument shows that if Y = g(X) ∼ PY , X ∼ PX , then

EY = Eg(X) =

∫

ydPY (y) =

∫

g(x)dPX(x).

69

If F is AC with a “nice” density f = F ′ (a.e.), g is “nice” as well (e.g. both

are piece-wise continuous), then∫

g(x) dF (x) =

∫ ∞

−∞

g(x)f(x)dx,

the conventional Riemann integral. Again, this needs to be proved (which is

done by starting with g piece-wise constant and then passing to limits), which

is somewhat beyond the scope of our course.

The above two relations are the basic tools for computing expectations in

discrete & AC cases. When F is a mixture of the two, use a “mixed formula”.

NB: a mnemonic interpretation of the integral: similarly to∫

g(x)dx ≈

∑

(g(x)-values)× (increments of x),

one can (naively) think that∫

g(x)dF (x) ≈

∑

(g(x)-values)× (increments of F (x)).

70

Thm [4.23] For X ≥ 0, EX = ∫∞

0

FX(x)dx, FX := 1− FX(x). [Picture.a]

For X ≷ 0, EX = − ∫ 0−∞ FX(x)dx+ ∫∞0 FX(x))dx.

I For X ≥ 0 s.t. P(X ∈ hZ) = 1 for some h > 0, one has

EX =

∞∑

k=0

hkP(X = hk) = h

∞∑

k=1

k∑

j=1

P(X = hk) = h

∞∑

j=1

∞∑

k=j

P(X = hk)

= h

∞∑

j=0

P(X > hj) =

∞∑

j=0

FX(hj)h =

∫ ∞

0

FX(x)dx. (∗)

For general X ≥ 0, set X ′n := bnXcn , X ′′n := X ′n + 1n , n ≥ 1, s.t. X ′n ≤ X ≤ X ′′n

and X ′′n −X ′n ≤ 1n . Hence EX ′n ≤ EX ≤ EX ′′n ≤ EX ′n + 1n andb

EX ′n

(∗)

=

∫ ∞

0

FX′n(x)dx ≤

∫ ∞

0

FX(x)dx ≤

∫ ∞

0

FX′′n (x)dx

(∗)

= EX ′′n ≤ EX ′n +

1

n

.

Bingo.

aIn particular, for integer-valued X’s, EX =

∑

n≥1 nP(X = n) =

∑

n≥1P(X ≥ n).

bObvious: if X ≤ Y then FX(x) = P(X > x) ≤ P(Y > x) = FY (x), x ∈ R.

71

Functions of RVs

For the RV Y := g(X) (g is “nice” enough),

EY =

∫

y dFY (y) =

∫

g(x) dFX(x),

so in fact don’t need to compute the DF FY to find EY .

We mostly use the following consequences of this: for discrete/AC X’s,

Eg(X) =

∑

ti∈CX

g(ti)P(X = ti), Eg(X) =

∫

g(x)fX(x) dx.

The same applies to f’ns of RVecs. An important special case: (X1, X2) ∈ R2.

72

Cor [4.30] If X1 and X2 are independent RVs, gi(Xi) ∈ L1, i = 1, 2, then

E g1(X1)g2(X2) = E g1(X1)E g2(X2). (∗)

I First consider gi(x) = 1(x ∈ Bi) for some Bi ∈ B(R) :

E g1(X1)g2(X2) = E 1(X1 ∈ B1)1(X2 ∈ B2) = E 1(X1 ∈ B1, X2 ∈ B2)

= P (X1 ∈ B1, X2 ∈ B2) indep’ce= P (X1 ∈ B1) P (X2 ∈ B2)

= E 1(X1 ∈ B1) E 1(X2 ∈ B2) = E g1(X1) E g2(X2).

Next show that (∗) holds for gi(x) =

∑n

j=1 ai,j1(x ∈ Bi,j):

E g1(X1)g2(X2) =

∑

j,k

a1,ja2,kE1(X1 ∈ B1,j)1(X2 ∈ B2,k) = [using the above]

=

∑

j,k

a1,ja2,kE1(X1 ∈ B1,j)E1(X2 ∈ B2,k) = E g1(X1)E g2(X2).

Such simple functions g form a large enough class to approximate general

functions g. The usual limiting procedure works. Good.

73

Special case: Moments

This is when g(x) = xk.

Def The kth moment of X is EXk =

∫

xkdF (x).

Expectation EX = the 1st moment. [Why called ‘moments’?]

NB: the expectation is a characteristic of location. Can be BAD.

Def The kth central moment of X is E(X −EX)k.

Variance Var (X) := E(X −EX)2 ≡ EX2 − (EX)2 = 2nd central moment.

NB: the variance is a characteristic of scale/spread. Can be BAD.

Absolute moments: for | · |, i.e. E|X −EX|k = kth absolute central moment.

For RVs X with E|X|p <∞, we write: X ∈ Lp (p > 0).

74

What are moments good for?

Sometimes they can be easily calculated.

If you know all the moments of X, you’ll know its distribution as well

(under broad conditions — but this is not always so).

There are several very useful inequalities involving moments.

Some of them relate different moments, some of them give bounds for

probabilities in terms of moments.

75

Thm [4.39] (Jensen’s inequality) Let X ∈ L1 and g : R→ R be a convex

function. Then

g(EX) ≤ E g(X). [NB: Cor [4.14] is a special case: g(x) = |x|.]

I For a convex g, for any x0 ∈ R there always exists an a ∈ R s.t.

g(x) ≥ g(x0) + a(x− x0) for all x ∈ R.

Now take x0 = EX and replace x with X:

g(X) ≥ g(EX) + a(X −EX), ω ∈ Ω.

Taking expectations of both sides, using linearity & monotonicity:

E g(X) ≥ g(EX) + aE(X −EX) = g(EX).

Bingo.

76

Cor [4.37] (Lyapunov’s inequality) For 0 < r ≤ s,

(E|X|r)1/r ≤ (E|X|s)1/s .

NB: this implies, in particular, that if the sth moment is finite, then so is the

rth one. Thus, if the second moment is finite, then the expectation must be

finite, too.

I Letting Y := |X|r, the inequality can be re-written as

(EY )

s/r ≤ EY s/r.

This is a special case of Jensen’s inequality, with g(x) = xs/r, x ≥ 0, which is

convex for s ≥ r. [Can put g(x) = 0 for x < 0, if you wish.] Bingo, right?

77

Thm [4.40] (Chebyshev’s [Markov’s] inequality) If g : R→ R is a

positive non-decreasing function, then, for any RV X and number a ∈ R,

P(X ≥ a) ≤ Eg(X)

g(a)

.

I Since 1(X ≥ a) ≤ g(X)/g(a), we have

P(X ≥ a) = E 1(X ≥ a) ≤ E g(X)

g(a)

=

Eg(X)

g(a)

.

Bingo. Special cases:

P(|X| ≥ a) ≤ E|X|

p

ap

for p, a > 0;

P(|X −EX| ≥ a) ≤ Var (X)

a2

for a > 0;a

P(X ≥ a) ≤ Ee

tX

eta

for t > 0.

aBTW: this gives a ‘3σ-rule bound’ of 1/9 only!

78

Def Mixed moments: EXkY m (+ central, absolute etc.).

Def For X,Y ∈ L2, the covariance of X and Y is

Cov (X,Y ) = E (X −EX)(Y −EY ) ≡ EXY −EXEY. [Do you see why? ]

The correlation between X and Y with Var (X),Var (Y ) > 0 is

Corr (X,Y ) =

Cov (X,Y )√

Var (X) Var (Y )

. [NB: Both are symmetric!]

BTW: Why did we require that X,Y ∈ L2? This ensures that EXY is finite:

0 ≤ (X ± Y )2 = X2 + Y 2 ± 2XY ⇒ |XY | ≤ 1

2

(X2 + Y 2), bingo.

[Of course, it also ensures that X,Y ∈ L1, see Cor [4.37].]

If we replace X with X/

√

EX2, and Y with Y/

√

EY 2, we’ll get∣∣∣∣ XY√EX2EY 2

∣∣∣∣ ≤ 12

(

X2

EX2

+

Y 2

EY 2

)

. (∗)

Taking expectations on both sides leads to the famous

79

Cor [4.36] (Cauchy–Bunyakovsky inequality)a If X,Y ∈ L2 then

XY ∈ L1 and

E |XY | ≤

√

EX2EY 2.

Since |EXY | ≤ E|XY |, for variances/covariances Cor [4.36] means that

|Cov (X,Y )| ≤

√

Var (X)Var (Y ),

and hence that

|Corr (X,Y )| ≤ 1.

Cool. BTW: when do we have “=” in this inequality? Looking at the

derivation of the inequality, one can see that in fact

Corr (X,Y ) = 1 (−1, resp.) ⇔ X = aY + b, a > 0 (a < 0, resp.),

see sl. 85 for a formal derivation.

aFor some unclear reasons, it’s often referred to as Cauchy–Schwarz inequality.

80

So correlation = ±1 when there is a perfect linear relationship b/w the RVs.

When the correlation is zero, one says that the RVs are uncorrelated, which

is NOT the same as independent.

NB: IF X and Y are independent, then by Cor [4.30] (sl. 74)

EXY = EX EY,

and hence Corr (X,Y ) = 0, but not the other way around! [Examples?]

But: for Gaussian (X,Y ), it’s the same [look at the density!].

Covariance & correlation have nice (and also quite insightful & useful)

geometric interpretations.

81

Consider X,Y ∈ L2, set X0 := X −EX, Y0 := Y −EY, and look at

Var (X + Y ) = E(X0 + Y0)

2 = EX20 + EY

2

0 + 2EX0Y0

= Var (X) + Var (Y ) + 2Cov (X,Y ).

Compare this with: for u, v ∈ Rd,

‖u+ v‖2 = (u+ v, u+ v) = (u, u) + (v, v) + 2(u, v)

= ‖u‖2 + ‖v‖2 + 2(u, v).

In fact, Cov (X,Y ) can be interpreted as a scalar product (of X0 and Y0, the

centred versions of X and Y , in the linear space L2 of square-integrable RVs,

so that the norm of X is

√

EX2 — quite similar to the std Euclidean norm

in Rd) and so has the same properties.

82

In particular, |(u, v)| ≤ ‖u‖‖v‖ becomes our |Cov (X,Y )| ≤√Var (X)Var (Y ).

Note also that correlation is an analog of

(u, v)

‖u‖‖v‖ = cos(∠(u, v)).

So uncorrelated RVs are like orthogonal vectors (in Rd: for them, (u, v) = 0

or, which is the same, cos(∠(u, v)) = 0).

In particular, Pythagoras’ theorem holds:

Var (X + Y ) = Var (X) + Var (Y ) ⇔ Cov (X,Y ) = 0.

Independent RVs is a special case of uncorrelated RVs (sl. 82), but not the

other way around: uncorrelated RVs don’t need to be independent! [Examples?]

83

Correlation is a measure of linear association rather than independence.

To the geometric fact that

| cos(∠(u, v))| = 1 ⇔ u and v are collinear (i.e. u = av for some a 6= 0)

there corresponds the following relation:

|Corr (X,Y )| = 1 ⇔ P(Y = aX + b) = 1 for some a 6= 0, b ∈ R.

[NB: Cov is the scalar product of the centred versions of X and Y .]

I Indeed, assume that Corr (X,Y ) = 1. Then, for the standardized RVs

X1 :=

X −EX√

Var (X)

, Y1 :=

Y −EY√

Var (Y )

, we have

E(X1 − Y1)2 = EX21 + EY 21 − 2Corr (X,Y ) = 1 + 1− 2 = 0.

Likewise, assuming that Corr (X,Y ) = −1, we’ll get E(X1 + Y1)2 = 0.

Either way, we have an RV Z = (X1 ∓ Y1)2 ≥ 0 with EZ = 0. But this is only

possible when P(Z = 0) = 1, and this is why:

84

Set An := {Z > 1/n}, n = 1, 2, . . . Then, by Markov’s inequality

(Thm [4.40], sl. 79),

P(An) = P(Z > 1/n) ≤ EZ

1/n

= 0.

Further, we clearly have An ⊂ An+1 and An ↑ A :=

⋃

k≥1Ak ≡ {Z > 0}, so

that by continuity of probability holds

P(A) = lim

n→∞P(An) = 0.

Thus we proved this:

Corr (X,Y ) = ±1 ⇒ P(X1 ∓ Y1 = 0) = 1,

i.e. w.p. 1 holds

X −EX√

Var (X)

∓ Y −EY√

Var (Y )

= 0.

This means that Y = aX + b for some a, b ∈ R with a of the same sign as the

correlation Corr (X,Y )). [Picture.] Bingo!

85

When considering d > 2 RVs, i.e. dealing with RVecs X = (X1, . . . , Xd), one

uses covariance matrices (CovMs):

C2X ≡ [C2X(i, j)]i,j=1,...,d := [Cov (Xi, Xj)] = E (X −EX)>(X −EX).

NB: We are using row vectors x = (x1, . . . , xd) here, so x

> is a column and the

matrix product x>x = [xixj ] is a d× d-matrix. If we used column vectors, we

would be writing xx> for the same object.

Note that C2X(i, i) = Var (Xi), i ≤ d.

These two are the key properties of CovMs:

(CovM.1) C2X is symmetric: C

2

X(i, j) = C

2

X(j, i).

(CovM.2) C2X is positive (or, rather, non-negative) definite: ∀x ∈ Rd,

xC2Xx

> ≥ 0.

86

(CovM.1) is obvious, as Cov (X,Y ) = Cov (Y,X).

(CovM.2) requires some effort. Setting Y := Xx> ∈ R, we have

0 ≤ Var (Y ) = E(Xx> −EXx>)2 = E [(X −EX)x>]2

= E

[

(X −EX)x>(X −EX)x>]

= E

[(

(X −EX)x>)>(X −EX)x>] [as a> = a for a ∈ R]

= E

[

x(X −EX)>(X −EX)x>]

= xE

[

(X −EX)>(X −EX)]x> = xC2Xx>, ufff, bingo!

It turns out that any d× d-matrix satisfying (CovM.1) & (CovM.2) is the

CovM of some distribution on Rd (e.g. a d-dimensional normal distribution,

which is our next topic).

87

Multivariate Normal Distributions

Recall: the standard d-dim normal RVec X = (X1, . . . , Xd) has i.i.d.

components Xj ∼ N(0, 1), and hence its density has the product form:

f(x) =

d∏

j=1

1√

2pi

e−x

2

j/2 ≡ 1

(2pi)d/2

exp

{

−1

2

xx>

}

, x ∈ Rd.

Clearly, EX = 0 ∈ Rd, C2X = I ≡ diag (1, . . . , 1), the identity matrix.

Now consider Y := µ+XA, where µ ∈ Rm, A ∈ Rd×m, so that Y ∈ Rm. Then

EY = E(µ+XA) = µ+ (EX)A = µ

and

C2Y = E (Y −EY )>(Y −EY ) = E (XA)>XA

= EA>X>XA = A>(EX>X)A = A>IA = A>A.

88

Now if m ≤ d and detC2Y 6= 0 (why m > d is no good? NB: detC2Y > 0 for

non-singular C2Y ) then we can use Thm [2.43] to claim that X has density

fY (y) =

1

(2pi)m/2

√

detC2Y

exp

{

−1

2

(y − µ)[C2Y ]−1(y − µ)>

}

, y ∈ Rm. (∗)

Why? The easiest case is when m = d and detA 6= 0; we have g(x) = µ+ xA

with the inverse h(y) = (y − µ)A−1, so that

(

∂hi

∂yj

)

= (A−1)> is constant and

detC2Y = detA

> detA = (detA)2 ⇒ Jh = detA−1 = (detA)−1 = 1√

detC2Y

.

Remember we said that any matrix satisfying (CovM.1) & (CovM.2) is the

CovM of some distribution (sl. 88)? If B is such a matrix, detB 6= 0, the

function obtained by replacing C2Y in (∗) with our B will be a (normal)

probability density. Good.

If detB = 0, it will still be the CovM of a normal distribution, but the latter

will be concentrated on a (shifted) linear subspace (and hence not AC).

89

5. Conditional Expectations (CEs)

The expectation of an RV X is a number, a numerical characteristic of X (or,

rather, of PX). We know (cf. Problem 5 from PS–5) that, for X ∈ L2, the

value EX is the best (in mean quadratic) predictor among constants for X :

E(X − a)2 −→ min

a

for a = EX.

So, if you don’t know the value of X, but do know PX , your best “educated

guess” for X will be its mean EX.

However, we often do know something about the outcome of our RE, in

which X was also produced, but remains hidden from us. For instance, we

don’t know if patient A has got disease D, but know the results of tests made

on him (here the desired RV is X = 1D); or, X is tomorrow’s price of stock S

which we don’t know yet, but whose history up to now we have observed

(plus, we saw the prices of other stocks in the market as well). In Bayesian

statistics, we assume that parameters’ values are random and observe data;

what do the data tell us about the parameters?

90

In all these situations, the key question is:

How to make the best use of the available information?

CEs provide a powerful tool for doing exactly that.

CEs are not numbers but RVs themselves, and in fact are functions of

the observed RVs (the information we have).

NB: If you condition on an event, which is rarely the case, then you end up

with a number, but even in this case you are actually dealing with a f’n.

First we will consider just this special case: conditioning on an event.

Suppose that all we know about the outcome of our RE is that event A

occurred. Given this information, what would be our best “educated guess”

about the realized value of X?

91

Let’s minimise the mean quadratic error on A: set

g(a) := E[(X − a)2;A] = E(X − a)21A = EX21A − 2aEX1A + a2P(A),

and solve (for a) the equation

0 = g′(a) = −2EX1A + 2aP(A).

This yields

a =

EX1A

P(A)

≡ E(X;A)

P(A)

=: E(X |A),

which we are well familiar with from our 2nd year probability course(s).

Interpretation: we just average the values of X over A, ignoring the rest of the

sample space. [Picture.]

NB: Of course, we don’t need X ∈ L2 to use this definition, it suffices to

have X ∈ L1 (i.e. E|X| <∞). But the above derivation provides motivation.

92

Of course, we can do the same for Ac instead of A.

So if we know the value of 1A (telling us if A occurred or not) and need to

provide the best forecast for X (whether A occurred or not), we use

E(X |1A) :=

E(X |A) ≡ E(X; 1A = 1)

P(1A = 1)

, ω ∈ A,

E(X |Ac) ≡ E(X; 1A = 0)

P(1A = 0)

, ω ∈ Ac,

which is an RV!

Next assume that we observed a simple RV Y =

∑n

i=1 yi1Ai , where

{A1, . . . , An} is a partition of Ω, all yi being distinct.

In other words, we know which of the events Ai occurred. Now if Y = yi then

ω ∈ Ai, and we know that in this case the best forecast for X will be given by

Xˆ =

E(X;Ai)

P(Ai)

=: xi, ω ∈ Ai.

This again specifies an RV!

93

Moreover, since Ai = {Y = yi}, introducing the function h(y) by putting

h(yi) := xi, we see that this new RV is actually a function of Y :

Xˆ = h(Y ) =: E(X |Y ),

called the CE of X given the simple RV Y .

This can be thought of as a “crude” version of X obtained by averaging the

values of X over events on which Y assumes one and the same value, the

“atoms” Ai = {Y = yi} of σ(Y ). [Picture.]

NB: Note that the values of Y don’t really matter when defining E(X |Y ).

What matters is the partition generated by Y or, equivalently, σ(Y ).

Ex. Expected weight of inhabitants given the postcode.

NB: Please note that all what we have said re CEs so far (and what we will be

saying below) holds for RVecs Y as well! (We never used the fact that Y ∈ R.)

94

NB: It is obvious that X and Xˆ have the same average values on the sets Ai:

by def’n of xi,

E(Xˆ;Ai) = E Xˆ1Ai

def’n of Xˆ

= Exi1Ai = xiP(Ai)

def’n of xi= E(X;Ai).

Q: Will this still hold if we replace Ai with AI :=

∑

i∈I Ai, I ⊂ {1, 2, . . . , n}?

A: Sure. It’s like averaging the averages:

E(Xˆ;AI) = E Xˆ1AI = E Xˆ

∑

i∈I

1Ai =

∑

i∈I

E Xˆ1Ai=

∑

i∈I

EX1Ai = · · · = E(X;AI).

Summarising, we see that the CE Xˆ = E(X |Y ) has these two properties:

(CE.1) Xˆ is “flat” on the “atoms” of σ(Y ), i.e. Xˆ is an RV

w.r.t. σ(Y ) (cf. Tutorial Problem 3 from PS–3; one says that

Xˆ is σ(Y )-measurable); this holds iff Xˆ = h(Y ) for some h, and

(CE.2) E(Xˆ;A) = E(X;A) for any A ∈ σ(Y ) [i.e. A = {Y ∈ B}].

95

Important: (CE.1) & (CE.2) uniquely specify the CE Xˆ = E(X |Y ) in case

of simple Y ’s! Indeed, using (CE.2) with A = Ai and applying (CE.1)

immediately implies that Xˆ = xi on Ai, bingo.

Now one can formally define CE in case of general Y using these properties.

Thm [CE] Let X ∈ L1 and Y be RVs on a common probability space. Then

there exists an RV Xˆ satisfying (CE.1) & (CE.2). This RV is unique, up to its

values on a set of zero probability, and is called the CE of X given Y , denoted

by E(X |Y ). [Cf. Def [8.12].]

I The proof is based on Radon-Nikodym’s theorem from Measure Theory,

and we won’t give it here. Good.

Recall: we already said that what really mattered in the def’n of E(X |Y ) in

the case of simple Y ’s was the partition generated by Y or, equivalently, σ(Y ).

Likewise in the general case.

96

First observe that if ϕ is a 1–1 function, then E(X |Y ) = E(X |ϕ(Y )) (why?).

In particular, one always has E(X |Y ) = E(X |Y 3) = E(X | eY ) etc., but in

the general case E(X |Y ) 6= E(X |Y 2) (when will “=” hold?).

Second observe that we don’t really need RVs Y themselves for

conditioning — we need information contained in their values. The same

information is also contained in σ(Y ): if we know which events from it

occurred and which didn’t, we know the value of Y ! If, for a given

sub-σ-algebra F1 ⊂ F , we replace (CE.1) with the requirement that Xˆ is

F1-measurable (i.e. Xˆ is an RV w.r.t. F1), then Thm [CE] will still hold true.

Thus introduced RV Xˆ =: E(X | F1) is called the CE of X given σ-algebra F1.

In probability & statistics, it’s critically important (i) to be aware of the

general definition of CE, (ii) to know and be able to use the key properties of

CE, and hence be able to effectively use this powerful tool, and (iii) to know

how to compute CEs in important special cases.

97

Ex. Poisson sums: We know (from our 2nd year probability subject or

otherwise) that, for independent X ∼ P (λ) and Y ∼ P (µ), the sum is also

Poisson: Z := X + Y ∼ P (λ+ µ). Show that E(X |Z) = λλ+µZ.

[Motivation: Suppose we know the total number NT of jumps in a Poisson

process during the time interval [0, T ]. What can we say about Nt for t < T?]

(CE.1) is obvious ( λλ+µZ is a function of Z, right?).

(CE.2) To verify this property, we write, for any k ≥ 0,

E(X ; Z = k) = EX1(Z = k) =

∑

i≥0

∑

j≥0

i 1(i+ j = k)︸ ︷︷ ︸

= 1 iff j = k − i

P(X = i, Y = j)

=

k∑

i=0

iP(X = i, Y = k − i)︸ ︷︷ ︸

λie−λ

i!

× µ

k−ie−µ

(k − i)!

= e−(λ+µ)

k∑

i=0

iλi

i!

× µ

k−i

(k − i)!

98

= e−(λ+µ)λ

k∑

i=1

1

(i− 1)!(k − i)!λ

i−1µk−i [letting l := i− 1]

= e−(λ+µ)

λ

(k − 1)!

k−1∑

l=0

(k − 1)!

l!(k − 1− l)!λ

lµk−1−l = e−(λ+µ)

λ(λ+ µ)k−1

(k − 1)! .

On the other hand,

E

(

λ

λ+ µ

Z; Z = k

)

=

λk

λ+ µ

P(Z = k) =

λk

λ+ µ

(λ+ µ)k

k!

e−(λ+µ),

which is the same! Bingo.

What if the question was not to show that , but rather to compute (without

knowing the answer in advance)? That is, what about (iii)?

The standard route is to first find the conditional distribution and then

compute the expectation for it (re how this is done in our Ex, see Tutorial

Problem 1 in PS–6).

99

OK, what is the conditional distribution?

Recall: for any event A, one has P(A) = E1A. Likewise, one defines

conditional probabilities (cf. Def [8.18]) by setting

P(A |Y ) := E(1A |Y ).

Distributions are collections of probabilities of the form PX(B) ≡ P(X ∈ B),

B ∈ B(R). And so conditional distributions are defined by letting

gB(Y ) := P(X ∈ B |Y ) := E[1(X ∈ B) |Y ].

and then setting PX|Y (B | y) := gB(y).

In fact, it’s a bit more tricky than just that. Our Thm [CE] on sl. 97 claims,

for a given RV Z, the existence of g(Y ) := E(Z |Y ) and its uniqueness up to

events of null probability. Now we have a family of RVs {gB(Y ), B ∈ B(R)}

and kind of want it to be a distribution (in B) for any outcome!

That is, for any y ∈ R, our gB(y) should be a probability (a set function in B).

100

Well, it turns out such a thing does exists. Always. And we call it the

conditional distribution of X given Y .

When Y is discrete, we just compute P(X ∈ B |Y = yi) (provided that

P(Y = yi) > 0, cf. Thm [8.37]), it’s all nice & simple.

When (X,Y ) is AC, one uses the conditional density (cf. Def [8.38])

fX|Y (x|y) :=

f(X,Y )(x, y)

fY (y)

, fY (y) =

∫

f(X,Y )(x, y) dx.

Then (cf. Thm [8.39])

P(X ∈ B, Y ∈ B′) =

∫

B×B′

f(X,Y )(x, y) dxdy =

∫

B×B′

fX|Y (x|y)fY (y) dxdy

=

∫

B′

[∫

B

fX|Y (x|y) dx

]

︸ ︷︷ ︸

=:gB(y)

fY (y) dy = E[gB(Y ); Y ∈ B′],

so that gB satisfies the def’n of conditional probability.

101

So, to compute E(X |Y ) when (X,Y ) is AC, one first finds the conditional

density fX|Y (x|y). Then one computes

g(y) :=

∫

xfX|Y (x|y) dx

(sometimes denoted by E(X |Y = y)) and lets

E(X |Y ) := g(Y ).

Ex Uniform distribution on D := {(x, y) ∈ R2 : x ≥ 0, y ≥ 0, x+ y ≤ 1}.

Ex Normal distribution.

Quite often one can use the general properties of CEs to make the

computation feasible. Our exposition will be somewhat different from the one

presented in Section 8.3.2 from the text.

102

Properties of CEs

(CEP.1) Linearity: for constants a, b ∈ R,

E(aX + bZ |Y ) = aE(X |Y ) + bE(Z |Y ).

[NB: all relations of this kind, involving CEs, are understood a.s. (=almost

surely), which means they hold up to an event of zero probability.]

I Indeed, the RHS is a function of Y , so (CE.1) holds. To verify (CE.2), we

use the def’n of CE and linearity of expectation:

E[RHS ;Y ∈ B] = aE[E(X |Y ) ;Y ∈ B]︸ ︷︷ ︸

=E(X;Y ∈B)

+bE

[

E(Z |Y ) ;Y ∈ B]︸ ︷︷ ︸

=E(Z;Y ∈B)

= aE(X;Y ∈ B) + bE(Z;Y ∈ B) = E(aX + bZ;Y ∈ B), good.

103

(CEP.2) Monotonicity: if X ≤ Z a.s., then also E(X |Y ) ≤ E(Z |Y ) (a.s.).

I Indeed, if it were NOT so, we would have (from linearity)

E(Z −X |Y ) = E(Z |Y )−E(X |Y ) < 0

with positive probability. The LHS is a function of Y (by def’n), say, h(Y ),

and so, for B := h−1((−∞, 0)),

{ω ∈ Ω : E(Z −X |Y ) < 0} ≡ {ω ∈ Ω : h(Y ) < 0} = {ω ∈ Ω : Y ∈ B}.

Now by (CE.2),

E

(

h(Y );Y ∈ B) = E(Z −X;Y ∈ B).

Look: the integrals are over the set {Y ∈ B} of positive probability, and the

integrand on the LHS is < 0, whereas the one on the RHS is ≥ 0.

Contradiction.

104

(CEP.3) If Z = g(Y ) then

E(ZX |Y ) = ZE(X |Y ).

That is, functions of Y behave like constants when one conditions on Y .

I Indeed, the RHS is a function of Y , so (CE.1) is fine.

Re (CE.2): first consider the case Z = 1{Y ∈C} for some C ∈ B(R). Then

E(ZX;Y ∈ B) = EX1{Y ∈C}1{Y ∈B},

whereas

E

[

ZE(X |Y );Y ∈ B] = E[E(X |Y );Y ∈ C ∩B] (CE.2)= E(X;Y ∈ C ∩B),

which is the same. Next we verify the property for simple RVs, then look at

their limits. Good.

105

(CEP.4) If X is independent of Y , then

E(X |Y ) = EX. (∗)

I Indeed, (CE.1) is met as a constant (= EX) is a function of Y .

Secondly, X and 1(Y ∈ B) are also independent, and so by Cor [4.30] (sl. 74)

E(X; Y ∈ B) ≡ E[X1{Y ∈B}] = EX E1{Y ∈B}

= E

[

(EX)1{Y ∈B}

]

= E(EX; Y ∈ B),

so that (CE.2) is also met.

In particular, if Y = const, then (∗) always holds.

106

(CEP.5) The double expectation law (a.k.a. the “tower property”):

E

[

E(X |Y1, Y2) |Y1

]

= E(X |Y1).

I Oh well, (CE.1) is obvious. To show (CE.2):

E

[

E(X |Y1, Y2);Y1 ∈ B

]

= E

[

E(X |Y1, Y2); (Y1, Y2) ∈ B × R

]

(CE.2)

= E(X; (Y1, Y2) ∈ B × R) = E(X;Y1 ∈ B)

(CE.2)

= E

[

E(X |Y1);Y1 ∈ B

]

,

bingo.

In particular, taking Y1 = const, we see that

E

[

E(X |Y )] = EX

(but that was obvious from (CE.2): just take B = R there).

This relation is actually a form of the total probability law, which is a very

useful thing!

107

One of the great features of CE is that, for X ∈ L2, the CE Xˆ := E(X |Y ) is

the best (in m.q.) forecast for X from Y . Let’s prove that!

For an RV Z = h(Y ), consider

E(X − Z)2 = E[(X − Xˆ) + (Xˆ − Z)]2

= E

[

(X − Xˆ)2 + 2(X − Xˆ)(Xˆ − Z) + (Xˆ − Z)2]

= E(X − Xˆ)2 + 2E(X − Xˆ)(Xˆ − Z) + E(Xˆ − Z)2

Use (CEP.5) to evaluate the middle term: the mixed moment equals

EE

[

(X − Xˆ)(Xˆ − Z) |Y ] (CEP.3)= E[(Xˆ − Z)E[(X − Xˆ) |Y ]]

(CEP.1)

= E

[

(Xˆ − Z)[E(X |Y )︸ ︷︷ ︸

=Xˆ

− E(Xˆ |Y )︸ ︷︷ ︸

=Xˆ by (CEP.3)

]

]

= 0,

so E(X − Z)2 = E(X − Xˆ)2 + E(Xˆ − Z)2 −→ minZ=h(Y ) for Z ≡ Xˆ, as then

the second (non-negative) term = 0, while the first one doesn’t depend on Z.

[Again note: can have Y ∈ Rd here. Projection interpr’n. Linear vs general f’cast.]

108

The final comment in the section on CEs:

That Xˆ = E(X |Y ) minimises the mean quadratic distance to X among all

RVs that are functions of Y also has a simple geometric interpretation.

In mathematics, an operator R : L→ L (L is a linear space, e.g. Rn or L2) is

called a projection if R2 = R, i.e. R(Rx) = Rx for any x ∈ L. [Picture.]

Let LY = space of all RVs Z ∈ L2 that are f’ns of a given RV Y : Z = h(Y ) for

some h. Then the operator R(·) = E(· |Y ) is a projection onto LY :

Xˆ = R(X) ≡ E(X |Y ) (CE.1)= h(Y ), and so R(R(X)) ≡ E(Xˆ |Y ) (CEP.3)= Xˆ.

The geometry of L2 is Euclidean, as in Rn (it’s based on a scalar product), so

shares a lot of properties, e.g. Pythagoras thm. In particular, the projection Xˆ

and the “error” X − Xˆ will be orthogonal in L2 (≡ uncorrelated):

EXˆ(X − Xˆ) (CEP.5)= EE[Xˆ(X − Xˆ)|Y ] (CEP.3)= E[XˆE(X − Xˆ|Y )︸ ︷︷ ︸

E(X |Y )−E(Xˆ |Y )=Xˆ−Xˆ

]

= 0.

109

6. Some Applications to Statistics

First a few words on the relationship b/w Probability Theory (PT) and

Mathematical Statistics (MS).

In PT: knowing the nature of a random phenomenon, we derive the

distributions of the characteristics of the phenomenon (that one can usually

observe in the RE). [Our knowledge is built into the mathematical model of

the RE: (Ω,F ,P) etc.] One can call this a direct problem.

Ex. Knowing the composition of the general population, what can one say

about the composition of a random sample from that population?

In MS, we deal with an inverse problem , e.g. in our Ex, given the observed

composition of the random sample, what can we say about the general

population? In the general situation, we have a mathematical model of the

RE: (Ω,F ,P) etc., where P is (at least, partially) unknown, observe the value

of an RVec resulting from our RE, and then use PT to make inference

concerning P on the basis of this data.

110

Later on, we will discuss application of the key results of PT, its limit

theorems, to MS. Now we will briefly discuss an important application of the

concept of CE in the context of parameter estimation. [You must have seen

elements of that in 2nd year stats subjects — in case you did them, of course.]

Our model for observed data:

• There is an underlying RE, modelled by (Ω,F ,Pθ), where Pθ is a

probability depending on parameter θ ∈ Θ ⊂ Rd, whose value we don’t (but

want to) know. Thus we have a family of suspects P := {Pθ}θ∈Θ and need to

point at the one that would fit the observed data best (in some sense).

• We observe an RVec X = X(ω) ∈ Rn. Denote by Pθ the distribution of X

on (Rn,B(Rn)) induced by Pθ. One can often identify (Ω,F ,Pθ) with

(Rn,B(Rn), Pθ) (may be awkward if one analyses “large sample” situations,

when n→∞), and then Ω = Rn does become the sample space.

111

Sufficient Statistics

You may remember some elements of this stuff from 2nd year stats.

Def. Any (measurable) function S = S(X) of the observation X is called a

statistic in our sampling experiment.

Estimators of the unknown parameter θ are statistics θ∗ = θ∗(X) (i.e. just

functions of the sample X assuming values in the parameter space Θ),

statistical tests are statistics δ∗ = δ∗(X) (assuming values 1 and 0 when

testing a single hypothesis H0 vs alternative H1: δ

∗ = 1 means that we reject

H0; for randomized tests, δ

∗ is the probability of rejecting H0; similarly in the

case of multiple hypotheses).

Clearly, a statistic S is also an RV (or RVec), and so one can talk about the

conditional distribution Pθ(X ∈ B|S), B ∈ B(Rn), cf. sl. 82.

112

Def. [R. A. Fisher, 1922] A statistic S = S(X) is called sufficient (SS) for

parameter θ if the conditional distribution Pθ(X ∈ B|S) doesn’t depend on θ.a

NB: if ϕ is a 1–1 function, then S1 = ϕ(S) is also an SS for θ (cf. sl. 98).

Being a CE given S, Pθ(X ∈ B|S) is a function of S, so there exists a function

P (B|s), B ∈ B(Rn), s ∈ Rd, s.t.

Pθ(X ∈ B|S) = P (B|S)

Interpretation: P (B|s) is the cond’l distribution of the sample X given S = s.

Meaning: For a given SS S, if you know that the sample point X is on the

“surface” {x ∈ Rn : S(x) = s}, any further information re where on this

surface the point X is tells you nothing about the value of θ (as the location

of the point follows then one and the same distribution for all θ) — and so

this information is irrelevant to the estimation problem. [Picture: N(0, σ2).]

aMore formally: there exists a version of the conditional distribution which is independent

of θ, cf. slides 102, 103. But we don’t bother about such nuances too much here.

113

Ex. Let X = (X1, . . . , Xn) be an i.i.d. sample, Xi ∼ P (λ).

Recall: for independent X ∼ P (λ) and Y ∼ P (µ), we have X + Y ∼ P (λ+ µ),

and the conditional distribution of X given X + Y = m is binomial

B(m, λλ+µ ). Hence that of Xj given S := X1 + · · ·+Xn = m is B(m, 1/n).

Perhaps S is a sufficient statistic for λ? Verify: fix an integer s ≥ 0, then for

x = (x1, . . . , xn), one clearly has Pλ(X = x |S = s) = 0 if

∑n

j=1 xj 6= s, while

if the sum = s we have

Pλ(X = x |S = s) = Pλ(X = x)

Pλ(S = s)

=

∏n

j=1 e

−λλ

xj

xj !

e−nλ

(nλ)s

s!

=

s!λ

∑

xj

(nλ)s

∏

xj !

=

(x1 + · · ·+ xn)!

x1! · · ·xn!

(

1

n

)x1

· · ·

(

1

n

)xn

,

multinomial distr’n with s =

∑n

j=1 xj independent trials with n equally likely

outcomes. Thus S = nX is an SS for λ. Good.

114

NB: What a great reduction of data! It turns out that all the information

about θ contained in the whole sample is “stored” in a single value, that of S.

More on that later. Now: how to find SSs?

The most natural formulation of the main result here uses the concept of

density.

Recall: a distribution P on Rn has a density f if, for any Borel set B ⊂ Rn,

P (B) =

∫

B

f(x) dx ≡

∫

· · ·

∫

f(x)1B(x) dx1 · · · dxn. (∗)

But we also introduced integrals of more general form (sl. 68, 69):∫

B

f(x) dQ(x), Q is a probability.

Basically the same def’n works for integrals w.r.t. measures (see sl. 17) as

well, rather than just probability measures, e.g. Lebesgue measure (= length

on R, area on R2, volume on Rn, n > 2), as in (∗), or counting measure (see

sl. 18).

115

Thus, if X ∼ P is a discrete RVec taking values in a countable set

C = {xi}i≥1, p(x) := P(X = x), and µ is the counting measure on C (i.e.

µ(B) = #{x ∈ B ∩ C}), then one has

P(X ∈ B) ≡ P (B) =

∫

B

dP (x) =

∫

B

p(x) dµ(x) =

∑

x∈B∩C

p(x) ≡

∑

i:xi∈B

p(xi).

In this case, one uses notation: p(x) =

dP

dµ

(x) and says that P is AC w.r.t. µ

and p is the density of P w.r.t. µ.

So the concept of AC is relative ; when used without any further

explanations, it always means that P is AC w.r.t. the Lebesgue measure, i.e.

(∗) from sl. 116 holds. But one often needs to use some other measures µ.

NB: So it turns out that discrete distributions are actually AC w.r.t. the

counting measures on their supports!

Now we are ready to formulate our main result here.

116

Thm [Neyman–Fisher (NF) factorisation.] Suppose all Pθ are AC w.r.t.

some measure µ, with densities fθ(x) =

dPθ

dµ (x). A necessary and sufficient

condition for statistic S to be an SS for θ is that, for some functions ψ(s, θ)

and h(x),

fθ(x) = ψ(S(x), θ)h(x). (∗)

If X = (X1, . . . , Xn) is an i.i.d. sample, Xj having a density fθ(x), then

fθ(x) ≡

n∏

j=1

fθ(xj) = ψ(S(x), θ)h(x).

NB: Factorisation (∗) is not unique, of course: say, the pair

ψ1(s, θ) := e

sψ(s, θ), h1(x) := e

−S(x)h(x)

would be OK, too!

117

Ex. (cont’d) In our Poisson example, the distribution is on the integers, with

the density fλ(x) = e

−λ λx

x! w.r.t. the counting measure on {0, 1, 2, . . .}. So the

likelihood function has the form

fλ(x) ≡

n∏

j=1

fλ(xj) =

n∏

j=1

e−λ

λxj

xj !

=

e−nλλ

∑n

j=1 xj∏n

j=1 xj !

= ψ(S(x), λ)h(x)

with

ψ(s, λ) = e−nλλs, S(x) =

n∑

j=1

xj , h(x) =

1∏n

j=1 xj !

.

Therefore, according to the NF Thm, S(X) :=

∑n

j=1Xj is an SS for λ (but

we have already proved that directly anyway).

118

I We will prove the NF Thm in the discrete case only; in the general case,

the same idea is used (but the argument becomes more technical).

Thus we assume that Pθ(X ∈ C) = 1, ∀θ ∈ Θ, for a countable set C, µ is the

counting measure on C, fθ(x) = Pθ(X = x) for x ∈ C.

⇐) Suppose that factorisation (∗) (sl. 118) takes place. Then, for x ∈ C,

s = S(x) (otherwise it’s trivial: zeros!),

Pθ(X = x |S(X) = s) = Pθ(X = x, S(X) = S(x))

Pθ(S(X) = s)

=

Pθ(X = x)

Pθ(S(X) = s)

=

fθ(x)∑

y∈C:S(y)=s fθ(y)

(∗)

=

ψ(S(x), θ)h(x)∑

y∈C:S(y)=s ψ(S(y), θ)h(y)

=

h(x)∑

y∈C:S(y)=s h(y)

,

which doesn’t depend on θ, hence S is an SS for θ!

119

⇒) Now assume that S is an SS for θ. Then, for x ∈ C, s = S(x) (otherwise

it’s trivial: zeros!),

Pθ(X = x |S(X) = s) =: h(x)

doesn’t depend on θ. Therefore

fθ(x) ≡ Pθ(X = x) = Pθ(X = x, S(X) = S(x))

= Pθ(X = x |S(X) = S(x))︸ ︷︷ ︸

=h(x)

Pθ(S(X) = S(x))︸ ︷︷ ︸

=:ψ(S(x),θ)

,

bingo!!

120

Ex. When X = (X1, . . . , Xn), where Xj ∼ N(µ, σ2), the parameter

θ := (µ, σ2) is 2-dim (NB: here µ ∈ R, it’s not a measure, just a number). The

likelihood function has the form:

fθ(x) =

n∏

j=1

fθ(xj) = (2piσ

2)−n/2 exp

{

− 1

2σ2

n∑

j=1

(xj − µ)2

}

= (2piσ2)−n/2 exp

{

− 1

2σ2

[∑

x2j︸ ︷︷ ︸

=:S2

−2µ

∑

xj︸ ︷︷ ︸

=:S1

+nµ2

]}

= (2piσ2)−n/2 exp

{

− 1

2σ2

[

S2 − 2µS1 + nµ2

]}

= ψ(S, θ)h(x),

where S = (S1, S2), h(x) ≡ 1 (but could take h(x) ≡ (2pi)−n/2 as well).

Therefore (S1, S2) is an SS for θ = (µ, σ

2): of all the information contained in

the sample X, we only need two numbers, (S1, S2), for estimation of (µ, σ

2).

121

Ex. Suppose X = (X1, . . . , Xn), where Xj ∼ U [0, θ]. This is an AC distr’n,

fθ(x) =

{

θ−1 if x ∈ [0, θ],

0 otherwise.

Hence, using x(1) = minj≤n xj and x(n) = maxj≤n xj , the likelihood function is

fθ(x) =

n∏

j=1

fθ(xj) =

{

θ−n if xj ∈ [0, θ], j ≤ n,

0 otherwise.

= θ−n1

(

0 ≤ x(1), x(n) ≤ θ

)

= θ−n1{x(n)≤θ}︸ ︷︷ ︸

=:ψ(S,θ)

1{x(1)≥0}︸ ︷︷ ︸

=:h(x)

Therefore S(X) := X(n) is an SS for θ.

Cor [of NF Thm] If T is a statistic, ϕ a function, and S := ϕ(T ) is an SS

for θ, then T is also an SS for θ.

I Obvious from NF Thm. In fact, the “best” SSs are the minimal ones

(that are f’ns of all the others): they can be used to get the best estimators.

122

Recall the following important concept.

Def. θˆ∗ = θˆ∗(X) := arg maxθ∈Θ fθ(X) is called the maximum likelihood

estimator (MLE) of θ from X.

This statistic can be a (very) good estimator for θ and possesses some nice

properties you may be familiar with. At the moment, we observe this:

Cor [of NF Thm] If S is an SS for θ, then the MLE θˆ∗ is a function of S

only (no further information from the sample X is needed).

I Indeed, from the NF Thm,

max

θ

fθ(X) = max

θ

[

ψ(S(X), θ)h(X)

]

= h(X) max

θ

ψ(S(X), θ)

so the value θ at which the max is attained depends on S(X) only.

Moreover, if S is an SS for θ, then all the Bayesian estimators (rings a bell?)

are functions of S only etc.

123

How SSs can improve estimators’ efficiency

Q: How to measure performance of estimators θ∗ of θ? First consider the case

when Θ ⊂ R. The standard mean quadratic error approach:

Eθ(θ

∗ − θ)2 −→ min

θ∗

(∗)

Q: Does there exist an estimator θ∗0 minimising the LHS of (∗) for all θ?

A: In non-trivial cases, the answer is negative. Indeed, suppose that there

exists such an estimator θ∗0 . Fix a θ1 ∈ Θ and take θ∗1 ≡ θ1. Then

Eθ(θ

∗

1 − θ)2 = (θ1 − θ)2,

which turns into zero for θ = θ1. But we know that θ

∗

0 is the best estimator,

so must also have

Eθ1(θ

∗

0 − θ1)2 = 0.

And this holds for any θ1 ∈ Θ!! This is only possible when the obesrvation X

uniquely determines the value of θ (e.g. when Xi ∼ εθ, Θ = R, or

Xi ∼ U [θ, θ + 1], Θ = Z)

124

So one compares performance of estimators within reasonable classes, e.g.

unbiased estimators.

Def. An estimator θ∗0 = θ

∗

0(X) from a class K of estimators of θ is called

efficient in K if, for any θ∗ ∈ K,

Eθ(θ

∗

0 − θ)2 ≤ Eθ(θ∗ − θ)2, ∀θ ∈ Θ.

Ex (of an important class of estimators). For a function b = b(θ), θ ∈ Θ, let

Kb = {θ∗ : Eθθ∗ = θ + b(θ), ∀θ ∈ Θ}

be the class of all estimators with the bias b(θ).

In particular, K0 is the class of all unbiased estimators. Estimators efficient

in K0 are called simply efficient.

BTW: unbiasedness is a desirable (but not necessary) property of a good

estimator. Unbiased estimators don’t need to exist (e.g. if X ∼ B(ϕ(θ)) for

some f’n ϕ, then θ∗ ∈ K0 ⇔ Eθθ∗ ≡ θ∗(0)(1− ϕ(θ)) + θ∗(1)ϕ(θ) ?!= θ, ∀θ ∈ Θ).

125

Thm. An efficient in Kb estimator (if it exists) is unique (modulo its values

on a subset A of the sample space s.t. Pθ(A) = 0, ∀θ ∈ Θ).

I Suppose that both θ∗1 and θ∗2 are efficient in Kb:

Eθ(θ

∗

i − θ)2 = min

θ∗∈Kb

Eθ(θ

∗ − θ)2 =: Rθ, ∀θ ∈ Θ, i = 1, 2.

Clearly, θ∗0 :=

1

2 (θ

∗

1 + θ

∗

2) ∈ Kb and, using

(

a1+a2

2

)2

+

(

a1−a2

2

)2

=

a21+a

2

2

2 with

ai = θ

∗

i − θ, i = 1, 2, and taking Eθ’s, we obtain

Eθ(θ

∗

0 − θ)2︸ ︷︷ ︸

≥Rθ

+

1

4

Eθ(θ

∗

1 − θ∗2)2 = Rθ ⇒ Eθ(θ∗1 − θ∗2)2 ≤ 0, hence = 0.

As we showed on slides 85 & 86 , this means that Pθ(θ

∗

1 − θ∗2 = 0) = 1. Bingo.

[NB: This assertion has a transparent geometric interpretation. Picture.]

126

Thm [Rao-Blackwell (RB)] Let θ∗ ∈ Kb, S be an SS for θ. Then the CE

θ∗S := Eθ(θ

∗ |S) has the following properties:

(i) θ∗S is a function of S only (and hence is a statistic);

(ii) θ∗S ∈ Kb;

(iii) Eθ(θ

∗

S − θ)2 ≤ Eθ(θ∗− θ)2, ∀θ ∈ Θ, where “=” holds iff Pθ(θ∗S = θ∗) = 1.

Thus applying Eθ(· |S) to an estimator θ∗ improves it!

I First observe that

θ∗S =

∫

θ∗(x)Pθ(X ∈ dx |S) S is an SS=

∫

θ∗(x)P (dx |S)

doesn’t depend on θ and is a function of S = S(X) only. Hence θ∗S is a statistic

and so can be used as an estimator. [It wouldn’t be so if S were NOT an SS!]

This proves (i). To prove (ii), note:

Eθθ

∗

S = EθEθ(θ

∗ |S) (CEP.5)= Eθθ∗ = θ + b(θ), so that θ∗S ∈ Kb indeed.

127

To demonstrate (iii), we do this:

Eθ(θ

∗ − θ)2 = Eθ

(

(θ∗ − θ∗S) + (θ∗S − θ)

)2

= Eθ(θ

∗ − θ∗S)2︸ ︷︷ ︸

≥0

+Eθ(θ

∗

S − θ)2 + 2 Eθ(θ∗ − θ∗S)(θ∗S − θ)︸ ︷︷ ︸

=0, see bottom of sl. 110

≥ Eθ(θ∗S − θ)2.

It remains to note that “=” holds here iff Eθ(θ

∗ − θ∗S)2 = 0, but, as we proved

on slides 85 & 86, the latter is equivalent to Pθ(θ

∗ − θ∗S = 0) = 1. Bingo.

NB: Recall Cor from sl. 123: S = ϕ(T ), S is an SS =⇒ T is an SS, too. Using

the same argument, one can show that then

Eθ(θ

∗

S − θ)2 ≤ Eθ(θ∗T − θ)2

That is, the “smaller” the conditioning SS is, the more efficient the result!

128

Ex. Consider an i.i.d. sample X = (X1, . . . , Xn), Xj ∈ P (λ), and λ∗ = X1. As

we know, Eλλ

∗ = EλX1 = λ, so that λ∗ ∈ K0, and

Eλ(λ

∗ − λ)2 = Var λ(X1) = λ.

We also know (sl. 115) that S :=

∑n

j=1Xj is an SS for λ, so we can form the

estimator

λ∗S = Eλ(λ

∗ |S) = Eλ(X1 |S) = S

n

≡ X

since the conditional distr’n of X1 given S = m is binomial B(m, 1/n) (cf.

Tutorial Problem 1, PS-6).

So

Eλ(λ

∗

S − λ)2 = Var λ(X) =

λ

n

,

which is much better than for the original λ∗!

One can show that λ∗S is actually efficient (i.e., it has the smallest mean

quadratic error among all unbiased estimators).

129

Q: What do we do when θ ∈ Rd?

A: A possible way to evaluate the performance of estimators is to compare

mean quadratic dispersions

Eθ(θ

∗ − θ, a)2, a ∈ Rd,

(·, ·) being the scalar product, and prefer θ∗1 to θ∗2 if the former has smaller

dispersion for all a.

That is, one looks at the projections of θ∗ − θ onto different directions a,

Thm [multivariate Rao-Blackwell] As in the univariate case, except for (iii)

which is replaced with:

(iii′) Eθ(θ∗S − θ, a)2 ≤ Eθ(θ∗ − θ, a)2, ∀θ ∈ Θ, ∀a ∈ Rd, where “=” holds for

all a ∈ Rd iff Pθ(θ∗S = θ∗) = 1.

The same proof: just do it for the univariate parameters (θ, a) and estimators

(θ∗, a) for them. Good.

130

Q: How far can one go along the path paved by RB Thm?

A: When there exists a complete SS for θ, one can go till the very end: in that

case, for a θ∗ ∈ Kb, the estimator θ∗S will be efficient in Kb. But this is from

another book. . .

131

7. Convergence of Random Variables

Modes of Convergence

Recall this: Let {xn}n≥1 be a sequence in R (or Rd). We say that xn converge

to x as n→∞, and write

xn → x as n→∞, or lim

n→∞xn = x,

if, ∀ε > 0, ∃nε <∞ s.t. one has

|xn − x| < ε for all n ≥ nε.

That is, for an arbitrarily small neighbourhood U of x, the xn’s should be in

U for all large enough n.

Now what might Xn → X mean when Xn = Xn(ω) are RVs? They are

functions, not numbers! [Picture.]

132

In Probability Theory (as in other areas of mathematics), one considers

several modes of convergence of RVs (functions). They are different, used in

different contexts, some more often than others. We will give general

definitions and briefly describe relationships between some of them.

Def. [5.1] Convergence almost surely (a.s.), or w.p. 1:

Xn

a.s.−→ X as n→∞

if there is an event A with P(A) = 1 s.t. ∀ω ∈ A, Xn(ω)→ X(ω) as n→∞.

In other words, it’s point-wise convergence on a set of probability 1.

Def. [5.2] Convergence in probability:

Xn

P−→ X as n→∞

if ∀ε > 0, P(|Xn −X| > ε)→ 0 as n→∞.

133

Def. [5.3] Quadratic mean convergence:

Xn

L2−→ X as n→∞

if Xn, X ∈ L2 and E(Xn −X)2 → 0 as n→∞.

In fact, this is convergence of elements of the space L2 of square-integrable

RVs (on a common (Ω,F ,P)) in its “native norm” (the one corresponding to

the scalar product (X,Y ) := EXY ).

Def. [5.4] Convergence in mean:

Xn

L1−→ X as n→∞

if Xn, X ∈ L1 and E|Xn −X| → 0 as n→∞.

The last two modes (5.3, 5.4) are particularly popular in engineering

applications (one of the reasons being that calculating moments is a relatively

simple task).

134

All four are modes of convergence of sequences of RVs, given on a common

probability space. However, the most important from the applications’

viewpoint is convergence in distribution (a.k.a. “weak convergence of

distributions”), which doesn’t care where and how the RVs are defined — only

the distributions of Xn matter.

Def. [5.5] Convergence in distribution:

Xn

d−→ X as n→∞

if limn→∞ FXn(t) = FX(t) at all points t ∈ R where FX(t) is continuous,

i.e. such t that FX(t−) = FX(t) ⇐⇒ P(X = t) = 0.

NB: Why this restriction to continuity points of FX? Because it makes sense

(and perfectly agrees with an alternative, more natural def’n, see Thm [5.8]

below). Consider Xn ≡ 1/n, X ≡ 0. [Picture: DFs.] Then FXn(t)→ FX(t) at

all t 6= 0, whereas FX(0) = 1 6= 0 = FXn(0).

Do we want to exclude such a situation? Don’t the distributions converge?

The def’n allows the jump points to “move” & also emerge in the limit.

135

The key argument for the above def’n of “

d−→” is actually the following

alternative def’n (which works not only for RVs and RVecs, but also in much

more general cases).

Thm [5.8] Xn

d−→ X as n→∞ iff for any continuous bounded function f

Ef(Xn)→ Ef(X) as n→∞.

It is this property that makes convergence in distribution so useful &

important.

I The proof is somewhat technical, let’s leave it for the future. Just note: for

“nice” f (e.g. differentiable and vanishing outside a bounded interval), one can

integrate by parts to obtain:

∫

f(x) dFXn(x) = −

∫

FXn(x) df(x), so if FXn(x)

converge to F (x) everywhere (perhaps except for at most countable set where

F has jumps), then. . . Makes sense.

136

Ex. If X,Xj ∈ N, j ≥ 1, then Xn d−→ X iff ∀k ∈ N,

P(Xn = k)→ P(X = k), n→∞.

Indeed,

P(Xn = k) = P(Xn ≤ k)−P(Xn ≤ k − 1)

= FXn(k)− FXn(k − 1)

= FXn(k + s)− FXn(k − 1 + s) for any s ∈ (0, 1),

[picture!] and also, for any s ∈ (0, 1),

FXn(k + s) =

k∑

j=1

P(Xn = j).

The assertion follows.

137

Ex. Let Yn be uniformly distr’d on {0, 1, . . . , n}: P(Yn = k) = 1n+1 , 0 ≤ k ≤ n.

Prove that Xn := Yn/n

d−→ X ∼ U [0, 1] as n→∞.

I The limiting DF F (x) = x1(x ∈ [0, 1]) + 1(x > 1) is continuous, so have to

prove: ∀x ∈ R, FXn(x)→ F (x) as n→∞. As P(Yn ≤ k) = k+1n+1 , 0 ≤ k ≤ n,

FXn(x) = P(Yn ≤ nx) =

0, x < 0,

P(Yn ≤ bnxc) = bnxc+1n+1 , x ∈ [0, 1],

1, x > 1.

But bnxc+1n+1 → x as n→∞:∣∣∣∣bnxc+ 1n+ 1 − x

∣∣∣∣ = |bnxc+ 1− (n+ 1)x|n+ 1 ≤ |bnxc − nx|+ 1− xn+ 1 ≤ 2n+ 1 .

BTW, what about the alternative def’n (Thm [5.8])? Look:

Ef(Xn) =

∑n

k=0 f(

k

n )

1

n+1 →

∫ 1

0

f(x) dx = Ef(X), so it’s OK!!

138

Some Relationships among the Modes

a.s.−→

always ⇓

P−→

under integrability conditions⇒

always⇐

L2−→

always ⇓

d−→

[There are some further “restricted” implications, which we didn’t mention.]

Why do these implications hold?

139

1)

a.s.−→ ⇒ P−→

I Assume that Xn a.s.−→ X, i.e. Xn(ω)→ X(ω) for all ω ∈ A, where P(A) = 1.

Fix an arbitrary ε > 0 and set An := {|Xn −X| > ε}. Clearly we cannot have

An i.o. for an ω ∈ A, so that

Ac ⊃ [An i.o.] sl.11≡

⋂

k

⋃

n≥k

An,

and hence

0 = P(Ac) = P

(⋂

k

⋃

n≥k

An

)

continuity of P

= lim

k→∞

P

(⋃

n≥k

An

)

,

so that, as k →∞,

0← P

(⋃

n≥k

An

)

≥ P(Ak) = P(|Xk −X| > ε),

which means that Xn

P−→ X. Bingo.

140

2)

P−→ under integrability conditions⇒ L

2

−→

I This is a bit technical, uses integration theory. Leave it for the future.

3)

P−→ always⇐ L

2

−→

I This immediately follows from Markov’s inequality (Thm [4.40], sl. 79):

P(|Xn −X| > ε) ≤ E|Xn −X|

2

ε2

.

4)

P−→ ⇒ d−→

This one we can prove! Better do that on the next slide.

141

I Let t be a continuity point of FX , ε > 0 be fixed. Then

FXn(t) = P(Xn ≤ t)

= P(Xn ≤ t, |Xn −X| ≤ ε︸ ︷︷ ︸

⊂{X≤t+ε}

) + P(Xn ≤ t, |Xn −X| > ε)

≤ P(X ≤ t+ ε) + P(|Xn −X| > ε)

= FX(t+ ε) + P(|Xn −X| > ε),

and similarly

FX(t− ε) ≤ FXn(t) + P(|Xn −X| > ε).

That is,

FX(t− ε)−P(|Xn −X| > ε) ≤ FXn(t) ≤ FX(t+ ε) + P(|Xn −X| > ε).

As ε > 0 is arbitrarily small, can choose it so that FX(t± ε) will be arbitrarily

close to FX(t) (can do that as FX is continuity at t), and P(|Xn−X| > ε)→ 0

as n→∞. So what happens to FXn(t) then? Bingo.

142

Examples and Counterexamples

Consider Ω = [0, 1] with P = U [0, 1]. [Pictures!]

1) Let Xn := n1(0,1/n), X ≡ 0. Then:

Xn

a.s.−→ X, and hence also P−→, d−→, but Xn

L2

6−→ X, Xn

L1

6−→ X, since

E|Xn −X| = EXn = 1 6→ 0.

2) Let Xn :=

√

n1(0,1/n), X ≡ 0. Then we’ll have the same relations, except

for having now Xn

L1−→ X since E|Xn −X| = EXn = 1/

√

n→ 0.

3) Let X ≡ 0,

X1 := 1[0,1/2), X2 := 1[1/2,1),

X3 := 1[0,1/3), X4 := 1[1/3,2/3), X5 := 1[2/3,1),

X6 := 1[0,1/4), X7 := 1[1/4,2/4), X8 := 1[2/4,3/4), X9 := 1[3/4,1),

etc. Then Xn

a.s.

6−→ X, but P−→, d−→, L

2

−→, L

1

−→.

143

4) Let X ≡ 0,

X1 := 21[0,1/2), X2 := 21[1/2,1),

X3 := 31[0,1/3), X4 := 31[1/3,2/3), X5 := 31[2/3,1),

X6 := 41[0,1/4), X7 := 41[1/4,2/4), X8 := 41[2/4,3/4), X9 := 41[3/4,1),

etc. Then Xn

a.s.

6−→ X, but P−→, d−→, but

L1

6−→ since E|Xn −X| = EXn = 1 6→ 0

(and hence also

L2

6−→).

5) If Xn := 1[0, 12+

1

n ]

, X := 1[ 12 ,1], then Xn

d−→ X, but all other →’s fail.

6) If Xn ∼ U [ 12 − 1n , 12 + 1n ], X ≡ 12 , then Xn

d−→ X, even though

FXn(

1

2 ) =

1

2 6→ 1 = FX( 12 ). Can we assert

P−→? What about L

2

−→ and a.s.−→?

144

Convergence under Transformations

1) As for the “usual convergence” of sequences in R (or Rd),

if, for two sequences of RVs {Xn} and {Yn} given on a common probability

space, Xn

a.s.−→ X and Yn a.s.−→ Y as n→∞, then also Xn + Yn a.s.−→ X + Y .

(Thm [5.19])

Indeed, we know that {Xn} converges on an event A with P(A) = 1 and {Yn}

converges on an event B with P(B) = 1. Clearly, P(A ∩B) = 1 and both

sequences converge on the event A ∩B.

[BTW: What about the products XnYn?]

The same applies to

P−→: as |(Xn + Yn)− (X + Y )| ≤ |Xn −X|+ |Yn − Y |,

P

(|(Xn + Yn)− (X + Y )| > ε) ≤ P(|Xn −X| > ε/2)+ P(|Yn − Y | > ε/2).

[BTW: What about the products XnYn?]

145

2) If Xn

L2−→ X and Yn L

2

−→ Y as n→∞, then also Xn + Yn L

2

−→ X + Y .

This follows from the triangle inequality for L2-norm, which follows from

Cauchy-Bunyakovsky inequality (Cor [4.36], sl. 81).

But for the products we can only claim that XnYn

L1−→ XY — and we don’t

even know if XnYn ∈ L2, but that it’s in L1 follows from Cauchy-

Bunyakovsky inequality! The

L1−→-claim follows from

XnYn −XY = XnYn −XnY +XnY −XY = Xn(Yn − Y ) + (Xn −X)Y

and the Cauchy-Bunyakovsky inequality (once again).

146

3) Of course, if we only know that Xn

d−→ X and Yn d−→ Y as n→∞, then

even asking if Xn + Yn

d−→ X + Y is meaningless: Xn and Yn may be defined

on different probability spaces, we cannot add them!

4) But we have this important general result:

Thm [5.23] Let g : R→ R be a continuous function. Then, as n→∞,

a) if Xn

a.s.−→ X then g(Xn) a.s.−→ g(X),

b) if Xn

P−→ X then g(Xn) P−→ g(X),

c) if Xn

d−→ X then g(Xn) d−→ g(X).

This theorem holds for RVecs in Rd as well, and beyond.

I a) is obvious.

b) This we could prove as in the text, but it’s more instructive to go another

way basing on this fundamental result from Real Analysis: any function g

continuous on a closed bounded interval [a, b] is uniformly continuous there.

147

Uniform continuity: ∀ε > 0, ∃δ > 0 s.t. [Pictures & examples.]

x, y,∈ [a, b], |x− y| ≤ δ =⇒ |g(x)− g(y)| < ε.

Fix an ε > 0, let An,ε :=

{|g(Xn)− g(X)| > ε}. Then, ∀N > 0, IN := [−N,N ],

P(An,ε) = P(An,ε; X ∈ IN ) + P(An,ε; X 6∈ IN )

≤ P(An,ε; X ∈ IN ) + P(X 6∈ IN )

≤ P(An,ε; X ∈ IN , |Xn −X| ≤ δ) + P(|Xn −X| > δ) + P(X 6∈ IN )

=: P1 + P2 + P3.

Now, for an arbitrary small η > 0, we can choose N so large that P3 < η/2.

Next, since g is uniformly continuous on [−N − 1, N + 1], we can choose δ < 1

so small that if x, y ∈ [−N − 1, N + 1], |x− y| ≤ δ then |g(x)− g(y)| < ε. But

this yields P1 = 0.

Finally, as Xn

P−→ X, for all large enough n holds P(|Xn −X| > δ) < η/2,

and that will imply P(An,ε) < η. Bingo for b).

148

c) is obvious from Thm [5.8]: we have to show that

Yn := g(Xn)

d−→ Y := g(X), but since for a bounded continuous f the

composition (f ◦ g)(x) := f(g(x)) is also bounded (as f is bounded) and

continuous (as g is continuous and hence f ◦ g is), we have

Ef(Yn) = E(f ◦ g)(Xn)→ E(f ◦ g)(X) = Ef(Y ).

Total bingo.

149

Now we turn to the convergence results mentioned on sl. 66.

Thm [4.9] (Monotone Convergence Theorem.)If Xn ≥ 0 are RVs on a

common probability space and Xn ↑ X a.s. as n→∞, then EXn ↑ EX.

I For any n ≥ 1, there exists a sequence of simple RVs X(k)n ↑ Xn as k →∞,

and for them EX

(k)

n ↑ EXn (cf. Def [4.4], Prpn [4.5]).

Now X(k) := maxn≤kX

(k)

n , k ≥ 1, are clearly also simple RVs, and

X(k−1) ≤ X(k) ≤ Xk, k ≥ 1.

By monotonicity, there exists the a.s. limit limk→∞X(k) =: Y.

Since, for any n ≥ 1, one has (as k →∞)

Xn Y X

↑ ↑ ↑

X

(k)

n ≤ X(k) ≤ Xk

a.s., one concludes that Xn ≤ Y ≤ X a.s. for any n ≥ 1.

150

Hence Y = X a.s. (recall that Xn ↑ X a.s. as n→∞).

Therefore, for the simple RVs X(k), one has X(k) ↑ X a.s. Hence by

Prpn [4.5] one has the last relation (as k →∞) in the next line:

EX ≥ EXk ≥ EX(k) ↑ EX.

We conclude that also EXk ↑ EX. Bingo!

151

Thm [4.8] (Fatou’s Lemma.) If Xn ≥ 0 are RVs on a common probability

space then

E lim inf

n→∞ Xn ≤ lim infn→∞ EXn.

Recall: lim infn→∞ xn = limn→∞ infk≥n xk is the least partial limit for the

sequence {xn}n≥1 ⊂ R.

I Set X := lim infn→∞Xn ≡ limn→∞ Yn, where Yn := infm≥nXm ↑ X as

n→∞, 0 ≤ Yn ≤ Xn. By the Monotone Convergence Theorem,

EX = lim

n→∞EYn = lim infn→∞ EYn ≤ lim infn→∞ EXn.

Bingo!

152

Thm [4.16] (Dominated Convergence Theorem.) If |Xn| ≤ c <∞ a.s. for

any n ≥ 1, Xn → X a.s. as n→∞, then there exists limn→∞EXn = EX.

Note that the first condition can be replaced with: |Xn| ≤ Y a.s. for any

n ≥ 1, where EY <∞. Check the proof below!

I Since Xn + c ≥ 0, c−Xn ≥ 0 for any n ≥ 1, by Fatou’s lemma one has

E lim inf

n→∞ Xn ≤ lim infn→∞ EXn,

E lim sup

n→∞

Xn ≥ lim sup

n→∞

EXn.

But lim infn→∞Xn = lim supn→∞Xn = limn→∞Xn = X by assumption, so

the LHS’s in the above formulae coincide with each other and with EX. Hence

lim sup

n→∞

EXn ≤ EX ≤ lim inf

n→∞ EXn,

which can only hold if there exists the limit limn→∞EXn = EX.

Bingo!

153

Our First Limit Theorems: Sums of Bernoullia RVs

Suppose {Xn} is an i.i.d. sequence of B(p)-RVs,

P(Xj = 1) = 1−P(Xj = 0) = p ∈ (0, 1), q := 1− p.

Recall: for Sn := X1 + · · ·+Xn,

P(Sn = k) =

(

n

k

)

pkqn−k, k = 0, 1, . . . , n.

Laws of Large Numbers (LLNs)

Thm [5.30] (Weak LLN.)

Sn

n

P−→ p as n→∞.

I As L

2

−→ implies P−→, we only need to prove the former, which is obvious:

E

(

Sn

n

− p

)2

=

E(Sn − np)2

n2

=

Var (Sn)

n2

=

nVar (X1)

n2

=

npq

n2

=

pq

n

→ 0.

aNamed after Jacob Bernoulli (1654–1705), whose Ars Conjectandi (1713) contained the

first proof of the WLLN.

154

Thm [5.31] (Strong LLN.)

Sn

n

a.s.−→ p as n→∞.

I For ε > 0, set An(ε) := {|Sn/n− p| > ε}. The main task: show that An(ε)

i.o. w.p. 0. [Indeed, then

⋃

k[An(1/k) i.o.] also has probability 0, and we are

done.]

Refer to Thm [1.27] (Borel-Cantelli) on sl. 26: for this it suffices to show that

∞∑

n=1

P(An(ε)) <∞.

Show that: by Markov’s inequality (Thm [4.40], sl. 79),

P(An(ε)) = P

(|Sn − np| > nε) ≤ E(Sn − np)4

n4ε4

.

If we show that E(Sn − np)4 ≤ cn2, we are done: then P(An(ε)) ≤ c1n−2, and∑

n n

−2 <∞!

155

Letting X˜j := Xj − p (so that EX˜j = 0), we have

E(Sn − np)4 = E

( n∑

j=1

X˜j

)4

= E

[ n∑

j=1

X˜4j + 6

∑

j

2

k︸ ︷︷ ︸

E(··· )=(EX˜21 )2

+

∑

terms containing X˜1j︸ ︷︷ ︸

E(··· )=0

]

= nEX˜41 + 3n(n− 1)(EX˜21 )2

= n [(1− p)4p+ (1− q)4q]︸ ︷︷ ︸

=pq(p3+q3)≤pq≤1/4

+3n(n− 1) (pq)2︸ ︷︷ ︸

≤(1/4)2=1/16<1/12

≤ n

4

+

n(n− 1)

4

=

n2

4

.

Total bingo. We completed the proof.

156

NB: This result validates Probability Theory (as we constructed it):

the Xj ’s are actually indicators of independent events occurring with the same

probability p. Can interpret this as a sequence of independent trials,

observing/not observing the same event in each of them.

Thus Sn is the total number of occurrences of our event in n trials, and

Sn/n = relative frequency of the event. We showed that, for “almost all”

sequences of trials, the relative frequency tends to p, which is exactly what we

aimed to reproduce with our mathematical model.

157

Ex. Consider Ω = [0, 1], P = U [0, 1], Yj(ω) = the jth digit in the decimal

expansion of the number ω = 0.Y1Y2Y3 . . . ∈ [0, 1].

Then {Yj} is an i.i.d. sequence, with P(Y1 = k) = 110 , k = 0, 1, . . . , 9. Both

claims are obvious [picture!].

What is the freq’cy of a given digit k in a “typical number’s” decimal exp’n?

Letting Xj := 1(Yj = k), we get an i.i.d. sequence of Bernoulli RVs with

p = 110 , with the frequency of k in the first n digits of ω given by Sn/n.

Now from the SLLN we know that, for Ck := {Sn/n→ p ≡ 110 as n→∞}, one

has P(Ck) = 1. And this holds for all k = 0, 1, . . . , 9, so that P

(⋂9

k=0 Ck

)

= 1.

This means that, for ALMOST ALL numbers ω ∈ [0, 1], each of the ten

decimal digits appears with frequency 110 in ω’s decimal expansion.

Most rationals are exceptions, of course (e.g. 1/3 = 0.3333 . . . etc.: they all

have periodic expansions starting from some place) — but there are only

countably many of them, a negligibly small proportion of all numbers in [0, 1]!

158

Q: Could we use the same tools to establish the WLLN & SLLN in the general

case (when Xj 6∼ B(p))?

A: To some extent. Look: all we needed in the proof of the WLLN was that

E(Sn − np)2 ≡ E(Sn − nµ)2 ≡ Var (Sn) = nVar (X1).

But this will hold in the general case, for uncorrelated Xj with common

values of µ := EXj and Var (Xj) (we even don’t need independence!). The

same argument shows that Sn/n

P−→ µ in this case.

For the SLLN, our argument will still work for i.i.d. Xj with EX

4

j <∞.

However, one doesn’t need that much: in fact, in the i.i.d.-case,

SLLN ⇔ E|X1| <∞. The proof of this is much more sophisticated, leave it

for the future.

159

We will not discuss here de Moivre-Laplace limit theorem (=“local CLT”) for

i.i.d. Xj ∼ B(p) that describes the behaviour of

P(Sn = k)

when we keep k “

√

n-close” to the mean ESn = np: these probabilities can

then be approximated by multiples of the normal density values. [Picture.]

The proof of this is based on:

(a) the binomial formula P(Sn = k) =

(

n

k

)

pkqn−k and

(b) Stirling’s forumlaa: k! =

√

2pikk+1/2e−k(1 + o(1)) as k →∞ (o(1)→ 0).

We won’t discuss here the Poisson limit theorem (“the law of small numbers”)

which concerns approximating P(Sn = k) in situations where p = pn → 0 s.t.

npn → λ ∈ (0,∞): we will do it as an exercise.

Instead, we will turn to powerful (analytic) tools that can be used to analyse

the behaviour of probability distributions in much more general situations.

aCheck how well it works! In fact, k! =

√

2pikk+1/2e−k+θ(k), 1

12k+1

< θ(k) < 1

12k

.

160

8. Characteristic Functions (ChFs)

Def [6.1]. For an RV X, its ChF ϕX(t) : R→ C is defined by

ϕX(t) := Ee

itX =

∫

eitxdFX(x).

Recall Euler’s formula: eit = cos t+ i sin t, |eit| = 1 for t ∈ R. [Picture.]

So ϕX(t) = E cos(tX) + iE sin(tX), always exists and is finite. Moreover,

|ϕX(t)| =

∣∣EeitX ∣∣ ≤ E∣∣eitX ∣∣ = 1, ϕX(0) = Eei0X = E1 = 1.

The ChF of distribution (or DF) F is the ChF of X ∼ F .

NB: for X ∈ Z, ϕX(t) =

∑

eitkP(X = k); for AC X’s, ϕX(t) =

∫

eitxfX(x) dx.

What’s the point of introducing ChFs? To use Fourier analysis: represent a

given function as a “mixture” of harmonic oscillations eitx with different

frequencies t. Analogy: orthogonal basis expansion, coordinates etc. And most

importantly: DF

1−1←→ChF, and there is MORE! We will see.

161

Ex. X ≡ c = const ⇒ ϕX(t) = Eeitc = eitc.

Ex. X ∼ B(p) ⇒ ϕX(t) = EeitX = peit·1 + qeit·0 = 1 + p(eit − 1).

Ex. X ∼ U [0, 1] ⇒ ϕX(t) =

∫ 1

0

eitxdx =

1

it

[

eitx

]1

0

=

eit − 1

it

.

Oops: what if t = 0? Trouble? Nope. Look: |eit − 1| ≤ |t|. [Picture.]

Indeed, eit − 1 = ∫ t

0

(eis)′ds = i

∫ t

0

eisds, where |eis| = 1.

Ex. X ∼ N [0, 1] ⇒ ϕX(t) =

∫

eitx

e−x

2/2

√

2pi

dx =

1√

2pi

∫

eitx−x

2/2dx

=

1√

2pi

∫

e−

1

2 (x

2−2itx±(it)2)dx = e−t

2/2 1√

2pi

∫

e−

1

2 (x−it)2dx︸ ︷︷ ︸

=1

= e−t

2/2.

Why = 1? Imagine we had µ ∈ R instead of it in e− 12 (x−it)2 , OK? Formal

proof: Cauchy thm for integrals of analytic f’ns over closed contours. [Picture.]

162

Prpn. If Y = aX + b, where a, b ∈ R are constants, then ϕY (t) = eitbϕX(at).

I Obvious: ϕY (t) = Eeit(aX+b) = Eeit(aX)eitb = eitbEei(ta)X = eitbϕX(at).

Ex. If Y ∼ U [a, b], then Y d= a+ (b− a)X, X ∼ U [0, 1], and so

ϕY (t) = e

itaϕX((b− a)t) = eita × e

i(b−a)t − 1

i(b− a)t =

eibt − eiat

i(b− a)t .

Can verify by a direct calculation: ϕY (t) =

∫ b

a

eitx

1

b− a dx.

Ex. If X ∼ N(µ, σ2), then X d= µ+ σZ, Z ∼ N [0, 1], and so

ϕX(t) = e

itµϕZ(σt) = e

itµe−(σt)

2/2 = exp

{

itµ− σ

2t2

2

}

.

163

Recall: for z = x+ iy ∈ C (x, y ∈ R are the real and imaginary parts of z), its

complex conjugate z¯ := x− iy [Picture.], z1 + z2 = z¯1 + z¯2 (obvious).

NB: eiu = e−iu (e.g. from Euler’s formula) and so

ϕX(t) =

∫

eitxdFX(x) =

∫

eitxdFX(x) =

∫

e−itxdFX(x) = ϕX(−t) = ϕ−X(t).

Thus we established the following

Prpn. ϕX(t) = ϕX(−t) = ϕ−X(t).

This will prove quite handy. For instance, note that if X is a symmetric RV,

i.e. X

d

= −X, then we obtain ϕX(t) = ϕ−X(t) = ϕX(t) by the Prpn.

Now what does z = z¯ mean? It means that z ∈ R, so that the ChF of a

symmetric RV X is always real-valued. We will see later that the converse

is true as well!

164

Prpn [6.2]. Any ChF is uniformly continuous.a

I Fix an arbitrary ε > 0, consider IN := [−N,N ]. Then, ∀t, h ∈ R,

|ϕX(t+ h)− ϕX(t)| =

∣∣Eei(t+h)X −EeitX ∣∣ = ∣∣E(ei(t+h)X − eitX)∣∣

=

∣∣EeitX(eihX − 1)∣∣ ≤ E∣∣eitX(eihX − 1)∣∣

= E |eitX |︸ ︷︷ ︸

=1

|eihX − 1| = E |eihX − 1|︸ ︷︷ ︸

≤|eihX |+|1|=2

= E( |eihX − 1|︸ ︷︷ ︸

≤|hX|≤|h|N

; X ∈ IN ) + E(|eihX − 1|; X 6∈ IN )︸ ︷︷ ︸

≤2P(X 6∈IN )

≤ |h|N + 2P(X 6∈ IN ).

Now first choose N so large that the 2nd term < ε/2, and then the whole

thing will be < ε for |h| < ε2N (regardless of the value of t!). This proves

uniform continuity. Bingo.

aRe uniform continuity, see sl. 149.

165

Thm [6.4]. If X and Y are independent RVs, then ϕX+Y (t) = ϕX(t)ϕY (t).

NB1: Not ϕX(t) + ϕY (t)!! This isn’t even a ChF! (Why?)

NB2: Products are MUCH easier to compute than convolutions!

I ϕX+Y (t) = Eeit(X+Y ) = E eitXeitY︸ ︷︷ ︸

independent RVs

= EeitXEeitY = ϕX(t)ϕY (t). Good.

Ex. If X ∼ N(µX , σ2X) and Y ∼ N(µY , σ2Y ) are independent, then

ϕX+Y (t) = exp

{

it(µX + µY )− t

2

2

(σ2X + σ

2

Y )

}

.

If we had the uniqueness result (i.e. DF

1−1←→ChF), that would mean that

X + Y ∼ N(µX + µY , σ2X + σ2Y ). A bit later.

NB: Try to use convolution to derive that result. No fun.

166

Thm [6.11]. Let k ≥ 1. If E|X|k <∞, then ϕX(t) is k times continuously

differentiable and EXk = i−k

dk

dtk

ϕX(t)

∣∣∣∣

t=0

.

In particular, EX = −iϕ′X(0), EX2 = −ϕ′′X(0) (when the moments exist).

I d

k

dtk

ϕX(t) =

dk

dtk

EeitX

?

= E

dk

dtk

eitX = E(iX)k eitX = ikEXk eitX , which

turns into ikEXk when we put t := 0. We just need to justify

?

=: this follows

from the Dominated Convergence Theorem (sl. 154) — and we need

E|X|k <∞ for that. Good.

The converse is true for even k and “almost true” for odd k. Thus, if ϕX(t) is

twice differentiable at zero, then EX2 <∞ (and hence ϕX(t) is everywhere

twice differentiable, cf. Thm [6.12]). Ex: N(µ, σ2).

Thus, the smoother ϕX(t), the lighter the “tails” of FX , and the other way

around!

167

Inversion Formulae and Uniqueness

Thm [6.7]. If

∫ |ϕX(t)|dt <∞ then X has a continuous density given by

fX(x) =

1

2pi

∫

e−itxϕX(t) dt.

This is a general result from Fourier analysis.

Meaning (a bit loose): One can think of fX as a sum of harmonic oscillations

with different frequencies t. When computing ϕX(t), we find how strong the

contribution of oscillations at frequency t to fX is (it’s like an orthogonal basis

expansion of a vector). Then we can “assemble” fX back from these

oscillations, and this is what the inversion formula does.

NB: Thm [6.7] implies that there is a one-to-one correspondence between

distributions and their ChFs (at least in the AC case, when the ChF is

integrable, but it’s true always, we’ll discuss that a bit later).

Summary: Two different distributions cannot have the same ChF!

168

Ex. Consider ϕX(t) =

sin t

t

. How do we know that this is a ChF? See Ex. on

sl. 164: the ChF of U [−1, 1] is

eit·1 − eit·(−1)

it(1− (−1)) =

eit − e−it

2it

=

sin t

t

by Euler’s formula. But clearly

∫ ∣∣ sin t

t

∣∣ dt =∞ (why?), so cannot use the

inversion formula. No wonder though, as otherwise X would have continuous

density, which is wrong.

Ex. Consider ϕX(t) = 1 + p(e

it − 1). Can we use the inversion formula?

NB: If X ∈ Z then ϕX(2pik) = Eei2pikX = 1 for k ∈ Z. So, for such X,

ϕX(t) 6→ 0 as |t| → ∞, whereas for AC X always ϕX(t)→ 0 [Lebesgue thm].

Ex. Consider ϕX(t) = e

−t2/2 (this corresponds to X ∼ N(0, 1), see sl. 163).

The integrability condition is clearly met, so there exists continuous

fX(t) =

1

2pi

∫

e−itxe−t

2/2dt

sl. 163

=

1√

2pi

e−x

2/2, good!

169

Ex. Compute the ChF of X ∼ E(1) (do you see how to extend this to E(λ)?):

ϕX(t) =

∫ ∞

0

eitxe−xdx =

∫ ∞

0

e−(1−it)xdx =

−1

1− it

[

e−(1−it)x

]∞

0

=

1

1− it .

Can one apply the inversion formula?

Nope, as

∫ |ϕX(t)|dt =∞, so cannot use the inversion formula. No wonder

though, as otherwise X would have continuous density, which is wrong.

Now consider the double exponential distribution (the first Laplace distr’n):

fX(x) =

1

2

e−|x|, x ∈ R.

NB: this is a mixture of E(1) and its “mirror reflection” (the distr’n of −Y ,

Y ∼ E(1)), with equal weights, so

ϕX(t) =

∫

eitx

(

1

2fY (x) +

1

2f−Y (x)

)

dx =

1

2

(

ϕY (t) + ϕ−Y (t)

)

=

1

2

(

1

1− it +

1

1− i(−t)

)

=

1

1 + t2

.

170

NB: This ϕX(t) is already integrable on R, so the inversion formula is

applicable, and hence

1

2

e−|x| =

1

2pi

∫

e−itx

1 + t2

dt.

If we replace here x↔ t, and then t with −t, the result can be re-written as

e−|t| =

∫

eitx

pi(1 + x2)

dx.

That is, the ChF of the standard Cauchy distribution is e−|t|.

Try to compute it directly: no fun. Also, note that it is NOT differentiable at

zero, so there is no way for the first moment of the distribution to be finite.

Good.

171

Q: Is there any way to invert the ChF when

∫ |ϕX(t)|dt =∞?

A: Yes. Look: assuming for a moment that we can use Thm [6.7],

FX(y)− FX(x) =

∫ y

x

fX(u) du =

∫ y

x

[

1

2pi

∫

e−ituϕX(t) dt

]

du

?

=

1

2pi

∫ [∫ y

x

e−ituϕX(t) du

]

dt =

1

2pi

∫ [

ϕX(t)

∫ y

x

e−itudu

]

dt

=

1

2pi

∫

e−itx − e−ity

it

ϕX(t) dt, (1)

where one can justify

?

=. The RHS makes sense when

∫ ∣∣ϕX(t)

t

∣∣dt <∞, and

then the LHS still equals the RHS. Formal proof: if Z ∼ N(0, 1) is independent

of X, ε > 0, then ϕX+εZ(t) = ϕX(t)e

−ε2t2/2 is integrable on R, inversion f’la

applies, (1) holds with X + εZ instead of X. Then pass to the limit as ε→ 0.

In the general case, the resulting f’la just has limε→0 on the RHS. Or replace

the

∫

on the RHS with limT→∞

∫ T

−T (the “principal value integral”).

172

The most important conclusion: uniqueness holds in the general case as well.

To different DFs there correspond different ChFs! (Cf. (1)!)

Now we can claim that if, say, the ChF of X is eiµt−σ

2t2/2, then we must

have X ∼ N(µ, σ2) (cf. Ex. on sl. 167).

Also, now we can claim that if X,Y ∼ E(1) are independent then since,

according to sl. 171,

ϕX−Y (t) = ϕX(t)ϕ−Y (t) = ϕX(t)ϕY (−t) = 1

1− it ×

1

1 + it

=

1

1 + t2

,

the difference X − Y must follow the double exponential distribution.a One

can discover that using convolution, but ChFs make things much easier.

And now we can assert that if ϕX(t) is real-valued, then ϕX(t) = ϕ−X(t)

(see sl. 165) implies that X

d

= −X, i.e. the distribution of X is symmetric (as

we promised to justify).

a Here X − Y has the mixture distribution 1

2

E(1) + 1

2

(−E(1)) (gross notation abuse!!),

due to the memoryless property of E(1) (use the TPF).

173

NB: Using the same argument as in Thm [6.11] (sl. 168), one can show that

if

∫ |tkϕX(t)| dt <∞, then X has a k times continuously differentiable density.

Thus, the smoother fX(x), the lighter the “tails” of ϕX(t), and the other way

around! Cf. remark on sl. 168.

NB: Sums of independent RVs have smoother distributions than summands,

as |ϕX(t)ϕY (t)| decays as t→ ±∞ faster than any of the factors.

Ex. Sums of i.i.d. Xj ∼ U [−1, 1] : for Sn := X1 + · · ·+Xn, using sl. 170,

ϕS1(t) =

sin t

t

, ϕS2(t) =

sin2 t

t2

(so that S2 has a continuous density),

ϕS3(t) =

sin3 t

t3

(so that S3 has a continuously differentiable density) etc.

[Picture.]

Ex. What can one say about the distribution of X + Y , where X and Y are

independent, X ∈ Z and Y is AC?

Ex. The sum of two independent singular RVs can be AC.

174

Continuity Theorems & Applications

One of the great things about ChFs is that there is a very simple & useful

relationship between convergence in distribution and that of ChFs.

Thm [6.15] As n→∞, Xn d−→ X ⇔ ∀t ∈ R, ϕXn(t)→ ϕX(t).

I ⇒) Obvious: note that ϕXn(t) = Ef(Xn), where f(x) = eitx is a bounded

continuous function of x (we keep t fixed here), and recall Thm [5.8], sl. 137.

⇐) Not so obvious. What happens is essentially this: the collection of

functions f(x) = eitx for different t ∈ R is rich enough to ensure that

convergence EeitXn → EeitX for all t implies that Ef(Xn)→ Ef(X) for all

bounded continuous f (we can take linear combinations of eitjx for collections

of tj ’s to approximate general f ’s etc).

Or one can use the inversion formula. Thus, assuming for simplicity that

|ϕXn(t)|, |ϕX(t)| ≤ g(t) for some bounded function g(t) s.t. g(t)/t is integrable

at ±∞, we can use the formula from sl. 173 to write:

175

FX(y)− FX(x) = 1

2pi

∫

e−itx − e−ity

it

ϕX(t) dt

=

1

2pi

∫

e−itx − e−ity

it

lim

n→∞ϕXn(t) dt

∗

= lim

n→∞

1

2pi

∫

e−itx − e−ity

it

ϕXn(t) dt

= lim

n→∞(FXn(y)− FXn(x)), (∗)

where

∗

= is justified by the Dominated Convergence Theorem (cf. sl. 154).

Good.

Is it clear that when convergence (∗) holds then Xn d−→ X? If yes, it’s bingo.

176

Q: What if we just know that ϕXn(t)→ ϕ(t) as n→∞, where ϕ is some

function (we don’t know if it’s a ChF of some distribution)?

A: Then bad things can happen, and we can say when they do happen!

Thm [6.17] If ∀t ∈ R holds ϕXn(t)→ ϕ(t) as n→∞, and ϕ(t) is

continuous at t = 0, then ϕ(t) is a ChF of some RV X and Xn

d−→ X.

I The proof is somewhat beyond the scope of this course, so we’ll leave it for

the future. At the moment: just a few words re WHAT happens when the

limiting function ϕ is discontinuous at t = 0.

Ex. Let Xn ∼ U [−n, n]. Then (sl’s 164, 170) ϕXn(t) =

sinnt

nt

{ → 0, t 6= 0,

≡ 1, t = 0.

The limiting f’n ϕ(t) = 1(t = 0) is discont’s at 0, and so cannot be a ChF.

In this example, the probability “escapes to infinity”, and it is in such

situations that one obtains a limit for ϕXn that is discont’s at 0. Look e.g. at

a similar situation where Xn ∼ N(0, n). [Write down ϕXn and see what happens!]

177

NB: Assertions similar to Thms [6.15], [6.17] hold for Laplace transforms

lX(t) := Ee

−tX (popular tools when X ≥ 0) and GFs ζX(z) := EzX (used for

X ∈ Z), as they are basically the same as the ChF.

Now how do these theorems work in applications?

We know that ChFs love addition of independent RVs: they multiply then. So

the technique is well-suited for analysing situations where we add such RVs.

But why do we pay so much attention to sums of RVs in PT?

When a large number of relatively small factors act together, the total effect

can often be (at least, approximately) linear. Which is no wonder though, as,

for a differentiable f ,

f(x + εy) = f(x) + ε

∑

j

∂f(x)

∂xj

yj + o(ε) as ε→ 0,

so we do have a sum after all. . .

178

Thm [6.19] (WLLN) Let X1, X2, . . . be i.i.d. RVs. If E|X1| <∞ then

Sn

n

d−→ µ := EX1 as n→∞ (and hence also P−→, see PS–8, tute problem 1).

I Using the properties of ChFs, ∀t ∈ R,

ϕSn/n(t) = ϕSn(t/n) =

(

ϕX(t/n)

)n

. (∗)

Here the argument t/n→ 0, so one would be inclined to expand ϕX about

zero to see what happens — and we can do that since ϕX is continuously

differentiable with ϕ′X(0) = iµ (due to the assumption that E|X1| <∞). So:

ϕX(s) = ϕX(0) + ϕ

′

X(0)s+ o(s) = 1 + iµs+ o(s) as s→ 0.

Now back to (∗):

ϕSn/n(t) =

[

1 +

iµt+ o(1)

n

]n

→ eitµ = ϕµ(t),

bingo by Thm [6.15].

179

Right. This means that X is a consistent estimator of µ = EX1 (by

definition). This is good, but to find, say, confidence intervals or evaluate

errors of tests based on X (or other consistent estimators), one needs more.

Namely, one needs to know the distribution (at least, approximately) of the

difference X − µ, which is vanishing as n→∞.

So we need a “magnifying glass” to see any patterns here, which is achieved by

scaling the difference, by considering

bn(X − µ) ≡ bn

(

Sn

n

− µ

)

=

bn

n

(Sn − nµ) for some bn →∞.

For what choice of {bn} will this have a limiting distribution? It turns out

that when the Xj ’s have a finite second moment, the right choice is bn = c

√

n.

Re what to do when EX2j =∞, we’ll talk a bit later.

First — to the Central Limit Theorem (CLT).

180

Thm [6.20] (CLT) If, in addition to the assumptions of Thm [6.19],

EX2j <∞ and σ2 := Var (X1) > 0, then

Yn :=

Sn − nµ

σ

√

n

d−→ Z ∼ N(0, 1) as n→∞.

I It suffices to show that ϕYn(t)→ e−t

2/2. First we standardise the Xj ’s by

setting X˜j := (Xj − µ)/σ. Then clearly X˜j are i.i.d. with EX˜j = 0, EX˜2j = 1,

and Yn = S˜n/

√

n, where S˜n := X˜1 + · · ·+ X˜n. Therefore

ϕYn(t) = ϕS˜n/

√

n (t) = ϕS˜n(t/

√

n ) =

(

ϕX˜1(t/

√

n )

)n

. (∗)

Here t/

√

n→ 0 as n→∞, and we will again expand ϕX˜1 about zero, but as

EX˜2j <∞, the ChF ϕX˜1 is twice differentiable and we have one more term

in Taylor’s series:

ϕX˜(s) = ϕX˜(0)︸ ︷︷ ︸

=1

+ϕ′

X˜

(0)︸ ︷︷ ︸

=iµ˜=0

s+

1

2

ϕ′′

X˜

(0)︸ ︷︷ ︸

=−EX˜2=−1

s2 + o(s2) = 1− s

2

2

(1 + o(1)).

181

Now back to (∗):

ϕYn(t) =

[

1− 1

2

( t√

n

)2

(1 + o(1))

]n

=

[

1− t

2(1 + o(1))

2n

]n

→ e−t2/2

as n→∞. Bingo.

NB: Using the same techniques, can extend this to non-identically distributed

independent RVs Xj . Just need to be careful, need an additional condition,

e.g. the Lyapunov condition : assuming all EXj = 0 (no big deal), one has

B−3n

n∑

j=1

E|Xj |3 → 0 as n→∞, where B2n := Var (Sn) =

∑n

j=1 EX

2

j

(the condition ensures that, in the limit, all the Xj ’s are “negligibly small”

compared to the sum Sn). Then for ϕXj one can use Taylor’s expansion with

three terms (the third moment is finite!) and use the condition to show that

one has ϕSn/Bn(t)→ e−t

2/2, so that Sn/Bn

d−→ Z ∼ N(0, 1) (you can DIY!).

182

Now what if EX2j =∞ (in the i.i.d. case)? If µ := EX1 is still finite and hence

X ≡ Snn

P−→ µ, there can still exist another scaling seq’ce bn →∞ such that

bn(X−µ) d−→ something. For this, need “regular variation” of the tails of FX :

F (−x) = x−αL−(x), 1− F (x) = x−αL+(x),

where α ∈ [1, 2], L± are “slowly varying” in the sense that L(vx)L(x) → 1 as

x→∞, v fixed (Ex: lnx, but not xa), and limx→∞ 1−F (x)F (−x) → c ∈ [0,∞].

Then with bn = n

1−1/αl(n), l being another slowly varying f’n, the limiting

distribution will be one of the so-called stable laws. When E|X1| =∞, under

the above conditions on the tails (with α ∈ (0, 1]), n−1/αl(n)Sn will converge

in distribution. Technically, it’s much harder to prove than the CLT.

Important difference: roughly speaking, the contributions of individual Xj ’s to

the sum Sn are all negligibly small in the case of the CLT, whereas in the case

of convergence to a non-normal stable distribution, the main contribution to

Sn comes from a small proportion of the Xj ’s (the largest ones!).

183

Ex. Cauchy distribution with density f(x) =

1

pi(1 + x2)

has ChF ϕ(t) = e−|t|.

Therefore, in this case

ϕSn/n(t) =

(

ϕ(t/n)

)n

=

(

e−|t/n|

)n

= e−|t| = ϕ(t),

i.e.

Sn

n

d

= X1. Wow!!

Thus, say, using the sample mean X to estimate the parameter θ in the

location family of densities fθ(x) =

1

pi(1 + (x− θ)2) would be meaningless, as

X

d

= X1 (no gain in precision compared to a single observation). But: the

sample median X(n/2) would work (later about that).

Cauchy distribution is an example of a non-normal stable distribution.

184

Thm [6.21] (Poisson limit theorem.) If Xn,1, . . . , Xn,n are independent RVs,

P(Xn,j = 1) = 1−P(Xn,j = 0) = pn, j = 1, . . . , n,

and npn → λ ∈ (0,∞) as n→∞, then Sn := Xn,1 + · · ·+Xn,n d−→ Y ∼ P (λ).

I Here

ϕSn(t) =

(

ϕXn,1(t)

)n sl. 163

= (1 + pn(e

it − 1))n =

[(

1 + pn(e

it − 1))1/pn]npn

→

(

ee

it−1

)λ

= eλ(e

it−1) as n→∞, bingo (why?).

Q: Do we really need all Xn,j to have the same distribution here? Seems to be

an overstretch, from the applications view-point. Also, do they really need to

be Bernoulli RVs?

A: Nope to both. Go to the next slide for more.

185

Here is more: suppose that

P(Xn,j = k) =

1− pn,j − qn,j , k = 0,pn,j , k = 1, P(Xn,j 6∈ {0, 1}) = qn,j .

Then, omitting the subscripts n,j for brevity and assuming that p+ q is small,

ϕX(t) = (1− p− q)eit0 + peit1 + qη(t) = 1 + p(eit − 1) + q(η(t)− 1)

= exp

{

p(eit − 1) + q(η(t)− 1) + o(p+ q)},

where η(t) = E(eitX |X 6∈ {0, 1}) and hence |η(t)− 1| ≤ |η(t)|+ 1 ≤ 2.

Therefore, putting λn :=

∑n

j=1 pn,j , we obtain

ϕSn(t) =

n∏

j=1

ϕXn,j (t) = exp

{

λn(e

it − 1) +O(∑nj=1 qn,j) + o(λn)}→ eλ(eit−1)

provided that λn → λ and maxnj=1 pn,j +

∑n

j=1 qn,j = o(1), good!

186

What do we do in the case of RVecs?

More or less the same, but there is more fun. For X = (X1, . . . , Xd) ∈ Rd, the

ChF is a function of t = (t1, . . . , td) ∈ Rd defined by

ϕX(t) := Ee

i(t,X) = E exp

{

i

d∑

j=1

tjXj

}

.

All the results for univariate ChFs extend in a natural way to the multivariate

case, including 1–1 correspondence between distributions and their ChFs and

continuity theorems. The change under linear transformation has this form: if

Y = XA+ b, where A is a d×m-matrix and b ∈ Rm, then, since for the scalar

product we can write (s, XA) = s(XA)> = sA>X> = (sA>, X), one has

ϕY (s) = Ee

i(s,Y ) = Eei(s,XA)+i(s,b) = ei(s,b)Eei(sA

>,X) = ei(s,b)ϕX(sA

>).

187

The inversion formula will require multivariate integration, calculation of

moments can be done using partial differentiation, e.g.

∂k1+k2

∂tk11 ∂t

k2

2

ϕX(t) = i

k1+k2EXk11 X

k2

2 e

i(t,X).

Letting here t := 0 yields the mixed moment EXk11 X

k2

2 (of course, for this to

work we need E|Xk11 Xk22 | <∞). And so on.

Now observe this: for a fixed unit vector b ∈ Rd, one can write, for the

univariate RV Xb = (b, X) (which is the projection of X on the direction of b),

ϕXb(t) = ϕX(t), t = tb.

Meaning: knowing the ChF of X is equivalent to knowing the ChFs of all the

projections of X on various directions in Rd. Switching to distributions:

knowing the distribution of X is equivalent to knowing the distributions of all

the projections Xb, i.e. the probabilities of all half-spaces [picture!]. Wow.

188

Now consider an i.i.d. sequence of RVecs X(1), X(2), . . . [NB: our notation

here differs from that in the book: we use Xj(k) for the jth component of the

kth vector X(k)], S(n) := X(1) + · · ·+X(n). It is obvious that if the SLLN

(or WLLN) holds for each of the components:

Sj(n)

n

a.s.−→ µj

(

Sj(n)

n

P−→ µj , resp.

)

as n→∞,

where µj = EXj(k), then it also holds for the vectors. (Can you prove that?)

So for the LLNs to hold, we just need E‖X(1)‖ <∞. Easy.

The proof of the CLT based on the use of ChFs is also not very difficult; we’ll

just need to use the multivariate Taylor expansion and the formula for

moments from the previous slide. Before formulating the multivariate CLT, we

consider the following example.

189

Ex. For X ∼ N(µ,C2X) (the normal distribution in Rd with mean µ ∈ Rd and

CovM C2X , see sl. 87–89, the ChF can be computed from the representation

X

d

= ZA+ µ, where Z ∼ N(0, I) is the standard d-dim normal distribution (it

has i.i.d. N(0, 1) components) and A ∈ Rd×d is a “square root” of the

(nonnegative-definite) matrix C2X : A

>A = C2X .

a First we compute

ϕZ(t) = E exp

{

i

d∑

j=1

tjZj

}

=

∏

j

EeitjZj = e−

∑

j t

2

j/2 = e−tt

>/2.

Using the transformation formula from the bottom of sl. 188, we obtain now

ϕX(t) = e

i(t,µ)ϕZ(tA

>) = ei(t,µ)e−(tA

>)(tA>)>/2 = ei(t,µ)−tC

2

Xt

>/2.

BTW, this can be taken as the general definition of the multivariate normal

distribution (whether it has a density or not, i.e. whether it is concentrated on

the whole space Rd or on a subspace of lower dimensionality).

aOne way of finding such an A is to use Cholesky decomposition.

190

Now return to our multivariate random walk S(n) := X(1) + · · ·+X(n),

where X(1), X(2), . . . are i.i.d. RVecs with E‖X(1)‖2 <∞, so that (i) X(1)

has a finite mean (vector) µ := EX(1), and (ii) the CovM C2X of X(1) exists.

We already know that

S(n)

n

P−→ µ as n→∞; what about getting more detail

on this convergence (as given by the CLT in the univariate case)? NB:

Cov (S(n)− µn) = nCov (X(1)).

As we pointed out before (sl. 190), using the multivariate Taylor formula, we

obtain the following analog of the classical CLT. The proof is almost identical

to the one in the univariate case.

Thm (Multivariate CLT). Under the above assumptions, as n→∞,

√

n

(

S(n)

n

− µ

)

≡ S(n)− µn√

n

d−→ Y ∼ N(0, C2X).

In the case of non-identically distributed X(j)’s, there is more fun, of course.

191

NB: Here and in all our earlier Limit Theorems, one can estimate how fast

the resp. convergence is. For example, in the case of the univariate CLT for

i.i.d. summands, under additional assumptions, one can give an upper bound

for |Fn(x)−Φ(x)|, where Fn is the DF of the std’d Sn, Φ is the DF of N(0, 1).

Thus, if γ := E|X1 − µ|3 <∞, then the difference won’t exceed 0.4748γ

σ3

√

n

(for

any x; Berry–Esseen Thm). This bound is unimprovable (well, the constant

might be made smaller, but not smaller than 0.4098) [Picture!]. Of course,

1/

√

n→ 0 pretty slowly; if unhappy with that approximation rate, can go

further (from, say, Fn(x) ≈ Φ(x) to Fn(x) ≈ Φ(x) + E(X1−µ)

3

6σ3

√

n

(1− x2)Φ′(x),

which has an error O(1/n) under proper conditions, etc.).

Ex. CLT for multinomial distributions. We have d bins and conduct the

following multi-stage RE: at each step, a ball is placed in a randomly chosen

bin, P(jth bin selected) = pj > 0 independently of the past,

∑d

j=1 pj = 1.

Consider the RVec S(n) whose components are counts Sj(n) := # of balls in

bin j after n steps. Describe the behaviour of S(n) as n→∞.

192

It is clear that S(n) =

∑n

k=1X(k), where X(k) ∈ Rd are i.i.d. RVecs with

X(1) =

(1, 0, 0, . . . , 0) =: e1 w.p. p1,

(0, 1, 0, . . . , 0) =: e2 w.p. p2,

· · · · · · · · ·

(0, 0, 0, . . . , 1) =: ed w.p. pd.

We see that the components Xj(1) are dependent B(pj)-RVs. For any fixed j,

however, the RVs Xj(1), Xj(2), Xj(3), . . . are i.i.d., and so by the SLLN one

has Sj(n)/n

a.s.−→ pj . We conclude that

S(n)

n

a.s.−→ p := (p1, p2, . . . , pd) as n→∞.

Since ‖X(1)‖ ≡ 1, the Multivariate CLT clearly holds for S(n). We just need

to find the CovM of X(1), which will give us the covariance matrix of the

limiting normal distribution.

193

As X(1) is always one of the coordinate vectors e1, . . . , ed, one has

Xj(1)Xk(1) = δjkXj(1) ⇒ EXj(1)Xk(1) = δjkEXj(1) = δjkpj ,

where δjk := 1(j = k) is Kronecker’s delta, so that

Cov (Xj(1), Xk(1)) = EXj(1)Xk(1)−EXj(1)EXk(1) = δjkpj − pjpk.

Thus the CovM of X(1) is shown to be

D2p := diag (pj)− p>p.

From the Multivariate CLT we conclude that, as n→∞,

Wn :=

√

n

(

S(n)

n

− p

)

=

S(n)− np√

n

d−→W ∼ N(0, D2p). (∗)

This result has immediate important implications for statistics! Suppose

Y1, . . . , Yn is an i.i.d. sample of RVs with a common (say, univariate and

unknown to us) DF G. We want to test the hypothesis H0 = {G = F},

where F is some hypothesized DF (‘goodness-of-fit’ testing).

194

One way to do that is to use the χ2-test. What is that about?

Partition R into d intervals: (−∞, t1], (t1, t2], . . . , (td−2, td−1], (td−1,∞).

Denote by Sj(n) the # of sample points Yk that fell into the jth partition

interval. Clearly, the RVec S(n) ∼ multinomial distribution with parameters

(n,p), where p1 := F (t1), p2 := F (t2)− F (t1), . . . , pd−1 := F (td−1)− F (td−2),

pd := 1− F (td−1) under the hypothesis H0. [Picture!]

From the previous slide, (∗) will hold. How to use that for testing H0?

Well, by Thm [5.23] (sl. 148), for any continuous function g(x), the

distribution of g(Wn) will converge to that of g(W ). In particular, this will

hold if we take g(x) := ‖x‖2 = ∑dj=1 x2j :

‖Wn‖2 d−→ ‖W‖2.

As we know the distr’n of W under H0 (it’s N(0, D

2

p)), we can find that

of ‖W‖2 and use it to construct a test of a(n approximately, for large n) given

type I error (that of rejecting H0 when it’s true), e.g. ‖Wn‖2 > r.

195

It would be OK if it were not for the fact that, to use the test for different F ’s

and partitions used, each time we would have to compute the distr’n of ‖W‖2

(as it depends on p), which is a nightmare. Can we get around it?

Yes. There is a simple modification of the test procedure that makes the

(asymptotic, as n→∞) distribution of the test statistic independent of both

p and F ! Such tests are called “asymptotically distribution-free”.

This is what we will do: consider

X˜(1) :=

(

X1(1)− p1√

p1

, . . . ,

Xd(1)− pd√

pd

)

∈ Rd, S˜(n) := X˜(1) + · · ·+ X˜(n).

Then S˜(n)√

n

=

(

S1(n)−np1√

np1

, . . . , Sd(n)−npd√npd

)

= S(n)−np√

n

B with B := diag(pj

−1/2),

and so ∥∥∥∥ S˜(n)√n

∥∥∥∥2 = d∑

j=1

(Sj(n)− npj)2

npj

=: H(n)

will be the famous χ2-statistic.

196

It follows from relation (∗) (sl. 195) and Thm [5.23] that

S˜(n)√

n

d−→WB ∼ N(0, B>D2pB),

with B>D2pB = diag(pj

−1/2)

(

diag(pj)− p>p

)

diag(pj

−1/2) = Id − b>b, where

Id is the unit matrix in Rd, and b := (

√

p1, . . . ,

√

pd) is a unit vector:

‖b‖2 = ∑dj=1√pj2 = 1. So, again by Thm [5.23], as n→∞,

H(n) =

∥∥∥∥ S˜(n)√n

∥∥∥∥2 d−→ ‖WB‖2, WB ∼ N(0, Id − b>b). (∗∗)

Now consider Z ∼ N(0, Id) and form the difference

Z − (b, Z)b.

This is the projection of Z onto the (d− 1)-dim hyperplane orthogonal to the

vector b. About this RVec we will make the following two observations (i), (ii):

197

(i) WB

d

= Z − (b, Z)b.

Indeed, both are normal vectors in Rd with zero means, whereas the

covariance matrix of the latter vector equals

C2Z−(b,Z)b = E(Z − (b, Z)b)>(Z − (b, Z)b) = E

(

Z(Id − b>b)

)>

Z(Id − b>b)

= E(Id − b>b)>Z>Z(Id − b>b) = (Id − b>b)>(EZ>Z)(Id − b>b)

= (Id − b>b)>(Id − b>b) = (Id − (b>b)>)(Id − b>b)

= (Id − b>b)(Id − b>b) = I2d − Idb>b− b>bId + (b>b)2

= Id − 2b>b + b>( bb>︸︷︷︸

=

∑√

pj2=1

)b = Id − b>b = C2WB .

As multivariate normal distributions are uniquely specified by their mean

vectors and covariance matrices, the assertion follows.

198

(ii) As N(0, Id) is invariant w.r.t. rotations, the “nature” of the distr’n of the

projection Z − (b, Z)b will be the same as for the projection of Z onto any

other (d− 1)-dim hyperplane, e.g. Z∗ := (Z1, . . . , Zd−1) ∈ Rd−1. Indeed:

Prpn [∗] If Z ∼ N(0, Id) and b1, . . . , bd is an orthonormal system in Rd,

then Y := ((b1, Z), . . . , (bd, Z)) ∼ N(0, Id) as well.

I Compute the ChF of Y : for t = (t1, . . . , td) ∈ Rd, recalling from sl. 191 that

the ChF of Z is ϕZ(t) = e

−tt>/2 and using the fact that

∥∥∑d

j=1 tjbj

∥∥2

=

(∑d

j=1 tjbj ,

∑d

k=1 tkbk

)

=

∑d

j=1

∑d

k=1 tjtk bjb

>

k︸ ︷︷ ︸

=δjk

= tt>, we have

ϕY (t) = Ee

itY > = E exp

{

i

d∑

j=1

tj(bj , Z)

}

= E exp

{

i

( d∑

j=1

tjbj , Z

)}

= exp

{

−1

2

∥∥∥ d∑

j=1

tjbj

∥∥∥2} = e−tt>/2 = ϕZ(t), bingo!

199

The desired assertion (ii) follows, as, taking b1, . . . , bd s.t. bd = b and noting

that Z =

∑d

j=1(bj , Z)bj , we have Z − (b, Z)b =

∑d−1

j=1(bj , Z)bj , which is a

(d− 1)-dim RVec with i.i.d. comp’s (bj , Z) ∼ N(0, 1), like Z∗ = (Z1, . . . , Zd−1).

Therefore

‖Z − (b, Z)b‖2 d=

∥∥∥d−1∑

j=1

(bj , Z)bj

∥∥∥2 d= ‖Z∗‖2 = d−1∑

j=1

Z2j ∼ χ2d−1

by the definition of χ2-distribution with d− 1 degrees of freedom.

Now from convergence (∗∗) and properties (i), (ii) we conclude that

H(n)

d−→ ‖WB‖2 d= ‖Z − (b, Z)b‖2 ∼ χ2d−1. Thus the limiting distribution

doesn’t depend neither on F nor on the partition used! Very handy: can use

the same χ2-distribution for any such goodness-of-fit test with the same

number of “bins” (in a much more general context than testing our H0!).

200

BTW, basically the same argument shows that, for i.i.d. Xj ∼ N(0, 1), one has∑n

j=1(Xj −X)2 ∼ χ2n−1. Going just a bit further, one can also notice that

this sum of squares and X will be independent RVs.

This follows from Prpn [∗]: just take an orthonormal system b1, . . . , bn in Rn

s.t. bn = (n

−1/2, . . . , n−1/2) (clearly, ‖b‖ = 1). Then X = n−1/2(bn, X) and,

as X =

∑n

j=1(bj , X)bj , one has

n∑

j=1

(Xj −X)2 = ‖X − (bn, X)bn‖2 =

∥∥∥n−1∑

j=1

(bj , X)bj

∥∥∥2,

which is independent of (bn, X) ∼ N(0, 1) (and has the same distribution as

‖∑n−1j=1 X2j ‖2 ∼ χ2n−1) by the Proposition.

This is why, in particular, X√

1

n−1

∑n

j=1(Xj−X)2

d

= 1√

n

× Xn√

1

n−1

∑n−1

j=1 X

2

j

, where

the distr’n of the last quotient is called the t-distribution with n− 1 degrees of

freedom (used in tests/CIs for means when the variance is unknown).

201

9. Further Applications in Statistics

Empirical Distribution Functions and Empirical Processes

Suppose X1, . . . , Xn is an i.i.d. sample, the DF F of the Xj ’s being (at least,

partly) unknown. Recall (PS–6): the vector S := (X(1), . . . , X(n)) of order

statistics for the sample is an SS for F .

An alternative way of representing the information stored in S is the

empirical distribution function (EDF, Def [7.26]; note that we use a

slightly different notation)

F ∗n(t) :=

1

n

n∑

j=1

1(Xj ≤ t) ≡

n∑

j=1

1

n

1(X(j) ≤ t), t ∈ R.

NB: This is the DF of the random probability distribution P ∗n :=

n∑

j=1

1

n

εXj

that assigns probabilities 1/n to each of the sample points, and which is

well-defined when the Xj ’s are from a general space (e.g. Xj ∈ Rd etc.).

202

It is clear that we can “extract” the values of all the order statistics from the

EDF F ∗n (they are just the jump points of the function), so the EDF (the

whole function, not just its value at some t!) is an SS as well.

Observe that quite a few useful statistics/estimators can easily be expressed

from the EDF, e.g. the sample mean

X =

∫

tdF ∗n(t)

and the sample variance

s2 :=

1

n

n∑

j=1

(

Xj −X

)2

=

1

n

n∑

j=1

X2j −X

2

=

∫

t2dF ∗n(t)−

(∫

tdF ∗n(t)

)2

are the mean and variance of the EDF, and likewise the sample median

m̂ :=

X(k) if n = 2k − 11

2 (X(k) +X(k+1)) if n = 2k

is the median of the EDF.

203

NB: Such estimators are often called “plug-in estimators” (or the method of

substitution estimators), as they estimate a parameter θ that can be expressed

as θ = G(F ), G being a (nice) functional (= function of a function), by

plugging the EDF into G in place of F :

θ∗ := G(F ∗n).

Now, if we had

• F ∗n → F (in some suitable sense) as n→∞, and

• G continuous (again, in some suitable sense),

then we would be able to establish consistency: θ∗ = G(F ∗n)→ G(F ) = θ.

Q: So does one have F ∗n → F? In what sense?

A: One does. In a very strong sense. Go to the next slide.

204

Thm [7.27] (Glivenko–Cantelli) Let X1, X2, . . . be i.i.d. RVs with a common

DF F . Then, as n→∞,

Dn := sup

t

|F ∗n(t)− F (t)| a.s.−→ 0.

I 1) For any fixed t ∈ R, by the SLLN (for the Bernoulli scheme),

F ∗n(t) ≡

1

n

n∑

j=1

1(Xj ≤ t)︸ ︷︷ ︸

i.i.d. B(F (t))

a.s.−→ F (t).

2) Hence, for any fixed collection −∞ = t0 < t1 < · · · < td−1 < td =∞,

Mn := max

k≤d

|F ∗n(tk)− F (tk)| a.s.−→ 0.

205

3) Since both F ∗n and F are non-decreasing, nothing really BAD can happen

between the nodes tk: for s ∈ (tk−1, tk] and ∆kF := F (tk)− F (tk−1),

F ∗n(s)− F (s) =

≤ F ∗n(tk)− F (tk−1) = F ∗n(tk)− F (tk) + ∆kF,≥ F ∗n(tk−1)− F (tk) = F ∗n(tk−1)− F (tk−1)−∆kF.

Thus

Dn ≡ sup

t

|F ∗n(t)− F (t)| ≤Mn + max

k≤d

∆kF.

4) Assume for simplicity that F is continuous (if not, no big deal: one just

deals with its “large” jumps separately). Then, ∀ε > 0, can choose d large

enough and {tk} s.t. maxk≤d ∆kF < ε/2. Now, in view of 2), we also have

Mn < ε/2 for all large enough n, and then we have Dn < ε. Bingo.

This result, as we said, can be used to establish consistency of estimators.

But wait: there is more!

206

It is tempting to use Dn for goodness-of-fit testing. To do that, need the

distr’n of Dn, which apparently depends on F . If this is so, it’s a mess. But:

Assume that F is continuous and let Q be its quantile function, so that

F (Q(t)) ≡ t, t ∈ (0, 1). Then, as we saw (sl. 49), X d= Q(U) for U ∼ U [0, 1],

and so we can think of the sequence {Xj} as obtained by Q-transforming an

i.i.d. sequence {Uj} with Uj ∼ U [0, 1], so that

1(Xj ≤ t) = 1(Q(Uj) ≤ t) = 1(Uj ≤ F (t)), F ∗n(t) = R∗n(F (t)),

where R∗n(u) :=

1

n

∑n

j=1 1(Uj ≤ u) is the uniform EDF.

This, in particular, implies that

Dn ≡ sup

t

|F ∗n(t)− F (t)| = sup

t

|R∗n(F (t))− F (t)| = sup

u∈[0,1]

|R∗n(u)− u|.

Wow! This means that the distribution of Dn is one and the same for all

continuous F , which is very handy for goodness-of-fit testing.

But wait: there is more!

207

From the binomial CLT, ∀u ∈ [0, 1],

√

n

(

R∗n(u)− u

) d−→ V (u) ∼ N(0, u(1− u)) as n→∞.

From the multinomial CLT (sl. 193), for any 0 ≤ u1 < · · · < un ≤ 1,

√

n

(

R∗n(u1)− u1, . . . , R∗n(ud)− ud

) d−→ (V (u1), . . . , V (ud)) ∼ N(0, C2(u)),

where u = (u1, . . . , ud) and elementary calculations show that the CovM has

the form

C2(u) =

[

min{uj , uk}(1−max{uj , uk})

]

j,k=1...,d

.

In fact, if we consider

{√

n

(

R∗n(u)− u

)}

u∈[0,1] as a random process on [0, 1]

(it is called the empirical process), then it will converge in distribution (on the

space C[0, 1] of continuous functions on [0, 1]) to the so-called Brownian

Bridge process {V (u)}u∈[0,1] (a close relative of the Brownian motion

process, a.k.a. the Wiener process).

208

So what? It follows (Kolmogorov Thm) that, as n→∞,

√

nDn

d

= sup

u∈[0,1]

∣∣√n(R∗n(u)− u)∣∣ d−→ max

u∈[0,1]

|V (u)|,

and the good news is that the distribution of the RHS is known:

P

(

max

u∈[0,1]

|V (u)| ≤ x

)

= K(x) := 1 + 2

∞∑

k=1

(−1)ke−2k2x2 , x > 0.

Therefore limn→∞P

(√

nDn ≤ x

)

= K(x), and one can use that for

goodness-of-fit testing (Kolmogorov test).

Another famous example is von Mises–Smirnov ω2-test based on the statistic

ω2n := n

∫ [

(F ∗n(t)− F (t))

]2

dF (t)

d

=

∫ 1

0

[√

n(R∗n(u)− u)

]2

du.

Again, because of the convergence of empirical processes (in distribution),

lim

n→∞P(ω

2

n ≤ x) = P

(∫ 1

0

V 2(u)du ≤ x

)

, x > 0, which is known, etc.

209

Asymptotic Normality & Efficiency of the

Maximum Likelihood Estimator

Recall: basing on an i.i.d. sample X = (X1, X2, . . . , Xn) with the Xj ’s having

density fθ(x) (w.r.t. to some measure µ), we construct the maximum

likelihood estimator (MLE) as

θˆ = θˆn(X) := arg max

θ

fθ(X) ≡ arg max

θ

n∏

j=1

fθ(Xj) ≡ arg max

θ

L(X, θ),

where L(X, θ) := ln fθ(X) =

∑n

j=1 l(Xj , θ) is the log-likelihood function

(here l(x, θ) = ln fθ(x), of course).

Ex. If Xj ∼ B(p) then fθ(x) = px(1− p)1−x, x = 0, 1 (w.r.t. the counting

measure on the integers), so L(X, p) =

∑n

j=1[Xj ln p+ (1−Xj) ln(1− p)] =

n

[

X ln p+ (1−X) ln(1− p)] and 0 = ∂L∂p = Xp − 1−X1−p ⇐⇒ p = pˆ := X.

210

Ex. If Xj ∼ N(µ, σ2), then fθ(x) = 1√2piσ e−(x−µ)

2/2σ2 (w.r.t. the Lebesgue

measure on R), θ = (µ, σ2) ∈ R× R+. Here clearly

L(X, θ) = −n

2

ln 2pi − n lnσ − 1

2σ2

n∑

j=1

(Xj − µ)2.

Equations for the critical point(s):

0 =

∂L

∂µ

=

1

σ2

n∑

j=1

(Xj − µ) ≡ n

σ2

(X − µ),

0 =

∂L

∂σ

= −n

σ

+

1

σ3

n∑

j=1

(Xj − µ)2 ≡ − n

σ3

[

σ2 − 1

n

n∑

j=1

(Xj − µ)2

]

.

Obvious solution (clearly a max): µˆ = X, σ̂2 = 1n

∑n

j=1(Xj −X)2.

Ex. What if Xj ∼ U(0, θ) or Xj ∼ U(θ, 1 + θ)? [Pictures.]

211

In all the examples, MLEs were nice consistent estimators. Why is this so?

Why does the method work? What properties can one expect from the MLEs?

Denote by ϑ the true value of the parameter θ. Then, by the SLLN,

1

n

L(X, θ) ≡ 1

n

n∑

j=1

l(Xj , θ)

a.s.−→ Eϑl(Xj , θ) as n→∞

(provided that Eϑl(Xj , θ) is finite; we assume everywhere that all relevant

conditions are met, without going into detail).

So one would expect that, for large n (and this is what the word “asymptotic”

refers to), the maximum of L(X, θ) will be located close to the point where

Eϑl(Xj , θ) attains its maximum (as a function of θ).

Now where is that point?

212

Thm (Gibbs’ inequality) For any two densities f and g (on a common space,

w.r.t. a common measure µ),∫

f(x) ln f(x)µ(dx) ≥

∫

f(x) ln g(x)µ(dx) (∗)

if both integrals are finite. Here, “ ≥” becomes “ = ” iff f(x) = g(x) (µ-almost

everywhere, i.e. everywhere except for a set of zero µ-measure).

I Using lnu ≤ u− 1 (holds for all u > 0, and “=” holds iff u = 1), we have∫

f(x)

(

ln g(x)− ln f(x))µ(dx) = ∫ f(x) ln g(x)

f(x)

µ(dx)

∗≤

∫

f(x)

(

g(x)

f(x)

− 1

)

µ(dx) =

∫

g(x)µ(dx)−

∫

f(x)µ(dx) = 1− 1 = 0.

Now “=” holds in (∗) iff we have “=” in ∗≤, which, in its turn, holds if

ln g(x)f(x) =

g(x)

f(x) − 1 wherever µ “can see it”, i.e. g(x)/f(x) = 1 µ-a.e. Bingo.

[Well, need to be a bit more careful as f(x) or g(x) can vanish, but it’s OK.]

213

Thus arg maxθ Eϑl(Xj , θ) = ϑ is the true value of the parameter!!

Ex. Let Xj ∼ P (θ) : fθ(x) = e−θ θxx! , x = 0, 1, 2, . . . (w.r.t. the counting

measure on Z), so

Eϑl(X1, θ) = Eϑ

(−θ +X1 ln θ − ln(X1!))

= −θ + EϑX1 ln θ −Eϑ ln(X1!) = −θ + ϑ ln θ −Eϑ ln(X1!).

To find the maximum, we solve (for θ) the usual equation:

0 =

∂

∂θ

Eϑl(X1, θ) = −1 + ϑ

θ

⇐⇒ θ = ϑ.

Good.

So we expect that the MLE θˆ will be close to the maximum point ϑ of the

function y1(θ) := Eϑl(Xj , θ) :

214

6-

θϑ θˆn

y1(θ) =

∫

fϑ(x) ln fθ(x)µ(dx)

y2(θ) =

1

nL(X, θ)

Look: the “sharper” the peak in the curve y1(θ), the closer should be the

maxima of the curves yj ! The curvature of y1 is given by its 2nd derivative:

∂2

∂θ2

∫

fϑ(x) ln fθ(x)µ(dx) =

∫

fϑ(x)

∂2

∂θ2

ln fθ(x)µ(dx)

=

∫

fϑ(x)

[

f ′′θ (x)

fθ(x)

−

(f ′θ(x)

fθ(x)

)2]

µ(dx).

[For brevity, we use this convention: ′ = ∂∂θ .]

215

Now at the point θ = ϑ of the max of y1(θ), this becomes∫

f ′′ϑ (x)µ(dx)︸ ︷︷ ︸

=0 (do you see why?)

−

∫

(f ′ϑ(x))

2

fϑ(x)

µ(dx)︸ ︷︷ ︸

=:I(ϑ)

= −I(ϑ),

where I(θ) the Fisher information that appears in the famous Rao–Crame´r

(lower) bounda for estimators’ errors: say, for unbiased estimators θn (where

n indicates the i.i.d. sample size), one has

Eϑ(θ

∗

n − ϑ)2 ≥

1

nI(ϑ)

.

This leads to the following plausible conclusion: the higher the value of I(ϑ),

the closer should θˆ be to ϑ. And this is so indeed:

aDoes that ring a bell?

216

Thm [7.24+7.25] Under some regularity assumptions, the MLE θˆ = θˆn is

consistent: θˆn

P−→ ϑ as n→∞, and asymptotically normal:

√

n(θˆn − ϑ) d−→ Y ∼ N(0, 1/I(ϑ)).

In fact, convergence here holds for all moments as well:

Eϑθˆn = ϑ+ o

(

n−1/2

)

, Eϑ(θˆn − ϑ)2 = 1 + o(1)

nI(ϑ)

.

Thus one can say that the MLE θˆn is asymptotically efficient.

217

I We will only give a sketch of the proof.

Consistency. For any ε > 0, one has:

An : =

{|θˆn − ϑ| > ε} ⊂ { sup

|u|>ε

L(X,ϑ+ u) > L(X,ϑ)

}

=

{

sup

|u|>ε

[

1

n

L(X,ϑ+ u)︸ ︷︷ ︸

a.s.−→ Eϑl(X1,ϑ+u)

− 1

n

L(X,ϑ)︸ ︷︷ ︸

a.s.−→ Eϑl(X1,ϑ)

]

> 0

}

as, by the SLLN,

1

n

L(X, θ) =

1

n

n∑

j=1

l(Xj , θ)

a.s.−→ Eϑl(X1, θ).

Here by Gibbs’ inequality Eϑl(X1, ϑ+ u)−Eϑl(X1, ϑ) < 0. No wonder that

one can prove that P(An)→ 0 as n→∞, which means that θˆn P−→ ϑ. Too

technical for us at the moment, but we have got an idea of why the MLE is

consistent.

218

Asymptotic normality. Using Taylor’s formula (basically, it’s just the mean

value theorem for the function L′(X, θ)) at the point θˆn:

L′(X,ϑ) = L′(X, θˆn)︸ ︷︷ ︸

=0

+(ϑ− θˆn)L′′(X, θ∗),

where θ∗ is a point between ϑ and θˆn (why is L′(X, θˆn) = 0?). Thus

1√

n

n∑

j=1

l′(Xj , ϑ)︸ ︷︷ ︸

d−→ Z∼N(0,I(ϑ))

=

√

n

(

θˆn − ϑ

)× (−1)× 1

n

n∑

j=1

l′′(Xj , θ∗)︸ ︷︷ ︸

P−→ Eϑl′′(X1,ϑ)

,

where the first convergence (

d−→) is due to the CLT and the fact that

Eϑl

′(Xj , ϑ) = 0 =

∫

f ′ϑ(x)µ(dx) (do you see why?) and Eϑ(l

′(Xj , ϑ))2 = I(ϑ),

and the second one (

P−→) follows from the LLN since Eϑl′′(X1, θ∗) is finite

and Eϑl

′′(X1, θ∗)

P−→ Eϑl′′(X1, ϑ) = −I(ϑ) (see sl. 216, 217) due to θˆn P−→ ϑ.

Thus

√

n

(

θˆn − ϑ

) d−→ ZI(ϑ) ∼ N(0, 1/I(ϑ)), bingo!

219

The End

220