统计代写-347H1S|学霸联盟

统计代写-347H1S

时间：2022-08-04

Lecture Notes of STA347H1S
Ziteng Cheng
July 31, 2022
Contents
1 Axioms and Basic Properties of Probabilities 1
2 Random Variables 5
3 Distribution as Induced Measure 7
4 Expectation as Lebesgue Integral 9
5 Lebesgue Measure and Density Function 12
6 Independence and Product Measures 14
7 Change of Variables 18
8 Selections of Inequalities 19
9 Convergence of Random Variables 24
10 Limit Theorems 25
11 Relations between Convergences 27
12 Laws of Large Numbers 28
13 Conditional Expectation 31
14 Weak Convergence of Probability 35
A Preliminaries 35
1 Axioms and Basic Properties of Probabilities
Let X and Ω be non-empty abstract spaces, with not special structure. We will use X and Ω
interchangeably. In particular, we use Ω to emphasize it as the sample space. A ⊆ Ω is sometimes
called an event. The elements in Ω is denoted by ω.
1
22X is called the power set of X, it is the set of all subset of X. A ⊆ X and A ∈ 2X share the same
meaning.
def:sigmaAlg Definition 1.1. A ∈ 2X is a σ-algebra if
(i) ∅ ∈ A , Ω ∈ A ;
(ii) [closed under complement] A ∈ A =⇒ Ac ∈ A ;
(iii) [closed under countable union] (An)n∈N ⊆ A =⇒
⋃
n∈NAn ∈ A .
For C ⊆ 2Ω, we write σ(C ) for the smallest σ-algebra containing C .
rmk:SigmaAlg Remark 1.2. In view of Theorem A.2 (h), σ-algebra is also closed under intersection, that is,
(An)n∈N ⊆ A =⇒
⋂
n∈NAn ∈ A . Intersection of σ-algebra is still a σ-algebra, but union is
not.
Example 1.3. Here are some examples of σ-algebra:
1. {∅,Ω} is the trivial σ-algebra.
2. 2Ω is a σ-algebra.
3. A ⊊ Ω, then σ(A) = {ω,A,Ac,Ω}.
4. Let Ω = {a, b, c, d}. A = {∅, {a, b, c, d}, {a}, {b}, {c, d}, {a, b}, {a, c, d}, {b, c, d}} is a σ-algebra.
Moreover, σ({{a}, {b}, {c, d}}) = A .
rmk:MonotoneClass Remark 1.4. In general, σ-algebra is not quite convenient to describe via definition. An alternative
description, which turns out to be more convenient in many cases, is monotone class theorem (cf.
[A&B, Secion 4.4]).
def:BorelsigmaAlg Definition 1.5. We define the Borel σ-algebra on Rn as B(Rn) := σ({B ⊆ Rn : B is open}).
For the rest of the course, we always pair Rn with its Borel σ-algebra B(Rn).
The notion of Borel σ-algebra can be extended to, say, a metric space. If X is a metric space,
then B(X) := σ({B ⊆ X : B is open}).
lem:SingletonInBorel Lemma 1.6. Let X be a metric space with metric d. For any x ∈ X, we have {x} ∈ B(X).
Proof. For r > 0, we let Br(x) := {x′ ∈ X : d(x, x′) < r}, i.e., Br(x) is the open ball centered at
x with radius r. Note that B(X) is also closed under countable intersection (cf. Remark 1.7), we
conclude {x} ∈ B(X).
rmk:BorelSigmaAlg Remark 1.7. It can be shown by combining [A&B, Section 4.9, Theorem 4.44] and monotone class
theorem (cf. Remark 1.4) that
B(Rn) = σ({A1 × · · · ×An : Ak ∈ B(R), k = 1, ..., n})
= σ({[a1, b1]× · · · × [an, bn] : ak ≤ bk ∈ R, k = 1, ..., n})
= σ({[a1, b1)× · · · × [an, bn) : ak ≤ bk,∈ R, k = 1, ..., n})
= σ({(a1, b1]× · · · × (an, bn] : ak ≤ bk,∈ R, k = 1, ..., n})
= σ({(a1, b1)× · · · × (an, bn) : ak ≤ bk,∈ R, k = 1, ..., n}).
3def:Measure Definition 1.8. µ : A → [0,∞] is a measure if
(i) µ(∅) = 0;
(ii) [countable-additivity]1 for any (An)n∈N ⊆ A with Ai ∩ Aj = ∅, i ̸= j, we have µ(
⋃
n∈NAn) =∑
n∈N µ(An).
We say µ is a probability if µ(Ω) = 1. We usually use P to denote a probability. If µ(X) = ∞, we
call µ an infinite measure.
rmk:Measure Remark 1.9. 1. Regarding the construction of a measure, we refer to the procedure called Carathe´odory
extension (cf. [D, Theorem 1.1.9], [B, Section 4.5] and [A&B, Section 10.23]).
2. Let µ and µ′ be measures on σ(C ) and µ(A) = µ′(A) for A ∈ C , then µ(A) = µ′(A) for
A ∈ σ(C ). This can be proved by showing {A ∈ 2Ω : µ(A) = µ(A)} = σ(C ) using monotone
class theorem (cf. Remark 1.4).
exmp:MeasureSp Example 1.10. 1. If X is finite or countable, we can easily construct a measure µ on 2X by
assigning a non-negative number αx to each x ∈ X and defining µ(A) :=
∑
x∈A αx for A ∈ 2X.
If
∑
x∈X αx = 1, then µ is a probability.
2. A Dirac measure on x, denoted by δx, is a measure that satisfies
δx(A) =
{
1, x ∈ A,
0, x /∈ A.
(X,X , δx) is a measure space.
3. Let (αn)n∈N ⊆ [0,∞] and (xn)n∈N ⊆ X. Then, µ(A) :=
∑
n∈N αnδxn(A) is a measure on
(X,X ). Such µ is called discrete. If
∑
n∈N αn = 1, then µ is a probability on (X,X ).
4. Lebesgue measure on (R,B(R)). Lebesgue measure on ([0, 1],B([0, 1])) is a probability. See
Section 5 for more discussion.
Definition 1.11. Let A ⊆ 2X be a σ-algebra and µ be a measure on X .
We call (X,X ) a measurable space and (X,X , µ) a measure space. If µ is a probability,
(X,X , µ) is called a probability space; we usually write (Ω,A ,P) for probability space.
On a measure space (X,X , µ), we say N is a null set if there is A ∈ X such that µ(A) = 0
and N ⊆ A. Note that N may note belong to X .
We say (X,X , µ) is a complete measure space if X contains all null sets, i.e., for all A ∈X
with µ(A) = 0, we have N ∈ A as long as N ⊆ A.
We say A ∈ A is true µ-almost surely, if Ac is a null set.
1This is also called σ-additivity.
4Definition 1.12. Let A ∈ 2X. The indicator function 1A : X→ R is defined as
1A(x) :=
{
1, x ∈ A,
0, x /∈ A.
When no confusion arise, we will omit x and simply write 1A.
Lemma 1.13. The indicator function has the following properties
(a) 1A∩B = 1A1B, and in particular, if A ⊆ B, 1A = 1A1B;
(b) if A ∩B = ∅, 1A∪B = 1A + 1B, and in particular, 1A + 1Ac = 1.
Definition 1.14. Let A,A1, A2, · · · ∈ 2X. We say (An)n∈N increases to A if A1 ⊆ A2 ⊆ . . .
and
⋃
n∈NAn = A. We (An)n∈N decreases to A if A1 ⊇ A2 ⊇ . . . and
⋂
n∈NAn = A.
We say (An)n∈N converges to A if limn→∞ 1An(x) = 1A(x) for x ∈ X. For abbreviation, we
write An ↑ A, An ↓ A and limn→∞An = A, respectively.
We also define
lim sup
n→∞
An :=
⋂
n∈N
⋃
k≥n
Ak and lim inf
n→∞ An :=
⋃
n∈N
⋂
k≥n
Ak.
Note that for any n, ℓ ∈ N we have ⋃k≥nAk ⊇ ⋂k≥ℓAk, and thus
lim sup
n→∞
An ⊇ lim inf
n→∞ An.
Remark 1.15. We have
lim sup
n→∞
1An(x) = 1lim supn→∞ An(x) and lim infn→∞ 1An(x) = 1lim infn→∞ An(x), x ∈ X.
To see this, we first note that supk≥n 1Ak = 1⋃k≥n Ak . Note additionally that 1⋃k≥n Ak(x) is decreas-
ing in n and bounded from below by 0, therefore limn→∞ 1⋃
k≥n Ak(x) is well-defined. Next, suppose
x ∈ X satisfies limn→∞ 1⋃
k≥n Ak(x) = 1, then there must be Nx ∈ X such that 1⋃k≥n Ak(x) = 1 for
n ≥ Nx, and thus x ∈
⋃
k≥nAk for n ≥ Nx. Since
⋃
k≥nAk is decreasing in n we have x ∈
⋃
k≥nAk
for n ∈ N, i.e., x ∈ lim supn→∞An. If x ∈ X satisfies limn→∞ 1⋃k≥0 Ak(x) = 0, there must be
Nx ∈ X such that 1⋃
k≥n Ak(x) = 0 for n ≥ Nx, and thus x /∈ lim supn→∞An.
The theorem below regards the basic properties of probability.
thm:MeasBasic Theorem 1.16. Let (X,X , µ) be a measure space. Then, for any A,B,A1, A2, · · · ∈X ,
(a) A ⊆ B =⇒ µ(A) ≤ µ(B);
(b) A ⊆ ⋃n∈NAn =⇒ µ(A) ≤∑n∈N µ(A);
(c) An ↑ A =⇒ limn→∞ P(An) = P(A);
(d) if µ = P is a probability, then P(A) + P(Ac) = 1;
(e) if µ = P is a probability, An ↓ A =⇒ limn→∞ P(An) = P(A).
5Proof. (a) Note B = A ∪ (B ∩ Ac) and A ∩ (B ∩ Ac) = ∅. Then, by Definition 1.1 (ii), µ(B) =
µ(A) + µ(B ∩Ac) ≥ µ(A).
(b) Define B1 := A1 and Bn := An ∩ (
⋃n−1
k=1 Ak)
c. Note that (Bn)n∈N are mutually disjoint.
Additionally,
⋃n
k=1Ak =
⋃n
k=1Bk, and thus
⋃
k∈NAk =
⋃
k∈NBk. This together with statement
(a) and Definition 1.8 (ii) implies that µ(A) ≤ µ(⋃n∈NAn) = µ(⋃n∈NBn) = ∑n∈N µ(Bn) ≤∑
n∈N µ(An).
(c) Let (Bn)n∈N be defined as above, and note Bn = An ∩ Acn−1 for n ≥ 2. It follows from
Definition 1.8 (ii) that µ(A) =
∑
n∈N µ(Bn) = limm→∞
∑m
n=1 µ(Bn) = limm→∞ µ(
⋃m
n=1Bn) =
limm→∞ µ(Am).
(d)&(e) DIY.
Example 1.17. This is a non-example for Theorem 1.16 (e) when µ is an infinite measure. On the
measurable space (N, 2N), we let µ be a counting measure, that is, µ(A) be the number of elements
in A. It can be verified that µ indeed satisfies Definition 1.8. Let An := {n, n + 1, . . . }. Then,
An ⊃ An+1 and µ(An) =∞. On the other hand, note that
⋂
n∈NAn = ∅ and thus µ(
⋂
n∈NAn) = 0.
The next theorem regards the continuity of probability.
thm:ProbCont Theorem 1.18 (Continuity of Probability). Let A ⊆ 2Ω be a σ-algebra. Suppose (An)n∈N ⊆ A
and limn→∞An = A. Then, A ∈ A and limn→∞ P(An) = P(A).
Proof. In view of Definition 1.1, we have lim supn→∞An ∈ A and lim infn→∞An ∈ A .
Next, note that by hypothesis, ω ∈ A if and only if there is Nω ∈ N such that ω ∈ An for any
n ≥ Nω (why?). Therefore, A ⊆ lim supn→∞An and A ⊆ lim infn→∞An. On the other hand, if
ω ∈ lim supn→∞An, then for any n ∈ N, there exists k ≥ n such that ω ∈ Ak. Note additionally
that lim infn→∞An ⊆ lim supn→∞An. It follows from hypothesis that ω ∈ A (why?), and thus
A = lim sup
n→∞
An = lim inf
n→∞ An, (1.1) eq:Alimsupliminf
which proves A ∈ A .
In order to finish the proof, we let Bn :=
⋂
k≥nAk and Cn :=
⋃
k≥nAk. Note that Bn ↑ A and
Cn ↓ A due to (1.1). By Theorem 1.16 (c) (e), we yield P(A) = limn→∞ P(Cn) = limn→∞ P(Bn).
However, we have Bn ⊆ An ⊆ Cn. It follows from Theorem 1.16 (a) that P(Bn) ≤ P(An) ≤ P(Cn).
Finally, we conclude limn→∞ P(An) = P(A).
2 Random Variables
def:rv Definition 2.1. (i) Let (X,X ) and (Y,Y ) be two measurable spaces. We say a function f :
X → Y is X -Y measurable if {x ∈ X : f(x) ∈ B} ∈ X for any B ∈ Y , and we write
f : (X,X ) → (Y,Y ) for abbreviation. Sometimes it is convenient to write f−1(B) := {x ∈
X : f(x) ∈ B}. We also define σ(f) := f−1(Y ), where we note f−1(Y ) is a σ-algebra (why?).
(ii) If we set X = Ω and consider Y : (Ω,A )→ (Y,Y ), to emphasize that Y maps from the event
space, we call Y an A -Y random variable.
(iii) For Y : (Ω,A )→ (Rn,B(Rn)), we may call Y an Rn-valued A -random variable. If n = 1, we
also call Y an real-valued A -random variable. When no confusion arises, we simply call Y a
real-valued random variable.
6(iv) Let f and g be X -B(R) measurable. We say f = g, µ-almost surely for µ({x ∈ X : f(x) =
g(x)}c) = 0. We write µ − a.s. for abbreviation. For the rest of this course, unless specified
otherwise, the ‘=’ relationship between functions are understood in the almost sure sense.
The same is true for ‘<’, ‘>’, ‘≤’ and ‘≥’. When no confusion arise we will omit µ.
rmk:Measurable Remark 2.2. 1. All functions f : X → Y are 2X-Y measurable, regardless of Y . It is tempting
to always use 2X when possible. But it turns out that 2X has some pathology when X is
uncountable, say, X = R. We defer to Remark 5.2 for more discussion.
2. f : X→ Y is X -σ(C ) measurable if and only if {x ∈ X : f(x) ∈ C} ∈X for any C ∈ C . The
‘if’ direction can be proved by showing {B ⊆ X : f−1(B) ∈ A } = σ(C ) using monotone class
theorem (cf. Remark 1.4). The ‘only if’ direction is clear from definition.
3. Composition preserves measurability. More precisely, consider f : (X,X ) → (Y,Y ) and
g : (Y,Y )→ (Z,Z ), then the composition of g and f , defined as g ◦f(a) := g(f(a)), is X -Z
measurable.
4. Suppose X and Y are metric spaces. Then, any continuous f : X → Y is B(X)-B(Y) measur-
able. This is a consequence of the fact that, f : X→ Y is continuous if and only if f−1(U) is
open for any open U ⊆ Y.
5. Consider Xk : (Ω,A )→ (Xk,Xk) for k = 1, . . . , n. Then, (X1, . . . , Xn) as a mapping from Ω
to Xn, is A -X1 ⊗ · · · ⊗Xn measurable, where X ⊗ Y := σ({A×B : A ∈X , B ∈ Y }).
6. Suppose f and g are real-valued X -measurable function. Then, so are cf (for c ∈ R), f + g,
fg, f/g (if g ̸= 0), max{f, g} and min{f, g}.
7. Let (fn)n∈N be a sequence of real-valuedX -measurable functions. Using point 2, we can show
that lim infn→∞ fn(ω) := limn→∞ infk≥n fk(ω) and lim supn→∞ fn(ω) := limn→∞ supk≥n fk(ω)
are X -B(R) measurable. Moreover, if (fn(ω))n∈N converges as n → ∞ for each ω ∈ Ω, then
f(ω) := limn→∞ fn(ω) is X -B(R) measurable.
8. For any f : (X,X )→ (Y,Y ), σ(f) is the smallest σ-algebra on X such that f is measurable,
and we have f : (X, σ(f))→ (Y,Y ). Moreover, if Y = σ(C ), then f−1(σ(C )) = σ(f−1(C )).
9. Let I be an uncountable set of indexes and consider fi : (X,X ) → (R,B(R)), supi∈I fi(x)
may not be measurable (cf. .......).
The next lemma can be proved using element chasing method.
lem:PreimageComm Lemma 2.3. Let f : X → Y, B ∈ 2Y and (Bi)i∈I ⊆ 2Y, where I is a set of indexes (possibly un-
countable). We have f−1(Bc) = (f−1(B))c, f−1(
⋂
n∈NBn) =
⋂
n∈N f
−1(Bn) and f−1(
⋃
n∈NBn) =⋃
n∈N f
−1(B).
def:SimpleFunc Definition 2.4. A function f : X → R is called simple if it takes only finitely many value. In
particular, it can be written as
f(x) =
n∑
k=1
rk1Ak(x), x ∈ X, (2.1) eq:SimpleFuncRepr
for some n ∈ N, distinct r1, . . . , rn ∈ R and Ak = f−1({rk}) for k = 1, . . . , n. Clearly,
7Lemma 2.5. Let f be the simple function in (2.1). Then, Ai ∩ Aj = ∅ for i ̸= j. Moreover,
A1, . . . , An ∈X if and only if f is X -B(R) measurable.
Proof. The first statement follows from Lemma 2.3 and the convention that f−1(∅) = ∅. Regarding
the second statement, if f is measurable, in view of Lemma 1.6, Ak = f
−1({rk}) ∈ X due to
Definition 2.1. If A1, . . . , An ∈X , then for any B ∈ B(R), we have
f−1(B) = f−1(
⋃
k=1,...,n; rk∈B
{rk}) =
⋃
k=1,...,n; rk∈B
f−1({rk}) ∈X ,
where we have used Lemma 2.3 in the last inequality.
thm:SimpleFuncApprox Theorem 2.6 (Simple Function Approximation). For any f : (X,X ) → (R,B(R)), there is a
sequence of simple functions (fn)n∈N such that fn is X -B(R) measurable, |fn(x)| ≤ |f(x)| and
limn→∞ fn(x) = f(x) for any x ∈ X. In particular, we can construct fn as
fn(x) = −n1f−1((−∞,−n])(x) +
−1∑
k=−n2n
k + 1
2n
1f−1(( k
2n
, k+1
2n
])(x)
+
n2n−1∑
k=0
k
2n
1f−1([ k
2n
, k+1
2n
))(x) + n1f−1([n,∞))(x), x ∈ X.
Moreover, if f is non-negative, then fn(x) ≤ fn+1(x) for x ∈ X.
thm:sigmagsigmaf Theorem 2.7. Consider f : (X,X )→ (Y,Y ) and g : (X,X )→ (R,B(R)). Then, σ(g) ⊆ σ(f) if
and only if there is h : (Y,Y )→ (R,B(R)) such that g = h ◦ f .
Proof. Regarding the ‘if’ direction, we have σ(g) = g−1(Z ) = (h ◦ f)−1(Z ) (why?)= f−1(h−1(Z )).
Since h−1(Z ) ⊆ Y due to the measurability of h, we conclude σ(g) ⊆ σ(f).
Now we prove the ‘only if’ direction. We first assume g is simple and suppose g(x) =
∑m
k=1 rk1Ak(x)
for some r1, . . . , rn ∈ R and A1, . . . , An ∈ X . Without loss of generality, we assume ri ̸= rj and
Ai ∩ Aj = ∅ for i ̸= j, where we note Ak = g−1({rk}) ∈ σ(g). Since σ(g) ⊆ σ(f), we must have
Ak ∈ σ(f) for k = 1, . . . , n. If follows that there is Bk such that f−1(Bk) = Ak and Bi ∩ Bj = ∅.
Then, h(x) :=
∑m
k=1 rk1Bk(x) is the desired.
Now we consider a generic g. There is a sequence ofX -measurable simple functions (gn)n∈N that
approximates g in the sense of Theorem 2.6 and σ(gn) ⊆ σ(f). For each n ∈ N, there is real-valued
Y -measurable hn such that gn = hn ◦ f . Let L := {y ∈ Y : lim infn→∞ hn(y) = lim supn→∞ hn(y)}.
Because limn→∞ hn(f(x)) = limn→∞ gn(x) = g(x) for x ∈ X, we have f(X) ⊆ L. Define h(x) :=
limn→∞ hn(x)1L(x), in view of Remark 2.2 (7), the proof is complete.
3 Distribution as Induced Measure
def:Distn Definition 3.1. Let (X,X , µ) be a measure space and consider f : (X,X )→ (Y,Y ). For the rest
of the course, we will use the following abbreviation/notation
µ({x ∈ X : f(x) ∈ B}) = µ(f ∈ B) = µ(f−1(B)) = µ ◦ f−1(B) =: µf (B), B ∈ Y ,
8where we note for B /∈ Y the left hand side does not make sense. µf is also called the measure
induced by f . On a probability space (Ω,A ,P), for Y : (Ω,A ) → (Y,Y ), PY is called the
(probabilistic) distribution of Y .
Using Lemma 2.3, we yield the result below.
Theorem 3.2. µf is a measure on (Y,Y ).
Definition 3.3. Let P be a probability on (R,B(R)). The cumulative distribution function (CDF)
induced by P is defined as F (r) := P((−∞, r]). If P = PY for some real-valued random variable Y ,
we call F the CDF of Y .
Remark 3.4. Using Remark 1.7 and Remark 1.9 (3), we can show that if two probability measure
induces the same distribution function, then the two measures must coincides.
Remark 3.5. One important reason to adopt such framework is that it justifies the existence of
continuous time random process, in terms of the result known as see Kolmogorov extension theorem
(cf. [A&B, Section 15.6]).
thm:DistFunc Theorem 3.6. Let F be a CDF on R. Then,
(a) F is non-decreasing;
(b) F is right-continuous on R, that is, limz→r+ F (z) = F (r) for r ∈ R;
(c) limr→−∞ F (r) = 0 and limr→∞ F (r) = 1;
(d) F has left limit on R, that is, for any r ∈ R and (rn)n∈N increasing to r we have (F (rn))n∈N
converges; additionally, F (r−) := limz→r− F (z) = P((−∞, r));
(e) F has at most countably many jumps.
Proof. DIY.
Remark 3.7. In view Remark 1.7, using Carathe´odory extension theorem (cf. Remark 1.9(2)),
we can show that a function F satisfying conditions (a) (b) (c) above characterizes a probability
measure P on (R,B(R)).
The result below is an immediate consequence of Theorem 3.6.
Corollary 3.8. Let P be a probability measure on (R,B(R)) and F be the corresponding CDF.
Then, for any real numbers x < y,
(a) P((x, y]) = F (y)− F (x);
(b) P([x, y]) = F (y)− F (x−);
(c) P([x, y)) = F (y−)− F (x−);
(d) P((x, y)) = F (y−)− F (x−);
(e) P({x}) = F (x)− F (x−).
94 Expectation as Lebesgue Integral
In what follows, we consider the extended real line R := R∪{−∞,∞} with following rules 0×∞ = 0,
0× (−∞) = 0, a±∞ = ±∞ and a× (±∞) = sgn(a) · ∞ for a ∈ R.
Let (X,X , µ) be a measure space. We want to define an integral of a real-valuedX -measurable
function with respect to µ. We first define the integral for simple random variable.
def:LebIntSimple Definition 4.1. Suppose f is a simple function of the form f(x) =
∑n
k=1 rk1Ak(x) with rk ∈ R
and Ak ∈X for k = 1, . . . , n. We define∫
X
f(x)µ(dx) :=
n∑
k=1
rkµ(Ak).
Note, in particular, µ(A) =
∫
X 1A(x)µ(dx) for A ∈X .
The lemma below argues that
∫
X f(x)µ(dx) is defined uniquely.
lem:IntSimpleVerInv Lemma 4.2. Suppose f(ω) =
∑n
k=1 rk1Ak(ω) =
∑m
k=1 ℓk1Bk(ω) for some rk, ℓk ∈ R and Ak, Bk ∈
X for k = 1, . . . , n. Then,
∑n
k=1 rkµ(Ak) =
∑m
k=1 skµ(Bk).
Proof. DIY.
def:LebIntPosX Definition 4.3. Suppose f ≥ 0 (here f may take values in [0,+∞]). We define∫
X
f(x)µ(dx) := sup
{∫
X
g(x)µ(dx) : g is simple real-valued X -measurable function and 0 ≤ g ≤ f
}
.
def:LebInt Definition 4.4. Let f be a real-valued X -measurable function. We write f+ := f1{f≥0}
and f− := −f1{f<0}. If
∫
X f
+(x)µ(dx) < ∞ or ∫X f−(x)µ(dx) < ∞, the Lebesgue integral
(of f with respect to µ) is defined as∫
X
f(x)µ(dx) :=
∫
X
f+(x)µ(dx)−
∫
X
f−(x)µ(dx)
We say f is integrable, if both
∫
X f
+(x)µ(dx) <∞ and ∫X f−(x)µ(dx) <∞, or equivalently,∫
X |f |(x)µ(dx) <∞.
We use L1(X,X , µ) for the set of integrable functions. Furthermore, for p ∈ (0,∞), we let
Lp(X,X , µ) be the set of real-valuedX -measurable functions such that ∫X |f(x)|pµ(dx) <∞,
and L∞(X,X , µ) the set of real-valued X -measurable functions such that µ({x : |f(x)| >
M}) = 0 for some M > 0.
Remark 4.5. If f = u + iv is a complex-valued function and
∫
X(|u(x)| + |v(x)|)µ(dx) is finite, we
define
∫
X f(x)µ(dx) :=
∫
X u(x)µ(dx) + i
∫
X v(x)µ(dx).
Definition 4.6. Let A ∈X . We write∫
A
f(x)µ(dx) :=
∫
f(x)1A(x)µ(dx).
10
Definition 4.7. Following Definition 4.4, set (X,X , µ) = (Ω,A ,P) as a probability space, for a
real-valued A -random variable Y , the expectation of Y is defined as the Lebesgue integral,
EP(Y ) :=
∫
Ω
Y (ω)P(dω).
When no confusion arise, we simply write E(Y ).
The proposition below follows immediately from the definitions above.
prop:ExpnBasic Proposition 4.8. Let (X,X , µ) be a measure space. Let f and g be real-valued X -random vari-
ables. Then following is true:
(a) if f and g are integrable and f ≤ g, then ∫X f(x)µ(dx) ≤ ∫X g(x)µ(dx);
(b) if f is integrable, then
∫
X cf(x)µ(dx) = c
∫
X f(x)µ(dx) for c ∈ R;
(c) if A ∈X satisfies µ(A) = 0 and f ≥ 0, then ∫X f(x)1A(x)µ(dx) = 0.
The theorem below is one of the most important result concerning Lebesgue integral.
thm:PreMonoConv Theorem 4.9 (Monotone Convergence). Let (fn)n∈N be a sequence of non-negative real-valued X -
measurable function such that f ′n ≥ fn for n′ ≥ n and limn→∞ fn(x) = f(x) for x ∈ X. Then,
lim
n→∞
∫
X
fn(x)µ(dx) =
∫
X
f(x)µ(dx).
Proof. By Proposition 4.8 (b), (
∫
X fn(x)µ(dx))n∈N is an increasing sequence of real numbers and∫
X fn(x)µ(dx) ≤
∫
X f(x)µ(dx). Let L be the limit. We thus have
∫
X f(x)µ(dx) ≥ L. What is left
to prove is
∫
X fn(x)µ(dx) ≤ L. To this end let g(x) =
∑ℓ
k=1 rk1Ak(x) be a simple function such
that g ≤ f . Let c ∈ (0, 1) and Bn = {x ∈ X : fn(x) ≥ cg(x)}. Note that (Bn)n∈N increases to X. It
follows from Proposition 4.8 (c) that
L ≥
∫
X
fn(x)µ(dx) ≥
∫
Bn
fn(x)µ(dx) ≥ c
∫
Bn
g(x)µ(dx) = c
ℓ∑
k=1
rkµ(Ak ∩Bn).
Note that (Ak ∩Bn)n∈N increases to Ak. Applying Theorem 1.16 (c) to the right hand side above,
we have L ≥ c ∫X g(x)µ(dx). Since c ∈ (0, 1) is arbitrary, we have L ≥ ∫X g(x)µ(dx). In view of
Definition 4.3, the proof is complete.
Thanks to monotone convergence theorem, we are now in position to establish the linearity of
Lebesgue integral.
lem:IntSum Lemma 4.10. Let f and g be real-valued non-negative X -measurable functions. Then,∫
X
(f(x) + g(x))µ(dx) =
∫
X
f(x)µ(dx) +
∫
X
g(x)µ(dx).
11
Proof. First suppose f and g are simple, say, f(x) =
∑m
k=1 rk1Ak(x) and g(x) =
∑n
k=1 sk1Bk(x).
Then, (f + g)(x) =
∑m+n
k=1 tk1Ck(x), where tk = rk, Ck = Ak for k = 1, . . . ,m, and tk = sk−m, Ck =
Bk−m for k = n+ 1, . . . ,m+ n. If follows that∫
X
(f(x) + g(x))µ(dx) =
m+n∑
k=1
tkµ(Ck) =
m∑
k=1
rkµ(Ak) +
n∑
k=1
skµ(Bk) =
∫
X
f(x)µ(dx) +
∫
X
g(x)µ(dx).
Next, we suppose f and g are non-negative. In view of Theorem 2.6, we let (fn)n∈N and (gn)n∈N
be sequences of simple function increasing to f and g, respectively. Note that (fn + gn)n∈N also
increases to f + g. Invoking monotone convergence (Theorem 4.9), the proof is complete.
The results above imply that the Lebesgue integral is a linear functional on L1(X,X , µ). We
formulates such linearity into the theorem below.
thm:ExpnLinear Theorem 4.11. Suppose f, g ∈ L1(X,X , µ). Then, for any a, b ∈ R we have∫
X
(af(x) + bg(x))µ(dx) = a
∫
X
f(x)µ(dx) + b
∫
X
g(x)µ(dx).
Proof. DIY.
The next theorem is useful, as it is a vital tools for calculating Lebesgue integral. It shows in
particular that the expectations of g(Y ) only depends on the distribution of Y .
thm:ExpnRule Theorem 4.12. Consider f : (X,X )→ (Y,Y ) and g : (Y,Y )→ (R,B(R)).
(a) g ◦ f ∈ L1(X,X , µ) if and only if g ∈ L1(Y,Y , µf );
(b) If either g ≥ 0, or the equivalent conditions in (a) is satisfied, then∫
X
g(f(x))µ(dx) =
∫
Y
g(y)µf (dy).
Proof. Recall Definition 3.1 and 4.1. Then,∫
X
1f−1(B)(x)µ(dx) = µ(f ∈ B) = µf (B) =
∫
B
µf (dx).
This proves (b) for g being simple function. Suppose g ≥ 0. In view of Theorem 2.6, we let (gn)n∈N
be a sequence of simple Y -measurable function increasing to g. Note gn ◦ f also increases to g ◦ f .
By monotone convergence (Theorem 4.9),∫
X
g ◦ f(x)µ(dx) = lim
n→∞
∫
X
gn ◦ f(x)µ(dx) = lim
n→∞
∫
Y
gn(y)µ
f (dy) =
∫
Y
g(y)µf (dy),
This proves (b) for g ≥ 0, and (a) follows immediately by substituting g above with |g|. Finally, for
g ∈ L1, invoking the decomposition that g = g+ − g− finishes the proof.
Proposition 4.13. We have Lp(Ω,A ,P) ⊇ Lq(Ω,A ,P) for 1 ≤ p ≤ q ≤ ∞.
12
Proof. If q = ∞, there is M > 0 such that P(|Y | ≤ M) = 1 and thus P(|Y |p ≤ M) = 1, which
implies that |Y |p is integrable. Now suppose q <∞. Let Y ∈ Lq. Since 1{|Y |≤1} + 1{|Y |>1} = 1, by
Lemma 4.10,
E(|Y |p) = E (|Y |p1{|Y |≤1})+ E (|Y |p1{|Y |>1}) ≤ 1 + E (|Y |q1{|Y |>1}) ≤ 1 + E(|Y |q) <∞,
which implies Y ∈ Lp and thus completes the proof.
Remark 4.14. The same is in general not true for infinite measure (why?).
5 Lebesgue Measure and Density Function
sec:LebMeas
def:LebMeasure Definition 5.1. A Lebesgue measure on (Rn,B(Rn)), denoted by λn, is a measure satisfying
λ([a1, b1]× · · · × [an, bn]) = (b1 − a1)× · · · × (bn − an), ak ≤ bk, k = 1, . . . , n.
rmk:LebMeas Remark 5.2. 1. In view of Remark 1.7 and 1.9 (3), we know Definition 5.1 defines a measure
uniquely (if exists). The existence is a consequence of Carathe´odory extension theorem (cf.
Remark 1.9(2)). In fact, we can define Lebesgue measure for a σ-algebra larger than B(R),
and such σ-algebra is call Lebesgue σ-algebra.
2. We wonder whether we can define Lebesgue measure for 2R
n
. This turns out to be not possible.
A counter example on R is available at [B, Section 4.4].
Example 5.3. In this example, we show that on (Rn,B(Rn)), λ(a × R . . .R) = 0. To this end
let Ak,n := [a − 1k , a + 1k ] × [−n, n] × · · · × [−ℓ, ℓ], and λ(Ak,ℓ) = 2(2ℓ)n/k. Invoking Theorem
1.16 (e), we have λ((a × [−ℓ, ℓ] × [−ℓ, ℓ]) = 0. It follows from Theorem 1.16 (c) that (Rn,B(Rn)),
λ(a × R . . .R) = 0. A similar (but more tedious) argument shows that a hyperplane has zero
Lebesgue measure.
The proposition below argue that Lebesgue integral extends Riemann integral.
Proposition 5.4. Suppose f : (Rn,B(Rn)) → (R,B(R)) is Riemann integrable on [a1, b1] × · · · ×
[an, bn]. Then, f is also Lebesgue integrable on the rectangle, and the Riemann integral coincides
with the Lebesgue integral.
Proof. See Section 7 of ‘Lebesgue Integration on Euclidean Space’ by Frank Jones.
Remark 5.5. On the other hand, not every Lebesgue integral make sense as a Riemann integral.
An example will be provided in HW.
From now on, we will omit λ from λ(dx) when writing Lebesgue integral with respect to Lebesgue
measure. Note that for integral on Rn with n > 1, the dummy variable x ∈ Rn is a n-dimensional
vector, i.e., x = (x1, . . . , xn). The following notations are equivalent,∫
Rn
f(x)dx =
∫
Rn
f(x1, . . . , xn) dx1 . . . dxn,
The notation on the right hand sides deserves more discussion in later sections on product measures
and independence.
13
def:PDF Definition 5.6. Let µ be a measure on (Rn,B(Rn)). We say µ is absolutely continuous with respect
to Lebesgue measure if there is a non-negative f : (Rn,B(Rn))→ (R,B(R)) such that
µ(A) =
∫
A
f(x) dx, A ∈ B(Rn).
In this case, we call f the density function of µ. If µ = P is a probability, we call f the probabilistic
density function (PDF) of P. If µ = PX , we call f the PDF of X.
Remark 5.7. 1. The notion of absolute continuity between measures is studied in a boarder setup.
Consider a measurable space (X,X ). We say µ is absolutely continuous with respect to ν
if for any A ∈ X with ν(A) = 0 we have µ(A) = 0. By Radon-Nikodym theorem, if µ is
absolutely continuous with respect to ν, there is a real-valued X -measurable f such that
µ(A) =
∫
A
f(x)ν(dx), A ∈ A ,
where the f is unique up to a set with 0 measure under ν and is called the Radon-Nikodym
derivative. We refer to [B, Section 13] for the detailed statement and proof.
2. For a measure µ on (Rn,B(Rn)), it needs not to be absolutely continuous w.r.t. λ, and there
may not exists a density function. In general, we have the following decomposition
µ = µD + µC + µS ,
where µD is a measure with atoms only, µC is a measure that is absolutely continuous w.r.t.
λ, and µS is a measure with no atom but not absolutely continuous w.r.t. λ. We refer to .......
for further discussion.
The next result is immediate from Definition 1.8 and 5.6.
Proposition 5.8. Let f be a PDF of some Rn-valued random variables Y , then
f ≥ 0, λ− a.s. and
∫
Rn
f(x)dx = 1. (5.1) eq:PDF
Proof. We claim that for k ∈ N and Ak := {x ∈ Rn : f(x) < − 1k}, we must have λ(Ak) = 0. Indeed,
suppose otherwise, we have PY (Ak) =
∫
Ak
f(x)dx ≤ −k−1λ(Ak) < 0, contradicting the hypothesis
that PY is a probability. Let A := {x ∈ Rn : f(x) < −0}. Note (Ak)k∈N increases to A. By Theorem
1.16 (c), we have λ(A) = 0. Regarding
∫
Rn f(x)dx = 1, it follows immediately from Definition 5.6
and that PY (Rn) = 1.
Conversely, we can use f satisfying (5.1) to define a probability measure on (Rn,B(Rn)).
Proposition 5.9. Suppose f : (Rn,B(Rn))→ (R,B(R)) satisfies (5.1). Then,
µ(A) :=
∫
A
f(x) dx, A ∈ B(Rn)
is a probability on (Rn,B(Rn)).
Proof. DIY.
14
From now on, we call f a PDF as long as f satisfies (5.1).
Below is a continuation of Theorem 4.12.
Theorem 5.10. Let f be an Rn-valued measurable function and g : (R,B(R))→ (R,B(R)). Suppose
Y admits a PDF f . If g ≥ 0, or fg ∈ L1(R,B(R), λ), then
E(g(Y )) =
∫
R
g(r)f(r) dr.
Proof. DIY.
The proposition below provide an alternative expression for expectation with non-negative real-
valued random variable.
prop:NonNegExpn Proposition 5.11. Let F be the CDF of a non-negative real-valued random variable Y . Then,
E(Y ) =
∫
R+
(1− F (r)) dr,
where the right hand side is understood as a integral with respect to Lebesgue measure (see Definition
5.1).
Proof. DIY.
Remark 5.12. In fact, a similar formula for random variable that is not necessarily non-negative
is also possible. This can be easily proved with the notion of Lebesgue-Stieltjes integral. To
heuristically derive an expression, we can assume Y is bounded and has a PDF, then we yield
E(Y ) =
∫
R+
(1− F (r)) dr −
∫
R−
F (r) dr.
6 Independence and Product Measures
def:Indep Definition 6.1. Consider the probability space (Ω,A ,P).
Two event A,A′ ∈ A are independent if P(A ∩ A′) = P(A)P(A′). A sequence of events
(An)n∈N ⊆ A is pairwise independent if P(Ai ∩ Aj) = P(Ai)P(Aj) for i ̸= j, is mutually
independent if for any subset I ⊆ N we have
P(
⋂
n∈I
An) =
∏
n∈I
P(An).
Two random variables Y : (Ω,A ) → (Y,Y ) and Z : (Ω,A ) → (Z,Z ) independent if {ω ∈
Ω : Y (ω) ∈ B} and {ω ∈ Ω : Z(ω) ∈ C} are independent for any B ∈ Y and C ∈ Z .
Equivalently, we write
P(Y ∈ B,Z ∈ C) = P(Y ∈ B)P(Z ∈ C), B ∈ Y , C ∈ Z .
A sequence of random variables (Yn)n∈N is pairwise independent if Yi and Yj are independent
for any i ̸= j, is mutually independent if for any I ⊆ N
P(
⋂
n∈I
{Yn ∈ Bn}) =
∏
n∈I
P(Yn ∈ Bn), Bn ∈ Yn, n ∈ I.
15
thm:Indp Theorem 6.2. Let Y and Z be real-valued random variables. Then, Y and Z are independent,
if and only if E(f(Y )g(Z)) = E(f(Y ))E(g(Z)) for any non-negative f : (Y,Y ) → (R,B(R)) and
g : (Z,Z )→ (R,B(R)).
Proof. The ‘if’ direction is immediately when we take f and g as indicators. Regarding the ‘only
if’ direction, an application of simple function approximation and monotone converge finishes the
proof.
Remark 6.3. If we replace the f and g above by bounded measurable functions, the theorem is
still true. But be careful when dealing integrable functions in a similar setting, as the product of
integrable functions need not be integrable.
Remark 6.4. Suppose Y and Z are metric spaces endowed with the corresponding Borel σ-algebra.
For Y and Z to be independent, it is sufficient to have E(f(Y )g(Z)) = E(f(Y ))E(g(Z)) for any
bounded continuous f and g. The proof of this statement involves more delicate treatment on the
related σ-algebra. We refer to ........
The following technical result will be useful later.
lem:BorelCantelli Lemma 6.5 (Borel-Cantelli). On (Ω,A ,P), let (An)n∈N ⊆ An. The following is true:
(a) if
∑
n∈N P(An) <∞, then P(
⋂
n∈N
⋃
k≥nAk) = 0;
(b) if P(
⋂
n∈N
⋃
k≥nAk) = 0 and (An)n∈N is mutually independent, we have
∑
n∈N P(An) <∞.
Proof. (a) By Theorem 1.16 (a) (b),
P(
⋂
n∈N
⋃
k≥n
Ak) ≤ P(
⋃
k≥m
Ak) ≤
∑
k≥m
P(Ak), m ∈ N.
By the hypothesis that
∑
n∈N P(An) < ∞, the right hand side above tends to 0 as m → ∞. The
proof is complete.
(b) Note that, by Theorem 1.16 (c) (e) (d),
P(
⋂
n∈N
⋃
k≥n
Ak) = lim
n→∞P(
⋃
k≥n
Ak) = lim
n→∞ limm→∞P(
m⋃
k=1
Ak) = lim
n→∞ limm→∞(1− P(
m⋂
k=1
Ack))
= lim
n→∞ limm→∞(1−
m∏
k=1
P(Ack)) = 1− limn→∞ limm→∞
m∏
k=1
(1− P(Ak)).
This together with the hypothesis that P(
⋂
n∈N
⋃
k≥nAk) = 0, we have
lim
n→∞ limm→∞
m∏
k=1
(1− P(Ak)) = 1.
By taking log, we have
lim
n→∞ limm→∞
m∑
k=n
log(1− P(Ak)) = lim
n→∞
∑
k≥n
log(1− P(Ak)) = 0.
It follows that
∑
k∈N log(1 − P(Ak)) is a converging sum with non-positive summands. Because
| log(1− z)| ≥ z for z ∈ [0, 1], we conclude to proof.
16
In view of Definition 3.1 and 6.1, for independent Y,Z and non-negative B ∈ Y , C ∈ Z , we
have P(Y,Z)(B × C) = PY (B)PZ(C). This motivates the following notions of product measures.
def:ProdMeas Definition 6.6. Consider two measurable spaces (X,X ) and (S,S ). Let µ and ν be measures on
(X,X ) and (S,S ), respectively.
The (Cartesian) product of sets A and B is defined as A × B := {(a, b) : a ∈ A, b ∈ B}. In
particular, X× S = {(x, s) : x ∈ X, s ∈ S}.
The product σ-algebra of X and S is defined as X ⊗S := σ({A×B : A ∈X , B ∈ S }).
The product measure of µ and ν, denoted by µ⊗ν, is defined as the measure on (X×S,X ⊗S )
that satisfies µ⊗ ν(A×B) = µ(A)× ν(B) for any A ∈X and B ∈ S .
Remark 6.7. 1. One way to establish the existence of product measure is to use Carathe´odory ex-
tension theorem (cf. Remark 1.9). Alternatively, we can also define µ⊗ν(C) as ∫Y µ(Cs)ν(ds)
or
∫
X ν(Cx)µ(dx) for C ∈ X ⊗ S , where Cs := {x ∈ X : (x, s) ∈ C} and Cx := {s ∈ S :
(x, y) ∈ C}. Note that it is not trivial to justify the above-mentioned definition.
2. We can also separately show the uniqueness using monotone class theorem (cf. Remark 1.4),
in case some versions of Carathe´odory extension theorem does not cover the uniqueness.
prop:IndepProdMeas Proposition 6.8. On (Ω,A ,P), consider two independent random variables Y : (Ω,A )→ (Y,Y )
and Z : (Ω,A )→ (Z,Z ). Then, P(Y,Z) = PY ⊗ PZ .
Proof. The proof of this result is out of the scope of this course. It is mainly based on monotone
class theorem (cf. Remark 1.4).
prop:LebesgueRn Proposition 6.9. Let λ be the Lebesgue measure on (R,B(R)), and λn be the Lebesgue measure on
(Rn,B(Rn)). Then, B(Rn) = B(R)⊗n and λn = λ⊗n.
Proof. The proof of this result is out of the scope of this course. It is mainly based on Remark 1.7
and monotone class theorem (cf. Remark 1.4).
The following results allow us to interchange the order of integration. The proof, mostly based
on monotone class theorem (cf. Remark 1.4), is out of the scope of this course.
lem:SectionMeasurable Lemma 6.10. Consider f : (X × S,X ⊗ S ) → (R,B(R)). Let µ and ν be measures on (X,X )
and (S,S ), respectively. The following is true
(a) for any x ∈ X, s 7→ f(x, s) and s 7→ ∫X f(x, s)µ(dx) are S -measurable;
(b) for any s ∈ S, x 7→ f(x, v) and x 7→ ∫S f(x, s)ν(ds) are X -measurable.
thm:Fubini Theorem 6.11 (Fubini-Tonelli). Suppose f : (X × S,X ⊗S ) → (R,B(R)) satisfies either f ≥ 0,
or,
∫
X×S |f(z)|µ⊗ ν(dz) <∞ (i.e., f ∈ L1(X× S,X ⊗S , µ⊗ ν)). Then,∫
X×S
f(z)µ⊗ ν(dz) =
∫
S
∫
X
f(x, s)µ(dx)ν(ds) =
∫
X
∫
S
f(x, s)ν(ds)µ(dx).
We note that the analogue for multi-variate f is also true.
17
Example 6.12. Here is an non-example of Fubini-Tonelli theorem. Let X = S = [0, 1], both
endowed with Lebesgue measure. Let gk(z) := (
1
k − 1k+1)−11( 1k+1 , 1k )(z) and note
∫
[0,1] gk(z) dz = 1.
We also define
f(x, s) =
∞∑
k=1
(gk(x)− gk+1(x)) gk(s), (x, s) ∈ [0, 1]2.
Note for each (x, s), at most two terms is non-zero, f is not single signed and f /∈ L1. Note, in
addition, that ∫
[0,1]
∫
[0,1]
f(x, s) dx ds =
∫
[0,1]
∑
k=1
0 · gk(s) ds = 0
and ∫
[0,1]
∫
[0,1]
f(x, s) ds dx =
∫
[0,1]
∞∑
k=1
(gk(x)− gk+1(x)) dx =
∫
[0,1]
g1(x) dx =
1
2
.
As a consequence of the results above, we yield that the joint PDF of independent random
variables is the product of the individual PDFs (if exists).
Corollary 6.13. Let Y be a Rn-valued random variable and Z be a Rd-valued random variables.
Suppose Y and Z are independent, and both have PDFs fY and fZ , respectively. Then, (Y,Z) as an
Rn+d-valued random variable has density f(Y,Z)(y, z) = fY (y)fZ(z) for (y, z) ∈ Rn ×Rd. Moreover,
for any B ∈ B(Rn+k), we have
P((Y,Z) ∈ B) =
∫
Rn
∫
Rd
1B(y, z)f(y)f(z) dz dy =
∫
Rd
∫
Rn
1B(y, z)f(y)f(z) dy dz,
Proof. In view of Proposition 6.8, we have
P(Y,Z)(B) =
∫
Rn+d
1B(r)P(Y,Z)(dr) =
∫
Rn+d
1B(r)PY ⊗ PZ(dr).
Then, by Fubini-Tonelli theorem (Theorem 6.11),
P(Y,Z)(B) =
∫
Rn
∫
Rd
1B(y, z)PZ(dz)PY (dy) =
∫
Rd
∫
Rn
1B(y, z)PY (dy)PZ(dz).
Invoking the hypothesis that both Y and Z have PDFs and Lemma 6.10, we yield
P(Y,Z)(B) =
∫
Rn
∫
Rd
1B(y, z)fZ(z) dzPY (dy) =
∫
Rn
∫
Rd
1B(y, z)fZ(z)fY (y) dz dy
=
∫
Rd
∫
Rn
1B(y, z)fY (y) dyPZ(dz) =
∫
Rd
∫
Rn
1B(y, z)fY (y)fZ(z) dy dz.
Finally, by Fubini-Tonelli theorem (Theorem 6.11) and Proposition 6.9, we conclude
P(Y,Z)(B) =
∫
Rn+d
1B(r)f(Y,Z)(r) dr.
18
7 Change of Variables
To illustrate the well-known result called Jacobi’s transformation formula, we consider the heuristic
argument below on (R2,B(R2), λ2). Let T ∈ R2×2 be a matrix, we treat v ∈ R2 as a column vector.
Let B ∈ B(R2), we define TB := {Tv : v ∈ B}. We wonder what is λ2(Tv). By the theory of
linear algebra, we know that every T arises as a product of elementary matrices: (1) permutation
matrix, denoted by T e1 ; (2) summing one row onto the other row, denoted by T
e
2 ; (3) row stretch,
denoted by T e3 . Note that (1) and (2) do not affect the volume, and |detT ei | = 1 for i = 1, 2. Only
row stretch scale the volume by | detT e3 |. This together with the fact that detTT ′ = detT detT ′,
we yield that λ2(TB) = |detT |λ2(B). Next, let g : G ⊆ R2 → R2 be one-to-one and continuously
differentiable. Additionally, in view of Taylor expansion, note that
g(u+ δu) = g(u) +
(
∂g1
∂r1
(u) ∂g1∂r2 (u)
∂g2
∂r1
(u) ∂g2∂r2 (u)
)
δu+ o(|δu|) =: g(u) + J(u)δu+ o(|δu|).
The above heuristically leads to the formula that∫
g(G)
1B(g
−1(r)) dr =
∫
g(G)
1g(B)(r) dr =
∫
G
1B(r)|det J(r)|dr, B ⊆ G, B ∈ B(Rn).
where g(B) := {v ∈ R2 : v = g(r) for some r ∈ B}. Following the idea of simple function approxi-
mation, we yield ∫
g(G)
f(g−1(r)) dr =
∫
G
f(r)|det J(r)| dr.
Since g−1 is one-to-one, it is sometimes more convenient to change f ◦ g−1 above to h,∫
g(G)
h(r) dr =
∫
G
h(g(r))|det J(r)|dr.
Below we officially introduce the Jacobi’s transformation formula.
Definition 7.1. Let G ⊆ Rn be open and g : G→ Rn be continuously differentiable. The Jacobian
matrix of g is Jg : G→ Rn×n with (i, j)-entry being [Jg(u)]i,j := ∂gi∂rj (u).
thm:JacobiTrans Theorem 7.2 (Jacobi’s transformation formula). Let G ⊆ Rn be open. Suppose g : G → Rn is
one-to-one, continuously differentiable on open D ∈ B(Rn) and λn(G \ D) = 0. Then, for any
h : (Rn,B(Rn))→ (R,B(R)) that is non-negative or integrable, we have∫
g(G)
h(r) dr =
∫
G
h(g(r))| det Jg(r)| dr.
Proof. See Theorem 7.26, W. Rudin (1987) Real and Complex Analysis.
The next theorem is an application of Jacobi’s transformation formula (Theorem 7.2).
thm:CoVPDF Theorem 7.3. Let Y be an Rn-valued random variable with PDF fY and β : (Rn,B(Rn)) →
(Rn,B(Rn)). Suppose β is one-to-one and |det Jβ(r)| > 0 (equivalently, Jβ(r) is invertible) for
r ∈ Rn. Then, Z = β(Y ) also has a PDF fZ , and
fZ(z) = 1β(Rn)(z)fY (β
−1(z))|det Jβ−1(z)|, z ∈ Rn.
19
Proof. Because β has positive Jacobian matrix, G := β(Rn) is open (long story). Moreover, because
β is one-to-one, we have g := β−1 is well-defined on G. We also have Jg(r) is well-defined and equals
to the inverse of Jβ(g(r)) for r ∈ G, due to inverse function theorem (cf. [J. Shurman, Multivariable
Calculus, Theorem 5.2.1]). It follows that |det Jg(r)| <∞ for r ∈ G. Then, for B ∈ B(Rn), we have
P(Z ∈ B) = P(β(Y ) ∈ B) = P(β(Y ) ∈ B ∩G) = P(Y ∈ β−1(B ∩G)) = P(Y ∈ g(B ∩G))
=
∫
g(G)
1g(B∩G)(y)fY (y) dy.
By Theorem 7.2,
P(Z ∈ B) =
∫
G
1g(B∩G)(g(r))fY (g(r))| det Jg(r)|dr =
∫
B
1G(r)fY (g(r))|det Jg(r)|dr,
which concludes the proof.
The following is helpful in case β is not one-to-one.
Corollary 7.4. Let Y be an Rn-valued random variable with PDF fY and β : (Rn,B(Rn)) →
(Rn,B(Rn)). Let (Sk)k∈N0 ∈ B(Rn) be a partition of Rn, that is, Si ∩ Sj = ∅ for i ̸= j and
Rn =
⋃
n∈N0. Suppose additionally that λ
n(S0) = 0; and for k ∈ N, Sk are open, βk : Sk → Rn
is one-to-one, continuously differentiable |det Jβ(r)| > 0 for r ∈ Sk, and β(r) = 1S0(r)β(r) +∑∞
k=1 βk(r)1Sk(r) for r ∈ Rn. Then, Z = β(Y ) has a PDF fZ , and
fZ(z) =
∑
k∈N
1βk(Sk)(z)fY (β
−1
k (z))| det Jβ−1k (z)|, z ∈ R
n.
8 Selections of Inequalities
thm:ChebyshevIneq Theorem 8.1 (Chebyshev’s Inequality). For p ∈ [1,∞) and a > 0, we have
µ({x ∈ X : |f(x)| ≥ a}) ≤ 1
ap
∫
X
|f(x)|pµ(dx).
Proof. This is an immediate consequence of the observation that, for z ∈ {x ∈ X : |f(x)| ≥ a} we
have |f(z)|p/ap ≥ 1.
cor:ChernoffIneq Corollary 8.2 (Chernoff’s Inequality). Let t > 0. Then,
P(Y ≥ E(Y ) + t) ≤ e−λtE
(
eλ(Y−E(Y ))
)
e−λt, λ > 0.
Proof. Note that for λ > 0,
P(Y ≥ E(Y ) + t) = P
(
eλ(Y−E(Y )) ≥ eλt
)
≤ E
(
eλ(Y−E(Y ))
)
e−λt
due to Chebyshev’s Inequality (Theorem 8.1) with p = 1.
thm:HoeffdingIneq Theorem 8.3 (Hoeffding’s Inequality). Suppose Y1, . . . , Yn are mutually independent real-valued
random variables, and ak ≤ Yk ≤ bk for k = 1, . . . , n. Then, for ε ≥ 0,
P
(∣∣∣∣∣
n∑
k=1
Yk −
n∑
k=1
E(Yk)
∣∣∣∣∣ ≥ ε
)
≤ 2 exp
(
− 2ε
2∑n
k=1(bk − ak)2
)
.
20
Proof. The case of ε = 0 is obvious. Let ε > 0. We first apply Chernoff’s inequality (Corollary 8.2)
to yield
P
(
n∑
k=1
Yk −
n∑
k=1
E(Yk) ≥ ε
)
≤ e−λεE
(
eλ
∑n
k=1(Yk−E(Yk))
)
≤ e−λε
n∏
k=1
E
(
eλ(Yk−E(Yk))
)
, λ > 0.
We need to estimate E
(
e−λ(Yk−E(Yk))
)
. The estimation is provided in Lemma 8.4, and thus
P
(
n∑
k=1
Yk −
n∑
k=1
E(Yk) ≥ ε
)
≤ e−λε
n∏
k=1
eλ
2(bk−ak)2/8 = exp
(
λ2
8
∞∑
k=1
(bk − ak)2 − ελ
)
, λ > 0.
Since λ > 0 is arbitrary, we pick λ = 4ε/
∑∞
k=1(bk − ak)2 and yield
P
(
n∑
k=1
Yk −
n∑
k=1
E(Yk) ≥ ε
)
≤ exp
(
− 2t
2∑n
k=1(bk − ak)2
)
.
Applying the similar reasoning to −Y1, . . . ,−Yn, we yield
P
(
n∑
k=1
Yk −
n∑
k=1
E(Yk) ≤ −ε
)
≤ exp
(
− 2t
2∑n
k=1(bk − ak)2
)
.
Finally, in view of Theorem 1.16 (b), we conclude the proof.
lem:EstMoment Lemma 8.4. For a real-valued random variable Y with E(Y ) = 0 and a ≤ Y ≤ b, we have
E(eλY ) ≤ eλ2(b−a)2/8, λ > 0.
Proof. The statement is clearly true for a = b. Suppose a < b and define Z := (b − Y )/(b − a).
Then, Y = Za+ (1− Z)b and
eλY = eZλa+(1−Z)λb ≤ Zeλa + (1− Z)eλb = b− Y
b− a e
λa +
Y − a
b− a e
λb.
Taking expectation we yield,
E(eλY ) ≤ b
b− ae
λa − a
b− ae
λb = exp
(
log
(
beλa − aeλb
b− a
))
= exp
(
λa+ log
(
b− aeλ(b−a)
b− a
))
= exp (u(p− 1) + log(p− (1− p)eu)) , (8.1) eq:EExplambdaY
where p := b/(b − a) and u := λ(b − a). Let φ(u) := u(p − 1) + log(p + (1 − p)eu) for u ≥ 0 (note
b ≥ 0 and thus p ≥ 0). By Taylor’s expansion, we have
φ(u) = φ(0) + φ′(0)u+
1
2
φ′′(ξ)u2
for some ξ ∈ [0, u]. Note φ(0) = 0, and φ′(u) = (p− 1) + (1−p)eup+(1−p)eu , i.e., φ′(0) = 0. Regarding φ′′,
φ′′(u) =
(1− p)eu
p+ (1− p)eu −
(1− p)2e2u
(p+ (1− p)eu)2 ≤
1
4
, u ≥ 0.
It follows that φ(u) ≤ u2/8. This together with (8.1) concludes the proof.
21
Remark 8.5. Extension of Hoeffding’s inequality.
thm:HolderIneq Theorem 8.6 (Ho¨lder’s Inequality). Let p, q ∈ (1,∞) satisfy 1p + 1q = 1. On (X,X , µ), then for
any real-valued f and g, we have∫
X
|f(x)g(x)|µ(dx) ≤
(∫
X
|f(x)|pµ(dx)
) 1
p
(∫
X
|g(x)|qµ(dx)
) 1
q
.
If p = 1, then ∫
X
|f(x)g(x)|µ(dx) ≤ Cg
∫
X
|f(x)|µ(dx),
where Cg := inf{r ≥ 0 : µ({x ∈ X : |g(x)| > Cg})} = 0.
The proof of Theorem 8.6 relies on the following lemma.
lem:YoungIneqProd Lemma 8.7. Let a, b ≥ 0 and p, q ∈ (1,∞) satisfy 1p + 1q = 1. Then,
ab ≤ a
p
p
+
bq
q
.
The equality holds only when ap = bq.
Proof. The inequality is true of a = b = 0. For the rest of the proof, we assume a > 0 and b > 0.
Because ln is concave, therefore
ln
(
ap
p
+
bq
q
)
≥ 1
p
ln(ap) +
1
q
ln(bq) = ln(ab).
Note that the equality is true only when ap = bq. Taking exponential on both hand side, we conclude
the proof.
Proof of Theorem 8.6. The case of p = 1 is obvious. We suppose p, q ∈ (1,∞). If one of (∫X |f(x)|pµ(dx)) 1p
or
(∫
X |f(x)|qµ(dx)
) 1
q is infinite, the inequality is automatically true. Without loss of generality, we
assume both the quantities are 1. By Lemma 8.7, we have
|f(x)g(x)| = |f(x)||g(x)| ≤ |f(x)|
p
p
+
|g(x)|q
q
.
Integrating both hand side, we yield the statement.
Below is a special case of Ho¨lder’s inequality (Theorem 8.6) with p = q = 2.
cor:CSIneq Corollary 8.8 (Cauchy-Schwartz). On (X,X , µ), for any real-valued f and g, we have∫
X
|f(x)g(x)|µ(dx) ≤
√(∫
X
|f(x)|pµ(dx)
)(∫
X
|g(x)|qµ(dx)
)
.
Theorem 8.9 (Minkowski inequality). Let p ∈ [1,∞). For any real-valued f and g in Lp(X,X , µ),
we have f + g ∈ Lp and(∫
X
|f(x) + g(x)|pµ(dx)
) 1
p
≤
(∫
X
|f(x)|pµ(dx)
) 1
p
+
(∫
X
|g(x)|pµ(dx)
) 1
p
.
22
Proof. Note that x 7→ |x|p is convex. Therefore,∣∣∣∣12 |f(x)|+ 12 |g(x)|
∣∣∣∣p ≤ 12 |f(x)|p + 12 |g(x)|p, x ∈ X.
This implies that f + g ∈ Lp. Next, note that the case of p = 1 is immediate due to triangle
inequality. We suppose p > 1. We also assume |f + g| is no constant 0 as the statement is trivially
true otherwise. Note∫
X
|f(x) + g(x)|pµ(dx) ≤
∫
X
|f(x) + g(x)||f(x) + g(x)|p−1µ(dx)
≤
∫
X
|f(x)||f(x) + g(x)|p−1µ(dx) +
∫
X
|g(x)||f(x) + g(x)|p−1µ(dx).
Let q = pp−1 so that
1
p +
1
q = 1. By Ho¨lder’s inequality (Theorem 8.6),∫
X
|f(x)||f(x) + g(x)|p−1µ(dx) ≤
(∫
X
|f(x)|pµ(dx)
) 1
p
(∫
X
|f(x) + g(x)|pµ(dx)
)1− 1
p
and ∫
X
|g(x)||f(x) + g(x)|p−1µ(dx) ≤
(∫
X
|g(x)|pµ(dx)
) 1
p
(∫
X
|f(x) + g(x)|pµ(dx)
)1− 1
p
.
Combining the above, and simplifying the resulting inequality, we conclude the proof.
Remark 8.10. One of the major consequence of Minkowski inequality is that we can use
d(f, g) :=
(∫
X
|f(x)− g(x)|pµ(dx)
) 1
p
as a metric for Lp spaces with p ∈ [1,∞).
The following inequality estimates the variance of certain system under the influence multiple
independent factor.
Theorem 8.11 (Efron-Stein’s Inequality). Let Y1, . . . , Yn, Y
′
1 , . . . , Y
′
n be mutually independent ran-
dom variables from (Ω,A ) to (Y,Y ) such that Yk and Y ′k have the same distribution. Define Y =
(Y1, . . . , Yn) and Y
(k) := (Y1, . . . , Yk−1, Y ′k, Yk+1, . . . , Yn). Then, for any bounded f : (Yn,Y ⊗n) →
(R,B(R)), we have
Var(f(Y )) ≤ 1
2
n∑
k=1
E
((
f(Y )− f(Y (k))
)2)
.
Proof. In what follows, we use the following notations:
Y ′ := (Y ′1 , . . . , Y
′
n), Y
[k] := (Y ′1 , . . . , Y
′
k, Yk+1, . . . , Yn), k = 1, . . . , n.
Note Y [n] = Y ′. We also set Y [0] := Y . Observe that
Var(f(Y )) = E(f(Y )2)− E(f(Y ))2 = E(f(Y )2)− E(f(Y )f(Y ′))
= E(f(Y )(f(Y )− f(Y ′))) =
n∑
k=1
E(f(Y )(f(Y [k−1])− f(Y [k]))).
23
Note that the distribution of (Y1, . . . , Yn, Y
′
1 , . . . , Y
′
n) remains the same if we switch Yk and Y
′
k.
Therefore,
f(Y )(f(Y [k−1])− f(Y [k])) and f(Y (k))(f(Y [k])− f(Y [k−1]))
have the same distribution. It follows from Theorem 4.12 that
E
(
f(Y )(f(Y [k−1])− f(Y [k]))
)
= E
(
f(Y (k))(f(Y [k])− f(Y [k−1]))
)
,
and thus the average of both hand sides should remains the same. Consequently, by Cauchy-
Schwartz inequality (Corollary 8.8),
E
(
f(Y )(f(Y [k−1])− f(Y [k]))
)
=
1
2
E
(
(f(Y )− f(Y (k)))(f(Y [k−1])− f(Y [k]))
)
≤ 1
2
√
E
(
(f(Y )− f(Y (k)))2)E ((f(Y [k−1])− f(Y [k]))2).
Finally, noticing that
E
(
(f(Y )− f(Y (k)))2
)
= E
(
(f(Y [k−1])− f(Y [k]))2
)
,
summing over k, we conclude the proof.
thm:JensenIneq Theorem 8.12 (Jensen’s Inequality). On (Ω,A ,P), consider Y : (Ω,A ) → (R,B(R)) and g :
(R,B(R))→ (R,B(R)). Suppose g(Y ) is integrable. Then, g(E(Y )) ≤ E(g(Y )).
In order to proof Jensen’s inequality, we first prove the following lemma.
lem:Convex Lemma 8.13. Let g : (R,B(R)) → (R,B(R)) be convex. Then, for any r0 ∈ R, there is c0 ∈ R
such that g(r) ≥ g(r0) + c0(r − r0) for r ∈ R.
Proof. We first consider r > r0. Note that for c ∈ (0, 1), we have
g(r0 + (1− c)(r − r0)) = g(cr0 + (1− c)r) ≤ cf(r0) + (1− c)f(r)
and thus
g(r0 + (1− c)(r − r0))− g(r0)
(1− c)(r − r0) ≤
g(r)− g(r0)
r − r0 .
It follows that r 7→ f(r)−f(r0)r−r0 is non-decreasing in r > r0. Similar reasoning shows that r 7→
g(r0)−g(r)
r0−r
is non-decreasing in r < r0. Moreover, for r
′ < r0 < r′′, we have
g(r0) ≤ g(r
′′ − r0
r′′ − r′ r
′ +
r0 − r′
r′′ − r′ r
′′) ≤ r
′′ − r0
r′′ − r′ g(r
′) +
r0 − r′
r′′ − r′ f(r
′′)
and thus
g(r0)− g(r′)
r0 − r′ ≤
g(r′′)− g(r0)
r′′ − r0 .
Let c0 = lim infr→r0+
g(r′′)−g(r0)
r′′−r0 , we must have c0 > −∞, and
g(r0)− g(r′)
r0 − r′ ≤ c0 ≤
g(r′′)− f(r0)
r′′ − r0 , r
′ < r0 < r′′,
which completes the proof.
24
Proof of Theorem 8.12. In view of Lemma 8.13, we let r0 = E(Y ) and c0 ∈ R satisfies g(r) ≥
g(r0) + c0(r − r0). Then, by Proposition 4.8, we yield
E(g(Y )) ≥ g(E(Y )) + c0E(Y − E(Y )) = g(E(Y )).
The proof is complete.
9 Convergence of Random Variables
In what follows, we will fix a measure space (X,X , µ). Let f be a real-valued X -measurable and
(fn)n∈N be a sequence of real-valued X -measurable function. Upon declaration, we will replace
(X,X , µ) by a probability space (Ω,A ,P), and replace f, fn by real-valued random variables Y, Yn.
def:FuncConv Definition 9.1. We say (fn)n∈N converges almost surely to f if
µ({x : lim
n→∞ fn(x) = f(x)}
c) = 0,
and denote limn→∞ fn = f, µ − a.s.. If µ is a probability, we may alternatively say (fn)n∈N
converges to f with probability 1.
Let p ∈ [1,∞) and suppose f ∈ Lp, (fn)n∈N ⊆ Lp. We say (fn)n∈N converges to f in Lp if
lim
n→∞
∫
X
|fn(x)− f(x)|pµ(dx) = 0,
and denote fn
Lp−→ f .
We say (fn)n∈N converges to f in measure if for any ε > 0,
lim
n→∞µ({x ∈ X : |fn(x)− f(x)| > ε}) = 0,
and denote µ- lim fn = f .
Remark 9.2. Note all the convergence mentioned above depends on the underlying probability space.
Sometimes it is necessary to emphasize the dependence on µ.
rmk:ASConv Remark 9.3. If (X,X , µ) is complete, following from Remark 2.2 (7), we have almost sure limit (if
exists) preserves measurability. This means we need not introduce a measurable f beforehand in
Definition 9.1. Instead, we can simply define f(x) := limn→∞ fn(x) for x ∈ X \N , when the limit
exists except for a null set N .
exmp:FuncConv Example 9.4. In this example, we consider the measure space ([0, 1],B([0, 1]), λ).
1. Let fn(x) := n
2
1[0,1/n]. Then, (fn)n∈N converges to 0 almost surely and in measure. But
(fn)n∈N does not converges to 0 in Lp for any p ∈ [0,∞).
2. Let s0 = 0 and sn := sn−1 + n−1 for n ∈ N. For n ∈ N define
fn(x) :=
{
1[sn−1%1,sn%1)(x), sn−1%1 ≤ sn%1,
1[0,sn%1)∪[sn%1,1](x), sn−1%1 > sn%1,
where % means modulo. Then, (fn)n∈N converges to 0 in Lp for p ∈ [1,∞) and in measure,
but not almost surely.
25
Theorem 9.5. Let g : (R,B(R))→ (R,B(R)) be continuous.
(a) if (fn)n∈N converges to f almost surely, then (g ◦ fn)n∈N converges to g ◦ f almost surely;
(b) suppose µ = P is a probability measure and (Yn)n∈N converges to Y in measure, then (g(Yn))n∈N
converges to g(Y ) in measure.
Proof. (a) Note {x ∈ X : limn→∞ fn(x) = f(x)} ⊆ {x ∈ X : limn→∞ g(fn(x)) = g(f(x))}. There-
fore,
µ({x ∈ X : lim
n→∞ g(fn(x)) = g(f(x))}
c) ≤ µ({x ∈ X : lim
n→∞ fn(x) = f(x)}
c) = 0.
(b) Let ε > 0 and ℓ ∈ N. Note
{|g(Yn)− g(Y )| > ε} = ({|g(Yn)− g(Y )| > ε} ∩ {Y ∈ [−ℓ, ℓ]}) ∪ ({|g(Yn)− g(Y )| > ε} ∩ {Y ∈ [−ℓ, ℓ]c})
⊆ ({|g(Yn)− g(Y )| > ε} ∩ {Y ∈ [−ℓ, ℓ]}) ∪ {Y ∈ [−ℓ, ℓ]c}.
Therefore, by countable additivity and Theorem 1.16 (a),
µ(|g(Yn)− g(Y )| > ε) ≤ µ(|g(Yn)− g(Y )| > ε, Y ∈ [−ℓ, ℓ]) + µ(Y ∈ [−ℓ, ℓ]c). (9.1) eq:mugf
Note g is uniformly continuous on [−ℓ−1, ℓ+1], then there is δ ∈ (0, 1) such that |g(y+r)−g(y)| ≤ ε
for r ∈ [−δ, δ]. Thus, under Y ∈ [−ℓ, ℓ], for |g(Yn)−g(Y )| > ε to be true, we must have |Yn−Y | > δ.
This together with (9.1) implies that
P(|g(Yn)− g(Y )| > ε) ≤ P(|Yn − Y | > δ) + P(Y ∈ [−ℓ, ℓ]c).
Taking lim sup on both hand sides above, we yield
lim sup
n→∞
µ(|g(Yn)− g(Y )| > ε) ≤ µ(Y ∈ [−ℓ, ℓ]c).
Noting that ([−ℓ, ℓ]c)ℓ∈N decreases to ∅, invoking Theorem 1.16 (e), we have
lim sup
n→∞
µ(|g(Yn)− g(Y )| > ε) = 0,
which completes the proof.
Example 9.6. Here is a non-example for (b) above with µ(X) =∞. Consider (R+,B(R+), λ). Let
f(x) := x and fn(x) := x + n
−1. Clearly, (fn)n∈N converges to f in measure. Now let g(y) := y2.
Then, for any n and ε > 0, we have µ({r ∈ R : |g(fn(r))− r2| > ε}) = µ([12(nε− n−1),∞)) =∞.
10 Limit Theorems
In view of the convention that 0 · ∞ = 0 and the notion of almost sure convergence, we can easily
extend Theorem 4.9 into the following.
thm:MonoConv Theorem 10.1 (Monotone Convergence). Let (fn)n∈N be a sequence of non-negative real-valued
measurable function. Suppose (fn)n∈N increases to f almost surely. Then,
lim
n→∞
∫
X
fn(x)µ(dx) =
∫
X
f(x)µ(dx).
26
The next theorem is known as Fatou’s lemma.
thm:Fatou Theorem 10.2 (Fatou). Let (fn)n∈N be a sequence of non-negative real-valued measurable function.
Then, ∫
X
lim inf
n→∞ fn(x)µ(dx) ≤ lim infn→∞
∫
X
fn(x)µ(dx).
Proof. Recall that lim infn→∞ fn(x) = limn→∞ infk≥n fn(x) and we define gn(x) := infk≥n fk(x).
Note that (gn)n∈N increases to lim infn→∞ fn almost surely. In addition, because gn ≤ fk for k ≤ n,
we have ∫
X
gn(x)µ(dx) ≤ inf
k≥n
∫
X
fk(x)µ(dx).
Invoking monotone convergence finishes the proof.
cor:Fatou Corollary 10.3. Suppose (fn)n∈N converges almost surely to f and
∫
X |fn(x)|µ(dx) ≤ K for some
K > 0. Then,
∫
X |f(x)|µ(dx) ≤ K.
Example 10.4. Consider ([0, 1],B([0, 1]), λ) and define fn(x) := −n21[0, 1
n
](x) for x ∈ [0, 1]. Note
Note lim infn→∞ fn(x) = 0 a.s. for x ∈ [0, 1] but
∫
[0,1] fn(x)µ(dx) = −n and thus
lim inf
n→∞
∫
[0,1]
fn(x)µ(dx) = −∞.
thm:DomConv Theorem 10.5 (Dominated Convergence). Suppose (fn)n∈N converges almost surely to f and there
is g ∈ L1 such that |fn| ≤ g. Then, f ∈ L1 and
lim
n→∞
∫
X
fn(x)µ(dx) =
∫
X
f(x)µ(dx).
Proof. As an immediate consequence of Corollary 10.3, we have f ∈ L1. Note fn + g ≥ 0. By
Theorem 4.11 and Fatou’s lemma (Theorem 10.2), we have∫
X
f(x)µ(dx) +
∫
X
g(x)µ(dx) =
∫
X
lim inf
n→∞ (fn(x) + g(x))µ(dx)
≤ lim inf
n→∞
∫
X
(fn(x) + g(x))µ(dx) = lim inf
n→∞
∫
X
fn(x)µ(dx) +
∫
X
g(x)µ(dx). (10.1) eq:fliminf
It follows that ∫
X
f(x)µ(dx) ≤ lim inf
n→∞
∫
X
fn(x)µ(dx).
On the other hand, we have g − fn ≥ 0 and thus with similar reasoning as before,
−
∫
X
f(x)µ(dx) ≤ lim inf
n→∞
∫
X
(−fn)(x)µ(dx) = − lim sup
n→∞
∫
X
fn(x)µ(dx). (10.2) eq:flimsupf
Combining (10.1) and (10.2) as well as the fact that lim sup ≥ lim inf, the proof is complete.
Remark 10.6. Note that monotone convergence and dominated convergence generalizes Theorem
1.16 (c) (e) and Theorem 1.18.
27
11 Relations between Convergences
thm:ConvLPConvInMeas Theorem 11.1. If (fn)n∈N converges to f in converges to in Lp, then (fn)n∈N converges to f in
measure.
Proof. Let ε > 0. By Theorem 8.1, we have
µ({x ∈ X : |fn(x)− f(x)| ≥ ε}) ≤ 1
εp
∫
X
|fn(x)− f(x)|pµ(dx).
In view of Definition 9.1, we conclude the proof.
thm:ConvInMeasConvAS Theorem 11.2. If (fn)n∈N converges to f in measure, then there is a subsequence (fnj )j∈N con-
verging to f almost surely.
Proof. Let n1 := 1 and choose nk > nk−1 by induction such that
µ({x ∈ X : |fnk(x)− f(x)| >
1
k
}) ≤ 2−k.
We define Ak := {x ∈ X : |fnk(x) − f(x)| > 1k} and A :=
⋂
n∈N
⋃
k≥nAk. Note that, by Theorem
1.16 (b), µ(
⋃
k∈NAk) ≤
∑
k∈N µ(Ak) < ∞ and (
⋃
k≥nAk)n∈N decreases to A. Then, by dominated
convergence (Theorem 10.5),
µ(A) = lim
n→∞µ(
⋃
j≥n
Aj) ≤ lim
n→∞
∞∑
k=n
µ(Ak) ≤ lim
n→∞ 2
−n+1 = 0.
In addition, observe that for x /∈ A, there is n ∈ N such that |fnk(x)− f(x)| ≤ 1k for any k ≥ n, and
thus limk→∞ fnk(x) = f(x). By Theorem 1.16 (a), we have
µ({x ∈ X : lim
k→∞
fnk(x) = f(x)}c) ≤ µ(A) = 0.
In view of Definition 9.1, we conclude the proof.
Corollary 11.3. If (fn)n∈N converges to f in converges to in Lp, then there is a subsequence
(fnj )j∈N converging to f almost surely.
Remark 11.4. measurabilty in complete probability space
Theorem 11.5. Let p ∈ [1,∞). Suppose (fn)n∈N converges to f in measure, and there is a non-
negative g ∈ Lp such that |fn| ≤ g for n ∈ N. Then, f ∈ Lp and (fn)n∈N converges to f in
Lp.
Proof. We first prove that f ∈ Lp. To this end note that
{x ∈ X : |f(x)| > g(x) + ε} ⊆ {x ∈ X : |f(x)| > |fn(x)|+ ε}
⊆ {x ∈ X : |f(x)− fn(x)| > +ε}.
Therefore, by Theorem 1.16 (a),
µ({x ∈ X : |f(x)| > g(x) + ε}) ≤ µ({x ∈ X : |f(x)− fn(x)| > ε}) −−−→
n→∞ 0.
28
Note additionally that ({x ∈ X : |f(x)| > g(x) + 1k})k∈N increases to {x ∈ X : |f(x)| > g(x)}. By
Theorem 1.16 (c), we conclude |f | ≤ g, and thus f ∈ Lp.
We proceed to show the Lp convergence by contradiction. Suppose there is ε > 0 and a subse-
quence (fnk)k∈N such that ∫
X
|fnk(x)− f(x)|pµ(dx) ≥ ε0, k ∈ N. (11.1) eq:fnkf
Because (fnk)k∈N converges to f in measure, by Theorem 11.2, there is a further subsequence
(fnkℓ )ℓ∈N converging to f almost surely. Note that |fn−f |p ≤ 2pgp. Then, by dominated convergence
(Theorem 10.5), we have
lim
ℓ→∞
∫
X
|fnkℓ (x)− f(x)|
pµ(dx) = 0,
which contradicts (11.1). This finishes the proof.
In what follows, we let (Ω,A ,P) be a probability space, Y be a real-valued random variable and
(Yn)n∈N a sequence of real-valued measurable functions.
thm:ASConvInMeas Theorem 11.6. If (Yn)n∈N converges to Y almost surely, then (Yn)n∈N converges to Y in measure.
Proof. For ε > 0, let An := {ω ∈ Ω : |Yn(ω)− Y (ω)| > ε}. Note 1An converges to 0 almost surely.
Then, by dominated convergence (Theorem 10.5), we have
lim
n→∞P(An) = limn→∞
∫
Ω
1An(ω)µ(dω) =
∫
Ω
lim
n→∞1An(ω)µ(dω) = 0.
In view of Definition 9.1, we conclude the proof.
Example 11.7. Here we show an example on (X,X , µ) where µ(X) = ∞ and almost sure con-
vergence does not implies convergence in measure. Let (X,X ) = (N, 2N) and µ be the counting
measure, i.e., µ(A) equals to the number of elements in A. Let fn(i) = (1−n−1)1[0,n](i). Note that
(fn)n∈N converges to 1 almost surely but µ(|fn − 1| > 12) =∞ for all n ∈ N.
12 Laws of Large Numbers
One of the fundamental result of probability regards the laws of large number. Below we introduce
a few different versions. In what follows, we let Y, Z be real-valued random variables, and (Yn)n∈N
be a sequence of real-valued random variables. Recall that
Var(Y ) := E((Y − E(Y ))2).
By saying Var(Y ) <∞, we mean Y ∈ L1 and Y −E(Y ) ∈ L2. This also implies Y ∈ L2 (why?). In
this case, we have Var(Y ) = E(Y 2)− E(Y )2.
For Y,Z ∈ L2, in view of Cauchy-Schwartz inequality (TBA), we define
Cov(Y,Z) := E((Y − E(Y ))(Z − E(Z))) = E(Y Z)− E(Y )E(Z).
Lemma 12.1. For Y,Z,W ∈ L2, the following is true:
29
(a) Cov(X,Y ) = Cov(Y,X);
(b) Cov(aX + bY,W ) = aCov(X,W ) + bCov(Y,W ) for any a, b ∈ R;
(c) Cov(a,W ) = 0 for any a ∈ R;
(d) Cov(Y, Y ) = Var(Y ).
The formula below is useful and is a good exercise:
Var(
n∑
k=1
Yk) =
n∑
j=1
n∑
i=1
Cov(Yi, Yj) =
n∑
k=1
Var(Yk) + 2
∑
{i,j}⊆{1,...,n}
Cov(Yi, Yj).
thm:L2LLN Theorem 12.2. [L2 LLN] Suppose (Yn)n∈N satisfies E(Yn) = a, Var(Yn) = b2 and Cov(Yi, Yj) = 0
for i ̸= j. Then,
E
( 1
n
n∑
k=1
Yk − a
)2 ≤ b2
n
.
Consequently, ( 1n
∑n
k=1 Yk)n∈N converges to a in L2.
Proof. Because Cov(Yi, Yj) = 0 for i ̸= j, we have
E
( n∑
k=1
Yk − na
)2 = E
( n∑
k=1
(Yk − a)
)2 = E
 n∑
i=1
n∑
j=1
(Yi − a)(Yj − a)

=
n∑
i=1
n∑
j=1
E ((Yi − a)(Yj − a)) =
n∑
k=1
E
(
(Yk − a)2
)
= nσ2.
Dividing both hand sides by n2, we finish the proof.
thm:StrongLLN Theorem 12.3. [Strong LLN] Suppose (Yn)n∈N is be an sequence of pairwise independent identi-
cally distributed real-valued random variable such that E(Y1) = a and E|Y1| <∞. Then, ( 1n
∑n
k=1 Yk)n∈N
converges to a almost surely.
Proof. Without loss of generality, we assume Y1 ≥ 0. Define Sn :=
∑n
k=1 Yk, Zn := Yn1{YnTn :=
∑n
k=1 Zk. For α > 1, we let ℓn := [α
n] be the smallest integer larger than αn. For ε > 0, note
that by Theorem 8.1,
∞∑
n=1
P
(∣∣∣∣ 1ℓn (Tℓn − E(Tℓn))
∣∣∣∣ > ε) ≤ 1ε2
∞∑
n=1
Var(Tn)
ℓ2n
=
1
ε2
∞∑
n=1
1
ℓ2n
ℓn∑
k=1
Var(Zk).
Since the summands are non-negative, we switch the order of sums to yield,
∞∑
n=1
P
(∣∣∣∣ 1ℓn (Tℓn − E(Tℓn))
∣∣∣∣ > ε) ≤ Cε ∞∑
k=1
1
k2
Var(Zk) ≤ Cε
∞∑
k=1
1
k2
E
(
Y 21 1{Y1)
= Cε
∞∑
k=1
1
k2
k−1∑
i=0
E
(
Y 21 1{Y1∈[i,i+1)}
)
= Cε
∞∑
i=0
(
E
(
Y 21 1{Y1∈[i,i+1)}
) ∞∑
k=i
1
k2
)
≤ Cε
∞∑
i=0
1
i+ 1
E
(
Y 21 1{Y1∈[i,i+1)}
) ≤ CεE(Yi) <∞.
30
for some Cε > 0. It follows that from Borel-Cantelli lemma (Lemma 6.5) that
P
 ⋃
m∈N
⋂
n≥m
{∣∣∣∣ 1ℓn (Tℓn − E(Tℓn))
∣∣∣∣ ≤ 1k
} = 1.
This together with Theorem 1.16 (e) implies that
P
⋂
k∈N
⋃
m∈N
⋂
n≥m
{∣∣∣∣ 1ℓn (Tℓn − E(Tℓn))
∣∣∣∣ ≤ 1k
} = 1,
i.e., (ℓ−1n (Tℓn − E(Tℓn)))n∈N converges to 0 almost surely. Note additionally that
E(Y1) = lim
n→∞E
(
Y11{Y1)
= lim
n→∞E(Zn) = limn→∞
1
ℓn
E(Tℓn),
where we have used monotone convergence (cf. Theorem 10.1) in the first inequality. We have
(Tℓn/ℓn)n∈N converges to E(Y1) almost surely. Moreover,
∞∑
n=1
P(Yn ̸= Zn) =
∞∑
n=1
P(Y1 ≥ n) ≤
∫
R+
P(Y1 ≥ x) dx =
∫
R+
P(Y1 > x) dx = E(Y1) <∞,
where we have used Theorem 3.6 and the fact that λ(Q) = 0 if Q ⊆ R is countable in the second last
inequality, and Proposition 5.11 in the last equality. By Borel-Cantelli lemma (Lemma 6.5) again,
we have P(Yn = Zn, n ≥ N for some N) = 1, and thus (Tℓn/ℓn)n∈N also converges to (Sℓn/ℓn)n∈N
almost surely. It follows that (Sℓn/ℓn)n∈N converges to E(Y1) almost surely. Finally, because Yk ≥ 0,
for k ∈ [ℓn, ℓn+1] we have
1
α
Sℓn
ℓn
≤ ℓn
k
Sℓn
ℓn
≤ Sk
k
≤ ℓn+1
k
Sℓn+1
ℓn+1
≤ αSℓn+1
ℓn+1
.
Letting k →∞, we yield
1
α
E(Y1) ≤ lim inf
k→∞
Sk
k
≤ lim sup
k→∞
Sk
k
≤ αE(Y1).
Since α > 1 is arbitrary, we conclude the proof.
cor:WeakLLN Corollary 12.4. Under the condition of Theorem 12.2 or 12.3, ( 1n
∑n
k=1 Yk)n∈N converges to a in
measure.
Proof. This is an immediate consequence of Theorem 12.2 or 11.1, or, Theorem 12.3 and 11.6.
Remark 12.5. An important consequence of LLN is that, the frequency of event A happening under
(pairwise, or mutually) independent trials is asymptotically the same as the probability of A as the
number of trials tends to infinity. This justifies the definition of probability.
Proposition 12.6 (Glivenko-Cantelli). Let (Yn)n∈N be a sequence of pairwise independent and
identically distributed real-valued random variables. Let Fn(r) :=
1
n
∑n
k=1 1(−∞,r](Yk) for r ∈ N,
i.e., Fn is the empirical CDF of n samples. Then,
lim
n→∞ supr∈R
|Fn(r)− F Y (r)| = 0, a.s..
31
Proof. To start with, for j, k ∈ N, j < k, let rj,k := inf{r ∈ R : F Y (r) ≥ j/k}, we also set
r0,k := −∞ and rk,k :=∞. Thanks to Theorem 3.6 (a), for any j, k and r ∈ [rj−1,k, rj,k]∣∣Fn(r)− F Y (r)∣∣ ≤ max{|Fn(rj−1,k)− F Y (rj,k)|, |Fn(rj,k)− F Y (rj−1,k)|}
≤ max{|Fn(rj−1,k)− F Y (rj−1,k)|, |Fn(rj,k)− F Y (rj,k)|}+ 1
k
≤ sup
r∈(rj,k)j∈{1,...,k−1}
∣∣Fn(r)− F Y (r)∣∣+ 1
k
. (12.1) eq:FnFY
Next, note that for any r ∈ R, by strong LLN (Theorem 12.3) we have
lim
n→∞Fn(r) = limn→∞
n∑
k=1
1(−∞,r](Yk) = PY ((−∞, r]) = F Y (r), a.s..
and thus
lim
n→∞ supr∈(rj,k)j∈{1,...,k−1}
∣∣Fn(r)− F Y (r)∣∣ = 0, a.s..
This together with (12.1) implies that
lim sup
n∈N
∣∣Fn(r)− F Y (r)∣∣ ≤ 1
k
, a.s..
Since k ∈ N is arbitrary, we complete the proof.
13 Conditional Expectation
Definition 13.1. We consider the probability space (Ω,A ,P).
Let Z ∈ L1. Let A ∈ A satisfy P(A) > 0. The conditional expectation of Y given A is the
quantity
E(Z|A) = E(1AZ)
P(A)
.
Let Z : (Ω,A )→ (Z,Z ). The conditional distribution of Z given A is a probability measure
(why?) on (Z,Z ) satisfying
P(B|A) = E(1A1{Y ∈B})
P(A)
=
E(1A1B(Y ))
P (A)
.
Let Y be an Rd-valued random variable and Z be an Rn-valued random variable. Suppose
(Y,Z) as an Rd+n-valued random variable has PDF f(Y,Z). The marginal PDF of Y in this
case is defined as
fY (y) :=
∫
Rn
f(Y,Z)(y, z) dz, y ∈ Rd.
The conditional PDF of Z given Y is
fZ|Y (z|y) :=
f(Y,Z)(y, z)
fY (y)
=
f(Y,Z)(y, z)∫
Rn f(Y,Z)(y, z) dz
, y ∈ Rd, z ∈ Rn.
32
Proposition 13.2. Let Z be a non-negative (or integrable) real-valued random variable and D :
(Ω,A )→ (N, 2N) be a discrete random variable. Let H := {k ∈ N : P(D = k) > 0}. Then,
E(1{D∈B}Z) = E
(
1{D∈B}Z˜
)
, B ⊆ N
where
Z˜(ω) :=
∑
k∈H
1{k}(D(ω))E(Z|{D = k}) =
∑
k∈H
1{k}(D(ω))
E(1{k}Z)
P(D = k)
.
If there is Z˜ ′(ω) =
∑
k∈N bk1{k}(D(ω)) that also satisfies
E(1{D∈B}Z) = E
(
1{D∈B}Z˜ ′
)
, B ⊆ N,
then Z˜ = Z˜ ′ (we recall that this means P(Z˜ = Z˜ ′) = 1).
Proof. Regarding the first statement, it is sufficient to prove for B = {i} with i ∈ H. As it the
statement is true for such B, then we can use monotone convergence to extend the statement for
any B ⊆ N . Note that
E(1{D=i}Z˜) = P(D = i)
E(1{D=i}Z)
P(D = i)
E(1{D=i}Z).
This proves the first statement. Regarding the second statement, we again take B = {i}, where
i ∈ H. Then, by hypothesis, we yield E(1{D=i}Z) = E(1{D=i})bi, i.e.,
bk =
E(1{k}Z)
P(D = k)
, k ∈ H.
Finally, because P(D /∈ H) = 0, and P(Z˜ − Z˜ ′ ̸= 0) ≤ P(D /∈ H). The proof is complete.
Proposition 13.3. Let Y be an Rd-valued random variable and Z be an Rn-valued random variable.
Suppose (Y,Z) as an Rd+n-valued random variable has PDF f(Y,Z). Fix a bounded h : (Rn, (Rn))→
(R,B(R)). Then,
E(g(Y )h(Z)) = E(g(Y )S) for any bounded g : (Rd,B(Rd))→ (R,B(R)),
where
S(ω) :=
∫
Rn
h(z)fZ|Y (z|Y (ω)) dz =
∫
Rn h(z)f(Y,Z)(Y (ω), z) dz∫
Rn f(Y,Z)(Y (ω), z) dz
.
Suppose S′ : (Ω,A ) → (R,B(R)) satisfies S′(ω) = h˜(Y (ω)) for some h˜ : (Rd,B(Rd)) → (R,B(R)),
and
E(g(Y )h(Z)) = E(g(Y )S′) for any bounded g : (Rd,B(Rd))→ (R,B(R)),
then S = S′.
Proof. DIY.
33
The two propositions above motivate a more general definition of conditional expectation. For
notional convenience, we define L(Ω,A ,P) as the set of random variables Z such that E(Z+) or
E(Z−) is finite. (The rest of this section depends heavily on σ-algebra. The related context is
optional and will NOT appear in the exam.)
def:CondExpn Definition 13.4. Let G be a sub-σ-algebra fo A and Z ∈ L(Ω,A ,P). The conditional expectation
of Z given G , denoted by E(Z|G ), is a G -measurable random variable satisfying
E(1BZ) = E(1BE(Z|G )) for any B ∈ G .
The proof of existence of Definition 13.4 is beyond the scope of this course. Below is a sketch.
We first assume Z ∈ L2(Ω,A ,P). Note L2(Ω,A ,P) is a Hilbert space and L2(Ω,G ,P) is a sub-
Hilbert space. In this case, E(Z|G ) is defined as the projection of Z onto L2(Ω,G ,P). Then, we
use the approximation Z ∧ n to extend the definition for Z satisfying E(Z+) <∞ or E(Z−) <∞.
During this procedure, we also yield the following technical lemma.
lem:CondExpnBasic Lemma 13.5. Let G be a sub-σ-algebra fo A and Z,Z ′ ∈ L(Ω,A ,P). Then,
(a) If 0 ≤ Z ≤ Z ′, then E(Z|G ) ≤ E(Z ′|G );
(b) Z ∈ L2(Ω,A ,P) implies Z ∈ L2(Ω,G ,P), and Z ∈ L1(Ω,A ,P) implies Z ∈ L1((Ω,G ,P)).
In particular, Lemma 13.5 implies that E(Z|G ) ≥ 0 if Z ≥ 0.
The following lemma regard the uniqueness of the definition.
lem:UniquenessCriteria Lemma 13.6. Let Z˜, Z˜ ′ ∈ L(Ω,G ,P). Suppose
E(1AZ˜) = E(1AZ˜ ′), A ∈ G .
Then, Z˜ = Z˜ ′.
Proof. Let A+k := {ω ∈ Ω : Z˜(ω)− Z˜ ′(ω) > 1k}. Note that Ak ∈ G . Then,
P(Ak) = kE(
1
k
1Ak) ≤ kE((Z˜ − Z˜ ′)1Ak) = 0.
Similarly, {Z˜ ′ − Z˜ > 1k} also has zero probability. Consequently, let P(|Z˜ ′ − Z˜| > 1k ) = 0. Finally,
in view of Theorem 1.16 (e), we have P(|Z˜ − Z˜ ′| > 0) = P(⋃k∈N{|Z˜ ′ − Z˜| > 1k}) = 0.
Caution: We emphasize that Z˜ = Z˜ ′ in Lemma 13.6 means P(Z˜ = Z˜ ′) = 1. This means that
Definition 13.4 only specifies the random variable E(Z|G ) almost surely instead of for any ω ∈ Ω.
The following result is an immediate consequence of Theorem 2.7.
Proposition 13.7. Let Y : (Ω,A )→ (Y,Y ) and Z ∈ L1(Ω,A ,P). Then, there is a h : (Y,Y )→
(R,B(R)) such that E(Z|σ(Y )) = h(Y ).
The next theorem regards the basic properties of conditional expectation.
thm:CondExpnBasic Theorem 13.8. Let Y,Z ∈ L(Ω,A ,P) and G ,H be two σ-algebra such that H ⊆ G ⊆ A . The
following is true:
34
(a) E(E(Z|G )|H ) = E(Z|H );
(b) if H = {∅,Ω}, then E(Z|H ) = E(Z);
(c) if Y, Z ∈ L1, then E(aY + bZ|G ) = aE(Y |G ) + bE(Z|G ) for a, b ∈ R.
Proof. (a) The is an immediate consequence of Lemma 13.6 and the observation that, for any
A ∈H ,
E(1AE(E(Z|G )|H )) = E(1AE(Z|G )) = E(1AZ) = E(1AE(Z|H )).
(b) Note that real-valued {∅,Ω}-measurable random variable is a constant function of ω ∈ Ω. The
statement follows by taking A = Ω.
(c) The proof involves the procedure used for proving the existence of Definition 13.4. We refer to
...... for the detailed proof.
The conditional version of limit theorems also holds true.
thm:CondLimitThm Theorem 13.9. Let Z ∈ L(Ω,A ,P) and (Zn)n∈N ∈ (Ω,A ,P). Let G be a sub-σ-algebra of A .
The following is true:
(a) (Monotone Convergence) if Zn ≥ 0 for n ∈ N and (Zn)n∈N increases to Z almost surely, then
lim
n→∞E(Zn|G ) = E(Z|G ), a.s.;
(b) (Fatou’s Lemma) if Zn ≥ 0 for n ∈ N, then
E(lim inf
n→∞ Zn|G ) ≤ lim infn→∞ E(Z|G ), a.s.;
(c) (Dominated Convergence) if limn∈N Zn = Z almost surely, and |Zn| ≤ Y for n ∈ N for some
Y ∈ L1(Ω,A ,P), then
lim
n→∞E(Zn|G ) = E(Z|G ).
Proof. (a) In view of Lemma 13.5 (a), we let Z˜ := limn→∞ E(Zn|G ) almost surely (Z˜ may be ∞
with positive probability). Due to Remark 9.3, Z˜ is G -measurable. Then, by monotone convergence
(Theorem 10.1),
E(Z˜1A) = lim
n→∞E(E(Zn|G )1A) = limn→∞E(Zn1A) = E(Z1A), A ∈ G ,
where we have used Definition 13.4 in the second equality. In view of Lemma 13.6, the proof is
complete.
(b)&(c) The proof are analogous to the proof of Fatou’s lemma(Theorem 10.2) and dominated
convergence (Theorem 10.5).
Theorem 13.10. If Y ∈ L∞(Ω,G ,P) and Z ∈ L1(Ω,A ,P), then E(Y Z|G ) = Y E(Z|G );
35
Proof. We first Y =
∑n
k=1 ak1Ak with Ak ∈ G for k = 1, . . . , n. Let A ∈ G . Then,
E(1AE(Y Z|G )) = E(1A
n∑
k=1
ak1AkZ) =
k∑
k=1
E(1A∩AkE(Z|G )) = E(1AY E(Z|G )).
This together with Lemma 13.6 proves the statement for simple Y and non-negative Z. Now
suppose Y ∈ L∞(Ω,G ,P) and |Y | ≤ M for some M > 0. In view of Theorem 2.6, we let (Yn)n∈N
be a sequence of simple random variable converging to Y almost surely with |Yn| ≤ M for k ∈ N.
Thus, (YnZ)n∈N converges to Y Z almost surely and |YnZ| ≤M |Z| ∈ L1. By conditional dominated
convergence (Theorem 13.9 (c)), we have
lim
n→∞E(YnZ|G ) = E(Y Z|G ), a.s..
Consequently, (YnE(Z|G ))n∈N also converges to Y E(Z|G ) almost surely. Moreover, |YnE(Z|G )| ≤
M |E(Z|G )| ∈ L1 due to Lemma 13.5 (b). The above together with dominated convergence (Theorem
10.5) and the proved statement for simple random variables implies
E(1AE(Y Z|G )) = lim
n→∞E(1AE(YnZ|G )) = limn→∞E(1AYnE(Z|G )) = E(1AY E(Z|G )).
Since A ∈ G is arbitrary, in view of Lemma 13.6, we conclude the proof.
Now let Z : (Ω,A )→ (Z,Z ). We proceed to investigate the function (ω,B) 7→ E(1B(Z)|G )(ω)
for (ω,B) ∈ Ω × Z . It is desirable to view such function as probability measure that depends
on the randomness ω. However, because the random variable E(1B(Z)|G ) is specified only almost
surely, and Z is uncountable in general, there is no guarantee that B 7→ E(1B(Z)|G )(ω) satisfies
countable additivity for almost every ω ∈ Ω. The following theorem resolve this issue when Z is a
separable metric space.
thm:RegularCondDistn Theorem 13.11. Let Z be a separable metric space endowed with the Borel σ-algebra B(Z). Let
Z : (Ω,A )→ (Z,B(Z)) G be a sub-σ-algebra of A . Then, there is P : Ω× B(Z)→ [0, 1] such that
(i) for each ω ∈ Ω, B 7→ P (ω,B) is a probability on (Z,B(Z));
(ii) for each B ∈ B(Z), ω 7→ P (ω,B) is A -measurable;
(iii) for each B ∈ B(Z), P (ω,B) = E(1B(Z)|G )(ω) for almost every ω.
Proof. The proof is out-of-scrope. We refer to .........
The P introduced in Theorem 13.11 is called the regular conditional distribution of Z given G .
14 Weak Convergence of Probability
A Preliminaries
Let X be a non-empty space, with no specific structure. We will review some algebra of sets. We
let I be a set of indexes and A,B,A1, A2, · · · ⊆ X
36
Definition A.1.
⋃
i∈I Ai := {x ∈ X : x ∈ Ai for some i ∈ I};

⋂
i∈I Ai := {x ∈ X : x ∈ Ai for all i ∈ I};
Ac := {x ∈ X : x /∈ A} and B \A = B ∩Ac.
thm:SetOp Theorem A.2. The following is true:
(a) if A ⊆ B, A ∩B = A and A ∪B = B;
(b)
⋂
i∈I Ai ⊆
⋃
i∈I Ai;
(c) if A1, A2, · · · ⊆ B then
⋃
i∈I Ai ⊆ B;
(d) if A1, A2, · · · ⊇ B then
⋂
i∈I Ai ⊇ B;
(e) A ∪B = B ∪A, A ∩B = B ∩A, this also extends to infinite union/intersection;
(f)
(⋃
i∈I Ai
) ∩B = ⋃i∈I(Ai ∩B);
(g)
(⋂
i∈I Ai
) ∪B = ⋂i∈I(Ai ∪B);
(h)
(⋃
i∈I Ai
)c
=
⋂
i∈I A
c
i ;
(i)
(⋂
i∈I Ai
)c
=
⋃
i∈I A
c
i .
In what follows, we let r ∈ R and (rn)n∈R ⊆ R.
Definition A.3. We say (rn)n∈R converges to r if for any ε > 0, there is N ∈ N such that
|rn − r| < ε for any n ≥ N . We denote this by limn→∞ rn = r.
We also say (rn)n∈R converges to ∞ if for any M > 0, there is N ∈ N such that rn ≥ M for
n ≥ N . We denote this by limn→∞ rn =∞.
We also say (rn)n∈R converges to −∞ if for any M > 0, there is N ∈ N such that rn ≤ −M
for n ≥ N . We denote this by limn→∞ rn = −∞.
Theorem A.4. Let (rn)n∈R ⊆ R. Suppose (rn)n∈R ⊂ R is non-decreasing (or, non-increasing) and
there is M > 0 such |rn| ≤M for all n ∈ N. Then, there is r ∈ R such that limn→∞ rn = r.
thm:MonoSeq Definition A.5. Let I be a set of indexes. Let (ri)i∈I ⊆ R. We define supi∈I ri as a number r¯ ∈ R
such that
(i) r¯ ≥ ri for all i ∈ I;
(ii) for any r′ ∈ R such that r′ ≥ ri for all i ∈ I, we have r′ ≥ r¯.
Similarly, we define infi∈I ri as a number r ∈ R such that
(i) r ≤ ri for all i ∈ I;
(ii) for any r′ ∈ R such that r′ ≤ ri for all i ∈ I, we have r′ ≤ r.
37
We also set supi∈I ri := ∞ if (ri)i∈I ⊆ R is unbounded from above; infi∈I ri := −∞ if (ri)i∈I ⊆ R
is unbounded from below.
Theorem A.6. If there is M ∈ R such that M ≥ ri for all i ∈ I, then supi∈I ri exists uniquely. If
there is M ∈ R such that M ≤ ri for all i ∈ I, then infi∈I ri exists uniquely.
Sometimes, it is convenient to consider the extended real line R := R ∪ {∞,−∞} with the
following rules 0×∞ = 0, 0× (−∞) = 0, a±∞ = ±∞ and a× (±∞) = sgn(a) · ∞ for a ∈ R.
We define
lim sup
n→∞
rn := lim
n→∞ supk≥n
rk and lim inf
n→∞ rn := limn→∞ infk≥n
rk.
Note that (supk≥n rk)n∈N is non-increasing and (infk≥n rk)n∈N is non-decreasing. In view of Theorem
A.5, lim supn→∞ rn and lim infn→∞ rn are well-defined and takes finite value if there is N,M ∈ N
such that |rn| ≤M for n ≥ N . It is clear that lim supn→∞ rn ≥ lim infn→∞ rn.
Theorem A.7. (rn)n∈N converges (to some r ∈ R) if and only if lim infn→∞ rn = lim supn→∞ rn
and they are finite.
Theorem A.8. (rn)n∈N converges (to some r ∈ R) if and only if all the subsequences converges (to
the same r).
Below we recall the definitions of pointwise convergence of function. We let X be an abstract
space with no specific structure. We let f be a real-valued function on X and (fn)n∈N a sequence
of real-valued functions on X.
Definition A.9. We say (fn)n∈N converges (pointwise) to f if limn→∞ fn(x) = f(x) for x ∈ X. We
also define two functions lim supn→∞ fn and lim infn→∞ fn as
lim sup
n→∞
fn(x) := lim
n→∞ supk≥n
fk(x) and lim inf
n→∞ fn(x) := limn→∞ infk≥n
fk(x), x ∈ X,
respectively.
Metric space TBA