STAT3006-统计代写|学霸联盟

STAT3006-统计代写

时间：2023-08-13

STAT3006: Statistical
Learning
First Half
University of Queensland
Semester 2
Hien Nguyen
h.nguyen7@uq.edu.au
August 10, 2023
1 / 113
Random variables – 24 July
We consider a sequence of random variables
X1,X2, . . . ,Xn, . . .
where each Xn ∈ X, for each n ∈ N.
Here X is a space of objects, typically endowed with a topology.
We will often write this sequence in the notation
XN = (Xn)n∈N .
To denote a the finite sequence
X1, . . . ,Xn
we will write
Xn = (Xi)i∈[n]
where [n] = {1,2, . . . ,n}, n ∈ N.
We typically call XN the data sequence and Xn a (finite) sample.
2 / 113
Random variables
To make the variables “random” we endow the sequence XN with a
probability space: (
XN,X⊗N,P
)
where we use the notation XN to denote the countable infinite
product space
X×X×X×·· · .
The set X⊂ 2X is called a σ-algebra and constitutes all of the
subsets A of X (events) for which the probability statements of the
form
Xn ∈ A
make sense, under the probability measure P that assigns
probability to each statement XN ∈ B⊂ X⊗N, which we write as
P(XN ∈ B) = P(B) .
Here, 2X is the power set of X; the set of all subsets of X.
3 / 113
Parametric statistics
In parametric statistics, we consider that P to be a function of
some parameter θ0 ∈ T, which we typically write as Pθ0 .
We typically assume additionally that the data are independent
and identically distributed (IID) in the sense that
Pθ0 = PNX ,θ0 .
That is, each Xn is endowed with the same probability space as
random variable X : (
X,X,PX ,θ0
)
.
Thus, for each B= A1×A2×·· ·×An×·· · , we write the probability
Pθ0 (B) = Pθ0 (XN ∈ B) as
∞
∏
n=1
PX ,θ0 (An)
=PX ,θ0 (X1 ∈ A1)×PX ,θ0 (X2 ∈ A2)×·· ·×PX ,θ0 (Xn ∈ An)×·· · .
4 / 113
The normal model
When X ∈ Rd is normally distributed with parameter
θ = (µ,Σ) ∈ T=Rd×PDd , we have the fact that
PX ,θ (A) =
∫
A
φd (x;µ,Σ)dx.
Here φd (·;µ,Σ) is the normal density function:
φd (·;µ,Σ) = 1
det(2πΣ)1/2
exp
{
−1
2
(x−µ)⊤Σ−1 (x−µ)
}
.
PDd is the set of d×d dimensional positive definite matrices: the
set of matrices A ∈ Rd×d such that for every x ∈ Rd/{0}:
x⊤Ax> 0.
5 / 113
Maximum likelihood estimation
Suppose that we know that the data XN is IID where each Xn is
normally distributed with unknown parameter θ0 = (µ0,Σ0).
We observe a sample Xn which we want to use to estimate θ0.
To do this, we compute the maximum likelihood estimator:
θˆn =
(
µˆn, Σˆn
)
= argmax
θ=(µ,Σ)∈Rd×PDd
Ln (µ,Σ) ,
where, in this case,
Ln (µ,Σ) =
1
n
n
∑
i=1
logφd (Xi;µ,Σ) .
Here, for a function f : T→ R, we write:
argmax
θ∈T
f (θ) =
{
θˆ ∈ T| f (θˆ)=max
θ∈T
f (θ)
}
.
Note also that θˆn can be written as a function of the sample Xn, i.e.:
θˆn = θˆ (Xn)
for some function θˆ : Xn→ T.
6 / 113
Maximum likelihood estimation
Here, the components of θˆn are
µˆn =
1
n
n
∑
i=1
Xi,
and
Σˆn =
1
n
n
∑
i=1
(Xi− µˆn)(Xi− µˆn)⊤ .
These are, respectively, the sample mean and (biased) sample
covariance of Xn, respectively.
Exercise 1
Determine when these estimators may not exist, in the sense that either
µˆn /∈ Rd or Σˆn /∈ PDd .
7 / 113
Expectations
Let Z = z(XN) be a random variable taking input XN and
outputting some real number.
We say that Z has expectation
E(Z) =
∫
XN
z(x1,x2, . . .)P(dx1,dx2, . . .)
if ∫
XN
|z(x1,x2, . . .)|P(dx1,dx2, . . .)< ∞.
If the probability measure P has a density function p : XN→ R≥0,
such that for any B ∈ X⊗N, we have
P(XN ∈ B) =
∫
B
p(x1,x2, . . .)dx1dx2 . . . ,
then we can also write
E(Z) =
∫
XN
z(x1,x2, . . .)p(x1,x2, . . .)dx1dx2 . . . .
8 / 113
Image measure space
Instead of referring to the probability measure on XN each time we
want to talk about Z, we can instead endow Z with its pushforward
or image measure space
(Z,Z,PZ) ,
where Z= R and we can identify Z with the Borel σ-algebra of R.
Here, the Borel σ -algebra of R contains all of the open sets and
closed sets of R.
We then define PZ as the image measure of Z in the sense that, for
any A ∈ Z, we can write
PZ (A) = P
(
Z−1 (A)
)
= P
({
XN∈XN : z(XN) ∈ A
})
.
Here, we say that Z−1 is the inverse image of Z, in the sense that,
for each A ∈ Z:
Z−1 (A) =
{
XN∈XN : z(XN) ∈ A
}
.
9 / 113
Kolmogorov’s strong law of large numbers
Let use suppose that we have a sequence ZN = (Zn)n∈N of IID
replicates of Z, in the sense that for every sequence of sets
(An)n∈N ⊂ Z, we have
PZ (Z1 ∈ A1,Z2 ∈ A2, . . .) = PZ (A1)PZ (A2) . . .
Further, assume that EZ |Z|< ∞.
Then, Kolmogorov’s strong law of large numbers states that
1
n
n
∑
i=1
Zi
a.s.−→
n→∞ E(Z) .
Here a.s. stands for almost surely, and is short hand for the
notation that
P
({
ZN ∈ZN : 1n
n
∑
i=1
Zi −→
n→∞ E(Z)
})
= 1
or alternatively,
P
({
ZN ∈ZN : 1n
n
∑
i=1
Zi−→
n→∞
E(Z)
})
= 0.
10 / 113
Consistency
Recall the estimator θˆn consisting of
µˆn =
1
n
n
∑
i=1
Xi and Σˆn =
1
n
n
∑
i=1
(Xi− µˆn)(Xi− µˆn)⊤ .
If we write the Rd vector Xi = (Xi1, . . . ,Xid) and µ = (µ1, . . . ,µd),
then we can also write µˆn = (µˆn1, . . . , µˆnd), where, for each j ∈ [d],
µˆn j =
1
n
n
∑
i=1
Xi j.
But since (XN) is an IID sequence, we also have ZN is IID, where
Zn = Xn j.
We also know that X has mean µ0 = (µ01, . . . ,µ0d), where
µ0 j = EZ (Xn j) for each j ∈ [d], and thus by the strong law of large
numbers (SLLN), we have
µˆn j =
1
n
n
∑
i=1
Xi j
a.s.−→
n→∞ E(Xn j) = µ0 j.
Since this is true for each j ∈ [d], we have µˆn a.s.−→
n→∞ µ0.
11 / 113
Consistency
Exercise 2
Use the fact that, when Xn is normally distributed with parameter
θ0 = (µ0,Σ0), we have
Σ0 = covX (X) =

Σ011 Σ012 · · · Σ01d
Σ021 Σ022 · · · Σ02d
...
... . . .
...
Σ0d1 Σ0d2 · · · Σ0dd
 ,
where Σ0i j = E(XniXn j)−E(Xni)E(Xn j), for each i, j ∈ [d], to show that
Σˆn
a.s.−→
n→∞ Σ0.
Here, you may use the fact that if you have a sequence (An)n∈N
such that P(An) = 1 for each n ∈ N, then P(⋂n∈NAn) = 1.
12 / 113
Consistency – 27 July
We have thus established that
µˆn
a.s.−→
n→∞ µ0 and Σˆn
a.s.−→
n→∞ Σ0,
and hence
θˆn
a.s.−→
n→∞ θ0.
When an estimator θˆn of a quantity θ0 converges almost surely to
that quantity, we say that it is a (strongly) consistent estimator.
13 / 113
Concentration inequalities
For an IID sequence ZN ∈ZN ⊂ RN with common expectation
EZ (Z) and finite variance
varZ (Z) =
∫
Z
z2PZ (dz)−
{∫
Z
zPZ (dz)
}2
< ∞,
Chebyshev’s inequality states that, for each ε > 0,
P
(∣∣∣∣∣1n n∑i=1Zi−EZ (Z)
∣∣∣∣∣≥ ε
)
≤ varZ (Z)
nε2
.
We can also write this in its confidence form:
P
(∣∣∣∣∣1n n∑i=1Zi−EZ (Z)
∣∣∣∣∣≤ ε
)
≥ 1− varZ (Z)
nε2
.
14 / 113
Concentration inequalities
Let Z ∈Z⊂ R≥0 be a random variable, with finite expectation
EZ (Z)< ∞.
Markov’s inequality states that for any ε > 0,
PZ (Z ≥ ε)≤ ε−1EZ (Z) .
Exercise 3
First prove that for an independent sample Zn ∈Zn ⊂ Rn>0,
var
(
1
n
n
∑
i=1
Zi
)
=
∫
Zn
(
1
n
n
∑
i=1
zi
)2
PZ1 (dz1) . . .PZn (dzn)
−
{∫
Zn
1
n
n
∑
i=1
ziPZ1 (dz1) . . .PZn (dzn)
}2
=
1
n2
n
∑
i=1
varZi (Zi) .
Using this fact, then show that Chebyshev’s inequality is a consequence
of Markov’s inequality.
15 / 113
Concentration inequalities
Using Chebyshev’s inequality, for the normal distribution mean
estimator µˆn = (µˆn1, . . . , µˆnd), we have, for each j ∈ [d] and ε > 0:
P
(∣∣µˆn j−µ0 j∣∣≤ ε)≥ 1− Σ0 j jnε2 .
The famous Bonferroni’s inequality, states that for a sequence of
events (B j) j∈[d],
P
(
d⋂
j=1
B j
)
≥ 1−
d
∑
j=1
{
1−P(B j)
}
.
Exercise 4
Use Bonferroni’s inequality to show that
P
(
Xn ∈ Xn :
d⋂
j=1
{∣∣µˆn j−µ0 j∣∣≤ ε})≥ 1− ∑dj=1Σ0 j jnε2 .
16 / 113
Confidence statements
In statistical inference, a confidence interval (CI) for an estimator
ϑˆn for a parameter ϑ0 ∈ R is typical presented with respect to a
prescribed confidence level 100(1−α)%, so that
P
(∣∣ϑˆn−ϑ0∣∣≤marginα (Xn))≥ 1−α,
where marginα (Xn) is a margin of error that depends on both the
sample Xn and the significance level α ∈ (0,1).
To write
P
(∣∣µˆn j−µ0 j∣∣≤ ε)≥ 1− Σ0 j jnε2
in the standard CI form, we simply set
α =
Σ0 j j
nε2
, and thus ε =
√
Σ0 j j√
αn
=marginα (Xn) .
Thus, we have
P
(∣∣µˆn j−µ0 j∣∣≤√Σ0 j jαn
)
≥ 1−α.
17 / 113
But what if we’re wrong?
Recall that thus far, we have assumed that the data are normal and
IID, in the sense that for B= A1×A2×·· ·×An×·· · ,
P(XN ∈ B) =
∞
∏
n=1
PX ,θ0 (An) ,
where PX ,θ0 (A) =
∫
A φd (x;µ0,Σ0)dx, for θ0 = (µ0,Σ0).
Suppose that XN is in fact IID but not normal, in the sense that
P(XN ∈ B) =
∞
∏
n=1
PX (An) ,
where PX is unknown and has unknown density function
pX : X→ R≥0, such that PX (A) =
∫
A pX (x)dx.
What meaning then can we give to our MLE
θˆn =
(
µˆn, Σˆn
)
= argmax
θ=(µ,Σ)∈Rd×PDd
Ln (µ,Σ)?
And what can we say about the consistent limit θ0 = (µ0,Σ0) of θˆn,
if it exists?
18 / 113
The Kullback–Leibler divergence
Let X ∈ X has density function pX , and suppose that we have an
alternative probability density function qX such that (X,X,Q) is a
valid probability space, where Q(X ∈ A) = ∫A q(x)dx, for each
A ∈ X.
Then, we can define the Kullback–Leibler (KL) divergence
between pX and qX as
KL(pX∥qX ) =
∫
X
log
(
pX (x)
qX (x)
)
pX (x)dx
=
∫
X
logpX (x)pX (x)dx−
∫
X
logqX (x)pX (x)dx
= EX logpX (X)−EX logqX (X) .
The KL divergence has the desirable properties that:
1. KL(pX∥qX ) = 0 if and only if pX = qX , almost surely, with respect to
PX .
2. KL(pX∥qX )≥ 0 for all valid densities pX ,qX : X→ R>0.
19 / 113
Minimum KL divergence
Let’s suppose that qX is now replaced by our potentially incorrect
hypothesis that PX is normal; i.e., set qX to the normal density
function φd (·;µ,Σ).
Then, the KL divergence between the true density pX and our
hypothesis is
KL(pX∥φd (·;µ,Σ)) = EX logpX (X)−EX logφd (X ;µ,Σ) .
Then, we might want to find parameter θ0 = (µ0,Σ0) ∈ Rd×PDd
that makes the KL divergence the smallest; i.e., deduce the normal
distribution that best approximates PX .
The first term of the KL divergence is a constant and thus we don’t
have to worry about it.
The second term we want to make as small as possible, since we
want to make the KL divergence as close to 0 as possible, so we
want to find
θ0 = argmin
θ=(µ,Σ)∈Rd×PDd
−EX logφd (X ;µ,Σ) .
20 / 113
Minimum KL divergence
Notice firstly that for any function f : T→ R
argmin
θ∈T
− f (θ) = argmax
θ∈T
f (θ) ,
and so we can solve the equivalent problem:
ϑ0 = argmax
θ=(µ,Σ)∈Rd×PDd
EX logφd (X ;µ,Σ) .
Next, we observe that we don’t know PX and thus can’t compute
EX ! However, for fixed θ ∈ Rd×PDd , the fact that Xn is IID implies
that the sequence
(logφd (Xn;µ,Σ))n∈N is IID.
So, the SLLN suggests that for each sample Xn, we can estimate
EX logφd (·;µ,Σ) by
1
n
n
∑
i=1
logφd (Xi;µ,Σ) = Ln (µ,Σ) .
21 / 113
Minimum KL divergence
Thus, the MLE
θˆn = argmax
θ=(µ,Σ)∈Rd×PDd
Ln (µ,Σ)
is equivalent to minimizing the sample estimate of the KL divergence
between the normal model and the unknown probability measure PX .
We have established that
θˆn
a.s.−→
n→∞ θ0 = (µ0,Σ0) = (EX (X) ,covX (X)) .
But, is it true that
θ0 = ϑ0 = argmax
θ=(µ,Σ)∈Rd×PDd
EX logφd (X ;µ,Σ)?
If true, does
1
n
n
∑
i=1
logφd
(
Xi; µˆn, Σˆn
) a.s.−→
n→∞ EX logφd (X ;µ0,Σ0)?
22 / 113
Statistical learning
Statistical learning avoids making parametric assumptions
regarding the underlying probability measure P on the data XN.
As such, statistical learning is typically not interested in estimating
P, but instead to draw inference about some characteristic C(P),
regarding P.
These characteristics are non-parametric in the sense that the not
depend on some assumed form of P, e.g. for IID sequence XN, where
X= R, we may wish to draw inference about basic summaries of P:
C(P) = EX (X); C(P) = varX (X); C(P) =medianX (X) .
Alternatively we can ask about approximations of P:
C(P) = "closest normal distribution to P in KL divergence".
For X = (W,Y ) ∈ X= R2, we may ask regarding
C(P) = "best linear model of the form Y = β0+β1X".
When XN is a sequence of correlated data with each Xn ∈ R, we may
ask regarding
C(P) = "best linear model of the form Xn = β0+β1Xn−1".
23 / 113
Statistical learning
Typically, statistical learning problems are phrased as optimization
problems, where C(P) ∈ T, and
C(P) = argmin
θ∈T
rP (θ) ,
where rP : T→ R is typically referred to as the (population) risk,
which is estimated by a sequence of risk estimates, or empirical
risks:
(rn (·;Xn))n∈N .
To draw inference regarding C(P), we compute the empirical risk
minimizer (ERM) sequence
(
θˆn
)
n∈N, defined by:
θˆn = argmin
θ∈T
rn (θ ;Xn) .
We then seek conditions under which the sequence
(
θˆn
)
n∈N
converges to C(P), either almost surely, or in some other mode.
24 / 113
A simple example
Let XN be an IID sequence where each Xn has the same probability
measure as X ∈ R: PX .
Let
C(P) = argmin
θ∈T
rP (θ) ,
where T= R and
rP (θ) = EX
(
|X−θ |2
)
.
We wish to estimate C(P) via the ERM:
θˆn = argmin
θ∈T
rn (θ ;Xn) ,
where
rn (θ ;Xn) =
1
n
n
∑
i=1
|Xi−θ |2 .
25 / 113
A simple example
Observe that we can write
EX
(
|X−θ |2
)
= EX
{
X2−2θX+θ 2}
= EX
(
X2
)−2θEX (X)+θ 2.
The first-order condition from calculus states that we can compute
C(P) = argmin
θ∈T
EX
(
|X−θ |2
)
by solving
d
dθ
{
EX
(
|X−θ |2
)}
= 0 =⇒ −2EX (X)+2θ = 0
=⇒ θ = EX (X) ,
so C(P) = EX (X) is the mean of PX .
Exercise 5
Show that θˆn = argmin
θ∈T
rn (θ ;Xn) is the sample mean.
26 / 113
Another simple example
Let XN be an IID sequence where each Xn has the same probability
measure as X ∈ R: PX .
Exercise 6
What is the characteristic defined by
C(P) = argmin
θ∈T
rP (θ) ,
where rP (θ) = EX (|X−θ |)? What is its ERM:
θˆn = argmin
θ∈T
rn (θ ;Xn) ,
where
rn (θ ;Xn) =
1
n
n
∑
i=1
|Xi−θ |?
27 / 113
Ordinary least squares
Let XN be an IID sequence where each Xn = (Wn,Yn) ∈ R2 has the
same probability measure as X = (W,Y ) ∈ R2: PX .
Here, we also give W and Y their individual pushforward (marginal)
measures PW and PY .
Let
C(P) = (α0,β0) = argmin
θ=(α,β )∈T
rP (θ) ,
where T= R2 and
rP (θ) = EX
(
|Y − (α+βW )|2
)
.
We wish to estimate C(P) via the ERM:
θˆn =
(
αˆn, βˆn
)
= argmin
θ∈T
rn (θ ;Xn) ,
where
rn (θ ;Xn) =
1
n
n
∑
i=1
|Yi− (α+βWi)|2 .
28 / 113
Ordinary least squares
Observe that
EX
(
|Y − (α+βW )|2
)
=EY
(
Y 2
)−2αEY (Y )−2βEX (WY )+α2+2αβEW (W )+β 2EW (W 2) .
We observe that the gradient of rP is thus
∇rP (θ)=
(
∇αrP (θ)
∇β rP (θ)
)
=
(
2βEW (W )−2EY (Y )+2α
2βEW
(
W 2
)−2EX (WY )+2αE(W )
)
.
Solving the first-order condition (FOC) ∇rP (θ) = 0 yields
β0 = {EX (WY )−EY (Y )EW (W )}/
{
EW
(
W 2
)− [EW (W )]2} ,
α0 = EY (Y )−β0EW (W ) .
29 / 113
Ordinary least squares
Similarly, we write
rn (θ ;Xn)
=
1
n
n
∑
i=1
Y 2i −
2α
n
n
∑
i=1
Yi− 2βn
n
∑
i=1
WiYi+α2+
2αβ
n
n
∑
i=1
Wi+
β 2
n
n
∑
i=1
W 2i .
Again, solving the FOC, ∇rn (θ ;Xn) = 0, yields
βˆn =
{
1
n
n
∑
i=1
WiYi− 1n2
n
∑
i=1
Yi
n
∑
i=1
Wi
}
/
1n n∑i=1W 2i − 1n2
[
n
∑
i=1
Wi
]2 ,
αˆn =
1
n
n
∑
i=1
Yi− βˆn 1n
n
∑
i=1
Wi.
Exercise 7
Given that all necessary expectations exist, show that
θˆn =
(
αˆn, βˆn
)
a.s.−→
n→∞ (α0,β0).
30 / 113
Logistic regression
Let XN be an IID sequence where each Xn = (Wn,Yn) ∈ R×{0,1}
has the same probability measure as X = (W,Y ) ∈ R×{0,1}: PX ,
with marginal measures PW and PY .
Conditioned on having observed W = w, we model the probability of
Y : PY |W=w, by the approximation:
P˜W=w (Y = 1;θ) = exp(α+βw)/{1+ exp(α+βw)} ,
for some pair of parameters θ = (α,β ) ∈ T= R2.
The logistic regression problem is then to solve ERM:
θˆn =
(
αˆn, βˆn
)
= argmin
θ∈T
rn (θ ;Xn) ,
rn (θ ;Xn)
=− 1
n
n
∑
i=1
Yi log P˜W=Wi (Y = 1;θ)−
1
n
n
∑
i=1
(1−Yi) log P˜W=Wi (Y = 0;θ)
=
1
n
n
∑
i=1
log{1+ exp(α+βWi)}− 1n
n
∑
i=1
Yi {α+βWi} .
31 / 113
Logistic regression
We observe that the gradient ∇rn (θ ;Xn) consists of
∇αrn (θ ;Xn) =
1
n
n
∑
i=1
P˜W=Wi (Y = 1;θ)−
1
n
n
∑
i=1
Yi,
∇β rn (θ ;Xn) =
1
n
n
∑
i=1
WiP˜W=Wi (Y = 1;θ)−
1
n
n
∑
i=1
YiWi.
The FOC: ∇rn (θ ;Xn) = 0 is not solvable in closed-form.
However, we observe that the Hessian can be written as
Hrn (θ ;Xn) =
[
∂ 2rn/∂α2 ∂ 2rn/∂α∂β
∂ 2rn/∂β∂α ∂ 2rn/∂β 2
]
=
1
n
n
∑
i=1
[
1 Wi
Wi W 2i
]
P˜W=Wi (Y = 1;θ) P˜W=Wi (Y = 0;θ) .
Exercise 8
Considering that P˜W=Wi (Y = y;θ)> 0, show that Hrn (θ ;Xn) is positive
definite as long as there are at exists distinct indices i, j ∈ [n], such that
Wi ̸=Wj.
32 / 113
MM algorithms
Let f : T→ R be an objective function of interest for which you
want to minimize, but cannot do directly.
Let m : T×T→ R be called a majorizer of f on T if:
1. m(θ ,θ) = f (θ), for every θ ∈ T.
2. m(θ ,τ)≥ f (θ), for every θ ,τ ∈ T.
Typically we want to choose a majorizer for f , for which at each
τ ∈ T, minimizing m(·,τ) is easier than minimizing f .
Starting from some θ (0) ∈ T, we define the
majorization–minimization (MM) sequence
(
θ (s)
)
s∈N
, via the
MM algorithm:
θ (s) ∈ argmin
θ∈T
m
(
θ ,θ (s−1)
)
.
Exercise 9
Show that the sequence of objective sequence
(
f
(
θ (s)
))
s∈N
is
decreasing, in the sense that for each s ∈ N, f
(
θ (s)
)
≤ f
(
θ (s−1)
)
.
33 / 113
Convexity
We say that a subset T of some vector space is convex if for each
θ ,τ ∈ T, and for every λ ∈ (0,1):
λθ +(1−λ )τ ∈ T.
For a function on the convex set T, f : T→ R, we say that f is
convex if for each θ ,τ ∈ T and every λ ∈ (0,1):
f (λθ +(1−λ )τ)≤ λ f (θ)+(1−λ ) f (τ) .
We further say that f is strictly convex if for each distinct pairs
θ ,τ ∈ T and every λ ∈ (0,1):
f (λθ +(1−λ )τ)< λ f (θ)+(1−λ ) f (τ) .
We similarly say that f is concave or strictly concave if the
inequalities above are reversed (i.e., if − f is convex or strongly
convex).
34 / 113
Convexity
If f is a twice differentiable function on an open and convex set
T⊂Rd , then f is convex if and only if its Hessian H f (θ) is positive
semi-definite in the sense that
τ⊤H f (θ)τ ≥ 0,
for every τ ∈ Rd and every θ ∈ T.
We further say that f is strictly convex if and only if H f (θ) is
positive definite:
τ⊤H f (θ)τ > 0,
for every τ ∈ Rd\{0} and every θ ∈ T.
Exercise 10
Show that if f is convex [strictly convex] and differentiable on an open
set T⊂ Rd , then, respectively,
θ0 ∈ {θ ∈ T : ∇ f (θ) = 0} ⇐⇒ θ0 ∈
[=]
argmin
θ∈T
f (θ) .
35 / 113
The quadratic upper bound
For a twice-differentiable convex function f on T⊂ Rd , the integral
form of the multivariate Taylor theorem states that for any τ ∈ T:
f (θ) = f (τ)+{∇ f (τ)}⊤ (θ − τ)
+
∫ 1
0
(θ − τ)⊤H f (θ + t (θ − τ))(θ − τ)(1− t)dt.
Suppose that we have some matrix B ∈ Rd×d , such that
B−H f (θ) ∈ PDd for each θ ∈ T; we will write, for each
A,B ∈ Rd×d : A≥ B ⇐⇒ A−B is positive semi-definite.
Then of course, for each fixed τ ∈ T:∫ 1
0
(θ − τ)⊤H f (θ + t (θ − τ))(θ − τ)(1− t)dt
≤
∫ 1
0
(θ − τ)⊤B(θ − τ)(1− t)dt
=
∫ 1
0
(1− t)dt (θ − τ)⊤B(θ − τ)
=
1
2
(θ − τ)⊤B(θ − τ) .
36 / 113
The quadratic upper bound
Let us write
m(θ ,τ) = f (τ)+{∇ f (τ)}⊤ (θ − τ)+ 1
2
(θ − τ)⊤B(θ − τ) .
Exercise 11
Verify that m is a majorizer of f , give that f is convex,
twice-differentiable on T, and that B≥ H f (θ), for each θ ∈ T.
Next, we note that if we have some constant L≥ E, for every
E ∈ eigenvalues(B), then we further have the fact: LI≥ B, and so
m(θ ,τ) = f (τ)+{∇ f (τ)}⊤ (θ − τ)+ L
2
∥θ − τ∥22
is a majorizer of f .
37 / 113
Gradient descent
Let us now consider application of the MM algorithm based on the
majorizer:
m(θ ,τ) = f (τ)+{∇ f (τ)}⊤ (θ − τ)+ L
2
∥θ − τ∥22 .
To minimize m(·,τ), we solve the FOC: ∇m(θ ,τ) = 0,for fixed
τ ∈ T, where:
∇m(θ ,τ) = ∇ f (τ)+L(θ − τ) .
This implies:
∇ f (τ)+L(θ − τ) = 0 =⇒ L(θ − τ) =−∇ f (τ)
=⇒ θ = τ−L−1∇ f (τ) .
The MM algorithm to minimize f is then to start with θ (0), and
generate the sequence
(
θ (s)
)
s∈N
, where
θ (s) = argmin
θ∈T
m
(
θ ,θ (s−1)
)
= θ (s−1)−L−1∇ f
(
θ (s−1)
)
.
This is the standard gradient descent algorithm with fixed
“learning rate”: L−1 > 0.
38 / 113
Back to logistic regression
Recall that in the logistic regression problem, the Hessian
Hrn (θ ;Xn) =
1
n
n
∑
i=1
[
1 Wi
Wi W 2i
]
P˜W=Wi (Y = 1;θ) P˜W=Wi (Y = 0;θ)
can be typically assumed to be positive definite.
Exercise 12
Show that π (1−π)≤ 1/4 for every π ∈ (0,1), and use this fact to show
that Bn ≥ Hrn (θ ;Xn), for every θ ∈ R2, where
Bn =
1
4
Gn, where Gn =
1
n
n
∑
i=1
[
1 Wi
Wi W 2i
]
.
39 / 113
Back to logistic regression
Thus, given that we have Ln ≥ En for every En ∈ eigenvalues(Gn),
we can obtain a majorizer
m(θ ,τ) = rn (τ;Xn)+{∇rn (τ;Xn)}⊤ (θ − τ)+ Ln8 ∥θ − τ∥
2
2
and obtain the MM/gradient descent algorithm, starting from θ (0):
θ (s) = θ (s−1)−4L−1n ∇rn
(
θ (s−1);Xn
)
.
Can we verify
rn
(
θ (s);Xn
)
−→
s→∞minθ∈T
rn (θ ;Xn)
or the stronger condition, that
θ (s) −→
s→∞ θˆn ∈ argminθ∈T
rn (θ ;Xn)?
40 / 113
Back to logistic regression
Once we verify that we have a method for obtaining:
min
θ∈T
rn (θ ;Xn) = rn
(
θˆn;Xn
)
, with
θˆn ∈ argmin
θ∈T
rn (θ ;Xn),
for each n ∈ N, we can ask whether or not
θˆn
a.s.−→
n→∞ θ0∈ argminθ∈T
rP (θ)
or, the weaker condition, whether or not:
rn
(
θˆn;Xn
) a.s.−→
n→∞ minθ∈T
rP (θ) .
Here, we also have to check for conditions under which rP (θ) exists
as the limit, for each θ ∈ T:
rn (θ ;Xn)
a.s.−→
n→∞ rP (θ) .
41 / 113
The fundamental inequality of SL
Suppose that we have an ERM and risk minimizer:
θˆn ∈ argmin
θ∈T
rn
(
θˆn;Xn
)
, θ0 ∈ argmin
θ∈T
rP (θ) .
Then, we can write the so-called excess risk due to θˆn as
rP
(
θˆn
)− rP (θ0)
=rP
(
θˆn
)− rn (θˆn;Xn)︸︷︷︸
∆1
+ rn
(
θˆn;Xn
)− rn (θ0;Xn)︸︷︷︸
∆2
+ rn (θ0;Xn)− rP (θ0)︸︷︷︸
∆3
.
Observe that by definition of θˆn, ∆2 ≤ 0, so
rP
(
θˆn
)− rP (θ0)≤ ∆1+∆3.
42 / 113
The fundamental theorem of SL
Now observe that
∆1 = rP
(
θˆn
)− rn (θˆn;Xn)≤ ∣∣rn (θˆn;Xn)− rP (θˆn)∣∣
and
∆3 = rn (θ0;Xn)− rP (θ0)≤ |rn (θ0;Xn)− rP (θ0)| .
Thus,
∆1,∆3 ≤ sup
θ∈T
|rn (θ ;Xn)− rP (θ)| and
rP
(
θˆn
)− rP (θ0)≤ ∆1+∆3 ≤ 2× sup
θ∈T
|rn (θ ;Xn)− rP (θ)| .
Exercise 13
Use this fact to show that if supθ∈T |rn (θ ;Xn)− rP (θ)| a.s.−→n→∞ 0, then
rn
(
θˆn;Xn
) a.s.−→
n→∞minθ∈T
rP (θ) .
43 / 113
Measure theory
44 / 113
Set notation – 31 July
We will denote a set, typically by blackboard bold characters
A,B,C, . . . or double stroke characters A,B,C, . . . .
We say that x ∈ A to mean that the object x is an element of A,
and x /∈ A to mean that x is not in A.
The set intersect A∩B contains all x, such that x ∈ A and x ∈ B.
The set union A∪B contains all x, such that x ∈ A or x ∈ B
(potentially, x ∈ A∩B).
The set exclusion A\B contains all x, such that x ∈ A but x /∈ B.
We say that A is a subset of B and write A⊂ B to denote all x such
that if x ∈ A then x ∈ B.
We say that A is a superset of B and write A⊃ B to denote all x
such that if x ∈ B then x ∈ A.
If both A⊂ B and A⊃ B are true, then A= B.
We lastly denote a set with no elements by the empty set ∅, and say
that A is nonempty if A ̸=∅.
45 / 113
Set builder notation
To construct a set with all of the elements of x ∈ X that satisfy the
property P, we write
{x ∈ X|P(x) is true}
For example, the set of all positive real numbers can be written as
R>0 = {x ∈ R|P(x) is true}
where P(x) = "x> 0".
For the set satisfying the property:
P(x) = "x1 or x2 or x3 or . . . or xn"
we simply write {x1, . . . ,xn}.
46 / 113
Important sets
The natural numbers: N= {1,2,3, . . .}.
The integers: Z= {. . . ,−2,−1,0,1,2, . . .}.
The non-negative integers: Z≥0 = N∪{0}.
The real numbers: R= (−∞,∞).
The non-negative real numbers: R≥0 = [0,∞).
The positive real numbers: R>0 = (0,∞).
The extended real numbers: R¯= [−∞,∞].
The non-negative extended real numbers: R¯≥0 = [0,∞].
The positive extended real numbers: R¯>0 = (0,∞].
Here, the extended real numbers allow us to use notation such as
“ f (x) = ∞” in a meaningful way.
We also seen the set of natural numbers up to n ∈ N:
[n] = {1,2, . . . ,n} .
47 / 113
Sizes of sets
We denote the number of elements in a set A by its cardinality:
A 7→ |A| ∈ Z≥0.
We say that |A|= |B| if there is a bijection between A and B.
If |A|= 0, we say that A is empty (i.e., A=∅)
If |A|= |[n]|, for some n ∈ N, then we say that A is finite and has n
elements.
If |A|= |N|, then we say that A is countably infinite.
e.g. Z≥0 and Z are both countably infinite.
If |A|> |N|, we say that A is uncountably infinite.
e.g. R and R¯ are countably infinite.
Exercise 14
Show that |Z|= |N|, and that |R|= |(−1,1)|.
48 / 113
Products
We call (a1, . . . ,an) an n-tuple.
If the 2-tuples (or pairs) (a,b) are such that each a ∈ A and each
b ∈ B, then we define the set of all possible such pairs by the
Cartesian product
A×B= {(a,b) |a ∈ A and b ∈ B} .
This product notation can be extended to finite or (countably)
infinite tuples (a1, . . . ,an) or (an)n∈N by writing
Πni=1Ai =×ni=1Ai = {(a1, . . . ,an) |a1 ∈ A1, . . . ,an ∈ An}
or
Π∞n=1An =×∞n=1An =
{
(an)n∈N |an ∈ An, for each n ∈ N
}
.
If A1 = A2 = . . . , then we can alternatively write
Πni=1Ai = An and Π∞n=1An = AN.
49 / 113
Power sets and complements
We denote the power set of A by 2A, where 2A is the set of all
subsets of B of A:
2A = {B : B⊂ A} .
Thus, when B⊂ A, we can alternatively write that B ∈ 2A.
For any B ∈ 2A, we can denote its complement (within A) by Bc,
where
Bc = A\B.
For example, N ∈ 2Z, since N⊂ Z. Further,
Nc = Z\N= {z ∈ Z : z≤ 0}, which we can also write as Z≤0: the
non-positive integers.
Exercise 15
Show that if a set A is finite with cardinality n (i.e. |A|= n), then∣∣2A∣∣= 2n.
50 / 113
Algebras
We say that F⊂ 2Ω is an algebra over the non-empty set Ω if it
fulfills the following conditions:
1. F is nonempty.
2. If A ∈ F, then Ac ∈ F.
3. If A,B ∈ F, then A∪B ∈ F.
The smallest algebra over any Ω is F= {∅,Ω}, which we call the
trivial algebra.
For a non-trivial example let Ω= {1,2,3}, and consider
F= {∅,Ω,{1} ,{2,3}}.
Exercise 16
Show that if F is an algebra and {Ai}i∈[n] is set of n ∈ N elements of F,
then
n⋃
i=1
Ai = A1∪·· ·∪An ∈ F, and
n⋂
i=1
Ai = A1∩·· ·∩An ∈ F.
51 / 113
σ -Algebras
We say that F⊂ 2Ω is an σ-algebra over the non-empty set Ω if it
fulfills the following conditions:
1. F is nonempty.
2. If A ∈ F, then Ac ∈ F.
3. If {An}n∈N ⊂ F, then
⋃∞
n=1An ∈ F.
Notice that every algebra is a σ -algebra.
For any set Ω, the largest σ -algebra that we can construct is the
discrete algebra F= 2Ω.
We call a set Ω together with a compatible σ -algebra F a
measurable space: (Ω,F).
Exercise 17
Show that if F is a σ -algebra and {An}n∈N is a countable subset of F,
then
∞⋂
n=1
An ∈ F.
52 / 113
Products of σ -Algebras
Recall that if we have two sets Ω1 and Ω2, then the Cartesian
product yields the set of pairs (ω1,ω2):
Ω1×Ω2 = {(ω1,ω2) |ω1 ∈Ω1 and ω2 ∈Ω2} .
If F1 and F2 are σ -algebras of the respective sets, then it is
tempting to define a σ -algebra on Ω1×Ω2 by taking F1×F2.
This doesn’t make any sense, since
F1×F2 = {(A1,A2) |A1 ∈ F1 and A2 ∈ F2} ⊂ 2Ω1 ×2Ω2 .
But a σ -algebra for Ω1×Ω2, by definition, should be a subset of
2Ω1×Ω2 .
That is, an element of a σ -algebra of Ω1×Ω2 should be a subset
consisting of pairs (ω1,ω2) ∈Ω1×Ω2, and not a pair of subsets
(A1,A2) ∈ 2Ω1 ×2Ω2 .
53 / 113
Products of σ -Algebras
A method for generating a σ -algebra for a product of sets is to
define the product σ-algebra:
F1⊗F2 = {A1×A2|A1 ∈ F1 and A2 ∈ F2} .
If we a countably infinite sequence of sets (Ωn)n∈N and σ -algebras
(Fn)n∈N, then the product σ-algebra is defined as
∞⊗
n=1
Fn =
{
∞
∏
n=1
An|An ∈ Fn, for each n ∈ N
}
.
If Ωn =Ω and Fn = F, then we also use the power notation:
F⊗N =
{
∞
∏
n=1
An|An ∈ F, for each n ∈ N
}
.
54 / 113
Topologies
If X is a set, then we call T ⊂ 2Ω a topology if it satisfies the
following properties:
1. ∅ ∈T .
2. X ∈T .
3. If A,B ∈T , then A∩B ∈T .
4. For any indexed collection {Aa}a∈N ⊂T , where N is a potentially
uncountable, it holds that ⋃
a∈N
Aa ∈T .
We call a set X together with a compatible topology T a
topological space: (X,T ).
We call a subset A⊂X an open set in (X,T ) if and only if A ∈T .
55 / 113
Metric spaces
A metric or distance on a set X is a function D :X×X→ R¯≥0 that
satisfies the following criteria:
1. For any x,y ∈ X, D(x,y) = 0 if and only if x= y (Identifiability).
2. For any x,y ∈ X, D(x,y) =D(y,x) (Symmetry).
3. For any x,y,z ∈ X, D(x,z)≤D(x,y)+D(y,z) (Triangle inequality).
We call a set X together with a compatible metric D a metric
space: (X,D).
Exercise 18
Prove the reverse triangle inequality: for every x,y,z ∈ X,
|D(x,z)−D(y,z)| ≤D(x,y)
56 / 113
Metric spaces
We say that ∥·∥ : X→ R≥0 is a norm on the vector space X if:
1. ∥x∥= 0 if and only if x= 0 (Identifiability).
2. For each a ∈ R and x ∈ X: ∥ax∥= |a|∥x∥ (Homogeneity).
3. For each x,y ∈ X, ∥x+ y∥ ≤ ∥x∥+∥y∥ (Triangle inequality).
For X= Rd with elements x= (x1, . . . ,xd), the most commonly
encountered norms are the p-norms:
∥x∥p =
{
d
∑
i=1
|xi|p
}1/p
,
for p ∈ [1,∞], which includes the Euclidean and taxicab norms:
∥x∥2 =
√√√√ d∑
i=1
x2i and ∥x∥1 =
d
∑
i=1
|xi| .
57 / 113
Metric spaces
For a vector space X, particularly, when X= Rd , using a norm ∥·∥
we can define a metric D via the mapping:
(x,y) 7→ ∥x− y∥=D(x,y) .
Another useful metric is the Hamming distance on the set of
binary strings X= {0,1}d : for every x= (x1, . . . ,xd) and
y= (y1, . . . ,yd) in X,
D(x,y) =
d
∑
i=1
Jxi ̸=yiK ,
where J·K is the so called Iverson brackets, where
JAK={1 if statement A is true,
0 otherwise.
Exercise 19
Verify that the Hamming distance is indeed a metric on the binary strings.
58 / 113
Metric topological spaces
For each point x ∈ (X,D), we define a (open) ball of radius r > 0
around x as
Br (x) = {y ∈ X :D(x,y)< r} .
The set of balls
BD = {Br (x) |x ∈ X and r ∈ R>0}
can then be used as a generative base for the so-called metric space
topology TD by putting all balls B ∈BD into TD, putting all finite
intersections of balls TD and putting all uncountable unions of balls
into TD.
Thus, we can treat every metric space (X,D) also as a topological
space (X,TD).
Even when we don’t have a metric D on X in mind, we can say that
X is metrizable there exists a metric on X that we can use to
induce a metric topological space.
For example Rd is metrizable, without declaring whether we wish to
endow the space with the norm-based metric using ∥·∥p or ∥·∥q, for
p ̸= q.
59 / 113
Borel σ -algebras
For a metric space (Ω,D), we can define its so-called Borel
σ-algebra as
B(Ω) =BD (Ω) ,
where B(Ω) is defined as the smallest σ -algebra that contains all of
the balls BD.
That is,
Br (ω) ∈B(Ω), for each ω ∈Ω and r > 0.
Ω\Br (ω), for each ω ∈Ω and r > 0.
If {Bn}n∈N ⊂BD, then
∞⋃
n=1
Bn ∈B(Ω) and
∞⋂
n=1
Bn ∈B(Ω) .
60 / 113
Borel σ -algebras
For an arbitrary subset A ⊂ 2Ω, we typically denote the smallest
σ-algebra generated by A as
σ (A ) ,
in the sense that σ (A ) is the smallest σ -algebra that contains every
set A ∈A .
Using this notation, we can write the Borel σ -algebra of (Ω,D) as
B(Ω) = σ (BD).
When a set Ω is metrizable, we can still define a Borel σ -algebra
without specifying the underlying metric.
Exercise 20
Demonstrate that the notion of a smallest σ -algebra that is generated by
A is well-defined, in some sense.
61 / 113
Measures – 3 August
Let (Ω,F) be a measurable space.
We define a measure on (Ω,F) as a function M : F→ R¯≥0, where
1. For each A ∈ F, M(A)≥ 0 (Non-negativity).
2. For sequence of pairwise disjoint sets (An)n∈N:
M
(
∞⋃
n=1
An
)
=
∞
∑
n=1
M(An) (σ-additivity).
3. M(∅) = 0 (Nullity).
Here, we say that A,B are pairwise disjoint if A∩B=∅.
Putting M together with (Ω,F) gives us a measure space:
(Ω,F,M).
62 / 113
Examples of measures
Suppose that we have the set n discrete objects: Ω= [n] equipped
with the discrete σ -algebra: F= 2Ω.
Then, we call MD the discrete measure if MD (A) = |A| for each set
A ∈ F.
e.g. MD (∅) = 0,MD ({1}) = 1, MD (Ω) = n, and
2=MD ({1}∪{2}) =MD ({1})+MD ({2}) = 1+1= 2.
Suppose instead that Ω= N is now a countably infinite set,
equipped with the σ -algebra F= 2Ω.
Then, we call MC the counting measure if MC (A) = |A| for each set
A ∈ F.
Notice that MC behaves the same as MD, except that MC (Ω) = ∞.
Exercise 21
Verify, in the discrete case, that σ ({x} : x ∈Ω) = 2Ω.
63 / 113
Common types of measures
We say that a measure M on (Ω,F) is finite if M(Ω)< ∞, which
implies that for every A ∈ F, M(A)≤M(Ω)< ∞.
A finite measure with the property that M(Ω) = 1 is called a
probability measure and (Ω,F,P) is called a probability space.
If there exists a covering sequence (An)n∈N ⊂ F, such that
M(An)< ∞ for each n ∈ N, and we can construct the entire set Ω
using the sequence:
Ω=
∞⋃
n=1
An,
then we say that M is σ -finite on (Ω,F).
For example, MD is finite on
(
[n] ,2[n]
)
and MC is σ -finite on(
N,2N
)
.
Exercise 22
Show that the measure defined by A 7→ n−1MD (A) is a probability
measure.
64 / 113
Some useful properties
If M is a finite measure on F, then the usual union/or rule holds for
A,B ∈ F:
M(A∪B) =M(A)+M(B)−M(A∩B) .
Since M(A∩B)≥ 0, this also means that
M(A∪B)≤M(A)+M(B) .
For monotonic sequences (An)n∈N and (Bn)n∈N:
A1 ⊂ A2 ⊂ ·· · and B1 ⊃ B2 ⊃ ·· · ,
lim
n→∞M(An) =M
(
lim
n→∞An
)
and lim
n→∞M(Bn) =M
(
lim
n→∞Bn
)
.
Here, we can also write:
lim
n→∞An =
⋃
n∈N
An and limn→∞Bn =
⋂
n∈N
Bn.
65 / 113
Lebesgue measure
In most learning problems, we are concerned with sets in Rd .
The most common measures on Rd that we intuitively talk about
are things such as lengths in d = 1, area in d = 2, and volume in
d = 3, and (hyper)volumes in d > 4.
In measure theory, to recover these objects, we have the Lebesgue
measures ML on the measurable spaces (Ω,F) =
(
Rd ,B
(
Rd
))
, for
each d ∈ N, where the Lebesgue measures are defined by the facts
that
ML ((a,b)) = b−a when Ω= R,
ML ((a,b)× (c,d)) = (b−a)× (d− c) when Ω= R2,
ML
(
d
∏
j=1
(a j,b j)
)
=
d
∏
j=1
(b j−a j) when Ω=Rd ,
where a< b, c< d, and a j < b j for each j ∈ [d].
66 / 113
Lebesgue measure
Suppose that we instead measure (R,B(R)) with the “counting
measure” instead, where we
M"C" (A) = |A| ,
for each A ∈B(R).
Notice that, the only way to generate a sequence (An)n∈N ⊂B(R),
such that
∞⋃
n=1
An = R
is for some An to be an interval (a,b) or unions of intervals (a j,b j),
where a< b and a j < b j.
But, we notice that M"C" ((a,b)) = |(a,b)|= ∞, so there is no way to
cover R with sets of finite measure, and thus M"C" on (Ω,B(R)) is
not σ -finite.
Exercise 23
Argue that ML is σ -finite on
(
Rd ,B
(
Rd
))
for each d ∈ N.
67 / 113
Distribution functions
Let P be a probability measure on either (R,B(R)) or
(
Z,2Z
)
, with
typical element ω.
Then, we define the (cummulative) distribution function (C)DF
of P as
F (t) = P({ω : ω ≤ t}) = P(ω ≤ t) , t ∈ R.
If P is a probability measure on (Ω,F), where Ω=∏ni=1Ωi and
F=
⊗n
i=1Fi with Ωi being either R (in which case Fi =B(R)) or Z
(in which case Fi = 2Z), then we say that P the DF:
F (t) = P(ω1 ≤ t1,ω2 ≤ t2, . . . ,ωn ≤ tn) ,
for each t ∈ Rn, where ω = (ω1, . . . ,ωn).
Exercise 24
In the case when P is a probability on (R,B(R)) or
(
Z,2Z
)
, show that
lim
t→−∞F (t) = 0 and limt→∞F (t) = 1.
68 / 113
Measurable functions
For a measurable space (Ω,F), we say that a set A⊂Ω is
measurable if A ∈ F.
Let X :Ω→ X, where (X,X) is a measurable space.
We write the inverse image of X as X−1 : 2X→ 2Ω, where, for each
A ∈ 2X
X−1 (A) = {ω ∈Ω : X (ω) ∈ A} ∈ 2Ω.
We then say that the function X is (F→ X)-measurable if for each
A ∈ X, X−1 (A) is measurable
X−1 (A) ∈ F.
When (Ω,F) is endowed with a probability measure P, we also say
that X is a random variable/function.
69 / 113
Pushforward measures
If X is (F→ X)-measurable and (Ω,F,M) is a measure space, then
we can endow (X,X) with the pushforward (image) measure MX ,
which has the property that
MX (A) =M
(
X−1 (A)
)
=M(ω ∈Ω|X (ω) ∈ A) .
That is, the new measure space (X,X,MX ) endows each set A ∈ X
with a measure MX (A) that is equal to the measure of the set
containing all values ω ∈Ω such that X (ω) ∈ A, based on M.
We also observe that measurability is closed under composition in
the sense that if (Ω,F), (X,X) and (Y,Y) are measurable spaces,
and X :Ω→ X and Y : X→ Y are (F→ X)- and
(X→Y)-measurable, then Z = Y ◦X :Ω→ Y is
(F→Y)-measurable, where
ω 7→ Z (ω) = Y (X (ω)) .
70 / 113
Continuous functions
Let (X,DX ) and (Y,DY ) be metric spaces. Then, a function
Y : X→ Y is said to be continuous if and only if for each DY -open
set A⊂ Y, the inverse image
Y−1 (A) = {x ∈ X|Y (x) ∈ A}
is also a DX -open set.
If we endow X and Y with their respective Borel σ -algebra with
respect to their metric topologies, we know if A is DY -open, then
A ∈B(Y) and if B is a DX -open set then B ∈B(X).
But since every set in A ∈B(Y) can be constructed from a
countable number of unions/intersects of DY -open sets, we can
conclude that the map Y−1 (A) is also constructible from a
countable number of unions/intersects of DX -open sets, which
implies that B ∈B(X).
Thus, a continuous function from (X,B(X)) to (Y,B(Y)) is
(B(X)→B(Y))-measurable!
71 / 113
The extended reals
Recall that the reals R has the same cardinality as (−1,1), which is
implied by the existence of a bijection f : R→ (−1,1).
Such a bijection is
x 7→ f (x) = x/(1+ |x|) .
Notice, that we can then define f : R¯→ [−1,1] by setting
f (−∞) =−1 and f (∞) = 1.
Using this map, we can then define a metric D on R¯ as
D(x,y) = | f (x)− f (y)| ,
and so R¯ by a metrizable space.
Thus, we can safely define a Borel σ -algebra B
(
R¯
)
on R¯ using the
open sets of TD.
Exercise 25
Check that D is indeed a metric on R¯ and propose an alternative metric.
72 / 113
Characteristic functions
We now specialize our attention to mappings of the form
X : (Ω,F)→ (R¯,B(R¯)).
For each A⊂Ω, the characteristic function (of A):
χA (ω) = Jω ∈ AK={1 if ω ∈ A,
0 if ω /∈ A.
For any set B⊂ R, the inverse image is thus
χ−1A (B) = {ω ∈Ω|χA (ω) ∈ B}=

Ω if 0 ∈ B and 1 ∈ B,
A if 0 /∈ B and 1 ∈ B,
Ω\A if 0 ∈ B and 1 /∈ B,
∅ if 0 /∈ B and 1 /∈ B.
Then, χ−1A (B) ∈ F, for every B ∈Ω as long as A ∈ F. So χA (ω) is
measurable if A ∈ F.
73 / 113
Measurable extended real functions
An extended real function X : (Ω,F)→ (R¯,B(R¯)) is measurable if
any of the following equivalent statements are true, for each x ∈ R:
1. X−1 ([x,∞)) = {ω ∈Ω : x≤ X (ω)< ∞} ∈ F,
2. X−1 ([x,∞]) = {ω ∈Ω : x≤X (ω)≤ ∞} ∈ F,
3. X−1 ([−∞,x]) = {ω ∈Ω :−∞≤ X (ω)≤ x} ∈ F,
4. X−1 ((−∞,x]) = {ω ∈Ω :−∞< X (ω)≤ x} ∈ F.
Let X : (Ω,F)→ (R¯,B(R¯)) and Y : (Ω,F)→ (R¯,B(R¯)) be
measurable functions. For each a ∈ R, the following functions are
also measurable:
1. a, aX , a+X , X+Y , X−Y , XY .
2. X/Y if Y (ω) ̸= 0, for every ω ∈Ω.
3. |X |, max{X ,Y}, min{X ,Y}.
Exercise 26
Show the corollary that if X : (Ω,F)→ (R¯,B(R¯)) is measurable, then for
every x ∈ R¯ and every open set A⊂ R,
X−1 ({x}) ∈ F and X−1 (A) ∈ F.
74 / 113
Supremum and Infimum
Recall that the maximum and minimum of a set X⊂ R¯ are
max
[min]
X=
{
m ∈ X : m ≥
[≤] x, for each x ∈ X
}
,
which can be empty sets!
For X⊂ R¯, we define the upper and lower bounds of X as
U(X) =
{
u ∈ R¯|u≥ x, for every x ∈ X} ,
L(X) =
{
l ∈ R¯|l ≤ x, for every x ∈ X} .
We then refer to its least-upper bound and its greatest-lower
bound as its supremum supX and infimum infX, where
supX= a if a> x for every x ∈ X, and a≤ u, for every u ∈ U(X).
infX= a if a< x for every x ∈ X, and a≥ l, for every l ∈ L(X).
If we further let sup∅=−∞ and inf∅= ∞, then for any X⊂ R¯, the
supremum and infimum of X exist and are in R¯.
Exercise 27
For X⊂ R¯, show that if minX exists, then minX= infX, and similarly
when maxX exists, maxX= supX.
75 / 113
Supremum and Infimum
A typical adaptation of the supremum and infimum to sequences
(xn)n∈R ⊂ R¯ are the following notations:
sup
n≥m
xn = sup{xm,xm+1, . . .} , inf
n≥mxn = inf{xm,xm+1, . . .} .
This leads to the notion of the limits superior and inferior:
limsup
n→∞
xn = lim
m→∞
(
sup
n≥m
xn
)
, liminf
n→∞ xn = limm→∞
(
inf
n≥mxn
)
.
Exercise 28
Show that if limsupn→∞ xn = liminfn→∞ xn = l, then limn→∞ xn = l and
vice versa.
It is also natural to adapt the
supremum/infimum/maximum/minimum notation to functions by
defining, for f : X→ R¯:
sup/inf/max/min
x∈X
f (x) = sup/inf/max/min
x∈X
{ f (x) |x ∈ X} .
76 / 113
Supremum and Infimum
Let (Xn)n∈N be a sequence of
(
F→B(R¯))-measurable functions,
and define the supremum and infimum functions as
ω 7→ sup
n∈N
Xn (ω) = sup{Xn (ω) |n ∈ N} ,
and
ω 7→ inf
n∈N
Xn (ω) = inf{Xn (ω) |n ∈ N} .
Similarly, we can define the limit superior and inferior functions
as:
ω 7→ limsup
n→∞
Xn (ω) , ω 7→ liminf
n→∞ Xn (ω) .
All four of the functions above are measurable.
Notice that since both limsupn→∞Xn (ω) and liminfn→∞Xn (ω) are
measurable, then so is
ω 7→ lim
n→∞Xn (ω) .
77 / 113
Almost everywhere properties
Generically, we say that a sequence (Xn)n∈N functions
Xn :Ω→ (X,D) converges pointwise to X :Ω→ (X,D), if for each
ω ∈Ω:
D(Xn (ω) ,X (ω)) −→
n→∞ 0.
We say that a set A has full measure with respect to (Ω,F,M) if
A ∈ F and M(Ω\A) = 0. For a probability measure M= P, the
property is equivalent to P(A) = 1.
We say that two functions X ,Y :Ω→ X are equal almost
everywhere (a.e.) if there exists a set of full measure A ∈ F, such
that for each ω ∈ A: X (ω) = Y (ω).
Here, we if M is a probability measure, we also say almost surely
(a.s.) instead of a.e..
If the sequence of functions (Xn)n∈N are now
Xn : (Ω,F,M)→ (X,B(X)), then we say that (Xn)n∈N converges to
X :Ω→ (X,D), a.e., if there exists a set of full measure A such that
for each ω ∈ A:
D(Xn (ω) ,X (ω)) −→
n→∞ 0.
78 / 113
Complete measure spaces
We say that a measure space (Ω,F,M) is complete if for each
A,B ∈ F, such that A⊂ B and M(B) = 0, it must also be true that
A ∈ F.
As an example, consider Ω= {1,2,3} with the algebra
F= {∅,{1} ,{2,3} ,Ω} and define the measure M by M({1}) = 1
and M({2,3}) = 0. We observe that (Ω,F,M) is not complete since
{2} ,{3} ⊂ {2,3} and M({2,3}) = 0 but {2} ,{3} /∈ F. We can
define the so-called completion of (Ω,F,M) by enlarging the algebra
F to F¯M with {2} ,{3} ∈ F¯ and defining a new measure M¯ such that
M¯({1}) = 1,M¯({2}) = 0 and M¯({3}) = 0.
We can always complete a generic measure space (Ω,F,M) in a
similar way. Namely, if we set
F¯M = {C∪A|C ∈ F,A⊂ B ∈ F,M(B) = 0}
and let M¯(C∪A) =M(C), for each C ∈ F and A satisfying the
properties A⊂ B ∈ F and M(B) = 0.
The measure space
(
Ω, F¯M,M¯
)
is now a complete measure space
and if A ∈ F, then M(A) = M¯(A).
79 / 113
Complete measure spaces
Let the two functions X ,Y : (Ω,F)→ R¯ be equal, a.e., with respect
to the measure M.
Exercise 29
Show that if X is a
(
F→B(R¯))-measurable function, then Y is(
F¯M→B
(
R¯
))
-measurable.
This implies if the sequence functions (Xn)n∈N,
Xn : (Ω,F,M)→
(
R¯,B
(
R¯
))
, converge to X , a.e.; i.e.,
Xn
a.e.−→
n→∞ X ,
then since limn→∞Xn is
(
F→B(R¯))-measurable, it is also true that
X is
(
F¯M→B
(
R¯
))
-measurable.
80 / 113
Simple functions – 7 August
For a measurable space (Ω,F), we define its simple functions as
the class
S(Ω,F)
=
{
S :Ω→ R|S (ω) =
m
∑
k=1
akχAk (ω) ,(ak)k∈[m] ⊂ R,(Ak)k∈[m] ⊂ F
}
.
Let the non-negative simple functions then be defined as
S≥0 (Ω,F) = {S ∈ S(Ω,F) |S (ω)≥ 0, for each ω ∈Ω} .
Exercise 30
For every non-negative
(
F→B(R¯))-measurable function X ≥ 0, there
exists a sequence of functions (Sn)n∈N ⊂ S≥0 (Ω,F), such that
Sn (ω)≤ Sn+1 (ω)≤ X (ω) , for each n ∈ N,ω ∈Ω,
and lim
n→∞Sn (ω) = X (ω) , for every ω ∈Ω.
81 / 113
The Lebesgue integral
For any non-negative simple function S ∈ S≥0 (Ω,F):
S (ω) =
m
∑
k=1
akχAk (ω) ,
we define its integral, with respect to the measure M, as∫
Ω
S (ω)M(dω) =
m
∑
k=1
akM(Ak) .
For every non-negative
(
F→B(R¯))-measurable function: X ≥ 0,
we define its Lebesgue integral, with respect to M, as∫
Ω
X (ω)M(dω) = lim
n→∞
∫
Ω
Sn (ω)M(dω) ,
where (Sn)n∈N ⊂ S≥0 (Ω,F) and limn→∞ Sn = X .
An alternative definition is that∫
Ω
X (ω)M(dω) = sup
{∫
S (ω)M(dω) ∈ S≥0 (Ω,F) : S≤ X
}
.
You will also see:
∫
ΩX (ω)M(dω) =
∫
ΩX (ω)dM(ω) =
∫
ΩXdM.
82 / 113
Summation and integration
Consider a non-negative measurable function X ≥ 0, mapping from(
Z,2Z,MC
)
to
(
R¯,B
(
R¯
))
.
The non-negative simple functions on
(
Z,2Z
)
are:
S≥0 (Ω,F) =
{
S (ω) =
m
∑
k=1
ak Jω = bkK ,(ak)k∈[m] ⊂ R, (bk)k∈[m] ⊂ Z
}
and so if X is the limit of (Sn)n∈N ⊂ S≥0 (Ω,F) is just going to have
the form:
X (ω) = ∑
k∈Z
X (k)Jω = kK .
But the counting measure of any set is MC ({ω ∈ Z : ω = k}) = 1, so
the Lebesgue integral of X is defined as∫
Z
X (ω)MC (dω) = ∑
k∈Z
X (k)MC ({ω = k}) = ∑
k∈Z
X (k) .
Similarly, for X ≥ 0, mapping from
(
[n] ,2[n],MD
)
to
(
R¯,B
(
R¯
))
, we
can write
∫
[n]X (ω)MD (dω) = ∑nk=1X (k).
83 / 113
Properties of integrals
For any
(
F→B(R¯))-measurable function X , note that we can write
X = X+−X−, where, for each ω ∈Ω,
X+ (ω) =max{X (ω) ,0} ≥ 0 and X− (ω) =−min{X (ω) ,0} ≥ 0.
Then, we define the Lebesgue integral of X as∫
Ω
X (ω)M(dω) =
∫
Ω
X+ (ω)M(dω)−
∫
Ω
X− (ω)M(dω) .
We say that a function X is Lebesgue integrable if |∫ΩXdM|< ∞.
Let X and Y be
(
F→B(R¯))-measurable functions. Then:
If X (ω)≤ Y (ω), for a.e. ω ∈Ω, then ∫ΩXdM≤ ∫ΩYdM.
If a,b ∈ R¯, then ∫Ω aX+bYdM= a∫ΩXdM+b∫ΩYdM.
Exercise 31
Consider that |X |= X++X− and argue that∣∣∣∣∫ΩXdM
∣∣∣∣< ∞ ⇐⇒ ∫Ω |X |dM< ∞, and that
and that |∫ΩXdM| ≤ ∫Ω |X |dM.
84 / 113
Expectations
Let X be a real-valued random variable on the probability space
(Ω,F,P), then we define it expectation and variance as:
E(X) =
∫
Ω
X (ω)P(dω) , var(X) =
∫
Ω
|X (ω)−E(X)|2P(dω) ,
where we then call the standard deviation: sd(X) =
√
var(X).
If X ,Y are random variables, then we define the covariance as:
cov(X ,Y ) = E [{X−E(X)}{Y −E(Y )}] = E(XY )−E(X)E(Y ) .
If f : R→ R is convex, then f (E [X ])≤ E [ f (X)] (Jensen’s
inequality).
If A ∈ F then P(A) = E(χA).
If E(|X |)< ∞ then |X |< ∞, a.e., and X = 0,a.e.⇐⇒ E|X |= 0.
Exercise 32
Prove Markov’s inequality: if X ≥ 0, then for every ε > 0,
P(ω ∈Ω|X (ω)≥ ε)≤ ε−1E(X) .
85 / 113
Independence
On (Ω,F,P), we say the sample of events (Ai)i∈[n] ⊂ F are
independent if
P
(
n⋂
i=1
Ai
)
=
n
∏
i=1
P(Ai) .
We extend this definition by saying that the n (F→ Xi)-measurable
random variables (Xi)i∈[n] are independent if for each sample of
sets (Ai)i∈[n] ⊂ Xi:
P
(
n⋂
i=1
{Xi (ω) ∈ Ai}
)
=
n
∏
i=1
P(Xi (ω) ∈ Ai) .
Note that this also means that if each Xi map from (Ω,F) to either
(R,B(R)) or
(
Z,2Z
)
, then the CDF satisfies:
F (t) = P(X1 ≤ t1, . . . ,Xn ≤ tn) =
n
∏
i=1
P(Xi ≤ ti) =
n
∏
i=1
Fi (ti) ,
where Fi (ti) = P(Xi ≤ ti), for each i ∈ [n].
86 / 113
Product measure
Recall that if we have two measurable spaces (Ω1,F1) and (Ω2,F2),
then we can define the product measurable space:
(Ω1×Ω2,F1⊗F2), where
F1⊗F2 = {A1×A2|A1 ∈ F1 and A2 ∈ F2} .
For measure spaces (Ω1,F1,M1) and (Ω2,F2,M2), we define a its
product as
(Ω1×Ω2,F1⊗F2,M1⊗M2)
such that for each A1 ∈ F1 and A2 ∈ F2,
M1⊗M2 (A1×A2) =M1 (A1)×M2 (A2) .
We can extend this definition to the product of measures
((Ωi,Fi,Mi))i∈[n] to produce the product measure space(×ni=1Ωi,⊗ni=1Fi,⊗ni=1Mi), where for each sample of events Ai ∈ Fi,
for each i ∈ [n], we write(
n⊗
i=1
Mi
)
(×ni=1Ai) =
n
∏
i=1
Mi (Ai) .
87 / 113
The Fubini–Tonelli theorem
Let (Ω1,F1,M1) and (Ω2,F2,M2) be σ -finite measure spaces, and
let X :Ω1×Ω2→ R¯ be
(
F1⊗F2→B
(
R¯
))
-measurable, where we
can write (ω1,ω2) 7→ X (ω1,ω2).
If one of the quantities∫
Ω1×Ω2
|X (ω1,ω2)|M1⊗M2 (dω1×dω2) ,
∫
Ω1
∫
Ω2
|X (ω1,ω2)|M2 (dω2)M1 (dω1) ,∫
Ω2
∫
Ω1
|X (ω1,ω2)|M1 (dω1)M2 (dω2)
is finite then they all are, and this finiteness implies that
ω1 7→
∫
Ω2
X (ω1,ω2)M2 (dω2) and ω2 7→
∫
Ω1
X (ω1,ω2)M1 (dω1)
are
(
F1→B
(
R¯
))
- and
(
F2→B
(
R¯
))
-measurable functions,
respectively, and are both finite , almost everywhere.
88 / 113
The Fubini–Tonelli theorem
Assuming that
∫
Ω1×Ω2 |X (ω1,ω2)|M1⊗M2 (dω1×dω2) is finite, we
also have the fact that the three quantities:∫
Ω1×Ω2
X (ω1,ω2)M1⊗M2 (dω1×dω2) ,∫
Ω1
∫
Ω2
X (ω1,ω2)M2 (dω2)M1 (dω1) ,∫
Ω2
∫
Ω1
X (ω1,ω2)M1 (dω1)M2 (dω2)
are all equal to one another (Fubini).
Furthermore, if X (ω1,ω2)≥ 0 for every (ω1,ω2) ∈ (Ω1,Ω2), then we
can drop the finiteness assumption on∫
Ω1×Ω2 |X (ω1,ω2)|M1⊗M2 (dω1×dω2). That is,
ω1 7→
∫
Ω2
X (ω1,ω2)M2 (dω2) and ω2 7→
∫
Ω1
X (ω1,ω2)M1 (dω1)
are always measurable, and the three quantities above are always
equal for non-negative X (Tonelli).
89 / 113
Returning to independence
Recall that we say the (F→ Xi)-measurable random variables
(Xi)i∈[n] are independent if for each sample of sets (Ai)i∈[n] ⊂ Xi:
P
(
n⋂
i=1
{Xi ∈ Ai}
)
=
n
∏
i=1
P(Xi ∈ Ai) .
Using pushforward measures, we have:
n
∏
i=1
P(Xi (ω) ∈ Ai) =
n
∏
i=1
PXi (Xi ∈ Ai) =
n⊗
i=1
PXi (×ni=1Ai)
Considering the function, (X1, . . . ,Xn) 7→ X1×X2×·· ·×Xn, repeated
application of the Fubini–Tonelli theorem implies that, if finite,
E
[
n
∏
i=1
Xi
]
=
n
∏
i=1
E [Xi] . (1)
Exercise 33
Verify that (1) holds.
90 / 113
Returning to independence
As a corollary to (1), if ( fi)i∈[n] is a sample of(
Xi→B
(
R¯
))
-measurable functions, then we also have the fact that
if Xn = (Xi)i∈[n] are independent, then
E
[
n
∏
i=1
fi (Xi)
]
=
n
∏
i=1
E [ fi (Xi)] .
As a partial converse, we also have the following two facts:
If for each sample of (gi)i∈[n]
(
Xi→B
(
R¯
))
-measurable functions,
where either (i) gi > 0 for all i ∈ [n] or (ii) |gi|< ∞ for all i ∈ [n],
E
[
n
∏
i=1
gi (Xi)
]
=
n
∏
i=1
E [gi (Xi)] ,
then Xn is an independent sample.
If each Xi is a Borel σ -algebra, then the above implication is also
true, for each sample (gi)i∈[n], such that (iii) gi is bounded (i.e.
|gi|< ∞) and continuous for all i ∈ [n].
91 / 113
Convergence of measurable functions
We have already defined that (Xn)n∈N, Xn : (Ω,F,M)→ (X,B(X)),
converges to X :Ω→ (X,D), almost everywhere, if there exists a
set of full measure A such that for each ω ∈ A:
D(Xn (ω) ,X (ω)) −→
n→∞ 0,
which we write as Xn a.e.−→
n→ X (or Xn
a.s.−→
n→ X if M= P is a probability
measure).
A weaker mode of convergence is to require that for each ε > 0,
lim
n→∞M(ω ∈Ω|D(Xn (ω) ,X (ω))> ε) = 0,
which is referred to as convergence in measure M, and is typically
written as
Xn
M−→
n→∞ X .
When M= P is a probability measure, we can instead say that
(Xn)n∈N converges to X in probability P, and write Xn
P−→
n→∞ X .
92 / 113
Kolmogorov’s strong laws of large numbers – 10 August
Let XN = (Xn)n∈N be a sequence if IID random variables on (Ω,F,P),
where each Xn : (Ω,F)→ (R,B(R)) has the same pushforward
measure PX as the random variable X : (Ω,F)→ (R,B(R)).
Then, provided that EX (X) = µ ∈ R, the strong law of large
numbers (SLLN) states that:
X¯n (ω) =
1
n
n
∑
i=1
Xi (ω)
a.s.−→
n→∞ µ. (2)
In fact, the Kolmogorov’s SLLN isn’t the best that we can do.
Suppose instead that XN = (Xn)n∈N are pairwise IID random
variables on (Ω,F,P), in the sense that Xm is independent of Xn and
(Xm,Xn) has the same product pushforward measure as some pair
(X ,Y ) : (Ω,F)→ (R2,B⊗2 (R)), for each m,n ∈ N.
Then if EX (X) = EY (Y ) = µ ∈ R Etemadi’s SLLN states that (2)
holds.
93 / 113
Frequentist probability
Let X be a random variable on (Ω,F,P), where X : (Ω,F)→ (X,X)
For some A ∈ X, what do we actually mean when we say
P(X ∈ A) = P(ω ∈Ω : X (ω) ∈ A) .
Let XN = (Xn)n∈N be a sequence if IID random variables on
(Ω,F,P), where each Xn : (Ω,F)→ (X,X) has the same pushforward
measure as X .
Then, with the sequence XN, we can arbitrarily well estimate
P(X ∈ A) using the SLLN by the average
lim
n→∞
1
n
n
∑
i=1
JXi (ω) ∈ AK= lim
n→∞
1
n
n
∑
i=1
χA (Xi (ω))
a.s.
= E [χA (X)]=P(X ∈ A) .
This is so-called Frequentist or classical interpretation of
probability.
94 / 113
Convergence preservation
We will just write •−→
n→∞ to mean either
M−→
n→∞ or
a.e.−→
n→∞.
Suppose that a,b ∈ R and (Xn)n∈N and (Yn)n∈N are real measurable
functions, such that
Xn
•−→
n→∞ X and Yn
•−→
n→∞ Y ,
then:
aXn+bYn
•−→
n→∞ aX+bY , XnYn
•−→
n→∞ XY .
If Y (ω) ̸= 0, for a.e. ω ∈Ω, then Xn/Yn •−→n→∞ X/Y .
If f : R→ R is continous, then f (Xn) •−→n→∞ f (X).
Exercise 34
Show that if Xn a.s.−→
n→∞ X , then Xn
M−→
n→∞ X .
95 / 113
Dominated convergence theorem
Let XN = (Xn)n∈N be a sequence of measurable functions on
(Ω,F,M), where Xn : (Ω,F)→
(
R¯,B
(
R¯
))
, and suppose that
Xn
M−→
n→∞ X ,
for some X : (Ω,F)→ (R¯,B(R¯))
Further, assume that there exists a dominating function
Y : (Ω,F)→ (R¯,B(R¯)), such that for each n ∈ N,
|Xn (ω)| ≤ Y (ω) , for every ω ∈Ω.
If
∫
ΩYdM< ∞, then
lim
n→∞
∫
Ω
XndM=
∫
Ω
XdM.
Notice that if |Xn (ω)|< b for some b ∈ R≥0, then we can set Y = b
(Bounded convergence theorem).
96 / 113
The Lebesgue spaces
Let X be a measurable function on (Ω,F,M), mapping into
(R,B(R)).
We often require conditions regarding the integrability of X or some
power p ∈ [1,∞) of X : e.g.,∫
Ω
|X |p dM< ∞. (3)
It is thus convenient to construct a class of functions for which
condition (3) holds for every element.
We call such classes of functions: Lebesgue spaces or L p spaces,
defined, for each p ∈ [1,∞), as
L p (Ω,F,M)=
{
X |
∫
Ω
|X |p dM< ∞, X is (F→B(R)) -measurable
}
.
When some of the triple (Ω,F,M) are implied, we can also write:
L p (Ω,M) , L p (Ω) , L p (M) , or L p.
97 / 113
The Lebesgue spaces
Let X ∈L p (Ω,F,M) and Y ∈L q (Ω,F,M), for some p,q ∈ [1,∞).
1. If p= q, then X+Y ∈L p (Ω,F,M) (Minkowski).
2. If 1/p+1/q= 1, then XY ∈L 1 (Ω,F,M) (Hölder)
3. If M(Ω)< ∞ and p< q, then L p (Ω,F,M)⊂L q (Ω,F,M).
We can further complement the L p spaces with the space of
essentially bounded functions:
L ∞ (Ω,F,M) = {X | |X (ω)| ≤ b, for a.e. ω ∈Ω, b< ∞} .
Now, instead of saying something things like “the
expectation/variance of the random variable X exists”, we can say,
respectively,
X ∈L 1 (Ω,F,P) or X ∈L 2 (Ω,F,P) ,
and by Point 3, we know that if the variance of X exists, then its
expectation also exists.
98 / 113
Measures from integrals
Let X ∈L 1 (Ω,F,M) and X (ω)≥ 0, for each ω ∈Ω.
Then, the function N : F→ R≥0 defined by
A 7→ N(A) =
∫
Ω
χA (ω)X (ω)M(dω) =
∫
A
XdM
is a measure on (Ω,F).
Further, if X (ω) = Y (ω), for a.e. ω ∈Ω, then, for each A ∈ F:
N(A) =
∫
A
XdM=
∫
A
YdM.
So, for if Z is
(
F→B(R¯))-measurable, then we can write∫
Ω
ZdN=
∫
Ω
ZXdM=
∫
Ω
ZYdM.
Exercise 35
Check that N satisfies the definition of a measure on (Ω,F).
99 / 113
The Radon–Nikodym theorem
On (Ω,F), we say that N is absolutely continuous with respect to
M, written as N≪M, if M(A) = 0 implies N(A) = 0, for every
A ∈ F.
We further say that N has a density with respect to M, if there
exists an n ∈L 1 (Ω,F,M) such that n≥ 0, and for each A ∈ F:
N(A) =
∫
A
n(ω)M(dω) .
Then, the Radon–Nikodym theorem states that:
N has a density with respect to M ⇐⇒ N≪M.
The density n= dN/dM is often referred to as the Radon–Nikodym
derivative, because the following heuristic holds for each(
F→B(R¯))-measurable X :∫
A
X (ω)n(ω)M(dω) =
∫
A
X
dN
dMdM=
∫
A
XdN.
100 / 113
Optimization theory
101 / 113
Empirical risk minimization
Let X be a random variable on (Ω,F,P) with pushforward measure
space (X,X,PX ) and let θ be a parameter in the parameter space
T.
Let ℓ : T×X→ R¯≥0 be a loss function.
We then call the expectation of the loss function the (population)
risk:
r : T→ R¯, θ 7→ r (θ) = E{ℓ(θ ,X)} ,
which we can estimate by the empirical/sample risk:
rn : T×Xn→ R¯, (θ ,Xn) 7→ rn (θ ,Xn) = 1n
n
∑
i=1
ℓ(θ ,Xi) ,
where Xn = (Xi)i∈N is an IID sample where Xi has measure PX .
The task is to compute the empirical risk minimizers/minimum:
θˆn ∈ argmin
θ∈T
rn (θ ,Xn) and rˆn =min
θ∈T
rn (θ ,Xn)
and investigate the relationships with risk minimizers/minimum:
argmin
θ∈T
r (θ) and rˆ =min
θ∈T
r (θ) .
102 / 113
Examples of ERM problems
Let η : Rd×T→ R≥0 and suppose that
η (·;θ) ∈L 1 (Rd ,B(Rd) ,PX), for every θ ∈ T⊂ Rq.
Here, we can think of η (·;θ) as a density with respect to(
Rd ,B
(
Rd
)
,ML
)
, calibrated by parameter θ ∈ T.
Further suppose that PX ≪ML and thus PX has density
pX : Rd → R≥0.
Then, given a measure of “error” between the target pX (X) and
hypothesis η (X ;θ), say,
(X ,θ) 7→ E(pX (X) ,η (X ;θ))≥ 0
we can consider the ERM problem defined by loss function
(θ ,X) 7→ ℓ(θ ,X) = E(p(X) ,η (X ;θ)) .
When E(p(X) ,η (X ;θ)) = log{pX (X)}− log{η (X ;θ)}, we get
maximum likelihood estimation (MLE).
When E(p(X) ,η (X ;θ)) = |pX (X)−η (X ;θ)|2, we get
least-squares density estimation.
103 / 113
Examples of ERM problems
Let X = (W,Y ) ∈X=W×Y and suppose that η :W×T→ Y¯⊃Y.
Here, we can consider η to be a predictor of the response Y , based
on features/covariates W , calibrated by parameter θ ∈ T⊂ Rq.
We then define an error between the target Y and hypothesis
η (W ;θ) as
(X ,θ) 7→ E(Y,η (W ;θ))≥ 0.
and consider ERM problems, where
(θ ,X) 7→ ℓ(θ ,X) = E(Y,η (W ;θ))+pen(θ) ,
with penalty function pen : T→ R≥0.
For example:
If Y= Y¯= Rd and we take E(Y,η (W ;θ)) = ∥Y −η (W ;θ)∥22 and
pen(θ) = 0, we obtain (multivariate nonlinear) least-squares
regression.
If Y= {−1,1} and Y¯= R and we take
E(Y,η (W ;θ)) =max{0,1−Yη (W ;θ)} and pen(θ) = λ ∥θ∥22 (λ ≥ 0),
we get the (soft-margin) support vector machine (with hinge
loss).
104 / 113
Examples of ERM problems
To elaborate on the least squares setting, first suppose that W= Rp
and Y= R. If we set the predictor hypothesis to have the form
η (W ;θ) = α+W⊤β , θ = (α,β ) ∈ R1×Rp,
then we obtain ordinary least-squares regression.
If instead, take Y= Rd , then
η (W ;θ) = α+βW , θ = (α,β ) ∈ Rd×Rd×p,
yields multivariate linear regression.
Retaining W= Rp and Y= Rd , if we define functions
gl : Rpl−1 ×Tl → Rpl , for each l ∈ [L], where p0 = p and pL = d, and
Tl ⊂ Rql . Then, if we take
η (W ;θ) = gL (·;ϑL)◦ · · · ◦g3 (·;ϑ3)◦g2 (·;ϑ2)◦g1 (W ;ϑ1)
with θ = (ϑ1, . . . ,ϑL) ∈ ×Ll=1Tl , then we get an L-layer feedforward
neural network.
105 / 113
The real vector spaces
In most settings, we are concerned with hypotheses η that depend
on a real parameter vector θ ∈ T⊂ Rq, for some q ∈ N.
Thus, we can concentrate all of our effort in understanding how the
spaces Rq behave, and the behaviors of functions
r : Rq→ R¯.
Firstly, when we say that Rq is a vector space, we saying the
following: if a,b ∈ R and θ ,τ ∈ Rq, then
1×θ = θ , 0×θ = 0 ∈ Rq, aθ +bτ ∈ Rq
along with the algebraic properties: a(bθ) = (ab)θ ,
a(θ + τ) = aθ +aτ, (a+b)θ = aθ +bθ , θ + τ = τ+θ , etc.
Of course, this means that if (ai)i∈[n] ∈ R and (τi)i∈[n], then the
quantity
θ = a1τ1+a2τ2+ · · ·+anτn
are well-defined.
106 / 113
Norms on the reals
Specializing our previous discussions, we recall that ∥·∥ : Rq→ R≥0
is a norm on Rq, if
1. ∥θ∥= 0 if and only if θ = 0 (Identifiability).
2. For each a ∈ R and θ ∈ Rq: ∥aθ∥= |a|∥θ∥ (Homogeneity).
3. For each θ ,τ ∈ X, ∥θ + τ∥ ≤ ∥θ∥+∥τ∥ (Triangle inequality).
In particular, we have the family of so-called p-norms, for each
p ∈ [1,∞):
θ = (ϑ1, . . . ,ϑq) 7→ ∥θ∥p =
{
q
∑
i=1
|ϑi|p
}1/q
.
We can further complete the class of p-norms with the ∞-norm:
∥θ∥∞ =max
{
ϑ1, . . . ,ϑq
}
.
Together, we say that
(
Rd ,∥·∥) is a normed vector space.
Exercise 36
Verify that ∥·∥∞is a norm.
107 / 113
Equivalence of norms
There are of course infinitely many p-norms, and even more norms,
generally: e.g.,
θ 7→ ∥θ∥= ∥θ∥1+2∥θ∥2+3∥θ∥3 .
We say that two norms ∥·∥ and 9 ·9 are equivalent if there exists
a,b ∈ R>0, such that for every θ :
a∥θ∥ ≤ 9θ9≤ b∥θ∥ .
It turns out that the choice of norms on Rq doesn’t matter since all
norms on Rq are equivalent.
Exercise 37
For θ ∈ Rq, show
∥θ∥∞ ≤ ∥θ∥2 ≤ ∥θ∥1 ≤
√
q∥θ∥2 ≤ n∥θ∥∞ .
108 / 113
The inner product
We are particularly interested in the Euclidean space (Rq,∥·∥2),
because it has further structural properties.
Namely, we can define the inner product operation
(θ ,τ) 7→ ⟨θ ,τ⟩=
q
∑
i=1
ϑiti,
for θ = (ϑ1, . . . ,ϑq) and τ = (t1, . . . , tq).
For a,b ∈ R and θ ,τ,υ = (u1, . . . ,uq) ∈ Rq, we have:
1. ⟨θ ,τ⟩= ⟨τ,θ⟩ (Symmetry).
2. ⟨aθ +bτ,υ⟩= a⟨θ ,υ⟩+b⟨τ,υ⟩ (Linearity).
3. ⟨θ ,θ⟩= ∥θ∥2 (Norm determining).
Together, we say that
(
Rd ,⟨·, ·⟩) is an inner product space.
109 / 113
Metrics
We also endow
(
Rd ,⟨·, ·⟩) with the Euclidean metric:
(θ ,τ) 7→D(θ ,τ) = ∥θ − τ∥2
which we can use to define a topology on Rd .
That is, we can define the open sets in Rd using the system of open
balls:
BD =
{
Br (θ) |θ ∈ Rd and r ∈ R>0
}
,
where, for each point θ ∈ Rd , we define the (open) ball of radius
r > 0 around θ as
Br (θ) =
{
τ ∈ Rd : ∥θ − τ∥2 < r
}
.
For any set A⊂ Rd , we can also define its diameter as
diam(A) = sup
θ ,τ∈A
D(θ ,τ) = sup
θ ,τ∈A
∥θ − τ∥2 .
We say that A is bounded if diam(A)< ∞.
110 / 113
Closed sets
We recall, from the definition of a topology, that the open sets in Rd
are the empty set ∅ and space Rd , the (infinite) unions of balls, and
the finite intersects of these unions.
We can further say that a set is closed if its complement is open.
That is, if A⊂ Rd is open, then Ac = Rd\A is closed.
An alternative definition of closedness is as follows: A is closed if
every sequence (θn)n∈N ⊂ A converges in A, in the sense that
limn→∞ θn ∈ A.
For example, we can also define the system of closed balls:
B¯D =
{
B¯r (θ) |θ ∈ Rd and r ∈ R>0
}
,
where, for each point θ ∈ Rd , we define the closed ball of radius
r ≥ 0 around θ as
B¯r (θ) =
{
τ ∈ Rd : ∥θ − τ∥2 ≤ r
}
.
We can construct the closed sets from closed balls: ∅ and Rd are
closed, along with the (infinite) intersections of closed balls, and the
finite unions of these intersections.
111 / 113
Compact sets
Let (θn)n∈N ⊂ Rd be a sequence.
Let m : N→ N be a strictly increasing function.
Then, we say that
(
θm(n)
)
n∈N is a subsequence of (θn)n∈N.
We say that a set A⊂ Rd is a compact set if for each sequence
(θn)n∈N, there exists a subsequence
(
θm(n)
)
n∈N such that
lim
n→∞θm(n) ∈ A.
Equivalently we say that A is compact if (An)n∈N ⊂ 2A is a
decreasing sequence of closed and nonempty subsets, i.e.
An+1 ⊂ An, An is closed, and An ̸=∅, for each n ∈ N, then
lim
n→∞An =
∞⋂
n=1
An ̸=∅.
Exercise 38
Show that (0,1) is not compact.
112 / 113
Extreme value theorem
The Bolzano–Weierstrass theorem states that a set A⊂ Rq is
compact if and only if it is closed and bounded.
Recall that a function r : Rq→ R¯ is continuous if
r−1 (A) = {θ ∈ T : r (θ) ∈ A}
is open, for every open set A.
On any metric space, including the Euclidean space, a more
convenient definition is that if r : T 7→ R¯ is continuous if for every
sequence (θn)n∈N ⊂ T,
lim
n→∞θn = θ =⇒ limn→∞r (θn) = r (θ) .
Weierstrass’s extreme value theorem states that if T is a
compact set, and r : R→ R¯ is continuous, then
argmin
θ∈T
r (θ) ̸=∅ and argmax
θ∈T
r (θ) ̸=∅.
113 / 113