程序代写案例-STAT3006|学霸联盟

程序代写案例-STAT3006

时间：2021-11-15

High Dimensional Inference
STAT3006
October 27, 2021
1 / 110
The high dimensional setting
In a standard statistical setting, we generally have the following setup:
Observe data Xn = (X 1, . . . ,X n), consisting of observations X i, i ∈ [n], for
some n ∈ N.
Suppose that each X i ∈ X, and we wish to model X i using a model with
θ ∈ T⊆ Rp, for some p ∈ N
Here we use the usual terminology that p is the dimensionality of the
inferential problem and n is the sample size.
Previously, we typically assume that the relationship is
p≪ n;
that is, we have many more observations than we have dimensions of our
data.
The symbol “≪” is to be read as “much less than”.
In this Chapter, we consider what happens when this relationship reverses:
p≫ n,
or when we have far more dimensions of data than we have observations
to draw inference about θ .
2 / 110
The high dimensional setting
Typically, we consider one of several scenarios regarding the nature of p:
p ∈ N may be a large but fixed number.
p : N→ N, n 7→ p(n), where p(n) grows in size with n, at some rate.
θ ∈ T, where T is a space of functions that may be larger than Rp for any p
(in essence, p= ∞).
We will also consider relationships between data
X i ∈ X⊆ Rq and Y i ∈ Y⊆ Rr,
where it may be true that
max{q,r}≫ n,
with q,r potentially infinitely large.
3 / 110
High-dimensional weirdness
Suppose that X =
(
X1, . . . ,Xq
)
is an independent and identically
distributed (IID) random sequence, where
X j ∼ N(0,1) .
We can write the density function of X is
fq (x) =
q
∏
j=1
1√
2π
exp
{
−x
2
j
2
}
= (2π)−q/2 exp
{
−∥x∥
2
2
2
}
,
where x =
(
x1, . . . ,xq
)
is a realization of X .
Notice that
sup
x∈Rq
fq (x) = fq (0) = (2π)−q/2 .
Now, observe that
lim
q→∞ fq (0) = limq→∞(2π)
−q/2 = 0
at an exponentially fast rate.
So the q-dimensional Gaussian distribution rapidly gets flatter as q increases.
4 / 110
High-dimensional weirdness
Consider a ball of points aroung x = 0 with density value at least
ε fq (0)> 0:
Bq,ε (0) =
{
x ∈ Rq : fq (x)≥ ε fq (0)
}
=
{
x ∈ Rq : ∥x∥22 ≤ 2log
(
1
ε
)}
.
Our usual intuition for small q is that for fixed ε ∈ (0,1) sufficiently small,
the probability if X ∈Bq,ε (0) will be close to 1.
That is, we expect any random point to be in the “bell” of the Gaussian
distribution.
Markov’s Inequality
For random variable X ∈ R with finite expectation, i.e. E |X |< ∞,
P(|X | ≥ x)≤ E |X |
x
.
5 / 110
High-dimensional weirdness
Write:
P
(
X ∈Bq,ε (0)
)
= P
(
fq (x)≤ 2log
(
1
ε
))
= P
(
exp
{
−1
2
∥x∥22
}
≥ ε
)
≤︸︷︷︸
Markov’s
1
ε
∫
Rq
exp
{
−∥x∥22
}
(2π)q/2
dx
=
1
ε
1
2q/2
.
For any value of ε ∈ (0,1) fixed, we notice that
lim
q→∞
1
ε
1
2q/2
= 0.
That is, the probability mass of the Gaussian distribution accumulates in
the tail!
6 / 110
High-dimensional weirdness
We note that
Bq,ε (0) =
{
x ∈ Rq : ∥x∥22 ≤ 2log
(
1
ε
)}
is a Euclidean ball of radius ρ = 2log
( 1
ε
)
.
Let’s consider a more general Euclidean ball (around x) of radius ρ > 0:
Bq (x;ρ) =
{
ξ ∈ Rq : ∥ξ − x∥22 ≤ ρ
}
.
The volume of Bq (0;ρ) (or its Lebesgue measure) is:
λ
(
Bq (x;ρ)
)
=
πq/2
Γ(q/2+1)
ρq ∼
q→∞
(
2πeρ2
q
)q/2
(qπ)−1/2 ,
Here, we write g(x) ∼
x→∞ h(x) to mean that limx→∞ g(x)/h(x) = 1, and say
that g is asymptotically equivalent to h
Γ(x) =
∫ ∞
0 t
x−1 exp{−t}dt is the Gamma function, for x> 0.
7 / 110
High-dimensional weirdness
Define the ϖ ∈ (0,1) crust of the ball Bq (x;ρ) as:
Cϖq (x;ρ) =Bq (x;ρ)−Bq (x;(1−ϖ)ρ) .
The ϖ crust can be interpreted as the points of Bq (0;ρ) that are less than
distance (1−ϖ)ρ away from the boundary.
Consider the ratio between the volumes of the crust and ball:
Ratio=
λ
(
Cϖq (x;ρ)
)
λ
(
Bq (x;ρ)
) = πq/2Γ(q/2+1) {ρq− (1−ϖ)q ρq}
πq/2
Γ(q/2+1)ρ
q
= 1− (1−ϖ)q .
Notice that limq→∞Ratio= 1; and thus for any ϖ , as q increases, almost all
of the volume of the ball accumulates on its crust!
8 / 110
High-dimensional weirdness
Figure: The ϖ crust Cϖq (x;ρ) when q= 2.
9 / 110
High-dimensional weirdness
Let X 1,X 2 ∈ [0,1]q be uniform randomly distributed on the q-dimensional
unit cube.
This implies that we can write the PDF of X 1,X 2 as
f (x) =
q
∏
j=1
f
(
x j
)
=
q
∏
j=1
1= 1.
Consider the squared distance between X 1 and X 2:
∥X 1−X 2∥22 .
On average, we have
E∥X 1−X 2∥22 =
q
∑
j=1
E
[(
X1 j−X2 j
)2]
= qE
[
(X11−X21)2
]
= q
∫ 1
0
∫ 1
0
(x11− x21)2 dx11dx12 = q6 .
Observe that limq→∞ p/6= ∞, so as dimensionality increases, uniformly
sample points become infinitely far from one another, on average!
10 / 110
High-dimensional weirdness
Figure: Two uniformly sampled points in [0,1]2.
11 / 110
High-dimensional weirdness
Suppose that we want to cover the q-dimensional unit cube [0,1]q with
balls of radius 1 (i.e., Bq (x;1)).
That is, we want to obtain a set of balls Bq (xi;1), for xi ∈ [0,1]q, and i ∈ [n],
such that
[0,1]q ⊂
n⋃
i=1
Bq (xi;1) .
We know that the volume of [0,1]q and Bq (xi;1) are:
λ ([0,1]q) = 1 and λ
(
Bq (xi;1)
)
=
πq/2
Γ(q/2+1)
.
12 / 110
High-dimensional weirdness
To cover a set of volume λ ([0,1]q) with balls of volume λ
(
Bq (xi;1)
)
, we
would require at least
n≥ λ ([0,1]
q)
λ
(
Bq (xi;1)
) = Γ(q/2+1)
πq/2
.
From the asymptotic equivalence, we have:
RHS ∼
q→∞
( q
2πe
)q/2√
qπ.
This implies that we need infinitely many balls of form Bq (xi;1) to cover
the unit cube, as q→ ∞. More specifically:
q 20 30 50 100 150 200
n≥ 39 45378 5.8E12 4.2E39 1.3E72 1.8E108
13 / 110
High-dimensional weirdness
Figure: Covering the 2-dimensional unit cube with balls of form Bq (xi;1).
14 / 110
Visualization in high dimensions
We consider the Khan data set from the ISLR package of “An Introduction
to Statistical Learning with Applications in R” of James et al. (2013).
Data consists of n= 63 observations X 1, . . . ,X 63.
Each observation X i ∈ Rq, where q= 2308.
X i is the gene expression profile of 2308 genes of tissue samples from
subject i ∈ [n].
For each subject i ∈ [n], we also observe the type of tumor the tissue was
sampled from as variable Yi ∈ [r], for r = 4.
We consider two basic kinds of plots:
The heat map: plots the q×n matrix Xn, where X i corresponds to the ith
column of Xn. The value of each Xi j is indicated via a color scale.
The coordinate plot: plots each vector X i as a function gi ( j) = Xi j, where
j ∈ [q]. Here, we may indicate the different values of i or of Yi via colors.
15 / 110
Dimensionality reduction
Notice that the two basic plots maps the data into a 2- or 3-dimensional
space.
Suppose that we want to systematically map the high dimensional data
X 1, . . . ,X n ∈ X⊆ Rq into a lower dimensional space, say W⊆ Rs, where
s≪ q.
We will write a mapping from X⊆ Rq to Rs as F : X→W, x 7→F (x),
where F stands for forward.
This lower dimensional mapping W i =F (X i) ∈ Rs is called a dimensionality
reduction.
We also hope that the mapping is reversible in some way, in a sense that
we can sufficiently well re-estimate X i from W i =F (X i), for each i ∈ [n].
We will write a mapping from W to X as R :W→ X, w 7→R (w), where R
stands for reverse/recovery.
We formalize the notion that we well re-estimate X i as demanding that F
and R are such that
1
n
n
∑
i=1
div(X i,R (F (X i)))
is small, where div(·, ·) : Rq×Rq → R≥0 is a “divergence function”.
Here, we will often write the recovered observation R (F (X i)) = Xˆ i, and
think of it, in a sense, as a prediction.
16 / 110
Dimensionality reduction
Figure: Schematic of dimensionality reduction and recovery mappings.
17 / 110
Principal component analysis
Let X= Rq and W= Rs, where s≪ q.
We will consider linear mappings for F and R:
F (x) = Fx, where F ∈ Rs×q.
R (w) = Rw, where R ∈ Rq×s.
Thus we wish to map each observation X i ∈ X into a lower dimensional
space W and represent it as
W i =F (X i) = FX i.
We then recover each high dimensional observation from the smaller space
W via the reverse mapping
Xˆ i =R (W i) =R (F (X i)) = RFX i.
18 / 110
Principal component analysis
Using the Euclidean distance
div(x, xˆ) = ∥x− xˆ∥22 ,
we wish to choose F and R so that the average divergence
n−1
n
∑
i=1
∥∥X i− Xˆ i∥∥22 = n−1 n∑
i=1
∥X i−RFX i∥22
is small.
Notice that the value of F and R that makes average difference smallest is
the solution:
Fˆ, Rˆ= argmin
F∈Rs×q,R∈Rq×s
n
∑
i=1
∥X i−RFX i∥22 .
The solution Fˆ, Rˆ is most commonly referred to as the principal component
analysis (PCA) solution.
19 / 110
Principal component analysis
Write
Oq,s =
{
O ∈ Rq×s :O⊤O= Is
}
and call Oq,s the Stiefel manifold of order (q,s), or more descriptively the
set of matrices in Rq×s, with orthonormal columns.
That is, the columns of O, say o j, for j ∈ [s] are such that:
For any j,k ∈ [s], where j ̸= k: o⊤j ok = 0.
For any j ∈ [s], ∥∥o j∥∥2 = 1.
Stiefel manifold representation
Let Fˆ, Rˆ be the PCA solution. Then, Rˆ=O ∈Oq,s and Fˆ=O⊤.
20 / 110
Principal component analysis
For some fixed F ∈ Rs×q,Rq×s, consider the mapping x 7→ RFx. Since
w = Fx ∈ Rs, the mapping w 7→ Rw ∈ Rq, maps from a lower dimensional
space to a higher dimensional space and thus the range
Range(RF×·) = {RFx : x ∈ Rq} ⊂ Rq.
In fact, since
Range(F×·) = {Fx : x ∈ Rq} ⊂ Rs,
it must be true that Range(RF×·) is an s-dimensional linear subspace of
Rq.
Since Range(RF×·) is an s-dimensional subspace of Rq, it has an
orthonormal basis, which we can write as a matrix O ∈Oq,s.
Thus, each xˆ ∈ Range(RF×·) can be written as xˆ =Ow, where w ∈ Rs.
21 / 110
Principal component analysis
For each w ∈ Rs and x ∈ Rq, we can write
∥x−Ow∥22 = ∥x∥22−2w⊤O⊤x+w⊤O⊤O︸︷︷︸
Is
w
= ∥x∥22+∥w∥22−2w⊤O⊤x.
We minimize this quadratic (with respect to w) by solving the first order
system:
∂
∂w
Obj.= 2w−2O⊤x = 0, implying w =O⊤x.
This implies
xˆ =Ow =OO⊤x ∈ Range(RF×·) minimizes ∥x− xˆ∥22 .
More specifically,
OO⊤x = argmin
xˆ∈Range(RF×·)
∥x− xˆ∥22 .
22 / 110
Principal component analysis
The previous optimality result is satisfied if we substitute x by X i, for each
i ∈ [n], so we have
OO⊤X i = argmin
xˆ∈Range(RF×·)
∥X i− xˆ∥22 , implying that
∥∥∥X i−OO⊤X i∥∥∥2
2
≤ ∥X i−RFX i∥22 , since
RFX i ∈ Range(RF×·) .
Summing over i ∈ [n], we get
n
∑
i=1
∥∥∥X i−OO⊤X i∥∥∥2
2
≤
n
∑
i=1
∥X i−RFX i∥22 .
This is true for any F and R, so
argmin
O∈Oq,s
n
∑
i=1
∥∥∥X i−OO⊤X i∥∥∥2
2
= argmin
F∈Rs×q ,R∈Rq×s
n
∑
i=1
∥X i−RFX i∥22 .
23 / 110
Principal component analysis
We apply the Stiefel manifold representation and seek a solution
Oˆ= argmin
O∈Oq,s
n
∑
i=1
∥∥∥X i−OO⊤X i∥∥∥2
2
.
Expanding the summands, we get∥∥∥X i−OO⊤X i∥∥∥2
2
= ∥X i∥22−2X⊤i OO⊤X i+X⊤OO⊤O︸︷︷︸
Is
O⊤X
= ∥X i∥22−2X⊤i OO⊤X i+X⊤i OO⊤X i
= ∥X i∥22− X⊤i OO⊤X i︸︷︷︸
Scalar=trace(Scalar)
= ∥X i∥22− trace
(
X⊤i OO⊤X i
)
= ∥X i∥22− trace
(
O⊤X iX⊤i O
)
24 / 110
Principal component analysis
Then, we can write
n
∑
i=1
∥∥∥X i−OO⊤X i∥∥∥2
2
=
n
∑
i=1
∥X i∥22−
n
∑
i=1
trace
(
O⊤X iX⊤i O
)
=
n
∑
i=1
∥X i∥22− trace
(
O⊤
{
n
∑
i=1
X iX⊤i
}
O
)
︸︷︷︸ .
trace is a linear operator
We can rewrite the optimization problem again, as:
Oˆ= argmin
O∈Oq,s
− trace
(
O⊤
{
n
∑
i=1
X iX⊤i
}
O
)
= argmax
O∈Oq,s
trace
(
O⊤
{
n
∑
i=1
X iX⊤i
}
O
)
.
25 / 110
Principal component analysis
Notice that the Grammian matrix
G=
n
∑
i=1
X iX⊤i
is symmetric and positive semidefinite and admits the spectral
decomposition:
G= VDV⊤,
where, D ∈ Rq×q is a diagonal matrix and V ∈ Rq×q satisfies:
V⊤V= VV⊤ = Iq.
The diagonal elements of D are the eigenvalues of G and the columns of V
are the eigenvectors of G.
Without loss of generality, we can order the elements of D and columns of
V so that:
D11 ≥ D22 ≥ ·· · ≥ Dqq ≥ 0.
26 / 110
Principal component analysis
Spectral solution for PCA
Let X 1, . . . ,X n ∈ Rq be arbitrary vectors with Grammian G= ∑ni=1X iX⊤i . The
columns o j ( j ∈ [s]) of the solution
Oˆ= argmax
O∈Oq,s
trace
(
O⊤
{
n
∑
i=1
X iX⊤i
}
O
)
are exactly the eigenvectors corresponding to the largest s eigenvalues of G.
Let VDV⊤ be a spectral decomposition of G and let O ∈Oq,s be an
orthonormal matrix.
Write U= V⊤O, so that VU= VV⊤O=O, and thus
O⊤GO= U⊤V⊤︸︷︷︸
O⊤
VDV⊤︸︷︷︸
G
VU︸︷︷︸
O
= U⊤DU,
since V⊤V= Iq.
27 / 110
Principal component analysis
Write ui j for the ith row and jth column element of U.
For each i ∈ [q], we can write the ith row and ith column element of UU⊤ as:[
UU⊤
]
ii
=
s
∑
j=1
U2i j, where [U]i j =Ui j.
By substitution, we have
trace
(
O⊤GO
)
= trace
(
U⊤DU
)
=
q
∑
i=1
Dii
s
∑
j=1
U2i j.
Notice that U⊤U=O⊤VV⊤O=O⊤O= Is.
So U ∈Oq,s is orthonormal with jth column u j and hence:
q
∑
i=1
s
∑
j=1
U2i j =
s
∑
j=1
∥∥u j∥∥22 = s∑
j=1
1= s.
28 / 110
Principal component analysis
Let U˜ ∈ Rq×q be a matrix where the first s columns are equal to the
columns of U, and where U˜ ∈Oq,q is orthonormal.
Notice that U˜⊤U˜= Iq = U˜U˜⊤ and so U˜⊤ ∈Oq,q, and thus:
q
∑
j=1
U˜2i j = 1, where
[
U˜
]
i j = U˜i j.
This implies that ∑sj=1 U˜2i j ≤ 1.
Together with ∑qi=1∑sj=1U2i j = s, we have
trace
(
O⊤GO
)
=
q
∑
i=1
Dii
s
∑
j=1
U2i j
≤ max
{u∈[0,1]q:∑qi=1 ui≤s}
q
∑
i=1
Diiui.
It is easy to see that the RHS is maximized when ui = 1 for i ∈ [s], yielding
upper bound
trace
(
O⊤GO
)
≤
s
∑
i=1
Dii.
29 / 110
Principal component analysis
Let V have ith column vi for i ∈ [q], and suppose that we set O= Oˆ, where
Oˆ=
[
v1 v2 · · · vs
]
.
We observe that
Oˆ⊤V=

v⊤1
v⊤2
...
v⊤s


v⊤1
v⊤2
...
v⊤q

⊤
=

v⊤1 v1 v
⊤
1 v2 · · · v⊤1 vq
v⊤2 v1 v
⊤
2 v2 · · · v⊤2 vq
...
...
. . .
...
v⊤s v1 v⊤s v2 · · · v⊤s vq
=
[
Is
0q−s
]
.
Then,
trace
(
O⊤GO
)
= trace
(
O⊤VDV⊤O
)
= trace
[ Is
0q−s
]⊤
D
[
Is
0q−s
]= s∑
i=1
Dii.
Thus, trace
(
Oˆ⊤GOˆ
)
= ∑si=1Dii and therefore:
Oˆ= argmax
O∈Oq,s
trace
(
O⊤GO
)
.
30 / 110
Principal component analysis
Often, data are centered before conducting PCA. That is, first compute
X˜ i = X i−X n, for each i ∈ [n], where X n = 1n ∑ni=1X i.
We then notice that
1
n
G˜=
1
n
n
∑
i=1
X˜ iX˜
⊤
i =
1
n
n
∑
i=1
(
X i−X n
)(
X i−X n
)⊤
is the (biased) sample covariance matrix.
Solving the PCA problem:
Oˆ= argmax
O∈Oq,s
trace
(
O⊤
{
1
n
G˜
}
O
)
= argmax
O∈Oq,s
trace
(
O⊤
{
1
n
n
∑
i=1
X˜ iX˜
⊤
i
}
O
)
can be interpreted as maximizing the trace of the covariance matrix.
Further, we can interpret the spectral decomposition of G˜:
G˜= V˜DV˜⊤ =
q
∑
i=1
D˜iiv˜iv˜⊤i
as proportioning the covariance matrix by each eigenvector mapping.
That is, the proportion of total variance explained by the ith largest
eigenvector mapping is
D˜ii/trace
(
D˜
)
.
31 / 110
Principal component analysis
Often, data are centered before conducting PCA. That is, first compute
X˜ i = X i−X n, for each i ∈ [n], where X n = 1n ∑ni=1X i.
We then notice that
1
n
G˜=
1
n
n
∑
i=1
X˜ iX˜
⊤
i =
1
n
n
∑
i=1
(
X i−X n
)(
X i−X n
)⊤
is the (biased) sample covariance matrix.
Solving the PCA problem:
Oˆ= argmax
O∈Oq,s
trace
(
O⊤
{
1
n
G˜
}
O
)
= argmax
O∈Oq,s
trace
(
O⊤
{
1
n
n
∑
i=1
X˜ iX˜
⊤
i
}
O
)
can be interpreted as maximizing the trace of the covariance matrix.
Further, we can interpret the spectral decomposition of G˜:
G˜= V˜DV˜⊤ =
q
∑
i=1
D˜iiv˜iv˜⊤i
as proportioning the covariance matrix by each eigenvector mapping.
That is, the proportion of total variance explained by the ith largest
eigenvector mapping is
D˜ii/trace
(
D˜
)
.
32 / 110
Probabilistic PCA
Suppose that W ∼ N(0,Is)is a latent random variable and E ∼ N
(
0,σ2Iq
)
is an error term, and that we observe only through
X = µ +RW +E ∈ Rq, W ∈ Rs.
Here σ2 > 0, µ ∈ Rq, and R ∈ Rq×s are unknown.
Using IID data Xn = (X 1, . . . ,X n), where each X i ∼ X , we wish to compute
estimates of the parameter tuple
θ =
(
σ2,µ ,R
)
.
For any X ∈ Rq×s, we wish to produce an estimator of W of the form:
Wˆ = E [W |X ] .
This setup is referred to as probabilistic PCA and is due to Tipping and
Bishop (1999).
33 / 110
Probabilistic PCA
Figure: Schematic of data generating and estimation process of probabilistic PCA.
34 / 110
Probabilistic PCA
Since X = µ +RW +E and W ∼ N(0,Is),
X ∼ N
(
µ ,RR⊤+σ2Iq
)
.
The PDF of X can be written as:
φ
(
x;µ ,RR⊤+σ2Iq
)
=(2π)−q/2
∣∣∣RR⊤+σ2Iq∣∣∣−1/2 exp{−12 (x−µ )⊤{RR⊤+σ2Iq}−1 (x−µ )
}
.
This implies that the log-likelihood with respect to Xn can be written as:
logLn (θ )
=
n
∑
i=1
logφ
(
X i;µ ,RR⊤+σ2Iq
)
= −n
2
q log(2π)− n
2
log
∣∣∣RR⊤+σ2Iq∣∣∣
−1
2
trace
([
RR⊤+σ2Iq
]−1
S (µ )
)
,
where S (µ ) = n−1∑ni=1 (X i−µ )(X i−µ )⊤.
35 / 110
Probabilistic PCA
First notice that since X is normal with mean µ , the maximum likelihood
estimator (MLE) of µ is the sample mean X n.
We can profile the likelihood (substitute in solution, with respect to µ ), to
obtain the log-likelihood with respect to R and σ2: logLn
(
R,σ2
)
, which
as the same form as logLn (θ ), with X n in place of µ .
Let S
(
X n
)
have spectral decomposition VDV⊤, where D is diagonal and
Dii ≥ 0 is the ith largest eigenvalue and the ith column of V, vi, is the
normalized eigenvector corresponding to Dii.
The maximizers of logLn
(
R,σ2
)
are:
σˆ2 =
1
q− s
q
∑
i=s+1
Dii, Rˆ= Vs
(
Ds− σˆ2Is
)1/2
Q,
where Vs contains the first s columns of V, Ds contains the first s rows
and columns of D, and Q ∈ Rs×s is an arbitrary rotation matrix in the
sense that:
Q⊤ =Q−1, |Q|= 1.
Thus, there are infinitely many MLEs.
36 / 110
Probabilistic PCA
Closure of normal distributions under conditioning
Assuming compatibility of matrices, for[
X
Y
]
∼ N
([
µX
µY
]
,
[
ΣXX ΣXY
ΣYX ΣYY
])
and fixed x, we have [Y |X = x]∼ N
(
µY |X (x) ,ΣY |X
)
, where
µY |X (x) = µY +ΣYXΣ
−1
XX (x−µX ) ,
ΣY |X = ΣYY −ΣYXΣ−1XXΣXY .
37 / 110
Probabilistic PCA
Notice that we can write:[
X
W
]
∼ N
([
µ
0
]
,
[
RR⊤+σ2Iq R
R⊤ Is
])
, since
cov(W ,X ) = E
(
W [X −µ ]⊤
)
= E
[
WX⊤
]
−E
W µ⊤︸︷︷︸
E(W )=0

= E
[
WW⊤R⊤
]
= E
[
WW⊤
]
︸︷︷︸
cov(W )=Is
R⊤ = R⊤.
Using the conditional closure, we get
E [W |X = x] = µW +ΣWXΣ−1XX (x−µX )
= R⊤
(
RR⊤+σ2Iq
)−1
(x−µ ) =
(
R⊤R+σ2Is
)−1
R⊤ (x−µ )
where the last line is via the push-through identity.
We can estimate E [W |X = x] by
Eˆ [W |X = x] =
(
Rˆ⊤Rˆ+ σˆ2Is
)−1
Rˆ⊤
(
x−X n
)
.
38 / 110
Probabilistic PCA
The additional advantage of working with the probabilistic PCA is the fact
that it admits the use of model selection methods for probability models.
Let sˆ be the choice of the dimension for the space W.
We can consider for each s, an MLE θˆ s of the corresponding likelihood
Lsn (θ ).
The Bayesian information criterion (BIC) suggests that we should choose
sˆ= argmin
s∈N
−2logLsn
(
θˆ s
)
+{q(s+1)+1} logn,
Here, q(s+1)+1 is the total dimensionality of σ2,µ s,Rs.
We are not limited to the BIC, but can use any measure of model fitness.
39 / 110
Nonlinear dimensionality reduction
Suppose that we can write the forward mapping F : Rq → Rs as
x 7→ (F1 (x) ,F2 (x) , . . . ,Fs (x)) ,
where Fk:Rq → R, for each k ∈ [s].
Similarly, suppose we can write the reverse mapping R : Rs → Rq as
w 7→ (R1 (w) ,R2 (w) , . . . ,Rq (w)) ,
where R j:Rs → R, for each j ∈ [q].
For example, we can do this with our PCA mappings F (x) = Fx and
R (w) = Rw to get:
Fk (x) = fkx, where fk is the kth row of F.
R j (w) = r jw, where r j is the jth row of R.
When Fk (x) = a(fkx+ ck) and R j (w) = a
(
r jw+d j
)
where a : R→ R and
c=
(
c1, . . . ,cq
) ∈ Rq and d= (d1, . . . ,ds) ∈ Rs, we call the forward and
reverse maps a 3-layer autoencoder with activation function a.
Typically a is a nonlinear function, such as the ReLU function
a(·) =max{0, ·}= [·]+ or the logistic function a(·) = (1+ exp(−×·))−1.
We can identify that PCA is a 3-layer autoencoder with a linear activation
function.
40 / 110
Nonlinear dimensionality reduction
Figure: Schematic of 3-layer autoencoders.
41 / 110
Two-sample problems
One of the most venerable type of problems in statistics are two-sample
type problems:
Suppose that we observe two data sets Xn = (X 1, . . . ,X n) and
Ym = (Y 1, . . . ,Y m), where both Xn and Ym are IID samples that are
independent from one another, and X 1,Y 1 ∈ X⊆ Rq.
Let fX (x) and fY (y) be the PDFs of X 1 and Y 1, respectively. Here,
fX , fY ∈ F, where
F=
{
f : X→ R≥0 :
∫
X
f (x)dx = 1
}
is the class of probability density functions on X.
Let P : X→ P, f 7→P ( f ) be some property of the PDF f in the space P.
We wish to the null and alternative hypotheses:
H0 :P ( fX ) =P ( fY ) , versus H1 :P ( fX ) ̸=P ( fY ) .
Some examples of properties P are the (multivariate) median and mean
(k = 1,2, respectively):
P ( f ) = argmin
µ∈X
∫
X
∥x−µ∥kk f (x)dx.
42 / 110
Two-sample goodness-of-fit
If we set P ( f ) = f ∈ F, we obtain the setup of the two-sample
goodness-of-fit (2-GoF) test:
H0 : fX = fY , versus H1 : fX ̸= fY .
In the one-dimensional setting (X= R), the most famous 2-GoF test is the
two-sample Kolmogorov–Smirnov test, defined by the test statistic:
T (Xn,Ym) = ∥F (·;Xn)−F (·;Ym)∥∞
= sup
x∈R
|F (·;Xn)−F (·;Ym)| .
Here, F (·;Xn) and F (·;Ym) are the empirical cumulative distribution
functions (ECDFs):
F (x;Xn) =
n
∑
i=1
Jx≤ XiK
n
, F (y;Yn) =
m
∑
i=1
Jy≤ YiK
m
.
The famous Glivenko–Cantelli Theorem states that
∥F (·;Xn)−FX∥∞ a.s.−→n→∞ 0,
where FX (x) =
∫ x
−∞ fX (t)dt is the cumulative distribution function
(CDF) of the random variable X .
43 / 110
Two-sample goodness-of-fit
Using the reverse triangle inequality, and the triangle inequality, we have:
|∥F (·;Xn)−F (·;Ym)∥∞−∥FX −FY ∥∞|
≤︸︷︷︸
r.t.i
∥F (·;Xn)−F (·;Ym)− (FX −FY )∥∞
≤︸︷︷︸
t.i
∥F (·;Xn)−FX∥∞+∥F (·;Ym)−FY ∥∞ .
The Glivenko–Cantelli Theorem then implies that
∥F (·;Xn)−FX∥∞+∥F (·;Ym)−FY ∥∞ a.s.−→n,m→∞ 0
and thus
|T (Xn,Ym)−∥FX −FY ∥∞| a.s.−→n,m→∞ 0.
∥FX −FY ∥∞ ≥ 0 is the uniform (or Kolmogorov) distance between
probability densities fX and fY , which is equal to zero if and only if
fX = fY .
This implies that the test statistic is consistent in the sense that
T (Xn,Ym)
a.s.−→
n,m→∞ 0 if and only if the null hypothesis H0 : fX = fY is true.
44 / 110
Completely monotone functions
We say that a function g : R≥0 → R≥0 completely monotone if:
1. g is continuous on R≥0.
2. g is infinitely differentiable on R>0.
3. For each k ∈ N∪{0},
(−1)k g(k) (t)≥ 0, for t ∈ R>0,
where g(k) : R>0 → R>0 is the kth derivative of g.
An example of a completely monotone function is g(t) = exp{−αt},
α > 0.
g(k) (t) = (−1)kαk exp{−αt}, so (−1)k g(k) (t) = αk exp{−αt}> 0, for each k.
Another example is g(t) = (t+α)β , β ≤ 0< α.
g(k) (t) = β (β −1) · · ·(β − k+1)(t+α)β−k so for each k:
(−1)k g(k) (t) = |β (β −1) · · ·(β − k+1)|(t+α)β−k ≥ 0.
45 / 110
Strongly positive definite kernels
For X= Rq, define the kernel κg : X×X→ R≥0, (x,y) 7→ κg (x,y), where
κg (x,y) = g
(
∥x− y∥22
)
.
If g is a completely monotone function that is not constant, then, by Cheney
and Light (2000, Ch. 18, Lem. 3), κg is a strongly positive definite kernel
in the sense that ∫
X
∫
X
κg (x,y)h(x)h(y)dxdy ≥ 0,
for every measurable h : Rq → R and∫
X
∫
X
κg (x,y)h(x)h(y)dxdy = 0
implies that h= 0.
Notice that when g(t) = exp{−αt}, we obtain the Gaussian kernel
κg (x,y) = exp
{
−α ∥x− y∥22
}
that we considered for nonlinear SVM.
46 / 110
Strongly positive definite kernels
Assume that g is completely monotone and set h= fX − fY , then:∫
X
∫
X
κg (x,y)h(x)h(y)dxdy
=
∫
X
∫
X
κg (x,y){ fX (x)− fY (x)}{ fX (y)− fY (y)}dxdy
=
∫
X
∫
X
κg (x,y) fX (x) fX (y)dxdy+
∫
X
∫
X
κg (x,y) fY (x) fY (y)dxdy
−
∫
X
∫
X
κg (x,y) fY (x) fX (y)dxdy+
∫
X
∫
X
κg (x,y) fX (x) fY (y)dxdy︸︷︷︸
2
∫
X
∫
X κg(x,y) fX (x) fY (y)dxdy
=EXEX ′κg
(
X ,X ′
)
+EY EY ′κg
(
Y ,Y ′
)−2EXEY κg (X ,Y )≥ 0,
where X ,X ′ are IID with density fX and Y ,Y ′ are IID with density fY .
We will write
M ( fX , fY ) = EXEX ′κg
(
X ,X ′
)
+EY EY ′κg
(
Y ,Y ′
)−2EXEY κg (X ,Y )
and note that it is a semimetric on Fg in the sense that:
M ( fX , fY ) =M ( fY , fX )≥ 0 for all fX , fY ∈ Fg and M ( fX , fY ) = 0 if and
only if fX = fY .
47 / 110
Maximum mean discrepancy
Gretton et al. (2012) calls M 1/2 the maximum mean discrepancy
(MMD), given kernel κg, and prove that
M 1/2 ( fX , fY ) =
√
EXEX ′κg
(
X ,X ′
)
+EY EY ′κg
(
Y ,Y ′
)−2EXEY κg (X ,Y )
is a metric in the sense that, along with it being a semimetric, it also
satisfies the triangle inequality:
M 1/2 ( fX , fY )+M
1/2 ( fY , fZ )≥M 1/2 ( fX , fZ ) , for fX , fY , fZ ∈ F.
By Gretton et al. (2012, Lem. 4), the name MMD comes from the fact
that
M 1/2 ( fX , fY ) = sup
η∈{h∈H:∥h∥H≤1}
{EXη (X )−EY η (Y )} ,
where H is the Hilbert space defined by the kernel κg and ∥·∥H is its norm.
48 / 110
U-statistics
Let z1, . . . ,zr ∈ X and define an rth order “U-kernel” υ : Xr → R,
(z1, . . . ,zr) 7→ υ (z1, . . . ,zr), where we require that υ be symmetric in the
sense that for every permutation ϖ of [r], we have
υ (z1, . . . ,zr) = υ
(
zϖ(1), . . . ,zϖ(r)
)
.
Let
Kn,r = {χ = ( j1, . . . , jr) : 1≤ j1 ̸= . . . ̸= jr ≤ n}
be the set of combinations of r distinct elements from [n], where
|Kn,r|=
(n
r
)
.
For IID sample Zn = (Z1, . . . ,Zn), where Z1 ∈ X has density fZ , we define
the U-statistic corresponding to U-kernel υ as
Uυ (Zn) =
(
n
r
)−1
∑
χ∈Kn,r
υ
(
Zχ(1), . . . ,Zχ(r)
)
, where
EUυ (Zn) =
(
n
r
)−1
∑
χ∈Kn,r
Eυ
(
Zχ(1), . . . ,Zχ(r)
)
=
(
n
r
)−1(n
r
)
Eυ (Z1, . . . ,Z r)
= Eυ (Z1, . . . ,Z r) =
∫
X
· · ·
∫
X
υ (z1, . . . ,zr)
r
∏
j=1
f
(
z j
)
dz1 · · ·dzr.
49 / 110
U-statistics
For X= R, we can consider the trivial 1st order U-kernel: υ (z1) = z1,
where
Uυ (Zn) =
(
n
1
)−1
∑
χ∈Kn,r
υ
(
Zχ(1)
)
=
1
n
n
∑
i=1
Zi = Z¯n.
A good example of a 2nd order U-kernel is:
υ (z1,z2) = (z1− z2)2 /2, where
Uυ (Zn) =
(
n
2
)−1
∑
χ∈Kn,r
υ
(
Zχ(1),Zχ(2)
)
=
2
n(n−1)
1
2 ∑χ∈Kn,r
(
Zχ(1)−Zχ(2)
)2
=
1
n(n−1)
n
∑
i=1
n
∑
j>i
(
Zi−Z j
)2
=
1
n(n−1)
1
2
n
∑
i=1
n
∑
j=i
(
Zi−Z j
)2
=
1
2n(n−1)
n
∑
i=1
n
∑
j=i
({Zi− Z¯n}−{Z j− Z¯n})2
=
1
2n(n−1)
n
∑
i=1
n
∑
j=i
(
{Zi− Z¯n}2+
{
Z j− Z¯n
}2)
=
1
2n(n−1)2n
n
∑
i=1
{Zi− Z¯n}2 = 1n−1
n
∑
i=1
{Zi− Z¯n}2 .
50 / 110
U-statistics
Suppose that samples Xn and Ym have the same sample size; i.e., m= n.
Then, since Xn and Yn are independent IID samples, we can construct the
IID sample Zn = (Z1, . . . ,Zn), where Z i = (X i,Y i), for i ∈ [n].
Consider the U-kernel:
υ (z1,z2) = κg (x1,x2)+κg (y1,y2)−κg (x1,y2)−κg (x2,y1) , where
Eυ
(
(X ,Y ) ,
(
X ′,Y ′
))
=EXEX ′κg
(
X ,X ′
)
+EY EY ′κg
(
Y ,Y ′
)−2EXEY κg (X ,Y ) =M ( fX , fY ) .
The associated U-statistic is:
Uυ (Zn) =
2
n(n−1)
n
∑
i=1
n
∑
j>i
κg
(
X i,X j
)
+
2
n(n−1)
n
∑
i=1
n
∑
j>i
κg
(
Y i,Y j
)
− 2
n(n−1)
n
∑
i=1
n
∑
j>i
κg
(
X i,Y j
)− 2
n(n−1)
n
∑
i=1
n
∑
j>i
κg
(
X j,Y i
)
=
1
n(n−1)
n
∑
i=1
n
∑
j ̸=i
κg
(
X i,X j
)
+
1
n(n−1)
n
∑
i=1
n
∑
j ̸=i
κg
(
X i,X j
)
− 2
n(n−1)
n
∑
i=1
n
∑
j ̸=i
κg
(
X i,Y j
)
.
51 / 110
Hoeffding’s inequality for U-statistics
The following result comes from Hoeffding (1963) (in the same paper as
his bounded random variable inequality).
Hoeffding’s inequality for U-statistics
Let Zn = (Z1, . . . ,Zn) be an IID sample and let Uυ (Zn) be a U-statistic with
rth order symmetric U-kernel ν, such that ν (z1, . . . ,zr)∈ [a,b]. Then, for n≥ r:
P(Uυ (Zn)−Eν (Z1, . . . ,Z r)≥ ε)≤ exp
{
−2⌊n/r⌋ε
2
(b−a)2
}
.
Let κg (xn,yn) ∈ [0,b]. Then,
υ (z1,z2) = κg (x1,x2)+κg (y1,y2)−κg (x1,y2)−κg (x2,y1) ∈ [−2b,2b]
and we have
P(Uυ (Zn)−M ( fX , fY )≥ ε)≤ exp
{
−⌊n/r⌋ε
2
8b2
}
.
52 / 110
A two-sample GoF test
Let Xn = (X 1, . . . ,X n) and Yn = (Y 1, . . . ,Y n) be two independent IID
samples, where X 1,Y 1 ∈ X and X 1,Y 1 have PDFs fX and fY , respectively.
We wish to test the hypotheses:
H0 : fX = fY , versus H1 : fX ̸= fY ,
using a Uυ (Zn) =Uυ (Xn,Yn) with 2nd order U-kernel defined by κg,
where g(t) = exp{−αt}, and thus
κg (x,y) = exp
{
−α ∥x− y∥22
}
.
Notice that κg (x,y) is maximized when x = y, and thus is bounded from
above by exp(0) = 1. Further, since κg is a Gaussian function, it is
bounded from below by 0.
Under the null hypothesis, fX = fY , which implies that M ( fX , fY ) = 0;
which implies, by Hoeffding’s inequality:
P(Uυ (Zn)≥ ε)≤ exp
{
−⌊n/2⌋ε
2
8
}
53 / 110
A two-sample GoF test
In one approach to hypothesis testing, we set a level of significance (or
size) α ∈ (0,1) and construct a test statistic
T (Zn) =T (Xn,Yn) ,
and a system of critical values C (α,n) and say that we reject the null
hypothesis H0 if
T (Xn,Yn)≥ C (α,n) .
We wish to define the critical values C (α,n) so that under the null
hypothesis (characterized by the probability symbol P0), the probability of
rejection is controlled below α (i.e. the Type I error), in the sense that:
P0 (T (Xn,Yn)≥ C (α,n))≤ α.
54 / 110
A two-sample GoF test
For our hypothesis test, we can define the test statistic by
T (Xn,Yn) =Uυ (Zn), which implies that
P0 (T (Xn,Yn)≥ ε)≤ exp
{
−⌊n/2⌋ε
2
8
}
.
We wish to make the RHS of the inequality equal to α, thus we want to
solve:
α = exp
{
−⌊n/2⌋ε
2
8
}
=⇒ logα =−⌊n/2⌋ε
2
8
=⇒ ε = 2
√
2
⌊n/2⌋ log
(
1
α
)
,
which implies
P0
(
T (Xn,Yn)≥ 2
√
2
⌊n/2⌋ log
(
1
α
))
≤ α.
This implies that we can use the system of critical values
C (α,n) = 2
√
2
⌊n/2⌋ log
(
1
α
)
to ensure that the Type I error of the test is controlled at size α. 55 / 110
Different sample sizes
Let Xn = (X 1, . . . ,X n) and Ym = (Y 1, . . . ,Y m) be independent IID samples
with potentially different sample sizes n and m.
In the context of our problem, the κg-defined two-sample U-statistic
U (Xn,Ym) =
1
n(n−1)
n
∑
i=1
n
∑
j ̸=i
κg
(
X i,X j
)
+
1
m(m−1)
m
∑
i=1
m
∑
j ̸=i
κg
(
Y i,Y j
)
− 2
nm
n
∑
i=1
m
∑
j=1
κg
(
X i,Y j
)
is an unbiased estimator of M ( fX , fY ) (cf. Gretton et al. 2012, Lem. 6).
In Hoeffding (1963), a two-sample version of the U-statistic bound is also
provided, and yields the inequality:
P(U (Xn,Ym)−M ( fX , fY )≥ ε)≤ exp
{
−min{⌊n/2⌋ ,⌊m/2⌋}ε
2
8b2
}
,
if κg ∈ [0,b].
56 / 110
Different sample sizes
Using g= exp{−αt} again, under the null hypothesis H0 : fX = fY , we have
P0 (T (Xn,Ym)≥ ε)≤ exp
{
−min{⌊n/2⌋ ,⌊m/2⌋}ε
2
8
}
,
where T (Xn,Ym) =U (Xn,Ym) is our test statistic.
Setting the RHS to α, we obtain
ε = 2
√
2
min{⌊n/2⌋ ,⌊m/2⌋} log
(
1
α
)
.
Thus, rejecting H0 under the rule that
T (Xn,Ym)≥ C (α,n) = 2
√
2
min{⌊n/2⌋ ,⌊m/2⌋} log
(
1
α
)
correctly controls the Type I error at size α ∈ (0,1).
57 / 110
Hypothesis testing
We consider now a setting where we random data Xn = (X 1, . . . ,X n)
observations that are not necessarily independent or identically
distributed.
We believe that these distributions arise from some probability model,
characterized by θ ∈ T⊆ Rp, where potentially
p≫ n.
We want to test m ∈ N hypotheses using data Xn, about θ , where
potentially
m≫ n.
E.g., if the PDF of Xn is characterized by some fθ , then we may wish to
test hypotheses about properties P j ( fθ ), where j ∈ [m].
We will label each null hypothesis we test as H0 j for j ∈ [m] and and write
h j =
0 if H0 j is true,1 if H0 j is false.
58 / 110
p-values
For each hypothesis H0 j, we observe a p-value, defined as a random
variable (a function of Xn): Pj (Xn) = Pj, with the property (sub-uniform)
that
P0 j
(
Pj ≤ α
)≤ α, for α ∈ [0,1] ,
where P0 j can be interpreted as the probability of an event under the
condition that H0 j is true.
More technically, we we define each hypothesis test as being:
H0 :P j ( fθ ) =
q
θ ∈ T j
y
= 1,
for some subset T j ⊆ T, and write the probability under the assumption that
θ = θ ′ as Pθ ′ , then we can write
P0 j
(
Pj ≤ α
)
= sup
θ ′∈T j
Pθ ′
(
Pj ≤ α
)
.
We can think about a p-value in terms of a test statistic: T j (Xn) =−Pj
tested using the rejection condition
T j (Xn)≥ C j (α,n) =−α, so that
P0 j
(
T j (Xn)≥ C j (α,n)
)
= P0 j
(−Pj ≥−α)= P0 j (Pj ≤ α)≤ α.
59 / 110
Classical hypothesis testing
In the classical setting where m= 1 (i.e., we test a single hypothesis), the
outcome space can be partitioned as follows:
Test is not rejected (P1 > α) Test is rejected (P1 ≤ α)
H01 is true (h1 = 0) True Negative ✓ False Positive (Type I)
H01 is true (h1 = 1) False Negative (Type II) True Positive ✓
In particular, by setting up our hypothesis to be of size α ∈ (0,1), we are
explicitly controlling the probability of a Type I error at α: i.e.,
P(Type I)≤ P01 (P1 ≤ α)≤ α,
by definition of the p-value.
That is, if we test a null hypothesis H01 that is true, then on average, we
will commit a Type I error and make a False Positive conclusion α×100%
of the time.
Put another way, on average, the expected number of False Positives from
one test is
E01 JP1 ≤ αK= P01 (P1 ≤ α)≤ α,
where E0 j can be considered as the expectation under the null hypothesis
H0 j, for j ∈ [m].
60 / 110
Multiple hypothesis testing
Suppose now that we wish to test all m of our hypotheses using p-value
rule to reject H0 j when Pj ≤ α. If all of the hypotheses are indeed true,
then, we can compute the total number of False Positives as
FP=
m
∑
j=1
E0 j
q
Pj ≤ α
y≤ mα,
which grows linearly with m.
In modern genetics and imaging analysis, m≥ 106 is not uncommon (see,
e.g., Nguyen et al., 2014, regarding an MRI experiment).
If we test at the conventional levels of m= 106 true null hypotheses at the
conventional sizes α ∈ {0.005,0.01,0.05,0.1}, then we will end up potentially
having as many False Positives as:
α 0.005 0.01 0.05 0.1
FP 5000 10000 50000 100000
61 / 110
The Family-Wise Error Rate
Instead of using a criterion that performs Type I error control for a single
test, we require a criterion that controls the number of False Positives
across all m tests.
The traditional method for controlling the number of false positives is the
notion of Familiy-wise error rate (FWER) control, which is the
gold-standard in multiple testing in clinical settings.
The FWER for a family of m tests is defined as P(FP≥ 1) and controlling
the FWER at level αFW is defined as satisfying
FWER= P(FP≥ 1)≤ αFW.
That is, we wish to control the probability that we make one or more False
Positives to be no more than αFW. Or alternatively, the probability of not
making any False Positives must be at least 1−αFW.
We can see that FWER= P(Type I), when m= 1, and so can be viewed as
a natural extension.
62 / 110
The Bonferroni method
Let our family of null hypotheses H01, . . .H0m be tested using p-values
P1, . . . ,Pm and let the total number of true hypotheses be
m0 = m−
n
∑
j=1
h j.
Let R j be a random variable that indicates whether we reject H0 j:
R j =
0 if H0 j is not rejected1 if H0 j is rejected.
The Bonferroni method for controlling FWER states that we should
reject H0 j if Pj ≤ αFW/m; i.e.,
R j =
q
Pj ≤ αFW/m
y
.
To show that FWER is correctly controlled, observe that:
FWER= P
 ⋃
{ j:h j=0}
{
R j = 1
}= P
 ⋃
{ j:h j=0}
{
Pj ≤ αFW/m
}
≤︸︷︷︸
Boole’s
∑
{ j:h j=0}
P0 j
(
Pj ≤ αFWm
)
≤︸︷︷︸
defn.
m0
αFW
m
≤ αFW.
63 / 110
False Discovery Rates
Consider the outcome matrix again, but now for m tests:
m−Rej (#Non-rejections) Rej (#Rejections)
m0 TN (#True Negatives) FP (#False Positives)
m−m0 FN (#False Negatives) TP (#True Positives)
Suppose that we wish to control the average number of False Positives,
instead of the absolute value: i.e., the false positive proportion:
FPP= FPFP+TP.
This is exactly what was considered by Benjamini and Hochberg (1995)
(the most cite mathematics paper, ever!), where they propose the idea of
the false discovery rate:
FDR= E{FPP|Rej≥ 1}= E
{ FP
FP+TP |Rej≥ 1
}
.
Here, we say that FDR is controlled at level αFD, if
FDR≤ αFD.
64 / 110
Step-up procedure
Let H01, . . .H0m be tested using p-values P1, . . . ,Pm, again and suppose that
we order the p-values as
P(1) ≤ P(2) ≤ ·· · ≤ P(m).
We will use a so-called Step-up rejection procedure defined by a
non-decreasing function δ : [m]→ R>0 where we reject H0 j via the rule
R j =
s
Pj ≤ αFDδ (k
∗)
m
{
, where
k∗ = argmax
k∈[m]
{
P(k) ≤
αFDδ (k)
m
}
.
We can show that this procedure controls the FDR in the sense that
FDR≤ αFD
m0
m
∞
∑
j=1
δ (min{ j,m})
j ( j+1)
.
65 / 110
Step-up procedure
Notice that k∗ = ∑mj=1R j = Rej, and write
FDR= E
{ FP
FP+TP |Rej≥ 1
}
= E
∑ j∈{ j:h j=0}
r
Pj ≤ αFDδ (k
∗)
m
z
k∗
Jk∗ ≥ 1K

= ∑
j∈{ j:h j=0}
E
{s
Pj ≤ αFDδ (k
∗)
m
{ Jk∗ ≥ 1K
k∗
}
.
Using the fact that
∞
∑
j=1
1
j ( j+1)
= 1, we deduce that
1
k∗
=
∞
∑
j=k∗
1
j ( j+1)
=
∞
∑
j=1
J j ≥ k∗K
j ( j+1)
.
66 / 110
Step-up procedure
Next, we notice that δ (k∗)≤ δ (min{ j,m}) for any j ≥ k∗, since δ is
non-decreasing.
Then,
FDR= ∑
j∈{ j:h j=0}
∞
∑
j=1
1
j ( j+1)
E
{s
Pj ≤ αFDδ (k
∗)
m
{J j ≥ k∗KJk∗ ≥ 1K}
= ∑
j∈{ j:h j=0}
∞
∑
j=1
1
j ( j+1)
P
({
Pj ≤ αFDδ (k
∗)
m
}
∩{ j ≥ k∗}∩{k∗ ≥ 1}
)
joint.≤ ∑
j∈{ j:h j=0}
∞
∑
j=1
1
j ( j+1)
P
(
Pj ≤ αFDδ (min{ j,m})m
)
defn.≤ ∑
j∈{ j:h j=0}
∞
∑
j=1
1
j ( j+1)
αFDδ (min{ j,m})
m
= αFD
m0
m
∞
∑
j=1
δ (min{ j,m})
j ( j+1)
.
67 / 110
The Benjamini–Yekutieli method
Notice that the RHS of
FDR≤ αm0
m
∞
∑
j=1
δ (min{ j,m})
j ( j+1)
can be controlled at size αFD if
∞
∑
j=1
δ (min{ j,m})
j ( j+1)
≤ 1.
In Benjamini and Yekutieli (2001), the authors propose to set
δ (k) = k
(
m
∑
j=1
1
j
)−1
.
Another choice is to simply set δ (k) = (2m)−1 k (k+1).
68 / 110
The Benjamini–Hochberg method
In Benjamini and Yekutieli (2001), the authors note that(
m
∑
j=1
1
j
)−1
∼
m→∞
1
logm
,
which indicates that for large m:
k∗ = argmax
k∈[m]
P(k) ≤ αFDkm
(
m
∑
j=1
1
j
)−1
≈ argmax
k∈[m]
{
P(k) ≤
αFDk
m logm
}
.
It is proved in Benjamini and Hochberg (1995) that if the p-values
P1, . . . ,Pm are independent, then the rejection procedure
R j =
s
Pj ≤ αFDk
∗
m
{
, where k∗ = argmax
k∈[m]
{
P(k) ≤
αFDk
m
}
,
will control FDR at size αFD; i.e., FDR≤ αFD.
69 / 110
Adjusted p-values
In the literature, multiple testing results are often reported in terms of
adjusted p-values P˜j, instead of p-values Pj, where we reject a null
hypothesis H0 j under the condition that
P˜j ≤ α
to control the FDR (or FWER, depending on context).
In such situations,
P˜j = inf
α∈[0,1]
{
α : R j = 1
}
.
For example, in the Benjamini–Hochberg method, we can write the
adjusted p-value as
P˜j = inf
α∈[0,1]
{
α : Pj ≤ αk
∗
m
, where k∗ = argmax
k∈[m]
{
P(k) ≤
αk
m
}}
.
Another example is the Bonferroni method, for which we can write the
adjusted p-value as
P˜j = inf
α∈[0,1]
{
α : Pj ≤ αm
}
= mPj.
70 / 110
Linear regression
Let X 1, . . . ,X n ∈ Rq for some q≫ n be a set of high dimensional
covariates, each associated with a response Yi, for i ∈ [n].
We suppose that the pairs (X i,Yi) satisfy the linear regression
relationships
E [Yi|X i = xi] = α+β⊤xi,
where α ∈ R is known as the intercept and β ∈ Rq is the regression
coefficient vector. Together, the linear regression relationship relies on
the parameter tuple θ ∈ (α,β ) ∈ Rp, where p= q+1≫ n.
71 / 110
Linear regression
The estimation of θ is typically conducted by solving the least-squares
(LS) estimation problem:
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈Rq
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
︸︷︷︸
=LSn(α,β )
=
1
2n
n
∑
i=1
(
Yi−θ⊤X˜ i
)2
︸︷︷︸
=LSn(θ )
,
where X˜⊤i =
(
1,X⊤i
)
, for each i ∈ [n].
Since LSn is convex and differentiable in θ , we can solve the LS problem
by solving the first-order condition:
∂LSn
∂θ
=
1
n
n
∑
i=1
X˜ i
(
Yi− X˜⊤i θ
)
= 0,
=⇒
n
∑
i=1
X˜ iYi =
{
n
∑
i=1
X˜ iX˜
⊤
i
}
θ
=⇒
?
{
n
∑
i=1
X˜ iX˜
⊤
i
}−1 n
∑
i=1
X˜ iYi = θˆ .
72 / 110
Matrix ranks
A square matrix A ∈ Rp×p is said to have full rank if the p columns
a1, . . . ,ap of A are linearly independent (and if the p columns of A⊤ are
linearly independent).
A square matrix is invertible if and only if it has full rank.
For a vector a ∈ Rp, both a⊤a ∈ R and aa⊤ ∈ Rp×p have rank 1.
The rank of a matrix sum of matrices A+B is less than the sum of ranks
of A and B, i.e.
rank(A+B)≤ rank(A)+ rank(B) .
Notice that we require an inverse for the p× p matrix ∑ni=1 X˜ iX˜
⊤
i , which is
the sum of n rank 1 matrices; so
rank
(
n
∑
i=1
X˜ iX˜
⊤
i
)
≤
n
∑
i=1
rank
(
X˜ iX˜
⊤
i
)
≤ n.
But p≫ n and so ∑ni=1 X˜ iX˜⊤i is not invertible!
73 / 110
Matrix ranks
Notice that without the matrix inverse, θˆ can be defined as the solution to
the equation: {
n
∑
i=1
X˜ iX˜
⊤
i
}
︸︷︷︸
A
θ =
n
∑
i=1
X˜ iYi︸︷︷︸
b
,
where A ∈ Rp×p and b ∈ Rp.
If rank(A) = r, then Aθ = b has exactly p− r+1 linearly independent
solutions, and every solution of Aθ = b can be written as a linear
combination of the independent solutions.
Since rank(A)≤ n, there are at least p−n+1 linearly independent
estimators of θ and infinitely many estimators overall!
74 / 110
Ridge regression
As with logistic regression and support vector machines, we can force a
unique solution by considering instead a penalized regression problem of
the form:
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈Rq
LSn (α,β )+λpen(β ) ,
where pen : Rq → R≥0 is a penalization of the parameter β and λ > 0 is a
tuning parameter.
An obvious choice for a penalization, as usual is for pen to be quadratic:
pen(β ) = 1
2
∥β ∥22 .
This results in the ridge regression optimization problem:
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈Rq
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
+
λ
2
∥β ∥22 .
75 / 110
Ridge regression
To solve the ridge regression problem, we will write X˜⊤i =
(
1,X⊤i
)
, which
then yields the representation:
Ln (α,β )+λ ∥β ∥22 =
1
2n
n
∑
i=1
(
Yi−θ⊤X˜ i
)2
+
λ
2
θ⊤I˜qθ , I˜q =
[
0 0⊤
0 Iq
]
.
Solving the first-order condition then yields:
∂ {LSn+λpen}
∂θ
=
1
n
n
∑
i=1
X˜ i
(
Yi− X˜⊤i θ
)
+λ I˜qθ
=⇒
n
∑
i=1
X˜ iYi =
{
n
∑
i=1
X˜ iX˜
⊤
i +nλ I˜q
}
θ
=⇒
?
{
n
∑
i=1
X˜ iX˜
⊤
i +nλ I˜q
}−1 n
∑
i=1
X˜ iYi = θˆ .
76 / 110
Ridge regression
It is easy to see that
I˜q =
[
0 0⊤
0 Iq
]
has rank q, since Iq has full rank.
We can also see that
n
∑
i=1
X˜ iX˜
⊤
i +nλ I˜q =
[
n ∑ni=1X
⊤
i
∑ni=1X i ∑
n
i=1X iX
⊤
i +nλ Iq
]
.
For any x ∈ Rp, xx⊤ ∈ Rp×p is symmetric positive semi-definite, and for
any two matrices, if A,B and A−B are all symmetric positive
semi-definite, then rank(A)≥ rank(B), so
rank
(
n
∑
i=1
X iX⊤i +nλ Iq
)
≥ rank(nλ Iq)= q,
Now, assuming that
(
n,∑ni=1X
⊤
i
)
is not the row span of(
∑ni=1X i,∑
n
i=1X iX
⊤
i +nλ Iq
)
;
rank
(
n
∑
i=1
X iX⊤i +nλ Iq
)
= q+1= p.
77 / 110
Ridge regression
We recall that for each λ > 0, the problem of the form
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈Rq
LSn (α,β )+λpen(β ) ,
with convex penalty pen (here, LSn is already convex), permits an
equivalent representation:
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈B(γ)
LSn (α,β ) ,
where B(γ) = {β ∈ Rq : pen(β )≤ γ} and pen
(
βˆ
)
= γ, for some γ = γ (λ )
(by Kloft et al. 2011, Prop. 12).
As such, the ridge regression least squares estimator can be equivalently
phrased as:
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈B(γ)
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
,
where B(γ) =
{
β ∈ Rq : 12 ∥β ∥22 ≤ γ
}
.
78 / 110
Sparsity
Recall our initial linearity assumption that
E [Yi|X i = xi] = α+β⊤xi,
and additionally assume that we can partition the coordinates of β (or xi;
j ∈ [q]) into two sets:
J0 =
{
j ∈ [q] : β j = 0
}
and J1 =
{
j ∈ [q] : β j ̸= 0
}
= supp(β ) .
That is, we assume that there are some coordinates of β in our generative
model that are zero, and so we can write
E [Yi|X i = xi] = α+β⊤xi = α+ ∑
j∈J1
β⊤j xi j.
If |J1|< q then we say that the regression model is sparse, in the sense
that some elements of β are exactly zero.
In general, we do not know the value of |J1|, but some models in the
literature may assume that it is fixed at some value q1 < q.
79 / 110
Sparsity
Via an abuse of notation, we can consider the l0-“norm” constrained
regression problem:
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈B(γ)
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
,
where B(γ) = {β ∈ Rq : ∥β ∥0 ≤ γ}, with l0-norm:
∥β ∥0 =
q
∑
j=1
∣∣β j∣∣0 = |J1| ,
using the convention that 00 = 0 and a0 = 1 if a ̸= 0.
Here, we can view γ ∈ N as being a budget for how many non-zero elements
of β there are.
Unfortunately, the l0-norm is neither a norm or a well-behaving function;
that is, it is not convex, not continuous, and not coercive.
Furthermore, this implies that the set B(γ) is not convex.
80 / 110
Convex relaxation
A convex relaxation of the l0-norm constrained regression problem is the
l1-norm constrained regression problem:
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈B(γ)
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
,
where B(γ) = {β ∈ Rq : ∥β ∥1 ≤ γ}, with ∥β ∥1 = ∑qj=1
∣∣β j∣∣.
This problem also goes by the name of “Least Absolute Shrinkage and
Selection Operator” (LASSO).
The LASSO problem is considered a convex relaxation because it relaxes
the set B(γ) to be a convex set and thus permits the representation:
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈Rq
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
+λ ∥β ∥1 ,
for some λ = λ (γ)> 0, when pen
(
βˆ
)
= γ, while preserving the possibility
that βˆ may be sparse (i.e.,
∣∣∣supp(βˆ)∣∣∣< q), like the l0-norm problem.
81 / 110
Convex relaxation
Figure: Diagram of the l0-norm constrained regression problem for q= 2.
82 / 110
Convex relaxation
Figure: Diagram of the LASSO regression problem for q= 2.
83 / 110
Convex relaxation
Figure: Diagram of the ridge regression problem for q= 2.
84 / 110
The LASSO
The LASSO was first considered in the statistics literature in Tibshirani
(1996) and was a catalyst for a lot of interest in the crossing over of
convex optimization and statistics.
We can break the Least Absolute Shrinkage and Selection Operator name
into its components.
The least absolute part refers to the use of the l1 vector norm for
penalization.
The selection operator refers to the fact that the problem can result in
sparse solutions; a kind of variable selection.
The shrinkage part refers to how a penalization of β causes the optimal
solution to “shrink” towards 0, in the sense that∥∥∥βˆ LS∥∥∥≥ ∥∥∥βˆ LASSO∥∥∥≥ ∥0∥ .
Unfortunately, we cannot solve
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈Rq
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
+λ ∥β ∥1 ,
using typical first-order conditions for smooth functions.
This was resolved in Tibshirani (1996) by using constrained form of the
problem and solving the associated quadratic programming problem.
85 / 110
Subdifferential calculus
Let g : Rp → R, θ 7→ g(θ ) be a continuous and convex function.
Then, we have the fact that for every θ ,θ 0 ∈ Rp, there exists a δ ∈ Rp that
satisfies the supporting hyperplane inequality:
g(θ )≥ g(θ 0)+δ⊤ (θ −θ 0) .
We define the subdifferential of g at θ 0 as the set:
∂g(θ 0) =
{
δ ∈ Rp : g(θ )≥ g(θ 0)+δ⊤ (θ −θ 0)
}
.
Notice that for a differentiable point θ 0,
∂g(θ 0) =
{
∂g
∂θ
∣∣∣∣
θ=θ 0
}
.
For example, when g(θ) = |θ |;
∂g(θ) =

−1 if θ < 0,
[−1,1] if θ = 0,
1 if θ > 0.
86 / 110
Subdifferential calculus
Figure: Supporting hyperplanes of g(θ) = |θ |, at θ = 0.
87 / 110
Subdifferential calculus
Since the subdifferential is essentially a gradient-like object, it admits
some gradient-like properties; e.g.:
For a> 0, ∂ [ag(θ )] = a∂g(θ )
For two continuous and convex functions g,h : Rp → R,
∂ [g(θ )+h(θ )] = ∂g(θ )+∂h(θ ) .
If h◦g(θ ) is convex and continuous, then
∂ [h◦g(θ )] = ∂g
∂θ
× ∂h(ψ)|ψ=g(θ ) .
Here, set addition and set scalar multiplication can be defined as
follows:
A+B= {a+b : a ∈ A,b ∈ B} , a×B= {a×b : b ∈ B} .
Most importantly, for the purpose of optimization, subdifferentials yield a
kind of first-order condition. That is, if g is continuous and convex, then
g global minimum at θ 0 if and only if
0 ∈ ∂g(θ 0) .
88 / 110
Subdifferential calculus
Let X1, . . . ,Xn ∈ R and θ ∈ R, and suppose that we want to solve the
problem:
θˆ = argmin
θ∈R
1
n
n
∑
i=1
|Xi−θ |︸︷︷︸
=g(θ)
.
For each i ∈ [n], and θ ∈ R
∂ |Xi−θ |=

−1 if θ < Xi,
[−1,1] if θ = Xi,
1 if θ > Xi,
∂g(θ) =− ∑
{i:Xi>θ}
1
n
+ ∑
{i:Xi=θ}
[
−1
n
,
1
n
]
+ ∑
{i:Xi<θ}
1
n
.
Solving the condition 0 ∈ ∂g(θ), for even n, we can pick θˆ , such that n/2
observations satisfy Xi ≤ θˆ and n/2 observations satisfy Xi ≥ θˆ so that
− ∑
{i:Xi>θˆ}
1
n
+ ∑
{i:Xi=θˆ}
[
−1
n
,
1
n
]
+ ∑
{i:Xi<θˆ}
1
n
=−1
n
× n
2
+
1
n
× n
2
= 0.
Thus, θˆ is just the sample median.
89 / 110
Subdifferential calculus
We can consider now the LASSO problem:
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈Rq
LASSOn (α,β ) , where
LASSOn (α,β ) =
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
+λ
q
∑
j=1
∣∣β j∣∣
= LSn (α,β )+λpen(β )
The subdifferential of LASSOn (α,β ) is
∂LASSOn (θ ) =
∂LSn (θ )
∂θ
+λ
q
∑
j=1

−1 if θ < 0,
[−1,1] if θ = 0,
1 if θ > 0.
Solving the p= (q+1)-dimensional problem 0 ∈ ∂LASSOn (θ ) is difficult.
90 / 110
For a generic multivariate function
g : Rp → R, θ 7→ g(θ )
with marginal univariate functions
g
(·;θ− j) : R→ R, θ j → g(θ j;θ− j) , where θ− j = (. . . ,θ j−1,θ j+1, . . .) ,
we can define the coordinate descent algorithm as follows:
Let θ (0) ∈ Rp be some initialization (an initial guess of θ ) and write
θ (s) ∈ Rp as the sth iteration of the algorithm.
At iteration s ∈ N, for each j ∈ [p], compute
θ (s)j =
argmin g
(
θ j;θ
(s−1)
− j
)
if s= j−1 (mod p) ,
θ (s−1)j otherwise.
Run the algorithm until s≥ s¯, for some sufficiently large s¯.
91 / 110
Coordinate descent
For a continuous function g : Rp → R, θ 7→ g(θ ) we define the
directional derivative (or semiderivative) of g in the direction δ ∈ Rp as
g′δ (θ ) = limλ↓0
g(θ +λδ )−g(θ )
λ
.
Note that if g is differentiable, then
g′δ (θ ) = δ
⊤ ∂g
∂θ
.
We say that θ ∗ is a stationary point of g if g′δ (θ )≥ 0, for all δ ∈ Rp.
Notice that when g is differentiable, then θ ∗ satisfies ∂g/∂θ |θ=θ ∗ = 0, and
so g′δ (θ
∗) = δ⊤0= 0, for all δ .
We say that a point θ ∗ is regular if for all δ =
(
δ1, . . . ,δp
)
with
g′(··· ,0,δ j ,0,···) (θ
∗)≥ 0, for each j ∈ [p] ,
it follows that g′δ (θ )≥ 0.
92 / 110
Coordinate descent
Figure: Example of an irregular point: g(θ1,θ2) = |3θ1+4θ2|+ |2θ1+θ2|.
93 / 110
Coordinate descent
Let g : Rp → R be a continuous and convex function. By Razaviyayn et al.
(2013, Thm. 2), if the coordinate descent sequence converges in the sense
that θ (s)→ θ ∗, and θ ∗ is a regular point, then θ ∗ is a global minimizer of g.
By Aggarwal (2020, Lem. 4.10.1), if g : Rp → R has the form
g(θ ) = g˜(θ )+
p
∑
j=1
h j
(
θ j
)
,
where g˜ is continuous and differentiable in θ , and h1, . . . ,hp are continuous
univariate functions, then g is regular at every point θ ∈ Rp.
94 / 110
Coordinate descent
Returning to the LASSO problem:
θˆ =
(
αˆ, βˆ
)
= argmin
α∈R,β∈Rq
LASSOn (α,β ) , where
LASSOn (α,β ) =
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
+λ
q
∑
j=1
∣∣β j∣∣ ,
we seek coordinate-wise solutions at iteration s.
For fixed β = β (s−1), there is no penalty in terms of α, so our
coordinate-wise solution is simply to solve the first-order condition
∂LASSOn
(
α,β (s−1)
)
∂α
=−1
n
n
∑
i=1
(
Yi−α−β (s−1)⊤X i
)
= 0,
which implies
α(s) =
n
∑
i=1
Yi−
n
∑
i=1
β (s−1)⊤X i.
95 / 110
Coordinate descent
For fixed α = α(s−1) and β− j = β
(s−1)
− j , we can write the subdifferential of
the univariate function LASSOn
(
β j;α(s−1),β
(s−1)
− j
)
, for each j ∈ [p] as
∂LASSOn
(
β j;α(s−1),β
(s−1)
− j
)
=− 1
n
n
∑
i=1
Xi j
(
Yi−α(s−1)−β jXi j−∑
k ̸= j
β (s−1)k Xik
)
+λ

−1 β j < 1,
[−1,1] β j = 0,
1 β j > 1,
=− 1
n
n
∑
i=1
Xi j
(
Yi−α(s−1)−∑
k ̸= j
β (s−1)k Xik
)
+β j
1
n
n
∑
i=1
X2i j+

−λ β j < 1,
[−λ ,λ ] β j = 0,
λ β j > 1,
We solve 0 ∈ ∂LASSOn
(
β j;α(s−1),β
(s−1)
− j
)
for the three cases of β j,
separately.
96 / 110
Coordinate descent
Firstly, when β j = 0, we have
0 ∈ −1
n
n
∑
i=1
Xi j
(
Yi−α(s−1)−∑
k ̸= j
β (s−1)k Xik
)
+[−λ ,λ ]
=⇒ 1
n
n
∑
i=1
Xi j
(
Yi−α(s−1)−∑
k ̸= j
β (s−1)k Xik
)
∈ [−λ ,λ ]
=⇒
∣∣∣∣∣1n n∑i=1Xi j
(
Yi−α(s−1)−∑
k ̸= j
β (s−1)k Xik
)∣∣∣∣∣≤ λ
So if
∣∣∣ 1n ∑ni=1Xi j (Yi−α(s−1)−∑k ̸= j β (s−1)k Xik)∣∣∣≤ λ , we set β (s)j = 0.
97 / 110
Coordinate descent
If β j > 0, then
− 1
n
n
∑
i=1
Xi j
(
Yi−α(s−1)−∑
k ̸= j
β (s−1)k Xik
)
−β j 1n
n
∑
i=1
X2i j+λ = 0
=⇒ β (s)j =
∑ni=1Xi j
(
Yi−α(s−1)−∑k ̸= j β (s−1)k Xik
)
−nλ
∑ni=1X
2
i j
.
If β j < 0, then
− 1
n
n
∑
i=1
Xi j
(
Yi−α(s−1)−∑
k ̸= j
β (s−1)k Xik
)
−β j 1n
n
∑
i=1
X2i j−λ = 0
=⇒ β (s)j =
∑ni=1Xi j
(
Yi−α(s−1)−∑k ̸= j β (s−1)k Xik
)
+nλ
∑ni=1X
2
i j
.
Note that only one of the two solutions can be logically consistent.
98 / 110
Coordinate descent
Write
a(s−1)j =
n
∑
i=1
Xi j
(
Yi−α(s−1)−∑
k ̸= j
β (s−1)k Xik
)
Together we have the following coordinate-wise update:
β (s)j =

0 if
∣∣∣a(s−1)j ∣∣∣≤ nλ
a(s−1)j +nλ
∑ni=1X2i j
if a(s−1)j <−nλ ,
a(s−1)j −nλ
∑ni=1X2i j
if a(s−1)j > nλ .
We can combine the three conditions into the expression:
β (s)j =
1
∑ni=1X
2
i j
sign
(
a(s−1)j
)[∣∣∣a(s−1)j ∣∣∣−nλ]+ = 1∑ni=1X2i j Softnλ
(
a(s−1)j
)
,
where Softλ (x) = sign(x) [|x|−λ ]+ is the soft-thresholding function and
[x]+ =max{0,x}.
99 / 110
Sieve estimator
Suppose that we observe IID observations (X 1,Y1) , . . . ,(X n,Yn), and we
wish to estimate a parameter θ 0 ∈ T, defined as
θ 0 = argmin
θ∈T
E [l (θ ;X 1,Y1)]
where l : T×X×Y→ R, (θ ;x,y) 7→ l (θ ;x,y) is a loss function.
For k ∈ N, let T1,T2, . . . be a sequence of sets, such that Tk ⊆ Tk+1 ⊆ T.
For each n ∈ N, let k = k (n) and estimate θ 0 via the minimization problem
θˆ n = argmin
θ∈Tk(n)
Ln (θ ) , Ln (θ ) =
1
n
n
∑
i=1
l (θ ;X i,Yi) .
We wish to establish conditions under which the sequence θˆ n → θ 0 in
probability, as n→ ∞.
We call the sequence θˆ 1, . . . , θˆ n, . . . a sieve of estimators (or just a sieve
estimator).
100 / 110
Sieve estimator
Let n≪ q< ∞, and consider the LASSO problem in the constrained form:
θˆ n =
(
αˆn, βˆ n
)
= argmin
α∈[−an,an],β∈B(γn)
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
︸︷︷︸
=Ln(θ )
,
where B(γn) = {β ∈ Rq : ∥β ∥1 ≤ γn}.
l (θ ,x,y) = 12
(
Yi−α−β⊤X i
)2
.
T= R×Rp and Tk = R×B(γn), where 0< γn = γk(n) and 0< an = ak(n) are
increasing.
We wish to establish when the LASSO estimator is consistent for
θ 0 = argmin
θ=(α,β )∈T
1
2
E
{(
Y1−α−β⊤X 1
)2}
︸︷︷︸
=E[l(θ ;X 1,Y1)]
.
101 / 110
Sieve estimator
Figure: Diagram of the LASSO sieve estimator.
102 / 110
Sieve estimator
Let ∥·∥ : T→ R≥0 be a norm on T.
The following set of assumptions arise from Chen (2007):
1. E [l (θ ;X 1,Y1)] is continuous and E [l (θ 0;X 1,Y1)]< ∞.
2. For any ε > 0, E [l (θ 0;X 1,Y1)]≤ inf{θ∈T:∥θ−θ 0∥≥ε}E [l (θ ;X 1,Y1)] .
3. There exists mappings Mk : T→ Tk, θ 7→Mk (θ ), such that
∥Mk (θ )−θ ∥→ 0 as k→ ∞.
4. Ln (θ ) is continuous and measurable for all θ ∈ Tk for each k ∈ N.
5. For each k ∈ N, Tk is compact.
6. For each k ∈N, supθ∈Tk |Ln (θ )−E [l (θ ;X 1,Y1)]| → 0, in probability, as n→ ∞.
Under Assumptions 1–6,
∥∥∥θˆ n−θ 0∥∥∥→ 0 in probability, as n→ ∞, where
θˆ n = argmin
θ∈Tk(n)
Ln (θ ) , Ln (θ ) =
1
n
n
∑
i=1
l (θ ;X i,Yi) ,
and θ 0 = argmin
θ∈T
E [l (θ ;X 1,Y1)] .
103 / 110
LASSO consistency
We now check the assumptions for the LASSO problem:
θˆ n =
(
αˆn, βˆ n
)
= argmin
α∈[−an,an],β∈B(γn)
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
︸︷︷︸
=Ln(θ )
,
where B(γn) = {β ∈ Rq : ∥β ∥1 ≤ γn}.
Assumption 1 is valid if we let Y1,X 1 have second moments (i.e.
E
{
Y 21
}
< ∞ and E
{
∥X 1∥22
}
< ∞), so that
1
2
E
{(
Y1−α−β⊤X 1
)2}
< ∞, for (α,β ) ∈ T.
Assumption 2 is valid if θ 0 is the global minimum and
E
[(
Y1−θ⊤X˜ 1
)2]
= E
{
Y 21
}
+θ⊤E
{
X˜ 1X˜
⊤
1
}
θ +Linear Terms,
is strictly convex for X˜⊤1 =
(
1,X⊤1
)
, which is true if E
{
X˜ 1X˜
⊤
1
}
is positive
definite.
104 / 110
LASSO consistency
For Assumption 3, we need to define the projection operation:
Mk (θ ) = min
ψ∈Tk
∥θ −ψ∥ ,
and let γk → ∞, as k→ ∞, so that for any θ ∈ T, there exists a k ∈ N such
that Mk (θ ) = θ .
Assumptions 4 and 5 are free, by definition.
Assumption 6 is a verified by an application of the uniform law of large
numbers of Jennrich (1969), which is true if the data are IID, where we
verify that
sup
θ∈Tk
1
2
E
{(
Y1−α−β⊤X 1
)2}
< ∞,
is true since Tk is compact, under the conditions of Assumption 1.
105 / 110
LASSO consistency
Let (X 1,Y1) , . . . ,(X n,Yn) be IID, such that E
{
Y 21
}
< ∞ and E
{
∥X 1∥22
}
< ∞,
and assume that E
{
X˜ 1X˜
⊤
1
}
is positive definite, where X˜⊤1 =
(
1,X⊤1
)
. For an
increasing sequences 0< an,γn −→n→∞ ∞, let
θˆ n =
(
αˆn, βˆ n
)
= argmin
α∈[−an,an],β∈B(γn)
1
2n
n
∑
i=1
(
Yi−α−β⊤X i
)2
,
where B(γn) = {β ∈ Rq : ∥β ∥1 ≤ γn}. Then,
θ 0 = argmin
θ∈T
E [l (θ ;X 1,Y1)]
is unique and
∥∥∥θˆ n−θ 0∥∥∥→ 0 in probability, as n→ ∞.
106 / 110
Increasing q consistency
Consider instead a scenario where by as n increases, the dimension of X i
also increases. That X i = X qn,i ∈ Rqn , where qn is an increasing sequence.
We will suppose that there is some maximum value of q:
maxn∈N q(n) = qmax that can be large, but must be finite.
Here, we also have the maximal dimensionality covariate Xmax,i ∈ Rqmax .
We will let the θ = (α,βmax) ∈ T= R×Rqmax , and for each n ∈ N, we
consider the estimator
θˆ n =
(
αˆn, βˆ n
)
= argmin
α∈[−an,an],β∈Bqn (γn)
1
2n
n
∑
i=1
(
Yi−α−β⊤Xmax,i
)2
,
where Bq (γ) =
{
β⊤ =
(
β⊤q ,0qmax−q
)
∈ Rqmax : β q ∈ Rq,∥β ∥1 ≤ γ
}
.
This implies that for each k = k (n), we have Tk = R×Bqn (γn)⊆ T, where
qn = q(k (n)) and γn = γ (k (n)).
Notice also that Tk is compact.
107 / 110
LASSO consistency
Let
(
Xmax,1,Y1
)
, . . . ,(Xmax,n,Yn) be IID, such that E
{
Y 21
}
< ∞ and
E
{∥∥Xmax,1∥∥22} < ∞, and assume that E{X˜ 1X˜⊤1 } is positive definite, where
X˜⊤1 =
(
1,X⊤max,1
)
. For increasing sequences an −→n→∞ ∞, γn −→n→∞ ∞ and qn −→n→∞
qmax, let
θˆ n =
(
αˆn, βˆ n
)
= argmin
α∈R,β∈Bqn (γn)
1
2n
n
∑
i=1
(
Yi−α−β⊤Xmax,i
)2
,
where Bqn (γn) =
{
β⊤ =
(
β⊤qn ,0qmax−qn
)
∈ Rqmax : β qn ∈ Rqn ,∥β ∥1 ≤ γn
}
.
Then,
θ 0 = argmin
θ∈T
E [l (θ ;Xmax,1,Y1)]
is unique and
∥∥∥θˆ n−θ 0∥∥∥→ 0 in probability, as n→ ∞.
We can also write the LASSO estimator as θˆ n =
(
αˆn, βˆ qn,n,0qmax−qn
)
with(
αˆn, βˆ qn,n
)
= argmin
α∈[−an,an],β qn∈B˜qn (γn)
1
2n
n
∑
i=1
(
Yi−α−β⊤qnX qn,i
)2
,
and B˜qn (γn) =
{
β qn ∈ Rqn : β qn ∈ Rqn ,
∥∥∥β qn∥∥∥1 ≤ γn}. 108 / 110
References I
Aggarwal, C. C. (2020). Linear Algebra and Optimization for Machine Learning: A
Textbook. Springer.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a
practical and powerful approach to multiple testing. Journal of the Royal Statistical
Society Series B, 57:289–300.
Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in
multiple testing under dependency. Annals of Statistics, 29:1165–1188.
Chen, X. (2007). Handbook of Econometrics, volume 6B, chapter Large sample sieve
estimation of semi-nonparametric models, pages 5549–5632. Elsevier.
Cheney, W. and Light, W. (2000). A Course in Approximation Theory. Brooks/Cole,
Pacific Grove.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A
kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables.
Journal of the American Statistical Association, 58(301):13–30.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to
Statistical Learning with Applications in R, volume 112. Springer.
109 / 110
References II
Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares estimators.
Annals of Mathematical Statistics, 40:633–643.
Kloft, M., Brefeld, U., Sonnenburg, S., and Zien, A. (2011). Lp-norm multiple kernel
learning. The Journal of Machine Learning Research, 12:953–997.
Nguyen, H. D., McLachlan, G. J., Cherbuin, N., and Janke, A. L. (2014). False
discovery rate control in magnetic resonance imaging studies via Markov random
fields. IEEE Transactions on Medical Imaging, 33:1735–1748.
Razaviyayn, M., Hong, M., and Luo, Z.-Q. (2013). A unified convergence analysis of
block successive minimization methods for nonsmooth optimization. SIAM Journal
of Optimization, 23:1126–1153.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society: Series B (Methodological), 58(1):267–288.
Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principle component analysis.
Journal of the Royal Statistical Society B, 61:611–622.
110 / 110

学霸联盟