STA135 Lecture 6: Principal Component Analysis
Xiaodong Li UC Davis
1 Principal components: concepts and calculation
We would like to explain the variance-covariance structure of a set of variables by a few
linear combinations of these variables.
Let ~X =
X1...
Xp
be a random vector with population mean ~µ and population covariance
matrix Σ. Denote the spectral decomposition of Σ as
Σ = λ1~v1~v
>
1 + . . .+ λp~vp~v
>
p ,
where λ1 ≥ . . . ≥ λp > 0. Each eigenvector is represented as
~vk = [vk1, . . . , vkp]
>.
1.1 The first principal component
Consider a linear combination of the variates by ~a =
[
a1 . . . ap
]>
:
Y1 = ~a
> ~X = a1X1 + a2X2 + . . .+ apXp.
In order to explain the variance-covariance of ~X as much as possible, we want to maximize
the variance of Y1. At the same time, in order to fix the scale, we impose the constraint
‖~a‖ = 1. Then the first principal component for ~X is defined by the following optimization:
max Var(Y1) = Var(~a
> ~X) = ~a>Σ~a
s.t. ‖~a‖2 = 1
The Lagrangian for the above optimization is
f(~a;λ) = ~a>Σ~a− λ(~a>~a− 1).
By setting the gradient to be equal to zero, we get
∇~af(~a;λ) = 2Σ~a− 2λ~a = ~0,
that is
Σ~a = λ~a,
1
This implies that ~a must be a unit eigenvector, and λ is the corresponding eigenvalue.
Notice what we aim to maximize is
Var(Y1) = Var(~a
> ~X) = ~a>Σ~a = λ~a>~a = λ.
Since λ1 ≥ λ2 ≥ . . . ≥ λp > 0, in order to maximize Var(Y1), we must have{
~a = ~v1
Var(Y1) = λ1.
To sum up, we have the following result:
Proposition 1. The first principal component is
Y1 = ~v
>
1
~X = v11X1 + v12X2 + . . . v1pXp,
where ~v1 is the eigenvector corresponding to the leading eigenvalue λ1. Moreover, Var(Y1) =
λ1.
1.2 The second principal component
In order to define the second principal component, we also look for some linear combination
of the variates
Y2 = ~a
> ~X = a1X1 + a2X2 + . . .+ apXp,
such that its variance is as large as possible. However, here we have two constraints:
First, we need to impose ‖~a‖2 = 1 in order to fix the scale; Second, we require that Y2
explains the variance-covariance of ~X that has not been explained by Y1, which amounts to
cov(Y2, Y1) = 0. Notice that
cov(Y2, Y1) = cov(~a
> ~X,~v>1 ~X) = ~a
>Cov( ~X)~v1 = ~a>Σ~v1 = λ1~a>~v1,
so the constraint cov(Y2, Y1) = 0 is equivalent to ~a
>~v1 = 0. Consequently, the second
principal component is defined through the following optimization
max Var(Y2) = Var(~a
> ~X) = ~a>Σ~a
s.t. ~a>~a = 1,
~a>~v1 = 0.
The resulting Lagrangian is thus
f(~a;λ, γ) = ~a>Σ~a− λ(‖~a‖2 − 1)− γ~a>~v1.
Again, by setting the gradient to be equal to zero, we get
∇~af(~a;λ, γ) = 2Σ~a− 2λ~a− γ~v1 = ~0.
Taking the inner product of both sides with ~v1, we have
~v>1 (2Σ~a− 2λ~a− γ~v1) = 0.
2
The first term
~v>1 Σ~a = ~a
>Σ~v1 = ~a>(λ1~v1) = λ1~a>~v1 = 0.
By the constraint, the second term is ~v>1 ~a = 0. As a result, γ‖~v1‖2 = γ = 0. Then the
stationary condition is still in the form of
Σ~a = λ~a.
This implies that ~a must be a unit eigenvector while λ is the corresponding eigenvalue. Note
that we actually want to maximize
Var(Y2) = ~a
>Σ~a = λ‖~a‖2 = λ.
Since ~a>~v1 = 0, the best we can get is{
~a = ~v2
Var(Y2) = λ2.
To sum up, we have the following result:
Proposition 2. The second principal component is
Y2 = ~v
>
2
~X = v21X1 + v22X2 + . . . v2pXp,
where ~v2 is the eigenvector corresponding to the second largest eigenvalue λ2. Moreover,
Var(Y2) = λ2.
1.3 General concepts on principal components
In general, the k-th principal component can be defined iteratively through the following
procedure. Given the existing principal components Y1, . . . , Yk−1, we look for some linear
combination of the variates
Yk = ~a
> ~X = a1X1 + a2X2 + . . .+ apXp,
such that its variance is as large as possible. Still, we need to impose ‖~a‖2 = 1 in order to
fix the scale, and require that Yk to explain the variance-covariance of ~X that has not been
explained by Y1, . . . , Yk−1, which amounts to
cov(Yk, Y1) = cov(Yk, Y2) = . . . = cov(Yk, Yk−1) = 0.
Consequently, the k-th principal component is defined through the following optimization
max Var(Yk) = Var(~a
> ~X) = ~a>Σ~a
s.t. ~a>~a = 1,
~a>~v1 = 0,
...
~a>~vk−1 = 0.
In general, we have the following result
3
Proposition 3. The k-th principal component is
Yk = ~v
>
k
~X = vk1X1 + vk2X2 + . . . vkpXp,
where ~vk is the eigenvector corresponding to the k-th largest eigenvalue λk. Moreover,
Var(Yk) = λk.
The coefficients vk1, . . . , vkp are referred to as loadings on the random variables X1, ...,
Xp for the k-th principal component Yk.
1.4 Verification of covariance structures of the PCs
Denote
V =
[
~v1 . . . ~vp
]
=
v11 v21 . . . vp1
v12 v22 . . . vp2
...
...
. . .
...
v1p v2p . . . vpp
, (1.1)
where the row and column indices require attention. Recall that Σ = V ΛV >, where
Λ =
λ1 . . .
λp
.
The random vector of population principal components can thus be written as
~Y =
Y1...
Yp
=
~v
>
1
~X
...
~v>p ~X
=
~v
>
1
...
~v>p
~X = V > ~X.
The linear relationship ~Y = V > ~X gives
Cov(~Y ) = V >Cov( ~X)V = V >V ΛV >V = Λ,
which implies
Var(Yk) = λk, for k = 1, . . . , p.
and
Cov(Yj , Yk) = 0 for j 6= k.
1.5 Standardization
In certain applications, it is common to standardize the original variates X1, . . . , Xp into
Z1 =
X1 − µ1√
σ11
, Z2 =
X2 − µ2√
σ22
, . . . , Zp =
Xp − µp√
σpp
.
4
Then it is straightforward to get
Cov(~Z) =
ρ11 ρ12 . . . ρ1p
ρ21 ρ22 . . . ρ2p
...
...
. . .
...
ρp1 ρp2 . . . ρpp
= Corr( ~X)
where
ρjk =
σjk√
σjjσkk
.
If we still represent the spectral decomposition for the covariance of the standardized vari-
ables as
Cov(~Z) = λ1~v1~v
>
1 + . . .+ λp~vp~v
>
p
with λ1 ≥ λ2 ≥ . . . ≥ λp > 0. The principal components of Z1, . . . , Zp are
Yk = ~v
>
k
~Z = vk1Z1 + . . . vkpZp.
2 Basic Principal Component Analysis
2.1 Contribution of variables to the determination of PCs
One standard method to compare the contributions of different variables to the determina-
tion of a particular PC is through the formula:
Yk = ~v
>
k
~X = vk1X1 + vk2X2 + . . . vkpXp.
In other words, we compare the contributions of X1, . . . , Xp to the determination of Yk
based on the loadings vk1, . . . , vkp.
Here we introduce the second method: Compare the contributions of X1, . . . , Xp to the
determination of Yk based on the correlations Corr(X1, Yk), . . . ,Corr(Xp, Yk). Recall that
we have ~Y = V > ~X, where V is defined in (1.1). Then,
Cov(~Y , ~X) = Cov(V > ~X, ~X) = V >Cov( ~X) = V >Σ = V >V ΛV > = ΛV >
=
λ1 . . .
λp
v11 v12 . . . v1p
v21 v22 . . . v2p
...
...
. . .
...
vp1 vp2 . . . vpp
=
λ1v11 λ1v12 . . . λ1v1p
λ2v21 λ2v22 . . . λ2v2p
...
...
. . .
...
λpvp1 λpvp2 . . . λpvpp
,
The covariance between the k-th principal component and the j-th variate is
Cov(Yk, Xj) = λkvkj .
5
We further have
Corr(Yk, Xj) =
Cov(Yk, Xj)√
Var(Yk)Var(Xj)
=
λkvkj√
λkσjj
= vkj
√
λk
σjj
.
This gives the second method to compare the contributions of Xj ’s to the determination of
Yk through the correlation coefficients vkj
√
λk
σjj
for j = 1, . . . , p.
When the variables are standardized from Xj to Zj , we have and
Corr(Yk, Zj) = vkj
√
λk
ρjj
= vkj
√
λk.
This implies that for a fixed k, the loadings and correlation coefficients between the k-th PC
Yk and Z1, . . . , Zp are proportional. Therefore, there is no difference in comparing of the
contributions of variables to the determination of Yk based on either loadings or correlations.
2.2 Selecting the number of PCs
Recall that the spectral decomposition of the population covariance is
Σ = λ1~v1~v
>
1 + λ2~v2~v
>
2 + . . .+ λp~vp~v
>
p .
Denote
Σ =
σ11 σ12 . . . σ1p
σ21 σ22 . . . σ2p
...
...
. . .
...
σp1 σp2 . . . σpp
= V ΛV >.
The trace formula gives
trace(Σ) = trace(V ΛV >) = trace(ΛV >V ) = trace(Λ),
which is equivalent to
σ11 + σ22 + . . .+ σpp = λ1 + λ2 + . . .+ λp.
Since Var(Yk) = λk for k = 1, . . . , p and Var(Xj) = σjj , the trace formula gives
Var(X1) + . . .+ Var(Xp) = Var(Y1) + . . .+ Var(Yp)
Definition 4. The proportion of the total variance due to the first k principal components
is defined as
Var(Y1) + . . .+ Var(Yk)
Var(Y1) + . . .+ Var(Yp)
=
λ1 + . . .+ λk
λ1 + . . .+ λp
=
λ1 + . . .+ λk
σ11 + . . .+ σpp
.
Example: If
λ1 + λ2 + λ3
σ11 + σ22 + . . .+ σpp
> 90%,
then we can replace X1, . . . , Xp with Y1, Y2 and Y3 without much loss of information.
6
3 Sample PCA
3.1 Summary of results
Let
~x1, . . . , ~xn
be a sample with sample mean ~x and sample covariance S. By considering the spectral
decomposition of the sample covariance
S = λˆ1~u1~u
>
1 + . . .+ λˆp~up~u
>
p ,
where ~uk = [uk1, . . . , ukp]
>, we have the following results about sample PCs in the analogy
to population PCs:
• The k-the sample PC is defined as
Ŷk = uk1X1 + uk2X2 + . . .+ ukpXp.
The coefficients uk1, . . . , ukp are referred to as loadings for the k-th sample princi-
pal component Ŷk. In particular, the i-th observation of the k-th sample principal
component as
yˆik = ~u
>
k ~xi = uk1xi1 + uk2xi2 + . . .+ ukpxip.
• The sample variance of Ŷk is λˆk, and for k 6= j, the sample covariance between Ŷk and
Ŷj is 0.
• The sample correlation between Yk and Xj is ukj
√
λˆk
sjj
.
• The total sample covariances is
s11 + s22 + . . .+ spp = λˆ1 + λˆ2 + . . .+ λˆp,
and the proportion of the total variance due to the first k sample principal components:
λˆ1 + . . .+ λˆk
λˆ1 + . . .+ λˆp
=
λˆ1 + . . .+ λˆk
s11 + . . .+ spp
.
3.2 Reduction of number of columns in the dataset
Consider the spectral decomposition of the sample covariance in the matrix form:
S = λˆ1~u1~u
>
1 + . . .+ λˆp~up~u
>
p = UΛ̂U
>,
where
λˆ1 ≥ . . . ≥ λˆp ≥ 0,
U = [~u1, . . . , ~up] ==
u11 u21 . . . up1
u12 u22 . . . up2
...
...
. . .
...
u1p u2p . . . upp
,
7
and
Λ̂ =
λˆ1 . . .
λˆk
Then the i-th observation of all sample principal components is
~ˆyi =
yˆi1...
yˆip
=
u11xi1 + u12xi2 + . . .+ u1pxip...
up1xi1 + up2xi2 + . . .+ uppxip
=
u11 u12 . . . u1p
u21 u22 . . . u2p
...
...
. . .
...
up1 up2 . . . upp
xi1
xi2
...
xip
= U>~xi.
Then, the data matrix of sample principal components is
Ŷ :=
~ˆy>1
~ˆy>2
...
~ˆy>n
=
~x>1 U
~x>2 U
...
~x>nU
=
~x>1
~x>2
...
~x>n
U = XU = X[~u1, . . . , ~up].
In particular, if we only keep the observations of the first two sample PCs, we get
yˆ11 yˆ12
yˆ21 yˆ22
...
...
yˆn1 yˆn2
=
x11 x12 . . . x1p
x21 x22 . . . x2p
...
...
. . .
...
xn1 xn2 . . . xnp
u11 u21
u12 u22
...
...
u1p u2p
In practice, one is interested in plotting the PC scores of the observations for Ŷ1 and
Ŷ2, i.e., the scatter plot of [
yˆ11
yˆ12
]
,
[
yˆ21
yˆ22
]
, . . . ,
[
yˆn1
yˆn2
]
.
Meanwhile, each variable Xj should be also presented in the plot as the vector of loadings[
u1j
u2j
]
, which will be helpful for the interpretation of Ŷ1 and Ŷ2.
4 Data analysis and interpretation
See, e.g., Example 8.5 on page 451.
8
学霸联盟