xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

程序代写案例-STA135

时间：2021-04-24

STA135 Lecture 6: Principal Component Analysis

Xiaodong Li UC Davis

1 Principal components: concepts and calculation

We would like to explain the variance-covariance structure of a set of variables by a few

linear combinations of these variables.

Let ~X =

X1...

Xp

be a random vector with population mean ~µ and population covariance

matrix Σ. Denote the spectral decomposition of Σ as

Σ = λ1~v1~v

>

1 + . . .+ λp~vp~v

>

p ,

where λ1 ≥ . . . ≥ λp > 0. Each eigenvector is represented as

~vk = [vk1, . . . , vkp]

>.

1.1 The first principal component

Consider a linear combination of the variates by ~a =

[

a1 . . . ap

]>

:

Y1 = ~a

> ~X = a1X1 + a2X2 + . . .+ apXp.

In order to explain the variance-covariance of ~X as much as possible, we want to maximize

the variance of Y1. At the same time, in order to fix the scale, we impose the constraint

‖~a‖ = 1. Then the first principal component for ~X is defined by the following optimization:

max Var(Y1) = Var(~a

> ~X) = ~a>Σ~a

s.t. ‖~a‖2 = 1

The Lagrangian for the above optimization is

f(~a;λ) = ~a>Σ~a− λ(~a>~a− 1).

By setting the gradient to be equal to zero, we get

∇~af(~a;λ) = 2Σ~a− 2λ~a = ~0,

that is

Σ~a = λ~a,

1

This implies that ~a must be a unit eigenvector, and λ is the corresponding eigenvalue.

Notice what we aim to maximize is

Var(Y1) = Var(~a

> ~X) = ~a>Σ~a = λ~a>~a = λ.

Since λ1 ≥ λ2 ≥ . . . ≥ λp > 0, in order to maximize Var(Y1), we must have{

~a = ~v1

Var(Y1) = λ1.

To sum up, we have the following result:

Proposition 1. The first principal component is

Y1 = ~v

>

1

~X = v11X1 + v12X2 + . . . v1pXp,

where ~v1 is the eigenvector corresponding to the leading eigenvalue λ1. Moreover, Var(Y1) =

λ1.

1.2 The second principal component

In order to define the second principal component, we also look for some linear combination

of the variates

Y2 = ~a

> ~X = a1X1 + a2X2 + . . .+ apXp,

such that its variance is as large as possible. However, here we have two constraints:

First, we need to impose ‖~a‖2 = 1 in order to fix the scale; Second, we require that Y2

explains the variance-covariance of ~X that has not been explained by Y1, which amounts to

cov(Y2, Y1) = 0. Notice that

cov(Y2, Y1) = cov(~a

> ~X,~v>1 ~X) = ~a

>Cov( ~X)~v1 = ~a>Σ~v1 = λ1~a>~v1,

so the constraint cov(Y2, Y1) = 0 is equivalent to ~a

>~v1 = 0. Consequently, the second

principal component is defined through the following optimization

max Var(Y2) = Var(~a

> ~X) = ~a>Σ~a

s.t. ~a>~a = 1,

~a>~v1 = 0.

The resulting Lagrangian is thus

f(~a;λ, γ) = ~a>Σ~a− λ(‖~a‖2 − 1)− γ~a>~v1.

Again, by setting the gradient to be equal to zero, we get

∇~af(~a;λ, γ) = 2Σ~a− 2λ~a− γ~v1 = ~0.

Taking the inner product of both sides with ~v1, we have

~v>1 (2Σ~a− 2λ~a− γ~v1) = 0.

2

The first term

~v>1 Σ~a = ~a

>Σ~v1 = ~a>(λ1~v1) = λ1~a>~v1 = 0.

By the constraint, the second term is ~v>1 ~a = 0. As a result, γ‖~v1‖2 = γ = 0. Then the

stationary condition is still in the form of

Σ~a = λ~a.

This implies that ~a must be a unit eigenvector while λ is the corresponding eigenvalue. Note

that we actually want to maximize

Var(Y2) = ~a

>Σ~a = λ‖~a‖2 = λ.

Since ~a>~v1 = 0, the best we can get is{

~a = ~v2

Var(Y2) = λ2.

To sum up, we have the following result:

Proposition 2. The second principal component is

Y2 = ~v

>

2

~X = v21X1 + v22X2 + . . . v2pXp,

where ~v2 is the eigenvector corresponding to the second largest eigenvalue λ2. Moreover,

Var(Y2) = λ2.

1.3 General concepts on principal components

In general, the k-th principal component can be defined iteratively through the following

procedure. Given the existing principal components Y1, . . . , Yk−1, we look for some linear

combination of the variates

Yk = ~a

> ~X = a1X1 + a2X2 + . . .+ apXp,

such that its variance is as large as possible. Still, we need to impose ‖~a‖2 = 1 in order to

fix the scale, and require that Yk to explain the variance-covariance of ~X that has not been

explained by Y1, . . . , Yk−1, which amounts to

cov(Yk, Y1) = cov(Yk, Y2) = . . . = cov(Yk, Yk−1) = 0.

Consequently, the k-th principal component is defined through the following optimization

max Var(Yk) = Var(~a

> ~X) = ~a>Σ~a

s.t. ~a>~a = 1,

~a>~v1 = 0,

...

~a>~vk−1 = 0.

In general, we have the following result

3

Proposition 3. The k-th principal component is

Yk = ~v

>

k

~X = vk1X1 + vk2X2 + . . . vkpXp,

where ~vk is the eigenvector corresponding to the k-th largest eigenvalue λk. Moreover,

Var(Yk) = λk.

The coefficients vk1, . . . , vkp are referred to as loadings on the random variables X1, ...,

Xp for the k-th principal component Yk.

1.4 Verification of covariance structures of the PCs

Denote

V =

[

~v1 . . . ~vp

]

=

v11 v21 . . . vp1

v12 v22 . . . vp2

...

...

. . .

...

v1p v2p . . . vpp

, (1.1)

where the row and column indices require attention. Recall that Σ = V ΛV >, where

Λ =

λ1 . . .

λp

.

The random vector of population principal components can thus be written as

~Y =

Y1...

Yp

=

~v

>

1

~X

...

~v>p ~X

=

~v

>

1

...

~v>p

~X = V > ~X.

The linear relationship ~Y = V > ~X gives

Cov(~Y ) = V >Cov( ~X)V = V >V ΛV >V = Λ,

which implies

Var(Yk) = λk, for k = 1, . . . , p.

and

Cov(Yj , Yk) = 0 for j 6= k.

1.5 Standardization

In certain applications, it is common to standardize the original variates X1, . . . , Xp into

Z1 =

X1 − µ1√

σ11

, Z2 =

X2 − µ2√

σ22

, . . . , Zp =

Xp − µp√

σpp

.

4

Then it is straightforward to get

Cov(~Z) =

ρ11 ρ12 . . . ρ1p

ρ21 ρ22 . . . ρ2p

...

...

. . .

...

ρp1 ρp2 . . . ρpp

= Corr( ~X)

where

ρjk =

σjk√

σjjσkk

.

If we still represent the spectral decomposition for the covariance of the standardized vari-

ables as

Cov(~Z) = λ1~v1~v

>

1 + . . .+ λp~vp~v

>

p

with λ1 ≥ λ2 ≥ . . . ≥ λp > 0. The principal components of Z1, . . . , Zp are

Yk = ~v

>

k

~Z = vk1Z1 + . . . vkpZp.

2 Basic Principal Component Analysis

2.1 Contribution of variables to the determination of PCs

One standard method to compare the contributions of different variables to the determina-

tion of a particular PC is through the formula:

Yk = ~v

>

k

~X = vk1X1 + vk2X2 + . . . vkpXp.

In other words, we compare the contributions of X1, . . . , Xp to the determination of Yk

based on the loadings vk1, . . . , vkp.

Here we introduce the second method: Compare the contributions of X1, . . . , Xp to the

determination of Yk based on the correlations Corr(X1, Yk), . . . ,Corr(Xp, Yk). Recall that

we have ~Y = V > ~X, where V is defined in (1.1). Then,

Cov(~Y , ~X) = Cov(V > ~X, ~X) = V >Cov( ~X) = V >Σ = V >V ΛV > = ΛV >

=

λ1 . . .

λp

v11 v12 . . . v1p

v21 v22 . . . v2p

...

...

. . .

...

vp1 vp2 . . . vpp

=

λ1v11 λ1v12 . . . λ1v1p

λ2v21 λ2v22 . . . λ2v2p

...

...

. . .

...

λpvp1 λpvp2 . . . λpvpp

,

The covariance between the k-th principal component and the j-th variate is

Cov(Yk, Xj) = λkvkj .

5

We further have

Corr(Yk, Xj) =

Cov(Yk, Xj)√

Var(Yk)Var(Xj)

=

λkvkj√

λkσjj

= vkj

√

λk

σjj

.

This gives the second method to compare the contributions of Xj ’s to the determination of

Yk through the correlation coefficients vkj

√

λk

σjj

for j = 1, . . . , p.

When the variables are standardized from Xj to Zj , we have and

Corr(Yk, Zj) = vkj

√

λk

ρjj

= vkj

√

λk.

This implies that for a fixed k, the loadings and correlation coefficients between the k-th PC

Yk and Z1, . . . , Zp are proportional. Therefore, there is no difference in comparing of the

contributions of variables to the determination of Yk based on either loadings or correlations.

2.2 Selecting the number of PCs

Recall that the spectral decomposition of the population covariance is

Σ = λ1~v1~v

>

1 + λ2~v2~v

>

2 + . . .+ λp~vp~v

>

p .

Denote

Σ =

σ11 σ12 . . . σ1p

σ21 σ22 . . . σ2p

...

...

. . .

...

σp1 σp2 . . . σpp

= V ΛV >.

The trace formula gives

trace(Σ) = trace(V ΛV >) = trace(ΛV >V ) = trace(Λ),

which is equivalent to

σ11 + σ22 + . . .+ σpp = λ1 + λ2 + . . .+ λp.

Since Var(Yk) = λk for k = 1, . . . , p and Var(Xj) = σjj , the trace formula gives

Var(X1) + . . .+ Var(Xp) = Var(Y1) + . . .+ Var(Yp)

Definition 4. The proportion of the total variance due to the first k principal components

is defined as

Var(Y1) + . . .+ Var(Yk)

Var(Y1) + . . .+ Var(Yp)

=

λ1 + . . .+ λk

λ1 + . . .+ λp

=

λ1 + . . .+ λk

σ11 + . . .+ σpp

.

Example: If

λ1 + λ2 + λ3

σ11 + σ22 + . . .+ σpp

> 90%,

then we can replace X1, . . . , Xp with Y1, Y2 and Y3 without much loss of information.

6

3 Sample PCA

3.1 Summary of results

Let

~x1, . . . , ~xn

be a sample with sample mean ~x and sample covariance S. By considering the spectral

decomposition of the sample covariance

S = λˆ1~u1~u

>

1 + . . .+ λˆp~up~u

>

p ,

where ~uk = [uk1, . . . , ukp]

>, we have the following results about sample PCs in the analogy

to population PCs:

• The k-the sample PC is defined as

Ŷk = uk1X1 + uk2X2 + . . .+ ukpXp.

The coefficients uk1, . . . , ukp are referred to as loadings for the k-th sample princi-

pal component Ŷk. In particular, the i-th observation of the k-th sample principal

component as

yˆik = ~u

>

k ~xi = uk1xi1 + uk2xi2 + . . .+ ukpxip.

• The sample variance of Ŷk is λˆk, and for k 6= j, the sample covariance between Ŷk and

Ŷj is 0.

• The sample correlation between Yk and Xj is ukj

√

λˆk

sjj

.

• The total sample covariances is

s11 + s22 + . . .+ spp = λˆ1 + λˆ2 + . . .+ λˆp,

and the proportion of the total variance due to the first k sample principal components:

λˆ1 + . . .+ λˆk

λˆ1 + . . .+ λˆp

=

λˆ1 + . . .+ λˆk

s11 + . . .+ spp

.

3.2 Reduction of number of columns in the dataset

Consider the spectral decomposition of the sample covariance in the matrix form:

S = λˆ1~u1~u

>

1 + . . .+ λˆp~up~u

>

p = UΛ̂U

>,

where

λˆ1 ≥ . . . ≥ λˆp ≥ 0,

U = [~u1, . . . , ~up] ==

u11 u21 . . . up1

u12 u22 . . . up2

...

...

. . .

...

u1p u2p . . . upp

,

7

and

Λ̂ =

λˆ1 . . .

λˆk

Then the i-th observation of all sample principal components is

~ˆyi =

yˆi1...

yˆip

=

u11xi1 + u12xi2 + . . .+ u1pxip...

up1xi1 + up2xi2 + . . .+ uppxip

=

u11 u12 . . . u1p

u21 u22 . . . u2p

...

...

. . .

...

up1 up2 . . . upp

xi1

xi2

...

xip

= U>~xi.

Then, the data matrix of sample principal components is

Ŷ :=

~ˆy>1

~ˆy>2

...

~ˆy>n

=

~x>1 U

~x>2 U

...

~x>nU

=

~x>1

~x>2

...

~x>n

U = XU = X[~u1, . . . , ~up].

In particular, if we only keep the observations of the first two sample PCs, we get

yˆ11 yˆ12

yˆ21 yˆ22

...

...

yˆn1 yˆn2

=

x11 x12 . . . x1p

x21 x22 . . . x2p

...

...

. . .

...

xn1 xn2 . . . xnp

u11 u21

u12 u22

...

...

u1p u2p

In practice, one is interested in plotting the PC scores of the observations for Ŷ1 and

Ŷ2, i.e., the scatter plot of [

yˆ11

yˆ12

]

,

[

yˆ21

yˆ22

]

, . . . ,

[

yˆn1

yˆn2

]

.

Meanwhile, each variable Xj should be also presented in the plot as the vector of loadings[

u1j

u2j

]

, which will be helpful for the interpretation of Ŷ1 and Ŷ2.

4 Data analysis and interpretation

See, e.g., Example 8.5 on page 451.

8

学霸联盟

Xiaodong Li UC Davis

1 Principal components: concepts and calculation

We would like to explain the variance-covariance structure of a set of variables by a few

linear combinations of these variables.

Let ~X =

X1...

Xp

be a random vector with population mean ~µ and population covariance

matrix Σ. Denote the spectral decomposition of Σ as

Σ = λ1~v1~v

>

1 + . . .+ λp~vp~v

>

p ,

where λ1 ≥ . . . ≥ λp > 0. Each eigenvector is represented as

~vk = [vk1, . . . , vkp]

>.

1.1 The first principal component

Consider a linear combination of the variates by ~a =

[

a1 . . . ap

]>

:

Y1 = ~a

> ~X = a1X1 + a2X2 + . . .+ apXp.

In order to explain the variance-covariance of ~X as much as possible, we want to maximize

the variance of Y1. At the same time, in order to fix the scale, we impose the constraint

‖~a‖ = 1. Then the first principal component for ~X is defined by the following optimization:

max Var(Y1) = Var(~a

> ~X) = ~a>Σ~a

s.t. ‖~a‖2 = 1

The Lagrangian for the above optimization is

f(~a;λ) = ~a>Σ~a− λ(~a>~a− 1).

By setting the gradient to be equal to zero, we get

∇~af(~a;λ) = 2Σ~a− 2λ~a = ~0,

that is

Σ~a = λ~a,

1

This implies that ~a must be a unit eigenvector, and λ is the corresponding eigenvalue.

Notice what we aim to maximize is

Var(Y1) = Var(~a

> ~X) = ~a>Σ~a = λ~a>~a = λ.

Since λ1 ≥ λ2 ≥ . . . ≥ λp > 0, in order to maximize Var(Y1), we must have{

~a = ~v1

Var(Y1) = λ1.

To sum up, we have the following result:

Proposition 1. The first principal component is

Y1 = ~v

>

1

~X = v11X1 + v12X2 + . . . v1pXp,

where ~v1 is the eigenvector corresponding to the leading eigenvalue λ1. Moreover, Var(Y1) =

λ1.

1.2 The second principal component

In order to define the second principal component, we also look for some linear combination

of the variates

Y2 = ~a

> ~X = a1X1 + a2X2 + . . .+ apXp,

such that its variance is as large as possible. However, here we have two constraints:

First, we need to impose ‖~a‖2 = 1 in order to fix the scale; Second, we require that Y2

explains the variance-covariance of ~X that has not been explained by Y1, which amounts to

cov(Y2, Y1) = 0. Notice that

cov(Y2, Y1) = cov(~a

> ~X,~v>1 ~X) = ~a

>Cov( ~X)~v1 = ~a>Σ~v1 = λ1~a>~v1,

so the constraint cov(Y2, Y1) = 0 is equivalent to ~a

>~v1 = 0. Consequently, the second

principal component is defined through the following optimization

max Var(Y2) = Var(~a

> ~X) = ~a>Σ~a

s.t. ~a>~a = 1,

~a>~v1 = 0.

The resulting Lagrangian is thus

f(~a;λ, γ) = ~a>Σ~a− λ(‖~a‖2 − 1)− γ~a>~v1.

Again, by setting the gradient to be equal to zero, we get

∇~af(~a;λ, γ) = 2Σ~a− 2λ~a− γ~v1 = ~0.

Taking the inner product of both sides with ~v1, we have

~v>1 (2Σ~a− 2λ~a− γ~v1) = 0.

2

The first term

~v>1 Σ~a = ~a

>Σ~v1 = ~a>(λ1~v1) = λ1~a>~v1 = 0.

By the constraint, the second term is ~v>1 ~a = 0. As a result, γ‖~v1‖2 = γ = 0. Then the

stationary condition is still in the form of

Σ~a = λ~a.

This implies that ~a must be a unit eigenvector while λ is the corresponding eigenvalue. Note

that we actually want to maximize

Var(Y2) = ~a

>Σ~a = λ‖~a‖2 = λ.

Since ~a>~v1 = 0, the best we can get is{

~a = ~v2

Var(Y2) = λ2.

To sum up, we have the following result:

Proposition 2. The second principal component is

Y2 = ~v

>

2

~X = v21X1 + v22X2 + . . . v2pXp,

where ~v2 is the eigenvector corresponding to the second largest eigenvalue λ2. Moreover,

Var(Y2) = λ2.

1.3 General concepts on principal components

In general, the k-th principal component can be defined iteratively through the following

procedure. Given the existing principal components Y1, . . . , Yk−1, we look for some linear

combination of the variates

Yk = ~a

> ~X = a1X1 + a2X2 + . . .+ apXp,

such that its variance is as large as possible. Still, we need to impose ‖~a‖2 = 1 in order to

fix the scale, and require that Yk to explain the variance-covariance of ~X that has not been

explained by Y1, . . . , Yk−1, which amounts to

cov(Yk, Y1) = cov(Yk, Y2) = . . . = cov(Yk, Yk−1) = 0.

Consequently, the k-th principal component is defined through the following optimization

max Var(Yk) = Var(~a

> ~X) = ~a>Σ~a

s.t. ~a>~a = 1,

~a>~v1 = 0,

...

~a>~vk−1 = 0.

In general, we have the following result

3

Proposition 3. The k-th principal component is

Yk = ~v

>

k

~X = vk1X1 + vk2X2 + . . . vkpXp,

where ~vk is the eigenvector corresponding to the k-th largest eigenvalue λk. Moreover,

Var(Yk) = λk.

The coefficients vk1, . . . , vkp are referred to as loadings on the random variables X1, ...,

Xp for the k-th principal component Yk.

1.4 Verification of covariance structures of the PCs

Denote

V =

[

~v1 . . . ~vp

]

=

v11 v21 . . . vp1

v12 v22 . . . vp2

...

...

. . .

...

v1p v2p . . . vpp

, (1.1)

where the row and column indices require attention. Recall that Σ = V ΛV >, where

Λ =

λ1 . . .

λp

.

The random vector of population principal components can thus be written as

~Y =

Y1...

Yp

=

~v

>

1

~X

...

~v>p ~X

=

~v

>

1

...

~v>p

~X = V > ~X.

The linear relationship ~Y = V > ~X gives

Cov(~Y ) = V >Cov( ~X)V = V >V ΛV >V = Λ,

which implies

Var(Yk) = λk, for k = 1, . . . , p.

and

Cov(Yj , Yk) = 0 for j 6= k.

1.5 Standardization

In certain applications, it is common to standardize the original variates X1, . . . , Xp into

Z1 =

X1 − µ1√

σ11

, Z2 =

X2 − µ2√

σ22

, . . . , Zp =

Xp − µp√

σpp

.

4

Then it is straightforward to get

Cov(~Z) =

ρ11 ρ12 . . . ρ1p

ρ21 ρ22 . . . ρ2p

...

...

. . .

...

ρp1 ρp2 . . . ρpp

= Corr( ~X)

where

ρjk =

σjk√

σjjσkk

.

If we still represent the spectral decomposition for the covariance of the standardized vari-

ables as

Cov(~Z) = λ1~v1~v

>

1 + . . .+ λp~vp~v

>

p

with λ1 ≥ λ2 ≥ . . . ≥ λp > 0. The principal components of Z1, . . . , Zp are

Yk = ~v

>

k

~Z = vk1Z1 + . . . vkpZp.

2 Basic Principal Component Analysis

2.1 Contribution of variables to the determination of PCs

One standard method to compare the contributions of different variables to the determina-

tion of a particular PC is through the formula:

Yk = ~v

>

k

~X = vk1X1 + vk2X2 + . . . vkpXp.

In other words, we compare the contributions of X1, . . . , Xp to the determination of Yk

based on the loadings vk1, . . . , vkp.

Here we introduce the second method: Compare the contributions of X1, . . . , Xp to the

determination of Yk based on the correlations Corr(X1, Yk), . . . ,Corr(Xp, Yk). Recall that

we have ~Y = V > ~X, where V is defined in (1.1). Then,

Cov(~Y , ~X) = Cov(V > ~X, ~X) = V >Cov( ~X) = V >Σ = V >V ΛV > = ΛV >

=

λ1 . . .

λp

v11 v12 . . . v1p

v21 v22 . . . v2p

...

...

. . .

...

vp1 vp2 . . . vpp

=

λ1v11 λ1v12 . . . λ1v1p

λ2v21 λ2v22 . . . λ2v2p

...

...

. . .

...

λpvp1 λpvp2 . . . λpvpp

,

The covariance between the k-th principal component and the j-th variate is

Cov(Yk, Xj) = λkvkj .

5

We further have

Corr(Yk, Xj) =

Cov(Yk, Xj)√

Var(Yk)Var(Xj)

=

λkvkj√

λkσjj

= vkj

√

λk

σjj

.

This gives the second method to compare the contributions of Xj ’s to the determination of

Yk through the correlation coefficients vkj

√

λk

σjj

for j = 1, . . . , p.

When the variables are standardized from Xj to Zj , we have and

Corr(Yk, Zj) = vkj

√

λk

ρjj

= vkj

√

λk.

This implies that for a fixed k, the loadings and correlation coefficients between the k-th PC

Yk and Z1, . . . , Zp are proportional. Therefore, there is no difference in comparing of the

contributions of variables to the determination of Yk based on either loadings or correlations.

2.2 Selecting the number of PCs

Recall that the spectral decomposition of the population covariance is

Σ = λ1~v1~v

>

1 + λ2~v2~v

>

2 + . . .+ λp~vp~v

>

p .

Denote

Σ =

σ11 σ12 . . . σ1p

σ21 σ22 . . . σ2p

...

...

. . .

...

σp1 σp2 . . . σpp

= V ΛV >.

The trace formula gives

trace(Σ) = trace(V ΛV >) = trace(ΛV >V ) = trace(Λ),

which is equivalent to

σ11 + σ22 + . . .+ σpp = λ1 + λ2 + . . .+ λp.

Since Var(Yk) = λk for k = 1, . . . , p and Var(Xj) = σjj , the trace formula gives

Var(X1) + . . .+ Var(Xp) = Var(Y1) + . . .+ Var(Yp)

Definition 4. The proportion of the total variance due to the first k principal components

is defined as

Var(Y1) + . . .+ Var(Yk)

Var(Y1) + . . .+ Var(Yp)

=

λ1 + . . .+ λk

λ1 + . . .+ λp

=

λ1 + . . .+ λk

σ11 + . . .+ σpp

.

Example: If

λ1 + λ2 + λ3

σ11 + σ22 + . . .+ σpp

> 90%,

then we can replace X1, . . . , Xp with Y1, Y2 and Y3 without much loss of information.

6

3 Sample PCA

3.1 Summary of results

Let

~x1, . . . , ~xn

be a sample with sample mean ~x and sample covariance S. By considering the spectral

decomposition of the sample covariance

S = λˆ1~u1~u

>

1 + . . .+ λˆp~up~u

>

p ,

where ~uk = [uk1, . . . , ukp]

>, we have the following results about sample PCs in the analogy

to population PCs:

• The k-the sample PC is defined as

Ŷk = uk1X1 + uk2X2 + . . .+ ukpXp.

The coefficients uk1, . . . , ukp are referred to as loadings for the k-th sample princi-

pal component Ŷk. In particular, the i-th observation of the k-th sample principal

component as

yˆik = ~u

>

k ~xi = uk1xi1 + uk2xi2 + . . .+ ukpxip.

• The sample variance of Ŷk is λˆk, and for k 6= j, the sample covariance between Ŷk and

Ŷj is 0.

• The sample correlation between Yk and Xj is ukj

√

λˆk

sjj

.

• The total sample covariances is

s11 + s22 + . . .+ spp = λˆ1 + λˆ2 + . . .+ λˆp,

and the proportion of the total variance due to the first k sample principal components:

λˆ1 + . . .+ λˆk

λˆ1 + . . .+ λˆp

=

λˆ1 + . . .+ λˆk

s11 + . . .+ spp

.

3.2 Reduction of number of columns in the dataset

Consider the spectral decomposition of the sample covariance in the matrix form:

S = λˆ1~u1~u

>

1 + . . .+ λˆp~up~u

>

p = UΛ̂U

>,

where

λˆ1 ≥ . . . ≥ λˆp ≥ 0,

U = [~u1, . . . , ~up] ==

u11 u21 . . . up1

u12 u22 . . . up2

...

...

. . .

...

u1p u2p . . . upp

,

7

and

Λ̂ =

λˆ1 . . .

λˆk

Then the i-th observation of all sample principal components is

~ˆyi =

yˆi1...

yˆip

=

u11xi1 + u12xi2 + . . .+ u1pxip...

up1xi1 + up2xi2 + . . .+ uppxip

=

u11 u12 . . . u1p

u21 u22 . . . u2p

...

...

. . .

...

up1 up2 . . . upp

xi1

xi2

...

xip

= U>~xi.

Then, the data matrix of sample principal components is

Ŷ :=

~ˆy>1

~ˆy>2

...

~ˆy>n

=

~x>1 U

~x>2 U

...

~x>nU

=

~x>1

~x>2

...

~x>n

U = XU = X[~u1, . . . , ~up].

In particular, if we only keep the observations of the first two sample PCs, we get

yˆ11 yˆ12

yˆ21 yˆ22

...

...

yˆn1 yˆn2

=

x11 x12 . . . x1p

x21 x22 . . . x2p

...

...

. . .

...

xn1 xn2 . . . xnp

u11 u21

u12 u22

...

...

u1p u2p

In practice, one is interested in plotting the PC scores of the observations for Ŷ1 and

Ŷ2, i.e., the scatter plot of [

yˆ11

yˆ12

]

,

[

yˆ21

yˆ22

]

, . . . ,

[

yˆn1

yˆn2

]

.

Meanwhile, each variable Xj should be also presented in the plot as the vector of loadings[

u1j

u2j

]

, which will be helpful for the interpretation of Ŷ1 and Ŷ2.

4 Data analysis and interpretation

See, e.g., Example 8.5 on page 451.

8

学霸联盟