程序代写案例-IMESTER 2
时间:2022-06-05
TRIMESTER 2 EXAMINATION 2020/2021
STAT40150
Multivariate Analysis
Associate Professor R. Killick
Associate Professor E. Cox
Professor I.C. Gormley∗
Time Allowed: 2 hours
Instructions for Candidates
Answer all three questions.
Each question is worth 100 marks. For full marks, you must clearly show all steps,
define all notation and explain all reasoning.
Candidates should upload their examination script as a single (scanned) pdf file to
Brightspace within 30 minutes of the end of the exam.
Candidates may refer to their notes or online references when answering these questions,
or use software for numerical calculations, but they must not communicate with anyone
else during the examination. Candidates must show complete workings, and
associated reasoning, in their submitted examination script. Correct answers alone
will not achieve full marks.
Candidates are required to read, complete, and upload the Honour Code form that has
been distributed as the first page in their single pdf file submission.
©UCD 2020/2021 1 of 10
1. A chemical analysis of 178 wines grown in the same region in Italy but derived from
three different cultivars was conducted. The analysis determined the quantities of
5 constituents found in each wine.
(a) Under the factor analysis model a p-dimensional observation xi (i = 1, . . . , N)
is modeled as xi = µ + Λf i + i where it is assumed that f i ∼ MVNq(0, I),
i ∼ MVNp(0,Ψ) and Cov(f i, i) = 0. Derive the marginal distribution
p(xi), clearly motivating the steps in the derivation. [10 marks]
(b) The variances of the 5 constituents are detailed in Table 1. Would you advise
standardising these data prior to the application of factor analysis to them?
Prove that your answer is correct using the factor model definition detailed in
1(a), clearly defining any notation you use. [10 marks]
Alcohol Sugar free extract Fixed acidity Tartaric acid Chloride
0.66 5.01 350.44 0.40 2505.74
Table 1: Constituents’ variances
(c) A factor analysis model with 2 latent factors was applied to the correlation
matrix of these data with the resulting loadings matrix detailed below. Which
constituent has the smallest uniqueness? What is that uniqueness value?
[15 marks]
Factor1 Factor2
Alcohol -0.084 0.409
Sugar.free_extract -0.058 0.996
Fixed_acidity 0.952 0.127
Tartaric_acid 0.624 -0.205
Chloride -0.245 0.073
(d) Describe why a factor rotation is often employed when fitting a factor analysis
model, clearing defining any notation you use. How would you expect the
application of the ‘varimax’ rotation to affect the factor loadings detailed in
1(c)? [20 marks]
Cont./...
©UCD 2020/2021 2 of 10
(e) Principal components analysis was then applied to the correlation matrix of
these data. The standard deviations of principal components 1 through 5
are 1.3224, 1.1795, 0.9398, 0.8183 and 0.55432, respectively. Compute the
proportion of the variance explained by each principal component and illustrate
(by hand, software is not necessarily required) the resulting scree plot.
[15 marks]
(f) The resulting principal components analysis loadings matrix is given below.
Given the loadings, and your answer to 1(e), how many principal components
do you think are required to represent these data? Explain your answer.
[10 marks]
PC1 PC2 PC3 PC4 PC5
Alcohol -0.19 0.66 -0.16 0.64 0.29
Sugar.free_extract -0.30 0.65 0.18 -0.51 -0.45
Fixed_acidity 0.57 0.32 0.34 -0.34 0.58
Tartaric_acid 0.64 0.12 0.24 0.40 -0.60
Chloride -0.37 -0.17 0.88 0.24 0.08
(g) The (2-dimensional) scores resulting from the application of factor analysis and
from the application of principal components analysis to the wine data are il-
lustrated in Figures 1a and 1b respectively. What method could be used to
compare the similarity of these resulting scores plots? Explain how your sug-
gested method works, clearly defining any notation used, and how the method
quantifies the similarity of the scores plots. [10 marks]
(h) Based on the outputs provided above, do you think factor analysis or principal
components analysis would be the preferred dimension reduction approach for
these data? [10 marks]
[Total: 100 marks]
Cont./...
©UCD 2020/2021 3 of 10
−1 0 1 2
−
2
0
2
4
Dimension 1
D
im
en
si
on
2
(a) Two dimensional factor analysis scores.
−4 −2 0 2 4
−
3
−
2
−
1
0
1
2
3
Dimension 1
D
im
en
si
on
2
(b) Two dimensional principal components analysis scores.
Figure 1: Two dimensional scores for the wine data.
Cont./...
©UCD 2020/2021 4 of 10
2. (a) In linear discriminant analysis the aim is to maximize the posterior probability
P(g|x) that observation x belongs to class g for g = 1, . . . , G. Show that in the
case of p = 1 for G = 2 classes of equal size and with means µ1 and µ2 respec-
tively, that the Bayes decision boundary between the 2 classes corresponds to
the point
x =
µ1 + µ2
2
.
Clearly define all notation and motivate each step taken in your solution.
[20 marks]
(b) Archaeological researchers analyzed 180 glass vessels from the 15th-17th cen-
turies using x-ray methods to determine the concentrations of 8 elements
present in the glass vessels. Four major compositional types could be dis-
tinguished with 145, 15, 10 and 10 vessels in each type respectively. The
researchers’ goal was to predict the compositional type from the 8 elements
alone. The data are illustrated in Figure 2.
i. Which of the variables seem most likely to be useful in predicting compo-
sitional type? Explain your reasoning. [5 marks]
ii. It was not possible to use quadratic discriminant analysis to classify these
data. Why is this the case? [10 marks]
iii. Three new glass vessels were analyzed to determine the concentrations
of the 8 elements present. The resulting linear discriminant functions are
detailed in the output below. To which composition type would you classify
each vessel? Explain your answer. [5 marks]
[,1] [,2] [,3] [,4]
[1,] 2327 2299 2279 2161
[2,] 2243 2268 2226 2096
[3,] 1806 1816 1946 2014
iv. Which of the 3 glass vessels in (iii) has the lowest uncertainty associated
with their classification? [30 marks]
v. An additional variable is available for the glass vessels data that details
the geographical region in which the vessel was found. Would it be appro-
priate to include this variable when classifying a glass vessel using linear
discriminant analysis? Explain your answer. [10 marks]
Cont./...
©UCD 2020/2021 5 of 10
Na2O
1
4
55
70
0.
0
5 15
5
1 3 5
MgO
Al2O3
1 3
55 65
SiO2
P2O5
0 2
0.0 0.3
SO3
Cl
0.2 1.0
5 15
5
1
4
0
3
0.
2
K2O
Figure 2: Pairs plot of the 8 elements in the glass vessels data. Vessels from the four
different compositional types are illustrated using different colours and shapes.
vi. The k-nearest neighbours classification algorithm is an alternative, non-
parametric classification technique. How might one report the uncertainty
in the resulting classification when using the k-nearest neighbours classifi-
cation algorithm? [10 marks]
vii. The k-nearest neighbours algorithm often performs poorly when p is large.
Why is this the case? [10 marks]
[Total: 100 marks]
Cont./...
©UCD 2020/2021 6 of 10
3. As part of a cancer study, a cancer cell line microarray data set was collected,
consisting of 6,830 gene expression measurements on 64 cancer cell lines. The cancer
type of each cell line is known. Interest lies in uncovering any clustering structure
in the set of cell lines.
(a) Figure 3 illustrates three dendrograms resulting from applying hierarchical clus-
tering to the cancer cell lines data. Which, if any, of the linkage types used in
Figure 3 would you recommend for use here? Explain your reasoning.
[5 marks]
(b) Based on your answer to 3(a), how many clusters would you suggest are
present? Explain your answer. [4 marks]
(c) An alternative clustering method, the k-means algorithm, was applied to the
cancer cell lines data, resulting in Figure 4. How many clusters do you think
this approach suggests are present? Explain your answer. [5 marks]
(d) How uncertain is your answer to 3(c)? Explain your answer. [8 marks]
(e) A clustering solution from the application of hierarchical clustering to these
data was compared to the known cancer type, resulting in an adjusted Rand
index of 0.17. A similar comparison from the output of the k-means algorithm
to the known cancer type resulted in an adjusted Rand index of 0.22. Explain
what the adjusted Rand index measures. Which clustering solution would be
viewed as preferable here? [7 marks]
(f) Rather than performing clustering on the entire data matrix, we could perform
hierarchical or k-means clustering on the first few principal component scores.
Why might this be a useful approach here? [5 marks]
(g) The researchers involved in this study wish to know which genes differ the most
across the uncovered clusters. Suggest a way to answer this question.
[8 marks]
(h) A model-based approach to clustering these data was also used, with some
output shown in Figure 5. Based on this, which number of components and
model type was deemed to be optimal? Explain your answer. [8 marks]
Cont./...
©UCD 2020/2021 7 of 10
(i) Sketch a simple 2-dimensional plot to explain what the VEI model type means
in terms of the shape of the clusters produced. Write down the parameterisation
of the covariance matrix that corresponds to the VEI model, explaining all
notation used. [10 marks]
(j) Show that it is not possible to directly maximise the likelihood function in
model-based clustering using the standard analytical approach. Show your
workings, motivating each step taken, and clearly define any notation used.
[20 marks]
(k) The Expectation-Maximisation (EM) algorithm can be used to obtain maxi-
mum likelihood estimates of the parameters in model-based clustering, and of
the parameters in the factor analysis model. Explain how the EM algorithm
works in general, indicating how the E-step differs between the model-based
clustering and factor analysis settings. [20 marks]
[Total: 100 marks]
Cont./...
©UCD 2020/2021 8 of 10
BR
EA
ST
BR
EA
ST CN
S
CN
S
R
EN
AL
BR
EA
ST
N
SC
LC
R
EN
AL
M
EL
AN
O
M
A
OV
AR
IA
N
OV
AR
IA
N
N
SC
LC
OV
AR
IA
N
CO
LO
N
CO
LO
N
OV
AR
IA
N
PR
O
ST
AT
E
N
SC
LC
N
SC
LC
N
SC
LC
PR
O
ST
AT
E
N
SC
LC
M
EL
AN
O
M
A
R
EN
AL
R
EN
AL
R
EN
AL
OV
AR
IA
N
UN
KN
OW
N
OV
AR
IA
N N
SC
LC
CN
S
CN
S
CN
S
N
SC
LC
R
EN
AL
R
EN
AL
R
EN
AL
R
EN
AL
N
SC
LC
M
EL
AN
O
M
A
M
EL
AN
O
M
A
M
EL
AN
O
M
A
M
EL
AN
O
M
A
M
EL
AN
O
M
A
M
EL
AN
O
M
A
BR
EA
ST
BR
EA
ST
CO
LO
N
CO
LO
N
CO
LO
N
CO
LO
N
CO
LO
N
BR
EA
ST
M
CF
7A
−r
ep
ro
BR
EA
ST
M
CF
7D
−r
ep
ro
LE
UK
EM
IA
LE
UK
EM
IA
LE
UK
EM
IA
LE
UK
EM
IA
K5
62
B−
re
pr
o
K5
62
A−
re
pr
o
LE
UK
EM
IA
LE
UK
EM
IA
40
80
12
0
16
0
Complete
Linkage
LE
UK
EM
IA
LE
UK
EM
IA
LE
UK
EM
IA
LE
UK
EM
IA
LE
UK
EM
IA
LE
UK
EM
IA
K5
62
B−
re
pr
o
K5
62
A−
re
pr
o
R
EN
AL
N
SC
LC
BR
EA
ST
N
SC
LC
BR
EA
ST
M
CF
7A
−r
ep
ro
BR
EA
ST
M
CF
7D
−r
ep
ro
CO
LO
N
CO
LO
N
CO
LO
N
R
EN
AL
M
EL
AN
O
M
A
M
EL
AN
O
M
A
BR
EA
ST
BR
EA
ST M
EL
AN
O
M
A
M
EL
AN
O
M
A
M
EL
AN
O
M
A
M
EL
AN
O
M
A
M
EL
AN
O
M
A
OV
AR
IA
N
OV
AR
IA
N
N
SC
LC
OV
AR
IA
N
UN
KN
OW
N
OV
AR
IA
N N
SC
LC
M
EL
AN
O
M
A
CN
S
CN
S
CN
S
R
EN
AL
R
EN
AL
R
EN
AL
R
EN
AL
R
EN
AL
R
EN
AL
R
EN
AL
PR
O
ST
AT
E
N
SC
LC
N
SC
LC
N
SC
LC
N
SC
LC
OV
AR
IA
N
PR
O
ST
AT
E
N
SC
LC
CO
LO
N
CO
LO
N
OV
AR
IA
N
CO
LO
N
CO
LO
N
CN
S
CN
S
BR
EA
ST
BR
EA
ST
40
80
12
0
Average Linkage
LE
UK
EM
IA
R
EN
AL
BR
EA
ST
LE
UK
EM
IA
LE
UK
EM
IA
CN
S
LE
UK
EM
IA
LE
UK
EM
IA
K5
62
B−
re
pr
o
K5
62
A−
re
pr
o
N
SC
LC
LE
UK
EM
IA
OV
AR
IA
N
N
SC
LC
CN
S
BR
EA
ST
N
SC
LC
OV
AR
IA
N
CO
LO
N
BR
EA
ST
M
EL
AN
O
M
A
R
EN
AL
M
EL
AN
O
M
A
BR
EA
ST
BR
EA
ST M
EL
AN
O
M
A
M
EL
AN
O
M
A
M
EL
AN
O
M
A
M
EL
AN
O
M
A
M
EL
AN
O
M
A
BR
EA
ST
OV
AR
IA
N
CO
LO
N
M
CF
7A
−r
ep
ro
BR
EA
ST
M
CF
7D
−r
ep
ro
UN
KN
OW
N
OV
AR
IA
N N
SC
LC
N
SC
LC
PR
O
ST
AT
E
M
EL
AN
O
M
A
CO
LO
N
OV
AR
IA
N
N
SC
LC
R
EN
AL
CO
LO
N
PR
O
ST
AT
E
CO
LO
N
OV
AR
IA
N
CO
LO
N
CO
LO
N
N
SC
LC
N
SC
LC
R
EN
AL
N
SC
LC
R
EN
AL
R
EN
AL
R
EN
AL
R
EN
AL
R
EN
AL CN
S
CN
S
CN
S
40
60
80
10
0
Single Linkage
Figure 3: Dendrograms of the cancer cell line data.
Cont./...
©UCD 2020/2021 9 of 10
5 10 15 20
20
00
00
30
00
00
40
00
00
Number of clusters
W
ith
in
c
lu
st
er
s
um
o
f s
qu
ar
es
Figure 4: Within cluster sum of squares resulting from the application of k-means clus-
tering to the cancer cell line data.
−
10
50
00
0
−
95
00
00
−
85
00
00
Number of components
BI
C
1 2 3 4 5 6 7 8 9 10
VEI
EEI
VII
EII
Figure 5: Bayesian information criteria for the set of models fitted using model-based
clustering.
—o0o—
©UCD 2020/2021 10 of 10