程序代写案例-STAT302/767
时间:2022-04-12
STAT302/767 Midterm.
19 September 2019, 50 minutes 35 marks total.
FAMILY NAME: YOUR ID#:
GIVEN NAME: 767/ 302 (circle)
1 Q1
Vineyards maintain spray diaries, where sprays of fertilisers and pesticides
are recorded. Among the variables recorded are the number of sprays of each
product and the target (a particular insect, fungal disease, or weed species, as
well as “additives,” which are substances that make the spray disperse bet-
ter.) Products have been aggregated in to categories by target and chemistry,
with hard (“h”) chemistry being the most tightly regulated, soft (“s”) the
least regulated, and an intermediate category (“?”). 18 variables have been
created representing the different target-chemistry combinations. We scale
these variables to have standard deviation 1 and then perform a principle
components analysis.
A (2 marks) Discuss the decision to scale the data. In what situations
will this be advantageous? In what situations will it be misleading?
Answer: scaling the data is advantageous when the data have different
natural scales that do not reflect their importance. However, if the data
do have the same natural scale, scaling may inflate the importance of
variables that don’t vary much. Here, although the units are the same
(number of sprays) because different chemical types are being sprayed,
its not clear that the scales are the same. One extra “hard” spray may
be more important than many soft sprays. Therefore scaling the data
makes sense.
B (2 marks)The variances of the resulting principle component scores are
given below. Sketch a scree plot.
>prcomp(agcount, scale=TRUE)->ag.pr
>round(ag.pr$sdev^2,2)
[1] 7.39 2.34 1.82 1.22 1.16 0.94 0.79 0.73 0.51 0.33 0.28 0.19
1
[13] 0.11 0.09 0.05 0.02 0.01 0.00
2
C (3 marks) How many principal components do you suggest using? Ex-
plain your reasoning. What proportion of total variability do they
account for? (Note that since the variables were initially standardised,
the sum of the variances is 18.)
Answer: The eigenvalue greater than 1 rule would suggest 5, elbow
criteria would suggest 3 or 5. Four is not sensible because the eigen-
values for the fourth and fifth component are very similar–suggesting
the four dimensional space would not be reproducible.
D (1 mark) There are two different management styles, contemporary
and future, with future vineyards stating that they strive to eliminate
the use of “hard” chemistry. Consider the plot of the first three prin-
cipal component scores on the following page. Black dots represent
vineyards with contemporary management, and open circles represent
future management. Identify any component (or combination of com-
ponents) that separates the two groups reasonably well.
Answer: All the discriminatory power appears to be in component 1.
(Note–some clever people were able to draw a diagonal line through
the PC1 and PC3 plot that correctly classified a couple more points, so
reasonable to say PC1 and PC3 if you have drawn this line or otherwise
explained.) It is not just the plots with good separation, you have to
look at what PCs are doing the separating.
3
4
E (2 marks) Below we give the correlations of the original variables and
the principle component scores for the first three components. Consider
the components identified in (D), and the relative scores of the future
and contemporary vineyards. Are the correlations consistent with the
stated definitions of future and contemporary management? Explain.
PC1 PC2 PC3
Additive ? 0.45 0.45 -0.47
Additive h 0.61 -0.18 0.63
Additive s -0.27 -0.06 -0.07
Botrytis ? 0.77 0.21 0.16
Botrytis h 0.82 -0.04 0.17
Botrytis s -0.63 0.03 0.02
Downy Mildew ? -0.56 0.00 0.51
Downy Mildew h 0.76 0.44 -0.05
Downy Mildew s 0.47 -0.48 -0.44
Fertiliser ? -0.53 0.63 -0.35
Fertiliser s -0.81 0.13 -0.06
Grasses & Weeds h 0.71 0.39 0.31
Leafroller ? 0.75 0.10 -0.22
Mealy Bug ? 0.49 0.48 -0.22
Mealy Bug h 0.76 -0.32 0.27
Powdery Mildew ? -0.13 0.59 0.45
Powdery Mildew h 0.92 0.19 -0.08
Powdery Mildew s -0.56 0.58 0.29
Answer: Yes what we see is consistent with the definition of future/contemporary.
PC1 showed future vineyards with low scores and contemporary vineyards
with high scores. We see PC1 is negatively correlated with “soft” chemistry
and positively correlated with “hard chemistry”, implying that the future
vineyards are using more soft and less hard chemistry, and vice versa for
contemporary vineyards.
5
2 Q2
The diet of 215 people is observed, and two sets of variables recorded: 5
macro nutrients (total energy in kJ, carbohydrate, protein, fat, and fibre in
grams) and 7 micronutrients (beta-carotene, vitamin C, vitamin A, retinol,
vitamin E, vitamin B6 and vitamin B12). A canonical correlation analysis
is performed.
A (1 mark) How many pairs of canonical variates will be produced?
Answer: 5 (minimum of 5 and 7)
B (4 marks) Two p-values produced by the CCorA function are given
below. Explain what null hypothesis they are testing, and the difference
between how they are generated. Under what circumstances is each
preferred? Do they give the same conclusion in this case? What is that
conclusion?
> CCorA(micros, macros, permutations=1000)->nutri.CCA
> nutri.CCA$p.perm
[1] 0.000999001
> nutri.CCA$p.Pillai
[1] 4.649273e-123
Answer: The null hypothesis is that there are no linear associations
between the groups of variables. The F distribution computes the prob-
ability of oberving a Pillais trace as or more extreme than what was
observed by comparing it to a parametric distribution derived from
the multivariate normal assumption. The permutation distribution
randomly permutes the rows of one of the datasets, so that there is
no association between the datasets, and computes the test statistic
for each of these permutations to generate the distribution of the test
statistic under the null hypothesis. The F distribution is preferred if
the data are multivariate normal, and the permutation distribution in
other cases. Both p-values suggest we should reject the null hypothe-
sis, so we conclude there is a relationship between the macro and micro
nutrients.
6
C (2 marks) Examine the RDA-Rsq and RDA-adj-Rsq given below. Com-
ment on the ability to predict someones micro nutrient levels using
macronutrient information.
RDA-R.Sq RDA-adj-Rsq
micros| macros 0.51 0.50
macros | micros 0.93 0.93
Answer: Both the RDA-R.Sq and RDA-adj-Rsq suggest about 50% of
the variability in micro nutrients is explained by the macro nutrients.
This is consistent with their being a relationship between the two sets
of variables, but suggests the relationship is not strong enough to do
accurate prediction of micro nutrient levels from macro nutrient levels.
D (2 marks) The biplot is included on the next page. Examine the second
row of plots labeled with the variable names. What do the coordinates
represent? What does the outer circle represent?
Answer: The coordinates represent the correlations of the original
variables with the first two canonical axes. When an arrow reaches the
outer circle, the corresponding variable is perfectly represented (it can
be perfectly reconstructed) by the first two canonical axes.
7
−2 0 2 4 6
−
4
−
2
0
2
4
CanAxis1
Ca
nA
xis
2 l
l
l
l
l
l
ll
ll l
l
l
l
l
l
l
l
l
l
l l
ll
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
ll
l l
l
l
l
l
ll
l
l l
l
l
l
l
ll
l l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l ll
l
l l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
2
3
4
5
6
7
89
1011 12
13
14
17
18
19
20
21
22
23
24 25
2627
28
29
30
31
3233
3435
36
37
38
39
40
41
43
44
45
46
47
48
9
50
51
52
53
54555657
58
59
60
61
62
63
64
65
66
7
68
69
70
71
72
73
74
75
7678
79 80
81 82
83
84
85
86
87
88
89
909192
93
94
95
96
97 98
100
101
102103104
105 106
107
108
109
110
111
112
114
115
116
117
118119
120 121
122
123
124
127
128
130
131
132
133
134
135 136
137
38
39
140
141
4143144
145
6
147
148
951
15
153
54155
56
57 158
15
160
16162
163
165
6
167
168
69170
171
1 2173
174
175
176
177
178
179
180
182
83184
185
186187
188
189
19019192
193
194
195
196
197
198199
200
20
203
204
205
206
207
208
209210
21
212
213
2216
2 7
218
219
220 221
222
223
2 4
225
226
227
228
229
230
CCorA object plot
First data table (Y)
−2 0 2 4
−
4
−
2
0
2
4
CanAxis1
Ca
nA
xis
2
l
l
l
ll
l
ll
ll
l
l
l
l
l
l
l
l
l
ll
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
ll
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
lll
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l ll
lll l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
ll
l
l
l l
l
l
l
l
l
l
l
l
l
l
2
3
4
56
7
89
1011
12
13
14
17
18
19
20
21
22
2324
25
262728
29
30
31
32
3334
35
36
37
38
39
40
41
43
44 45
46
478
49
50
51
52
53
54
55557
58
59
60
61 62
63
64
65
66
67
68
69
70
71
72
73
74
75
7678
79 80
81
82
83
84
85
86
87
88
89
90
9192
93
94
95
96
97
98
100
101
10
103
104
105
106107
108
109
110
111
112
114
115
116
117
118
119
120 121
122
123
124
127
128
130
131
132
133
134
135
136
137
381 9
140141
142
1431 4
145146
147148149151 152
15354
155
156
57
158
159160
161
162
163
165
1
167
168
69170
171
172173
1 4
175
176
178
179
180
182
183
184
185
86 7
188
189
190191
92
193
194
195
96
197
19899
200
20
203
204
205
206207 208
2 9
210
1
212
213
215
216
217
218
21
220221
222
223
2 4
225
226
27228
22
230
CCorA object plot
Second data table (X)
−1.0 −0.5 0.0 0.5 1.0
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
CanAxis1
Ca
nA
xis
2
Beta.carotene
Vitamin.C
Vitamin.A
Retinol
Vitamin.E
Vitamin.B6
Vitamin.B12
CCorA variable plot
First data table (Y)
−1.0 −0.5 0.0 0.5 1.0
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
CanAxis1
Ca
nA
xis
2
Energy
Carbohydrate
Protein
Fat
Fibre
CCorA variable plot
Second data table (X)
8
E (4 marks) Retinol and vitamin B12 occur only in animal based food
sources (milk, meat, eggs). Based on the loadings of the first two
canonical variates, which of the macronutrients do you expect to be
associated with animal based food sources? Explain your reasoning.
In the plot of individuals, where would you expect to find vegan or
vegetarian individuals?
> round(nutri.CCA$corr.X.Cx,2)
CanAxis1 CanAxis2 CanAxis3 CanAxis4 CanAxis5
Energy 0.99 0.02 0.16 0.05 0.00
Carbohydrate 0.93 0.17 0.33 0.03 0.05
Protein 0.97 -0.20 0.07 -0.14 0.02
Fat 0.98 -0.08 -0.08 0.15 0.01
Fibre 0.90 0.37 0.04 -0.22 0.00
> round(nutri.CCA$corr.Y.Cy,2)
CanAxis1 CanAxis2 CanAxis3 CanAxis4 CanAxis5\include{Midterm2019-sol.tex}
Beta.carotene 0.63 0.31 -0.17 -0.44 -0.27
Vitamin.C 0.70 0.38 0.16 -0.50 -0.16
Vitamin.A 0.79 0.16 -0.06 -0.34 -0.32
Retinol 0.88 -0.26 0.23 0.03 -0.32
Vitamin.E 0.97 0.21 -0.10 -0.10 0.03
Vitamin.B6 0.87 0.14 0.34 -0.26 0.16
Vitamin.B12 0.88 -0.34 0.09 -0.31 0.04
Answer: The first canonical axis is highly correlated with most vari-
ables in both sets, including Retinol and B-12. It could be argued that
people eating a low amount of meat would have a low score on this
axis. The second axis has negative correlations with Retinol and Vi-
tamin B.12. Protein and Fat mirror this relationship with the second
canonical axis, so we expect that they are also associated with meat
eating. Therefore, we expect people who eat a lot of meat to have
low scores on the second axis, and vegans/vegetarians to have higher
scores.
9
3 Q3
Consider a set of metabolomics data, similar to our first assignment, where
the spectral intensity of 333 compounds has been measured on 118 fungal
samples. For each compound, a t-test has been performed to examine the
difference between two treatment groups (control fungi, and fungi treated
with short chain fatty acids). We are interested in discovering compounds
whose levels are affected by the treatment.
A (4 marks) Imagine making a histogram of the 333 t-test pvalues. Make
two sketches: first, showing what you expect if all the compounds follow
the null hypothesis (ie are unaffected by the treatement), and second,
what you expect if 20% of the compounds are strongly affected by the
treatment (and the other 80% follow the null hypothesis). Put density
on the y-axis, put tick marks and labels on both the x- y-axis, and
draw roughly to scale.
Answer: Because these are drawn as densities, the height of the first
(all following null) should be around 1, and the second (20% following
alternative) should have a spike near zero and then plateau at height
0.8 (pi0).
10
11
B (1 mark) If we want to control the family wise type I error rate to be
less than 0.05, using the Bonferroni correction, what p-value threshold
would we use to declare “discoveries”?
Answer: 0.05/333= 0.00015
C (1 mark) If we declare tests with a p-value of less than 0.05 to be
discoveries, how many discoveries do we expect to find if there are in
fact no true differences (ie, the first scenario you sketched above)?
Answer:333*.05=16.65
D (1 mark) In fact, 123 p-values are found to be less than 0.05. What is
the expected false discovery rate, using the Benjamini and Hochberg
method?
Answer: 16.65/123= 0.135
12
E (1 mark) In other situations, we have used linear discriminant anal-
ysis, which creates linear combinations of the original variables that
maximize an ANOVA f-statistic, to describe the differences between
groups (eg treatment and control). What prevents us from using that
technique here?
Answer: We have p > n; LDA requires n > p to compute W−1. Note:
we don’t actually have information that our other LDA assumptions
(equal covariance, multivariate normality) are violated. If you wrote
about these, you got zero marks here, but some marks in part F if your
suggested technique did address the problem you mentioned.
F (4 marks) Suggest an alternate technique that can cope with the prob-
lem described in (E). Give the name of the technique, and describe the
criteria it optimises.
Answer: PLS-DA. This finds linear combinations that maximize the
covariance between a set of continuous variables (the metabolites) and a
set of dummy variables indicating categories. Note: 1 mark for getting
the name sort of right, 1 more for getting it exactly right, one mark for
mentioning covariance and one for mentioning dummy variables.
13