data2002代写-DATA2002|学霸联盟

data2002代写-DATA2002

时间：2022-09-17

DATA2002
ANOVA contrasts
Garth Tarr
The University of Sydney
1
In this lecture
Contrasts
Condence intervals
2
Contrasts
4
Beyond ANOVA: contrasts
Recall our model: for and ,
the -th observation in the -th sample is modelled as the value taken by
and all random variables are assumed independent.
We can rewrite this as:
where is the error term.
The ANOVA -test is a test of the hypothesis
If this hypothesis is “rejected”, then what?
Further analysis reduces to the study of contrasts
i = 1, 2,… , g j = 1, 2,… , n
i
j i
∼  ( , ) ,Y
ij
μ
i
σ
2
= + ,Y
ij
μ
i
ε
ij
∼  (0, )ε
ij
σ
2
F
: = =… = .H
0
μ
1
μ
2
μ
g
5
Contrasts
A contrast is a linear combination where the
coecients add to zero.
In an ANOVA context, a contrast is a linear
combination of means.
We make the distinction between two kinds of
contrast:
population contrasts: contrasts involving the
population group means i.e. the ’s;
sample contrasts: contrasts involving the
sample group means i.e. the ’s and ’s.
For example, we might consider
the population contrast
whose corresponding observed
sample version is
which is the observed value of
the random variable
μ
i
y
¯
i∙
Y
¯
i∙
− ,μ
1
μ
2
− ,y
¯
1∙
y
¯
2∙
− .Y
¯
1∙
Y
¯
2∙
6
Plant growth
The PlantGrowth data has results from an experiment to compare yields (as
measured by dried weight of plants) obtained under a control and two different
treatment conditions ( , Table 7.1).

Dobson, 1983
# the data is already "in" R
data("PlantGrowth")
library(tidyverse)
PlantGrowth %>% ggplot() +
aes(y = weight, x = group,
colour = group) +
geom_boxplot(coef = 10) +
geom_jitter(width = 0.1, size = 5) +
theme(legend.position = "none") +
labs(y = "Weight (g)", x = "Group")
We want to compare the means of the three different groups, the control,
treatment 1 and treatment 2 groups. The null hypothesis is
.: = =H
0
μ
1
μ
2
μ
3
7
Plant growth
The p-value, is less than 0.05, so we reject the null
hypothesis at the 5% level of signicance and conclude there is evidence to suggest
that at least one of the groups has a signicantly different mean yeild to the others.
Is it ctrl vs trt1?
Is it ctrl vs trt2?
Is it trt1 vs trt2?
Or is it some other linear combination of the means that is different?

plant_anova = aov(weight ~ group, data = PlantGrowth)
summary(plant_anova)
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P( ≥ 4.846) = 0.0159F
2,27
Which means are different?
8
Distribution of sample contrasts
Any with denes a sample contrast, .
Under our normal-with-equal-variances model, this random variable has
distribution given by
The corresponding population contrast is the expected value of the (random)
sample contrast.
Conversely, the (observed) sample contrast is an estimate of the
corresponding population contrast;
the (random) sample contrast is the corresponding estimator.
,… ,c
1
c
g
= = 0c
∙
∑
g
i=1
c
i
∑
g
i=1
c
i
Y
¯
i∙
∼ N
(
,
)
.
∑
i=1
g
c
i
Y
¯
i∙
∑
i=1
g
c
i
μ
i
σ
2
∑
i=1
g
c
2
i
n
i
∑
i=1
g
c
i
y
¯
i∙
∑
i=1
g
c
i
Y
¯
i∙
9
Behaviour of contrasts under the null hypothesis
Under the “ANOVA null hypothesis” ,
all population contrasts are zero:
all (random) sample contrasts have expectation zero:
Therefore the “ANOVA null hypothesis” can be rephrased as “all population
contrasts are zero”.
: =… = (= μ, say)H
0
μ
1
μ
g
= μ = μ = 0 ;
∑
i=1
g
c
i
μ
i
∑
i=1
g
c
i
∑
i=1
g
c
i
E
( )
= = 0 .
∑
i=1
g
c
i
Y
¯
i∙
∑
i=1
g
c
i
μ
i
10
Maybe not what we want
Thus in some examples, in a particular sense, the “ANOVA null hypothesis” may be
“too strong”:
we may only wish to test one (or more) “special” population contrasts are zero.
Also, the “ANOVA null hypothesis” may not be rejected for the reason we want:
some contrasts may be non-zero, but are they the ones we are interested in?
-tests for individual contrasts
Suppose we really only want to test that for some “special
contrast” given by (with ).
We can of course perform the ANOVA Mean-Square Ratio -test, but can we
possibly do better?
We can perform a more “targeted” -test using the corresponding sample contrast
and the residual mean square.
t
: = 0H
0
∑
g
i=1
c
i
μ
i
,… ,c
1
c
g
= 0∑
g
i=1
c
i
F
t
11
12
Thus a -statistic for testing the hypothesis that is
which has a distribution if .
This generalises the two-sample -statistic.
t = 0∑
g
i=1
c
i
μ
i
∑
g
i=1
c
i
Y
¯
i∙
σ ̂ ∑
g
i=1
c
2
i
n
i
‾ ‾‾‾‾‾‾
√
t
N−g
= 0∑
g
i=1
c
i
μ
i
t
13
Yield
Let and represent the population means of treatment 1, treatment 2 and
the control group, respectively.
Let’s consider if there is a difference between treatment 1, trt1 and treatment 2,
trt2, this corresponds to contrast coecients , and .

,μ
1
μ
2
μ
3
= 1c
1
= −1c
2
= 0c
3
plant_summary = PlantGrowth %>%
mutate(group = factor(group, levels = c("trt1","trt2", "ctrl"))) %>%
group_by(group) %>%
summarise(n = n(), mean_weight = mean(weight)) %>%
mutate(contrast_coefficients = c(1, -1, 0),
c_ybar = mean_weight * contrast_coefficients)
plant_summary
# A tibble: 3 × 5
group n mean_weight contrast_coefficients c_ybar

1 trt1 10 4.66 1 4.66
2 trt2 10 5.53 -1 -5.53
3 ctrl 10 5.03 0 0
sum(plant_summary$c_ybar) # sample contrast
[1] -0.865
14
Residual standard error
Recall the ANOVA analysis from earlier:
The “Residual Mean Sq” is the estimate of , the (population) variance (within
each group)
The “Residual standard error” is , the estimate of , the
(population) standard deviation (within each group)

plant_anova = aov(weight ~ group, data = PlantGrowth)
summary(plant_anova)
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
σ
2
Residual Mean Sq
‾ ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
√
σ
15
Residual standard error
We use the tidy() function from the broom package to help extract these terms
from the anova object.

library(broom)
tidy(plant_anova)
# A tibble: 2 × 6
term df sumsq meansq statistic p.value

1 group 2 3.77 1.88 4.85 0.0159
2 Residuals 27 10.5 0.389 NA NA
resid_ms = tidy(plant_anova)$meansq[2]
resid_ms
[1] 0.3885959
resid_se = sqrt(resid_ms)
resid_se
[1] 0.6233746
16
Calculating the test statistic
The observed test statistic for the contrast is,
From our earlier summary,

= =t
0
∑
g
i=1
c
i
y
¯
i∙
σ ̂ ∑
g
i=1
c
2
i
n
i
‾ ‾‾‾‾‾‾
√
−y
¯
1
y
¯
2
×Residual Mean Sq
‾ ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
√
+ +
1
2
10
(−1)
2
10
0
10
‾ ‾‾‾‾‾‾‾‾‾‾‾‾‾‾
√
(n_i = plant_summary %>% pull(n))
[1] 10 10 10
(ybar_i = plant_summary %>% pull(mean_weight))
[1] 4.661 5.526 5.032
(c_i = plant_summary %>% pull(contrast_coefficients))
[1] 1 -1 0
(se = sqrt(resid_ms * sum((c_i^2) / n_i)))
[1] 0.2787816
(t_stat = sum(c_i * ybar_i)/se)
[1] -3.102787
17
Calculating the p-value
The test statistic has a distribution if .
This is the same degrees of freedom as the denominator (residual) degrees of
freedom from the ANOVA!
A (two-sided) p-value is obtained using
(Potentially) a smaller standard error! Better estimate of .

# observed test statistic
(t_stat = sum(c_i * ybar_i)/se)
[1] -3.102787
t
N−g
= 0∑
g
i=1
c
i
μ
i
plant_anova$df.res
[1] 27
2*pt(abs(t_stat), df = plant_anova$df.res, lower.tail = FALSE)
[1] 0.004459236
Why is this better than an ordinary two-sample -test?t
σ
18
Condence intervals
20
Condence interval
A condence interval for a population contrast can be obtained in the usual way,
based on the -statistic.
Suppose the “multiplier”, or critical value, satises
Then whatever be the “true” values of the ’s, since the quantity
we have, using the usual condence interval-type manipulations,
t
t
⋆
P(− ≤ ≤ ) = 0.95 .t
⋆
t
N−g
t
⋆
μ
i
∼
−∑
g
i=1
c
i
Y
¯
i∙
∑
g
i=1
c
i
μ
i
σ ̂ ∑
g
i=1
c
2
i
n
i
‾ ‾‾‾‾‾‾
√
t
N−g
P
(
− ≤ ≤ +
)
= 0.95 .
∑
i
c
i
Y
¯
i∙
t
⋆
σ ̂
∑
i
c
2
i
n
i
‾ ‾‾‾‾‾
√
∑
i
c
i
μ
i
∑
i
c
i
Y
¯
i∙
t
⋆
σ ̂
∑
i
c
2
i
n
i
‾ ‾‾‾‾‾
√
21
“Observed value” of condence interval
For observed sample means , a 95% condence interval for the “true”
population contrast is given by
where, as above, denotes the square root of the residual mean square
,… ,y
¯
1∙
y
¯
g∙
∑
i
c
i
μ
i
±
∑
i
c
i
y
¯
i∙
⏟
estimate
t
⋆
σ ̂
∑
i
c
2
i
n
i
‾ ‾‾‾‾‾
√
  
std error
σ ̂
22
“Observed value” of condence interval
A two-sided 95% condence interval for the “special contrast” considered above is
given as follows:
the “multiplier” is determined via:
the interval is obtained using

t
⋆
t_star = qt(0.975, df = 969)
t_star
[1] 1.962415
sum(c_i * ybar_i) + c(-1,1) * t_star * se
[1] -1.4120853 -0.3179147
23
References
Dobson, A.J. (1983). An introduction to statistical modelling. London: Chapman & Hall.
24