选择代写-STAT3022|学霸联盟

选择代写-STAT3022

时间：2021-05-08

STAT3022 Applied Linear Models Semester 1
Tutorial & Lab Sheet 1
Tutorial Problems
There are no tutorial problems in Week 1.
Computer Problems
We will be using the statistical software R for all analysis. Basic use of R is assumed knowledge in
this course. If you are not familiar with R, you should quickly learn. Check the resource page in
canvas to learn or refresh.
For Q1-Q3, you will treat R as a calculator to get the quantity you want. For Q4, you will get
started with literate programming using R Markdown. For Q5, you should try to produce an R
Markdown report yourself. Note that assignments will be based on your R Markdown output so be
sure to learn how to make an output in your computer class.
Question 1
(Assumed knowledge) Use R to find the following probabilities
(a) P (Z > −0.785), Z ∼ N(0, 1), with pnorm(q, lower.tail = FALSE).
(b) P (t2 ≥ −1.26), with pt(q, df, lower.tail = FALSE).
(c) P (χ24 < 4.7), with pchisq(q, df).
(d) P (|t9| > 1.85), with pf(q**2, df1, df2, lower.tail = FALSE) after thinking about how
the t and the F distribution relate to each other.
Question 2
(Assumed knowledge) Use qnorm, qt, qchisq, and qf to find c in the following
(a) P (t4 ≥ c) = .995 with qt,
(b) P (|Z| ≤ c) = 1/11 with both, qnorm and qchisq,
(c) P (F3,12 ≤ c) = .90 with qf.
Question 3
(Assumed knowledge) A machine produces metal pieces that are cylindrical in shape. A sample of 8
pieces is taken and the diameters are
1.01, 0.97, 0.39 1.03, 1.04, 0.99, 0.98, 0.99.
(a) Construct a box plot representation of this data set. (Hint: with x <- c(1.01,..., 0.99)
and boxplot()).
(b) Estimate the average diameter, µ, produced by the machine. Estimate the standard error
(= s/
√
n) of your estimate? (mean(x), s = sd(), and n = length(x)).
(c) Assuming that the diameter can be modelled by a normal distribution, calculate a 98% con-
fidence interval for µ. (Hint: with t.test(x, mu = 1, conf.level = 0.98), gives you the
solution for (d) as well).
(d) Would you reject the hypothesis H0 : µ = 1.00 at significance level α = .02 on the basis of
these data?
Copyright c© The University of Sydney 1
Question 4
Go to RStudio > File > New File > R Markdown. Click on ‘OK’ to get a pre-filled R Markdown
file. Push the Knit button on top of the console just under the file names and examine the output.
Have a play around and knit to understand how it works. Where did the data cars and pressure
come from?
Question 5
In this question, you will attempt to write your own reproducible report for the analysis of the lengths
of time of passages of play data from ten international rugby matches involving the “All Blacks”.
This (as all other course data is available from Canvas Unit Schedule & Materials) is available as
rugby.txt. This exercise helps you to digest part of Lectures 1-2 and to revisit assumed knowledge
on R and graphical displays. To get started, you may like to modify the R Markdown file from Q4.
(a) Load the tidyverse R packages which will load a collection of R packages including ggplot2
and dplyr.
library(tidyverse)
(b) Download the data and read it into R, storing them as a data frame rugby. You can use the
command below but you will need to make sure that the file you have downloaded is in the
right path. Make sure you master about reading data into R.
rugby <- read.table("rugby.txt", header = TRUE)
(c) Look at the data frame by simply typing its name, rugby, into an R chunk and compiling the
pdf with Knit. You should see that the data frame has two columns. Scroll up to see that
these columns are headed Game and Time respectively. (These headings were read in from the
text file, rugby.txt; R was alerted to the presence of these headings by the header = TRUE
syntax in the read.table command.) The variable Game identifies the match (labelled A, B,
. . ., K) and the variable Time contains the times of passages of play, in seconds.
(d) In reports it is often preferable to only show the first couple of lines in a data frame. Try the
following:
rugby[1:3, ]
head(rugby)
head(rugby, 2)
(e) Type in rugby$Game into an R chunk and press Knit.
(f) The variable type for Game is categorical (or factor as synonym). You get frequencies for each
category by table(rugby$Game). Which game had the most separate passages of play? Which
had the least? You can use the help function to learn more – try help(table) or equivalently
?table in the R console.
(g) We can display the data using a bar plot. You can produce a bar plot with
Copyright c© The University of Sydney 2
barplot(table(rugby$Game))
without any additional R package. Try producing a simliar plot using ggplot2 R package.
(h) The passage of play time is a continuous numerical variable. Try displaying it using a histogram.
Is the distribution of Time normal? If not, have you seen any other data sets with similarly
shaped histograms?
(i) Finally, we can look at the times broken down by individual match. Type
rugby %>%
filter(Game=="A") %>%
pull(Time)
That gives you just the passage times for match A. Try producing separate histograms of the
passage times for game A and game H using ggplot2 R package or otherwise.
Copyright c© The University of Sydney 3
STAT3022 Applied Linear Models Semester 1
Tutorial & Lab Sheet 2
Tutorial Problems
Question 1
Suppose the linear regression model is given by Yi = β0 + β1xi + εi, i = 1 . . . , n ≥ 2. Assume that
εi ∼ NID(0, 1), that is assumptions (A1)-(A4) hold. Because of convenience the scale of the x values
is changed (e.g. from inches to centimeters) and the transformed explanatory values z = x/τ are
used instead. Write the new model as
Yi = γ0 + γ1zi + εi.
(a) Represent estimates of γ0 and γ1 in terms of βˆ0 and βˆ1. (Lecture 3)
(b) Show that r2 is invariant. (Lecture 3)
Question 2
Show that the F -test statistic for testing H0 : β1 = 0,
F =
βˆ1
2
SXX
σˆ2
,
can be written as
F =
r2(n− 2)
1− r2 ,
where r is the coefficient of correlation between x and Y . (Assumed knowledge)
Question 3
A vector of random variables X = (X1, X2, X3)
> has covariance matrix
Σ =
 9 −4 1−4 25 0
1 0 2
 .
(a) Find the correlation coefficient between X1 and X2. (Assumed knowledge)
(b) Find the variance of Y = −2X1 + 3X2. Write Y as a>X and show that the variance is a>Σa.
(Assumed knowledge + Lecture 5)
(c) What is the standard deviation of X3? (Assumed knowledge)
Question 4
In Lecture 3, you saw that a simple linear regression model
Yi = β0 + β1xi + i
assuming errors i ∼ NID(0, σ2) have the joint density evaluated at (the observed values) y> =
(y1, . . . , yn) as
Copyright c© The University of Sydney 1
f(y; β0, β1, σ) =
1√
2piσ
e−
(y1−β0−β1x1)2
2σ2 × · · · × 1√
2piσ
e−
(yn−β0−β1xn)2
2σ2
=
(
1√
2piσ
)n
e−
1
2σ2
∑n
i=1(yi−β0−β1xi)2 . (1)
(a) Write the log-likelihood, `(β0, β1;y, σ) = log f(y; β0, β1, σ).
(b) Find β0 and β1 that maximises ` assuming σ is a known fixed value.
Computer Problems
For the following questions, use the olympic.txt dataset that consists of the winning heights or
distances (in inches) for the High Jump, Discus and Long Jump events at the Olympics up to 1996.
(Lecture 3 & 4)
Question 1
(a) Store the olympic.txt dataset in R as the data frame olympic. Hint: the dataset is tab
delimited (sep="\t").
(b) Describe, and where possible explain, any unusual features about olympic.
(c) Create a new data frame olympicMetric that has measurements in metres, by using the con-
version 1 m = 39.3701 inches, and the full year (e.g. 1900 rather than 0). Show the first 6 rows
of the olympicMetric. Hint: The Olympics were held every 4 years except for 1916, 1940,
and 1944 due to war.
You should use olympicMetric for the next two questions.
Question 2
(a) Plot the first 20 values of LongJump (xi) against the first 20 values of HighJump (yi). Briefly
comment on the pattern.
(b) Fit the simple linear regression model
yi = β0 + β1xi + εi, εi ∼ NID(0, σ2)
for i = 1, . . . , 20 using
olympicLm <- lm(HighJump ~ LongJump, data = olympicMetric)
(c) Find the regression point estimates for the parameters (β0, β1, σ
2) using a summary output of
olympicLm.
(d) Check the model assumptions using the graphical diagnostic plots from the lectures, and a
scatter plot with associated fitted regression line in red.
Copyright c© The University of Sydney 2
Question 3
This question relates to material in Lecture 5.
(a) Construct 95% confidence intervals for the parameters of the model.
(b) Find point estimates for missing values of HighJump.
(c) Is it more appropriate to construct prediction intervals or confidence intervals for missing values
of HighJump? Explain your answer.
(d) Construct the appropriate intervals, based on your answer to the previous part, at the 99%
level for missing values of HighJump.
Copyright c© The University of Sydney 3
STAT3022 Applied Linear Models Semester 1
Tutorial & Lab Sheet 3
Tutorial Problems
Question 1
The effect of height (H) and weight (W) on catheter length (L) on n = 12 children with congenital
heart disease was analyzed using two models:
Model 1: Li = β0 + β1Wi + β2Hi + i, i ∼ NID(0, σ21)
and
Model 2: Li = γ0 + γ1Wi + εi, εi ∼ NID(0, σ22).
Assume that both models pass the diagnostic tests. Use the output after this question only
(i.e. do not use R) to answer the following questions.
(a) Write down the fitted models, including the estimates of the error variances. (Lecture 4 & 6)
(b) Calculate a 90% confidence interval for β2. What can you conclude from your interval? (Lecture
4)
(c) What is the multiple correlation coefficient between L and (H, W)? (Lecture 7)
(d) Give a 95% confidence interval for σ2. (Lecture 4)
(e) Calculate the sample coefficient of correlation between the L and W values. (Assumed knowl-
edge)
(f) Calculate a 90% confidence interval for the expected catheter length when the weight is 90 for
Model 2. (Lecture 4)
Question 1 Output
dat <- data.frame(
"H"=c(42.8, 63.5, 37.5, 39.5, 45.5, 38.5, 43, 22.5, 37, 23.5, 33, 58),
"W"=c(40, 93.5, 35.5, 30, 52, 17, 38.5, 8.5, 33, 9.5, 21, 79),
"L"=c(37, 49.5, 34.5, 36, 43, 28, 37, 20, 33.5, 30.5, 38.5, 47))
fit1 <- lm(L ~ W + H, data = dat)
summary(fit1)
##
## Call:
## lm(formula = L ~ W + H, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.048 -1.258 -0.259 1.899 7.004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.0084 8.7512 2.401 0.0399 *
## W 0.1908 0.1652 1.155 0.2777
## H 0.1964 0.3606 0.545 0.5993
## ---
Copyright c© The University of Sydney 1
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 3.943 on 9 degrees of freedom
## Multiple R-squared: 0.8053, Adjusted R-squared: 0.7621
## F-statistic: 18.62 on 2 and 9 DF, p-value: 0.0006336
fit2 <- lm(L ~ W, data=dat)
summary(fit2)
##
## Call:
## lm(formula = L ~ W, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.994 -1.481 -0.135 2.091 7.040
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.63746 2.00421 12.792 1.60e-07 ***
## W 0.27727 0.04399 6.303 8.87e-05 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 3.802 on 10 degrees of freedom
## Multiple R-squared: 0.7989, Adjusted R-squared: 0.7788
## F-statistic: 39.73 on 1 and 10 DF, p-value: 8.871e-05
c(qt(0.95, 9), qt(0.95, 9), qt(0.95, 10), qt(0.95, 11))
## [1] 1.833113 1.833113 1.812461 1.795885
c(qchisq(0.025, 9), qchisq(0.975, 9), qchisq(0.025, 10), qchisq(0.975, 10))
## [1] 2.700389 19.022768 3.246973 20.483177
X <- model.matrix(~ W, data=dat)
solve(t(X) %*% X)
## (Intercept) W
## (Intercept) 0.277938162 -0.0051043889
## W -0.005104389 0.0001338856
Question 2
The aim of a study of weekly fuel consumption (Y in tons) was to develop a prediction equation
to predict Y on the basis of x1 (average hourly temperature,
◦F ) and x2 (the wind chill index).
Observations were taken over 8 weeks. Assume
Copyright c© The University of Sydney 2
Yi = β0 + β1xi1 + β2xi2 + i, i ∼ NID(0, σ2), i = 1, 2, . . . , 8.
The fitted least squares multiple regression equation is
Yi = 13.109− 0.09xi1 + 0.083xi2.
The residual sum of squares is 0.674, x¯1 = 43.975, x¯2 = 12.875 and
(X>X)−1 =
 5.434 0.0859 0.11890.0859 0.00147 0.0016
0.1189 0.0016 0.00359
 ,
where X is the 8× 3 design-matrix associated with the above multiple regression. Note that
qt(0.975, 5)
## [1] 2.570582
(a) Give an estimate for σ2. (Lecture 4)
(b) Obtain a 95% CI for β1. (Lecture 4)
(c) Test the hypothesis that β2 = 0 against the alternative that β2 is positive using a false positive
rate of 0.05. (Lecture 4)
(d) Can the above model be simplified? (Lecture 4)
(e) Give an estimate for the average fuel consumption if the average hourly temperature is 40(◦F )
and the chill index is 20. (Lecture 4)
Copyright c© The University of Sydney 3
Computer Problems
For the following questions, use the olympic.txt dataset that consists of the winning heights or
distances (in inches) for the High Jump, Discus and Long Jump events at the Olympics up to 1996.
Question 1
In an experiment to determine the source from which corn plants in various soils obtain their phos-
phorous, the concentration of inorganic phosphorous
(x1) and of two types of organic phosphorous (x2, x3) in the soil, and also the phosphorous content
(y) of the plants, were measured. The data are
x1 <- c(0.4, 0.4, 3.1, 0.6, 4.7, 1.7, 9.4, 10.1, 11.6, 12.6, 13.8, 10.9, 23.1, 29.9)
x2 <- c(53, 23, 19, 34, 24, 65, 44, 31, 29, 58, 55, 37, 46, 51)
x3 <- c(158, 163, 37, 157, 59, 123, 46, 117, 173, 112, 117, 111, 114, 124)
y <- c(64, 60, 71, 61, 54, 77, 81, 93, 93, 51, 60, 76, 96, 99)
(a) Create a>cor), and have a look at the pairwise scatterplots (pairs or otherwise). Comment on these in
terms of obvious outliers, shape of plots and homoscedasticity.
(b) Use lm() to fit the regression model
Yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + i , i ∼ NID(0, σ2),
show the summary of the lm output and give the fitted value and residual for the first obser-
vation.
(c) Calculate a 95% confidence interval for β3.
(d) What is the square of the multiple correlation coefficient for this model?
(e) What observation has fourth largest leverage and are there any high leverage points? (a) Using
the lm() output it seems that x3 can be dropped from the model. It also seems that x2 can be
dropped. Why?
(f) Determine a reasonable model with x1 alone for predicting the expected phosphorous content
of the plants.What is the R2 value for your final model? How does this compare with your
answer in (d)? Estimate the error standard deviation and compare it with the estimate in (b).
(g) For the regression of phosphorous content on inorganic phosphorous only calculate a 94%
prediction interval for Y |x1 = 40.
(h) Show that for the full model the largest Cook’s distance is 0.63 and for the simple linear
regression model 0.19 (2dp). Why are the values different?
Copyright c© The University of Sydney 4
STAT3022 Applied Linear Models Semester 1
Tutorial & Lab Sheet 4
Tutorial Problems
Question 1
Consider the simple linear regression model
Yi = β0 + β1xi + i,
where i ∼ NID(0, σ2).
(a) Prove that the least squares estimates for the parameters satisfy
βˆ0 = Y¯ − βˆ1x¯, and
βˆ1 =
∑n
i=1(Yi − Y¯ )xi∑n
i=1(xi − x¯)xi
.
(Lecture 3&4)
(b) Show that the residuals, Ri = Yi − (βˆ0 + βˆ1xi) sum to 0. (Assumed knowledge)
(c) Write the least squares estimator for the intercept in the form
∑n
i=1 aiYi. (Assumed knowledge)
(d) Show that the least squares estimator for β0 is unbiased by using c). (Lecture 3&4)
(e) (Lecture 5) Recall that you can write the above model in a matrix notation with X =
(
1n x
)
.
Show that
X>X =
(
n nx¯
nx¯
∑n
i=1 x
2
i
)
.
(f) (Lecture 5) Show that
(X>X)−1 =
1
nSxx
(∑n
i=1 x
2
i −nx¯
−nx¯ n
)
.
(g) (Lecture 3-5) Show that using matrix notation or otherwise that
i. V ar(βˆ1) =
σ2
Sxx
,
ii. V ar(βˆ0) = σ
2
∑n
i=1 x
2
i
nSxx
, and
iii. Cov(βˆ0, βˆ1) = −σ2 x¯
Sxx
.
Question 2
The leverage for observation i is a useful number between 0 and 1, which can be calculated by
hii = x
>
i (X
>X)−1xi. For the simple linear regression model
Yi = β0 + β1xi + i,
show that the formula for the leverage is
hii =
1
n
+
(xi − x¯)2∑n
j=1(xj − x¯)2
.
Note the points with highest leverage are the ones with x coordinate farthest from x¯. (Lecture 8)
Copyright c© The University of Sydney 1
Computer Problems
Question 1
In an experiment to investigate the amount of a drug retained in the liver of a rat, 19 rats were
randomly selected, weighed, placed under light ether anesthesia and given an oral dose of the drug.
The dose an animal received was determined as approximately 40mg of the drug per kilogram of
body weight, since liver weight is known to be strongly related to body weight and it was felt that
large livers would absorb more of a given dose than smaller livers. After a fixed length of time each
rat was sacrificed, the liver weighed, and the percent of the dose in the liver determined. The data
can be found in ratliver.txt
The experimental hypothesis was that, for the method of determining the dose, there is no
relationship between the percentage of the dose in the liver (Y ) and the body weight (x1), liver
weight (x2), and relative dose (x3). Consider the following model
Yi = β0 + β1xi1 + β2xi2 + β3xi3 + i, i ∼ NID(0, σ2). (1)
(a) Give the fitted least squares multiple regression equation, the residual sum of squares as well
as an estimate for the error variance (Lecture 6).
(b) Construct a 90% confidence interval for the error standard deviation (Lecture 6).
(c) Calculate a 95% confidence interval for the expected Y value when x1 = 165, x2 = 8.0 and
x3 = 0.85 (Lecture 7).
(d) Can any of the explanatory variables be dropped from the model based on t-tests? (Lecture
6&9)
(e) Determine if there are any high leverage points. (Lecture 8)
(f) Show that there is one outlier (Lecture 8).
(g) Refit the model ignoring the outlier. Determine the square of the multiple correlation coefficient.
Comment. (Lecture 6&8)
(h) Calculate 95% confidence intervals for the parameter estimates in the model. Do any of the
intervals contain 0? (Lecture 6)
(i) Calculate the sample standard deviation of the y sample ignoring the outlier (assumed knowl-
edge).
(j) By comparing the sample standard deviation of the y to the error standard deviation in Model
(1) (ignoring the outlier) does there appear to be any useful information for predicting Y
contained in the x variables (new material)?
(k) Finally, perform an F-test for H0 : β1 = β2 = β3 = 0 vs. H1 : β1 6= 0∨β2 6= 0∨β3 6= 0. (Lecture
9)
Copyright c© The University of Sydney 2
STAT3022 Applied Linear Models Semester 1
Tutorial & Lab Sheet 5
Tutorial Problems
Question 1
In a study of high-density lipoprotein (HDL, labelled y) in human blood a sample of size 42 was used.
Measurements were taken on total cholesterol (x1), total trigylceride (x2) as well as noting whether
a sticky component, sinking pre-beta (SPB, labelled x3) was present (coded 1) or absent (coded 0).
This data is stored in the data frame dat and a partial analysis is given in the R-output at the end
of the question. The basis for the analysis is the model
Yi = β0 + β1x1 + β2x2 + β3x3 + i,
where i ∼ NID(0, σ2).
(a) Write down the fitted multiple regression model. What proportion of the total variability in Y
is explained by the multiple regression? (Lecture 6)
(b) Provide a 95% confidence interval for β2 assuming that the model passed all diagnostic checks.
(Lectures 5-6)
(c) What is the purpose of the residual versus fitted values plot? Are the above model assumptions
reasonable? Justify your answer. (Lecture 6 or any lecture that shows such a plot)
(d) Are there any high leverage points in this data set? What characterises a high leverage point
in general? (Lecture 8)
(e) Are there any outliers in the data? (Lecture 8)
(f) Test the hypothesis H0 : β1 = β2 = 0 assuming the model passes the diagnostics checks.
(Lecture 9)
(g) For the model Yi = β0+β3x3+i, i ∼ NID(0, σ2), give a 95% confidence interval for the error
standard deviation, σ, for the simple regression on x3 only given that the model assumptions
are satified. (Lectures 5-6)
(h) For the model Yi = β0 + β3x3 + i, i ∼ NID(0, σ2),, what is the predicted difference in
average HDL levels between people with and without SPB? (Assumed knowledge)
R-Output for Tutorial Question 1
skimr::skim(dat)
## Skim summary statistics
## n obs: 42
## n variables: 4
##
## -- Variable type:numeric ----------------------------------------------------------------------
## variable missing n mean sd p0 p25 p50 p75 p100
## x1 0 42 267.81 60.4 121 248 263.5 293.25 397
## x2 0 42 155.05 73.75 46 103 140 179 408
## x3 0 42 0.48 0.51 0 0 0 1 1
## y 0 42 47.76 10.61 27 40 46.5 56 76
Copyright c© The University of Sydney 1
cor(dat)
## y x1 x2 x3
## y 1.00000000 -0.100107631 0.06801009 0.399193519
## x1 -0.10010763 1.000000000 0.51275584 0.004640736
## x2 0.06801009 0.512755842 1.00000000 0.111253230
## x3 0.39919352 0.004640736 0.11125323 1.000000000
lm1 <- lm(y ~ x1 + x2 + x3, dat)
summary(lm1)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.140 -7.065 -1.226 6.284 22.616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.86453 7.24985 6.740 5.54e-08 ***
## x1 -0.02731 0.03015 -0.906 0.3707
## x2 0.01504 0.02485 0.605 0.5486
## x3 8.14831 3.11224 2.618 0.0126 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 9.992 on 38 degrees of freedom
## Multiple R-squared: 0.1777, Adjusted R-squared: 0.1128
## F-statistic: 2.737 on 3 and 38 DF, p-value: 0.05682
deviance (lm1)
## [1] 3793.872
sort(lm.influence(lm1)$hat)
## 31 15 30 35 36 19
## 0.04629716 0.04683573 0.04685897 0.04821023 0.05042534 0.05049491
## 42 21 2 14 39 8
## 0.05110170 0.05160724 0.05244661 0.05385775 0.05466735 0.05681556
## 3 20 16 1 32 12
## 0.05714121 0.06028428 0.06248028 0.06252065 0.06713583 0.07047152
## 11 29 38 33 37 22
## 0.07051782 0.07348578 0.07420895 0.07520626 0.08053032 0.08666763
## 41 27 9 7 23 6
## 0.08750115 0.08805885 0.08872022 0.08988140 0.09344519 0.10936203
## 13 26 24 17 25 34
Copyright c© The University of Sydney 2
## 0.11858202 0.12658355 0.13172571 0.13894556 0.14103697 0.14864608
## 4 28 40 18 5 10
## 0.16291214 0.16364755 0.16426858 0.18070875 0.18959735 0.32610783
sort(cooks.distance(lm1))
## 32 30 31 11 3
## 0.0001733905 0.0002926988 0.0011660749 0.0014044186 0.0021165545
## 33 36 8 9 26
## 0.0022742158 0.0023978683 0.0028851000 0.0032802713 0.0032934973
## 1 5 20 37 19
## 0.0033010288 0.0036645618 0.0037343660 0.0044964353 0.0045986869
## 42 2 29 39 16
## 0.0051369616 0.0060833598 0.0061630671 0.0064126954 0.0084407571
## 15 12 21 35 40
## 0.0084973525 0.0116229464 0.0211984847 0.0212002616 0.0215417142
## 14 38 4 22 17
## 0.0218619864 0.0292889225 0.0295527833 0.0319387911 0.0402706419
## 25 7 23 41 13
## 0.0427706782 0.0446770335 0.0460629462 0.0489442660 0.0492754407
## 24 18 28 34 6
## 0.0813787248 0.0886426603 0.0995042673 0.1026733420 0.1150689996
## 27 10
## 0.1356115422 0.2426537541
qf(0.5, 4, 38)
## [1] 0.8543325
qchisq(0.975, 40)
## [1] 59.34171
qchisq(0.025, 40)
## [1] 24.43304
qt(0.975, 38)
## [1] 2.024394
lm2 <- lm(y ~ x1 + x3, dat)
summary(lm2)
Copyright c© The University of Sydney 3
##
## Call:
## lm(formula = y ~ x1 + x3, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.616 -7.401 -1.782 6.044 23.794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.56351 7.17378 6.770 4.45e-08 ***
## x1 -0.01791 0.02562 -0.699 0.48881
## x3 8.38720 3.06193 2.739 0.00924 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 9.91 on 39 degrees of freedom
## Multiple R-squared: 0.1698, Adjusted R-squared: 0.1272
## F-statistic: 3.987 on 2 and 39 DF, p-value: 0.02658
deviance(lm2)
## [1] 3830.45
lm3 <- lm(y ~ x3, dat)
summary(lm3)
##
## Call:
## lm(formula = y ~ x3, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.7727 -7.7727 -0.7727 5.6943 23.8500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.773 2.099 20.851 < 2e-16 ***
## x3 8.377 3.042 2.754 0.00882 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 9.847 on 40 degrees of freedom
## Multiple R-squared: 0.1594, Adjusted R-squared: 0.1383
## F-statistic: 7.583 on 1 and 40 DF, p-value: 0.008822
deviance(lm3)
## [1] 3878.414
Copyright c© The University of Sydney 4
autoplot(lm1)
27
6
34
−10
0
10
20
40 45 50
Fitted values
R
es
id
ua
ls
Residuals vs Fitted
27
6
34
−1
0
1
2
−2 −1 0 1 2
Theoretical Quantiles
St
an
da
rd
ize
d
re
sid
ua
ls
Normal Q−Q
27
6
34
0.4
0.8
1.2
1.6
40 45 50
Fitted values
St
a
n
da
rd
iz
e
d
re
si
du
a
ls
Scale−Location
10
27
6
−1
0
1
2
0.0 0.1 0.2 0.3
Leverage
St
an
da
rd
ize
d
Re
sid
ua
ls
Residuals vs Leverage
Computer Problems
Question 1
In 1954 Branch Rickey wrote an article for Life Magazine entitled Goodbye to some old baseball ideas.
He criticized some traditional baseball statistics and proposed some of his own that he thought more
useful. For individual hitting Rickey proposed the sum of on-base average (OBA) and extra-base
power (EBP). A main question is whether OBA and EBP are distinct components of hitting ability.
The data is given in the file baseball.txt.
(a) Determine an appropriate polynomial regression model of OBA (the response) on EBP. You
should consider models up to order 6 (i.e.˜with the highest power of EBP being at most 6). For
your answer you should explain how you selected the order of the polynomial model. (Lecture
12) (Hint: use R-code of the form lm(y ~ poly(x, 6),data=dat))
(b) What is the equation of your preferred fitted model? (Lecture 12)
(c) (Lecture 7) Use your preferred model to predict OBA at EBP score of 250. Supply an associated
90% prediction interval with
predict(lm.out, newdata=data.frame(x= ?, x2= ?, ..., xp= ?),
interval="prediction", level=?)
Copyright c© The University of Sydney 5
Question 2
The diabetes study contained 442 patients measured on 10 baseline variables (age, sex, body mass
index, average blood pressure and six blood serum measurements). The dataset can be found in
diabetes.csv.
(a) Read the data and assign it to a data frame named dat. What type of variables does the data
contain?
(b) Calculate the pairwise correlation coefficients between the variables and draw pairwise scatter
plot using pairs or ggpairs. Which explanatory variable is the most highly correlated with
the response?
(c) Fit the following models:
M0 <- lm(y ~ 1, data=dat)
Mf <- lm(y ~ ., data=dat)
(d) Select the “best” model using
1. Forward, starting with the intercept only model using the AIC criterion (in case you are
going to use the step function make sure you specify the option trace=0 to suppress most
of the output) (Lecture 10);
2. Stepwise, starting with the full model, using an F test with pout = 0.10 and pin = 0.05
(Lecture 10-11);
3. Backward, starting with the full model, using the AIC criterion (Lecture 10-11); and
4. Stepwise, starting with the full model, using the BIC criterion (Lecture 10-11).
Copyright c© The University of Sydney 6
STAT3022 Applied Linear Models Semester 1
Tutorial & Lab Sheet 6
Tutorial Problems
The two tutorial problems below is for helping you revise the one-way ANOVA from STAT2012 /
STAT2912 / DATA2002 material.
Question 1
(a) In order to compare four different brands of golf balls, five balls from each brand are placed
in a machine that exerts the force produced by a 200 metre drive. The number of simulated
drives needed to crack or chip each ball is recorded below.
A B C D
281 270 218 364
220 334 244 302
274 307 225 325
242 290 273 337
251 331 249 355
Use the R output below or otherwise to contruct an analysis of variance table on these data for
the model Yij = µ + αi + ij, ij ∼ N(0, σ2), where Yij denotes the jth observation on brand i.
(Assumed knowledge, Lecture 14)
dat <- data.frame(Trt=rep(LETTERS[1:4], times=5),
Y=c(281, 270, 218, 364,
220, 334, 244, 302,
274, 307, 225, 325,
242, 290, 273, 337,
251, 331, 249, 355))
dat %>% group_by(Trt) %>%
summarise(n=n(), TrtSum=sum(Y), TrtSum2=sum(Y^2))
## # A tibble: 4 x 4
## Trt n TrtSum TrtSum2
##
## 1 A 5 1268 324002
## 2 B 5 1532 472366
## 3 C 5 1209 294215
## 4 D 5 1683 568919
Question 2
In an effort to improve the quality of tapes, the effects of four kinds of coatings (A, B, C and D) on
the reproducing quality of sound are compared. The measurements of sound distortion were analysed
to give the following analysis of variance table:
Copyright c© The University of Sydney 1
y <- c(1.79, 0.54, 1.26, 3.58, 3.25, 5.93, 4.27, 4.26, 2.56, 4.09,
0.65, 3.75, 1.8, 3.99, 1.2, 2.33, 3.54, 4.85, 5.82, 4.61,
4.6, 3.72, 4.58, 3.98, 4.07, 5.29, 2.54)
coatings <- rep(c("A", "B", "C", "D"), times=c(5, 4, 8, 10))
M1 <- aov(y ~ coatings)
summary(M1)
## Df Sum Sq Mean Sq F value Pr(>F)
## coatings 3 25.94 8.647 6.109 0.00326 **
## Residuals 23 32.56 1.415
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
tapply(y, coatings, mean)
## A B C D
## 2.08400 4.25500 2.66875 4.40600
table(coatings)
## coatings
## A B C D
## 5 4 8 10
qt(0.95, 23)
## [1] 1.713872
qt(0.975, 23)
## [1] 2.068658
Find the standard errors of the differences of mean for A and B and hence perform a t-test for
the hypothesis that the difference is zero. Also give a 95% confidence interval for this difference.
(Assumed knowledge, Lecture 14)
Copyright c© The University of Sydney 2
Computer Problems
Question 1
Data were collected on 35 companies selected from Duns Review (1980). The data are stored in a
file called companies.txt with the company names in column 1 and stock prices (y) in column 2.
The other variables, in column order, are dividend (x1), yield (x2), earnings per share in $ (x3), sales
in millions of dollars (x4), income in millions of dollars (x5), return on sales (x6), return on equity
(x7), exchange traded (1: New York only; 2: other) (x8).
We are interested in constructing a multiple regression model to predict stock prices for similar
companies. Select the “best” model using least squares regression with the following algorithms:
(a) Stepwise, starting with the full model, using an F test with pout = 0.10 and pin = 0.05.
(b) Backward, starting with the full model, using the AIC criterion.
(c) Stepwise, starting with the full model, using the BIC criterion.
(d) Which solution in (a)-(c) is the best? Comment with at most 40 words.
(e) For the model of your choice in (d) run a robust multiple regression using the rlm in library(MASS)
with method="MM". Are there any differences to the LS solution (with lm)?
Copyright c© The University of Sydney 3
STAT3022 Applied Linear Models Semester 1
Tutorial 7
Tutorial Problems
Question 1
The following data give the arc sine transformations (Yij) of the percentage survival of the beetle
Tribolium castaneum at 4 densities (X = number of eggs per gram of flour medium).
X (densities)
5/g 20/g 50/g 100/g
61.68 68.21 58.69 53.13
58.37 66.72 58.37 49.89
69.30 63.44 58.37 49.82
61.68 60.84
69.30
Totals 320.33 259.21 175.43 152.84
The total sum of squares for these data is 562.3883.
(a) Find the treatment sum of squares and hence test for differences in the percentage of beetles
surviving. (Lecture 14)
(b) Find an appropriate contrast to measure the linear effect of X. Hence find the linear component
of the treatment sum of squares and test the hypothesis that Y can be adequately modelled by
a linear regression on x. (Lecture 17)
qf(0.95, 3, 11)
## [1] 3.587434
qt(c(0.9, 0.95, 0.975), 15)
## [1] 1.340606 1.753050 2.131450
qt(c(0.9, 0.95, 0.975), 14)
## [1] 1.345030 1.761310 2.144787
qt(c(0.9, 0.95, 0.975), 13)
## [1] 1.350171 1.770933 2.160369
qt(c(0.9, 0.95, 0.975), 12)
## [1] 1.356217 1.782288 2.178813
Copyright c© The University of Sydney 1
qt(c(0.9, 0.95, 0.975), 11)
## [1] 1.363430 1.795885 2.200985
Question 2
In order to compare four different brands of golf balls, five balls from each brand are placed in a
machine that exerts the force produced by a 200 metre drive. The number of simulated drives needed
to crack or chip each ball is recorded below.
A B C D
281 270 218 364
220 334 244 302
274 307 225 325
242 290 273 337
251 331 249 355
An analysis of variance for the model Yij = µ + αi + ij, ij ∼ N(0, σ2), where Yij denotes the j-th
observation on brand i has ANOVA table:
Source of Variation df SS Mean Square F
Brands 3 29860.4 9953.467 16.42
Residual 16 9698.4 606.15
Total 19 39558.8
(a) Calculate 90% Tukey, Bonferroni, and Scheffe simultaneous confidence intervals for the maximal
brand effect, adjusting for all pairwise comparisons and compare to the 90% CI for the difference
in treatment effect of C and D by measuring how much percentage wider the adjusted CIs are.
(Lecture 16)
(b) Decompose the brands sum of squares into three orthogonal components including one which
measures the difference between brands A and C, and one which measures the difference be-
tween brands B and D. Specify the orthogonal contrasts that correspond to the three 1 d.f.
components. (Lecture 15)
qtukey(c(0.9, 0.95, 0.975), 4, 16)
## [1] 3.520076 4.046093 4.547630
qt(c(0.9, 0.95, 0.975), 16)
## [1] 1.336757 1.745884 2.119905
qf(c(0.9, 0.95, 0.975), 3, 16)
## [1] 2.461811 3.238872 4.076823
Copyright c© The University of Sydney 2
Computer Problems
Question 1
This question is based on the data collected from the 78 bluegills captured from Lake Mary, Minnesota
in 1981. The data is found in Weisberg (1986) A linear model approach to backcalculation of fish
length, Journal of the American Statistical Association, 81(196) 922-929 or in bluegills.txt. The
data frame contains two columns: the length (in mm) and the age (in years) of the fish at capture.
You will be investigating how the length of a bluegill fish is related to its age.
(a) Plot a scatter plot of age as predictor and length as response with a straight line corresponding
to the best line of fit using least squares estimates.
(b) Does the simple linear regression seem appropriate? Explain your conclusion.
(c) Take age as a factor and construct a one-way ANOVA table.
(d) Fit a polynomial regression model that is equivalent to the above ANOVA model.
(e) Test if the polynomial regression model can be simplified. What model have the most appro-
priate fit? Explain your choice.
Copyright c© The University of Sydney 3
STAT3022 Applied Linear Models Semester 1
Tutorial & Lab Sheet 8
Tutorial Problems
Question 1
To determine the effect of exhaust index (in seconds) and pump heater voltage (in volts) on the
pressure inside a vacuum tube (in microns of mercury), three exhaust indices and two voltages are
chosen at fixed levels. It was decided to run two trials for each combination of index and voltage so
in all 12 trials were run. A completely randomised design was deemed appropriate. Assume that we
use the following model for the response
Yijk = µ+ αi + βj + (αβ)ij + ijk, ijk ∼ N(0, σ2)
where i = 1, 2 indexing the voltage level, j = 1, 2, 3 indexing the exhaust index and k = 1, 2 indexing
the replicate.
The results of the 12 experiments are given below.
Pump Heater Exhaust index
Voltage 60 90 120 Totals
127 48, 58 28, 33 7, 15 189
220 62, 54 14, 10 9, 6 155
Totals 222 85 37 344
(a) Calculate the interaction sum of squares and hence complete the below analysis of variance table.
Test for an interaction effect. Note that the sum of squares of observatons (
∑2
i=1
∑3
j=1
∑2
k=1 Y
2
ijk)
is 14,988. (Lecture 18 and 19)
Source of Variation df Sum of Squares
Voltage(V)
Exhaust Index (E) 4608.17
Interaction(V × E)
Residual 6 139.00
Total
(b) Calculate a 95% confidence interval for the average difference in pressure at the two voltage levels
when the exhaust index is set at 90 using a two-way ANOVA interaction model. (New material)
qf(c(0.95, 0.975), 2, 6)
## [1] 5.143253 7.259856
qt(c(0.95, 0.975), 6)
## [1] 1.943180 2.446912
Copyright c© The University of Sydney 1
Computer Problems
Question 1
A very famous data set is (Fisher’s or Anderson’s) iris data set, which gives the measurements in
centimeters of the variables sepal length and width and petal length and width, respectively, for 50
flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. The data
set is already available in R and can be called by typing iris
(a) Store the 50 observations for the setosa species in a new data frame called dat.
(b) For setosa, draw boxplots in a single frame for the four variables Sepal.Length, Sepal.Width,
Petal.Length and Petal.Width.
(c) For setosa, draw Q-Q-plots in a single frame for the four variables and showing the variable
name as a heading for each of the four plots.
(d) Use the recommended shapiro.test to assess the normality assumption for those four variables
for the setosa species.
(e) Explore the other normality tests shown in Lecture 20. Do you get the same answers as in (d)?
Question 2
The aim of this question is to give you a real-life experience of what it means to analyse a data set
with the aim to produce a model that is optimal from a non-standard point of view. Consider the
data phys.txt.
(a) Read the data into R and store it in a data frame called dat. What are the variable names
(given in the first row of the txt file)?
(b) Fit a full model, i.e. lm(Mass ~ .,data=data) and note that the adjusted R-squared is 0.9565,
i.e. already large.
(c) Check the diagnostic plots for the full model and report any possible violations of the standard
assumptions for the errors in the linear regression model. If there are no violations, explain
why there are none.
(d) Find that model that gives you largest adjusted R-squared value by using exactly two variables.
To do so efficiently, write one or more loops and comment the code, so that it can be easily
understood by your peers.
(e) Create two new variables, z1 and z2, that are functions (not necessarily linear functions) of
any combination of the explanatory variables such that lm(y ~ z1 + z2) has larger adjusted
R-squared value than the model in (d).
Copyright c© The University of Sydney 2
STAT3022 Applied Linear Models Semester 1
Tutorial & Lab Sheet 9
Tutorial Problems
Question 1
In a 3 × 2 factorial experiment on the control of Agropyron repens (quack-grass), Zick (1956) used
3 rates of spray (0, 4 and 8 pounds per acre) of maleic hydrazide in combination with 2 periods
of delay (3 and 10 days) in cultivation after spraying. Fifty two days after spraying, the number
of quack-grass shoots per square foot was recorded for each plot. It was believed reasonable to set
up a standard linear model for the square roots of these numbers. The square roots are as follows:
Periods of delay Rates of Spray
Period of delay Rate of Spray Square roots of quack-
(days) (lbs/acre) grass shoots (/square ft)
0 15.7 14.6 16.5 14.7
3 4 9.8 14.6 11.9 12.4
8 7.9 10.3 9.7 9.6
0 18.0 17.4 15.1 14.4
10 4 13.6 10.6 11.8 13.3
8 8.8 8.2 11.3 11.2
Consider the model
Yijk = µ+ αi + βj + (αβ)ij + ijk,
i = 1, 2, 3, j = 1, 2, k = 1, 2, 3, 4 where
• ijk ∼ NID(0, σ2),
• αi corresponds to the effect of Rate of spray i,
• βj corresponds to the effect of Period j, and
• the (αβ)ij denote the Rate × Period interaction terms
(a) Organise the data given in the above table in a tidy form.
(b) Confirm that y1•• = 126.4, y2•• = 98, y3•• = 77, y•1• = 147.7, y•2• = 153.7 and y••• = 301.4.
(c) Given that the treatment sum of squares is given as 155.6533 and the sample variance of the
response, s2y is 8.505145, construct the analysis of variance table corresponding to the above
model. Determine if the above model can be simplified. (Lecture 18-19)
(d) Now consider the one-way ANOVA model with factor Rate as
Yijk = µ+ αi + ijk, where ijk ∼ NID(0, σ2),
to answer this question. Since the factor Rate is a numerical factor with 3 levels, Lecture 17
says that the TSS of this one-way ANOVA model is the same as the TSS of order (3-1) polynomial
regression. Use a contrast with coefficient ci = ni(xi− x¯) to calculate the linear regression component
of the Rate of spray sum of squares in the polynomial regression. Is a linear regression model
appropriate in this case? (Lecture 17)
Copyright c© The University of Sydney 1
(e) Given that Sxy = −197.6 and Sxx = 256, estimate the parameters in
Yijk = β0 + β1xi + ijk, whereijk ∼ NID(0, σ2),
where xi is the rates of spray. (Lecture 3 and assumed knowledge)
c(qf(0.05, 1, 18), qf(0.05, 2, 18), qf(0.05, 1, 20), qf(0.05, 2, 20))
[1] 0.004043292 0.051439739 0.004032045 0.051425070
c(qf(0.95, 1, 18), qf(0.95, 2, 18), qf(0.95, 1, 20), qf(0.95, 2, 20))
[1] 4.413873 3.554557 4.351244 3.492828
Question 2
Wilkinson (1954) reports the results of a study on the influence of time of bleeding and diethylstilbe-
strol (estrogen compound) on plasma phospholipid in lambs. Five lambs were assigned at random
to each of the four treatment groups.
Totals of 5 values in each treatment group.
Time (Factor A)
Factor B am pm Totals
No estrogen 66.39 182.67 249.06
Estrogen 96.80 139.06 235.86
Totals 163.19 321.73 484.92
The total sum of squares for the data set of 20 values is 1919.33.
(a) Construct the analysis of variance table for this factorial design, analyse the data and comment.
(Lecture 18-19)
(b) Using the two-way ANOVA model, Construct a 95% confidence interval for the difference in
mean plasma phospholipid levels for lambs receiving diethylstilbestrol and those not.
qf(c(0.05,0.95),1,16)
[1] 0.004057389 4.493998478
qt(0.975,16)
[1] 2.119905
Copyright c© The University of Sydney 2
Question 3
A completely randomized experiment was conducted to compare seven treatments for their effective-
ness in reducing scab disease in potatoes. The field plan is shown below. The upper figure in each
plot denotes the treatment, coded 1 to 7. The lower figure denotes an index of scabbiness of potatoes
in that plot: 100 potatoes were randomly sampled from the plot, for each one the percentage of
the surface area infected with scabs was assessed by eye and recorded, and the average of these 100
percentages was calculated to give the scabbiness index.
2 1 6 4 6 7 5 3
9 12 18 10 24 17 30 16
1 5 4 3 5 1 1 6
10 7 4 10 21 24 29 12
2 7 3 1 3 7 2 4
9 7 18 30 18 16 16 4
5 1 7 6 1 4 1 2
9 18 17 19 32 5 26 4
(a) What is the experimental unit, observational unit and treatment of the above experiment?
How many experimental units, observational units and treatments are there? (Lecture 23)
(b) Suppose that y is a vector of response and trt is a factor that denotes the treatment. Given
the below information only, complete the analysis of variance table below by filling out the ??s.
(Lecture 14-15, 24)
tapply(y, trt, sum)
## 1 2 3 4 5 6 7
## 181 38 62 23 67 73 57
table(trt)
## trt
## 1 2 3 4 5 6 7
## 8 4 4 4 4 4 4
sum(y)
## [1] 501
length(y)
## [1] 32
var(y)
## [1] 67.5877
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## trt ?? ?? ?? ?? ###
Copyright c© The University of Sydney 3
(c) Is there any evidence that the mean scabbiness is different according to different treatments?
Justify your answer. (Lecture 14-15, 24)
qf(0.95, 6, 25)
[1] 2.49041
qf(0.95, 1, 30)
[1] 4.170877
(d) Estimate the mean scabbiness produced by each treatment. (Lecture 14-15, 24)
(e) What is the standard error of the differences between treatment 1 mean and the average of
treatment 2 and 3 means? (Lecture 14-15, 24)
Computer Problems
Question 1
An anesthesiologist made a comparative study of the effects of acupuncture and codeine on postop-
erative dental pain in male subjects. The four treatments (Tmt) were
• A : placebo capsule (sugar) & placebo acupuncture (inactive points)
• B : codeine capsule & placebo acupuncture
• C : placebo capsule & acupuncture
• D : codeine capsule & acupuncture
Thirty-two subjects were grouped into eight blocks (Tol) of four according to an initial evaluation
of their level of pain tolerance. The subjects in each block were then randomly assigned to the four
treatments. Pain relief scores (Pain) were obtained for all subjects two hours after dental treatment.
Data were collected on a double blind basis. The data is given in the file pain.txt.
(a) Assume you have to plan this experiment yourself. Generate an allocation list for a complete
randomized block design with 8 blocks and four treatments. This part does not use the data
but only its dimensions (sample size: 32; variables: block and treatment).
(b) Assume that after the experiment was run by the anesthesiologist, with a design that was
similarly generated as what you just produced in (a), you are given the data in pain.txt. Try
to understand the structure of the data by using str, summary or skim, i.e. think about if you
have to change the variable type of any of the data columns?
(c) Obtain the residuals for the randomized block model and plot them against the fitted values.
Also prepare a normal probability plot of the residuals. What are your findings? Comment on
any violation of underlying assumptions, otherwise confirm that assumptions seem reasonable.
(d) Produce an interaction plot. Comment on how parallel the plots are, also taking into account
the variability of each calculated “treatment mean” (in the sense that a treatment mean is the
average, Y¯ij•, of all observations that received level i of factor A and level j of factor B.
(e) Obtain the analysis of variance table (without interactions).
(f) Calculate a 95% confidence interval for αB???αC and use this to test if this pairwise contrast
is zero.
Copyright c© The University of Sydney 4

## 1 A 5 1268 324002
## 2 B 5 1532 472366
## 3 C 5 1209 294215
## 4 D 5 1683 568919
Question 2
In an effort to improve the quality of tapes, the effects of four kinds of coatings (A, B, C and D) on
the reproducing quality of sound are compared. The measurements of sound distortion were analysed
to give the following analysis of variance table:
Copyright c© The University of Sydney 1
y <- c(1.79, 0.54, 1.26, 3.58, 3.25, 5.93, 4.27, 4.26, 2.56, 4.09,
0.65, 3.75, 1.8, 3.99, 1.2, 2.33, 3.54, 4.85, 5.82, 4.61,
4.6, 3.72, 4.58, 3.98, 4.07, 5.29, 2.54)
coatings <- rep(c("A", "B", "C", "D"), times=c(5, 4, 8, 10))
M1 <- aov(y ~ coatings)
summary(M1)
## Df Sum Sq Mean Sq F value Pr(>F)
## coatings 3 25.94 8.647 6.109 0.00326 **
## Residuals 23 32.56 1.415
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
tapply(y, coatings, mean)
## A B C D
## 2.08400 4.25500 2.66875 4.40600
table(coatings)
## coatings
## A B C D
## 5 4 8 10
qt(0.95, 23)
## [1] 1.713872
qt(0.975, 23)
## [1] 2.068658
Find the standard errors of the differences of mean for A and B and hence perform a t-test for
the hypothesis that the difference is zero. Also give a 95% confidence interval for this difference.
(Assumed knowledge, Lecture 14)
Copyright c© The University of Sydney 2
Computer Problems
Question 1
Data were collected on 35 companies selected from Duns Review (1980). The data are stored in a
file called companies.txt with the company names in column 1 and stock prices (y) in column 2.
The other variables, in column order, are dividend (x1), yield (x2), earnings per share in $ (x3), sales
in millions of dollars (x4), income in millions of dollars (x5), return on sales (x6), return on equity
(x7), exchange traded (1: New York only; 2: other) (x8).
We are interested in constructing a multiple regression model to predict stock prices for similar
companies. Select the “best” model using least squares regression with the following algorithms:
(a) Stepwise, starting with the full model, using an F test with pout = 0.10 and pin = 0.05.
(b) Backward, starting with the full model, using the AIC criterion.
(c) Stepwise, starting with the full model, using the BIC criterion.
(d) Which solution in (a)-(c) is the best? Comment with at most 40 words.
(e) For the model of your choice in (d) run a robust multiple regression using the rlm in library(MASS)
with method="MM". Are there any differences to the LS solution (with lm)?
Copyright c© The University of Sydney 3
STAT3022 Applied Linear Models Semester 1
Tutorial 7
Tutorial Problems
Question 1
The following data give the arc sine transformations (Yij) of the percentage survival of the beetle
Tribolium castaneum at 4 densities (X = number of eggs per gram of flour medium).
X (densities)
5/g 20/g 50/g 100/g
61.68 68.21 58.69 53.13
58.37 66.72 58.37 49.89
69.30 63.44 58.37 49.82
61.68 60.84
69.30
Totals 320.33 259.21 175.43 152.84
The total sum of squares for these data is 562.3883.
(a) Find the treatment sum of squares and hence test for differences in the percentage of beetles
surviving. (Lecture 14)
(b) Find an appropriate contrast to measure the linear effect of X. Hence find the linear component
of the treatment sum of squares and test the hypothesis that Y can be adequately modelled by
a linear regression on x. (Lecture 17)
qf(0.95, 3, 11)
## [1] 3.587434
qt(c(0.9, 0.95, 0.975), 15)
## [1] 1.340606 1.753050 2.131450
qt(c(0.9, 0.95, 0.975), 14)
## [1] 1.345030 1.761310 2.144787
qt(c(0.9, 0.95, 0.975), 13)
## [1] 1.350171 1.770933 2.160369
qt(c(0.9, 0.95, 0.975), 12)
## [1] 1.356217 1.782288 2.178813
Copyright c© The University of Sydney 1
qt(c(0.9, 0.95, 0.975), 11)
## [1] 1.363430 1.795885 2.200985
Question 2
In order to compare four different brands of golf balls, five balls from each brand are placed in a
machine that exerts the force produced by a 200 metre drive. The number of simulated drives needed
to crack or chip each ball is recorded below.
A B C D
281 270 218 364
220 334 244 302
274 307 225 325
242 290 273 337
251 331 249 355
An analysis of variance for the model Yij = µ + αi + ij, ij ∼ N(0, σ2), where Yij denotes the j-th
observation on brand i has ANOVA table:
Source of Variation df SS Mean Square F
Brands 3 29860.4 9953.467 16.42
Residual 16 9698.4 606.15
Total 19 39558.8
(a) Calculate 90% Tukey, Bonferroni, and Scheffe simultaneous confidence intervals for the maximal
brand effect, adjusting for all pairwise comparisons and compare to the 90% CI for the difference
in treatment effect of C and D by measuring how much percentage wider the adjusted CIs are.
(Lecture 16)
(b) Decompose the brands sum of squares into three orthogonal components including one which
measures the difference between brands A and C, and one which measures the difference be-
tween brands B and D. Specify the orthogonal contrasts that correspond to the three 1 d.f.
components. (Lecture 15)
qtukey(c(0.9, 0.95, 0.975), 4, 16)
## [1] 3.520076 4.046093 4.547630
qt(c(0.9, 0.95, 0.975), 16)
## [1] 1.336757 1.745884 2.119905
qf(c(0.9, 0.95, 0.975), 3, 16)
## [1] 2.461811 3.238872 4.076823
Copyright c© The University of Sydney 2
Computer Problems
Question 1
This question is based on the data collected from the 78 bluegills captured from Lake Mary, Minnesota
in 1981. The data is found in Weisberg (1986) A linear model approach to backcalculation of fish
length, Journal of the American Statistical Association, 81(196) 922-929 or in bluegills.txt. The
data frame contains two columns: the length (in mm) and the age (in years) of the fish at capture.
You will be investigating how the length of a bluegill fish is related to its age.
(a) Plot a scatter plot of age as predictor and length as response with a straight line corresponding
to the best line of fit using least squares estimates.
(b) Does the simple linear regression seem appropriate? Explain your conclusion.
(c) Take age as a factor and construct a one-way ANOVA table.
(d) Fit a polynomial regression model that is equivalent to the above ANOVA model.
(e) Test if the polynomial regression model can be simplified. What model have the most appro-
priate fit? Explain your choice.
Copyright c© The University of Sydney 3
STAT3022 Applied Linear Models Semester 1
Tutorial & Lab Sheet 8
Tutorial Problems
Question 1
To determine the effect of exhaust index (in seconds) and pump heater voltage (in volts) on the
pressure inside a vacuum tube (in microns of mercury), three exhaust indices and two voltages are
chosen at fixed levels. It was decided to run two trials for each combination of index and voltage so
in all 12 trials were run. A completely randomised design was deemed appropriate. Assume that we
use the following model for the response
Yijk = µ+ αi + βj + (αβ)ij + ijk, ijk ∼ N(0, σ2)
where i = 1, 2 indexing the voltage level, j = 1, 2, 3 indexing the exhaust index and k = 1, 2 indexing
the replicate.
The results of the 12 experiments are given below.
Pump Heater Exhaust index
Voltage 60 90 120 Totals
127 48, 58 28, 33 7, 15 189
220 62, 54 14, 10 9, 6 155
Totals 222 85 37 344
(a) Calculate the interaction sum of squares and hence complete the below analysis of variance table.
Test for an interaction effect. Note that the sum of squares of observatons (
∑2
i=1
∑3
j=1
∑2
k=1 Y
2
ijk)
is 14,988. (Lecture 18 and 19)
Source of Variation df Sum of Squares
Voltage(V)
Exhaust Index (E) 4608.17
Interaction(V × E)
Residual 6 139.00
Total
(b) Calculate a 95% confidence interval for the average difference in pressure at the two voltage levels
when the exhaust index is set at 90 using a two-way ANOVA interaction model. (New material)
qf(c(0.95, 0.975), 2, 6)
## [1] 5.143253 7.259856
qt(c(0.95, 0.975), 6)
## [1] 1.943180 2.446912
Copyright c© The University of Sydney 1
Computer Problems
Question 1
A very famous data set is (Fisher’s or Anderson’s) iris data set, which gives the measurements in
centimeters of the variables sepal length and width and petal length and width, respectively, for 50
flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. The data
set is already available in R and can be called by typing iris
(a) Store the 50 observations for the setosa species in a new data frame called dat.
(b) For setosa, draw boxplots in a single frame for the four variables Sepal.Length, Sepal.Width,
Petal.Length and Petal.Width.
(c) For setosa, draw Q-Q-plots in a single frame for the four variables and showing the variable
name as a heading for each of the four plots.
(d) Use the recommended shapiro.test to assess the normality assumption for those four variables
for the setosa species.
(e) Explore the other normality tests shown in Lecture 20. Do you get the same answers as in (d)?
Question 2
The aim of this question is to give you a real-life experience of what it means to analyse a data set
with the aim to produce a model that is optimal from a non-standard point of view. Consider the
data phys.txt.
(a) Read the data into R and store it in a data frame called dat. What are the variable names
(given in the first row of the txt file)?
(b) Fit a full model, i.e. lm(Mass ~ .,data=data) and note that the adjusted R-squared is 0.9565,
i.e. already large.
(c) Check the diagnostic plots for the full model and report any possible violations of the standard
assumptions for the errors in the linear regression model. If there are no violations, explain
why there are none.
(d) Find that model that gives you largest adjusted R-squared value by using exactly two variables.
To do so efficiently, write one or more loops and comment the code, so that it can be easily
understood by your peers.
(e) Create two new variables, z1 and z2, that are functions (not necessarily linear functions) of
any combination of the explanatory variables such that lm(y ~ z1 + z2) has larger adjusted
R-squared value than the model in (d).
Copyright c© The University of Sydney 2
STAT3022 Applied Linear Models Semester 1
Tutorial & Lab Sheet 9
Tutorial Problems
Question 1
In a 3 × 2 factorial experiment on the control of Agropyron repens (quack-grass), Zick (1956) used
3 rates of spray (0, 4 and 8 pounds per acre) of maleic hydrazide in combination with 2 periods
of delay (3 and 10 days) in cultivation after spraying. Fifty two days after spraying, the number
of quack-grass shoots per square foot was recorded for each plot. It was believed reasonable to set
up a standard linear model for the square roots of these numbers. The square roots are as follows:
Periods of delay Rates of Spray
Period of delay Rate of Spray Square roots of quack-
(days) (lbs/acre) grass shoots (/square ft)
0 15.7 14.6 16.5 14.7
3 4 9.8 14.6 11.9 12.4
8 7.9 10.3 9.7 9.6
0 18.0 17.4 15.1 14.4
10 4 13.6 10.6 11.8 13.3
8 8.8 8.2 11.3 11.2
Consider the model
Yijk = µ+ αi + βj + (αβ)ij + ijk,
i = 1, 2, 3, j = 1, 2, k = 1, 2, 3, 4 where
• ijk ∼ NID(0, σ2),
• αi corresponds to the effect of Rate of spray i,
• βj corresponds to the effect of Period j, and
• the (αβ)ij denote the Rate × Period interaction terms
(a) Organise the data given in the above table in a tidy form.
(b) Confirm that y1•• = 126.4, y2•• = 98, y3•• = 77, y•1• = 147.7, y•2• = 153.7 and y••• = 301.4.
(c) Given that the treatment sum of squares is given as 155.6533 and the sample variance of the
response, s2y is 8.505145, construct the analysis of variance table corresponding to the above
model. Determine if the above model can be simplified. (Lecture 18-19)
(d) Now consider the one-way ANOVA model with factor Rate as
Yijk = µ+ αi + ijk, where ijk ∼ NID(0, σ2),
to answer this question. Since the factor Rate is a numerical factor with 3 levels, Lecture 17
says that the TSS of this one-way ANOVA model is the same as the TSS of order (3-1) polynomial
regression. Use a contrast with coefficient ci = ni(xi− x¯) to calculate the linear regression component
of the Rate of spray sum of squares in the polynomial regression. Is a linear regression model
appropriate in this case? (Lecture 17)
Copyright c© The University of Sydney 1
(e) Given that Sxy = −197.6 and Sxx = 256, estimate the parameters in
Yijk = β0 + β1xi + ijk, whereijk ∼ NID(0, σ2),
where xi is the rates of spray. (Lecture 3 and assumed knowledge)
c(qf(0.05, 1, 18), qf(0.05, 2, 18), qf(0.05, 1, 20), qf(0.05, 2, 20))
[1] 0.004043292 0.051439739 0.004032045 0.051425070
c(qf(0.95, 1, 18), qf(0.95, 2, 18), qf(0.95, 1, 20), qf(0.95, 2, 20))
[1] 4.413873 3.554557 4.351244 3.492828
Question 2
Wilkinson (1954) reports the results of a study on the influence of time of bleeding and diethylstilbe-
strol (estrogen compound) on plasma phospholipid in lambs. Five lambs were assigned at random
to each of the four treatment groups.
Totals of 5 values in each treatment group.
Time (Factor A)
Factor B am pm Totals
No estrogen 66.39 182.67 249.06
Estrogen 96.80 139.06 235.86
Totals 163.19 321.73 484.92
The total sum of squares for the data set of 20 values is 1919.33.
(a) Construct the analysis of variance table for this factorial design, analyse the data and comment.
(Lecture 18-19)
(b) Using the two-way ANOVA model, Construct a 95% confidence interval for the difference in
mean plasma phospholipid levels for lambs receiving diethylstilbestrol and those not.
qf(c(0.05,0.95),1,16)
[1] 0.004057389 4.493998478
qt(0.975,16)
[1] 2.119905
Copyright c© The University of Sydney 2
Question 3
A completely randomized experiment was conducted to compare seven treatments for their effective-
ness in reducing scab disease in potatoes. The field plan is shown below. The upper figure in each
plot denotes the treatment, coded 1 to 7. The lower figure denotes an index of scabbiness of potatoes
in that plot: 100 potatoes were randomly sampled from the plot, for each one the percentage of
the surface area infected with scabs was assessed by eye and recorded, and the average of these 100
percentages was calculated to give the scabbiness index.
2 1 6 4 6 7 5 3
9 12 18 10 24 17 30 16
1 5 4 3 5 1 1 6
10 7 4 10 21 24 29 12
2 7 3 1 3 7 2 4
9 7 18 30 18 16 16 4
5 1 7 6 1 4 1 2
9 18 17 19 32 5 26 4
(a) What is the experimental unit, observational unit and treatment of the above experiment?
How many experimental units, observational units and treatments are there? (Lecture 23)
(b) Suppose that y is a vector of response and trt is a factor that denotes the treatment. Given
the below information only, complete the analysis of variance table below by filling out the ??s.
(Lecture 14-15, 24)
tapply(y, trt, sum)
## 1 2 3 4 5 6 7
## 181 38 62 23 67 73 57
table(trt)
## trt
## 1 2 3 4 5 6 7
## 8 4 4 4 4 4 4
sum(y)
## [1] 501
length(y)
## [1] 32
var(y)
## [1] 67.5877
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## trt ?? ?? ?? ?? ###
Copyright c© The University of Sydney 3
(c) Is there any evidence that the mean scabbiness is different according to different treatments?
Justify your answer. (Lecture 14-15, 24)
qf(0.95, 6, 25)
[1] 2.49041
qf(0.95, 1, 30)
[1] 4.170877
(d) Estimate the mean scabbiness produced by each treatment. (Lecture 14-15, 24)
(e) What is the standard error of the differences between treatment 1 mean and the average of
treatment 2 and 3 means? (Lecture 14-15, 24)
Computer Problems
Question 1
An anesthesiologist made a comparative study of the effects of acupuncture and codeine on postop-
erative dental pain in male subjects. The four treatments (Tmt) were
• A : placebo capsule (sugar) & placebo acupuncture (inactive points)
• B : codeine capsule & placebo acupuncture
• C : placebo capsule & acupuncture
• D : codeine capsule & acupuncture
Thirty-two subjects were grouped into eight blocks (Tol) of four according to an initial evaluation
of their level of pain tolerance. The subjects in each block were then randomly assigned to the four
treatments. Pain relief scores (Pain) were obtained for all subjects two hours after dental treatment.
Data were collected on a double blind basis. The data is given in the file pain.txt.
(a) Assume you have to plan this experiment yourself. Generate an allocation list for a complete
randomized block design with 8 blocks and four treatments. This part does not use the data
but only its dimensions (sample size: 32; variables: block and treatment).
(b) Assume that after the experiment was run by the anesthesiologist, with a design that was
similarly generated as what you just produced in (a), you are given the data in pain.txt. Try
to understand the structure of the data by using str, summary or skim, i.e. think about if you
have to change the variable type of any of the data columns?
(c) Obtain the residuals for the randomized block model and plot them against the fitted values.
Also prepare a normal probability plot of the residuals. What are your findings? Comment on
any violation of underlying assumptions, otherwise confirm that assumptions seem reasonable.
(d) Produce an interaction plot. Comment on how parallel the plots are, also taking into account
the variability of each calculated “treatment mean” (in the sense that a treatment mean is the
average, Y¯ij•, of all observations that received level i of factor A and level j of factor B.
(e) Obtain the analysis of variance table (without interactions).
(f) Calculate a 95% confidence interval for αB???αC and use this to test if this pairwise contrast
is zero.
Copyright c© The University of Sydney 4

学霸联盟