统计代写-STAT 3022|学霸联盟

统计代写-STAT 3022

时间：2022-06-24

STAT 3022 Applied Linear Models – Semester 1, 2022
Final Examination
Reading and writing time allowed: 2 hours and 10 minutes
Document upload time: 15 minutes (after the examination with no further working)
Notes and Instructions:
• Examination Conditions: Open book. You may refer to any notes on Canvas website.
• This examination must be taken on a computer or laptop with satisfactory internet connectivity. It
should NOT be taken on a mobile device.
• Compatible web browsers include updated versions of Mozilla Firefox or Google Chrome. Any other
browser may not display questions correctly.
• Please be mindful we may access logs of your Canvas activity in the event of any discrepancy or concerns
regarding breaches of integrity.
• The content of this examination is not to be shared or distributed in any form.
• The work that you submit for this examination must be your sole effort (i.e. not copied from, or
discussed with, anyone else).
• This examination carries weight 50% towards your final mark.
• The paper contains 4 questions. Attempt all questions. Marks are shown in each question. Total marks
are 100.
• If you are asked to calculate certain quantity, you must show your working with the minimal of showing
values substituted into a suitable formula in order to obtain full marks. Then you may use a calculator
or any package to evaluate the values.
• If you are asked to state or report certain values from R output provided, no calculation is required.
• If you are asked to explain certain concepts, you need to write at most two sentences.
• Unless otherwise stated, you can assume the normal assumptions to hold for all the inferences (i.e
confidence intervals, hypothesis testing) questions.
• Unless otherwise stated, take the significant level α to be 5% and round your answer to 3 or 4 decimal
places in general.
1
Question 1 (15 marks)
Answer the following questions.
(i) (3 marks) Consider the dataset (xi, yi) for i = 1, . . . , n and n = 100 with x¯ = 20, y¯ = 15, and sample
correlation between X and Y is r = −0.23. If we fit a simple linear regression model yi = β0 +β1xi + εi,
what is the value of the t-statistic for testing β1 = 0? Show your calculation (you can use any formula
given in the lecture notes or in the tutorials).
(ii) (3 marks) Consider a multiple linear regression model with 150 observations, one continuous predictor
(X1), one categorical predictor (X2) with 4 categories, and one categorical with 3 categories (X3). If
we want to build a model that contains only the additive effect of X1, the interaction between X1 and
X2, and the additive effect of X3, what would be the dimension of the design matrix for this multiple
linear regression model?
(iii) (3 marks) Consider a multiple linear regression model from 100 observations on 7 covariates (with no
interaction term) Let β = (β0, . . . , β7)> be the true coefficient vector. What is the degree of freedom
for the t−statistic used in the test H0 : 2β1 − β2 + 4β3 = 0 vs. H1 : 2β1 − β2 + 4β3 6= 0?
(iv) (3 marks) Consider the two-way ANOVA model yijk = µ + αi + βj + γij + εijk, i = 1, 2, j = 1, 2, 3
and k = 1, 2. Let β = (µ, α1, α2, β1, β2, β3, γ11, γ12, γ13, γ21, γ22, γ23)>. In the matrix form y = Xβ + ε,
what is the row of the design matrix X that corresponds to the observation y132?
(v) (3 marks) In a two-way ANOVA model yij = µ+ αi + βj + εij with i, j = 1, 2, 3, what are the values
for c and d such that the linear combination α1 − (1/2)α2 + cα3 + (1/5)β1 + dβ2 is estimable?
Question 2 (40 marks)
A researcher is interested in building a multiple linear regression model of the murder rate (Murder) for 50
states in the United States on four potential covariates
Variable Meaning
Population Total population
Income Per capita income
Illiteracy Illiteracy rate
Life.Exp Life expectancy in years
Region Region (Northeast, South, North Central, West) that each state belongs to
Several models are fitted with selected R output and results being given at the end of this document. Based
only on these outputs, answer the following questions. If a question can’t be answered from these outputs,
you can write “Not sufficient information to answer.”
(i) (10 marks) Conduct the omnibus F-test for the significance of model m1. Please report the test statistic,
its degree of freedom, the p-value, and the conclusion of the test.
(ii) (5 marks) For the model m2, is the t-test associated with Income in the summary table and the F -test
associated with Income in the sequential ANOVA model testing the same hypotheses? Explicitly state
the models under the null and alternative hypothesis of each test.
(iii) (10 marks) After fitting the model m3, the researcher doubts that the following states may exert too
much influence on the fitted models: Nevada, California, New York, Hawaii, and Maine. For each state
listed above, identify whether it is a high leverage, an outlier, or an influential observation.
(iv) (5 marks) For the model m4, can we conclude that neither the quadratic term associated with life
expectancy nor illiteracy is significant? Justify your answer.
2
(v) (10 marks) For the model m5, are the intercept associated with the South and North Central region
significantly different from one another? Conduct an appropriate test to justify your answer.
Question 3 (30 marks)
The factors that influence the breaking strength of a synthetic fiber are being studied. Four production
machines and three operators are chosen and the outcome variables are recorded. Several models are fitted
and selected outputs are given at the end of this document.
First, both machines and operators are considered fixed factors.
(i) (2 marks) Write down the statistical model that allow interaction effect between the two factors.
Explicitly define notations used in the models and state all the assumptions.
(ii) (3 marks) Conduct at significant level 5% a hypothesis test of whether the interaction between machine
and operator is significant. In conducting the test, state the null and alternative hypotheses in terms of
the notations you have defined in part (i). You may use the test statistic and p-value from the output
without re-computing it.
(iii) (5 marks) Construct a suitable ANOVA table for the model m2 that only contains the additive effects
of the two factors.
(iv) (5 marks) Construct 95% confidence intervals for all the pairwise differences in the mean outcome
among the operators, using Tukey method to adjust for multiple comparison. Use the model m1 to
estimate the variance of random error.
For the last three parts, treat both machine and operator to have random effects on the outcome.
(v) (5 marks) Write down the statistical model that allow interaction effect between the two factors.
Explicitly define notations used in the models and state all the assumptions.
(vi) (5 marks) What are the estimated variance components from (1) method of moments, and (2) restricted
maximum likelihood estimation?
(vii) (5 marks) Testing whether the interaction effect exist between the two random factors.
Question 4 (15 marks)
Let H = X(X>X)−1X> be the hat matrix of a multiple linear regression model for a full-ranked design
matrix X. Let hij denote the (i, j) element of H. Prove that hii ≥ h2ij for all the pairs (i, j).
3
Output for Question 2
head(statedata, n=10)
Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
Connecticut 3100 5348 1.1 72.48 3.1 56.0 139 4862
Delaware 579 4809 0.9 70.06 6.2 54.6 103 1982
Florida 8277 4815 1.3 70.66 10.7 52.6 11 54090
Georgia 4931 4091 2.0 68.54 13.9 40.6 60 58073
Region
Alabama South
Alaska West
Arizona West
Arkansas South
California West
Colorado West
Connecticut Northeast
Delaware South
Florida South
Georgia South
n <- nrow(statedata)
summary(statedata)
Population Income Illiteracy Life.Exp
Min. : 365 Min. :3098 Min. :0.500 Min. :67.96
1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625 1st Qu.:70.12
Median : 2838 Median :4519 Median :0.950 Median :70.67
Mean : 4246 Mean :4436 Mean :1.170 Mean :70.88
3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575 3rd Qu.:71.89
Max. :21198 Max. :6315 Max. :2.800 Max. :73.60
Murder HS.Grad Frost Area
Min. : 1.400 Min. :37.80 Min. : 0.00 Min. : 1049
1st Qu.: 4.350 1st Qu.:48.05 1st Qu.: 66.25 1st Qu.: 36985
Median : 6.850 Median :53.25 Median :114.50 Median : 54277
Mean : 7.378 Mean :53.11 Mean :104.46 Mean : 70736
3rd Qu.:10.675 3rd Qu.:59.15 3rd Qu.:139.75 3rd Qu.: 81162
Max. :15.100 Max. :67.30 Max. :188.00 Max. :566432
Region
Northeast : 9
South :16
North Central:12
West :13
m1 <- lm(Murder ~ Population + Life.Exp, data=statedata)
### THIS IS THE MODEL YOU ARE ASKED TO CONDUCT THE OMNIBUS F-TEST
4
m2 <- lm(Murder ~ Population + Life.Exp + Income, data = statedata)
summary(m2)
Call:
lm(formula = Murder ~ Population + Life.Exp + Income, data = statedata)
Residuals:
Min 1Q Median 3Q Max
-5.0604 -1.2165 -0.1001 1.4489 5.3766
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.530e+02 1.638e+01 9.340 3.41e-12 ***
Population 2.487e-04 6.955e-05 3.576 0.000834 ***
Life.Exp -2.055e+00 2.406e-01 -8.541 4.77e-11 ***
Income -2.309e-04 5.362e-04 -0.431 0.668697
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.102 on 46 degrees of freedom
Multiple R-squared: 0.6957, Adjusted R-squared: 0.6759
F-statistic: 35.06 on 3 and 46 DF, p-value: 6.03e-12
anova(m2)
Analysis of Variance Table
Response: Murder
Df Sum Sq Mean Sq F value Pr(>F)
Population 1 78.85 78.85 17.8532 0.0001118 ***
Life.Exp 1 384.90 384.90 87.1441 3.464e-12 ***
Income 1 0.82 0.82 0.1855 0.6686974
Residuals 46 203.17 4.42
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m3 <- lm(Murder ~ Population + Life.Exp + Illiteracy, data = statedata)
anova(m3)
Analysis of Variance Table
Response: Murder
Df Sum Sq Mean Sq F value Pr(>F)
Population 1 78.85 78.85 23.787 1.326e-05 ***
Life.Exp 1 384.90 384.90 116.105 3.591e-14 ***
Illiteracy 1 51.50 51.50 15.534 0.0002736 ***
Residuals 46 152.49 3.32
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
infl <- data.frame(hat = hatvalues(m3), rstudent = rstudent(m3), cook =cooks.distance(m3))
infl %>% arrange(desc(abs(rstudent))) %>% head(., 10)
hat rstudent cook
Maine 0.06139142 -2.208454 0.07355211
Nevada 0.18311066 2.206144 0.25159499
5
Pennsylvania 0.09126453 -1.961411 0.09096211
Alabama 0.07621101 1.818887 0.06497285
Rhode Island 0.05735766 -1.697526 0.04211206
Michigan 0.05676647 1.624744 0.03835046
Massachusetts 0.03590075 -1.601374 0.02308784
West Virginia 0.05119165 -1.586043 0.03284830
New Jersey 0.03051520 -1.490858 0.01703709
Hawaii 0.29778250 1.485191 0.22787391
infl %>% arrange(desc(hat)) %>% head(., 10)
hat rstudent cook
California 0.33002523 0.3889005108 1.897553e-02
Hawaii 0.29778250 1.4851908513 2.278739e-01
New York 0.21587137 -0.3278453517 7.543885e-03
Nevada 0.18311066 2.2061436908 2.515950e-01
Louisiana 0.16902825 -0.4443576975 1.021933e-02
Texas 0.16401592 0.5550304426 1.534066e-02
Mississippi 0.13655717 -0.7739177830 2.388986e-02
South Carolina 0.13252219 -1.3764902216 7.098234e-02
New Mexico 0.10443795 0.0003646099 3.961913e-09
Illinois 0.09730404 0.4618623647 5.848520e-03
infl %>% arrange(desc(cook)) %>% head(., 10)
hat rstudent cook
Nevada 0.18311066 2.206144 0.25159499
Hawaii 0.29778250 1.485191 0.22787391
Pennsylvania 0.09126453 -1.961411 0.09096211
Maine 0.06139142 -2.208454 0.07355211
South Carolina 0.13252219 -1.376490 0.07098234
Alabama 0.07621101 1.818887 0.06497285
Rhode Island 0.05735766 -1.697526 0.04211206
Michigan 0.05676647 1.624744 0.03835046
West Virginia 0.05119165 -1.586043 0.03284830
Utah 0.07315692 1.203230 0.02829304
statedata[c('Nevada', 'California', 'New York', 'Hawaii', 'Maine'), ]
Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Nevada 590 5149 0.5 69.03 11.5 65.2 188 109889
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
New York 18076 4903 1.4 70.55 10.9 52.7 82 47831
Hawaii 868 4963 1.9 73.60 6.2 61.9 0 6425
Maine 1058 3694 0.7 70.39 2.7 54.7 161 30920
Region
Nevada West
California West
New York Northeast
Hawaii West
Maine Northeast
m4 <- lm(Murder ~ Population + Illiteracy + Life.Exp + I(Life.Expˆ2) + I(Illiteracyˆ2), data = statedata)
anova(m4)
Analysis of Variance Table
6
Response: Murder
Df Sum Sq Mean Sq F value Pr(>F)
Population 1 78.854 78.854 22.7823 2.032e-05 ***
Illiteracy 1 299.646 299.646 86.5726 5.882e-12 ***
Life.Exp 1 136.751 136.751 39.5097 1.282e-07 ***
I(Life.Exp^2) 1 0.086 0.086 0.0248 0.8756
I(Illiteracy^2) 1 0.115 0.115 0.0333 0.8560
Residuals 44 152.293 3.461
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m5 <- lm(Murder ~ Population + Illiteracy + Life.Exp + Region, data=statedata)
summary(m5)
Call:
lm(formula = Murder ~ Population + Illiteracy + Life.Exp + Region,
data = statedata)
Residuals:
Min 1Q Median 3Q Max
-2.9482 -1.1034 -0.1339 1.0492 3.4071
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.051e+02 1.571e+01 6.692 3.60e-08 ***
Population 2.707e-04 5.087e-05 5.322 3.50e-06 ***
Illiteracy 1.808e+00 5.145e-01 3.513 0.00106 **
Life.Exp -1.455e+00 2.179e-01 -6.676 3.80e-08 ***
RegionSouth 2.607e+00 7.800e-01 3.343 0.00173 **
RegionNorth Central 2.013e+00 6.925e-01 2.907 0.00575 **
RegionWest 3.106e+00 6.760e-01 4.595 3.76e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.528 on 43 degrees of freedom
Multiple R-squared: 0.8497, Adjusted R-squared: 0.8287
F-statistic: 40.5 on 6 and 43 DF, p-value: 3.936e-16
vcov(m5) %>% round(., 4)
(Intercept) Population Illiteracy Life.Exp RegionSouth
(Intercept) 246.7340 -1e-04 -2.4464 -3.4197 -3.8602
Population -0.0001 0e+00 0.0000 0.0000 0.0000
Illiteracy -2.4464 0e+00 0.2648 0.0309 -0.1523
Life.Exp -3.4197 0e+00 0.0309 0.0475 0.0521
RegionSouth -3.8602 0e+00 -0.1523 0.0521 0.6083
RegionNorth Central 0.6830 0e+00 0.0611 -0.0141 0.1927
RegionWest -0.4582 0e+00 -0.0156 0.0025 0.2836
RegionNorth Central RegionWest
(Intercept) 0.6830 -0.4582
Population 0.0000 0.0000
Illiteracy 0.0611 -0.0156
Life.Exp -0.0141 0.0025
RegionSouth 0.1927 0.2836
7
RegionNorth Central 0.4795 0.2582
RegionWest 0.2582 0.4570
Output for Question 3
head(dat, n=10)
y machine operator
1 109 1 A
2 110 1 A
3 110 1 B
4 112 1 B
5 116 1 C
6 114 1 C
7 110 2 A
8 115 2 A
9 110 2 B
10 111 2 B
OperatorMeans <- tapply(dat$y, dat$operator, mean)
OperatorMeans
A B C
109.875 111.125 115.875
MachineMeans <- tapply(dat$y, dat$machine, mean)
MachineMeans
1 2 3 4
111.8333 112.1667 111.6667 113.5000
m1 <- lm(y ~ machine*operator, data = dat)
anova(m1)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
machine 3 12.458 4.153 1.0952 0.3887526
operator 2 160.333 80.167 21.1429 0.0001167 ***
machine:operator 6 44.667 7.444 1.9634 0.1506807
Residuals 12 45.500 3.792
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m2 <- lm(y ~ machine + operator, data = dat)
#### THIS IS THE MODEL YOU ARE ASKED TO CONSTRUCT THE ANOVA TABLE
library(lme4)
fit.lme <- lmer(y ~ (1|operator) + (1|machine) + (1|(operator:machine)), data=dat)
boundary (singular) fit: see ?isSingular
summary(fit.lme)
Linear mixed model fit by REML ['lmerMod']
8
Formula: y ~ (1 | operator) + (1 | machine) + (1 | (operator:machine))
Data: dat
REML criterion at convergence: 109.8
Scaled residuals:
Min 1Q Median 3Q Max
-1.41189 -0.60892 0.03501 0.22287 2.03048
Random effects:
Groups Name Variance Std.Dev.
(operator:machine) (Intercept) 1.278 1.130
machine (Intercept) 0.000 0.000
operator (Intercept) 9.228 3.038
Residual 3.792 1.947
Number of obs: 24, groups: (operator:machine), 12; machine, 4; operator, 3
Fixed effects:
Estimate Std. Error t value
(Intercept) 112.292 1.828 61.44
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular
THIS IS THE LAST PAGE
9