程序代写案例-3H
时间:2022-04-11
XX May 2021
EXAMINATION FOR THE DEGREES OF M.A., M.SCI. AND B.SC.
(SCIENCE)
Statistics – 3H Linear Models
This paper consists of 9 pages and contains 3 question(s).
Candidates should attempt all questions.
Question 1 20 marks
Question 2 20 marks
Question 3 20 marks
Total 60 marks
The following material is made available to you:
Statistical tables∗
Statistical Tables
Formula sheet
NOTE: Candidates must attempt all questions.
1
CONTINUED OVERLEAF/
1. (a) We have the following mean model for a linear regression:
E(yi|xi, zi) = β0 + β11[xi > 3.5] + β2zi,
where 1[·] is an indicator function equal to 1 if the statement in the brackets is
true and 0 otherwise. We have the following values for x = (1, 6.4, 3, 5.1, 2.9, 11.3)
and z = (15.2, 20.5, 12.4, 10.5, 21.0, 19.1). Write out the design matrix for this
regression along with the corresponding defined parameter vector. [3 MARKS]
(b) List the assumptions required to fit this model and do inference. [2 MARKS]
(c) This is a special case of linear regression. What is the name for this type of model?
[1 MARK]
(d) Suggest a reason for why the indicator function of x might be used rather than
the original x. What is a disadvantage of using the indicator? [2 MARKS]
(e) Is this a main effects or complete model? If it is a main effects model, explain how
to make it a complete model, if it is a complete model, explain how to make it a
main effects one. [2 MARKS]
(f) Figure 1 is a standard diagnostic plot for this model. What is this type of plot
called? Comment on if there is an issue with an assumption (if so, what), which
assumption and for what reason. [3 MARKS]
(g) Suppose we observed an issue with the homoscedasticity assumption in our diag-
nostic plots.
i. Which plot would have been used to check this assumption and what should
we have seen if the assumption was satisfied. [2 MARKS]
ii. What 2 strategies could we take to try to deal with this problem? List a
disadvantage of either one of the strategies suggested. [3 MARKS]
(h) Comment on the DFFITS plot in Figure 2 for this model. [2 MARKS]
2
CONTINUED OVERLEAF/
Figure 1: Question 1(f)
Figure 2: Question 1(h)
3
CONTINUED OVERLEAF/
2. A biostatistician is interested in looking at the relationship between a range of recorded
variables and the mean per capita cancer mortalities (TargetdeathRate) in some areas
of the United States. He is curious to see what an automated model building approach
suggests and uses an AIC backward search. Partial results from the first two steps of
this search are given here:
step(g,direction = "backward")
Start: AIC=3583.84
TargetdeathRate ~ avgAnnCount + incidenceRate + medIncome +
popEst2015 + povertyPercent + studyPerCap + binnedInc + MedianAge +
MedianAgeMale + MedianAgeFemale + AvgHouseholdSize + PercentMarried +
PctNoHS18_24 + PctHS18_24 + PctSomeCol18_24 + PctBachDeg18_24 +
PctHS25_Over + PctBachDeg25_Over + PctEmployed16_Over + PctUnemployed16_Over +
PctPrivateCoverage + PctPrivateCoverageAlone + PctEmpPrivCoverage +
PctPublicCoverage + PctPublicCoverageAlone + PctWhite + PctBlack +
PctAsian + PctOtherRace + PctMarriedHouseholds + BirthRate
Df Sum of Sq RSS AIC
- studyPerCap 1 0 222012 3581.8
- PctAsian 1 2 222014 3581.8
- PctPrivateCoverage 1 11 222023 3581.9
- AvgHouseholdSize 1 96 222108 3582.1
- PctPrivateCoverageAlone 1 256 222267 3582.5
- avgAnnCount 1 264 222276 3582.5
- BirthRate 1 265 222276 3582.5
- popEst2015 1 309 222320 3582.7
- povertyPercent 1 330 222342 3582.7
- binnedInc 9 6536 228548 3583.0
- MedianAge 1 517 222529 3583.2
- MedianAgeFemale 1 524 222535 3583.2
- PctUnemployed16_Over 1 639 222650 3583.5
222012 3583.8
- PctPublicCoverage 1 768 222780 3583.9
- PctWhite 1 941 222952 3584.3
- PctPublicCoverageAlone 1 969 222980 3584.4
- PctHS18_24 1 974 222986 3584.4
- PctSomeCol18_24 1 980 222991 3584.4
- PctNoHS18_24 1 1005 223016 3584.5
- PctBachDeg18_24 1 1053 223065 3584.6
- medIncome 1 1300 223311 3585.3
- PctBlack 1 1429 223441 3585.6
- PctEmpPrivCoverage 1 1583 223595 3586.0
- PctHS25_Over 1 1592 223604 3586.1
4
CONTINUED OVERLEAF/
- MedianAgeMale 1 2296 224308 3587.9
- PctEmployed16_Over 1 2769 224781 3589.2
- PctBachDeg25_Over 1 3075 225087 3590.0
- PctOtherRace 1 3410 225421 3590.9
- PercentMarried 1 8775 230786 3604.8
- PctMarriedHouseholds 1 13638 235649 3617.1
- incidenceRate 1 37992 260004 3675.2
Step: AIC=3581.84
TargetdeathRate ~ avgAnnCount + incidenceRate + medIncome +
popEst2015 + povertyPercent + binnedInc + MedianAge + MedianAgeMale +
MedianAgeFemale + AvgHouseholdSize + PercentMarried + PctNoHS18_24 +
PctHS18_24 + PctSomeCol18_24 + PctBachDeg18_24 + PctHS25_Over +
PctBachDeg25_Over + PctEmployed16_Over + PctUnemployed16_Over +
PctPrivateCoverage + PctPrivateCoverageAlone + PctEmpPrivCoverage +
PctPublicCoverage + PctPublicCoverageAlone + PctWhite + PctBlack +
PctAsian + PctOtherRace + PctMarriedHouseholds + BirthRate
Df Sum of Sq RSS AIC
- PctAsian 1 2 222014 3579.8
- PctPrivateCoverage 1 11 222023 3579.9
- AvgHouseholdSize 1 96 222108 3580.1
- PctPrivateCoverageAlone 1 255 222267 3580.5
- avgAnnCount 1 264 222276 3580.5
- BirthRate 1 265 222277 3580.5
- popEst2015 1 309 222321 3580.7
- povertyPercent 1 331 222343 3580.7
- binnedInc 9 6555 228567 3581.0
- MedianAge 1 517 222529 3581.2
- MedianAgeFemale 1 524 222536 3581.2
- PctUnemployed16_Over 1 639 222651 3581.5
222012 3581.8
- PctPublicCoverage 1 768 222780 3581.9
- PctWhite 1 942 222954 3582.3
- PctPublicCoverageAlone 1 969 222981 3582.4
- PctHS18_24 1 975 222987 3582.4
- PctSomeCol18_24 1 980 222992 3582.4
- PctNoHS18_24 1 1005 223017 3582.5
- PctBachDeg18_24 1 1054 223066 3582.6
- medIncome 1 1303 223315 3583.3
- PctBlack 1 1429 223441 3583.6
- PctEmpPrivCoverage 1 1586 223598 3584.1
- PctHS25_Over 1 1610 223622 3584.1
5
CONTINUED OVERLEAF/
- MedianAgeMale 1 2302 224314 3585.9
- PctEmployed16_Over 1 2788 224800 3587.2
- PctBachDeg25_Over 1 3076 225088 3588.0
- PctOtherRace 1 3410 225421 3588.9
- PercentMarried 1 8774 230786 3602.8
- PctMarriedHouseholds 1 13656 235668 3615.1
- incidenceRate 1 38106 260118 3673.5
(a) What is the next decision made by the algorithm? [1 MARK]
(b) Which candidate variable is likely to be removed in the next (third) step following
the last piece of R output? Will this definitely be the case? Why/why not?
[3 MARKS]
(c) Why might this approach be preferable to one using hypothesis testing to choose
a final model? How could one remedy the issue with using hypothesis tests?
[2 MARKS]
(d) Once the search has concluded with a final model, suggest one piece of further
exploration that could be worth doing. [1 MARK]
(e) The final model output is given here:
summary(final)
Call:
lm(formula = TARGETdeathRate ~ incidenceRate + medIncome + MedianAgeMale +
PercentMarried + PctNoHS18_24 + PctHS18_24 + PctSomeCol18_24 +
PctBachDeg18_24 + PctHS25_Over + PctBachDeg25_Over + PctEmployed16_Over +
PctPrivateCoverageAlone + PctEmpPrivCoverage + PctPublicCoverage +
PctPublicCoverageAlone + PctWhite + PctBlack + PctOtherRace +
PctMarriedHouseholds, data = data)
Residuals:
Min 1Q Median 3Q Max
-79.88 -10.83 0.16 10.97 107.32
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.732e+03 1.484e+03 1.842 0.06604 .
incidenceRate 1.619e-01 1.670e-02 9.696 < 2e-16 ***
medIncome 3.088e-04 1.755e-04 1.760 0.07899 .
MedianAgeMale -6.698e-01 3.295e-01 -2.033 0.04255 *
PercentMarried 1.939e+00 3.926e-01 4.939 1.03e-06 ***
6
CONTINUED OVERLEAF/
PctNoHS18_24 -2.578e+01 1.482e+01 -1.740 0.08247 .
PctHS18_24 -2.539e+01 1.483e+01 -1.712 0.08738 .
PctSomeCol18_24 -2.548e+01 1.483e+01 -1.718 0.08629 .
PctBachDeg18_24 -2.618e+01 1.483e+01 -1.765 0.07806 .
PctHS25_Over 5.834e-01 2.376e-01 2.456 0.01436 *
PctBachDeg25_Over -9.527e-01 3.788e-01 -2.515 0.01217 *
PctEmployed16_Over -8.915e-01 2.254e-01 -3.955 8.61e-05 ***
PctPrivateCoverageAlone -7.551e-01 3.759e-01 -2.009 0.04503 *
PctEmpPrivCoverage 6.314e-01 2.712e-01 2.328 0.02025 *
PctPublicCoverage -1.175e+00 4.431e-01 -2.652 0.00822 **
PctPublicCoverageAlone 1.441e+00 4.841e-01 2.976 0.00304 **
PctWhite 2.010e-01 1.329e-01 1.512 0.13098
PctBlack 2.963e-01 1.244e-01 2.382 0.01755 *
PctOtherRace -9.089e-01 3.057e-01 -2.973 0.00307 **
PctMarriedHouseholds -2.191e+00 3.660e-01 -5.985 3.82e-09 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 20.11 on 571 degrees of freedom
Multiple R-squared: 0.4737,Adjusted R-squared: 0.4561
F-statistic: 27.04 on 19 and 571 DF, p-value: < 2.2e-16
i. What distribution (along with any details like degrees of freedom) is being
used to produce the p-values in the final column of this output?
[2 MARKS]
ii. What are the null and alternative hypotheses for the model F test and what
is the conclusion here? [2 MARKS]
iii. The condition number for this model is 39976.49. What can we conclude from
this? Suggest a strategy for remedying this. [3 MARKS]
iv. The parameters for MedianAgeMale and PctPrivateCoverageAlone were of
particular interest to the investigator so a partial F-test for these two covari-
ates was run. The p-value was 0.06. Give the null and alternative hypotheses
for this test and comment on its result with respect to the previous model
output. [2 MARKS]
v. Interpret the coefficient of PercentMarried (which is the percentage of county
residents who are married). [2 MARKS]
(f) Give one advantage and one disadvantage of using maximum likelihood with nor-
mality over ordinary least squares to estimate parameters in a linear model.
[2 MARKS]
7
CONTINUED OVERLEAF/
3. (a) A researcher into effective study habits looks at a set of volunteers and measures
3 continuous scores related to their lifestyle (x1, x2 and x5) and 2 categorical
scores (x3 - whether they have a positive, neutral or negative outlook on life and
x4 - assigned gender at birth). They fit a linear model with main effects for all
variables and an interaction between x1 and x2. A partial anova table for the
resulting model is given here:
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 362.7 362.7 u < 2.2e-16 ***
x2 1 5148.9 5148.9 17737.0072 < 2.2e-16 ***
x3 2 5.9 v 10.1954 0.0002459 ***
x4 1 183.3 183.3 631.3107 < 2.2e-16 ***
x5 1 0.0 0.0 0.0081 w
x1:x2 1 381.4 381.4 1313.8350 < 2.2e-16 ***
Residuals 42 12.2 0.3
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
i. How many observations/subjects did the researcher have in their analysis?
[1 MARK]
ii. Calculate the missing values of u and v from the anova table. [2 MARKS]
iii. What are the null and alternative hypotheses for the p-value of the test in the
missing value w? Using the statistical tables, calculate a critical value (listing
the distribution and degrees of freedom) for the test of the x5 coefficient.
What conclusion do you reach about this term? What alternative test could
you have run to check the significance of the x5 term? Would it definitely
have given the same result or not? [6 MARKS]
iv. Having stupidly thrown away their original data, the research now wants to
run an F test comparing the model without the interaction and x5 term to
the full model. Using the data in the table, produce the observed F statistic
and run the test giving a conclusion on what model to retain. [4 MARKS]
(b) The same researcher is looking at proportion of people, y, in each area in Glasgow
who smoke. They want to fit a linear model to model the effect of various covariates
on this outcome. A statistician convinces the researcher to use a transformed
outcome y∗ = arcsine(y) = sin−1(y) as the outcome instead. They end up with a
simple linear regression of y∗ on a single covariate x.
The researcher wants to get a prediction and range of plausible values for the
proportion of people who smoke when the value of x is equal to 0. The statistician
8
CONTINUED OVERLEAF/
presents them with two outputs from R:
> predict(mod, data.frame(x=0), interval="prediction")
fit lwr upr
1 0.2996512 0.2801427 0.3191596
> predict(mod, data.frame(x=0), interval="confidence")
fit lwr upr
1 0.2996512 0.2977087 0.3015937
i. The researcher decides to take the second interval as it’s shorter. Comment
on this decision and how they could have otherwise decided between the two.
[2 MARKS]
ii. Using the prediction interval output, explain what proportion of smokers are
likely in an area with x = 0. [3 MARKS]
iii. What checks should the statistician have done before producing this output?
[2 MARKS]
9
END OF QUESTION PAPER.