STAT3022-r代写-Assignment 1|学霸联盟

STAT3022-r代写-Assignment 1

时间：2024-04-10

STAT3022 Applied Linear Models Semester 1, 2024
Assignment 1
Please read the following statements carefully.
• You should attempt to solve the problems by yourself. You are free to consult others about
the assignments and are encouraged to put your questions in the Ed discussion if you need any
clarification. However, collusion or collaboration beyond this point will be considered cheating.
• Any materials submitted for an assignment must be your own work. Simply copying, or copying
and modifying, other people’s work or from the Internet is not permitted. It is important that
you get the benefits of thinking hard about the problems, their formulations, and the results.
• Unless otherwise stated in the question, you must show all of your work to receive full credit
for any problem.
• Submission: You must type your solution in R markdown. Any mathematical expression
should be typed neatly. Presentation of the solution will be taken into account for the marking
of each question. Please submit only two files: (1) a html or pdf file that contains your answers
AND your relevant R code, and (2) the .Rmd file that produces it. Failure to submit the .Rmd
file or not write the solution in Rmarkdown will lead to a deduction of 50% of your total mark
in this assignment. Instruction to use Rmarkdown has been given in Computer Lab week 1.
• The relevant materials to this assignment include the lectures, tutorials, and computer labs for
the first 6 weeks of the unit.
• For any question that requires R code (question 2), no credit will be given if the code is not
submitted.
Copyright © The University of Sydney 1
Question 1 (45 marks)
An analyst studies the relationship between the salaries (Y ) of academics in a university in the US
and X1 = number of years since PhD, X2 = number of years of service, X3 = gender, which is a
categorical variable with 2 levels (Female and Male), and X4 = academic rank, which is a categorical
variable with 3 levels (A, B, and C). The analyst fitted several models in R, whose selected outputs
are given at the end of this document.
Based on the outputs, answer the following questions. Note that “not enough information to
answer’ ’ can be the correct response.
(i) First, the analyst fitted the model m1 that only contains the two quantitative (continuous)
variables X1 and X2. Obtain the ANOVA table for this model.
(ii) The model m2 in the output contains two quantitative variables X1, X2 and the categorical
variable X3 without any interaction. Complete the summary table of the model m2 below.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.28759 ? ? < 2e-16 ***
yrs.since.phd 0.15528 ? ? 3.15e-09 ***
yrs.service -0.06498 ? ? 0.0109 *
sexMale 0.84571 ? ? 0.0701 .
---
Residual standard error: ? on ? degrees of freedom
Multiple R-squared: ?, Adjusted R-squared: ?
F-statistic: ? on ? and ? DF, p-value: < 2.2e-16
(iii) For the model m2, state the models under the null and alternative hypotheses corresponding to
yrs.service in (1) the t test given in the summary table (outlined in part (ii) of this question)
and (2) the F test given in the ANOVA table (given in the output).
(iv) From the model m2, obtain the point prediction and 95% prediction intervals for (1) a female
academic with 5 years since PhD and 2 years of service, and (2) a male academic with 3 years
since PhD and 2 years of service.
(v) Consider the model m3 with the additive effect of X1, X2, X4, as well as the interaction effect
between X4 and X1 and the interaction effect between X4 and X2. Given that the normality is
reasonable, conduct an appropriate F -test to conclude whether all the interaction terms can be
dropped out of the model. Please specify the models under the null and alternative hypotheses,
the value of the test statistic, and the p-value of the test.
Copyright © The University of Sydney 2
Question 2 (45 marks)
The Scholastic Aptitude Test, or SAT, is a standard test used throughout the United States to
determine college entrance. When it was first introduced in 1982, the large variation among average
SAT scores between the states became an area of great concern for some states and great pride
for others. But what causes such variation? A scientist set out to determine the extent to which
demographic variables influenced SAT scores, and carried out a large study to address this issue.
They measured, for each state (state): the average total SAT score (Y = total); the total state
expenditure on secondary schools, expressed in hundreds of dollars per student (X1 = expend);
the average students/teacher ratio (X2 = ratio); the average teacher salary (X3 = salary), and
percentage of SAT takers (X4 = perc). The dataset SATscore is available on Canvas.
(i) Obtain the pairwise scatterplots and a correlation matrix among the outcome and 4 above
demographic variables in the dataset (note that state should be only treated as the row name,
not a variable in this dataset).
(ii) Fit the multiple linear regression of the outcome on all the 4 demographic variables. Obtain
the summary table, and based on it, write the fitted regression equation.
(iii) Conduct appropriate model diagnostics to check the normality and the constant variance as-
sumption of the model.
(iv) Can any state be considered as an influential observation for the model? For any state that
you consider to be influential, is it influential mostly because it is an outlier or a high leverage
or both? If it is a high leverage, what causes it?
Clearly provide the evidence (eg. number, plot, etc.) that support your conclusion.
(v) One common measure of multicollinearity in the model with all continuous covariates is the
variance inflation factor (VIF), which is defined as follows. To compute the VIF for Xk, we treat
Xk as the response, and then fit a multiple linear regression of Xk on all the other covariates
in the model. Denoting R2k as the multiple R-squared of this model, then
VIFk =
1
1−R2k
.
Using this definition, obtain the VIF for all the covariates in the model. You are not allowed
to use any additional package in R to compute it.
(vi) A common rule of thumb is if any covariate has a VIF greater than 5, then multicollinearity
is serious in the model. Based on that rule, (1) comment on whether the model has a serious
multicollinearity, and (2) relate it to the pairwise correlation plot.
Copyright © The University of Sydney 3
Question 3 (10 marks)
Consider the linear model with outcome Y and three quantitative (continuous) covariates X1, X2
and X3. Let rjk be the sample correlation between Xj and Xk. Assume X1 and X2 are uncorrelated,
i.e r12 = 0. From the lecture note, we know that SSR(X1, X2) = SSR(X1) + SSR(X2) in this case.
Now, assume both X1 and X2 are correlated with X3, i.e r13 6= 0, r23 6= 0. In this case, is the
claim that SSR(X1, X2 | X3) = SSR(X1 | X3) + SSR(X2 | X3) always true? If so, prove it. If not,
provide at least one counterexample (numerically or theoretically) where the above equality does not
hold.
Copyright © The University of Sydney 4
R Outputs for Question 1
head(dat, n = 10)
salary sex yrs.since.phd yrs.service Rank
1 13.9750 Male 19 18 A
2 17.3200 Male 20 16 A
3 7.9750 Male 4 3 C
4 11.5000 Male 45 39 A
5 14.1500 Male 40 41 A
6 9.7000 Male 6 6 B
7 17.5000 Male 30 23 A
8 14.7765 Male 45 45 A
9 11.9250 Male 21 20 A
10 12.9000 Female 18 18 A
# THIS IS THE MODEL YOU ARE ASKED TO PRODUCE ANOVA TABLE
m1 <- lm(salary ~ yrs.since.phd + yrs.service, data =dat)
m2 <- lm(salary ~ yrs.since.phd + yrs.service + sex, data = dat)
anova(m2)
Analysis of Variance Table
Response: salary
Df Sum Sq Mean Sq F value Pr(>F)
yrs.since.phd 1 638.52 638.52 85.8141 < 2e-16 ***
yrs.service 1 45.74 45.74 6.1475 0.01358 *
sex 1 24.55 24.55 3.2990 0.07008 .
Residuals 393 2924.20 7.44
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
X <- model.matrix(m2)
V <- solve(t(X) %*% X)
round(V, 6)
(Intercept) yrs.since.phd yrs.service sexMale
(Intercept) 0.030973 -0.000538 0.000308 -0.024242
yrs.since.phd -0.000538 0.000088 -0.000079 -0.000035
yrs.service 0.000308 -0.000079 0.000087 -0.000071
sexMale -0.024242 -0.000035 -0.000071 0.029136
Copyright © The University of Sydney 5
m3 <- lm(salary ~ (yrs.since.phd + yrs.service)*Rank, data=dat)
anova(m3)
Analysis of Variance Table
Response: salary
Df Sum Sq Mean Sq F value Pr(>F)
yrs.since.phd 1 638.52 638.52 114.2057 < 2.2e-16 ***
yrs.service 1 45.74 45.74 8.1814 0.004461 **
Rank 2 764.88 382.44 68.4038 < 2.2e-16 ***
yrs.since.phd:Rank 2 10.12 5.06 0.9052 0.405296
yrs.service:Rank 2 4.45 2.23 0.3983 0.671724
Residuals 388 2169.29 5.59
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Copyright © The University of Sydney 6