STAT7038 -无代写-Assignment 2
时间:2025-10-11
Semester 2, 2025
STAT7038 Regression Modelling
Assignment 2
Due date: 17:00 (AEDT) on 17 October 2025
Research School of Finance, Actuarial Studies and Statistics
INSTRUCTIONS:
• This assignment is worth 15% of your overall marks for this course.
• You must write up your solutions to this assignment by yourself. If you copy someone else’s
work or allow your work to be copied, you will receive a mark of zero for the assignment and
risk very severe academic consequences.
• Your report must be generated to PDF format using rmarkdown. Failure to do so will result in
a penalty.
• Please submit your assignment on Canvas. When uploading to Canvas you must submit the
following, combined into a SINGLE pdf document:
1. The assignment cover sheet (available to download from Canvas).
2. Your assignment/report (no more than 15 pages).
3. An appendix including the R codes you used (no page limit). Failure to upload the R
code will result in a penalty.
4. Please name your submission “Course code Uid”, e.g., “STAT7038 U1234567”.
• Your assignment may include some carefully edited R output (e.g. graphs, tables)
showing the results of your data analysis and a discussion of these results, as well
as some carefully selected code. Please be selective about what you present and only
include as many pages and as much R output as necessary to justify your solution. Clearly
label each part of your report with the part of the question that it refers to.
• Unless otherwise advised, use a significance level of 5% and round numeric answers to 4
decimal places.
• Marks may be deducted if these instructions are not strictly adhered to, and marks will certainly
be deducted if the total report is of an unreasonable length, i.e. more than 15 pages including
graphs and tables. You may include an appendix that is in addition to the above page limits;
however the appendix will not be assessed. It will only be used if there is some question
about what you have actually done.
• Late submissions will NOT be accepted. Extensions will usually be granted on medical
or compassionate grounds on production of appropriate evidence, but must have lecturer’s
permission at least 24 hours before the deadline.
• Standard ANU policies for academic integrity apply to this assignment, both for the report
and the R-code.
1
Semester 2, 2025
Questions 1 (65 marks)
A group of researchers in the US attempted to look at the pollution related factors affecting
mortality. Thirty US cities were sampled. Total age-adjusted mortality, (mortality), from
all causes, in deaths per 100,000 population, was measured, along with the following co-
variates: mean annual precipitation (in inches) (precipitation); median number of school
years completed for persons aged 25 years or older (education); percentage of population
that is non-white (nonwhite); relative pollution potential of oxides of nitrogen (nox); and
relative pollution potential of sulphur dioxide (so2). “Relative pollution potential” is the
product of tons emitted per day per square kilometre and a factor correcting for the city
dimension and exposure. The data is available in a .csv file, pollution.
(a) [5 marks] Conduct an Exploratory Data Analysis (EDA) on the numerical variabels,
in doing your analysis you need to assess whether each of the numercial covariates
is associated with the response variable. In you answer, you also need to raise the
potential problem(s) you may have in fitting the regression model.
(b) [9 marks] Fit a multiple linear regression (MLR) model with Mortality as the response
variable and all other covariates as predictors. Is the regression model significant?
Comment on the t-test results in the summary output. Do they contradict with the
F-test result? Why or why not? Conduct a diagnostic check for this particular problem
with the fitted model both qualitatively and quantitatively. What should be done to
solve this problem? (Hint: In partially answering this question, you may refer back to
part (a).)
(c) [11 marks] What are the estimated coefficients of the (MLR) model in part (b) and
the confidence intervals for each of these slope coefficients at a joint confidence level
of 95%? Interpret the values of these estimated coefficients with regards to model
specification.
(d) [10 marks] Fit only ONE multiple linear regession model (MLR) with with Mortality
as the response variable and all other covariates as predictors. Please make sure this
model allows you to conduct the following nested tests of hypotheses.
H0 : βprecipitation = βso2 = βeducation = βnox = βnonwhite = 0
H0 : βprecipitation = βso2 = βeducation = βnox = 0
H0 : βprecipitation = βso2 = βeducation = 0
H0 : βprecipitation = βso2 = 0
H0 : βprecipitation = 0
Fully write out the tests, including the four steps in testing each set of the hypothesis.
(e) [8 marks] Construct an appropriate test of the hypothesis that so2 and nox have the
same impact on mortality. That is, test βso2 = βnox.
2
Semester 2, 2025
(f) [7 marks] A researcher from this group suggested that they have been using a model
with coefficients: βprecipitation = 2, βeducation = −10, βnonwhite = 3, βnox = 0, and
βso2 = 1. Can you test whether this existing model is consistent with the new model
you have fit? Write down appropriate full and reduced models for carrying out such a
test. Perform the test and comment on the results.
(g) [10 marks] Using the model in (d), produce a plot of externally studentized residuals
against fitted values, a normal QQ plot, a leverage plot, a Cook’s distance plot and a
number of DFBETAs plots for all the slope coefficients in your model. Comment on
the model assumptions and unusual points. What are the characteristics of the workers
identified as unusual data points? Please do not drop any variables even though some
of the βs may be tested 0.
(h) [5 marks] One of the researcher is from the city of San Antonio, and has recorded a new
set of measurements on each of the predictors. The precipitation is 33, education is
11.5, nonwhite is 17.2 and nox and so2 are each 1. What do you predict the mortality
rate to be? Find a 99% interval for this prediction.
Question 2 (35 marks)
We have data about salary and other characteristics of all faculty in a small Midwestern
college in the U.S. collected in the early 1980s for presentation in legal proceedings for which
discrimination against women in salary was at issue. All persons in the data hold tenured
or tenure track positions; temporary faculty are not included. The data is available in the
file salary.csv on wattle. The data were collected from personnel files and consist of the
quantities described as follows:
• degree: Factor with levels “PhD” or “Masters”;
• sex: Factor, “Female” or “Male”;
• rank: Factor, “Asst”, “Assoc” or “Prof”;
• year: Years in current rank;
• salary: Dollars per year.
(a) [5 marks] Using a formal test to choose between the two models: one with year, rank,
degree and sex as predictors; one with rank, year as predictors.
(b) [10 marks] With the chosen model in part (a), compute the pairwise difference of the
mean salary among different ranks of the professors, and construct a 99% confidence
interval for these pairwaise differences.
(c) [10 marks] We are interested in whether there is a steady increase in salary in academia
when a person’s position rises. With the chosen model in part (a), construct a formal
test of whether the increase in salary when a person is promoted from “Assistant
professor” to “Associate professor” and the increase in salary when a person is promoted
from “Associate professor” to “Professor” are the same.
3
Semester 2, 2025
(c) [10 marks] With the chosen model in part (a), consider adding the interaction term
between rank and year. Generate a scatter plot of salary against year and use different
colors for different rank levels. Add fitted lines for each rank level in a different color.
(You should control other variables in the model, if any, at a given value.) Comment on
the plot whether there is visible interaction. Test whether the interaction is significant.
Interpret the result.
4

学霸联盟
essay、essay代写