Semester 2, 2025 STAT7038 Regression Modelling Assignment 2 Due date: 17:00 (AEDT) on 17 October 2025 Research School of Finance, Actuarial Studies and Statistics INSTRUCTIONS: • This assignment is worth 15% of your overall marks for this course. • You must write up your solutions to this assignment by yourself. If you copy someone else’s work or allow your work to be copied, you will receive a mark of zero for the assignment and risk very severe academic consequences. • Your report must be generated to PDF format using rmarkdown. Failure to do so will result in a penalty. • Please submit your assignment on Canvas. When uploading to Canvas you must submit the following, combined into a SINGLE pdf document: 1. The assignment cover sheet (available to download from Canvas). 2. Your assignment/report (no more than 15 pages). 3. An appendix including the R codes you used (no page limit). Failure to upload the R code will result in a penalty. 4. Please name your submission “Course code Uid”, e.g., “STAT7038 U1234567”. • Your assignment may include some carefully edited R output (e.g. graphs, tables) showing the results of your data analysis and a discussion of these results, as well as some carefully selected code. Please be selective about what you present and only include as many pages and as much R output as necessary to justify your solution. Clearly label each part of your report with the part of the question that it refers to. • Unless otherwise advised, use a significance level of 5% and round numeric answers to 4 decimal places. • Marks may be deducted if these instructions are not strictly adhered to, and marks will certainly be deducted if the total report is of an unreasonable length, i.e. more than 15 pages including graphs and tables. You may include an appendix that is in addition to the above page limits; however the appendix will not be assessed. It will only be used if there is some question about what you have actually done. • Late submissions will NOT be accepted. Extensions will usually be granted on medical or compassionate grounds on production of appropriate evidence, but must have lecturer’s permission at least 24 hours before the deadline. • Standard ANU policies for academic integrity apply to this assignment, both for the report and the R-code. 1 Semester 2, 2025 Questions 1 (65 marks) A group of researchers in the US attempted to look at the pollution related factors affecting mortality. Thirty US cities were sampled. Total age-adjusted mortality, (mortality), from all causes, in deaths per 100,000 population, was measured, along with the following co- variates: mean annual precipitation (in inches) (precipitation); median number of school years completed for persons aged 25 years or older (education); percentage of population that is non-white (nonwhite); relative pollution potential of oxides of nitrogen (nox); and relative pollution potential of sulphur dioxide (so2). “Relative pollution potential” is the product of tons emitted per day per square kilometre and a factor correcting for the city dimension and exposure. The data is available in a .csv file, pollution. (a) [5 marks] Conduct an Exploratory Data Analysis (EDA) on the numerical variabels, in doing your analysis you need to assess whether each of the numercial covariates is associated with the response variable. In you answer, you also need to raise the potential problem(s) you may have in fitting the regression model. (b) [9 marks] Fit a multiple linear regression (MLR) model with Mortality as the response variable and all other covariates as predictors. Is the regression model significant? Comment on the t-test results in the summary output. Do they contradict with the F-test result? Why or why not? Conduct a diagnostic check for this particular problem with the fitted model both qualitatively and quantitatively. What should be done to solve this problem? (Hint: In partially answering this question, you may refer back to part (a).) (c) [11 marks] What are the estimated coefficients of the (MLR) model in part (b) and the confidence intervals for each of these slope coefficients at a joint confidence level of 95%? Interpret the values of these estimated coefficients with regards to model specification. (d) [10 marks] Fit only ONE multiple linear regession model (MLR) with with Mortality as the response variable and all other covariates as predictors. Please make sure this model allows you to conduct the following nested tests of hypotheses. H0 : βprecipitation = βso2 = βeducation = βnox = βnonwhite = 0 H0 : βprecipitation = βso2 = βeducation = βnox = 0 H0 : βprecipitation = βso2 = βeducation = 0 H0 : βprecipitation = βso2 = 0 H0 : βprecipitation = 0 Fully write out the tests, including the four steps in testing each set of the hypothesis. (e) [8 marks] Construct an appropriate test of the hypothesis that so2 and nox have the same impact on mortality. That is, test βso2 = βnox. 2 Semester 2, 2025 (f) [7 marks] A researcher from this group suggested that they have been using a model with coefficients: βprecipitation = 2, βeducation = −10, βnonwhite = 3, βnox = 0, and βso2 = 1. Can you test whether this existing model is consistent with the new model you have fit? Write down appropriate full and reduced models for carrying out such a test. Perform the test and comment on the results. (g) [10 marks] Using the model in (d), produce a plot of externally studentized residuals against fitted values, a normal QQ plot, a leverage plot, a Cook’s distance plot and a number of DFBETAs plots for all the slope coefficients in your model. Comment on the model assumptions and unusual points. What are the characteristics of the workers identified as unusual data points? Please do not drop any variables even though some of the βs may be tested 0. (h) [5 marks] One of the researcher is from the city of San Antonio, and has recorded a new set of measurements on each of the predictors. The precipitation is 33, education is 11.5, nonwhite is 17.2 and nox and so2 are each 1. What do you predict the mortality rate to be? Find a 99% interval for this prediction. Question 2 (35 marks) We have data about salary and other characteristics of all faculty in a small Midwestern college in the U.S. collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The data is available in the file salary.csv on wattle. The data were collected from personnel files and consist of the quantities described as follows: • degree: Factor with levels “PhD” or “Masters”; • sex: Factor, “Female” or “Male”; • rank: Factor, “Asst”, “Assoc” or “Prof”; • year: Years in current rank; • salary: Dollars per year. (a) [5 marks] Using a formal test to choose between the two models: one with year, rank, degree and sex as predictors; one with rank, year as predictors. (b) [10 marks] With the chosen model in part (a), compute the pairwise difference of the mean salary among different ranks of the professors, and construct a 99% confidence interval for these pairwaise differences. (c) [10 marks] We are interested in whether there is a steady increase in salary in academia when a person’s position rises. With the chosen model in part (a), construct a formal test of whether the increase in salary when a person is promoted from “Assistant professor” to “Associate professor” and the increase in salary when a person is promoted from “Associate professor” to “Professor” are the same. 3 Semester 2, 2025 (c) [10 marks] With the chosen model in part (a), consider adding the interaction term between rank and year. Generate a scatter plot of salary against year and use different colors for different rank levels. Add fitted lines for each rank level in a different color. (You should control other variables in the model, if any, at a given value.) Comment on the plot whether there is visible interaction. Test whether the interaction is significant. Interpret the result. 4
学霸联盟