Course Code: Student Number: RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES AND STATISTICS REGRESSION MODELLING STAT2008 / STAT4038 / STAT6038 Final Examination for Semester 1, 2020 Reading Time: 0 minutes Writing Time: 240 minutes Exam Conditions: Centrally Scheduled Examination Permitted Materials: Any Instructions: • This examination paper comprises a total of 21 pages and there is a separate file of R Output which also has a total of 15 pages. • You can write your answer on this question sheet or on blank sheets of paper. Please include the cover sheet as the first page when you submit. • Please write your student number and course code in the space provided at the top of this page. • There is one part (Q2c) which is only to be attempted by students enrolled in STAT4038 and STAT6038. • There are 5 questions worth a total of 103 marks for STAT2008 students, excluding Q2c. • There are 5 questions worth a total of 113 marks for students enrolled in STAT4038 and STAT6038. • Statistical tables (generated using R) are provided on Wattle site. • Unless otherwise indicated, use a significance level (α) of 5% and 4 decimal places. Question 1 2 3 4 5 Total STAT2008/4038/6038 8 15 16 36 28 103 STAT4038/6038 only 0 10 0 0 0 10 Score: Final Examination - Semester 1, 2020 Page 1 of 21 Question 1 [8 Marks] You want to conduct some simulation studies. You have written out an algorithm as follows: 1. Generate Xi = i, i = 1, . . . , 10 2. Generate Yi = 4−2X+εi, i = 1, . . . , 10, where εi are i.i.d. and follow normal distribution with mean 0 and variance 5. 3. Fit Xi and Yi in a simple linear regression model using least squares and record the parameter estimates b0 and b1. Ŷ = b0 + b1X. 4. Construct a 99% confidence interval for the intercept parameter. 5. Repeat 1000 times steps 2, 3 and 4, so that you have 1000 b0 values and confidence intervals at hand. (a) 6 marks What distribution do you expect best describes these 1000 b0 values? You should state the shape, mean, and variance of this distribution. (b) 2 marks How many of these 1000 confidence interval do you expect to include the number 4? Final Examination - Semester 1, 2020 Page 2 of 21 Question 2 [15 Marks] For STAT4038 and STAT6038 students: [25 Marks] The Fnord Motor Company is about to embark on a major marketing campaign for their newly released vehicle, the Imposta II, under the slogan “Have more left in the tank after a ride in an Imposta II”. Before they can make such a statement, Fnord’s crack legal department suggested that the company might conduct a scientific study to support their claim. Fnord’s slick marketing team decided to put 30 Fnord Imposta IIs through their paces at Fnord’s top-secret test facilities in Oodnadatta. There, Fnord’s expert drivers were each given an Imposta II full of petrol (40 litres) and each car was driven for a different number of kilometres, denoted X, and at their return, the amount of petrol remaining in the tank, Y , was measured. In this question, we will attempt to model the amount of petrol left in terms of the distance driven. The attached R printout gives details of the modelling exercise attempted. From the attached printout, answer the following questions: (a) 3 marks State the true model as in model1 and then write down the fitted model (b) 4 marks Find a 95% confidence interval for β0. Is the value 40 contained in your confidence interval? Explain why you might expect it to be. Final Examination - Semester 1, 2020 Page 3 of 21 (c) 10 marks [For 4038 and 6038 students only] The manager argued that all Imposta IIs are given full tank of petrol so the intercept is known to be 40, thus the model should be Yi = 40 + β1Xi + i. Using least squares, you find that the estimator for β1 in this model is b1 = ∑ XiYi−40 ∑ Xi∑ X2i . Find the variance of this estimator and then develop a formula for a 95% confidence interval for the mean response given X = x0 under this model. You do not need to use the actual numbers but make sure you specify the appropriate degrees of freedom used. Final Examination - Semester 1, 2020 Page 4 of 21 (d) 8 marks The manager claims that the new Imposta II is more energy efficient than their old model Imposta I. He asked the same drivers to also drive 30 old Imposta I’s for a different number of distances W , and measure the amount of petrol remaining in the tank denoted as Z. He fitted a second model as model2. Assume the true error variance of model1 and model2 are the same. Conduct a test testing the claim of the manager. Final Examination - Semester 1, 2020 Page 5 of 21 Question 3 [16 Marks] You are given a response variable Y and 4 covariates X1, X2, X3 and X4. A number of different models are fitted in R. Use the printout to answer the following questions. (a) 2 marks Write down the model fit as model1 making sure you write down the values estimated for the parameters in the model. (b) 2 marks Write down the parameter estimates for the model fit as model2. (c) 2 marks What proportion of the variability in Y is explained by the linear model in X1, X2, X3 and X4 fit as model2? Final Examination - Semester 1, 2020 Page 6 of 21 (d) 4 marks Test whether the linear model fit in model2 plausibly passes through the origin. Use a 5% test. Final Examination - Semester 1, 2020 Page 7 of 21 (e) 6 marks You want to see whether X1 and X4 are given equal weighting in the formula for Y and whether X2 and X3 are given equal weighting in the formula for Y . Test the hypothesis H0 : β1 = β4, β2 = β3. Use information contained in the printout to test this hypothesis at the 5% level. Final Examination - Semester 1, 2020 Page 8 of 21 Question 4 [36 Marks] Drought, flood, super-cell storms. . . climate change is upon us! One of the indicators of the progress of climate change is the onset and nature of the so-called El Nin˜o and La Nin˜a effects used to describe ocean temperatures. Scientists think that these effects are instrumental parts of our weather patterns, and that these effects dictate the prevalence of floods and drought in south-eastern Australia and other parts of the world. The data for this question concerns the number of tropical storms and hurricanes in the Atlantic Basin between 1950 and 1997. Several variables were recorded: the year (Year); a record of whether the year was a cold, warm or neutral El Nin˜o year (elnino); a record of whether West Africa was wet, dry or normal that year (wa); the number of tropical storms each year; the number of hurricanes each year; and a storm index (called NTC) which is a composite variable measuring the overall intensity of the hurricane season (an average of the number of tropical storms, the number of hurricanes, the number of days of tropical storms, the number of days of hurricanes, the total number of intense hurricanes, and the number of days they last). In this question, we will focus on the relationship between the storm index NTC and its relationship with time and the variables elnino and wa. (a) 3 marks The data set contains 48 years worth of data. How many of the 48 years were cold El Nin˜o years? Warm El Nin˜o years? Neutral El Nin˜o years? How many of the 48 years were wet years in West Africa? Dry years in West Africa? Neutral years in West Africa? (b) 3 marks It has been decided that the variable wa2, corresponding to dry years in West Africa, should be removed from the model. One reason for this choice is that the variable is non- significant in the ANOVA table for model1. Explain first why this phenomenon is not, in itself, reason enough to automatically exclude the variable from consideration in selecting an appropriate model for the data. Then explain why, in this case, removing the variable from subsequent modelling is, in fact, a good idea. Final Examination - Semester 1, 2020 Page 9 of 21 For parts (d) onwards, you should regard model2 as the “full” model in any calcu- lations you do. (c) 4 marks A noted climatologist approaches you and says “I don’t buy this cold El Nin˜o, wet West Africa nonsense!! What do you think? Hey, you know statistics, why don’t you do a test?” Construct a 5% test of the hypothesis that the coefficients of both the cold El Nin˜o variable and the wet West Africa variable are zero. That is, test that βcold = βwet = 0, in the obvious notation for these coefficients. Final Examination - Semester 1, 2020 Page 10 of 21 (d) 5 marks Another noted climatologist approaches, this time from behind you. “I heard you were a statistician,” he declares. Before you can deny it, he continues “Can you test whether there is any El Nin˜o effect at all?” Construct such a test at the 5% level. Final Examination - Semester 1, 2020 Page 11 of 21 (e) 6 marks After a neutral year for both El Nin˜o and West African rainfall, the following year was a cold El Nin˜o year for which it was wet in West Africa. Using model2 as your model, what would you predict to be the change on the storm index from the neutral year to the following year? Quantify your answer, and indicate how you would construct a 95% interval for the change. Note that you do not have to compute the interval, just indicate how you would construct it (for example, what information would you need and how would you use it). Make sure you specify the degrees of freedom used. Final Examination - Semester 1, 2020 Page 12 of 21 (f) 3 marks Based on the externally studentised residuals, test if there is any outlier in the sample. Final Examination - Semester 1, 2020 Page 13 of 21 (g) 3 marks Consider the influence diagnostics given in the printout. Identify any potential in- fluential years and give your reasoning, in each case indicating the sense in which they might be expected to exert influence. (Use the 50th quantile of an appropriate F distribution as the cut-off value for Cooks’ distance.) (h) 4 marks Use the plots in the printout to comment on the error assumptions. Is there anything about the data that you think may cause a problem with the usual regression assumptions? Final Examination - Semester 1, 2020 Page 14 of 21 (i) 5 marks Does the effect of El Nin˜o on NTC change over time (controlling wa1)? Describe how a test can be conducted to answer this question. You don’t need to use any actual numbers but you should specify the degrees of freedom. Final Examination - Semester 1, 2020 Page 15 of 21 Question 5 [28 Marks] We live in a dirty, polluted world, much of the problem of our own making. But how deadly is all the pollution? A study in the US attempted to answer this question, also taking into account other factors affecting mortality. Sixty US cities were sampled. Total age-adjusted mortality from all causes, in deaths per 100,000 population, was measured, along with the following covariates: mean annual precipitation (in inches); median number of school years completed for persons aged 25 years or older; percentage of population that is non-white; relative pollution potential of oxides of nitrogen (NOX); and relative pollution potential of sulphur dioxide (SO2). “Relative pollution potential” is the product of tons emitted per day per square kilometre and a factor correcting for the city dimension and exposure. The data is analysed in the attached printout, labelled “PRINTOUT FOR QUESTION 5”. Answer the following questions using information contained in the printout. (a) 3 marks Write down the fitted model as model1 in the printout, making sure you use the estimated values of the coefficients and use the variable names. What is the estimated error variance? Is the regression significant? Final Examination - Semester 1, 2020 Page 16 of 21 (b) 5 marks The environmental scientist who prepared the printout forgot to produce standard errors for the estimated slope coefficients from model1. Find the required standard errors, carefully labelling them with the appropriate variable name. Final Examination - Semester 1, 2020 Page 17 of 21 (c) 3 marks Construct a test of the hypothesis that Education and NOX are not significant con- tributors to the model. Use α = 0.05. (d) 3 marks “I am from San Antonio, in the great state of Texas,” drawled another environmental scientist. “Where I’m from, the precipitation is 33, education is 11.5, non-white is 17.2 and NOX and SO2 are each 1. What do you predict my city’s mortality rate (a percentage) to be?” Answer his question, and, in case he asks, find a 99% interval for this prediction. Final Examination - Semester 1, 2020 Page 18 of 21 (e) 2 marks Which observations do you suspect of being influential, and in what sense? (f) 2 marks Are there multicollinearity problems with the dataset? Comment from both visual and quantitative aspects. Final Examination - Semester 1, 2020 Page 19 of 21 (g) 4 marks The environmental scientist suspected that the percentage of non-white should not be linearly related with mortality. What diagnostic plot should he check and how should this plot be generated? How to use this plot to check the scientist’s suspicion? Final Examination - Semester 1, 2020 Page 20 of 21 (h) 6 marks Results from best subset, forward and backward procedures for selecting an appro- priate model for mortality are given in the printout. By best subset selection, write out the best model selected by adjusted R2, Cp and SBC. Write down the models selected by forward selection and backward selection (using an entering and removing p-value of 0.05). Do the methods produce the same “best” model? End of Examination Final Examination - Semester 1, 2020 Page 21 of 21
学霸联盟