xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

扫码添加客服微信

扫描添加客服微信

程序代写案例-STAT2008

时间：2021-02-19

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 1 of 20

Research School of Finance, Actuarial Studies and Statistics

EXAMINATION

Semester 1 – Final, 2017

STAT2008 Regression Modelling

Examination/Writing Time Duration: 180 minutes

Reading Time: 15 minutes

Exam Conditions:

Central Examination. This examination paper is not available to the ANU Library archives.

Students must return the examination paper at the end of the examination.

Materials permitted in the exam venue: (No electronic aids are permitted e.g. laptops, phones)

Unannotated paper-based dictionary (no approval required),

One A4 page with notes on both side, Calculator

Materials to be supplied to Students:

Scribble Paper

Instructions to Students:

1. This examination paper comprises a total of twenty (20) pages and there is a separate handout of

R output which also has a total of twenty (20) pages. During the reading time preceding the exam,

please check that both documents have the correct number of pages.

2. All answers are to be written on this exam paper, which is to be handed in at the end of the exam.

You may make notes on scribble paper (or on the R handout) during the reading time, but

do NOT write on this exam paper until after the start of the writing time. If you need additional

space, use the rear of the previous page and clearly indicate the part of the question that your

answer refers to. The R handout and any scribble paper will be collected at the end of the

examination and destroyed, they will not be marked.

3. There are a total of four questions, which are worth 15 marks each, for a total of 60 marks.

The parts of each question are of unequal value, with the marks indicated for each part.

You should attempt to answer each and every part of all four questions. This examination

counts towards 60% of your final assessment.

4. Please write your student number in the space provided at the top of this page.

5. Include a clear statement of the formulae you use to answer each question.

6. Statistical tables (generated using R) are provided on pages 19 and 20 at the end of the handout of

R output. Unless otherwise indicated, use a significance level of 5% and note that log x refers to the

natural logarithm of x.

Q1 Q2 Q3 Q4 Total

Pages 2 to 6 7 to 11 12 to 15 16 to 20

Marks 15 15 15 15 60

Score

Venue _________________________________________

STUDENT

NUMBER U

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 2 of 20

Question 1 (15 marks)

The faraway library includes a data frame called cheddar, which contains data from a study of

cheddar cheese from the La Trobe Valley in Victoria. The concentration of Lactic acid, along

with the concentrations (on a log scale) of both Acetic acid and H2S (hydrogen sulphide) were

measured from 30 samples of cheese, which were then subjected to taste tests. Overall taste

scores were obtained by combining the scores from several tasters.

(a) A multiple regression model (cheddar.lm) has been fitted to these data and the summary

output from this model is given at the top of page 2 of the R output, but the analysis of

variance (ANOVA) table is not shown. Fill in the details of the ANOVA table in the

spaces shown below:

Df Sum Sq Mean Sq F value Pr(>F)

H2S

Lactic

Residuals

(3 marks – 1 for each row of the ANOVA table)

[Hint: rounding errors will accumulate as you derive entries in this table from other

values shown in the R output, so do NOT round the results of intermediate

calculations. DO round all your final answers in the above table to 2 decimal places.

You may also have to use the statistical tables to estimate one or more of the

p-values, or you can receive the marks for showing appropriate critical values.]

Working

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 3 of 20

Question 1 continued

(b) Residual plots for the model in part (a) are shown on pages 2 and 3 of the R output.

Do these plots suggest any problems with the underlying assumptions?

What is your overall assessment? (select just ONE of the following options)

□ Residuals are not independent (obvious pattern)

□ Residuals do not have constant variance (heteroscedasticity)

□ Residuals are not normally distributed

□ There are possible outliers and/or influential observations

□ More than one of the above problems

□ No obvious problems

(2 marks – 0.5 for each section)

Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 2?

If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 3?

If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 3?

If so describe the problem(s):

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 4 of 20

Question 1 continued

(c) For each of the following five diagnostic measures shown on page 4 of the R output,

calculate the relevant cut-off value suggested in the lecture notes and discuss whether or

not this cut-off is appropriate in this instance. Which observations, if any, exceed each

of the cut-off values?

(see the next page for more answer spaces for part (c) of Question 1)

The leverage or hat values (hii)

DFFITS

The externally studentised residuals (ti)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 5 of 20

Question 1, part (c) continued

(7 marks – 1 for each of the first 5 sections and 2 for the last summary section)

Given your answers above and considering the residual plots in part (b), are there

any observations that are vertical outliers and/or highly influential observations?

Should some observations be removed and the model re-fit to the remaining data?

COVRATIO

DFBETAS

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 6 of 20

Question 1 continued

(d) Output for a second model (cheddar.lm2) is shown on page 5 of the R output, which

includes an additional term added to the initial model described in the earlier parts of

this question. Is the term involving Acetic a significant addition to a model which

already includes H2S and Lactic? Give full details of an appropriate hypothesis test.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 7 of 20

Question 2 (15 marks)

The US Centers for Disease Control and Prevention (CDC) use data from the National Health

and Nutrition Examination Survey (NHANES) to develop a series of clinical growth charts

for assessing healthy growth ranges in boys and girls. The data frame kid.weights in the UsingR

library contains a sample of 250 observations taken from the NHANES data. The data frame

contains the age (in months), weight (in pounds) and height (in inches) for 129 girls (gender =

F) and 121 boys (gender = M), with age ranging from 3 months to 144 months (12 years).

(a) Page 6 of the R output shows code used to fit a series of models to these data. Residual

plots are given on page 7 for growth.lm3, the last of this series of models. Do these plots

suggest any problems with the underlying assumptions for model growth.lm3?

What is your overall assessment? (select just ONE of the following options)

□ Residuals are not independent (obvious pattern)

□ Residuals do not have constant variance (heteroscedasticity)

□ Residuals are not normally distributed

□ There are possible outliers and/or influential observations

□ More than one of the above problems

□ No obvious problems

(2 marks – 0.5 for each section)

Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 7?

If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 7?

If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 7?

If so describe the problem(s):

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 8 of 20

Question 2 continued

(b) On page 8 of the R output, there is also some summary output for the model growth.lm3,

including a few residual diagnostics. Use this summary output and your answers to part

(a) to comment on the following issues:

(2 marks)

Observations 228, 9 and 158 were highlighted in some of the residual plots. Which

of the diagnostics on page 8 could you use to test if these observations are vertical

outliers? Are these observations really outliers or do they suggest some other

problem with the underlying assumptions?

Is growth.lm3 an appropriate model for the kid.weights data? If not, how might we

modify this model?

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 9 of 20

Question 2 continued

(c) In the summary(growth.lm3) output on page 8 of the R handout, most of the summary

statistics and the partial regression coefficient for the interaction term boy:height have

been removed and replaced by question marks. Calculate all five missing statistics.

[Show all necessary formulae and working and round your final answers to no more

than 3 significant figures, as rounding errors will accumulate.]

(5 marks)

Estimated coefficient for the boy:height term

The residual standard error and the corresponding degrees of freedom

Multiple R-squared

Adjusted R-squared

The F-statistic and the corresponding degrees of freedom

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 10 of 20

Question 2 continued

(d) The indicator variable boy is equal to 1 for each male observation and is 0 otherwise

(when the observation is a girl). This indicator variable was created at the end of page 6

of the R output and was included in the model growth.lm3.

(2 marks)

The model growth.lm2 is also shown on page 6 of the R output, but has been turned

into a comment, so that the output for this model is not shown. What does the model

growth.lm2 suggest is the form of the relationship between weight and the

explanatory variables included in that model? What would have been the effects of

adding the indicator variable boy to the model growth.lm2 as just an additive term

(i.e. not including any interaction terms)?

Now examine the way in which the indicator variable boy has actually been added to

the model growth.lm2 to create the model growth.lm3. What are the effects of this

approach on the form of the relationship between the variables? Does the summary

output for the model growth.lm3 on page 8 of the R output suggest that the weight

growth curves for boys and girls differ by an additive constant; or a multiplicative

constant; or that completely separate curves are required?

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 11 of 20

Question 2 continued

(e) In the summary(growth.lm3) output on page 8, some of the partial regression coefficients

do not have any “stars” next to their p-values. Does this mean the relevant terms should

be removed from the model? Discuss each of the terms that have no “stars” and explain

why that term should or should not be removed.

(3 marks)

(f) In the vif(growth.lm3) output on page 8, some of the variance inflation factors are

relatively large. Is this an issue that suggests some changes need to be made to the

model? Why or why not?

(1 mark)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 12 of 20

Question 3 (15 marks)

When you attach the UsingR library (from the recommended Verzani text) in R, a number of

other libraries are also attached; the first of which is the MASS library, where MASS is short for

the title of the 2002 book by Bill Venables and Brian Ripley “Modern Applied Statistics with

S-PLUS” (yet another text which has been recommended in this course in previous years).

The data frame cement in the MASS library contains information on the setting of thirteen

samples of cement in Portland, Oregon in the US. For each sample, the percentages of the

four main chemical ingredients were accurately measured (x1 = tricalcium aluminate,

x2 = tricalcium silicate, x3 = tetracalcium alumina ferrate, and x4 = dicalcium silicate). While

the cement samples were setting, the amount of heat evolved was also measured (this is the

response variable, y, measured in calories/g).

(a) Pages 9 and 10 of the R output show a scatterplot matrix, a correlation matrix and other

output for the cement data. Comment on the relationships between the variables and

possible implications for fitting a multiple linear regression model with y as the

response variable and including all four of the possible explanatory variables, x1 to x4.

(2 marks)

(b) Pages 10 and 11 of the R output present output for a model, cement_all.lm, which

includes all four of the explanatory variables and for another model, cement_all.lm2,

which has the same four explanatory variables, but in a different order. The anova( )

tables are shown for both models, but the output from plot( ), summary( ) and vif( ) are

only shown for the first model. How would the plot( ), summary( ) and vif( ) output differ

for the second model (as opposed to the output shown for the first model)?

(1 mark)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 13 of 20

Question 3 continued

(c) Residual plots for the model (cement_all.lm) are shown on page 11 of the R output. Do

these plots suggest any problems with the underlying assumptions?

What is your overall assessment? (select just ONE of the following options)

□ Residuals are not independent (obvious pattern)

□ Residuals do not have constant variance (heteroscedasticity)

□ Residuals are not normally distributed

□ There are possible outliers and/or influential observations

□ More than one of the above problems

□ No obvious problems

(2 marks – 0.5 for each section)

Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 10?

If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 10?

If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 10?

If so describe the problem(s):

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 14 of 20

Question 3 continued

(d) Compare the significance of the terms involving the explanatory variables in the

ANOVA table and summary output for model (cement_all.lm) and in the ANOVA table

for model (cement_all.lm2) presented on page 10 of the R output. Discuss the problem

suggested by these comparisons. Is there some other output that confirms this problem?

(2 marks)

(e) Present full details of a nested F test to test whether or not the variables x2 and x3 are a

significant addition to a model that already includes x4 and x1.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 15 of 20

Question 3 continued

(f) Output for a modified model (cement.lm) is presented on page 12 of the handout of R

output. Use this output to discuss whether or not the modifications appear to have

solved the problem with the earlier models identified in part (d). What other output

should you check to assess the fit of the model (cement.lm)?

(2 marks)

(g) Find 95% confidence intervals for each of the partial regression coefficients in the

model (cement.lm). Interpret the values of these partial regression coefficients.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 16 of 20

Question 4 (15 marks)

As discussed in Question 1 of Tutorial 4, the brains example from lectures was actually a

subset of data from a larger study, which was conducted to study the need for sleep in various

species of mammals. Data from the larger study are available in the data frame mammalsleep

in the faraway library, which includes the following variables: brain weight (g); body weight

(kg); gestation (days); lifespan (years); danger (a score which can be summarised as 1 if the

mammal is at a high level of danger from other animals when sleeping and 0 if the danger is

relatively low); and sleep (the total time spent sleeping per day in hours). In mammalsleep,

there are some missing values for sleep, lifespan and gestation, which leaves 51 species of

mammals for which we have typical values for all 6 variables.

(a) The process of extracting the data for modelling is shown on page 13 of the R output

and the final data for modelling are shown on pages 14 and 15. I have applied a natural

log (to the base e) transformation to all of the continuous variables. What is the purpose

of this log transformation and does it appear to be a sensible approach in this instance?

(1 mark)

(b) Page 16 of the R output shows the results of applying the step( ) function to suggest a

suitable multiple linear regression model for these data. Briefly describe the process of

model refinement that has been applied here.

(1 mark)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 17 of 20

Question 4 continued

(c) Residual plots for the model suggested by the step( ) function are shown on page 17 of

the R output. Do these plots suggest any problems with the underlying assumptions?

What is your overall assessment? (select just ONE of the following options)

□ Residuals are not independent (obvious pattern)

□ Residuals do not have constant variance (heteroscedasticity)

□ Residuals are not normally distributed

□ There are possible outliers and/or influential observations

□ More than one of the above problems

□ No obvious problems

(2 marks – 0.5 for each section)

Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 17?

If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 17?

If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 17?

If so describe the problem(s):

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 18 of 20

Question 4 continued

(d) Which mammals have been identified in each of the residual plots in part (c)? Find the

species name for the relevant observations in the listing of the data on page 15 of the R

output. Discuss any potential problems with these observations.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 19 of 20

Question 4 continued

(e) Which is the only explanatory variable which has not been included in the suggested

model (msleep.lm)? Looking back to the scatterplot and correlations matrices on page 14

of the R output, can you suggest a reason why this variable was excluded?

(1 mark)

(f) Page 18 of the R output shows some summary output for the model (msleep.lm). What

do the signs of each of the partial regression coefficients suggest about the expected

amount of time spent sleeping?

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 20 of 20

Question 4 continued

(g) The correlation between log_sleep and log_lifespan was negative, so why does

log_lifespan have a positive partial regression coefficient in the suggested model?

Is log_lifespan a significant addition to a model that already includes log_gestation,

danger and log_body?

(2 marks)

(h) Under the suggested model, what is the expected difference in the daily hours spent

sleeping, between mammals that are in danger and those that are relatively safe? Find a

95% confidence interval for this difference.

(2 marks)

END OF EXAMINATION

学霸联盟

Page 1 of 20

Research School of Finance, Actuarial Studies and Statistics

EXAMINATION

Semester 1 – Final, 2017

STAT2008 Regression Modelling

Examination/Writing Time Duration: 180 minutes

Reading Time: 15 minutes

Exam Conditions:

Central Examination. This examination paper is not available to the ANU Library archives.

Students must return the examination paper at the end of the examination.

Materials permitted in the exam venue: (No electronic aids are permitted e.g. laptops, phones)

Unannotated paper-based dictionary (no approval required),

One A4 page with notes on both side, Calculator

Materials to be supplied to Students:

Scribble Paper

Instructions to Students:

1. This examination paper comprises a total of twenty (20) pages and there is a separate handout of

R output which also has a total of twenty (20) pages. During the reading time preceding the exam,

please check that both documents have the correct number of pages.

2. All answers are to be written on this exam paper, which is to be handed in at the end of the exam.

You may make notes on scribble paper (or on the R handout) during the reading time, but

do NOT write on this exam paper until after the start of the writing time. If you need additional

space, use the rear of the previous page and clearly indicate the part of the question that your

answer refers to. The R handout and any scribble paper will be collected at the end of the

examination and destroyed, they will not be marked.

3. There are a total of four questions, which are worth 15 marks each, for a total of 60 marks.

The parts of each question are of unequal value, with the marks indicated for each part.

You should attempt to answer each and every part of all four questions. This examination

counts towards 60% of your final assessment.

4. Please write your student number in the space provided at the top of this page.

5. Include a clear statement of the formulae you use to answer each question.

6. Statistical tables (generated using R) are provided on pages 19 and 20 at the end of the handout of

R output. Unless otherwise indicated, use a significance level of 5% and note that log x refers to the

natural logarithm of x.

Q1 Q2 Q3 Q4 Total

Pages 2 to 6 7 to 11 12 to 15 16 to 20

Marks 15 15 15 15 60

Score

Venue _________________________________________

STUDENT

NUMBER U

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 2 of 20

Question 1 (15 marks)

The faraway library includes a data frame called cheddar, which contains data from a study of

cheddar cheese from the La Trobe Valley in Victoria. The concentration of Lactic acid, along

with the concentrations (on a log scale) of both Acetic acid and H2S (hydrogen sulphide) were

measured from 30 samples of cheese, which were then subjected to taste tests. Overall taste

scores were obtained by combining the scores from several tasters.

(a) A multiple regression model (cheddar.lm) has been fitted to these data and the summary

output from this model is given at the top of page 2 of the R output, but the analysis of

variance (ANOVA) table is not shown. Fill in the details of the ANOVA table in the

spaces shown below:

Df Sum Sq Mean Sq F value Pr(>F)

H2S

Lactic

Residuals

(3 marks – 1 for each row of the ANOVA table)

[Hint: rounding errors will accumulate as you derive entries in this table from other

values shown in the R output, so do NOT round the results of intermediate

calculations. DO round all your final answers in the above table to 2 decimal places.

You may also have to use the statistical tables to estimate one or more of the

p-values, or you can receive the marks for showing appropriate critical values.]

Working

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 3 of 20

Question 1 continued

(b) Residual plots for the model in part (a) are shown on pages 2 and 3 of the R output.

Do these plots suggest any problems with the underlying assumptions?

What is your overall assessment? (select just ONE of the following options)

□ Residuals are not independent (obvious pattern)

□ Residuals do not have constant variance (heteroscedasticity)

□ Residuals are not normally distributed

□ There are possible outliers and/or influential observations

□ More than one of the above problems

□ No obvious problems

(2 marks – 0.5 for each section)

Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 2?

If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 3?

If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 3?

If so describe the problem(s):

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 4 of 20

Question 1 continued

(c) For each of the following five diagnostic measures shown on page 4 of the R output,

calculate the relevant cut-off value suggested in the lecture notes and discuss whether or

not this cut-off is appropriate in this instance. Which observations, if any, exceed each

of the cut-off values?

(see the next page for more answer spaces for part (c) of Question 1)

The leverage or hat values (hii)

DFFITS

The externally studentised residuals (ti)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 5 of 20

Question 1, part (c) continued

(7 marks – 1 for each of the first 5 sections and 2 for the last summary section)

Given your answers above and considering the residual plots in part (b), are there

any observations that are vertical outliers and/or highly influential observations?

Should some observations be removed and the model re-fit to the remaining data?

COVRATIO

DFBETAS

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 6 of 20

Question 1 continued

(d) Output for a second model (cheddar.lm2) is shown on page 5 of the R output, which

includes an additional term added to the initial model described in the earlier parts of

this question. Is the term involving Acetic a significant addition to a model which

already includes H2S and Lactic? Give full details of an appropriate hypothesis test.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 7 of 20

Question 2 (15 marks)

The US Centers for Disease Control and Prevention (CDC) use data from the National Health

and Nutrition Examination Survey (NHANES) to develop a series of clinical growth charts

for assessing healthy growth ranges in boys and girls. The data frame kid.weights in the UsingR

library contains a sample of 250 observations taken from the NHANES data. The data frame

contains the age (in months), weight (in pounds) and height (in inches) for 129 girls (gender =

F) and 121 boys (gender = M), with age ranging from 3 months to 144 months (12 years).

(a) Page 6 of the R output shows code used to fit a series of models to these data. Residual

plots are given on page 7 for growth.lm3, the last of this series of models. Do these plots

suggest any problems with the underlying assumptions for model growth.lm3?

What is your overall assessment? (select just ONE of the following options)

□ Residuals are not independent (obvious pattern)

□ Residuals do not have constant variance (heteroscedasticity)

□ Residuals are not normally distributed

□ There are possible outliers and/or influential observations

□ More than one of the above problems

□ No obvious problems

(2 marks – 0.5 for each section)

Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 7?

If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 7?

If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 7?

If so describe the problem(s):

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 8 of 20

Question 2 continued

(b) On page 8 of the R output, there is also some summary output for the model growth.lm3,

including a few residual diagnostics. Use this summary output and your answers to part

(a) to comment on the following issues:

(2 marks)

Observations 228, 9 and 158 were highlighted in some of the residual plots. Which

of the diagnostics on page 8 could you use to test if these observations are vertical

outliers? Are these observations really outliers or do they suggest some other

problem with the underlying assumptions?

Is growth.lm3 an appropriate model for the kid.weights data? If not, how might we

modify this model?

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 9 of 20

Question 2 continued

(c) In the summary(growth.lm3) output on page 8 of the R handout, most of the summary

statistics and the partial regression coefficient for the interaction term boy:height have

been removed and replaced by question marks. Calculate all five missing statistics.

[Show all necessary formulae and working and round your final answers to no more

than 3 significant figures, as rounding errors will accumulate.]

(5 marks)

Estimated coefficient for the boy:height term

The residual standard error and the corresponding degrees of freedom

Multiple R-squared

Adjusted R-squared

The F-statistic and the corresponding degrees of freedom

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 10 of 20

Question 2 continued

(d) The indicator variable boy is equal to 1 for each male observation and is 0 otherwise

(when the observation is a girl). This indicator variable was created at the end of page 6

of the R output and was included in the model growth.lm3.

(2 marks)

The model growth.lm2 is also shown on page 6 of the R output, but has been turned

into a comment, so that the output for this model is not shown. What does the model

growth.lm2 suggest is the form of the relationship between weight and the

explanatory variables included in that model? What would have been the effects of

adding the indicator variable boy to the model growth.lm2 as just an additive term

(i.e. not including any interaction terms)?

Now examine the way in which the indicator variable boy has actually been added to

the model growth.lm2 to create the model growth.lm3. What are the effects of this

approach on the form of the relationship between the variables? Does the summary

output for the model growth.lm3 on page 8 of the R output suggest that the weight

growth curves for boys and girls differ by an additive constant; or a multiplicative

constant; or that completely separate curves are required?

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 11 of 20

Question 2 continued

(e) In the summary(growth.lm3) output on page 8, some of the partial regression coefficients

do not have any “stars” next to their p-values. Does this mean the relevant terms should

be removed from the model? Discuss each of the terms that have no “stars” and explain

why that term should or should not be removed.

(3 marks)

(f) In the vif(growth.lm3) output on page 8, some of the variance inflation factors are

relatively large. Is this an issue that suggests some changes need to be made to the

model? Why or why not?

(1 mark)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 12 of 20

Question 3 (15 marks)

When you attach the UsingR library (from the recommended Verzani text) in R, a number of

other libraries are also attached; the first of which is the MASS library, where MASS is short for

the title of the 2002 book by Bill Venables and Brian Ripley “Modern Applied Statistics with

S-PLUS” (yet another text which has been recommended in this course in previous years).

The data frame cement in the MASS library contains information on the setting of thirteen

samples of cement in Portland, Oregon in the US. For each sample, the percentages of the

four main chemical ingredients were accurately measured (x1 = tricalcium aluminate,

x2 = tricalcium silicate, x3 = tetracalcium alumina ferrate, and x4 = dicalcium silicate). While

the cement samples were setting, the amount of heat evolved was also measured (this is the

response variable, y, measured in calories/g).

(a) Pages 9 and 10 of the R output show a scatterplot matrix, a correlation matrix and other

output for the cement data. Comment on the relationships between the variables and

possible implications for fitting a multiple linear regression model with y as the

response variable and including all four of the possible explanatory variables, x1 to x4.

(2 marks)

(b) Pages 10 and 11 of the R output present output for a model, cement_all.lm, which

includes all four of the explanatory variables and for another model, cement_all.lm2,

which has the same four explanatory variables, but in a different order. The anova( )

tables are shown for both models, but the output from plot( ), summary( ) and vif( ) are

only shown for the first model. How would the plot( ), summary( ) and vif( ) output differ

for the second model (as opposed to the output shown for the first model)?

(1 mark)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 13 of 20

Question 3 continued

(c) Residual plots for the model (cement_all.lm) are shown on page 11 of the R output. Do

these plots suggest any problems with the underlying assumptions?

What is your overall assessment? (select just ONE of the following options)

□ Residuals are not independent (obvious pattern)

□ Residuals do not have constant variance (heteroscedasticity)

□ Residuals are not normally distributed

□ There are possible outliers and/or influential observations

□ More than one of the above problems

□ No obvious problems

(2 marks – 0.5 for each section)

Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 10?

If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 10?

If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 10?

If so describe the problem(s):

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 14 of 20

Question 3 continued

(d) Compare the significance of the terms involving the explanatory variables in the

ANOVA table and summary output for model (cement_all.lm) and in the ANOVA table

for model (cement_all.lm2) presented on page 10 of the R output. Discuss the problem

suggested by these comparisons. Is there some other output that confirms this problem?

(2 marks)

(e) Present full details of a nested F test to test whether or not the variables x2 and x3 are a

significant addition to a model that already includes x4 and x1.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 15 of 20

Question 3 continued

(f) Output for a modified model (cement.lm) is presented on page 12 of the handout of R

output. Use this output to discuss whether or not the modifications appear to have

solved the problem with the earlier models identified in part (d). What other output

should you check to assess the fit of the model (cement.lm)?

(2 marks)

(g) Find 95% confidence intervals for each of the partial regression coefficients in the

model (cement.lm). Interpret the values of these partial regression coefficients.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 16 of 20

Question 4 (15 marks)

As discussed in Question 1 of Tutorial 4, the brains example from lectures was actually a

subset of data from a larger study, which was conducted to study the need for sleep in various

species of mammals. Data from the larger study are available in the data frame mammalsleep

in the faraway library, which includes the following variables: brain weight (g); body weight

(kg); gestation (days); lifespan (years); danger (a score which can be summarised as 1 if the

mammal is at a high level of danger from other animals when sleeping and 0 if the danger is

relatively low); and sleep (the total time spent sleeping per day in hours). In mammalsleep,

there are some missing values for sleep, lifespan and gestation, which leaves 51 species of

mammals for which we have typical values for all 6 variables.

(a) The process of extracting the data for modelling is shown on page 13 of the R output

and the final data for modelling are shown on pages 14 and 15. I have applied a natural

log (to the base e) transformation to all of the continuous variables. What is the purpose

of this log transformation and does it appear to be a sensible approach in this instance?

(1 mark)

(b) Page 16 of the R output shows the results of applying the step( ) function to suggest a

suitable multiple linear regression model for these data. Briefly describe the process of

model refinement that has been applied here.

(1 mark)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 17 of 20

Question 4 continued

(c) Residual plots for the model suggested by the step( ) function are shown on page 17 of

the R output. Do these plots suggest any problems with the underlying assumptions?

What is your overall assessment? (select just ONE of the following options)

□ Residuals are not independent (obvious pattern)

□ Residuals do not have constant variance (heteroscedasticity)

□ Residuals are not normally distributed

□ There are possible outliers and/or influential observations

□ More than one of the above problems

□ No obvious problems

(2 marks – 0.5 for each section)

Are there any problem(s) shown on the “Residuals vs Fitted” plot on page 17?

If so describe the problem(s):

Are there any problem(s) shown on the “Normal Q-Q” plot on page 17?

If so describe the problem(s):

Are there any problem(s) shown on the “Cook’s distance” plot on page 17?

If so describe the problem(s):

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 18 of 20

Question 4 continued

(d) Which mammals have been identified in each of the residual plots in part (c)? Find the

species name for the relevant observations in the listing of the data on page 15 of the R

output. Discuss any potential problems with these observations.

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 19 of 20

Question 4 continued

(e) Which is the only explanatory variable which has not been included in the suggested

model (msleep.lm)? Looking back to the scatterplot and correlations matrices on page 14

of the R output, can you suggest a reason why this variable was excluded?

(1 mark)

(f) Page 18 of the R output shows some summary output for the model (msleep.lm). What

do the signs of each of the partial regression coefficients suggest about the expected

amount of time spent sleeping?

(3 marks)

Final Examination, Semester 1, 2017 STAT2008 Regression Modelling

Page 20 of 20

Question 4 continued

(g) The correlation between log_sleep and log_lifespan was negative, so why does

log_lifespan have a positive partial regression coefficient in the suggested model?

Is log_lifespan a significant addition to a model that already includes log_gestation,

danger and log_body?

(2 marks)

(h) Under the suggested model, what is the expected difference in the daily hours spent

sleeping, between mammals that are in danger and those that are relatively safe? Find a

95% confidence interval for this difference.

(2 marks)

END OF EXAMINATION

学霸联盟

- 留学生代写
- Python代写
- Java代写
- c/c++代写
- 数据库代写
- 算法代写
- 机器学习代写
- 数据挖掘代写
- 数据分析代写
- Android代写
- html代写
- 计算机网络代写
- 操作系统代写
- 计算机体系结构代写
- R代写
- 数学代写
- 金融作业代写
- 微观经济学代写
- 会计代写
- 统计代写
- 生物代写
- 物理代写
- 机械代写
- Assignment代写
- sql数据库代写
- analysis代写
- Haskell代写
- Linux代写
- Shell代写
- Diode Ideality Factor代写
- 宏观经济学代写
- 经济代写
- 计量经济代写
- math代写
- 金融统计代写
- 经济统计代写
- 概率论代写
- 代数代写
- 工程作业代写
- Databases代写
- 逻辑代写
- JavaScript代写
- Matlab代写
- Unity代写
- BigDate大数据代写
- 汇编代写
- stat代写
- scala代写
- OpenGL代写
- CS代写
- 程序代写
- 简答代写
- Excel代写
- Logisim代写
- 代码代写
- 手写题代写
- 电子工程代写
- 判断代写
- 论文代写
- stata代写
- witness代写
- statscloud代写
- 证明代写
- 非欧几何代写
- 理论代写
- http代写
- MySQL代写
- PHP代写
- 计算代写
- 考试代写
- 博弈论代写
- 英语代写
- essay代写
- 不限代写
- lingo代写
- 线性代数代写
- 文本处理代写
- 商科代写
- visual studio代写
- 光谱分析代写
- report代写
- GCP代写
- 无代写
- 电力系统代写
- refinitiv eikon代写
- 运筹学代写
- simulink代写
- 单片机代写
- GAMS代写
- 人力资源代写
- 报告代写
- SQLAlchemy代写
- Stufio代写
- sklearn代写
- 计算机架构代写
- 贝叶斯代写
- 以太坊代写
- 计算证明代写
- prolog代写
- 交互设计代写
- mips代写
- css代写
- 云计算代写
- dafny代写
- quiz考试代写
- js代写
- 密码学代写
- ml代写
- 水利工程基础代写
- 经济管理代写
- Rmarkdown代写
- 电路代写
- 质量管理画图代写
- sas代写
- 金融数学代写
- processing代写
- 预测分析代写
- 机械力学代写
- vhdl代写
- solidworks代写
- 不涉及代写
- 计算分析代写
- Netlogo代写
- openbugs代写
- 土木代写
- 国际金融专题代写
- 离散数学代写
- openssl代写
- 化学材料代写
- eview代写
- nlp代写
- Assembly language代写
- gproms代写
- studio代写
- robot analyse代写
- pytorch代写
- 证明题代写
- latex代写
- coq代写
- 市场营销论文代写
- 人力资论文代写
- weka代写
- 英文代写
- Minitab代写
- 航空代写
- webots代写
- Advanced Management Accounting代写
- Lunix代写
- 云基础代写
- 有限状态过程代写
- aws代写
- AI代写
- 图灵机代写
- Sociology代写
- 分析代写
- 经济开发代写
- Data代写
- jupyter代写
- 通信考试代写
- 网络安全代写
- 固体力学代写
- spss代写
- 无编程代写
- react代写
- Ocaml代写
- 期货期权代写
- Scheme代写
- 数学统计代写
- 信息安全代写
- Bloomberg代写
- 残疾与创新设计代写
- 历史代写
- 理论题代写
- cpu代写
- 计量代写
- Xpress-IVE代写
- 微积分代写
- 材料学代写
- 代写
- 会计信息系统代写
- 凸优化代写
- 投资代写
- F#代写
- C#代写
- arm代写
- 伪代码代写
- 白话代写
- IC集成电路代写
- reasoning代写
- agents代写
- 精算代写
- opencl代写
- Perl代写
- 图像处理代写
- 工程电磁场代写
- 时间序列代写
- 数据结构算法代写
- 网络基础代写
- 画图代写
- Marie代写
- ASP代写
- EViews代写
- Interval Temporal Logic代写
- ccgarch代写
- rmgarch代写
- jmp代写
- 选择填空代写
- mathematics代写
- winbugs代写
- maya代写
- Directx代写
- PPT代写
- 可视化代写
- 工程材料代写
- 环境代写
- abaqus代写
- 投资组合代写
- 选择题代写
- openmp.c代写
- cuda.cu代写
- 传感器基础代写
- 区块链比特币代写
- 土壤固结代写
- 电气代写
- 电子设计代写
- 主观题代写
- 金融微积代写
- ajax代写
- Risk theory代写
- tcp代写
- tableau代写
- mylab代写
- research paper代写
- 手写代写
- 管理代写
- paper代写
- 毕设代写
- 衍生品代写
- 学术论文代写
- 计算画图代写
- SPIM汇编代写
- 演讲稿代写
- 金融实证代写
- 环境化学代写
- 通信代写
- 股权市场代写
- 计算机逻辑代写
- Microsoft Visio代写
- 业务流程管理代写
- Spark代写
- USYD代写
- 数值分析代写
- 有限元代写
- 抽代代写
- 不限定代写
- IOS代写
- scikit-learn代写
- ts angular代写
- sml代写
- 管理决策分析代写
- vba代写
- 墨大代写
- erlang代写
- Azure代写
- 粒子物理代写
- 编译器代写
- socket代写
- 商业分析代写
- 财务报表分析代写
- Machine Learning代写
- 国际贸易代写
- code代写
- 流体力学代写
- 辅导代写
- 设计代写
- marketing代写
- web代写
- 计算机代写
- verilog代写
- 心理学代写
- 线性回归代写
- 高级数据分析代写
- clingo代写
- Mplab代写
- coventorware代写
- creo代写
- nosql代写
- 供应链代写
- uml代写
- 数字业务技术代写
- 数字业务管理代写
- 结构分析代写
- tf-idf代写
- 地理代写
- financial modeling代写
- quantlib代写
- 电力电子元件代写
- atenda 2D代写
- 宏观代写
- 媒体代写
- 政治代写
- 化学代写
- 随机过程代写
- self attension算法代写
- arm assembly代写
- wireshark代写
- openCV代写
- Uncertainty Quantificatio代写
- prolong代写
- IPYthon代写
- Digital system design 代写
- julia代写
- Advanced Geotechnical Engineering代写
- 回答问题代写
- junit代写
- solidty代写
- maple代写
- 光电技术代写
- 网页代写
- 网络分析代写
- ENVI代写
- gimp代写
- sfml代写
- 社会学代写
- simulationX solidwork代写
- unity 3D代写
- ansys代写
- react native代写
- Alloy代写
- Applied Matrix代写
- JMP PRO代写
- 微观代写
- 人类健康代写
- 市场代写
- proposal代写
- 软件代写
- 信息检索代写
- 商法代写
- 信号代写
- pycharm代写
- 金融风险管理代写
- 数据可视化代写
- fashion代写
- 加拿大代写
- 经济学代写
- Behavioural Finance代写
- cytoscape代写
- 推荐代写
- 金融经济代写
- optimization代写
- alteryxy代写
- tabluea代写
- sas viya代写
- ads代写
- 实时系统代写
- 药剂学代写
- os代写
- Mathematica代写
- Xcode代写
- Swift代写
- rattle代写
- 人工智能代写
- 流体代写
- 结构力学代写
- Communications代写
- 动物学代写
- 问答代写
- MiKTEX代写
- 图论代写
- 数据科学代写
- 计算机安全代写
- 日本历史代写
- gis代写
- rs代写
- 语言代写
- 电学代写
- flutter代写
- drat代写
- 澳洲代写
- 医药代写
- ox代写
- 营销代写
- pddl代写
- 工程项目代写
- archi代写
- Propositional Logic代写
- 国际财务管理代写
- 高宏代写