xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

R代写-MTHM506

时间：2021-01-16

MTHM506 Statistical Data Modelling

Problem Sheet 1

(Covers Topics 1-2)

You should attempt all questions on this sheet. The questions constitute both summative (indicated by marks) and

formative assessment. Marks achieved in this assignment will contribute 25% of the final module mark. Solutions

are expected to be clearly explained, concise, well structured and well presented. Give R input commands

for each model fitted (e.g. ‘model <- glm(...)’). Do not display too much raw R output as part of your

solutions (e.g. don’t display the full output of ‘summary(model)’), but edit this down to the essentials. All

plots should have titles and appropriately labelled axes. Hand written solutions will be accepted, but a more

professional word processed submission is preferred (as is a mixture of the two).

Topic 1

1. In the lecture notes, there is an example relating to a study of the number of fish caught of differing sizes in a

net of a particular mesh size. We developed a binomial GLM for probability of escape from net with a single

predictor x, the fish length in cm. The model for the data y1, . . . , yn (fish escaping) is

Yi ∼ Bin(Ni, pi) Yi independent

log

(

pi

1− pi

)

= β0 + β1xi

where Ni (total fish number) has also been observed and xi is fish length. The notes contain details of the

relevant log-likelihood and code the R script topic1.R directly maximises the log-likelihood of this binomial

model using the data set fish using the R (optimiser) function nlm() and reports the maximum likelihood

estimates for β0 and β1 along with their standard errors and other model diagnostics.

(a) Extend the R code in order to fit the modified model:

yi ∼ Bin(ni, pi)

log

(

pi

1− pi

)

= β0 + β1xi + β2x

2

i

Report the maximum likelihood estimates for β0 , β1 and β2 along with their standard errors and the AIC

for the modified model.

(b) Calculate the log likelihood ratio statistic relevant to a formal comparison of difference in fit between

the original and the modified model. Perform a formal test of the hypothesis that there is no significant

difference in fit between the original and the modified model.

2. The module’s R workspace, contains a data frame nlmodel which involves data on a response variable y and

a single explanatory variable x. A scatter plot of y versus x suggests a strong non-linear relationship. Suppose

for these data we wish to consider the model:

Yi ∼ N

(

θ1xi

θ2 + xi

, σ2

)

Yi independent.

(a) Why can’t this model be fitted as a linear (regression) model? (1)

(b) Write down the likelihood L(θ1, θ2, σ2;y,x) and the log-likelihood `(θ1, θ2, σ2;y,x). (2)

(c) Write an R function mylike() which evaluates −`(θ1, θ2, σ2;y,x) for any values of the three parameters.

(1)

(d) Use the R function nlm() in association with your function mylike() to numerically minimise the log-

likelihood. Provide some evidence of how you chose sensible starting values. (3)

(e) Report the maximum likelihood estimates of the parameters and superimpose a plot of the associated

mean relationship on a scatter plot of y versus x. (2)

(f) Report the standard errors for θ1 and θ2, and use those to construct 95% confidence intervals. (5)

1

(g) Test the hypothesis that θ2 = 0.08 at the 5% significance level (not using the confidence interval) and

compute the associated p-value of the test. (3)

(h) Use plug-in prediction to construct and plot 95% prediction intervals. (4)

(21)

3. The dataframe sexr contains a variable ratio which is the ration of male births to female births across various

countries. Interest lies in seeing how this relates to a deprivation measure, namely infant mortality tinmort, so

it was suggested to use a Gamma distribution (since the ratio is strictly positive) with a mean that is a function

of tinmort. Specifically:

Yi ∼ Gamma(µi, λ) Yi indep.

log(µi) = β0 + β1xi

where xi is the year and where the pdf of the Gamma is

p (yi) =

yα−1i e

−yi/λi

Γ (α)λαi

, yi > 0, (1)

which has mean E [Yi] = µi = αλi and variance var [Yi] = σ2i = αλ

2

i . Parameter α is called the shape

parameter, while λi is the scale parameter.

(a) Write down the likelihood and the log-likelihood of this model.

(b) Use R to estimate the unknown parameters and report the standard errors.

(c) Plot the estimated relationship.

(d) Test the hypothesis that β1 = 0.

(e) Predict the ratio for the infant mortality value of 0.4 and report a 95% prediction interval.

Topic 2

4. In the lecture slides for Topic 2, we argued that the Normal, Binomial and Poisson distribution all belong to the

exponential family

p (y; θ, φ) = exp

{

(yθ − b (θ))

a(φ)

+ c (y, φ)

}

.

Show that this is the case by using the fact that a mathematical function f(x) can be written as exp{log(f(x))}.

5. Consider the Gamma distribution

p (y) =

yα−1e−y/λ

Γ (α)λα

, y > 0,

which has mean E [Y ] = µ = αλ and variance var [Y ] = σ2 = αλ2.

(a) Rewrite the probability distribution p(y) in terms of the mean µ and parameter α. (1)

(b) Thus show that it belongs to the exponential family, identifying parameters θ and functions a(·), b (·) and

c (·) in terms of α and µ. (3)

(c) Verify the expressions for the mean and the variance of the distribution using the associated general

results for the exponential family. (3)

(7)

6. The module workspace contains a data frame dicentric which presents data from an experiment conducted

to determine the effect of gamma radiation on the numbers of chromosomal abnormalities ca observed. The

number of cells exposed (in hundreds) in each run, differs and is contained in the variable cells. The dose

amount doseamt and the rate doserate at which the dose is applied are the predictors of interest.

(a) Directly model the rate, i.e. number of abnormalities per cell by using a Poisson GLM with response ca,

with a log link, with doseamt and doserate as predictors and with a suitable offset involving the numbers

of cells (cells). Comment on the model summary - parameter estimates, standard errors, model fit and

residual plots.

2

(b) If your final model from (a) is substantially over-dispersed, then fit a quasi-Poisson model instead and

comment on the differences between that and the results from (a).

(c) Alternatively, fit a Negative Binomial model and comment on the differences between that and the results

from (a).

(d) It was suggested that there might be an interaction between dose rate and dose amount, on the basis

that the effect of dose rate is not independent of dose amount (and vice versa). Fit a Poisson GLM with

the interaction, and also non-linear effects if deemed necessary.

(e) Use the AIC to choose between the Negative Binomial model and the Poisson GLM with the interactions,

and discuss what this model tells us about the relationship of the covariates to the response.

7. In lectures, we have seen data on number of quarterly aids cases in the UK, yi, from January 1983 to March

1994. The data are in dataframe aids, where the variable cases is yi and date is time, symbolised here as

xi. In this question we consider two competing models to describe the trend in the number of cases. Model 1

is

Yi ∼ Pois(λi) Yi independent

log(λi) = β0 + β1xi

and Model 2 is

Yi ∼ N(µi, σ2) Yi independent

log(µi) = γ0 + γ1xi

(a) Plot yi against xi and comment on whether the two proposed models are sensible. (3)

(b) Fit the two models in R, add the estimated trends from each model (λˆi and µˆi) on the plot from (7a) along

with approximate 95% confidence intervals on the mean and comment on the validity of each model

(based on the plot). For the confidence intervals you can assume that the sampling distribution of λˆi and

µˆi is approximately Gaussian with standard errors obtained from the R function

predict(...,type=‘‘response’’,se.fit=T). Obtain the AIC for each model and thus comment on

which model is preferable according to this criterion. (5)

(c) Produce the deviance residuals vs fitted values (λˆi and µˆi) plot for each model, comment appropriately

and thus propose a way that the two models might be extended to improve the fit. (2)

(d) Implement the proposed extensions to each model, to arrive at a final version for each of them (justified

by appropriate hypothesis tests). (3)

(e) On the basis of the analogous plots as in (7b) and (7c) but also on arguments of model fit based on the

deviance and the AIC, comment on which (if any) of the two final models in (d) you would choose as the

best. Mention at least one reason why either model is not ideal. (9)

(f) Further extend the final Poisson model to a Negative Binomial model and comment on whether this model

is preferable to the other two, on the basis of all the criteria used for comparison so far. (3)

(25)

8. The dataframe titanic2 contains an aggregated version of the titanic survival data. The age covariate has

been aggregated to “child” and “adult”.

(a) Fit a Binomial model with just the “main effects” of age_group, pclass and gender, and interpret the

estimates.

(b) To see whether the effect of age is different for males and females, and also whether the effect of class is

different for males and females, fit the model with an interaction between age_group and gender as well

pclass and gender, and interpret the estimates.

9. The module workspace, contains a data frame titanic which relates to 1309 passengers on the last voyage

of the ocean liner ‘Titanic’. The response variable survived is a binary variable where the value 1 means the

passenger survived the sinking. The data frame also contains predictors relating to passenger class (1st, 2nd,

3rd), gender, age and the fare amount each passenger paid. Passenger names are also available (for interest,

rather than for modelling).

3

(a) Fit a Bernoulli GLM with logistic link of survived with age, pclass and gender are predictors, as well

as all the associated two-way interactions. Reduce the model if and as appropriate using the AIC in

conjunction with the R function drop1(), and interpret the final model in terms of parameter estimates

and their significance. Perform relevant model checking analysis (model fit and residuals).

(13)

10. In 1972-74, a survey of one in six residents of Whickham, near Newcastle, England was made. Twenty years

later, this data recorded in a follow-up study. Only women who are current smokers or who have never smoked

are included. Resulting data set comprises 28 obs on the following 4 variables: y is the observed count for

given combination, smoker is a factor with levels yes no, dead is a factor with levels yes/no, age is a factor with

age-group levels 18-24, 25-34, 35-44, 45-54, 55-64, 65-74 and 75+. Interest here lies on the effects of age

and smoking on the probability of death.

(a) Investigate the association of age and smoking on the chance of dying, using Poisson log-linear models

for this contingency table. Clearly state any assumptions you are making about the sampling design.

11. In an archaeological study of animal husbandry, there is a particular interest in proportions of fragments of

animal bones of different types discovered in excavations in the south of England over four time eras. A data

set is collated from the excavation reports which consists of a three-way contingency table where cells contain

total numbers of bone fragments from all excavations cross-classfied by animal type (‘sheep’ or ‘cattle’ or

‘pig’), by the surface geology of the excavation sites (‘valley terrace’ or ’other’) and by the era (‘early saxon’

or ‘middle saxon’ or ‘late saxon’ or ‘high medieval’). These data are contained in the dataframe bones in the

module workspace. Interest lies in the association of era and geology, with the number of different bone types.

Explore significant associations in these data using Poisson log-linear models. In particular, identify an appro-

priate final preferred model which suitably fits the data and then report what this model suggests concerning

significant associations between the numbers of bones of different types and surface geology and ditto with

era. What assumptions are you making about the sampling design for the study in carrying out your analysis

and what are the modelling implications of those? (15)

12. The dataframe ToothGrowth, included in the base R installation, see ?ToothGrowth, contains information on

the growth of teeth in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2

mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

(a) Fit an Exponential GLM to length of teeth with two covariates, with the default link function and check

whether the model fits.

(b) Alternatively, fit this as a Gamma distribution and interpret the effect of the covariates.

(c) Check the residual plots of the Gamma model, and suggest a way that these can be improved.

(d) Implement your suggestion and check whether this has remedied the problem.

(e) Interpret the effects of the covariates from your final model.

13. Consider the data frame gehan in the module R workspace. This involves a trial of 42 leukaemia patients

where some were treated with the drug 6-mercaptopurine and the rest are controls. The trial was designed

as matched pairs, both withdrawn from the trial when either came out of remission (more info can be found by

typing ?gehan having first loaded the R package MASS).

Variable cens indicates that some observations are censored (cens=0). This means that the patient was

removed or dropped out from the trial before their remission time can be recorded. The only data available for

such patients is the time they were removed, i.e. that they didn’t come out of remission before this removal

time. When cens=0, variable time records the drop-out time rather than the remission time.

(a) Create a new dataframe by removing the censored observations and use the command

gehan$treat <- relevel(gehan$treat, "control")

to ensure that it is the “control” that gets buried in the intercept. Write down and fit an Exponential

GLM with an inverse link, with time as the response and treat as the covariate. Discuss the effects of

treatment, and check whether the model fits. (5)

4

(b) Discarding censored observations simplifies the situation, however this means throwing away potentially

useful information (even if partial). To fit a model with all the data we can instead alter the likelihood

contributions of the censored and non-censored data points:

L(λi; ti) = p(ti;λi) = λie

−λiti if no censoring

L(λi; ti) = Pr(t > ti;λi) = 1− Pr(t < ti;λi) = e−λiti if censoring

where p(ti;λi) is the Exponential probability density function for survival times ti, and Pr(t < ti;λi) is

the Exponential cumulative distribution function. The likelihood can then be written as:

L(λi; t1, . . . , tn) =

n∏

i=1

p(ti;λi)

ci [1− Pr(t < ti;λi)]1−ci

where ci = 1 for no censoring and ci = 0 for censoring. Show that for the exponential distribution

p(ti;λi) = λie

−λiti , the likelihood can be written as:

L(λi; t1, . . . , tn) =

n∏

i=1

(λiti)

cie−λiti

tcii

and note that this likelihood is equivalent to the Poisson likelihood up to a constant. One can therefore

fit this model using a Poisson model with mean λiti where the data are the ci. Write down this Poisson

model mathematically, such that λi = β0+β1xi where xi is the treatment variable. Fit this Poisson model

in R, check that it fits, and compare the treatment effect with the Exponential model on the ‘thinned’ data

from part (a). (Hint: to fit a model without an intercept, add a “−1” in the model formula of the glm() call.

Also, you will need to use the command gehan$treat <- relevel(gehan$treat, "6-MP") before

fitting the model, to revert the changes from part (a).) (14)

(19)

(100)

5

Problem Sheet 1

(Covers Topics 1-2)

You should attempt all questions on this sheet. The questions constitute both summative (indicated by marks) and

formative assessment. Marks achieved in this assignment will contribute 25% of the final module mark. Solutions

are expected to be clearly explained, concise, well structured and well presented. Give R input commands

for each model fitted (e.g. ‘model <- glm(...)’). Do not display too much raw R output as part of your

solutions (e.g. don’t display the full output of ‘summary(model)’), but edit this down to the essentials. All

plots should have titles and appropriately labelled axes. Hand written solutions will be accepted, but a more

professional word processed submission is preferred (as is a mixture of the two).

Topic 1

1. In the lecture notes, there is an example relating to a study of the number of fish caught of differing sizes in a

net of a particular mesh size. We developed a binomial GLM for probability of escape from net with a single

predictor x, the fish length in cm. The model for the data y1, . . . , yn (fish escaping) is

Yi ∼ Bin(Ni, pi) Yi independent

log

(

pi

1− pi

)

= β0 + β1xi

where Ni (total fish number) has also been observed and xi is fish length. The notes contain details of the

relevant log-likelihood and code the R script topic1.R directly maximises the log-likelihood of this binomial

model using the data set fish using the R (optimiser) function nlm() and reports the maximum likelihood

estimates for β0 and β1 along with their standard errors and other model diagnostics.

(a) Extend the R code in order to fit the modified model:

yi ∼ Bin(ni, pi)

log

(

pi

1− pi

)

= β0 + β1xi + β2x

2

i

Report the maximum likelihood estimates for β0 , β1 and β2 along with their standard errors and the AIC

for the modified model.

(b) Calculate the log likelihood ratio statistic relevant to a formal comparison of difference in fit between

the original and the modified model. Perform a formal test of the hypothesis that there is no significant

difference in fit between the original and the modified model.

2. The module’s R workspace, contains a data frame nlmodel which involves data on a response variable y and

a single explanatory variable x. A scatter plot of y versus x suggests a strong non-linear relationship. Suppose

for these data we wish to consider the model:

Yi ∼ N

(

θ1xi

θ2 + xi

, σ2

)

Yi independent.

(a) Why can’t this model be fitted as a linear (regression) model? (1)

(b) Write down the likelihood L(θ1, θ2, σ2;y,x) and the log-likelihood `(θ1, θ2, σ2;y,x). (2)

(c) Write an R function mylike() which evaluates −`(θ1, θ2, σ2;y,x) for any values of the three parameters.

(1)

(d) Use the R function nlm() in association with your function mylike() to numerically minimise the log-

likelihood. Provide some evidence of how you chose sensible starting values. (3)

(e) Report the maximum likelihood estimates of the parameters and superimpose a plot of the associated

mean relationship on a scatter plot of y versus x. (2)

(f) Report the standard errors for θ1 and θ2, and use those to construct 95% confidence intervals. (5)

1

(g) Test the hypothesis that θ2 = 0.08 at the 5% significance level (not using the confidence interval) and

compute the associated p-value of the test. (3)

(h) Use plug-in prediction to construct and plot 95% prediction intervals. (4)

(21)

3. The dataframe sexr contains a variable ratio which is the ration of male births to female births across various

countries. Interest lies in seeing how this relates to a deprivation measure, namely infant mortality tinmort, so

it was suggested to use a Gamma distribution (since the ratio is strictly positive) with a mean that is a function

of tinmort. Specifically:

Yi ∼ Gamma(µi, λ) Yi indep.

log(µi) = β0 + β1xi

where xi is the year and where the pdf of the Gamma is

p (yi) =

yα−1i e

−yi/λi

Γ (α)λαi

, yi > 0, (1)

which has mean E [Yi] = µi = αλi and variance var [Yi] = σ2i = αλ

2

i . Parameter α is called the shape

parameter, while λi is the scale parameter.

(a) Write down the likelihood and the log-likelihood of this model.

(b) Use R to estimate the unknown parameters and report the standard errors.

(c) Plot the estimated relationship.

(d) Test the hypothesis that β1 = 0.

(e) Predict the ratio for the infant mortality value of 0.4 and report a 95% prediction interval.

Topic 2

4. In the lecture slides for Topic 2, we argued that the Normal, Binomial and Poisson distribution all belong to the

exponential family

p (y; θ, φ) = exp

{

(yθ − b (θ))

a(φ)

+ c (y, φ)

}

.

Show that this is the case by using the fact that a mathematical function f(x) can be written as exp{log(f(x))}.

5. Consider the Gamma distribution

p (y) =

yα−1e−y/λ

Γ (α)λα

, y > 0,

which has mean E [Y ] = µ = αλ and variance var [Y ] = σ2 = αλ2.

(a) Rewrite the probability distribution p(y) in terms of the mean µ and parameter α. (1)

(b) Thus show that it belongs to the exponential family, identifying parameters θ and functions a(·), b (·) and

c (·) in terms of α and µ. (3)

(c) Verify the expressions for the mean and the variance of the distribution using the associated general

results for the exponential family. (3)

(7)

6. The module workspace contains a data frame dicentric which presents data from an experiment conducted

to determine the effect of gamma radiation on the numbers of chromosomal abnormalities ca observed. The

number of cells exposed (in hundreds) in each run, differs and is contained in the variable cells. The dose

amount doseamt and the rate doserate at which the dose is applied are the predictors of interest.

(a) Directly model the rate, i.e. number of abnormalities per cell by using a Poisson GLM with response ca,

with a log link, with doseamt and doserate as predictors and with a suitable offset involving the numbers

of cells (cells). Comment on the model summary - parameter estimates, standard errors, model fit and

residual plots.

2

(b) If your final model from (a) is substantially over-dispersed, then fit a quasi-Poisson model instead and

comment on the differences between that and the results from (a).

(c) Alternatively, fit a Negative Binomial model and comment on the differences between that and the results

from (a).

(d) It was suggested that there might be an interaction between dose rate and dose amount, on the basis

that the effect of dose rate is not independent of dose amount (and vice versa). Fit a Poisson GLM with

the interaction, and also non-linear effects if deemed necessary.

(e) Use the AIC to choose between the Negative Binomial model and the Poisson GLM with the interactions,

and discuss what this model tells us about the relationship of the covariates to the response.

7. In lectures, we have seen data on number of quarterly aids cases in the UK, yi, from January 1983 to March

1994. The data are in dataframe aids, where the variable cases is yi and date is time, symbolised here as

xi. In this question we consider two competing models to describe the trend in the number of cases. Model 1

is

Yi ∼ Pois(λi) Yi independent

log(λi) = β0 + β1xi

and Model 2 is

Yi ∼ N(µi, σ2) Yi independent

log(µi) = γ0 + γ1xi

(a) Plot yi against xi and comment on whether the two proposed models are sensible. (3)

(b) Fit the two models in R, add the estimated trends from each model (λˆi and µˆi) on the plot from (7a) along

with approximate 95% confidence intervals on the mean and comment on the validity of each model

(based on the plot). For the confidence intervals you can assume that the sampling distribution of λˆi and

µˆi is approximately Gaussian with standard errors obtained from the R function

predict(...,type=‘‘response’’,se.fit=T). Obtain the AIC for each model and thus comment on

which model is preferable according to this criterion. (5)

(c) Produce the deviance residuals vs fitted values (λˆi and µˆi) plot for each model, comment appropriately

and thus propose a way that the two models might be extended to improve the fit. (2)

(d) Implement the proposed extensions to each model, to arrive at a final version for each of them (justified

by appropriate hypothesis tests). (3)

(e) On the basis of the analogous plots as in (7b) and (7c) but also on arguments of model fit based on the

deviance and the AIC, comment on which (if any) of the two final models in (d) you would choose as the

best. Mention at least one reason why either model is not ideal. (9)

(f) Further extend the final Poisson model to a Negative Binomial model and comment on whether this model

is preferable to the other two, on the basis of all the criteria used for comparison so far. (3)

(25)

8. The dataframe titanic2 contains an aggregated version of the titanic survival data. The age covariate has

been aggregated to “child” and “adult”.

(a) Fit a Binomial model with just the “main effects” of age_group, pclass and gender, and interpret the

estimates.

(b) To see whether the effect of age is different for males and females, and also whether the effect of class is

different for males and females, fit the model with an interaction between age_group and gender as well

pclass and gender, and interpret the estimates.

9. The module workspace, contains a data frame titanic which relates to 1309 passengers on the last voyage

of the ocean liner ‘Titanic’. The response variable survived is a binary variable where the value 1 means the

passenger survived the sinking. The data frame also contains predictors relating to passenger class (1st, 2nd,

3rd), gender, age and the fare amount each passenger paid. Passenger names are also available (for interest,

rather than for modelling).

3

(a) Fit a Bernoulli GLM with logistic link of survived with age, pclass and gender are predictors, as well

as all the associated two-way interactions. Reduce the model if and as appropriate using the AIC in

conjunction with the R function drop1(), and interpret the final model in terms of parameter estimates

and their significance. Perform relevant model checking analysis (model fit and residuals).

(13)

10. In 1972-74, a survey of one in six residents of Whickham, near Newcastle, England was made. Twenty years

later, this data recorded in a follow-up study. Only women who are current smokers or who have never smoked

are included. Resulting data set comprises 28 obs on the following 4 variables: y is the observed count for

given combination, smoker is a factor with levels yes no, dead is a factor with levels yes/no, age is a factor with

age-group levels 18-24, 25-34, 35-44, 45-54, 55-64, 65-74 and 75+. Interest here lies on the effects of age

and smoking on the probability of death.

(a) Investigate the association of age and smoking on the chance of dying, using Poisson log-linear models

for this contingency table. Clearly state any assumptions you are making about the sampling design.

11. In an archaeological study of animal husbandry, there is a particular interest in proportions of fragments of

animal bones of different types discovered in excavations in the south of England over four time eras. A data

set is collated from the excavation reports which consists of a three-way contingency table where cells contain

total numbers of bone fragments from all excavations cross-classfied by animal type (‘sheep’ or ‘cattle’ or

‘pig’), by the surface geology of the excavation sites (‘valley terrace’ or ’other’) and by the era (‘early saxon’

or ‘middle saxon’ or ‘late saxon’ or ‘high medieval’). These data are contained in the dataframe bones in the

module workspace. Interest lies in the association of era and geology, with the number of different bone types.

Explore significant associations in these data using Poisson log-linear models. In particular, identify an appro-

priate final preferred model which suitably fits the data and then report what this model suggests concerning

significant associations between the numbers of bones of different types and surface geology and ditto with

era. What assumptions are you making about the sampling design for the study in carrying out your analysis

and what are the modelling implications of those? (15)

12. The dataframe ToothGrowth, included in the base R installation, see ?ToothGrowth, contains information on

the growth of teeth in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2

mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

(a) Fit an Exponential GLM to length of teeth with two covariates, with the default link function and check

whether the model fits.

(b) Alternatively, fit this as a Gamma distribution and interpret the effect of the covariates.

(c) Check the residual plots of the Gamma model, and suggest a way that these can be improved.

(d) Implement your suggestion and check whether this has remedied the problem.

(e) Interpret the effects of the covariates from your final model.

13. Consider the data frame gehan in the module R workspace. This involves a trial of 42 leukaemia patients

where some were treated with the drug 6-mercaptopurine and the rest are controls. The trial was designed

as matched pairs, both withdrawn from the trial when either came out of remission (more info can be found by

typing ?gehan having first loaded the R package MASS).

Variable cens indicates that some observations are censored (cens=0). This means that the patient was

removed or dropped out from the trial before their remission time can be recorded. The only data available for

such patients is the time they were removed, i.e. that they didn’t come out of remission before this removal

time. When cens=0, variable time records the drop-out time rather than the remission time.

(a) Create a new dataframe by removing the censored observations and use the command

gehan$treat <- relevel(gehan$treat, "control")

to ensure that it is the “control” that gets buried in the intercept. Write down and fit an Exponential

GLM with an inverse link, with time as the response and treat as the covariate. Discuss the effects of

treatment, and check whether the model fits. (5)

4

(b) Discarding censored observations simplifies the situation, however this means throwing away potentially

useful information (even if partial). To fit a model with all the data we can instead alter the likelihood

contributions of the censored and non-censored data points:

L(λi; ti) = p(ti;λi) = λie

−λiti if no censoring

L(λi; ti) = Pr(t > ti;λi) = 1− Pr(t < ti;λi) = e−λiti if censoring

where p(ti;λi) is the Exponential probability density function for survival times ti, and Pr(t < ti;λi) is

the Exponential cumulative distribution function. The likelihood can then be written as:

L(λi; t1, . . . , tn) =

n∏

i=1

p(ti;λi)

ci [1− Pr(t < ti;λi)]1−ci

where ci = 1 for no censoring and ci = 0 for censoring. Show that for the exponential distribution

p(ti;λi) = λie

−λiti , the likelihood can be written as:

L(λi; t1, . . . , tn) =

n∏

i=1

(λiti)

cie−λiti

tcii

and note that this likelihood is equivalent to the Poisson likelihood up to a constant. One can therefore

fit this model using a Poisson model with mean λiti where the data are the ci. Write down this Poisson

model mathematically, such that λi = β0+β1xi where xi is the treatment variable. Fit this Poisson model

in R, check that it fits, and compare the treatment effect with the Exponential model on the ‘thinned’ data

from part (a). (Hint: to fit a model without an intercept, add a “−1” in the model formula of the glm() call.

Also, you will need to use the command gehan$treat <- relevel(gehan$treat, "6-MP") before

fitting the model, to revert the changes from part (a).) (14)

(19)

(100)

5