MTHM506 Statistical Data Modelling
Problem Sheet 1
(Covers Topics 1-2)
You should attempt all questions on this sheet. The questions constitute both summative (indicated by marks) and
formative assessment. Marks achieved in this assignment will contribute 25% of the final module mark. Solutions
are expected to be clearly explained, concise, well structured and well presented. Give R input commands
for each model fitted (e.g. ‘model <- glm(...)’). Do not display too much raw R output as part of your
solutions (e.g. don’t display the full output of ‘summary(model)’), but edit this down to the essentials. All
plots should have titles and appropriately labelled axes. Hand written solutions will be accepted, but a more
professional word processed submission is preferred (as is a mixture of the two).
Topic 1
1. In the lecture notes, there is an example relating to a study of the number of fish caught of differing sizes in a
net of a particular mesh size. We developed a binomial GLM for probability of escape from net with a single
predictor x, the fish length in cm. The model for the data y1, . . . , yn (fish escaping) is
Yi ∼ Bin(Ni, pi) Yi independent
log
(
pi
1− pi
)
= β0 + β1xi
where Ni (total fish number) has also been observed and xi is fish length. The notes contain details of the
relevant log-likelihood and code the R script topic1.R directly maximises the log-likelihood of this binomial
model using the data set fish using the R (optimiser) function nlm() and reports the maximum likelihood
estimates for β0 and β1 along with their standard errors and other model diagnostics.
(a) Extend the R code in order to fit the modified model:
yi ∼ Bin(ni, pi)
log
(
pi
1− pi
)
= β0 + β1xi + β2x
2
i
Report the maximum likelihood estimates for β0 , β1 and β2 along with their standard errors and the AIC
for the modified model.
(b) Calculate the log likelihood ratio statistic relevant to a formal comparison of difference in fit between
the original and the modified model. Perform a formal test of the hypothesis that there is no significant
difference in fit between the original and the modified model.
2. The module’s R workspace, contains a data frame nlmodel which involves data on a response variable y and
a single explanatory variable x. A scatter plot of y versus x suggests a strong non-linear relationship. Suppose
for these data we wish to consider the model:
Yi ∼ N
(
θ1xi
θ2 + xi
, σ2
)
Yi independent.
(a) Why can’t this model be fitted as a linear (regression) model? (1)
(b) Write down the likelihood L(θ1, θ2, σ2;y,x) and the log-likelihood `(θ1, θ2, σ2;y,x). (2)
(c) Write an R function mylike() which evaluates −`(θ1, θ2, σ2;y,x) for any values of the three parameters.
(1)
(d) Use the R function nlm() in association with your function mylike() to numerically minimise the log-
likelihood. Provide some evidence of how you chose sensible starting values. (3)
(e) Report the maximum likelihood estimates of the parameters and superimpose a plot of the associated
mean relationship on a scatter plot of y versus x. (2)
(f) Report the standard errors for θ1 and θ2, and use those to construct 95% confidence intervals. (5)
1
(g) Test the hypothesis that θ2 = 0.08 at the 5% significance level (not using the confidence interval) and
compute the associated p-value of the test. (3)
(h) Use plug-in prediction to construct and plot 95% prediction intervals. (4)
(21)
3. The dataframe sexr contains a variable ratio which is the ration of male births to female births across various
countries. Interest lies in seeing how this relates to a deprivation measure, namely infant mortality tinmort, so
it was suggested to use a Gamma distribution (since the ratio is strictly positive) with a mean that is a function
of tinmort. Specifically:
Yi ∼ Gamma(µi, λ) Yi indep.
log(µi) = β0 + β1xi
where xi is the year and where the pdf of the Gamma is
p (yi) =
yα−1i e
−yi/λi
Γ (α)λαi
, yi > 0, (1)
which has mean E [Yi] = µi = αλi and variance var [Yi] = σ2i = αλ
2
i . Parameter α is called the shape
parameter, while λi is the scale parameter.
(a) Write down the likelihood and the log-likelihood of this model.
(b) Use R to estimate the unknown parameters and report the standard errors.
(c) Plot the estimated relationship.
(d) Test the hypothesis that β1 = 0.
(e) Predict the ratio for the infant mortality value of 0.4 and report a 95% prediction interval.
Topic 2
4. In the lecture slides for Topic 2, we argued that the Normal, Binomial and Poisson distribution all belong to the
exponential family
p (y; θ, φ) = exp
{
(yθ − b (θ))
a(φ)
+ c (y, φ)
}
.
Show that this is the case by using the fact that a mathematical function f(x) can be written as exp{log(f(x))}.
5. Consider the Gamma distribution
p (y) =
yα−1e−y/λ
Γ (α)λα
, y > 0,
which has mean E [Y ] = µ = αλ and variance var [Y ] = σ2 = αλ2.
(a) Rewrite the probability distribution p(y) in terms of the mean µ and parameter α. (1)
(b) Thus show that it belongs to the exponential family, identifying parameters θ and functions a(·), b (·) and
c (·) in terms of α and µ. (3)
(c) Verify the expressions for the mean and the variance of the distribution using the associated general
results for the exponential family. (3)
(7)
6. The module workspace contains a data frame dicentric which presents data from an experiment conducted
to determine the effect of gamma radiation on the numbers of chromosomal abnormalities ca observed. The
number of cells exposed (in hundreds) in each run, differs and is contained in the variable cells. The dose
amount doseamt and the rate doserate at which the dose is applied are the predictors of interest.
(a) Directly model the rate, i.e. number of abnormalities per cell by using a Poisson GLM with response ca,
with a log link, with doseamt and doserate as predictors and with a suitable offset involving the numbers
of cells (cells). Comment on the model summary - parameter estimates, standard errors, model fit and
residual plots.
2
(b) If your final model from (a) is substantially over-dispersed, then fit a quasi-Poisson model instead and
comment on the differences between that and the results from (a).
(c) Alternatively, fit a Negative Binomial model and comment on the differences between that and the results
from (a).
(d) It was suggested that there might be an interaction between dose rate and dose amount, on the basis
that the effect of dose rate is not independent of dose amount (and vice versa). Fit a Poisson GLM with
the interaction, and also non-linear effects if deemed necessary.
(e) Use the AIC to choose between the Negative Binomial model and the Poisson GLM with the interactions,
and discuss what this model tells us about the relationship of the covariates to the response.
7. In lectures, we have seen data on number of quarterly aids cases in the UK, yi, from January 1983 to March
1994. The data are in dataframe aids, where the variable cases is yi and date is time, symbolised here as
xi. In this question we consider two competing models to describe the trend in the number of cases. Model 1
is
Yi ∼ Pois(λi) Yi independent
log(λi) = β0 + β1xi
and Model 2 is
Yi ∼ N(µi, σ2) Yi independent
log(µi) = γ0 + γ1xi
(a) Plot yi against xi and comment on whether the two proposed models are sensible. (3)
(b) Fit the two models in R, add the estimated trends from each model (λˆi and µˆi) on the plot from (7a) along
with approximate 95% confidence intervals on the mean and comment on the validity of each model
(based on the plot). For the confidence intervals you can assume that the sampling distribution of λˆi and
µˆi is approximately Gaussian with standard errors obtained from the R function
predict(...,type=‘‘response’’,se.fit=T). Obtain the AIC for each model and thus comment on
which model is preferable according to this criterion. (5)
(c) Produce the deviance residuals vs fitted values (λˆi and µˆi) plot for each model, comment appropriately
and thus propose a way that the two models might be extended to improve the fit. (2)
(d) Implement the proposed extensions to each model, to arrive at a final version for each of them (justified
by appropriate hypothesis tests). (3)
(e) On the basis of the analogous plots as in (7b) and (7c) but also on arguments of model fit based on the
deviance and the AIC, comment on which (if any) of the two final models in (d) you would choose as the
best. Mention at least one reason why either model is not ideal. (9)
(f) Further extend the final Poisson model to a Negative Binomial model and comment on whether this model
is preferable to the other two, on the basis of all the criteria used for comparison so far. (3)
(25)
8. The dataframe titanic2 contains an aggregated version of the titanic survival data. The age covariate has
been aggregated to “child” and “adult”.
(a) Fit a Binomial model with just the “main effects” of age_group, pclass and gender, and interpret the
estimates.
(b) To see whether the effect of age is different for males and females, and also whether the effect of class is
different for males and females, fit the model with an interaction between age_group and gender as well
pclass and gender, and interpret the estimates.
9. The module workspace, contains a data frame titanic which relates to 1309 passengers on the last voyage
of the ocean liner ‘Titanic’. The response variable survived is a binary variable where the value 1 means the
passenger survived the sinking. The data frame also contains predictors relating to passenger class (1st, 2nd,
3rd), gender, age and the fare amount each passenger paid. Passenger names are also available (for interest,
rather than for modelling).
3
(a) Fit a Bernoulli GLM with logistic link of survived with age, pclass and gender are predictors, as well
as all the associated two-way interactions. Reduce the model if and as appropriate using the AIC in
conjunction with the R function drop1(), and interpret the final model in terms of parameter estimates
and their significance. Perform relevant model checking analysis (model fit and residuals).
(13)
10. In 1972-74, a survey of one in six residents of Whickham, near Newcastle, England was made. Twenty years
later, this data recorded in a follow-up study. Only women who are current smokers or who have never smoked
are included. Resulting data set comprises 28 obs on the following 4 variables: y is the observed count for
given combination, smoker is a factor with levels yes no, dead is a factor with levels yes/no, age is a factor with
age-group levels 18-24, 25-34, 35-44, 45-54, 55-64, 65-74 and 75+. Interest here lies on the effects of age
and smoking on the probability of death.
(a) Investigate the association of age and smoking on the chance of dying, using Poisson log-linear models
for this contingency table. Clearly state any assumptions you are making about the sampling design.
11. In an archaeological study of animal husbandry, there is a particular interest in proportions of fragments of
animal bones of different types discovered in excavations in the south of England over four time eras. A data
set is collated from the excavation reports which consists of a three-way contingency table where cells contain
total numbers of bone fragments from all excavations cross-classfied by animal type (‘sheep’ or ‘cattle’ or
‘pig’), by the surface geology of the excavation sites (‘valley terrace’ or ’other’) and by the era (‘early saxon’
or ‘middle saxon’ or ‘late saxon’ or ‘high medieval’). These data are contained in the dataframe bones in the
module workspace. Interest lies in the association of era and geology, with the number of different bone types.
Explore significant associations in these data using Poisson log-linear models. In particular, identify an appro-
priate final preferred model which suitably fits the data and then report what this model suggests concerning
significant associations between the numbers of bones of different types and surface geology and ditto with
era. What assumptions are you making about the sampling design for the study in carrying out your analysis
and what are the modelling implications of those? (15)
12. The dataframe ToothGrowth, included in the base R installation, see ?ToothGrowth, contains information on
the growth of teeth in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2
mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).
(a) Fit an Exponential GLM to length of teeth with two covariates, with the default link function and check
whether the model fits.
(b) Alternatively, fit this as a Gamma distribution and interpret the effect of the covariates.
(c) Check the residual plots of the Gamma model, and suggest a way that these can be improved.
(d) Implement your suggestion and check whether this has remedied the problem.
(e) Interpret the effects of the covariates from your final model.
13. Consider the data frame gehan in the module R workspace. This involves a trial of 42 leukaemia patients
where some were treated with the drug 6-mercaptopurine and the rest are controls. The trial was designed
as matched pairs, both withdrawn from the trial when either came out of remission (more info can be found by
typing ?gehan having first loaded the R package MASS).
Variable cens indicates that some observations are censored (cens=0). This means that the patient was
removed or dropped out from the trial before their remission time can be recorded. The only data available for
such patients is the time they were removed, i.e. that they didn’t come out of remission before this removal
time. When cens=0, variable time records the drop-out time rather than the remission time.
(a) Create a new dataframe by removing the censored observations and use the command
gehan$treat <- relevel(gehan$treat, "control")
to ensure that it is the “control” that gets buried in the intercept. Write down and fit an Exponential
GLM with an inverse link, with time as the response and treat as the covariate. Discuss the effects of
treatment, and check whether the model fits. (5)
4
(b) Discarding censored observations simplifies the situation, however this means throwing away potentially
useful information (even if partial). To fit a model with all the data we can instead alter the likelihood
contributions of the censored and non-censored data points:
L(λi; ti) = p(ti;λi) = λie
−λiti if no censoring
L(λi; ti) = Pr(t > ti;λi) = 1− Pr(t < ti;λi) = e−λiti if censoring
where p(ti;λi) is the Exponential probability density function for survival times ti, and Pr(t < ti;λi) is
the Exponential cumulative distribution function. The likelihood can then be written as:
L(λi; t1, . . . , tn) =
n∏
i=1
p(ti;λi)
ci [1− Pr(t < ti;λi)]1−ci
where ci = 1 for no censoring and ci = 0 for censoring. Show that for the exponential distribution
p(ti;λi) = λie
−λiti , the likelihood can be written as:
L(λi; t1, . . . , tn) =
n∏
i=1
(λiti)
cie−λiti
tcii
and note that this likelihood is equivalent to the Poisson likelihood up to a constant. One can therefore
fit this model using a Poisson model with mean λiti where the data are the ci. Write down this Poisson
model mathematically, such that λi = β0+β1xi where xi is the treatment variable. Fit this Poisson model
in R, check that it fits, and compare the treatment effect with the Exponential model on the ‘thinned’ data
from part (a). (Hint: to fit a model without an intercept, add a “−1” in the model formula of the glm() call.
Also, you will need to use the command gehan$treat <- relevel(gehan$treat, "6-MP") before
fitting the model, to revert the changes from part (a).) (14)
(19)
(100)
5