MAST90084: Statistical Modelling Assignment 1
1. Let X and Y be two categorical random variables, X with I different categories identified with the set
{1, . . . , I} and Y with J different categories identified with the set {1, . . . , J}. Suppose observations of
the variable pair (X,Y ) are tabulated in a I × J contingency table. Using standard notations, for a given
(i, j) ∈ {1, . . . , I} × {1, . . . , J}, nij is the entry in the (i, j)-th cell that denotes the count of observations
with X equal to its i-th category and Y equal to its j-th category. A Poisson sampling model for the
contingency table assumes that the nij ’s are independently distributed with
nij ∼ Poi(µij),
where µij denotes the Poisson mean for the cell count nij .
(a) Derive the conditional joint distribution of {nij}(i,j)∈{1,...,I}×{1,...J} given n. Identify the name of this
distribution, and explicitly state what its parameter values are in terms of {µij}(i,j)∈{1,...,I}×{1,...J}
and n. [5]
(b) Let I = J = 2. The quantity µ11/µ12µ21/µ22 , also known as the odds ratio, measures the association between
X and Y . What should be the value of the odds ratio if X and Y are independent and why? [3]
2. Data in the following 2× 2× 3 contingency table were used to study the effect of passive smoking on lung
cancer. The table summarizes the results of case-control studies from 3 countries for nonsmoking women
married to smokers. (Source: Blot and Fraumeni, J. Nat. Cancer Inst., 77:993-1000 (1986) and Agresti
(1996).)
Country Spouse Smoked Cases Controls
Japan No 21 82
Yes 73 188
UK No 5 16
Yes 19 38
USA No 71 249
Yes 137 363
(a) A log-linear model mod1 can be fitted to the data, with the results being given in the following R
output. Give the mathematical formula of form ln(µ) = · · · for the mean model of mod1, where µ is
the mean of the response. Any dummy variables in your formula should be explicitly defined. [5]
> pasSmoking.dat=data.frame(freq=c(21,73,5,19,71,137,82,188,16,38,249,363))
> pasSmoking.dat\$Cnt=factor(rep(c("Japan","UK", "USA"), times=2, each=2))
> pasSmoking.dat\$Smo=factor(rep(c("No","Yes"), times=6))
> pasSmoking.dat\$Can=factor(rep(c("Case","Control"), each=6))
> pasSmoking.dat
freq Cnt Smo Can
1 21 Japan No Case
2 73 Japan Yes Case
3 5 UK No Case
4 19 UK Yes Case
5 71 USA No Case
6 137 USA Yes Case
7 82 Japan No Control
8 188 Japan Yes Control
9 16 UK No Control
10 38 UK Yes Control
11 249 USA No Control
12 363 USA Yes Control
MAST90084 Statistical Modelling Assignment 1 Semester 1, 2021
> mod1=glm(freq~Cnt+Smo+Can+Cnt:Smo+Cnt:Can+Smo:Can, family=poisson, data=pasSmoking.dat)
> anova(mod1, test="Chisq")
Analysis of Deviance Table; Model: poisson; Link: log; Response: freq
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL 11 1168.85
Cnt 2 726.43 9 442.42 < 2.2e-16
Smo 1 112.52 8 329.90 < 2.2e-16
Can 1 307.56 7 22.34 < 2.2e-16
Cnt:Smo 2 15.50 5 6.84 0.0004316
Cnt:Can 2 1.05 3 5.80 0.5919109
Smo:Can 1 5.56 2 0.24 0.0184215
> 1-pchisq(0.24,2)
[1] 0.8869204
> 1-pchisq(5.80,3)
[1] 0.1217566
(b) Expanding the notations from Question 1, for the current contingency table we can also use nijk to
denote the count in each cell, where i ∈ {1, 2}, j ∈ {1, 2}, k ∈ {1, 2, 3} are indices corresponding
to Can (variable X), Smo (variable Y ) and Cnt (variable Z) respectively. Moreover, if nijk are
independently distributed with
nijk ∼ Poi(µijk),
one can, for any k ∈ {1, 2, 3}, define the odd ratios θXY (k) = µ11kµ22kµ12kµ21k for the partial table with Z = k.
The table is said to have homogeneous XY association when θXY (1) = θXY (2) = θXY (3). Explain why
the model in part (a) has XY homogenous association. [5]
(c) Based on the displayed R output in (a), test the significance of the interaction effect Smo:Can at
significance level 0.05, eliminating the effects of all other terms in mod1. Provide your conclusion
with clear explanation. [4]
(d) Based on the displayed R output in (a), test the adequacy of model Cnt+Smo+Can+Cnt:Smo+Cnt:Can
at significance level 0.05. Provide your conclusion with clear explanation. [4]
(e) Are your conclusions in (c) and (d) contradictory? You must give an explanation to get any score.
[5]
3. A variable Y taking values in {0, 1, 2, . . . } has a Negative Binomial (NB) distribution if its probability
mass function has the form
p(Y = y;µ, κ) =
Γ(κ+ y)
Γ(κ)y!
κκµy
(µ+ κ)κ+y
,
for y = 0, 1, . . . , where µ is the mean of Y .
(a) When κ is considered as fixed (or known), the NB distribution belongs to the exponential dispersion
model (EDM) discussed in class. Write out its form as an EDM explicitly. In particular, you have
to identify the natural parameter θ and the dispersion parameter φ in terms of µ and κ whenever
appropriate, and identify b(·) (as a function of θ). You can simply take the weight ω to be 1. [5]
(b) Let σ2 be the variance of Y . From your answer above, derive the formula for σ2 as a function of µ.
Why do we say that the NB distribution can be used as a likelihood model to handle “overdispersion”
compared to the Poisson distribution? [4]
2
MAST90084 Statistical Modelling Assignment 1 Semester 1, 2021
Total marks = 40
3