xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

扫码添加客服微信

扫描添加客服微信

R代写 - ST346: Assessed coursework

时间：2020-10-20

ST346: Assessed coursework 1

Generalized Linear Models

Deadline 12 noon (GMT) Tuesday 27 October 2020

Your solutions should be submitted electronically in the form of a PDF document using the submission

portal on the ST346 Moodle page. Please remember to include only your ID number on your submission

to allow anonymous marking.

If you have any queries about the coursework please post them on the ST346 forum, but do not post any

part of your solutions. This assignment counts towards 10% of your final module mark.

The maximum score for this coursework is 20/20. Numbers in brackets indicate the points available for

each question.

To access the data needed for this assignment, download the file courseworkData1.rda from the ST346

Moodle web page and read it into R using the function load(). This will create a copy of two data frames

in your R workspace: insurance and doctors.

1. The insurance data set concerns the number of car insurance claims main by clients of an insurance

company in a single year. Variables in the data set are:

• car Engine size of car (1: < 1 litre, 2: 1–1.5 litres, 3: 1.5–2 litres, 4: > 2 litres).

• age Age group: (1: < 25 years, 2: 25–29 years, 3: 30–35 years, 35 years)

• district Where policy holder lived (1: urban area, i.e. in a city; 0: rural area, i.e. outside a city)

• y Number of claims

• n Number of insurance policies

In this data set, individual policies have been aggregated into groups defined by the cross-classification

of car, age, and district giving N = 4⇥ 4⇥ 2 = 32 rows.

(a) Fit a null Poisson regression model with number of claims as the outcome, but none of the

variables car, age, district as predictor variables. Show that the estimate for the intercept

term is numerically equal to

log

PN

i=1 yiPN

i=1 ni

!

i.e. the log of the rate of claims per policy across all policy-holders. [2]

(b) Fit another Poisson model with predictor variables car, age, and district where car and age

are factors (i.e. considered as categorical variables).

If we denote the coecient for the variable district by d then exp(d) is the ratio between

the rate of claims in urban vs. rural areas. Give an estimate of this rate ratio. Is the rate of

insurance claims higher in urban or rural areas? [2]

(c) Use stepwise regression to determine whether the model in question 1b can be improved by

removing predictor variables or adding interactions. Your minimal model should be the null

model fitted in question 1a and your maximal model should be one with all predictors and all

2-way interactions. [3]

1

(d) Using the model chosen by stepwise regression in question 1c, test whether a linear dose-response

with age is a better fit than a categorical model with the anova() function (If your “optimal”

model does not include age then you have gone wrong. Try question 1c again).

You will need to use the 2-argument version of the anova function

anova(m1, m2, test="LRT")

where m1 and m2 are the two fitted models returned by the glm() function. [2]

(e) The insurance company wants to make the insurance premiums proportional to the risk of an

insurance claim. A customer pays a $100 dollar premium for a car in category 1. If they change

their car to one in category 4 then what should be their new insurance premium? [2]

2. The data frame doctors comes from the British Doctors Study (Follow the link for more information).

This study, which began in 1951, was the world’s first large prospective study of the e↵ects of smoking

to establish a convincing linkage between tobacco smoking and cause-specific mortality (death).

The doctors data set concerns deaths from coronary heart disease 10 years after the start of the

study. The data on 34494 participants have been aggregated into 10 groups defined by age and

smoking status. The variables in the data set are:

• age Age group. A factor with levels: 35–44, 45–54, 55–64, 65–74, 75–84.

• smoking A binary indicator of smoking habits (1=smoker, 0=non-smoker)

• deaths Total number of deaths that occurred in each group in 10 years of follow-up.

• personyears Total number of person-years of follow-up in each groups (i.e. if 5 doctors are

followed for 10 years then the group has 5⇥ 10 = 50 person-years of follow-up)

(a) Consider the following model

Di ⇠ Poisson(µi)

µi = ↵+ si + log(Yi)

where Di is the number of deaths in row i, Yi is the number of person-years of follow up in

group i and si is the smoking status.

Fit this model in R and show numerically that:

b = log b1b0

!

where b1 is the estimated mortality rate in smokers and b0 is the estimated mortality rate in

non-smokers.

b1 = Pi2S DiP

i2S Yib0 = Pi2N DiP

i2N Yi

where S is the set of rows containing smokers and N is the set of rows containing non-smokers.

[3]

2

(b) Now consider this model:

µi = ↵+ si +

GX

g=2

I{ai=g}g + log(Yi)

where ai 2 {1, 2, . . . G} is the age group in row i, and G = 5 is the number of age groups.

Fit this model in R. What happens to the estimate of compared with model 2a? [3]

(c) Under the model in question 2b, the ratio of the mortality rates for smokers versus non smokers

is assumed constant across age groups.

The figure below shows the estimated rates for smokers and non-smokers The top panel shows

the rates on an arithmetic scale and the bottom row shows the rates on a logarithmic scale.

0

500

1000

1500

2000

age group

de

at

hs

p

er

1

00

,0

00

ye

ar

s

35 to 44 45 to 54 55 to 64 65 to 74 75 to 84

smokers

non−smokers

Mortality by age

10

20

50

100

200

500

1000

2000

age group

de

at

hs

p

er

1

00

,0

00

ye

ar

s

35 to 44 45 to 54 55 to 64 65 to 74 75 to 84

smokers

non−smokers

Mortality by age (log scale)

Is the model in question 2b appropriate? Propose an alternative model that allows the e↵ect

of smoking to depend on age. Give an estimate of the mortality rate ratio for smokers vs non-

smokers among individuals aged 65–74. What is the p-value for the test that this rate ratio is

equal to 1 (Hint: use the stratified parameterization and look at the output of the summary()

function for the p-value). [3]

3