MATH39512-r代写|学霸联盟

MATH39512-r代写

时间：2023-04-19

Coursework MATH39512 Survival Analysis for
Actuarial Science 2023
Submit your answers as a single pdf-document on Blackboard before 3pm on Friday April
21st. Total number of marks is 20. Recall that any work you hand in must be yours and
yours alone.
1. Consider two homogeneous groups of individuals, group a and group b. The hazard
function of the survival time of an individual from group a, respectively b, is denoted by
µa(t), respectively µb(t). The possibly right-censored observed survival times of the two
groups are as follows (here + denotes a censored survival time):
Group a 3+ 6 9 11 14+ 14+
Group b 8 9 21 22+ 26 34+
(a) (2 marks) Assume that µa(t) = µb(t) for all t 0 and consider all the individuals
from both groups together as one aggregate group of individuals. Use the data of
all individuals combined to calculate, by hand, the Kaplan-Meier estimate of the
survival function of the survival time distribution of an arbitrary individual from
this aggregate group. (You should explain all your notation and show the details
of your calculations, so using only R output is not sucient.)
(b) (3 marks) We want to check whether the hazard rates of both groups are equal. To
this end, we want to test the null hypothesis,
H0 : µa(t) = µb(t) for all t 2 [0, t0 ^ ⌧R],
versus HA : µa(t) 6= µb(t) for some t 2 [0, t0 ^ ⌧R], where ⌧R is the first time when
there are no longer individuals at risk in one of the two groups. The following test
says to reject H0 at significance level ↵ if |Zt0/
p
Vt0 | > z↵/2, where
• Zt0 =
R t0^⌧R
0 Wsd
⇣ bAa(s) bAb(s)⌘ with Ws = bSa(s)bSb(s), where
– bSj(t) is the Kaplan-Meier estimate at time t of the survival data corre-
sponding to group j only,
– bSj(t) = lims"t bSj(s) is the left-limit of bSj(·) at t with the understanding
that bSj(0) = 1,
– bAj(t) denotes the Nelson-Aalen estimator at time t of the cumulative hazard
function of an individual from group j;
• Vt0 =
R t0^⌧R
0
W 2s
RasR
b
s
d
Nas +N
b
s

, where
– N jt denotes the number of individuals in group j that are observed to have
failed before or at time t;
– Rjt denotes the number of individuals at risk just before time t in group j;
• z↵ > 0 is such that (z↵) = 1 ↵, where (·) is the cumulative distribution
function of the standard normal distribution.
1
Given the survival data at the beginning of the question, carry out this test with
t0 = 40 at significance level ↵ = 0.05 and report your conclusions. (Hint: this test
is di↵erent from the log-rank test.)
2. Consider the following observed values of the possibly left-truncated and right-censored
survival times of 20 independent homogeneous individuals, where + denotes a censored
value:
(0.0, 0.8] (0.0, 1.6] (1.1, 2.7] (0.0, 4.3+] (0.0, 5.3+]
(3.2, 5.7] (0.5, 7.2] (0.0, 7.2+] (0.0, 8.5] (0.0, 8.7]
(0.0, 12.6] (0.0, 12.9] (9.3, 14.7+] (3.6, 16.3+] (11.1, 16.7]
(0.0, 17.6] (13.0, 21.7+] (0.0, 23.9] (15.5, 25.1] (0.0, 31.2+]
Assume the following parametric form of the common survival function of the individuals:
µ(t) = µ(t; ) = et,
with > 0.
(a) (3 marks) Find an explicit expression (in terms of ) for the log-likelihood function
of the parameter given the last 5 data points only (i.e. those on the last row). You
may assume here any reasonable conditions on the censoring and you can ignore
any constants that do not e↵ect the maximum likelihood estimation.
(b) (3 marks) Check graphically (you can use R’s survival package to help with this)
if the parametric assumption on the form of the hazard rate is an appropriate
assumption given the whole survival data set (i.e. the data of all 20 individuals).
You should explain your steps, provide the relevant plot and report your conclusions.
(Hint: the first step is to find functions f and g that do not depend on the parameter
such that f(A(t)) = ag(t) + b, where a and b can depend on but not on t and
where A(t) is the cumulative hazard function.)
3. Consider the data set psych in the R-package KMsurv. This data set can be loaded into
a data frame called psych via the following commands in R:
1 install.packages("KMsurv")
2 library(KMsurv)
3 data(psych)
Here the first command is not needed if the KMsurv package has already been installed.
The data set consists of 26 psychiatric inpatients admitted to the University of Iowa
hospitals during the years 1935–1948. The survival time recorded for each patient is the
time in years from admission to the hospital until death. Also recorded are the sex and
age at the time of (first) admission of each patient.
In order to analyse the data consider a Cox proportional hazards (PH) model with sex
and age acting as covariates. For answering some of the questions below you should
make use of the survival package in R.
2
hazard rate
(a) (1 mark) Give the form of the hazard rate of a patient in this Cox PH model. (You
should clearly define each piece of notation that you use.)
(b) (2 marks) Derive an explicit expression in terms of the regression coecients for
the partial likelihood in this Cox PH model using only rows 15-19 of the data set,
i.e. using only the following observations:
1 sex age time death
2 15 2 33 35 0
3 16 1 36 25 1
4 17 1 30 31 0
5 18 1 41 22 1
6 19 2 43 26 1
(You should explain all your notation.)
(c) (2 marks) While using the full data set, estimate the relative risk of a female patient
of age 20 at the time of admission relative to a male patient of age 50 at the
time of admission and give an interpretation of this estimate within the relevant
context. (You should briefly explain how you obtained your estimate, though it is
not necessary to provide the exact piece of code that you may have used.)
(d) (i) (2 marks) Denote by
• ˆ1 and ˆ2 the maximum likelihood estimates of the two regression coe-
cients in the Cox PH model containing all 26 observations,
• ˆ1(i) and ˆ2(i) the maximum likelihood estimates of the two regression
coecients in the Cox PH model containing all 26 observations except for
the i-th observation.
Provide a plot of the points ˆ1 ˆ1(i), i = 1, . . . , 26, and another plot for
the points ˆ2 ˆ2(i), i = 1, . . . , 26. Here you should add the code you used
to provide the plots. (Hint: (i) use a for loop, (ii) the subset argument of the
coxph function might be useful here, and (iii) the i-th component of a given
vector x can be excluded via x[-i].)
(ii) (2 marks) The outliers in the two plots (i.e. the points furthest away from zero)
indicate the observations/patients that have the most influence on the maxi-
mum likelihood estimate (mle) of a given regression coecient. For each of the
two plots detect 2 or 3 outliers and explain how the corresponding observations
influence the mle of the associated regression coecient and why are they so
influential.