r代写-7PAVPRMD|学霸联盟

r代写-7PAVPRMD

时间：2022-03-16

PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 1 of 13

Assignment
7PAVPRMD 2021-22: Prediction Modelling
King’s College London, February/March 2022
Module lead: Daniel Stahl

Date due: XXX
Programming software: R
Maximal score: 100
The assignment consists of two parts, a critical assessment of the methodology of prediction
modelling study in health research (part 1) and a data analyses project (part 2). In each part, a total
of 50 points can be obtained.

Part 1: Critical assessment of the methodology of prediction
modelling study in health research

A total of 50 points can be obtained in part 1 of this assignment.
Introduction:
Please read the paper “Developing an individualized risk calculator for psychopathology among
young people victimized during childhood: A population-representative cohort study” by Meehan et
al (2020) published in: Journal of Affective Disorders 262, 90-98. It can be downloaded at
https://www.sciencedirect.com/science/article/pii/S0165032719314065
Please note that there is supplementary material available at the bottom of the webpage. You may
need it for some of the questions.
The assignment is to critically assess the paper and the described prediction model using the
guidelines of Steyerberg and Vergouve (2014) “Towards better clinical prediction models: seven
steps for development and an ABCD for validation”.
https://academic.oup.com/eurheartj/article/35/29/1925/2293109
Steyerberg and Vergouve abstract summarized the seven steps:
“Clinical prediction models provide risk estimates for the presence of disease (diagnosis) or an event
in the future course of disease (prognosis) for individual patients. Although publications that present
and evaluate such models are becoming more frequent, the methodology is often suboptimal. We
propose that seven steps should be considered in developing prediction models: (i) consideration of
the research question and initial data inspection; (ii) coding of predictors; (iii) model specification;
PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 2 of 13

(iv) model estimation; (v) evaluation of model performance; (vi) internal validation; and (vii) model
presentation. The validity of a prediction model is ideally assessed in fully independent data, where
we propose four key measures to evaluate model performance: calibration-in-the-large, or the
model intercept (A); calibration slope (B); discrimination, with a concordance statistic (C); and clinical
usefulness, with decision-curve analysis (D). “
Questions
After reading both papers please answer the following questions regarding the adequate
development and assessment of the prediction model. Please concentrate on the methodological
aspects of the paper. I do not expect you to fully understand all clinical aspects. Please write
concisely and do not add non-relevant text to your answers!

Step 1: Consideration of the research question and initial data inspection

1. Is the aim developing a prediction model or identifying risk predictors or both? (1 point)

2. What is the precise research question of the prediction modelling part of the study?
(1 point)

3. What is already known about the predictors? How did the authors select their predictors?
(3 points)

4. What kind of study design was selected? Did they recruit patients? (2 points)

5. How were patients selected? Do the authors provide enough information to replicate the
patient group? If not, do they provide further information on where to get more information
(2 points)

6. Are there any potential problems with the select patient group to develop a clinical risk
calculation? Did they acknowledge and discuss the potential problems? (2 points)

7. Were the predictors and outcomes measured without missing data? If not, how did they
handle missing data? (2 points)

8. What are the primary outcomes? Are the outcomes well defined? (4 points)

Step
PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 3 of 13

Step 2: Coding of predictors
9. Did the authors categorize any continuous predictors? If yes, please provide one example
(1 point)

10. Did the authors model interactions or non-linear relationships? If yes, what kind of
interactions or non-linear relationships? (1 point)

Step 3: Model specification and Step 4: model estimation

Step 3: Model specification and Step 4: model estimation
11. Which statistical model did the authors use for each of the outcomes to build the prediction
models? (1 point)

12. Did they assess model assumptions? If yes, how? (2 points)

13. How did they avoid overfitting? (2 points)

14. How did they select predictors for the final model? (2 points)

Step 5: Model performance

Step 5: Model performance
15. How did the authors assess model performance? (6 points)
i. Did they assess discrimination? Yes/no.
If Yes, how?
ii. Did they assess calibration? Yes/No.
If Yes, how?
iii. Did they assess clinical usefulness? Yes/No.
If Yes, how?

If yes, how: (Please describe briefly)

PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 4 of 13

16. Are the applied assessments methods adequate? (Did they choose recommended measures
recommended by Steyerberg and Vergouve?) (1 point)
▪

Step 6: Model validity

Step 6: Model validity
17. Did they perform internal and external validity? If yes, what kind? (1 point)

18. Did they perform any sensitivity analyses? If yes, what kind? Did the sensitivity analyses alter
their conclusion? Please provide one example
(2 points)

Step 7: Model presentation

Step 7: Model presentation
19. How is the final model presented? Is the reader able to use the model if she would have
measures of the predictors? (2 points)

Discussion and critical review

Discussion
20. Do they discuss limitations in the models? If yes, please state one (2 points)

21. Do they interpret the model adequately or do they exaggerate in their conclusions? Why or
why not? (2 points)

22. Please list two things you liked about the methodology of the study. (4 points)

23. Please make two suggestions to improve the study (besides the limitation mentioned in the
abstract). You do not need to know how to implement your suggestions or if they are
methodological feasible. (4 points)
PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 5 of 13

Part 2: Data analyses: Analysis of prognostic effects in
traumatic brain injury
In the second part of the assignment, you will analyse a data set to develop a prediction model.
Please note: The total score for this part of the assessment is 50 points.
Please answer all questions in this word document and not within your R script!

Introduction: Predicting traumatic brain injury
Developing a prognostic tool that allows predicting the outcome of a traumatic brain injury (TBI)
after admission to the hospital would support early clinical decision making. In this study, we aim to
develop a prognostic model to predict outcome after a traumatic brain injury using potential
predictors available at admission to the hospital. We analyse a dataset provided by the authors of
the article “Predicting outcome after traumatic brain injury: Development and internal validation of
prognostic scores based on admission characteristics” by Prof Ewout Steyerberg and colleagues,
published in Plos Medicine 2008, Vol 5(8), pp 1251-1261.
Steyerberg provided part of the data on the webpage of his book “Clinical prediction modelling:
http://www.clinicalpredictionmodels.org/doku.php?id=rcode_and_data:start
Predicting traumatic brain injury: Data set
The data are in SPSS format: “TBI.sav”. A description of the dataset is in the attachment.

The data set consists of 2159 patients from two trials (International (#74 in the variable trial) and US
Tirilazad (#75 in the variable trial) trials, see http://www.tbi-impact.org/documents/design.pdf for
details) with 24 variables. The primary outcome was the Glasgow Outcome Scale (GOS) at 6 months
follow-up. The scale ranges from 1 to 5: outcome):
1 = dead
2= vegetative
3=severe disability
4 moderate disability
5 good recovery

The variable was also divided into two other clinical outcomes:
• Mortality at 6 months (yes or no) and
• Unfavourable outcome at 6 months (dead, vegetative or disability versus moderate disability
and good recovery)

In addition, 20 clinical, demographic and other characteristics of the patients were collected at
admission. The following table shows the label and variable names of the 24 variables of the data
set. For more information about the variables, please see the main paper or table 24.7 in Steyerberg’
s Clinical prediction modelling book. There are three variables that consists have a second
“truncated” version (hb and hbt, sodium and sodiumt, glucose and glucoset). Both versions are
almost identical and have the same sample size, but the truncated version replace implausible
values with the lowest/highest possible value. Please use the truncated version! Furthermore, one
PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 6 of 13

variable “pupil.i “ is coded from the variable d.pupil and should also not be used! There are,
therefore, 16 predictor variables left. Be aware that some are categorical!

trial Study identification (International=74, US=75)
d.gos GOS at 6 months <- outcome (do not use)
d.mort Mortality at 6 months <- outcome (do not use)
d.unfav Unfavourable outcome at 6 months <-Primary outcome
cause Cause of injury recoded
age Age in years
d.motor Admission motor score
d.pupil Pupillary reactivity
pupil.i Single imputed pupillary reactivity (do not use)
hypoxia Hypoxia before / at admission
hypotens Hypotension before / at admission
ctclass CT classification according to Marshall
tsah tSAH at CT
edh EDH at CT
cisterns Compressed cisterns at CT
shift Midline shift > 5 mm at CT
d.sysbpt Systolic blood pressure (truncated, mm Hg)
glucose Glucose at admission (mmol/l) (don’t use)
glucoset Truncated glucose values
ph pH
sodium Sodium (mmol/l) (don’t use)
sodiumt Truncated sodium
hb hb (g/dl) (don’t use)
hbt Truncated hb

Primary outcome of this exercise
Our primary outcome will be “Unfavourable outcome at 6 months (d.unfav.)!
Overall aim of the assignment
The aim of this assessment is to develop a validated prognostic model to predict the unfavourable
outcome at 6 months. In the first part, you will be asked to compare the predictive performance of
several statistical learning algorithms introduced in this module (ridge, lasso, elastic net)). Based on
the results, you will need to choose one, which you would use for a final model. For this analysis, we
will first use a complete case data set.
In the second part, you will develop a risk prediction model using the whole dataset including cases
with missing values. You will develop an internally validated model with the correct modelling of
missing data.
Important:
Please include all answer all questions in this word document and not within your R script or R
Markdown! You need to submit annotated R scripts for all analyses but I will only use the R script to
identify reasons for wrong answers and will not count any results/answers in the R script!
Please use always a seed 123 for regularized methods with cross-validation: set.seed(123)
PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 7 of 13

Begin of assignment: Predicting traumatic brain injury.
Comparison of different prediction algorithms
We will compare four statistical /machine learning algorithms using complete cases only. The data
set consists of two trials, we will use one trial for model building (training data) and one for model
assessment (test data).
1.1. Importing data and data manipulation (Total 3 points)
Total 3 points
Import the data set and pre-process the data and remove any cases with missing data:
Import the SPSS data set into R.
1. Column 2, 3, and 4 are all outcome variables (see introduction). We will use the variable
d.unfav ((unfavourable outcome: 1= yes, 0 = no) s out main outcome. The other two
outcomes (d.gos (GOS at 6 months, column 2 and d.mort (Mortality at 6 months, column 3)
need to be removed. (0.5 points).
2. Remove the redundant predictors “hb”, “sodium”, “glucose” and “pupil.i” from the data set
(0.5 points).
3. “Trial”, “d.unfav” (unfavourable outcome) and “cause” are categorical variables and need to
be formatted as factor variables (0.5 points). Treat pupil as continuous!
4. Remove cases with missing data (0.5 points)
5. Present below some default summary statistics for each remaining variable to demonstrate
that your code in 1-4 worked (1 points).
Table:

1.2. Model 1: Logistic regression (Total 17 points)

Model 1: Logistic regression
For our first model, we will use a “simple” logistic regression.
Task 1 Sample size calculations (7 points)
We want to do a simple logistic regression with all 16 predictors (I already removed redundant
variables pupil.i and three untruncated variables from the original data set): Is our sample size large
enough to avoid overfitting for our data analyses? Please read the paper by Riley et al (2020) about
sample size for clinical prediction models and follow the guidance of the authors on how to calculate
the sample size required to develop a prediction model for binary outcomes. The R package
“pmsampsize” computes the minimum sample size required for the development of a new
multivariable prediction model using the criteria proposed by Riley et al. 2018 and 2020.
This topic was not covered in the course and will “force” you to acquire new knowledge!
PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 8 of 13

Please remember that a categorical variable with l levels has l-1 parameters! I.e. the variable
“cause” has got 5 levels and you, therefore, need to estimate 5-1= 4 parameters. The software
requires “number of candidate predictor parameters for potential inclusion in the new prediction
model”. The intercept is, therefore, already included in the algorithm and does not need to be
counted as 1 parameter!
Please perform a sample size calculation using the suggested three criteria for binary outcomes. Please
use a conservative estimate of Nagelkerke’s r-squared value of 0.2 and estimate the prevleance from the
relevant data set.
Questions:
Is the sample size enough for analysing the a) US data set and b) the total sample of both trials (use
sample size for complete cases)? Please report your power analyses in such a way that the reader
can replicate it. Please cite the reference and the R package (using R’s command:
“citation()”) correctly!
What is the maximum number of parameters you can estimate with your sample size?
Summarize your sample size calculation:

a. List all three criteria (1.5 points) and values used (1.5 points)
b. Report of the sample size and conclusion for the two data sets (2 points)
c. Report the maximum of parameters you can estimate with your sample size of the two data
sets (1 points)
d. The correct reference for R package (0.5 points) and reference style (0.5 points).

Answers:

Riley, R. D., Ensor, J., Snell, K. I. E., Harrell, F. E., Martin, G. P., Reitsma, J. B., Moons, K. G. M., Collins,
G., & Van Smeden, M. (2020). Calculating the sample size required for developing a clinical
prediction model. Bmj, 368, m441. https://doi.org/10.1136/bmj.m441
(Optional: For more technical details please see:
• Riley RD, Snell KIE, Ensor J, Burke DL, Harrell FE, Jr., Moons KG, Collins GS(2018) Minimum sample size
required for developing a multivariable prediction model: Part I continuous outcomes. Statistics in
Medicine. doi: 10.1002/sim.7993
• Riley RD, Snell KIE, Ensor J, Burke DL, Harrell FE, Jr., Moons KG, Collins GS. (2018) Minimum sample
size required for developing a multivariable prediction model: Part II binary and time-to-event
outcomes. Statistics in Medicine. 2018. doi: 10.1002/sim.7992)

PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 9 of 13

Task 2: Prediction model using logistic regression (Total: 10 points)
For our first model, we will use a “simple” logistic regression.
Please use the US trial data set as a training data set to build or model and assess the model using
the second international data set as an external test data set to assess external validity.
For the data analyses, use the glm function, caret for prediction accuracy measures and the library
proc for AUC and ROC curves and build a prediction model using a logistic regression to estimate
severe outcomes in patients with TBI. Do not forget to submit well-annotated R scripts of your
analyses and use always a seed 123 for regularized methods with cross-validation: set.seed(123)

Please perform a logistic regression and present a table with the regression coefficients, standard
error (SE), z values, p values and Odds ratio with 95% Confidence intervals (see lecture “Logistic
regression”). Please label the variables adequately. (4 points)
Variable Estimated
regression
coefficient
SE Z value P value Odds
ratio
(OR)
2.5th ile
CI of OR
97.5ile
CI of OR

…..

Next, use your model to predict the outcome probability for new cases and assess the external
validity using the external international test data set.
Please present the area under the curve value with 95% confidence intervals (95% C.I.) and other
validity measures (see table) from the confusionMatrix function (from caret) and “proc” for the
external data set (5 points for analyses and presenting correct results).
Please classify persons with a probability of being a case greater or equal than 0.5 as a case or
otherwise as healthy control (Default in caret’s function). Please present the AUC value and the 95%
confidence intervals of the AUC in the table, i.e. 0.71 (0.61 – 0.81). Remember to use probabilities
for the AUC estimations!
Use ?proc to see how to get 95% confidence intervals for the AUC in R!
Logistic
regression
AUC value (+
95% C.I.)

Accuracy
Sensitivity
Specificity
PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 10 of 13

Positive
predictive
value

Negative
predictive
value

A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point
system:
• 90-1 = excellent (A)
• .80-.90 = good (B)
• .70-.80 = fair (C)
• .60-.70 = poor (D)
• .50-.60 = fail (F)
From: http://gim.unmc.edu/dxtests/roc3.htm
Question: How good is the prediction accuracy based on the AUC for the external test dataset? (1
point)
Answer:

1.3. Model 2 and 3: Regularized Logistic regression (Total: 15 points)

Task 3: Prediction model using regularized logistic regression
Can we improve our model if we use a statistical learning algorithm using caret and the code
provided in the practical instead of glm using the same predictors? Please use repeated CV for
model tuning.
As a statistical learning algorithm, we choose lasso regularized regression i) minimum lambda or ii)
lambda + 1Se tuning parameter.

Lasso logistic regression
1. Please develop a lasso logistic regression using the US data set for model building and the
international data set for external validation. Please report the same accuracy measures and
the AUC for both the minimum lambda and the 1SE lambda model (Total 8 points).
2. In addition, please report the final two models as a formula and report how many variables
are selected with the two lasso regression models (2 points).
Do not forget to submit well-annotated R scripts of your analyses and use always a seed 123 for
regularized methods with cross-validation: set.seed(123)

PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 11 of 13

Answers:

Logistic
regression
(1.1)
Lasso
(min
lambda)
(1.3)
Lasso
(lambda
+ 1 Se)
(1.3)
External External External
AUC (95%
C.I.)

Accuracy
Sensitivity
Specificity
Positive
predictive
value

Negative
predictive
value

(P.S: column “Logistic regression 1.1.” is the same table as the previous one at the bottom of page 5)

Question: Compare the three models (Models from 1.1 and 1.3.) according to the prediction
accuracy and the number of variables selected for the final model. Which of the three models would
you recommend to a clinician? Please explain and give two reasons. If you choose a Lasso model,
please state which of the two (minimum lambda or Lambda + 1 Se)
Note: There is no “right” answer (2 points)
Answer:

Suggestions for improvement
(Total 3 points)
Please make three suggestions to improve the study. You do not need to know how to implement
your suggestions.
PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 12 of 13

1)
2)
3)

1.4. Conference abstract (Total 15 points)
You want to present your results at a conference and need to submit an abstract with a maximum of
500 words. models Use results from 1.1 and 1.3. Do not expect the conference participants to know
all technical details. Please write an abstract with the following sections:
• Title (max 12 words)
• Up to 5 Keywords (minimum 3)
• Background or Introduction (including “aim”)
• Material & Methods
• Results
• Conclusion
• Limitations
(Marking of abstract, see below)

Your abstract
PGCert Applied Statistical Learning and Health informatics: 7PAVPRMD 2019-20

Page 13 of 13

Marking of abstract:

1. Does the abstract title describe the subject being written about? (1 point)
2. Does the abstract have between 2 and 5 keywords that closely reflect the content of the paper?
(0.5 points)
3. Does the abstract have all 7 sections? (1 point)
4. Is the abstract well written in terms of language, grammar, etc.? (1.5 points)
5. Overall: Does the abstract engage the reader by telling him or her what the paper is about and
why they should accept it for the conference? (1.5 points)
6. Background and aim: Does the abstract make a clear statement of the topic of the paper and the
research question/aim? (1.5 points)
7. Method: Does the abstract method describes the main methodology well (1 points) ? Are the
data sets well described? (1 points)
8. Results: Does the abstract give a concise summary of the findings? (1.5 points)
9. Are the conclusions convincing and answer the aim of the study? (1.5 points)
10. Does the abstract indicate the value of the findings and/or to whom will they be of use? (1 point)
11. Does the abstract describe at least one limitation? (1 points)
12. Does the abstract conform to the word limit of 500 words? (1 point)