MATH5826: Statistical Methods in Epidemiology
Term 1, 2021
Assignment 1
Submission deadline: Wednesday 17 February, 10:05am
Deliverables: One R Markdown file for the entire assignment with file name of the form
“LastName FirstName - z1234567 - Ass#.Rmd”. Your Rmd file should produce a PDF file
(use option output: pdf document), make no external references to the file structure on
your computer and you should have no commands to save output externally. A template
can be found on Moodle and more detailed instructions can be found in Lecture 1.
Assignment length: There is a 5 page limit and 12pt font size for your Rmd output
file. Any pages exceeding this limit or submissions with smaller font sizes will not be marked.
If you are over the page limit, be judicious about what R code/output is printed and perhaps
modify figure sizes (they do not need to be large but should be legible).
Submission: Upload your R Markdown file to Moodle and include the Plagiarism Statement
given below (copy-and-paste it).
Penalties: Failure to adhere to instructions will result in a minimum 5% mark reduction.
Name: Student Number:
I declare that this assessment item is my own work, except where acknowledged,
and has not been submitted for academic credit elsewhere, and acknowledge that
the assessor of this item may, for the purpose of assessing this item:
Reproduce this assessment item and provide a copy to another member of the
University; and/or,
Communicate a copy of this assessment item to a plagiarism checking service
(which may then retain a copy of the assessment item on its database for the
purpose of future plagiarism checking).
I certify that I have read and understood the University Rules in respect of Student
Academic Misconduct.
Signed: Date:
1
1. The dataset incdata.csv is a hypothetical dataset containing data for 100 individuals
in a study of a disease. The file contains no header row and the two variables are age
at entry to the study, and age at onset of the disease. Assume that complete follow-up
is available, so there are no withdrawals. Also, assume that the disease is chronic, so
those with the disease are no longer at risk and that individuals are not at risk until
study entry.
(a) For the first ten observations, plot the data in the format shown on slide 16 of
Lecture 2 but with participant age on the x-axis instead of time in study.
(b) How many individuals were at risk of contracting the disease at age 50?
(c) What was the total number of new cases in the age range 50-60?
(d) How many new cases in the age range 50-60 occurred to individuals at risk at age
50?
(e) Calculate the incidence proportion for the disease between ages 50 and 60.
(f) Calculate the total exposure time in the age range 50-60.
(g) Calculate the incidence rate for the disease for the age interval 50-60.
(h) Write a function to calculate the incidence proportion and incidence rate for any
age range, given data in this format. Try it out on the age range 60-70 for these
data.
2. The spreadsheet popnrateinfo.csv contains information on the US population by age
and the US cancer death rates for the years 1950, 1977 and 2004. The information was
obtained from different sources and so is in different formats.
(a) Use this information, and age ranges 0− 4, 5− 14, ..., 75− 84, 85+, to calculate:
i. The directly standardized rates for 1950 and 1977, using the 2004 US popu-
lation as standard.
ii. The standardized mortality ratios (SMRs) for 1950 and 1977, using the 2004
rates as standard.
(b) Provide 95% confidence intervals for both the directly standardized rates and for
the SMRs.
(c) Compare the three years 1950, 1977 and 2004 on the basis of the crude rates,
directly standardized rates, and SMRs.
(d) Compare the age-specific rates for the three years graphically.
2
学霸联盟