MATH5826 Statistical Methods in Epidemiology
Lecture 2: Measures of Disease Occurrence
Jake Olivier
Term 1, 2021
Prevalence and Incidence
Delta Method
Estimation and inference for event rates
Standardized Rates
Epidemiology originated as the study of epidemics, yet the current focus is much broader
including the study of chronic and acute disease, mental health and injury
The primary focus in epidemiology is often on the relationship (or association) between
an exposure E and disease D
For example, an epidemiologist may observe that those with higher levels of exposure to
lead is associated with an increase in the incidence of deficient brain development,
compared to those with little exposure to lead
Prior to making comparisons for, say, different levels of exposure, we must first introduce
measures of disease occurrence. These will be used later when we introduce measures of
Disease Occurrence
There are many measures of disease occurrence and the choice of which one to use
depends on the study design, the population under study and the available data
Some issues to consider are:
• threats to study validity – this could be due to bias or limitations that arise due to
the study design or method of data collection
• extraneous factors – the analysis needs to account for them to untangle differing
effects of exposure on disease
Disease Occurrence
Broadly speaking, disease occurrence can be quantified as either a ratio, proportion, rate
or odds
Each of these are measures obtained by dividing one quantity versus another (i.e.,
M = a/b)
The numerator and denominator of a ratio are separate quantities: one does not contain
the other. For example, the ratio
number of doctors working at a hospital
number of beds in the hospital
are comprised of separate quantities and can assist in determining resource allocation at
a hospital
Disease Occurrence: Proportions
For proportions, the denominator quantity (b) contains the numerator quantity (a).
That is, the denominator can be written as b = a + a′ where a′ = b − a.
A proportion is a number between 0 and 1, and proportions are often interpreted as
probabilities in epidemiology.
For example, the number of deaths in a specified time interval out of the number alive
at the start of the interval is a proportion, and can be regarded as an estimate of the
probability of dying in the interval.
Disease Occurrence: Rates
A rate is a measure of change in a quantity per unit of another quantity.
Mortality and disease rates are almost always expressed per unit of time which might be
calendar time, age, or follow-up time.
Lowres et al1 screened n = 1000 patients 65 years and older for atrial fibrulation (AF),
of which there were 15 new AF cases. The sum of the ages of those screened was
t = 76302 years.
The estimated proportion of new AF diagnosis was 15/1000 = 1.5%, and the rate per
person-years was 15/76302 or 1 new AF case per 5086.8 years lived.
1Feasibility and cost-effectiveness of stroke prevention through community screening for atrial fibrillation
using iPhone ECG in pharmacies. Thrombosis & Haemostasis 2014;111:1167-76.
Disease Occurrence: Rates
From differential calculus, the instantaneous rate of change in a quantity y is dy/dx . In
epidemiology, rates are commonly expressed relative to the size of the quantity in the
So, the rate r for the change in population y at time t relative to y(t), the population
size at time t, is
r = dydt
= dyy(t)dt
For example, if the population is diminishing only due to deaths, then r is a mortality
rate, i.e., the number of deaths per person per unit of time.
Disease Occurrence: Rates
Populations change over time and rates can be estimated over a time interval
The average rate in the interval (t, t + ∆t) is
r¯ = ∆y∫ t+∆t
t y(u)du
where ∆y is the change in y from t to t + ∆t.
For example, if y(x) = `x is the population at risk at age x , ∆y = `x − `x+∆x = dx , the
number of deaths between ages x and x + ∆x , and the average rate is
dx∫ x+∆x
x `udu
the number of deaths per total time at risk.
Disease Occurrence: Odds
The odds of occurrence of an event A is the probability of occurrence relative to the
probability of non-occurrence
odds(A) = P(A)1− P(A)
The odds can also be computed by frequency of occurrence to non-occurrence of an
Prevalence and Incidence
Delta Method
Estimation and inference for event rates
Standardized Rates
The number of disease cases in a population can be measured in several ways. Methods
may differ based on the time frame and whether new or existing cases are counted.
Prevalence is the number of existing cases of disease in a population at a point in time.
Two types of prevalance are:
• Point prevalence proportion: the proportion of a population with the disease at a
specified point in time
• Period prevalence proportion: the proportion of a population with the disease over
a specified period of time
For example, a researcher may be interested in the prevalance of active COVID-19 cases.
On 25 Jan 2021 in Australia, the point prevalance was 135/25,744,519, while the period
prevalance for 2020 was 28,381/25,499,884.2
2Population estimates at 25/1/2021 & 30/6/2020 12/40
Incidence is the number of new cases of disease occurring in a specified time period in
a population.
Two types of incidence are:
• Incidence proportion: the number of new cases of disease over a period of time,
divided by the number of people at risk for the disease at the start of the period.
• Incidence rate: the number of new cases of disease over a period of time, divided by
the total time at risk (for all individuals in the population) over the period.
Prevalence and Incidence
The numerator of the point prevalence proportion includes all those who have the
disease at that date, regardless of when the disease was contracted.
So, diseases of long duration tend to have higher prevalence than those of short
duration, even if the incidence is similar.
If the incidence and the average duration of a disease are constant over time, then the
prevalance P is
P = I × D
where I is incidence and D is average duration.
Epidemiologist’s Bathtub3
The epidemiologist’s bathtub can
be useful in understanding the
difference between prevalence and
The existing water level represents
prevalence while the amount of
new water flowing into the tub is
the incidence
Note the water exiting the tub can
be either mortality or recovery
3source: https://www.publichealth.hscni.net/node/5277
The following table presents some
hypothetical data on a population of five
individuals observed from t = 0 to t = 5.
The line segments represent time alive,
circles represent deaths, and crosses
represent occurrences of disease. Note the
disease is chronic in the sense that no
recovery is possible.
0 1 2 3 4 5
Point prevalence proportion at t = 0 0/5 = 0
Point prevalence proportion at t = 5 1/2 = 0.5
Incidence proportion from t = 0 to t = 5 3/5 = 0.6
Incidence rate from t = 0 to t = 5 3/(5 + 1 + 4 + 3 + 1) = 0.21
Prevalence and Incidence
Delta Method
Estimation and inference for event rates
Standardized Rates
Review of the Delta Method
It is often useful to derive the asymptotic variances of measures of disease occurrence
and association, and this section briefly reviews the delta method for obtaining the
covariance matrix of a transformation of a parameter vector.
Let θˆ be a p × 1 vector of parameter estimates with covariance matrix var(θˆ), and let
ϕ = g(θ) be a transformation of θ to a q × 1 parameter vector. The first-order Taylor
series expansion of g(θˆ) about θ is
g(θˆ) ≈ g(θ) +
(θˆ − θ)
∂gi (θ)
is the Jacobian matrix of g whose (i , j) element is ∂gi (θ)∂θj
Review of the Delta Method
Taking the variance of both sides of the equation
var(ϕˆ) = var(g(θˆ)) ≈
Evaluating the derivatives at the estimated value θ = θˆ gives the estimated covariance
v̂ar(ϕˆ) = v̂ar(g(θˆ)) ≈
For the univariate case, i.e., p = q = 1,
v̂ar(ϕˆ) ≈

Prevalence and Incidence
Delta Method
Estimation and inference for event rates
Standardized Rates
Estimation and inference for event rates
In cohort studies, we are often interested in the risk of (possibly recurrent) events during
some period of exposure.
Different individuals may be at risk of the event for different exposure periods, but the
overall event frequency is usually summarised as a rate. That is, the number of events
occurring in a specified time period divided by the total of the exposure periods for each
Poisson Process Model
We assume that the sequence of events and the aggregate count for each individual is
generated by a Poisson process where λ(s) is the instantaneous rate of events at time s
Let N(s) denote the counting process at time s, which is the cumulative number of
events in the period (0, s]. If the random variable D represents the number of events in
(s, s + t], then the probability of d events in this interval is
P(D = d) = P (N(s + t)− N(s) = d) = e
where s, t > 0 and the rate parameter
Λ(s,s+t) =
∫ s+t
λ(u) du
is the cumulative intensity over the interval
Poisson Process Model
If we assume constant intensity over the time period of interest, i.e., λ(u) = λ for all u,
then for d ≥ 0, t > 0 we have a homogeneous Poisson process and the number of events
D in an interval of length t is distributed as a Poisson with parameter λt
P(D = d ; t, λ) = e
If we further assume that all individuals in the population experience the same constant
intensity, then we have a doubly homogeneous Poisson process and the likelihood
function for a sample of n independent observations is
L(λ) =
e−λtj (λtj)dj
dj !
where tj and dj are the exposure time and number of events, respectively, for individual j .
This likelihood conditions on the exposure times t1, . . . , tn (i.e., tj are fixed constants)
Poisson Process Model
The log-likelihood is
logL(λ) = −λ
tj +
dj log(λ) +
dj log(tj)−
log(dj !)
and the score equation is
∂ logL(λ)
= −
tj +
Equating the score to 0 and solving for λ gives the maximum likelihood estimator
λˆ =
j=1 dj∑n
j=1 tj
= dT
where d and T are the total number of events and exposure, respectively
Poisson Process Model
The previous ML estimator is known as the (crude) event rate. It can also be
equivalently expressed as a weighted mean rate
λˆ =
j=1 dj∑n
j=1 tj
j=1 tj(dj/tj)∑n
j=1 tj
j=1 tj rj∑n
j=1 tj
where rj = dj/tj is the event rate for person j
Poisson Process Model
The variance of the rate can be obtained as the inverse of the the observed or expected
Fisher information. The observed information is
J (λ) = −∂
2 logL(λ)
= d
and the expected information is I(λ) = E (D)/λ2, where D now represents the total
number of events in the sample.
Therefore, the variance estimator based on the observed information is λ2/d , and based
on the expected information it is λ2/E (D) = λ/T .
This follows because D is a Poisson random variable with E (D) = λT (exercise). Thus
for a given λ, the variance is inversely proportional to the total exposure, and also to the
(expected) number of events.
Poisson Process Model
The variance of the rate can be estimated as
v̂ar(λˆ) = λˆT =
T 2 =
Large sample hypothesis tests and confidence intervals can be based on this result. For
example, a 95% confidence interval for λ would be
λˆ± 1.96se(λˆ) = λˆ± 1.96λˆ/

Alternatively, since λ is positive, a more accurate approximation may be obtained by
calculating confidence intervals for log(λ) using the delta method and transforming back.
This method gives var(log λˆ) ≈ 1/d , so a 95% confidence interval for λ is

d , λˆe+1.96/

According to the New York State Cancer Registry, there were 524 cancer deaths
amongst males aged 45-49 in New York State in 2000. The corresponding mid-year
population for this age group is 649,533, and we take this to be an approximation to
T , the total number of years of exposure for year 2000. The estimated mortality rate
is then λˆ = d/T = 524/649, 533 = 0.000807, or about 80.7 per 100,000 per year.
A 95% confidence interval for λ, assuming a normal distribution for λˆ, is
λˆ± 1.96λˆ/

d = 80.7± 1.96× 80.7/

524 = (73.8, 87.6)
Using a log transformation, a 95% confidence interval for λ is

d , λˆe+1.96/

d) = (80.7e−1.96/

524, 80.7e+1.96/

524) = (74.1, 87.9)
Prevalence and Incidence
Delta Method
Estimation and inference for event rates
Standardized Rates
Age-specific rates and crude rates
Since mortality and disease rates usually show considerable variation by age, we are
often interested in looking at a series of rates, one for each age or age group. These are
termed age-specific rates.
• In the above example, the age-specific rate for the 45-49 year age group is 80.7 per
100,000 per year.
• The age-specific rates for this population vary considerably, from 2.2 per 100,000
per year in the 5-9 and 10-14 year age groups, to 2585.5 in the 85+ age group.
• For similar reasons, it is also useful to estimate rates separately for males and
Age-specific rates and crude rates
The crude rate is the total number of deaths (in all age groups), divided by the total
Crude rates reflect not only the level of risk in the population, but also the age
distribution. So, comparison of crude rates amongst different populations can be
confounded by age. For example, a higher crude rate might simply reflect an older
population distribution, rather than a real difference in risk.
Valid comparisons can be made by comparing series of age-specific rates, but sometimes
a single summary measure is required that allows comparison between populations. Such
a summary measure is referred to as an age-adjusted or standardized rate.
Direct Standardization
Let λˆk = dk/tk be the age-specific rate for age group k = 1, . . . ,K , where dk is the
number of events and tk is the total exposure time for age group k.
Then, similar to the weighted mean rate, the crude rate can be written as a weighted
average of the age-specific death rates
λˆc =

k tk λˆk∑
k tk

T λˆk
where the weights are equal to the fraction of exposure in each age group.
The idea of direct standardization is to replace the exposure proportions obtained from
the study population of interest, with the corresponding proportions obtained from some
standard population.
Direct Standardization
For example, if we wanted to compare two different populations, we would use the same
standard population for each, thereby removing effects merely due to differing age
So, the directly standardized rate is obtained by applying the study population
age-specific rates to the standard population exposure proportions
λˆdir =

where tsk and Ts are the exposure in age group k and total exposure, respectively, for
the standard population.
Direct Standardization
The variance of the directly standardized rate can be estimated as
v̂ar(λˆdir ) =

( tsk

( tsk
)2 λˆ2k
Following direct standardization of rates for two populations A and B, comparisons can
be made using the standardized rate ratio
SRR = λˆA
The delta method on the log SRR ratio can be used to derive a variance estimate
v̂ar(log SRR) = v̂ar(log λˆA) + v̂ar(log λˆB)
= v̂ar(λˆA)
+ v̂ar(λˆB)
λˆ2B 34/40
Indirect Standardization
In contrast to direct standardization, indirect standardization applies age-specific rates
for some standard population to the exposure proportions in the study population. This
gives an “expected” number of deaths in the study population
E =

where λsk are the age-specific rates for the standard population.
The standardized mortality ratio is then calculated as the ratio of observed to expected
SMR = dE =

k tk λˆk∑
k tkλsk
Indirect Standardization
An indirectly standardized rate can be obtained by multiplying the crude rate in the
standard population by the SMR, although in practice we are often primarily interested
in the SMR itself.
If we can regard the expected deaths as a constant, then the variance of the SMR can
be estimated as
var(SMR) = v̂ar(D)E 2 =
E 2 =
Exact confidence intervals can also be constructed from the chi-square distribution[
2E ,
In the US state of Michigan from 1950 to 1964, 731,177 babies were first-born to their
mothers, and of these, 412 were affected by Down’s syndrome. In the same period,
442,811 babies were the fifth-born or more to their mothers, and of these, 740 were
affected by Down’s syndrome.
The crude “rates” (strictly speaking, prevalence proportions) of Down’s syndrome are:
• 412× 100, 000/731, 177 = 56.3 per 100,000 births for first-borns, and
• 740× 100, 000/442, 811 = 167.1 per 100,000 births for fifth and later-borns.
This is not a fair comparison, however, because incidence of Down’s syndrome is known
to increase with maternal age, and mothers of fifth and later-borns will tend to be older
than mothers of first-borns. That is, maternal age is a confounder in the association
between birth order and Down’s syndrome.
Example: Down’s Syndrome
The following table shows the maternal-age-specific proportions of births and prevalence
proportions, separately for the whole state of Michigan (the standard population), and
for first-borns and fifth and later-borns (the study populations) for the period 1950-1964.
Proportion of births in age range: Age-specific rate for:
Maternal Fifth or Fifth or
age Michigan First-born later-born Michigan First-born later-born
Under 20 0.113 0.315 0.001 42.5 46.5 0.0
20-24 0.330 0.451 0.069 42.5 42.8 26.1
25-29 0.278 0.157 0.279 52.3 52.2 51.0
30-34 0.173 0.054 0.339 87.7 101.3 74.7
35-39 0.084 0.019 0.235 264.0 274.5 251.7
40+ 0.022 0.004 0.078 864.4 819.1 857.8
Crude rate 89.5 56.3 167.1
Example: Down’s Syndrome
The directly standardized rates are computed in the following table by applying the
study population prevalence proportions to the standard population birth proportions.
Direct standardized rate:
Maternal Fifth or
age First-born later-born
Under 20 5.255 0.000
20-24 14.124 8.613
25-29 14.512 14.178
30-34 17.525 12.923
35-39 23.058 21.143
40+ 18.020 18.872
Direct standardized rate 92.5 75.7
Example: Down’s Syndrome
This table illustrates the calculation of the SMRs and the indirectly standardized rates.
Expected cases:
Maternal Fifth or
age First-born later-born
Under 20 13.388 0.043
20-24 19.168 2.933
25-29 8.211 14.592
30-34 4.736 29.730
35-39 5.016 62.040
40+ 3.458 67.423
Expected cases 53.976 176.760
SMR 1.044 0.945
Indirect standardized rate 93.4 84.6