xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

R代写-MATH5826

时间：2021-03-06

MATH5826 Statistical Methods in Epidemiology

Lecture 2: Measures of Disease Occurrence

Jake Olivier

Term 1, 2021

1/40

Background

Prevalence and Incidence

Delta Method

Estimation and inference for event rates

Standardized Rates

2/40

Background

Epidemiology originated as the study of epidemics, yet the current focus is much broader

including the study of chronic and acute disease, mental health and injury

The primary focus in epidemiology is often on the relationship (or association) between

an exposure E and disease D

For example, an epidemiologist may observe that those with higher levels of exposure to

lead is associated with an increase in the incidence of deficient brain development,

compared to those with little exposure to lead

Prior to making comparisons for, say, different levels of exposure, we must first introduce

measures of disease occurrence. These will be used later when we introduce measures of

association.

3/40

Disease Occurrence

There are many measures of disease occurrence and the choice of which one to use

depends on the study design, the population under study and the available data

Some issues to consider are:

• threats to study validity – this could be due to bias or limitations that arise due to

the study design or method of data collection

• extraneous factors – the analysis needs to account for them to untangle differing

effects of exposure on disease

4/40

Disease Occurrence

Broadly speaking, disease occurrence can be quantified as either a ratio, proportion, rate

or odds

Each of these are measures obtained by dividing one quantity versus another (i.e.,

M = a/b)

The numerator and denominator of a ratio are separate quantities: one does not contain

the other. For example, the ratio

number of doctors working at a hospital

number of beds in the hospital

are comprised of separate quantities and can assist in determining resource allocation at

a hospital

5/40

Disease Occurrence: Proportions

For proportions, the denominator quantity (b) contains the numerator quantity (a).

That is, the denominator can be written as b = a + a′ where a′ = b − a.

A proportion is a number between 0 and 1, and proportions are often interpreted as

probabilities in epidemiology.

For example, the number of deaths in a specified time interval out of the number alive

at the start of the interval is a proportion, and can be regarded as an estimate of the

probability of dying in the interval.

6/40

Disease Occurrence: Rates

A rate is a measure of change in a quantity per unit of another quantity.

Mortality and disease rates are almost always expressed per unit of time which might be

calendar time, age, or follow-up time.

Example

Lowres et al1 screened n = 1000 patients 65 years and older for atrial fibrulation (AF),

of which there were 15 new AF cases. The sum of the ages of those screened was

t = 76302 years.

The estimated proportion of new AF diagnosis was 15/1000 = 1.5%, and the rate per

person-years was 15/76302 or 1 new AF case per 5086.8 years lived.

1Feasibility and cost-effectiveness of stroke prevention through community screening for atrial fibrillation

using iPhone ECG in pharmacies. Thrombosis & Haemostasis 2014;111:1167-76.

7/40

Disease Occurrence: Rates

From differential calculus, the instantaneous rate of change in a quantity y is dy/dx . In

epidemiology, rates are commonly expressed relative to the size of the quantity in the

numerator.

So, the rate r for the change in population y at time t relative to y(t), the population

size at time t, is

r = dydt

/

y(t)

= dyy(t)dt

For example, if the population is diminishing only due to deaths, then r is a mortality

rate, i.e., the number of deaths per person per unit of time.

8/40

Disease Occurrence: Rates

Populations change over time and rates can be estimated over a time interval

The average rate in the interval (t, t + ∆t) is

r¯ = ∆y∫ t+∆t

t y(u)du

where ∆y is the change in y from t to t + ∆t.

For example, if y(x) = `x is the population at risk at age x , ∆y = `x − `x+∆x = dx , the

number of deaths between ages x and x + ∆x , and the average rate is

dx∫ x+∆x

x `udu

the number of deaths per total time at risk.

9/40

Disease Occurrence: Odds

The odds of occurrence of an event A is the probability of occurrence relative to the

probability of non-occurrence

odds(A) = P(A)1− P(A)

The odds can also be computed by frequency of occurrence to non-occurrence of an

event

10/40

Background

Prevalence and Incidence

Delta Method

Estimation and inference for event rates

Standardized Rates

11/40

Prevalence

The number of disease cases in a population can be measured in several ways. Methods

may differ based on the time frame and whether new or existing cases are counted.

Prevalence is the number of existing cases of disease in a population at a point in time.

Two types of prevalance are:

• Point prevalence proportion: the proportion of a population with the disease at a

specified point in time

• Period prevalence proportion: the proportion of a population with the disease over

a specified period of time

For example, a researcher may be interested in the prevalance of active COVID-19 cases.

On 25 Jan 2021 in Australia, the point prevalance was 135/25,744,519, while the period

prevalance for 2020 was 28,381/25,499,884.2

2Population estimates at 25/1/2021 & 30/6/2020 12/40

Incidence

Incidence is the number of new cases of disease occurring in a specified time period in

a population.

Two types of incidence are:

• Incidence proportion: the number of new cases of disease over a period of time,

divided by the number of people at risk for the disease at the start of the period.

• Incidence rate: the number of new cases of disease over a period of time, divided by

the total time at risk (for all individuals in the population) over the period.

13/40

Prevalence and Incidence

The numerator of the point prevalence proportion includes all those who have the

disease at that date, regardless of when the disease was contracted.

So, diseases of long duration tend to have higher prevalence than those of short

duration, even if the incidence is similar.

If the incidence and the average duration of a disease are constant over time, then the

prevalance P is

P = I × D

where I is incidence and D is average duration.

14/40

Epidemiologist’s Bathtub3

The epidemiologist’s bathtub can

be useful in understanding the

difference between prevalence and

incidence

The existing water level represents

prevalence while the amount of

new water flowing into the tub is

the incidence

Note the water exiting the tub can

be either mortality or recovery

3source: https://www.publichealth.hscni.net/node/5277

15/40

Example

The following table presents some

hypothetical data on a population of five

individuals observed from t = 0 to t = 5.

The line segments represent time alive,

circles represent deaths, and crosses

represent occurrences of disease. Note the

disease is chronic in the sense that no

recovery is possible.

1

2

3

4

5

0 1 2 3 4 5

Time

Pa

tie

nt

Legend

Death

Disease

Point prevalence proportion at t = 0 0/5 = 0

Point prevalence proportion at t = 5 1/2 = 0.5

Incidence proportion from t = 0 to t = 5 3/5 = 0.6

Incidence rate from t = 0 to t = 5 3/(5 + 1 + 4 + 3 + 1) = 0.21

16/40

Background

Prevalence and Incidence

Delta Method

Estimation and inference for event rates

Standardized Rates

17/40

Review of the Delta Method

It is often useful to derive the asymptotic variances of measures of disease occurrence

and association, and this section briefly reviews the delta method for obtaining the

covariance matrix of a transformation of a parameter vector.

Let θˆ be a p × 1 vector of parameter estimates with covariance matrix var(θˆ), and let

ϕ = g(θ) be a transformation of θ to a q × 1 parameter vector. The first-order Taylor

series expansion of g(θˆ) about θ is

g(θˆ) ≈ g(θ) +

[

∂gi(θ)

∂θj

]

(θˆ − θ)

where

[

∂gi (θ)

∂θj

]

is the Jacobian matrix of g whose (i , j) element is ∂gi (θ)∂θj

18/40

Review of the Delta Method

Taking the variance of both sides of the equation

var(ϕˆ) = var(g(θˆ)) ≈

[

∂gi(θ)

∂θj

]

var(θˆ)

[

∂gi(θ)

∂θj

]T

Evaluating the derivatives at the estimated value θ = θˆ gives the estimated covariance

matrix

v̂ar(ϕˆ) = v̂ar(g(θˆ)) ≈

[

∂gi(θˆ)

∂θj

]

v̂ar(θˆ)

[

∂gi(θˆ)

∂θj

]T

For the univariate case, i.e., p = q = 1,

v̂ar(ϕˆ) ≈

(

dg(θˆ)

dθ

)2

v̂ar(θˆ)

19/40

Background

Prevalence and Incidence

Delta Method

Estimation and inference for event rates

Standardized Rates

20/40

Estimation and inference for event rates

In cohort studies, we are often interested in the risk of (possibly recurrent) events during

some period of exposure.

Different individuals may be at risk of the event for different exposure periods, but the

overall event frequency is usually summarised as a rate. That is, the number of events

occurring in a specified time period divided by the total of the exposure periods for each

individual.

21/40

Poisson Process Model

We assume that the sequence of events and the aggregate count for each individual is

generated by a Poisson process where λ(s) is the instantaneous rate of events at time s

Let N(s) denote the counting process at time s, which is the cumulative number of

events in the period (0, s]. If the random variable D represents the number of events in

(s, s + t], then the probability of d events in this interval is

P(D = d) = P (N(s + t)− N(s) = d) = e

−Λ(s,s+t)(Λ(s,s+t))d

d!

where s, t > 0 and the rate parameter

Λ(s,s+t) =

∫ s+t

s

λ(u) du

is the cumulative intensity over the interval

22/40

Poisson Process Model

If we assume constant intensity over the time period of interest, i.e., λ(u) = λ for all u,

then for d ≥ 0, t > 0 we have a homogeneous Poisson process and the number of events

D in an interval of length t is distributed as a Poisson with parameter λt

P(D = d ; t, λ) = e

−λt(λt)d

d!

If we further assume that all individuals in the population experience the same constant

intensity, then we have a doubly homogeneous Poisson process and the likelihood

function for a sample of n independent observations is

L(λ) =

n∏

j=1

e−λtj (λtj)dj

dj !

where tj and dj are the exposure time and number of events, respectively, for individual j .

This likelihood conditions on the exposure times t1, . . . , tn (i.e., tj are fixed constants)

23/40

Poisson Process Model

The log-likelihood is

logL(λ) = −λ

n∑

j=1

tj +

n∑

j=1

dj log(λ) +

n∑

j=1

dj log(tj)−

n∑

j=1

log(dj !)

and the score equation is

∂ logL(λ)

∂λ

= −

n∑

j=1

tj +

1

λ

n∑

j=1

dj

Equating the score to 0 and solving for λ gives the maximum likelihood estimator

λˆ =

∑n

j=1 dj∑n

j=1 tj

= dT

where d and T are the total number of events and exposure, respectively

24/40

Poisson Process Model

The previous ML estimator is known as the (crude) event rate. It can also be

equivalently expressed as a weighted mean rate

λˆ =

∑n

j=1 dj∑n

j=1 tj

=

∑n

j=1 tj(dj/tj)∑n

j=1 tj

=

∑n

j=1 tj rj∑n

j=1 tj

where rj = dj/tj is the event rate for person j

25/40

Poisson Process Model

The variance of the rate can be obtained as the inverse of the the observed or expected

Fisher information. The observed information is

J (λ) = −∂

2 logL(λ)

∂λ2

= d

λ2

and the expected information is I(λ) = E (D)/λ2, where D now represents the total

number of events in the sample.

Therefore, the variance estimator based on the observed information is λ2/d , and based

on the expected information it is λ2/E (D) = λ/T .

This follows because D is a Poisson random variable with E (D) = λT (exercise). Thus

for a given λ, the variance is inversely proportional to the total exposure, and also to the

(expected) number of events.

26/40

Poisson Process Model

The variance of the rate can be estimated as

v̂ar(λˆ) = λˆT =

d

T 2 =

λˆ2

d

Large sample hypothesis tests and confidence intervals can be based on this result. For

example, a 95% confidence interval for λ would be

λˆ± 1.96se(λˆ) = λˆ± 1.96λˆ/

√

d

Alternatively, since λ is positive, a more accurate approximation may be obtained by

calculating confidence intervals for log(λ) using the delta method and transforming back.

This method gives var(log λˆ) ≈ 1/d , so a 95% confidence interval for λ is

(λˆe−1.96/

√

d , λˆe+1.96/

√

d)

27/40

Example

According to the New York State Cancer Registry, there were 524 cancer deaths

amongst males aged 45-49 in New York State in 2000. The corresponding mid-year

population for this age group is 649,533, and we take this to be an approximation to

T , the total number of years of exposure for year 2000. The estimated mortality rate

is then λˆ = d/T = 524/649, 533 = 0.000807, or about 80.7 per 100,000 per year.

A 95% confidence interval for λ, assuming a normal distribution for λˆ, is

λˆ± 1.96λˆ/

√

d = 80.7± 1.96× 80.7/

√

524 = (73.8, 87.6)

Using a log transformation, a 95% confidence interval for λ is

(λˆe−1.96/

√

d , λˆe+1.96/

√

d) = (80.7e−1.96/

√

524, 80.7e+1.96/

√

524) = (74.1, 87.9)

28/40

Background

Prevalence and Incidence

Delta Method

Estimation and inference for event rates

Standardized Rates

29/40

Age-specific rates and crude rates

Since mortality and disease rates usually show considerable variation by age, we are

often interested in looking at a series of rates, one for each age or age group. These are

termed age-specific rates.

• In the above example, the age-specific rate for the 45-49 year age group is 80.7 per

100,000 per year.

• The age-specific rates for this population vary considerably, from 2.2 per 100,000

per year in the 5-9 and 10-14 year age groups, to 2585.5 in the 85+ age group.

• For similar reasons, it is also useful to estimate rates separately for males and

females.

30/40

Age-specific rates and crude rates

The crude rate is the total number of deaths (in all age groups), divided by the total

exposure.

Crude rates reflect not only the level of risk in the population, but also the age

distribution. So, comparison of crude rates amongst different populations can be

confounded by age. For example, a higher crude rate might simply reflect an older

population distribution, rather than a real difference in risk.

Valid comparisons can be made by comparing series of age-specific rates, but sometimes

a single summary measure is required that allows comparison between populations. Such

a summary measure is referred to as an age-adjusted or standardized rate.

31/40

Direct Standardization

Let λˆk = dk/tk be the age-specific rate for age group k = 1, . . . ,K , where dk is the

number of events and tk is the total exposure time for age group k.

Then, similar to the weighted mean rate, the crude rate can be written as a weighted

average of the age-specific death rates

λˆc =

∑

k tk λˆk∑

k tk

=

∑

k

tk

T λˆk

where the weights are equal to the fraction of exposure in each age group.

The idea of direct standardization is to replace the exposure proportions obtained from

the study population of interest, with the corresponding proportions obtained from some

standard population.

32/40

Direct Standardization

For example, if we wanted to compare two different populations, we would use the same

standard population for each, thereby removing effects merely due to differing age

structures.

So, the directly standardized rate is obtained by applying the study population

age-specific rates to the standard population exposure proportions

λˆdir =

∑

k

tsk

Ts

λˆk

where tsk and Ts are the exposure in age group k and total exposure, respectively, for

the standard population.

33/40

Direct Standardization

The variance of the directly standardized rate can be estimated as

v̂ar(λˆdir ) =

∑

k

( tsk

Ts

)2

var(λˆk)

=

∑

k

( tsk

Ts

)2 λˆ2k

dk

Following direct standardization of rates for two populations A and B, comparisons can

be made using the standardized rate ratio

SRR = λˆA

/

λˆB

The delta method on the log SRR ratio can be used to derive a variance estimate

v̂ar(log SRR) = v̂ar(log λˆA) + v̂ar(log λˆB)

= v̂ar(λˆA)

λˆ2A

+ v̂ar(λˆB)

λˆ2B 34/40

Indirect Standardization

In contrast to direct standardization, indirect standardization applies age-specific rates

for some standard population to the exposure proportions in the study population. This

gives an “expected” number of deaths in the study population

E =

∑

k

tkλsk

where λsk are the age-specific rates for the standard population.

The standardized mortality ratio is then calculated as the ratio of observed to expected

deaths

SMR = dE =

∑

k tk λˆk∑

k tkλsk

35/40

Indirect Standardization

An indirectly standardized rate can be obtained by multiplying the crude rate in the

standard population by the SMR, although in practice we are often primarily interested

in the SMR itself.

If we can regard the expected deaths as a constant, then the variance of the SMR can

be estimated as

var(SMR) = v̂ar(D)E 2 =

d

E 2 =

SMR2

d

Exact confidence intervals can also be constructed from the chi-square distribution[

χ22D,α/2

2E ,

χ22(D+1),1−α/2

2E

]

36/40

Example

In the US state of Michigan from 1950 to 1964, 731,177 babies were first-born to their

mothers, and of these, 412 were affected by Down’s syndrome. In the same period,

442,811 babies were the fifth-born or more to their mothers, and of these, 740 were

affected by Down’s syndrome.

The crude “rates” (strictly speaking, prevalence proportions) of Down’s syndrome are:

• 412× 100, 000/731, 177 = 56.3 per 100,000 births for first-borns, and

• 740× 100, 000/442, 811 = 167.1 per 100,000 births for fifth and later-borns.

This is not a fair comparison, however, because incidence of Down’s syndrome is known

to increase with maternal age, and mothers of fifth and later-borns will tend to be older

than mothers of first-borns. That is, maternal age is a confounder in the association

between birth order and Down’s syndrome.

37/40

Example: Down’s Syndrome

The following table shows the maternal-age-specific proportions of births and prevalence

proportions, separately for the whole state of Michigan (the standard population), and

for first-borns and fifth and later-borns (the study populations) for the period 1950-1964.

Proportion of births in age range: Age-specific rate for:

Maternal Fifth or Fifth or

age Michigan First-born later-born Michigan First-born later-born

Under 20 0.113 0.315 0.001 42.5 46.5 0.0

20-24 0.330 0.451 0.069 42.5 42.8 26.1

25-29 0.278 0.157 0.279 52.3 52.2 51.0

30-34 0.173 0.054 0.339 87.7 101.3 74.7

35-39 0.084 0.019 0.235 264.0 274.5 251.7

40+ 0.022 0.004 0.078 864.4 819.1 857.8

Crude rate 89.5 56.3 167.1

38/40

Example: Down’s Syndrome

The directly standardized rates are computed in the following table by applying the

study population prevalence proportions to the standard population birth proportions.

Direct standardized rate:

Maternal Fifth or

age First-born later-born

Under 20 5.255 0.000

20-24 14.124 8.613

25-29 14.512 14.178

30-34 17.525 12.923

35-39 23.058 21.143

40+ 18.020 18.872

Direct standardized rate 92.5 75.7

39/40

Example: Down’s Syndrome

This table illustrates the calculation of the SMRs and the indirectly standardized rates.

Expected cases:

Maternal Fifth or

age First-born later-born

Under 20 13.388 0.043

20-24 19.168 2.933

25-29 8.211 14.592

30-34 4.736 29.730

35-39 5.016 62.040

40+ 3.458 67.423

Expected cases 53.976 176.760

SMR 1.044 0.945

Indirect standardized rate 93.4 84.6

40/40

学霸联盟

Lecture 2: Measures of Disease Occurrence

Jake Olivier

Term 1, 2021

1/40

Background

Prevalence and Incidence

Delta Method

Estimation and inference for event rates

Standardized Rates

2/40

Background

Epidemiology originated as the study of epidemics, yet the current focus is much broader

including the study of chronic and acute disease, mental health and injury

The primary focus in epidemiology is often on the relationship (or association) between

an exposure E and disease D

For example, an epidemiologist may observe that those with higher levels of exposure to

lead is associated with an increase in the incidence of deficient brain development,

compared to those with little exposure to lead

Prior to making comparisons for, say, different levels of exposure, we must first introduce

measures of disease occurrence. These will be used later when we introduce measures of

association.

3/40

Disease Occurrence

There are many measures of disease occurrence and the choice of which one to use

depends on the study design, the population under study and the available data

Some issues to consider are:

• threats to study validity – this could be due to bias or limitations that arise due to

the study design or method of data collection

• extraneous factors – the analysis needs to account for them to untangle differing

effects of exposure on disease

4/40

Disease Occurrence

Broadly speaking, disease occurrence can be quantified as either a ratio, proportion, rate

or odds

Each of these are measures obtained by dividing one quantity versus another (i.e.,

M = a/b)

The numerator and denominator of a ratio are separate quantities: one does not contain

the other. For example, the ratio

number of doctors working at a hospital

number of beds in the hospital

are comprised of separate quantities and can assist in determining resource allocation at

a hospital

5/40

Disease Occurrence: Proportions

For proportions, the denominator quantity (b) contains the numerator quantity (a).

That is, the denominator can be written as b = a + a′ where a′ = b − a.

A proportion is a number between 0 and 1, and proportions are often interpreted as

probabilities in epidemiology.

For example, the number of deaths in a specified time interval out of the number alive

at the start of the interval is a proportion, and can be regarded as an estimate of the

probability of dying in the interval.

6/40

Disease Occurrence: Rates

A rate is a measure of change in a quantity per unit of another quantity.

Mortality and disease rates are almost always expressed per unit of time which might be

calendar time, age, or follow-up time.

Example

Lowres et al1 screened n = 1000 patients 65 years and older for atrial fibrulation (AF),

of which there were 15 new AF cases. The sum of the ages of those screened was

t = 76302 years.

The estimated proportion of new AF diagnosis was 15/1000 = 1.5%, and the rate per

person-years was 15/76302 or 1 new AF case per 5086.8 years lived.

1Feasibility and cost-effectiveness of stroke prevention through community screening for atrial fibrillation

using iPhone ECG in pharmacies. Thrombosis & Haemostasis 2014;111:1167-76.

7/40

Disease Occurrence: Rates

From differential calculus, the instantaneous rate of change in a quantity y is dy/dx . In

epidemiology, rates are commonly expressed relative to the size of the quantity in the

numerator.

So, the rate r for the change in population y at time t relative to y(t), the population

size at time t, is

r = dydt

/

y(t)

= dyy(t)dt

For example, if the population is diminishing only due to deaths, then r is a mortality

rate, i.e., the number of deaths per person per unit of time.

8/40

Disease Occurrence: Rates

Populations change over time and rates can be estimated over a time interval

The average rate in the interval (t, t + ∆t) is

r¯ = ∆y∫ t+∆t

t y(u)du

where ∆y is the change in y from t to t + ∆t.

For example, if y(x) = `x is the population at risk at age x , ∆y = `x − `x+∆x = dx , the

number of deaths between ages x and x + ∆x , and the average rate is

dx∫ x+∆x

x `udu

the number of deaths per total time at risk.

9/40

Disease Occurrence: Odds

The odds of occurrence of an event A is the probability of occurrence relative to the

probability of non-occurrence

odds(A) = P(A)1− P(A)

The odds can also be computed by frequency of occurrence to non-occurrence of an

event

10/40

Background

Prevalence and Incidence

Delta Method

Estimation and inference for event rates

Standardized Rates

11/40

Prevalence

The number of disease cases in a population can be measured in several ways. Methods

may differ based on the time frame and whether new or existing cases are counted.

Prevalence is the number of existing cases of disease in a population at a point in time.

Two types of prevalance are:

• Point prevalence proportion: the proportion of a population with the disease at a

specified point in time

• Period prevalence proportion: the proportion of a population with the disease over

a specified period of time

For example, a researcher may be interested in the prevalance of active COVID-19 cases.

On 25 Jan 2021 in Australia, the point prevalance was 135/25,744,519, while the period

prevalance for 2020 was 28,381/25,499,884.2

2Population estimates at 25/1/2021 & 30/6/2020 12/40

Incidence

Incidence is the number of new cases of disease occurring in a specified time period in

a population.

Two types of incidence are:

• Incidence proportion: the number of new cases of disease over a period of time,

divided by the number of people at risk for the disease at the start of the period.

• Incidence rate: the number of new cases of disease over a period of time, divided by

the total time at risk (for all individuals in the population) over the period.

13/40

Prevalence and Incidence

The numerator of the point prevalence proportion includes all those who have the

disease at that date, regardless of when the disease was contracted.

So, diseases of long duration tend to have higher prevalence than those of short

duration, even if the incidence is similar.

If the incidence and the average duration of a disease are constant over time, then the

prevalance P is

P = I × D

where I is incidence and D is average duration.

14/40

Epidemiologist’s Bathtub3

The epidemiologist’s bathtub can

be useful in understanding the

difference between prevalence and

incidence

The existing water level represents

prevalence while the amount of

new water flowing into the tub is

the incidence

Note the water exiting the tub can

be either mortality or recovery

3source: https://www.publichealth.hscni.net/node/5277

15/40

Example

The following table presents some

hypothetical data on a population of five

individuals observed from t = 0 to t = 5.

The line segments represent time alive,

circles represent deaths, and crosses

represent occurrences of disease. Note the

disease is chronic in the sense that no

recovery is possible.

1

2

3

4

5

0 1 2 3 4 5

Time

Pa

tie

nt

Legend

Death

Disease

Point prevalence proportion at t = 0 0/5 = 0

Point prevalence proportion at t = 5 1/2 = 0.5

Incidence proportion from t = 0 to t = 5 3/5 = 0.6

Incidence rate from t = 0 to t = 5 3/(5 + 1 + 4 + 3 + 1) = 0.21

16/40

Background

Prevalence and Incidence

Delta Method

Estimation and inference for event rates

Standardized Rates

17/40

Review of the Delta Method

It is often useful to derive the asymptotic variances of measures of disease occurrence

and association, and this section briefly reviews the delta method for obtaining the

covariance matrix of a transformation of a parameter vector.

Let θˆ be a p × 1 vector of parameter estimates with covariance matrix var(θˆ), and let

ϕ = g(θ) be a transformation of θ to a q × 1 parameter vector. The first-order Taylor

series expansion of g(θˆ) about θ is

g(θˆ) ≈ g(θ) +

[

∂gi(θ)

∂θj

]

(θˆ − θ)

where

[

∂gi (θ)

∂θj

]

is the Jacobian matrix of g whose (i , j) element is ∂gi (θ)∂θj

18/40

Review of the Delta Method

Taking the variance of both sides of the equation

var(ϕˆ) = var(g(θˆ)) ≈

[

∂gi(θ)

∂θj

]

var(θˆ)

[

∂gi(θ)

∂θj

]T

Evaluating the derivatives at the estimated value θ = θˆ gives the estimated covariance

matrix

v̂ar(ϕˆ) = v̂ar(g(θˆ)) ≈

[

∂gi(θˆ)

∂θj

]

v̂ar(θˆ)

[

∂gi(θˆ)

∂θj

]T

For the univariate case, i.e., p = q = 1,

v̂ar(ϕˆ) ≈

(

dg(θˆ)

dθ

)2

v̂ar(θˆ)

19/40

Background

Prevalence and Incidence

Delta Method

Estimation and inference for event rates

Standardized Rates

20/40

Estimation and inference for event rates

In cohort studies, we are often interested in the risk of (possibly recurrent) events during

some period of exposure.

Different individuals may be at risk of the event for different exposure periods, but the

overall event frequency is usually summarised as a rate. That is, the number of events

occurring in a specified time period divided by the total of the exposure periods for each

individual.

21/40

Poisson Process Model

We assume that the sequence of events and the aggregate count for each individual is

generated by a Poisson process where λ(s) is the instantaneous rate of events at time s

Let N(s) denote the counting process at time s, which is the cumulative number of

events in the period (0, s]. If the random variable D represents the number of events in

(s, s + t], then the probability of d events in this interval is

P(D = d) = P (N(s + t)− N(s) = d) = e

−Λ(s,s+t)(Λ(s,s+t))d

d!

where s, t > 0 and the rate parameter

Λ(s,s+t) =

∫ s+t

s

λ(u) du

is the cumulative intensity over the interval

22/40

Poisson Process Model

If we assume constant intensity over the time period of interest, i.e., λ(u) = λ for all u,

then for d ≥ 0, t > 0 we have a homogeneous Poisson process and the number of events

D in an interval of length t is distributed as a Poisson with parameter λt

P(D = d ; t, λ) = e

−λt(λt)d

d!

If we further assume that all individuals in the population experience the same constant

intensity, then we have a doubly homogeneous Poisson process and the likelihood

function for a sample of n independent observations is

L(λ) =

n∏

j=1

e−λtj (λtj)dj

dj !

where tj and dj are the exposure time and number of events, respectively, for individual j .

This likelihood conditions on the exposure times t1, . . . , tn (i.e., tj are fixed constants)

23/40

Poisson Process Model

The log-likelihood is

logL(λ) = −λ

n∑

j=1

tj +

n∑

j=1

dj log(λ) +

n∑

j=1

dj log(tj)−

n∑

j=1

log(dj !)

and the score equation is

∂ logL(λ)

∂λ

= −

n∑

j=1

tj +

1

λ

n∑

j=1

dj

Equating the score to 0 and solving for λ gives the maximum likelihood estimator

λˆ =

∑n

j=1 dj∑n

j=1 tj

= dT

where d and T are the total number of events and exposure, respectively

24/40

Poisson Process Model

The previous ML estimator is known as the (crude) event rate. It can also be

equivalently expressed as a weighted mean rate

λˆ =

∑n

j=1 dj∑n

j=1 tj

=

∑n

j=1 tj(dj/tj)∑n

j=1 tj

=

∑n

j=1 tj rj∑n

j=1 tj

where rj = dj/tj is the event rate for person j

25/40

Poisson Process Model

The variance of the rate can be obtained as the inverse of the the observed or expected

Fisher information. The observed information is

J (λ) = −∂

2 logL(λ)

∂λ2

= d

λ2

and the expected information is I(λ) = E (D)/λ2, where D now represents the total

number of events in the sample.

Therefore, the variance estimator based on the observed information is λ2/d , and based

on the expected information it is λ2/E (D) = λ/T .

This follows because D is a Poisson random variable with E (D) = λT (exercise). Thus

for a given λ, the variance is inversely proportional to the total exposure, and also to the

(expected) number of events.

26/40

Poisson Process Model

The variance of the rate can be estimated as

v̂ar(λˆ) = λˆT =

d

T 2 =

λˆ2

d

Large sample hypothesis tests and confidence intervals can be based on this result. For

example, a 95% confidence interval for λ would be

λˆ± 1.96se(λˆ) = λˆ± 1.96λˆ/

√

d

Alternatively, since λ is positive, a more accurate approximation may be obtained by

calculating confidence intervals for log(λ) using the delta method and transforming back.

This method gives var(log λˆ) ≈ 1/d , so a 95% confidence interval for λ is

(λˆe−1.96/

√

d , λˆe+1.96/

√

d)

27/40

Example

According to the New York State Cancer Registry, there were 524 cancer deaths

amongst males aged 45-49 in New York State in 2000. The corresponding mid-year

population for this age group is 649,533, and we take this to be an approximation to

T , the total number of years of exposure for year 2000. The estimated mortality rate

is then λˆ = d/T = 524/649, 533 = 0.000807, or about 80.7 per 100,000 per year.

A 95% confidence interval for λ, assuming a normal distribution for λˆ, is

λˆ± 1.96λˆ/

√

d = 80.7± 1.96× 80.7/

√

524 = (73.8, 87.6)

Using a log transformation, a 95% confidence interval for λ is

(λˆe−1.96/

√

d , λˆe+1.96/

√

d) = (80.7e−1.96/

√

524, 80.7e+1.96/

√

524) = (74.1, 87.9)

28/40

Background

Prevalence and Incidence

Delta Method

Estimation and inference for event rates

Standardized Rates

29/40

Age-specific rates and crude rates

Since mortality and disease rates usually show considerable variation by age, we are

often interested in looking at a series of rates, one for each age or age group. These are

termed age-specific rates.

• In the above example, the age-specific rate for the 45-49 year age group is 80.7 per

100,000 per year.

• The age-specific rates for this population vary considerably, from 2.2 per 100,000

per year in the 5-9 and 10-14 year age groups, to 2585.5 in the 85+ age group.

• For similar reasons, it is also useful to estimate rates separately for males and

females.

30/40

Age-specific rates and crude rates

The crude rate is the total number of deaths (in all age groups), divided by the total

exposure.

Crude rates reflect not only the level of risk in the population, but also the age

distribution. So, comparison of crude rates amongst different populations can be

confounded by age. For example, a higher crude rate might simply reflect an older

population distribution, rather than a real difference in risk.

Valid comparisons can be made by comparing series of age-specific rates, but sometimes

a single summary measure is required that allows comparison between populations. Such

a summary measure is referred to as an age-adjusted or standardized rate.

31/40

Direct Standardization

Let λˆk = dk/tk be the age-specific rate for age group k = 1, . . . ,K , where dk is the

number of events and tk is the total exposure time for age group k.

Then, similar to the weighted mean rate, the crude rate can be written as a weighted

average of the age-specific death rates

λˆc =

∑

k tk λˆk∑

k tk

=

∑

k

tk

T λˆk

where the weights are equal to the fraction of exposure in each age group.

The idea of direct standardization is to replace the exposure proportions obtained from

the study population of interest, with the corresponding proportions obtained from some

standard population.

32/40

Direct Standardization

For example, if we wanted to compare two different populations, we would use the same

standard population for each, thereby removing effects merely due to differing age

structures.

So, the directly standardized rate is obtained by applying the study population

age-specific rates to the standard population exposure proportions

λˆdir =

∑

k

tsk

Ts

λˆk

where tsk and Ts are the exposure in age group k and total exposure, respectively, for

the standard population.

33/40

Direct Standardization

The variance of the directly standardized rate can be estimated as

v̂ar(λˆdir ) =

∑

k

( tsk

Ts

)2

var(λˆk)

=

∑

k

( tsk

Ts

)2 λˆ2k

dk

Following direct standardization of rates for two populations A and B, comparisons can

be made using the standardized rate ratio

SRR = λˆA

/

λˆB

The delta method on the log SRR ratio can be used to derive a variance estimate

v̂ar(log SRR) = v̂ar(log λˆA) + v̂ar(log λˆB)

= v̂ar(λˆA)

λˆ2A

+ v̂ar(λˆB)

λˆ2B 34/40

Indirect Standardization

In contrast to direct standardization, indirect standardization applies age-specific rates

for some standard population to the exposure proportions in the study population. This

gives an “expected” number of deaths in the study population

E =

∑

k

tkλsk

where λsk are the age-specific rates for the standard population.

The standardized mortality ratio is then calculated as the ratio of observed to expected

deaths

SMR = dE =

∑

k tk λˆk∑

k tkλsk

35/40

Indirect Standardization

An indirectly standardized rate can be obtained by multiplying the crude rate in the

standard population by the SMR, although in practice we are often primarily interested

in the SMR itself.

If we can regard the expected deaths as a constant, then the variance of the SMR can

be estimated as

var(SMR) = v̂ar(D)E 2 =

d

E 2 =

SMR2

d

Exact confidence intervals can also be constructed from the chi-square distribution[

χ22D,α/2

2E ,

χ22(D+1),1−α/2

2E

]

36/40

Example

In the US state of Michigan from 1950 to 1964, 731,177 babies were first-born to their

mothers, and of these, 412 were affected by Down’s syndrome. In the same period,

442,811 babies were the fifth-born or more to their mothers, and of these, 740 were

affected by Down’s syndrome.

The crude “rates” (strictly speaking, prevalence proportions) of Down’s syndrome are:

• 412× 100, 000/731, 177 = 56.3 per 100,000 births for first-borns, and

• 740× 100, 000/442, 811 = 167.1 per 100,000 births for fifth and later-borns.

This is not a fair comparison, however, because incidence of Down’s syndrome is known

to increase with maternal age, and mothers of fifth and later-borns will tend to be older

than mothers of first-borns. That is, maternal age is a confounder in the association

between birth order and Down’s syndrome.

37/40

Example: Down’s Syndrome

The following table shows the maternal-age-specific proportions of births and prevalence

proportions, separately for the whole state of Michigan (the standard population), and

for first-borns and fifth and later-borns (the study populations) for the period 1950-1964.

Proportion of births in age range: Age-specific rate for:

Maternal Fifth or Fifth or

age Michigan First-born later-born Michigan First-born later-born

Under 20 0.113 0.315 0.001 42.5 46.5 0.0

20-24 0.330 0.451 0.069 42.5 42.8 26.1

25-29 0.278 0.157 0.279 52.3 52.2 51.0

30-34 0.173 0.054 0.339 87.7 101.3 74.7

35-39 0.084 0.019 0.235 264.0 274.5 251.7

40+ 0.022 0.004 0.078 864.4 819.1 857.8

Crude rate 89.5 56.3 167.1

38/40

Example: Down’s Syndrome

The directly standardized rates are computed in the following table by applying the

study population prevalence proportions to the standard population birth proportions.

Direct standardized rate:

Maternal Fifth or

age First-born later-born

Under 20 5.255 0.000

20-24 14.124 8.613

25-29 14.512 14.178

30-34 17.525 12.923

35-39 23.058 21.143

40+ 18.020 18.872

Direct standardized rate 92.5 75.7

39/40

Example: Down’s Syndrome

This table illustrates the calculation of the SMRs and the indirectly standardized rates.

Expected cases:

Maternal Fifth or

age First-born later-born

Under 20 13.388 0.043

20-24 19.168 2.933

25-29 8.211 14.592

30-34 4.736 29.730

35-39 5.016 62.040

40+ 3.458 67.423

Expected cases 53.976 176.760

SMR 1.044 0.945

Indirect standardized rate 93.4 84.6

40/40

学霸联盟