DATA2002-无代写|学霸联盟

DATA2002-无代写

时间：2023-08-13

13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 1/20
DATA2002
Goodness of t tests for discrete distributions
Garth Tarr
The University of Sydney
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 2/20
Radiation exposure
The goal in biological dosimetry is to estimate the dose of ionizing radiation, absorbed by
an exposed individual by using chromosome damage in peripheral lymphocytes.
When radiation exposure occurs, the damage in DNA is randomly distributed between
cells producing chromosome aberrations. The outcome of interest is the number of
aberrations observed. The number of aberrations typically follows a Poisson distribution,
the rate of which depends on the dose.
The table below shows the number of chromosome aberrations from a patient exposed to
radiation after the nuclear accident of Stamboliyski (Bulgaria) in 2011 Puig & Weiß ( ).

2020
Number of aberrations 0 1 2 3 4 5 6 7
Frequency 117 94 51 15 0 0 0 1
How can we check if the random variable generating this data follows a Poisson
distribution.
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 3/20
Poisson distribution
A Poisson random variable represents the
probability of a given number of events
occurring in a xed interval (e.g. number of
events in a xed period of time) if these
event occur independently with some
known average rate per unit time.
If is a Poisson random variable with rate
parameter , the probability mass function
is:
λ
X
λ
P(X = k) = e
−λ
λ
k
k!
, k = 0, 1, 2,…
λ = 2
plot(table(rpois(n=10000, lambda=2)),
ylab = "Count")
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 4/20
λ = 6
rpois(n=10000, lambda=6) |> table() |>
plot(ylab = "Count")
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 5/20
Chi-squared tests for discrete distributions
Suppose we have a sample of observations .
We want to test whether the sample is taken from a population with a given distribution
function where are parameters of the distribution.
We summarise the observed data by tabulating the observed frequencies for each
possible outcome and compare them to the corresponding expected frequencies, ,
calculated using the expected probabilities, , from the hypothesised distribution,
.
This is a general chi-squared goodness-of-t test with test statistic,
where iterates over the distinct outcomes.
x
1
,x
2
,… ,x
n
F
0
(x|θ
1
, θ
2
, . . . , θ
h
) θ
l
y
i
e
i
p
i
F
0
(x|θ
1
, θ
2
, . . . , θ
h
)
T =
k
∑
i=1
(Y
i
− e
i
)
2
e
i
=
k
∑
i=1
(Y
i
− np
i
)
2
np
i
,
i = 1, 2,… , k
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 6/20
Chi-squared tests for discrete distributions
However the model parameters are usually unknown and have to be
estimated from the sample.
In this case, the expected probabilities are replaced by their estimates .
Then the observed test statistic is
and the approximate p-value is
Note the degrees of freedom are where is the number of parameters we need
to estimate.
θ
1
, θ
2
,… , θ
h
p
i
p^
i
t
0
=
k
∑
i=1
(y
i
− n
^
p
i
)
2
n
^
p
i
,
P(χ
2
k−1−q
≥ t
0
).
k− 1 − q q
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 7/20
Radiation exposure
Let be a random variable such that .
Let be the observed frequency of outcome .
The expected probabilities are given by the probability mass function,
where denote the probability of the number of chromosome aberrations in the -th
category.
Note that for a Poisson distribution both and are equal to the parameter .

X X ∼ Poisson(λ)
y
i
i
p
i
P(X = i) = p
i
= e
−λ
λ
i
i!
for i = 0, 1, 2,… ,
p
i
i
E(X) Var(X) λ
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 8/20
We need to estimate by the sample mean: .

λ
^
λ = x¯ = 248/278 = 0.89
i y
i
p^
i
= e
−
^
λ
^
λ
i
/i! e^
i
= np^
i
(y
i
− e^
i
)
2
/e^
i
0 117
0.89
0
e
−0.89
0!
= 0.4098 278 × 0.4098 = 113.92
(117 − 113.92)
2
113.92
= 0.08
1 94
0.89
1
e
−0.89
1!
= 0.3656 278 × 0.3656 = 101.63
(94 − 101.63)
2
101.63
= 0.57
2 51
0.89
2
e
−0.89
2!
= 0.1631 278 × 0.1631 = 45.33
(51 − 45.33)
2
45.33
= 0.71
3 15
0.89
3
e
−0.89
3!
= 0.0485 278 × 0.0485 = 13.48
(15 − 13.48)
2
13.48
= 0.17
4 0
0.89
4
e
−0.89
4!
= 0.0108 278 × 0.0108 = 3.01
(0 − 3.01)
2
3.01
= 3.01
5 0
0.89
5
e
−0.89
5!
= 0.0019 278 × 0.0019 = 0.54
(0 − 0.54)
2
0.54
= 0.54
6 0
0.89
6
e
−0.89
6!
= 0.0003 278 × 0.0003 = 0.08
(0 − 0.08)
2
0.08
= 0.08
≥ 7 1 0.00004 278 × 0.00004 = 0.01
(1 − 0.01)
2
0.01
= 96.40
Total 278 1 278 101.56
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 9/20
But wait! There are a number of cells where the expected number of counts is less than 5
which violates an assumption.
We combine the last ve classes so that the nal group is with observed frequency
, the expected count is
:
The nal observed test statistic is, .

≥ 3
y
≥3
= 15 + 1 = 16
e^
≥3
= 13.48 + 3.01 + 0.54 + 0.08 + 0.01 = 17.12
i y
i
p^
i
= e
−
^
λ
^
λ
i
/i! e^
i
= np^
i
(y
i
− e^
i
)
2
/e^
i
0 117
0.89
0
e
−0.89
0!
= 0.4098 278 × 0.4098 = 113.92
(117 − 113.92)
2
113.92
= 0.08
1 94
0.89
1
e
−0.89
1!
= 0.3656 278 × 0.3656 = 101.63
(94 − 101.63)
2
101.63
= 0.57
2 51
0.89
2
e
−0.89
2!
= 0.1631 278 × 0.1631 = 45.33
(51 − 45.33)
2
45.33
= 0.71
≥ 3 16 1 − 0.4098 − 0.3656 − 0.1631 = 0.0615 278 × 0.0615 = 17.12
(16 − 17.12)
2
17.12
= 0.07
Total 278 1 278 1.43
t
0
= 0.08 + 0.57 + 0.71 + 0.07 = 1.43
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 10/20
Hypothesis: the data follow a Poisson distribution
vs : the data do not follow from a Poisson
distribution.
Assumptions: The expected frequencies,
. Observations are independent.
Test statistic: . Under ,
approximately.
Observed test statistic:
P-value:
Decision: Since the p-value is greater than 0.05, we do
not reject the null hypothesis. The data are consistent
with a Poisson distribution.
We don’t conclude that
is true, just that the data

H
0
:
H
1
e
i
= np
i
≥ 5
T =
k
∑
i=1
(Y
i
− np
i
)
2
np
i
H
0
T ∼ χ
2
2
t
0
= 1.43
P(T ≥ t
0
) = P(χ
2
2
≥ 1.43) = 0.49
Important
H
0
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 11/20
are consistent with .
This is a subtle distinction.
H
0
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 12/20
If we wanted to implement this “manually” in R:

y = c(117, 94, 51, 15, 0, 0, 0, 1) # input the observed counts
x = 0:7 # define the corresponding groups
n = sum(y) # total number of samples (sample size)
k = length(y) # number of groups
(lam = sum(y * x)/n) # estimate the lambda parameter
[1] 0.8920863
p = dpois(x, lambda = lam) # obtain the p_i from the Poisson pmf
p
[1] 4.097999e-01 3.655769e-01 1.630631e-01 4.848878e-02 1.081404e-02
[6] 1.929412e-03 2.868670e-04 3.655859e-05
p[8] = 1 - sum(p[1:7]) # redefine the 8th element P(>=7) NOT P(7)
round(p, 5)
[1] 0.40980 0.36558 0.16306 0.04849 0.01081 0.00193 0.00029 0.00004
(ey = n * p) # calculate the expected frequencies
[1] 113.92436722 101.63037076 45.33153228 13.47988010 3.00630420
[6] 0.53637658 0.07974904 0.01141984
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 13/20
ey >= 5 #check assumption e_i >= 5 not all satisfied
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 14/20
Collapsing categories, so that the assumptions are met:

(yr = c(y[1:3], sum(y[4:8]))) # reduced category counts
[1] 117 94 51 16
(eyr = c(ey[1:3], sum(ey[4:8]))) # reduced category expected cell counts
[1] 113.92437 101.63037 45.33153 17.11373
all(eyr >= 5) # check that all expected cell counts are >= 5
[1] TRUE
(pr = c(p[1:3], sum(p[4:8]))) # reduced category hypothesised probabilities
[1] 0.40979988 0.36557687 0.16306307 0.06156018
kr = length(yr) # number of combined classes
(t0 = sum((yr - eyr)^2/eyr)) # test statistic
[1] 1.43721
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 15/20
(pval = 1 - pchisq(t0, df = kr - 1 - 1)) # p-value
[1] 0.4874317
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 16/20
What if we try to use chisq.test()?

chisq = chisq.test(yr, p = pr)
chisq
Chi-squared test for given probabilities
data: yr
X-squared = 1.4372, df = 3, p-value = 0.6968
Why are e degrees of eedom in chisq.test() wrong?
Is there a slightly easier way to get the right answer?
chisq$statistic
X-squared
1.43721
pchisq(unname(chisq$statistic), df = 2, lower.tail = FALSE)
[1] 0.4874317
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 17/20
Visualising the comparison between observed and expected counts.

xr = c("0", "1", "2", ">=3") # group labels
barplot(yr, names.arg = xr, main = "Observed frequency")
barplot(eyr, names.arg = xr, main = "Expected frequency")
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 18/20
Visualising the comparison between observed and expected counts.

dat = tibble::tibble(
aberrations = factor(
xr,
levels = c("0","1","2",">=3")
),
observed = yr,
expected = eyr
)
dat |> ggplot() +
aes(x = aberrations, y = observed) +
geom_col(alpha = 0.5) +
geom_point(aes(y = eyr),
colour = "blue",
size = 6) +
labs(y = "Count",
x = "Number of aberrations") +
theme_classic(base_size = 40)
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 19/20
R packages and functions
rpois() generate pseudo-random data from a Poisson distribution
dpois() probabilities from a Poisson distribution
table() tabulate discrete data
nrow(), mean(), sd() and var()
plot() a generic function that generates different plots depending on what you feed it.
E.g. when you feed it a table object, it plots a frequency distribution.
1:n returns a vector of integers between 1 and n inclusive
x[1:3] returns the rst 3 values in the vector x
x[4] = 41 sets the 4th element in the object x to be 41
13/08/2023, 16:22 DATA2x02 - DATA2002
https://pages.github.sydney.edu.au/DATA2002/2023/lec04.html#/title-slide 20/20
References
For further details see Larsen & Marx ( ), sections 10.3 and 10.4.
Larsen, R.J., & Marx, M.L. (2012). An introduction to mathematical statistics and its
applications (5th ed.). Boston, MA: Prentice Hall.
Puig, P., & Weiß, C.H. (2020). Some goodness-of-t tests for the poisson distribution with
applications in biodosimetry. Computational Statistics & Data Analysis, 144, 106878.
DOI:
2012
10.1016/j.csda.2019.106878