r代写-STATS 330|学霸联盟

r代写-STATS 330

时间：2021-11-03

STATS 330
THE UNIVERSITY OF AUCKLAND
SEMESTER ONE 2020
Campus: City
STATISTICS
Advanced Statistical Modelling
(Time allowed: TWO hours)
Time Allowed: This Final Assessment has been designed so that a well-prepared
student could complete it within two hours. From the 1pm release time you will have
24 hours to complete and submit your assessment. No marks will be deducted for
taking longer than two hours within that 24-hour period, but you must submit before
the deadline.
INSTRUCTIONS
• This assessment is open book, you are permitted to access your course manuals
and other written material including online resources.
• Calculators and computers are permitted.
• It is your responsibility to ensure your assessment is successfully submitted on
time.
• We recommend you aim to submit a couple of hours in advance of the deadline,
to allow time to deal with any technical issues that might arise.
• We STRONGLY recommend you download your submitted document from Can-
vas, after submitting it, to verify you have uploaded the correct document.
• Attempt all questions in Part A and Part B.
• There are 100 marks in total for this examination.
Page 1 of 14
STATS 330
Support:
• If you have any concerns regarding your Final Assessment, please call the Con-
tact Centre for advice, rather than your instructors.
• The Contact Centre can be reached on these numbers:
– Auckland: 09 373 7513
– Outside Auckland: 0800 61 62 63
– International: +64 9 373 7513
• For any Canvas issues, please use 24/7 help on Canvas by chat or phone.
• If any corrections are announced during the 24 hours of the final assessment, you
will be notified by a Canvas Announcement. Please ensure your notifications
are turned on during this period.
Question Interpretation:
Please note that during the final assessment period you cannot contact your instruc-
tors for clarification on how to interpret the wording of any specific questions or to
verify that your answer is correct. Interpreting wording and making appropriate as-
sumptions is part of what is being assessed. You will need to interpret the question
yourself and check your own answers.
If you believe there is a typo, first re-read the question to check you have not misun-
derstood the question, as it is very common for students to misread questions. If you
still believe there is a typo, please phone the Contact Centre.
Page 2 of 14
STATS 330
Academic Honesty Declaration:
By completing this assessment, I agree to the following declaration:
I understand the University expects all students to complete coursework with integrity
and honesty. I promise to complete all online assessment with the same academic
integrity standards and values. Any identified form of poor academic practice or
academic misconduct will be followed up and may result in disciplinary action.
As a member of the University’s student body, I will complete this assessment in a
fair, honest, responsible and trustworthy manner. This means that:
• I declare that this assessment is my own work.
• I will not seek out any unauthorised help in completing this assessment.
• I am aware the University of Auckland may use plagiarism detection tools to
check my content.
• I will not discuss the content of the assessment with anyone else in any form,
including, Canvas, Piazza, Facebook, Twitter or any other social media or online
platform within the assessment period.
• I will not reproduce the content of this assessment anywhere in any form at any
time.
• I declare that I generated the calculations and data in this assessment indepen-
dently, using only the tools and resources defined for use in this assessment.
• I will not share or distribute any tools or resources I developed for completing
this assessment.
Page 3 of 14
STATS 330
PART A
1. [32 marks] A conservationist was interested in exploring the impact of
changes in habitat structure on the density of understorey-foraging birds in a
private reserve. A stockproof fence had previously been constructed on the
reserve to limit the movement of feral animals. The northern side of the fence
was classified as heavily-grazed with regular sightings of feral animals, while the
southern side of the fence was classified as moderately-grazed with infrequent
sightings of feral animals.
A bird survey was carried out in a number of sites, each of the same area, on
either side of the stockproof fence. Over a 20-minute period, the number of
birds observed or heard in each site was recorded.
A cull of feral animals then took place.
A second bird survey was then carried out at the same sites over a 20-minute
period. The conservationist wanted to explore both the effect of the initial
grazing conditions and the impact of the cull on the number of bird sightings.
Grazing Grazing status of the site. Either "Heavy" or "Moderate"
Cull Cull status. Either "Before" for the first bird survey or "After" for
the second bird survey
Number Number of birds observed or heard during 20-minute bird survey for
a particular combination of the above variables
In total there are 62 rows in the data set. The first five rows are displayed below:
Number Grazing Cull
1 0 Moderate Before
2 3 Moderate Before
3 1 Moderate Before
4 19 Moderate Before
5 8 Moderate Before
The following model was fitted to these data:
> conservation.fit1 <- glm(Number ~ Cull*Grazing, poisson, conservation.df)
> summary(conservation.fit1)
Page 4 of 14
STATS 330
Call:
glm(formula = Number ~ Cull * Grazing, family = poisson, data = conservation.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.805 -2.452 -1.020 0.234 8.936
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8563 0.0884 21.00 < 2e-16 ***
CullBefore -0.6776 0.1523 -4.45 8.6e-06 ***
GrazingModerate 0.4463 0.1300 3.43 6e-04 ***
CullBefore:GrazingModerate 0.8213 0.2004 4.10 4.2e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 528.07 on 61 degrees of freedom
Residual deviance: 437.23 on 58 degrees of freedom
AIC: 624.7
Number of Fisher Scoring iterations: 6
(a) Write equations to fully describe the fitted model conservation.fit1.
[5 marks]
(b) Interpret the effect of Cull in the model conservation.fit1. Although
confidence interval interpretations are preferred, this information has not
been provided. Therefore your interpretation will need to be in terms of
point estimates. [5 marks]
(c) Based on the deviance of conservation.fit1, is there evidence to suggest
that the model does not fit the data? Explain your answer. [2 marks]
Because of problems with overdispersion, a second model was fitted to these
data:
> library("MASS")
> conservation.fit2 <- glm.nb(Number ~ Cull*Grazing, data=conservation.df)
(d) How does the model conservation.fit2 deal with the problem of overdis-
persion? [3 marks]
(e) Consider summary(conservation.fit2)—deliberately not shown.
(i) How would the estimates compare with those from
conservation.fit1? Justify your answer.
(ii) How would the standard errors compare with those from
conservation.fit1? Justify your answer.
Page 5 of 14
STATS 330
[5 marks]
> logLik(conservation.fit2)
'log Lik.' -180.06 (df=5)
(f) Calculate the AIC and BIC for the model conservation.fit2. [4 marks]
(g) Using AIC as your criterion, which model do you prefer between
conservation.fit1 and conservation.fit2? Justify your answer.
[3 marks]
(h) What would the consequence have been if the second bird survey was car-
ried out at the same sites over a 30-minute period rather than a 20-minute
period? Suggest a solution. [5 marks]
Page 6 of 14
STATS 330
PART B
2. [8 Marks] Approximately, the weights of adult USA women has a mean
of 70.3 kg with a SD of 16.8 kg. Similarly, the weights of adult USA men
has a mean of 78.0 kg with a SD of 13.2 kg. Suppose that exactly half of the
population is of each gender and that weights are normally distributed. Use R
to solve the following problems—copy-and-paste your R command and output.
(a) Early last year Donald Trump weighed 243 pounds upon a physical ex-
amination. Compared to adult males from USA, he is in the upper what
percentile? That is, compute the proportion of adult males from USA
heavier than him. Use 1 kg = 2.2046 pounds. [2 marks]
(b) An adult person from USA happens to be 80kg. Approximately, how many
times is it more likely to be a male relative to a female? [2 marks]
(c) What is the probability the weight of a randomly chosen man exceeds the
weight of a randomly chosen woman, both adults from USA, by 20kg or
more? [2 marks]
(d) If a woman is in the lower quartile of weights of the adult female USA
population, what is the upper bound for her weight? [2 marks]
Page 7 of 14
STATS 330
3. [9 Marks] Behind each of the following R code chunks is a purpose, e.g.,
it illustrates a statistical concept or result or idea.
(a) Consider the following R code.
N <- 3
Nsim <- 1e2
lambda <- 4
means <- numeric(Nsim)
for (i in 1:Nsim) {
xvec <- rpois(N, lambda)
means[i] <- mean(xvec)
}
hist(means)
(i) What statistical concept is illustrated by the output of the last line,
hist(means)? [1 mark]
(ii) What happens when only N is allowed to increase? Give the name of
any theoretical result associated with it. [2 marks]
(iii) What happens when only Nsim is allowed to increase? [1 mark]
(iv) What happens when only lambda is allowed to increase? [2 marks]
(b) Consider the following R code.
Nsim <- 1e2
N <- 10
sigma <- 0.5
xy0.df <- data.frame(x2 = runif(N))
meanfun <- function(x) -1.5 + 4 * x
xy0.df <- transform(xy0.df, y = meanfun(x2) + rnorm(N, 0, sigma))
myvec <- numeric(Nsim)
for (i in 1:Nsim) {
xy0.df <- transform(xy0.df, ysim = meanfun(x2) + rnorm(N, 0, sigma))
mod_i <- lm(ysim ~ x2, data = xy0.df)
myvec[i] <- sum(resid(mod_i)^2) / df.residual(mod_i)
}
mean(myvec)
Comment on the output of the last line, mean(myvec)—what statistical
concept does it illustrate?—and is it successful? [3 marks]
Page 8 of 14
STATS 330
4. [17 Marks] To study junior school children in New Zealand, a random
sample of 500 boys and 500 girls was selected from all possible New Zealand
primary school children. The data are in a data frame called kids.df, having
the following columns.
age in years
sex 1 = female, 0 = male
height in metres
weight in kg
tv number of televisions at home
dbp diastolic blood pressure in mm Hg
siblings number of other brothers and sisters living at home
bothparents both parents living at home? 1 = yes, 0 = no
bovs born overseas? 1 = yes, 0 = no
Write a one or two line R LM or GLM to investigate the following research
questions, e.g.,
lm(weight ~ siblings, data = kids.df)
(a) Of interest is to examine whether the number of televisions at home is
related to how many people there are living at home. Briefly comment on
your model, e.g., the assumptions made on the variable or variables used.
[4 marks]
(b) Of interest is to find if any variables can explain (any) obesity in the child,
as measured by body mass index (BMI; kg/m2). [3 marks]
(c) Does the effect of the family size on the probability of both parents living
at home depend on whether or not the children have an overseas influence
or connection? [3 marks]
Now suppose we are interested in fitting some GAMs using gam.
(d) Write your answer to (b) as a GAM instead of a GLM. Which model would
you fit first—the GLM or GAM—and why? [4 marks]
(e) Comment on the appropriateness of fit1 below. [3 marks]
fit1 <- gam(height ~ tv + s(age, df = 15) + sex + s(bovs), poisson,
data = kids.df)
Page 9 of 14
STATS 330
5. [6 Marks] A statistical model has likelihood function L(p) ∝ exp(3
2
p2)
where p is the number of variables in the model (up to a maximum of 10).
Given a data set, we want to choose the number of variables using a penalty
function approach. Suppose the balancing parameter to be used is λ = 2 and
the penalty is B = (p − 1)2. Using the negative log-likelihood as A, find the
optimal number of variables based on the quantity A+ λB.
6. [7 Marks] A large data set is separated into a training set and a test set.
(a) Is it necessary to do this randomly? Why or why not? [3 marks]
(b) In R how might this separation be done in a reproducible way? [1 mark]
(c) The statistician chooses 20% of the data for training and 80% for testing.
Comment briefly on this—2 or 3 lines would be plenty. [3 marks]
Page 10 of 14
STATS 330
7. [14 Marks] Consider the following code concerning a small study of chil-
dren who have had corrective spinal surgery. The response was whether kyphosis
(a type of deformation) was present or absent after the operation. The variable
Start measures the number of the first (topmost) vertebra operated on.
> data(kyphosis, package = "rpart")
> ooo <- with(kyphosis, order(Start))
> kyphosis <- kyphosis[ooo, ]
> kfit1 = glm(Kyphosis ~ Start, binomial, data = kyphosis)
> summary(kfit1)
Call:
glm(formula = Kyphosis ~ Start, family = binomial, data = kyphosis)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.473 -0.518 -0.421 -0.341 2.131
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8901 0.6300 1.41 0.15769
Start -0.2179 0.0604 -3.61 0.00031 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.234 on 80 degrees of freedom
Residual deviance: 68.072 on 79 degrees of freedom
AIC: 72.07
Number of Fisher Scoring iterations: 5
> vcov(kfit1)
(Intercept) Start
(Intercept) 0.396849 -0.0332694
Start -0.033269 0.0036529
> plot(jitter(kfit1$y) ~ Start, data = kyphosis, col = "blue",
ylab = "Response or probability",
main = "Jittered Response of kyphosis Data", las = 1)
> lines(fitted(kfit1) ~ Start, data = kyphosis, col = "darkgreen")
Page 11 of 14
STATS 330
5 10 15
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Jittered Response of kyphosis Data
Start
R
es
po
ns
e
or
p
ro
ba
bi
lit
y
(a) Why was jitter() used? [1 mark]
(b) Why was the data frame sorted by Start? [1 mark]
(c) Is there any evidence of overdispersion with respect to the binomial? Jus-
tify your answer, e.g., by some numerical evidence. [3 marks]
Now consider the following code. We want to estimate θ = the value of Start
such that there is a 25% probability of presence of kyphosis, as well as obtain
an approximate 95% confidence interval for θ.
> cfs <- coef(kfit1)
> eta = cfs[1] + cfs[2] * with(kyphosis, Start)
> n.obs = nrow(kyphosis)
> Nsim = 1e3
> betas = matrix(0, Nsim, length(cfs))
> for (i in 1:Nsim) {
ysim = rbinom(n.obs, size = 1, prob = 1 / (1 + exp(-eta)))
mod_i = glm(cbind(ysim, 1-ysim) ~ Start, binomial, data = kyphosis)
betas[i, ] = coef(mod_i)
}
> est = (-1.0986 - cfs[1]) / cfs[2]
> myci <- c(2*est-quantile((-1.0986-betas[, 1])/betas[, 2], prob=0.975),
2*est-quantile((-1.0986-betas[, 1])/betas[, 2], prob=0.025))
> plot(jitter(kfit1$y) ~ Start, data = kyphosis, col = "blue",
ylab = "Response or probability",
main = "Jittered Response of kyphosis Data", las = 1)
> lines(fitted(kfit1) ~ Start, data = kyphosis, col = "darkgreen")
> abline(v = c(est, myci), col = "red3", lty = "dashed")
> abline(h = 0.25, col = "gray", lty = "dashed")
Page 12 of 14
STATS 330
5 10 15
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Jittered Response of kyphosis Data
Start
R
es
po
ns
e
or
p
ro
ba
bi
lit
y
(d) Why was size = 1 used? [1 mark]
(e) Is this parametric bootstrapping, nonparametric bootstrapping or neither?
[1 mark]
(f) Explain where the -1.0986 came from. [1 mark]
(g) Obtain an approximate 95% confidence interval for the probability of
kyphosis when Start = 10. [4 marks]
(h) Using mgcv what commands would you use to fit a GAM to replace the
GLM kfit1. Don’t run it. [2 marks]
Page 13 of 14
STATS 330
8. [7 Marks] The following are the performance characteristics of B-type na-
triuretic peptide as a diagnostic test for congestive heart failure (CHF). The
study was done around the year 2000, and the new diagnostic test was consid-
ered positive if Serum BNP > 100 pg/mL. The diagnostic test was trialled on
1586 patients. Of 744 patients that were disease positive, 670 tested positive.
Of 842 patients that were disease negative, 640 tested negative.
(a) What is the overall prevalence of the disease? [1 mark]
(b) What is the sensitivity of the test? [1 mark]
(c) What is the specificity of the test? [1 mark]
(d) Given somebody with a negative test, estimate the probability of having
no disease. [2 marks]
(e) For this example, from a patient’s point of view, is a false positive better
or worse, or neither, than a false negative? Why? [2 marks]
Page 14 of 14

学霸联盟