DATA2002-无代写
时间:2022-11-14
Exam style questions
Started: Nov 13 at 22:21
Quiz Instructions
These questions are in the style of the final exam. You should refer to the DATA2002 Final Exam Canvas
front page for specific details about the exam.
THIS IS NOT THE FINAL EXAM
Section 1: smoking
A study of patients with insulin-dependent diabetes was conducted to investigate the
effects of cigarette smoking on renal and retinal complications. Before examining the
results of the study, a researcher expects that the proportions of four different
subgroups are as follow:
Of 100 randomly selected patients, there are 44 nonsmokers, 24 current smokers, 13
tobacco chewers and 19 ex-smokers. Should the researcher revise his estimates? Use
0.01 level of significance.
> y_i = c(44, 24, 13, 19)
> p_i = c(0.5, 0.2, 0.1, 0.2)
> (n = sum(y_i))
[1] 100
> (e_i = n * p_i)
[1] 50 20 10 20
> sum((y_i - e_i)^2/e_i)
[1] 2.47
> qchisq(0.005, 1:6, lower.tail = FALSE)
[1] 7.879439 10.596635 12.838156 14.860259 16.749602 18.547584
> qchisq(0.01, 1:6, lower.tail = FALSE)
[1] 6.634897 9.210340 11.344867 13.276704 15.086272 16.811894
> qchisq(0.025, 1:6, lower.tail = FALSE)
[1] 5.023886 7.377759 9.348404 11.143287 12.832502 14.449375
> qchisq(0.05, 1:6, lower.tail = FALSE)
[1] 3.841459 5.991465 7.814728 9.487729 11.070498 12.591587
1 ptsQuestion 1
p 0 words
Which of the tests we have covered this semester is most appropriate in this scenario?
Why?
2 ptsQuestion 2
Edit View Insert Format Tools Table
12pt Paragraph
p 0 words
Write down the appropriate null and alternative hypotheses.
2 ptsQuestion 3
What are the assumptions required for this test. Are they satisfied here?
Edit View Insert Format Tools Table
12pt Paragraph
Edit View Insert Format Tools Table
12pt Paragraph
p 0 words
1 ptsQuestion 4
p 0 words
What is the approximate distribution of the test statistic under the null hypothesis?
1 ptsQuestion 5
Write down an expression for the p-value.
Edit View Insert Format Tools Table
12pt Paragraph
Edit View Insert Format Tools Table
12pt Paragraph
p 0 words
2 ptsQuestion 6
What is your decision for the test and why?
Section 2: TV violence
A study of the amount of violence viewed on television as it relates to the age of the
viewer yields the results shown in the accompanying table for 81 people.
> x = matrix(c(8, 18, 12, 15, 21, 7), ncol = 3)
> colnames(x) = c("16-34", "35-54", "54+")
> rownames(x) = c("Low violence", "High violence")
> x
16-34 35-54 54+
Low violence 8 12 21
High violence 18 15 7
> (n = sum(x))
[1] 81
> (xr = apply(x, 1, sum))
Low violence High violence
41 40
> (xc = apply(x, 2, sum))
16-34 35-54 54+
26 27 28
> (ex = xr %*% t(xc) / n)
16-34 35-54 54+
[1,] 13.16049 13.66667 14.17284
[2,] 12.83951 13.33333 13.82716
> sum((x - ex)^2 / ex)
[1] 11.16884
> qchisq(0.05, 1:6, lower.tail = FALSE)
[1] 3.841459 5.991465 7.814728 9.487729 11.070498 12.591587
> qt(0.05, 1:6)
[1] -6.313752 -2.919986 -2.353363 -2.131847 -2.015048 -1.943180
> qt(0.025, 1:6)
[1] -12.706205 -4.302653 -3.182446 -2.776445 -2.570582 -2.446912
2 ptsQuestion 7
Which test is most appropriate in this scenario? Why?
1 ptsQuestion 8
Write down the appropriate null and alternative hypotheses.
2 ptsQuestion 9
What are the assumptions required for this test. Are they satisfied here?
1 ptsQuestion 10
What is the approximate distribution of the test statistic under the null hypothesis?
1 ptsQuestion 11
Write down an expression for the p-value.
2 ptsQuestion 12
What is your decision for the test and why?
Section 3: Ozone
Data was recorded on WS (wind speeds), Temp (temperature), H (humidity), In
(insolation) and O (ozone) for 30 days. R output is given below to help you answer the
following questions.
> pollut = read_csv("https://raw.githubusercontent.com/DATA2002/data/master/pollut.txt")
> glimpse(pollut)
Observations: 30
Variables: 5
$ WS 50, 47, 57, 38, 52, 57, 53, 62, 52, 42, 47, 40, 42, 40, 48,...
$ Temp 77, 80, 75, 72, 71, 74, 78, 82, 82, 82, 82, 80, 81, 85, 82,...
$ H 67, 66, 77, 73, 75, 75, 64, 59, 60, 62, 59, 66, 68, 62, 70,...
$ In 78, 77, 73, 69, 78, 80, 75, 78, 75, 58, 76, 76, 71, 74, 73,... $ O 15, 20, 13, 21, 1
2, 12, 12, 11, 12, 20, 11, 17, 20, 23, 17,...
> library(GGally)
> ggpairs(pollut) + theme_bw()
> pollut_lm = lm(O ~ ., pollut)
> summary(pollut_lm)
Call:
lm(formula = O ~ ., data = pollut)
Residuals:
Min 1Q Median 3Q Max
-6.5861 -1.0961 0.3512 1.7570 4.0712
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.49370 13.50647 -1.147 0.26219
WS -0.44291 0.08678 -5.104 2.85e-05
Temp 0.56933 0.13977 4.073 0.00041
H 0.09292 0.06535 1.422 0.16743
In 0.02275 0.05067 0.449 0.65728
Residual standard error: 2.92 on 25 degrees of freedom
Multiple R-squared: 0.798, Adjusted R-squared: 0.7657
F-statistic: 24.69 on 4 and 25 DF, p-value: 2.279e-08
> pollut_step = step(pollut_lm, trace = FALSE)
> summary(pollut_step)
Call:
lm(formula = O ~ WS + Temp + H, data = pollut)
Residuals:
Min 1Q Median 3Q Max
-6.5887 -1.1686 0.1978 1.9004 4.1544
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.60697 13.07154 -1.270 0.215
WS -0.44620 0.08513 -5.241 1.78e-05
Temp 0.60190 0.11764 5.117 2.47e-05
H 0.09850 0.06316 1.559 0.131
Residual standard error: 2.874 on 26 degrees of freedom
Multiple R-squared: 0.7964, Adjusted R-squared: 0.7729
F-statistic: 33.89 on 3 and 26 DF, p-value: 3.904e-09
> newdata = data.frame(WS = 40, Temp = 80, H = 50)
> predict(pollut_step, newdata, interval = "confidence")
fit lwr upr
1 18.6218 16.70852 20.53509
> predict(pollut_step, newdata, interval = "prediction")
fit lwr upr
1 18.6218 12.41146 24.83215
> library(ggfortify)
> autoplot(pollut_step, which = 1:2) + theme_bw()
2 ptsQuestion 13
Does it look like any variables can be dropped from the full model? If you were doing
backwards selection using a testing down strategy which would you drop first?
2 ptsQuestion 14
Write down a the workflow for a formal hypothesis test to see if the coefficient for
insolation is significantly different to zero. Make sure you state the null and alternative
hypotheses, test statistic (and its distribution), p-value and conclusion.
[You can do this using plain text, no need to use the Canvas equation editor]
1 ptsQuestion 15
Write down the fitted model for the model selected by the backward stepwise
procedure.
[You can do this using plain text, no need to use the Canvas equation editor]
2 ptsQuestion 16
State and check the linear regression assumptions for the model selected by the
backward stepwise procedure.
1 ptsQuestion 17
What proportion of the variability of ozone is explained by the explanatory variables in
the stepwise selected model?
2 ptsQuestion 18
Use the stepwise model to estimate the average ‘ozone‘ for days when ‘WS=40‘,
‘Temp=80‘ and ‘H=50‘. Is a confidence interval or a prediction interval most appropriate
here? Write down the estimated interval you think is most appropriate.
Section 4: Flicker frequency
If a light is flickering but at a very high frequency, it appears to not be flickering at all.
Thus there exists a ”critical flicker frequency” where the flickering changes from
”detectable” to ”not detectable” and this varies from person to person.
The critical flicker frequency and iris colour for 19 randomly sampled people were
obtained as part of a study into the relationship between critical frequency flicker and
eye colour.
We want to use a one-way ANOVA to test if there is a significant difference in the mean
detectable flicker frequency between people with different eye colours.
> library(tidyverse)
> fli k d t ("htt // ith b t t /DATA2002/d t / t /fli k t t")
> flicker = read_tsv("https://raw.githubusercontent.com/DATA2002/data/master/flicker.txt")
> glimpse(flicker)
Observations: 19
Variables: 2
$ Colour "Brown", "Brown", "Brown", "Brown", "Brown", "Brown", "B...
$ Flicker 26.8, 27.9, 23.7, 25.0, 26.3, 24.8, 25.7, 24.5, 26.4, 24...
> ggplot(flicker, aes(x = Colour, y = Flicker)) +
geom_boxplot() +
theme_classic() +
labs(y = "Critical flicker frequency", y = "Eye colour")
> flicker_anova = aov(Flicker ~ Colour, data = flicker)
> summary(flicker_anova)
Df Sum Sq Mean Sq F value Pr(>F)
Colour 2 23.00 11.499 4.802 0.0232
Residuals 16 38.31 2.394
> library(emmeans)
> flicker_emmeans = emmeans(flicker_anova, ~ Colour)
> contrast(flicker_emmeans, method = "pairwise", adjust = "bonferroni")
contrast estimate SE df t.ratio p.value
Blue - Brown 2.58 0.836 16 3.086 0.0212
Blue - Green 1.25 0.937 16 1.331 0.6060
Brown - Green -1.33 0.882 16 -1.511 0.4512
P value adjustment: bonferroni method for 3 tests
> library(ggfortify)
> autoplot(flicker_anova, which = c(1,2)) + theme_classic()
1 ptsQuestion 19
Write out the appropriate null and alternative hypotheses. [Be sure to define all
parameters used.]
2 ptsQuestion 20
What are the assumptions required for a one-way ANOVA? Are they satisfied in this
case?
2 ptsQuestion 21
Write down the test statistic (with distribution), observed test statistic, p-value and an
appropriate conclusion.
1 ptsQuestion 22
If appropriate, discuss the post hoc test results to identify which pairwise differences are
significant. If not appropriate, give a brief justification as to why not.
2 ptsQuestion 23
Describe how to perform the Bonferroni correction in the context of post-hoc pairwise
testing. Why is it needed?
2 ptsQuestion 24
Describe how you would perform a permutation test in this context.
Section 5: Weight gain
10 pigs were independently sampled and fed a specific diet. The weight of 5 pigs on
diet X and 5 pigs on diet Y are
Diet X: 12, 16, 16, 12, 10 and diet Y : 30, 12, 24, 32, 24.
We want to test if there is a difference in weight between the two diets using the
Wilcoxon rank-sum test.
> wdat = data.frame(
+ diet = rep(c("X","Y"), each = 5),
+ weight = c(12, 16, 16, 12, 10, 30, 12, 24, 32, 24)
+ ) %>%
+ mutate(ranks = rank(weight))
> wdat
diet weight ranks
1 X 12 3.0
2 X 16 5.5
3 X 16 5.5
4 X 12 3.0
5 X 10 1.0
6 Y 30 9.0
7 Y 12 3.0
8 Y 24 7.5
9 Y 32 10.0
10 Y 24 7.5
> wdat %>% group_by(diet) %>% summarise(sum(ranks))
diet sum(ranks)
1 X 18
2 Y 37
> nx = 5
> ny = 5
> N = nx + ny
> ew = nx*(N+1)/2
> varw = (sum(wdat$ranks^2) - N*(N+1)^2/4)*nx*ny/(N*(N-1))
> c(ew, varw)
[1] 27.50000 22.08333
> qnorm(c(0.9,0.95,0.975))
[1] 1.281552 1.644854 1.959964
> qt(c(0.9,0.95,0.975), 8)
[1] 1.396815 1.859548 2.306004
1 ptsQuestion 25
Write out the null and alternative hypotheses.
[Be sure to define all parameters used.]
2 ptsQuestion 26
Calculate the Wilcoxon rank-sum test statistic and the standardised version of the test
statistic.
1 ptsQuestion 27
At the level of significance α = 0.05, what is your conclusion?
3 ptsQuestion 28
What is a parametric test that could be used instead of the Wilcoxon rank-sum test?
What is one advantage of using a Wilcoxon rank-sum test over a parametric test? What
is one advantage of using a parametric test over the Wilcoxon rank-sum test?
3 ptsQuestion 29
Describe how you would calculate a 90% bootstrap confidence interval for the mean
difference between the two diets.
Section 6: short answer questions
2 ptsQuestion 30
Describe the process of k-means clustering.
[100 words or less.]
2 ptsQuestion 31
What is the purpose of principal component analysis? How can we select the number of
principal components we need to retain?
[100 words or less.]
2 ptsQuestion 32
Why is an ANOVA post-hoc t-test generally considered preferable to standard two-
sample t-test?
[100 words or less.]
4 ptsQuestion 33
You're the senior manager at a management consulting company. A junior data analyst
on your team has been tasked with building a prediction model for a binary outcome.
When you ask them how their model performs they respond with:
"It's awesome, bro! Best model ever. I used all available variables in a logistic
regression model. The resubstitution accuracy was pretty much the same as the leave-
one-out cross validation accuracy. So I'm done for the day. I'm gonna go play some
fussball and grab a kombucha from the fridge, can I get you one bro?
You remind the junior analyst for the 100th time that you're not their "bro". Internally you
curse the HR department for hiring Commerce grads from UNSW.
In the text box below, provide some guidance to the junior analyst about their model
selection and evaluation choices. Also suggest some alternative methods that they
could use and briefly outline their advantages and disadvantages.
Not saved
[150 words or less.]
Submit Quiz