统计r代写-STA 238|学霸联盟

统计r代写-STA 238

时间：2022-04-03

STA 238 - Assignment #2
Due: April 1 @ 8 PM ET, grace period up to April 4, 8 PM
Submit on Crowdmark: Accepted formats: jpeg, jpg, png, pdf
This is an individual assignment - all work and ideas presented should be entirely your own. The
purpose is to assess your acquisition of course concepts and ability to apply them to problems done
by hand, and to demonstrate these concepts through simulation methods.
Complete the written portions of the assignment on the PDF file in the space provided. If you
cannot access a printer, then you may complete the work on a separate paper and follow all upload
instructions on Crowdmark. Questions about the assignment should be addressed via course
email and not the discussion boards to avoid cluttering the space. The teaching team will be
looking for:
• Use of notation at course level, including variable definitions and complete distribution spec-
ifications as modeled in class
• Justifications using definitions, laws, axioms for your calculations (e.x. E(XY ) = E(X)·E(Y )
if X and Y are independent). We are looking for evidence that you are able to connect and
apply course concepts, with the final numerical result being worth at most 1 point.
• Crucial steps shown in your work (e.g. we don’t need to see all the integration steps, but it
should be clear why you are integrating, and the corresponding result). Any reader should be
able to easily follow your process through your work.
• Formatting and organized presentation of work. Round all final answers to four decimal places
(two for answers expressed in %)
• Written statements should be made in complete sentences using appropriate course level
terminology.
• Bonus points earned will be added to your score, up to a maximum of 100%.
For problems that use R (Labeled Problem * [R - * points]):
• All R work to be done in an R Markdown file, knit to pdf, with question headers. Each
question should begin on a new page (\newpage in your rmd text).
• Set seed before every random sampling you run.
• Your knit document should include: executable code chunks, required output values, and
written responses using correct LaTeX notation. Do not include print outs of large data sets.
• All graphs should include clear titles and axis labels that makes clear what data is being
plotted.
1
Problem 1 [20 points]. Note: Parts (d) and parts (b)-(c) can be completed indepen-
dently of each other.
As an aspiring gardener, you developed a new fertilizer that you hope will produce more consistent
growth in your crops as seedlings. You know from experience that the standard deviation in the
amount of growth over a two week period, measured in millimetres (mm) when growing seedlings
is 35 mm. That is, the amount of variation you see in growth among your plants could vary on
average by 3.5 cm. You hope that your new fertilizer will deliver nutrients more steadily, leading
to more consistent growth.
You test out your fertilizer on 30 seedlings and after a two week period, you find that the standard
deviation in the amount of growth is 11.05 mm (i.e. the amount of growth by all plants were within
1 cm of each other). Is there sufficient evidence to suggest that the fertilizer has contributed to
more consistent growth to seedlings?
a) (2 points) You recorded the amount of growth for each of your thirty seedlings below:
Seedling Growth (mm)
33.86 36.50 15.51 20.77 40.63
38.83 41.21 50.87 28.19 46.16
19.19 40.18 34.35 36.10 29.40
13.03 24.31 15.40 42.65 37.34
31.02 50.05 42.67 21.89 36.65
45.18 25.06 30.10 35.14 10.74
Compute the sample mean of your data, and also verify that the standard deviation in seedling
growth in these two weeks was 11.05 mm. What is one assumption you must verify before
conducting a hypothesis test?
b) (1 point) State the hypotheses you would test using the χ2 goodness of fit test to verify the
assumption in (a).
c) (7 points) Continuing from (b), since you do not have the exact distribution parameters of
the data to conduct your goodness of fit test, do the following instead:
1. Standardize your data using your sample mean and sample standard deviation. You can
choose to enter your data into R to do this step, but you must include your script and
output neatly as part of your solutions.
2. Verify your assumption by performing a goodness of fit test that compares your stan-
dardized data points to the standard normal distribution (N(0, 1)).
3. State the p-value of your test.
Note: Use six, equally likely intervals to perform your test. It doesn’t matter which end of
each interval you decide to have open or closed, as long as it is applied consistently across all
of your intervals. Use three decimal places for your interval endpoints.
d) (5 points) Interpret your p-value from part (c) before performing a hypothesis test with the
appropriate null and alternative hypotheses to determine whether there is sufficient evidence
to suggest that the fertilizer appears to reduce the variation in seedling growth. If you were
2
not able to complete (c), first verify your assumption in (a) with an appropriate QQ plot first
before proceeding.
e) (5) Write the probability statement that describes the rejection region for your hypothesis
test in (d) assuming normality holds and a significance level of 5%. Compute the region and
express it in terms of RR = {s2|s2 ≤ ....}, and use it to calculate the power of your test to
detect a difference in standard deviation if the actual standard deviation is σ = 12mm.
Problem 2 [R - 8 points]. This problem is a continuation of problem 1 where you are investi-
gating how your newly developed fertilizer impacts variation in seedling growth. The growth data
for your 30 seedlings are provided below:
Seedling Growth (mm)
33.86 36.50 15.51 20.77 40.63
38.83 41.21 50.87 28.19 46.16
19.19 40.18 34.35 36.10 29.40
13.03 24.31 15.40 42.65 37.34
31.02 50.05 42.67 21.89 36.65
45.18 25.06 30.10 35.14 10.74
a) (3 points) You want to determine the actual variation in seedling growth with the use of your
fertilizer. Construct a 98% confidence interval to estimate the true standard deviation in
seedling growth as a result of your fertilizer. Interpret your interval.
b) (5 points) Suppose your assumptions in problem 1 do not hold. Describe in detail the steps
you would use to construct a 98% confidence interval using empirical bootstrapping and
centred estimates, as described in class. Note: No R syntax allowed. Your description
should consist of the set of steps/instructions that someone else could follow to
implement this method.
c) (BONUS R - 3 pts) Using your student number as the randomization seed and using
B = 2000 resamples, construct a 98% confidence interval using empirical bootstrapped centred
values.
d) (BONUS R - 3 pts) How does your bootstrapped confidence interval compare with the
one computed under normality assumptions? Under what situation would you expect your
bootstrapped confidence interval to produce a drastically different confidence interval to the
one computed using normality assumptions? Why?
Problem 3 [R - 16 points]. A study was conducted to study any differences in the quality of
writing by published versus unpublished engineers. Specifically, the study wanted to investigate the
readability of written works by engineers who have had their works published in journals against
those who have no published works but write reports regularly as part of their tasks. Readability
in this case refers to how easily a layman might be able to read and comprehend the writing,
both in the structure of the written work and the need for technical knowledge. Written samples
of engineers belonging to either group were selected at random and scored according to an index
called the “index of confusion”. Low scores indicated high readability (low on the “confusion”
index) while high scores indicated low readability (high on the ‘confusion’ index). The data of
scores are tabulated below:
3
Journal Published Unpublished
1.95 1.77 0.95 1.11 2.30 2.84 2.52 2.96 2.75 2.02
2.25 1.83 1.48 1.56 1.41 3.42 1.71 3.22 1.63 2.86
2.06 1.59 2.07 1.39 1.75 2.74 2.36 3.62 3.08 1.81
1.24 2.05 1.91 2.10 1.71
a) (2 points) State the null and alternative hypotheses that would be tested in this problem.
b) (3 points) Check the normality assumption of your data graphically (either using R or by
hand). Include your well-labeled graphs and comments about the normality assumption.
c) (4 points) Produce in R a well-labeled side-by-side boxplots of the data. You should include
your both your code and the output graph in your response. Visually, does it seem reasonable
to believe the two groups have equal variances? Briefly explain.
d) (5 points) Perform a hypothesis test on your data according to your conclusions in (c). Show
all your work, including the sampling distribution you are using, any sample statistics you
needed to calculate, and your test-statistic.
e) (2 points) What is the p-value of your hypothesis test? Make sure to compute the p-value in
R to get a precise measurement. Interpret this p-value in terms of what it suggests about the
hypotheses being tested.
f) (BONUS - 6pts) Using side-by-side boxplots is not a particularly rigorous way to explore
the possibility of equal population variances. In this bonus problem, you’ll learn how to test
for equal population variances for two normally distributed samples. Don’t worry! The
process is no different than the other hypothesis tests we’ve covered.
In order to test for equal variances, we need a test statistic that would be a useful
measure, whose distribution is known. It turns out that:
Theorem 1. Let X1, X2, ..., Xn be a random sample from a normal distribution with vari-
ance σ2X and Y1, Y2, ..., Ym be a random sample from a normal distribution with variance
σ2Y . Let S
2
X and S
2
Y be the sample variances from these two random samples. Then:
F =
S2X/σ
2
X
S2Y /σ
2
Y
has a distribution with numerator degree of freedom df1 = n− 1 and denominator degree
of freedom df2 = m− 1
F has an F distribution with parameters Fn−1,m−1. It is a positive, right-skewed distri-
bution much like the χ2 distribution.
If we test:
H0 : σ
2
X = σ
2
Y
H9 : σ
2
X 6= σ2Y
Under the null hypothesis H0 : σ
2
X = σ
2
Y , we would expect F =
S2X/σ
2
X
S2Y /σ
2
Y
to be close to 1.
We would expect the sample variances to be similar, and under H0, σ
2
X/σ
2
Y = 1.
4
The p-value would find the probability of observing a ratio as extreme as was observed.
For example, if we observe S2X = 4 with n = 10 and S2Y = 2 and m = 20 from normal
populations, we see the sample variance in population X was twice as large as Y .
1. Test Statistic: The F test statistic would be F =
4/σ2
2σ2
= 2,
2. Sampling Distribution: F ∼ Fdf1=9,df2=19.
3. P-value: A two-tailed would find the p-value of a ratio of 2 or more and a ratio of
1/2 or less (opposite ratio!):
P (F ≥ 2)+P (F ≤ 0.5) = pf(2, 9, 19, lower.tail=F) + pf(0.5, 9, 19) = 0.2410
4. See ?pf in R for the documentation page on using this command to find cumulative
probabilities of an F distribution.
(6 points) After having verified normality in part (b), use the data to test for equal population
variances. To do this:
– State your hypotheses
– Report the sample statistics you will be using
– Calculate the test statistic
– State the sampling distribution, and finally,
– Report your p-value and conclusions drawn from the evidence.
5