Thinking and Reasoning with Data 代写-MAST90044-Assignment 2|学霸联盟

Thinking and Reasoning with Data 代写-MAST90044-Assignment 2

时间：2022-04-26

1
MAST90044 Thinking and Reasoning with Data
Semester 1 2022
Assignment 2
Due: 17:00 PM, Wednesday 27 April

Student name:______________________________

Student number:____________________________

• Please label your assignment with the following information in the
appropriate spots at the top of this document:
o your name
o your student number
• This assignment is worth 15% of the marks in this subject, and covers
the work done up to Week 7 (with a focus on Lab Chapters 4–6).
• The total number of marks for this assignment is 65.
• Your assignment should show all working and reasoning, as marks will
be given for method as well as for correct answers. Please spellcheck your
document.
• Each question is followed by an empty box for the answer. Please
answer each question in the dedicated box. If you need more space for
a question you can add this at the end of this document BUT clearly
state in the box that the answer to the question continues on the
additional pages. Please DO NOT resize/move the boxes, or add
additional pages (except for right at the end). The document needs to
be the correct format (boxes on the right pages) for ease of marking.
• Paste any R code and output into the boxes along with your answers.
Graphics from R can be resized within your document; make them smaller
(but still legible) as necessary to ensure they are in the box.
• Tutors will not help you directly with assignment questions. However,
they may give you some help with R if you ask e.g. what does the hist()
function do?
• Please note that we may mark only a subset of questions.
• Any extensions need to be approved by Julia. Please email both Julia
and Tina if you need an extension. Late assignments are penalised with a
20% reduction per day. Any assignment submitted more than 3 days (72
hours) after the due date without an extension will receive a score of 0.
• Assignments are to be saved as a pdf once complete and submitted
(uploaded) via GradeScope.
• You can resubmit your assignment at any time up to the deadline. We
highly recommend that you upload a draft version of your assignment well
in advance of the due date/time, as ‘technical issues’ or ‘failure to upload
properly’ will not be accepted as a valid excuse for not submitting on time
and you will be penalised. Please note that only your final (most recent)
submission will be marked.
2
Question 1 [5+3+4+5+4+5+4+5]
The dataset unescoSample.csv (available on Canvas) contains economic and
demographic information from the 1990 UNESCO yearbook on a sample of
the world’s countries. Definitions of the variables in the dataset are as follows:
• Birth rate per 1,000 of population
• Death rate per 1,000 of population
• Infant deaths per 1,000 of population
• Life expectancy at birth for males (years)
• Life expectancy at birth for females (years)
• Gross National Product (GNP) per capita
• Geopolitical group:
1. Eastern Europe
2. South America and Mexico
3. Western Europe, North America, Japan
4. Middle East
5. Asia
6. Africa
During the following analysis, each country may be treated as a single,
equally weighted observation (despite some countries having larger
populations than others).

(a) Use an appropriate graphical tool to explore the relationship between
life expectancy at birth for females and geopolitical group. Using the
graph, compare female life expectancy across groups and comment on
anything else interesting that you see.
3

(b) Calculate point estimates of average female life expectancy for each
geopolitical group. Do this by using the tapply function, including the
argument na.rm=TRUE to exclude missing data from the calculation.
Find a 95% confidence interval for the mean female life expectancy in
the Middle East.

4
(c) Test whether the mean female life expectancy for Group 2 is
significantly higher than 65 years. Use = 5%. Be sure to state your
hypotheses, the p-value (or critical value) and your conclusion in the
context of the problem. Would your conclusion change if = 1%?

(d) Test whether there is a difference in mean GNP between geopolitical
groups 1 and 5 by using the t.test function with var.equal=TRUE. State
your hypotheses, the p-value (or critical value) and your conclusion, for = 5%. What assumptions have been made while doing this test and
do you think they are reasonable? Why/why not?

5

(e) Repeat the test in part (d), this time setting var.equal=FALSE.
Compare the 95% CIs generated under the two methods.

6

(f) Fit a regression model with GNP as a predictor of male life expectancy,
for Africa only. Plot male life expectancy against GNP and
superimpose the regression line for your model. Identify any unusual
data points. Discuss whether or not you believe these points should be
removed when fitting the model and why/why not.

7
(g) You discover that Tina made a data entry error when uploading the
dataset. Remove one datapoint that Tina accidentally added. Refit the
model from (f) without this datapoint. State the equation of the new
fitted model and give estimates for all relevant parameters. Give a
measure of how much variability in male life expectancy is explained by
variability in GNP.

(h) Present two relevant diagnostic plots for the model fitted in (g). List the
model assumptions and comment on whether they hold or not in this
case, with reference to your diagnostic plots.

8

9
Question 2 [4+4+2+3]
The ‘Black Summer’ Australian bushfires in 2019–2020 burned over 24 million
hectares of land, directly caused 33 deaths and blanketed parts of Australia in
smoke for weeks. The table below gives a sample of AQI (air quality index)
readings for Canberra city centre during December and January, 2018–2019
and 2019–2020, measured at the same date and time across the two years.
The higher the reading, the more particles are in the air, with an AQI over 300
rated as ‘hazardous’ to human health.
Date and time 18–19 19–20
2/12 1:00 20 89
11/12 6:00 24 170
26/12 18:00 42 286
29/12 9:00 58 320
6/1 4:00 62 2333
9/1 1:00 34 414
16/1 6:00 67 77
22/1 0:00 43 33
22/1 3:00 35 23
24/1 13:00 45 192
Source: data.act.gov.au/Environment/Air-Quality-Monitoring-Data/94a5-zqnn
Note: air quality is a random variable that can change at different times of the
day and year, depending on the weather and pollution levels.

(a) Use a t-test to test whether the average air quality index during
December–January was significantly higher in 2019–20, compared with
2018–19, using = 10%. Be sure to state your hypotheses, p-value
and conclusion in the context of the problem.
10
(b) Use the sign test to test whether the median air quality index during
December–January was significantly higher in 2019–20, compared with
2018–19, using = 10%. Do this step-by-step (i.e. do not use the
binom.test or sign.test automated functions in R). Be sure to show all
working, state your p-value and conclusion.
11
(c) Out of the tests used in (a) and (b), which do you believe is more
appropriate in this scenario and why? For your preferred test, is it
possible that you could have made a Type I or II error, in this case?

(d) Pretend that it is summer 2022–2023 and that you are the Chief
Minister of Canberra. There is another (similarly sized) bushfire nearby
but this time, all your air quality measuring devices are malfunctioning.
You decide to make an announcement about whether or not the
average air quality is worse, based on the decision of your past
hypothesis testing. For this scenario, what are the definitions and
consequences of making a Type I error or Type II error? Would it be
worse to make a Type I error, or a Type II error, in this case?

12
Question 3 [4+3+5+5]
Let us return to the face recognition example from Assignment 1. On
completion of both parts of the test (the ‘face memory test’ and the ‘face
sorting test’), you will receive an overall percentage score of how well you did.
We collected the following 20 scores from 90044 students:
(66%, 62%, 61%, 62%, 57%, 53%, 68%, 57%, 73%, 58%, 57%, 62%, 57%,
69%, 65%, 53%, 59%, 58%, 64%, 69%)
According to the UNSW research team, participant scores are distributed as
follows:
Score ≤ 60% [61%,64%] [65%,68%] [69%,71%] ≥72%
Percentage of
people with
score
50% 25% 15% 5% 5%

(a) Use a chi-squared test to determine whether the scores for the 90044
students are distributed significantly differently to the UNSW
distribution. Be sure to state your hypotheses, the value of the test
statistic and critical cut-off point (or alternatively, p-value) and
conclusion. Use the bins/categories given in the table above and = 5%.

13
(b) What are the assumptions of this chi-squared test? Have we violated
any of these assumptions while applying the test in part (a)? If yes,
which one/s? Justify your answer.

(c) Repeat the test in part (a), but this time combining bins if required, to
improve the validity of the test. This time, complete the test without the
help of the chisq.test function in R.

14
(d) Jacob states that a student studying 90044 has a greater than 40%
chance of getting a score of 60% or above in the face recognition test.
State the hypotheses and calculate both an exact and approximate p-
value for testing Jacob’s theory. What conclusion would you make at
the = 10% level?

15
Extra working space
Question:
16
Extra working space
Question: