MATH1041-无代写|学霸联盟

MATH1041-无代写

时间：2024-04-05

MATH1041 Statistics for Life and Social Science
Term 1, 2024
MATH1041 Assignment
Data: Together with this document, you should have received your unique dataset in an e-mail sent to your
official university email address. The data (that is, your dataset) are available in a text file with the name
5380675.csv. If you have not received your dataset (double check your UNSW email inbox and the spam folder),
please contact your lecturer.
Submission due date: Tuesday 9th April (Week 9) before 11:59 PM (Sydney time, AEST). Note that
a late penalty of 5% of the maximal possible mark per day will apply. No assignment will be accepted more
than five days after the deadline.
Your submission must contain your full name and student zID at the top of your assignment. Submit your
assignment through Turnitin via Moodle. See the “Assessments Hub” section on Moodle for further information
regarding online submission.
Please submit a neatly typed assignment as a Microsoft Word document (.doc or .docx), see the
information and help about the assignment in the assessment section on Moodle, or as a PDF document (.pdf)
created for instance using Google Docs, LATEX, RMarkdown or similar tools. For your convenience, there is a
Microsoft Word template that can be downloaded from Moodle which you can write your assignment in, that is
already in a format appropriate for this assignment.
Verify that your assignment has been submitted correctly by downloading the submission receipt and clicking on
the link to check that it displays correctly in the Turnitin viewer. If not, it is your responsibility to make the
necessary amendment.
Typesetting (*) /2
Q1 /5
Q2 /9
Q3 /13
Q4 /17
Q5 /15
Q6 /4
Total /65
(*) See the next pages and the “Assessments Hub” on Moodle for details, help and explanations about the assignment and
typesetting.
Note that your assignment and dataset is unique. You cannot show your dataset or your
assignment to anyone. It is your responsibility to keep your dataset and your assignment secret.
Also, your assignment must be your own work. You cannot get any outside help in
any form. If you have a question about the assignment, the only places where you can ask it is on
the MATH1041 Assignment forum, provided you do not reveal your data, or at a staff consultation.
2
Computing assignment format
Keep in mind that this assignment is not only about assessing your Statistical skills; it is also about giving you
feedback on your Mathematical writing skills. The assignment must be typeset correctly and provide complete
explanations in complete English sentences and paragraphs. Think of this as practice for a document you might
produce in your future studies or career that includes mathematical explanations.
Here are some more details that may assist you:
• Regarding the overall assignment structure, please answer all questions in the given order (that is, 1.a.,
1.b., ... etc). Do not re-write the assignment questions again, only their label (write “3.e.” for
instance when you start question 3.e.). Keep your answers brief, clear and concise.
There is NO need to reproduce the cover sheet, i.e., the first 5 pages of the pdf file sent to you, in your
assignment.
• Start your answer to each Question (1, 2, etc.) on a new page. Each Question should start on a
new page, but sub-parts of a Question (such as Question 3.d., 3.e.) should continue on the same page.
• You are required to type up your entire assignment (in Microsoft Word, Google docs, LATEX, Overleaf
or RMarkdown) including any equations. The only exception are the plots produced by RStudio, for
which you can save the figures (use “export” in the bottom right window in RStudio) which you then
paste in your assignment. Nothing can be handwritten then scanned. As a UNSW student, you can
download Microsoft Word for free, see: https://www.myit.unsw.edu.au/software-students.
• As in any properly typeset document containing mathematic symbols, you must use an equation editor
for all maths symbols. For instance, you should write “X is normal”, rather than “X is normal” (Notice
how the ‘X’ looks different?) and you should write “tobs = 1.23”, rather than “tobs = 1.23”.
The marking scheme for this criterion is the following: Are mathematical symbols typeset using the
equation editor? 2 marks for ‘almost always’, 1 mark for ‘sometimes’, 0 mark for ‘rarely’.
Help about Microsoft equation editor can be found in a document called Microsoft Word Equation editor
help for MATH1041 located on Moodle in the Assignment (20%) section within the Assessments Hub
section of the MATH1041 Moodle page.
• You should add some working out for the questions involving calculations; do not just give the final answer.
Note that you may get partial marks for clear explanations and a correct method even if you get the wrong
answer. However, try to keep your solutions brief and concise. Depending on what the question is asking,
your working out could consist of RStudio commands, a formula, or perhaps the main steps explaining
how you arrived at your answer. You do not need to add all your R-code.
• Keeping your results to 3 or 4 significant figures should be fine. If there are multiple steps in a calculation,
do not round any numbers until you have reached the final step. To help you do calculations correctly in
RStudio without rounding, values should be stored as variables, rather than copying the output number
into a further calculation. For example, if you are constructing a confidence interval and need to calculate t∗,
you should write the code: tstar <- qt(0.975, df = 10) and then use the variable tstar in calculating
your confidence interval, rather than pasting in the number 2.228139.
• There is no requirement for font size and line spacing but please make sure your assignment is readable —
do not make the font size too small or the spacing too compact.
• If the question asks you to produce a graph/plot, you should always include that graph in your answer,
unless otherwise specified.
3
Scenario Do NOT copy-paste these data
Parkinson’s disease (PD), or simply Parkinson’s,
is a chronic degenerative disorder of the central
nervous system in the brain that affects both the
motor system and non-motor systems. The symptoms
usually emerge slowly, and as the disease progresses,
non-motor symptoms become more common. Early
symptoms are tremor, rigidity, slowness of movement,
and difficulty with walking, speaking or swallowing.
Problems may also arise with cognition, behaviour,
sleep, and sensory systems.a
The original datasetb analysed by J. Hlavnička et
al. in 2017c includes a random sample of 30 patients
with early untreated Parkinson’s disease (PD), a
second independent random sample of 50 patients
with Rapid Eye Movement (REM) sleep behaviour
disorder (RBD), which are at high risk of developing
Parkinson’s disease; and a third independent random
sample of 50 healthy controls (HC). All patients
were scored clinically by a well-trained professional
neurologist with experience in movement disorders.
All subjects were also examined during a single session
with a speech specialist. In the (first) column Code, an
entry such as RBD01 would indicate that this is Patient
01 out of 50 in the REM sleep Behaviour Disorder
group.
The data you received by email is a random sample
extracted from the original data described above. A
limited number of rows of your personal dataset is
shown on the right. The variables considered here
are: Age, Sex, Duration of pause intervals (ms) and
RateSpeech timing (-/min) (acoustic information
about the rhythmic organization of speech describing
its quality), and FingerTaps (giving an ordered score
in {0, 1, 2, 3, 4} to a finger tapping task, where 0
indicates “no problem” and 4 indicates “cannot or can
only barely perform the task”). It is usually assumed
that people with Parkinson’s disease tend to have, on
average, a higher Duration of pause intervals and a
lower Rate of speech timing.
aSource: Wikipedia.
bSource: UC Irvine Machine Learning Repository
cSource: Sci Rep, (2017) Feb 2;7(1):12. The original paper
conducts an entirely different analysis from this assignment.
Reading this paper will not help you complete this assignment,
and you should not refer to it in any of your answers.
## Code Age Sex Duration RateSpeech FingerTaps
## PD20 70 M 140 312 1
## RBD40 60 M 245 296 1
## PD28 60 M 154 340 1
## RBD17 69 F 201 279 1
## PD08 59 F 145 338 1
## PD21 70 M 155 334 0
## PD14 70 M 146 338 1
## RBD37 68 M 158 301 0
## PD12 37 F 129 365 2
## PD06 58 M 186 317 3
## HC34 66 M 119 402
## PD16 64 F 137 386 1
## HC50 54 M 171 264
## RBD10 69 M 145 329 0
## RBD20 75 M 132 339 0
## PD10 66 M 213 281 1
## RBD29 56 M 226 293 0
## HC42 68 M 158 315
## PD17 73 F 146 339 1
## RBD38 65 M 190 309 2
## HC11 65 M 130 403
## HC31 72 M 244 279
## HC19 58 M 130 381
## RBD07 64 M 175 270 0
## RBD41 65 M 203 264 0
## RBD12 63 M 250 285 0
## HC06 65 M 148 347
## HC44 54 M 154 350
## RBD42 68 M 181 302 1
## HC40 67 M 156 321
## RBD35 62 M 162 337 0
## RBD16 61 M 181 278 0
## RBD08 74 M 133 352 0
## PD03 68 M 377 211 1
## HC07 45 M 138 312
## RBD22 59 M 126 354 0
## HC20 60 M 129 329
## HC21 40 F 105 399
## PD25 77 M 220 311 1
## PD04 75 M 360 140 1
## .........................................
4
Reading the data into RStudio
The data are in a text file with the name 5380675.csv. This file was sent to you by e-mail (see page 1). To
complete this assignment, you need to use the FULL dataset provided to you by email. Do NOT copy and paste
the data on the previous page as your dataset.
The first step is to read the data into RStudio. The data format is like what you have already worked with
in the Weekly Mobius lessons. Follow the instructions given in section R1.4 “How to import a text file into
RStudio” of the RStudio “How-To-Manual” available on Moodle. Alternatively, you can also review your lecture
slides to find the R function to use to import a CSV file into RStudio. Two arguments that are often used
when calling this function are header = TRUE (to indicate that names of variables are present on the first line in
the file) and row.names = 1 (to indicate that names of the cases are provided in the first column). Another
very important argument is colClasses which takes a vector of classes to be assumed for the columns (such as
"character" for strings, "factor" for a categorical variable, and "numeric" for a quantitative variable). Once
you have uploaded the data then you are ready to start your analysis!
Checkpoint: To make sure everything is all right, we suggest that you first calculate the average of
the n = 105 values read from your file 5380675.csv for each quantitative variable, and check that
they match the values given below. If your data have been stored in an R object called student.data,
you can type print(colMeans(student.data[, unlist(lapply(student.data, is.numeric))], na.rm =
TRUE), digits = 5) where na.rm = TRUE indicates to remove non available () (i.e., missing) values.
## Age Duration RateSpeech
## 63.571 164.667 329.514
They do? It means you imported the data correctly in RStudio. You are ready to start!
IMPORTANT: Completing this checkpoint is essential. If you load in the dataset incorrectly, you will have
incorrect answers throughout the entire assignment, and you will have marks removed for every incorrect answer.
5
The Analysis Tasks
The questions below follow a logical order that can be used for analysing real data. Also, working through these
questions will help you better understand some concepts presented in the slides, which will be helpful for the
final exam.
PART I: Study Design
Q1. In this question, you will think about the research questions and aspects of study design. For all parts
in Q1, your answers should be no more than one sentence long.
1.a. Briefly, explain what is the research question that the stakeholders are interested in based on what is
described in the scenario. Keep this in mind when you analyse the data in Parts II and III.
1.b. What is the population that is of interest to researchers?
1.c. What are the cases here? (We do not expect a list of all cases here.)
1.d. Is it an observational study or is it an experiment? Provide a brief justification for your answer.
With the markers in mind, in your assignment, please start every question on a new page.
Q2. In this question, you will describe the organisation of the data. For each one of part a–c, your answer
should be no more than two sentences.
2.a. Your data is provided to you in a specific file format. What is the extension of the data file and what
does the extension stand for?
2.b. What is the sample size? (We expect a value here.)
2.c. What are the IDs (labels)? Give only the ID of the first observation.
2.d. Complete the table below so that it lists all of the variables that are contained in the dataset and the
type of each variable. You should add rows to the table as required. When describing the type of each
variable, you should be more specific than just saying that the variable is categorical or quantitative,
i.e. you should specify what kind of categorical or quantitative variable it is.
Table 1: Table to be completed and submitted with your assignment.
Variable Name Variable Type
6
With the markers in mind, in your assignment, please start every question on a new page.
PART II: Exploratory Data Analysis
Q3. Your second task, as any statistician would, is to explore your data with univariate analyses to gain a good
understanding of each variable in the data set. This is always a good strategy to help you detect problems
in a data set, and also to know enough about your data to better answer the research questions.
3.a. Let us deal with missing values first, if any. How many missing values are there in your dataset? You
can determine this using the R function is.na(). (They are indicated by NA entries after importation
into R, a code meaning “Non Available”.) Just state the number of missing values.
3.b. When doing initial data exploration, it is always good to consider the potential reasons for missing
data and where they appear in the data set. One way to handle missing values is sometimes to replace
all of them with a suitably chosen value. Other times, it is more appropriate to leave them as they
are. Considering the scenario, and looking closely at your data, what is the appropriate strategy here?
Justify your answer. Your answer should be no more than two sentences.
3.c. We now move on to univariate graphical summaries. Create a boxplot of the variable Age. Include it
in your submitted assignment properly labelled.
3.d. Comment on the presence or absence of outliers in the boxplot you produced in part 3.c (in no more
than one sentence).
3.e. Create an appropriate graphical summary for the variable FingerTaps (only for the subjects that are
NOT healthy controls). Include it in your submitted assignment properly labelled.
3.f. Comment in no more than one sentence on the trend that you see in the graphical summary in part
3.e.
3.g. We now move on to univariate numerical summaries. Create an appropriate numerical summary for
the variable Sex.
3.h. In no more than one sentence, comment on the result of part 3.g.
3.i. Compute the five number summary of variable Duration for all subjects combined (healthy and
non-healthy). (Do NOT use the fivenum() function.)
3.j. In no more than one sentence, comment on the result of part 3.i.
7
With the markers in mind, in your assignment, please start every question on a new page.
Q4. 4.a. We now want to study the relationship between the variables Duration and RateSpeech. What type
of graphical summary is appropriate for this? Just state the name of the summary (no justification
needed here).
4.b. It is sometimes appropriate to add a least-squares line to graphical summaries of the kind referred to
in part 4.a. Is it the case here? Just answer yes or no for this part.
4.c. Justify your answer to the part 4.b. Write no more than 3 sentences.
4.d. Now, produce the graphical summary referred to in part 4.a. Ensure that your plot is properly labelled
and include it in your assignment.
4.e. Describe the nature of the relationship observed on the plot you produced in part 4.d, using the four
adjectives (or their antonyms) given in the lecture slides. Your answer should be no more than four
sentences, but writing only one sentence should suffice. What else do you notice on this plot?
4.f. What is an appropriate numerical summary to describe the relationship between the Duration and
Rate of Speech? Just state the name of the numerical summary.
4.g. Compute the value of the numerical summary referred to in part 4.f. Give your answer to at least
two decimal places.
4.h. Comment on the value of the numerical summary you computed in part 4.g in no more than one
sentence.
4.i. Given the results you obtained in the previous parts, it is only necessary to study either Duration or
Rate of Speech. (By the way, do you understand why?) We will now focus on Duration. Produce
two (side-by-side) boxplots to compare the Duration of healthy controls to the other subjects. Ensure
that your plot is properly labelled and include it in your assignment.
4.j. Comment on the trend you see in the plot you produced in part 4.i in no more than one sentence.
8
With the markers in mind, in your assignment, please start every question on a new page.
PART III: Modeling and Inference
Q5. Now, we are going to do some modeling and statistical inference.
5.a. Let µ1 be the mean of variable Duration for healthy people. Let µ2 be the mean of variable Duration
for non-healthy people. We want to compare µ1 to µ2 and we assume that µ1 is KNOWN (equal to
146) while µ2 is UNKNOWN. Recall the name of the hypothesis test strategy you can use here.
5.b. Perform an appropriate hypothesis test to compare the true means of Duration between healthy and
non-healthy people. You must summarise all steps in your solution:
• (1) state the null (give both H0 and H˜0) and alternative (Ha) hypotheses relevant to the research
objectives stated in this scenario,
• (2) an expression/formula for a suitable test statistic,
• (3) its observed value in the sample,
• (4) the null distribution for this statistic,
• (5) the expression of the P-value,
• (6) the numerical value of the P-value,
• (7) your interpretation of the P-value and
• (8) your conclusion in plain language.
5.c. Given what you know about the scenario, and by referring to the boxplot obtained in part II, and to
any other calculation you can do, briefly discuss the validity of all the assumptions needed to safely
apply this hypothesis test. What other graphs could you do here to verify some of these assumptions?
(No need to include the graph in your assignment, just give the name of the graph and what it can be
used for.)
With the markers in mind, in your assignment, please start every question on a new page.
Q6. 6.a. Produce a one-sided 95% confidence interval for the difference in means of Duration between healthy
controls and non-healthy ones, still assuming that the mean for healthy people is known.
6.b. Does this confidence interval include the value µ1 given in part 5.a? Is your answer to this consistent
with your conclusions from the hypothesis test in part 5.b?
6.c. Referring back to the scenario, write a one-sentence plain-language interpretation of the confidence
interval obtained above.
END OF ASSIGNMENT