MATH1041 Statistics for Life and Social Science Term 1, 2025 MATH1041 Assignment Data: Together with this document, you should have received your unique dataset in an e-mail sent to your official university email address. The data (i.e., your dataset) are available in a text file with the name 5428814.csv. If you have not received your dataset (double check your UNSW email inbox and the spam folder), please contact your lecturer. Submission due date: Tuesday 15th April (Week 9) before 11:59 PM (Sydney time, AEST). Note that a late penalty of 5% of the maximal possible mark per day will apply. No assignment will be accepted more than five days after the deadline. Your submission must contain your full name and student zID at the top of your assignment. Submit your assignment through Turnitin via Moodle. See the “Assessments Hub” section on Moodle for further information regarding online submission. Please submit a neatly typed assignment as a Microsoft Word document (.doc or .docx), see the information and help about the assignment in the assessment section on Moodle, or as a PDF document (.pdf) created for instance using Google Docs, LATEX, RMarkdown or similar tools. For your convenience, there is a Microsoft Word template that can be downloaded from Moodle which you can write your assignment in, that is already in a format appropriate for this assignment. Verify that your assignment has been submitted correctly by downloading the submission receipt and clicking on the link to check that it displays correctly in the Turnitin viewer. If not, it is your responsibility to make the necessary amendment. Typesetting (*) /2 Q1 /5 Q2 /9 Q3 /13 Q4 /17 Q5 /15 Q6 /4 Total /65 (*) See the next pages and the “Assessments Hub” on Moodle for details, help and explanations about the assignment and typesetting. Note that your assignment and dataset is unique. You cannot show your dataset or your assignment to anyone. It is your responsibility to keep your dataset and your assignment secret. Also, your assignment must be your own work. You cannot get any outside help in any form. If you have a question about the assignment, the only places where you can ask it is on the MATH1041 Assignment help FORUM, provided you do not reveal your data, or at a staff consultation. 2 Computing assignment format Keep in mind that this assignment is not only about assessing your Statistical skills; it is also about giving you feedback on your Mathematical writing skills. The assignment must be typeset correctly and provide complete but concise explanations in complete English sentences and paragraphs. Think of this as practice for a document you might produce in your future studies or career that includes mathematical explanations. Here are some more details that may assist you: • Regarding the overall assignment structure, please answer all questions in the given order (that is, 1.a., 1.b., ... etc). Do not re-write the assignment questions again, only their label (write “3.e.” for instance when you start question 3.e.). Keep your answers brief, clear and concise. DO NOT reproduce the cover sheet, i.e., the first 5 pages of the pdf file sent to you, in your assignment. • Start your answer to each Question (1, 2, etc.) on a new page. Each Question should start on a new page, but sub-parts of a Question (such as Question 3.d., 3.e.) should continue on the same page. • You are required to type up your entire assignment (in Microsoft Word, Google docs, LATEX, Overleaf or RMarkdown) including any equations. The only exception are the plots produced by RStudio, for which you can save the figures (use “export” in the bottom right window in RStudio) which you then paste in your assignment. Nothing can be handwritten then scanned. As a UNSW student, you can download Microsoft Word for free, see: https://www.myit.unsw.edu.au/software-students. • As in any properly typeset document containing mathematic symbols, you must use an equation editor for all maths symbols. For instance, you should write “X is normal”, rather than “X is normal” (Notice how the ‘X’ looks different?) and you should write “tobs = 1.23”, rather than “tobs = 1.23”. The marking scheme for this criterion is the following: Are mathematical symbols typeset using the equation editor (or LATEX)? 2 marks for ‘almost always’, 1 mark for ‘sometimes’, 0 mark for ‘rarely’. Help about Microsoft equation editor can be found in a document called Microsoft Word Equation editor help for MATH1041 located in the Moodle’s Assignment (20%) section within the Assessments Hub section. • You should add some working out for the questions involving calculations; do not just give the final answer. Note that you may get partial marks for clear explanations and a correct method even if you get the wrong answer. However, try to keep your solutions brief and concise. Depending on what the question is asking, your working out could consist of RStudio commands, a formula, or perhaps the main steps explaining how you arrived at your answer. Only include key R commands, using a different font or colour. • Keeping your results to 3 or 4 significant figures should be fine. If there are multiple steps in a calculation, do not round any numbers until you have reached the final step. To help you do calculations correctly in RStudio without rounding, values should be stored as variables, rather than copying the output number into a further calculation. For example, if you are constructing a confidence interval and need to calculate t∗, you should write the code: tstar <- qt(0.975, df = 10) and then use the variable tstar in calculating your confidence interval, rather than pasting in the number 2.228139. • There is no requirement for font size and line spacing but please make sure your assignment is readable — do not make the font size too small or the spacing too compact. • If the question asks you to produce a graph/plot, you should always include that graph in your answer, unless otherwise specified. • It is FORBIDDEN to use functions from the Tidyverse suite of packages (e.g., ggplot, etc.). 3 Scenario Do NOT copy-paste these data A group of research ecologists were interested in studying the impacts of climate change on different species of plants that grow in New South Wales, Australia. Some of these plants are native to Australia while others are non-native (exotic). To obtain their data, the research team decided to col- lect a random sample of n plants from a national park. Some measurements were then taken in 2024 on each plant. The random sample of data consists of plant height measurements (measured in centimeters), dry weight measurements (measured in grams), whether the plant was native or non-native to Australia and the polinization mode of the plant (this could be one of four types: wind, water, insect and self-polinization). A limited number of rows of your unique personal dataset is shown on the right. Your data set contains four variables: • Height which corresponds to the heights, • Weight which corresponds to dry weight of a plant, • Type which corresponds to plant type (native = 0 and exotic = 1), • Polin which corresponds to the polinization mode of the plant (Wind, Water, Insect and Self). Your job is to assist the research team by analysing the data set provided to you. ## Weight Height Type Polin ## 18.63 117.27 0
## 26.38 167.34 0 Self ## 21.21 144.77 1 Water ## 21.20 150.80 0 Self ## 23.66 160.88 0 Self ## 23.22 171.89 0 Self ## 27.36 198.51 1 Water ## 21.62 134.98 0 Self ## 28.66 212.83 1 Water ## 26.61 171.97 1 Water ## 15.98 129.58 0 Self ## 21.45 199.92 0 Insect ## 25.11 153.30 0 Self ## 27.25 217.86 1 Insect ## 30.90 214.63 1 Water ## 20.25 111.85 0 Self ## 29.68 226.12 1 Insect ## 22.53 189.70 1 Wind ## 20.33 131.83 0 Insect ## 21.94 166.34 1 Insect ## 20.27 155.97 0 Self ## 22.27 175.60 1 Wind ## 30.41 229.32 1 Insect ## 19.31 128.13 0 Self ## 33.10 234.04 1 Wind ## 23.13 138.08 0 Insect ## 21.90 160.67 0 Self ## 14.96 118.57 0 Insect ## 35.81 246.04 1 Wind ## 21.07 138.20 0 Self ## 28.40 210.89 1 Insect ## 15.79 153.13 0 Self ## 22.87 185.12 1 Water ## 20.28 152.31 0 Self ## 23.97 162.33 0 Insect ## 22.67 177.58 1 Insect ## 21.60 158.10 1 Water ## 32.00 202.77 1 Insect ## 18.96 142.85 1 Wind ## 20.75 135.65 0 Self ## ......................................... 4 Reading the data into RStudio The data are in a text file with the name 5428814.csv. This file was sent to you by e-mail (see page 1). To complete this assignment, you need to use the FULL dataset provided to you by email. Do NOT copy and paste the data on the previous page as your dataset. The first step is to read the data into RStudio. The data format is like what you have already worked with in the Weekly Mobius lessons. Follow the instructions given in section R1.4 “How to import a text file into RStudio” of the RStudio “How-To-Manual” available on Moodle. Alternatively, you can also review your lecture slides to find the R function to use to import a CSV file into RStudio. Two arguments that are often used when calling this function are header = TRUE (to indicate whether names of variables are present on the first line in the file) and row.names = 1 (to indicate that names of the cases are provided in the first column). Another very important argument is colClasses which takes a vector of classes to be assumed for the columns (such as "character" for strings, "factor" for a categorical variable, and "numeric" for a quantitative variable). Once you have uploaded the data then you are ready to start your analysis! Checkpoint: To make sure everything is all right, we suggest that you first calculate the average of the n = 500 values read from your file 5428814.csv for each quantitative variable, and check that they match the values given below. If your data have been stored in an R object called student.data, you can type print(colMeans(student.data[, unlist(lapply(student.data, is.numeric))], na.rm = TRUE), digits = 5) where na.rm = TRUE indicates to remove non available () (i.e., missing) values, if any. ## Weight Height Type ## 23.302 164.100 0.464 They do? It means you imported the data correctly in RStudio. You are ready to start! IMPORTANT: Completing this checkpoint is essential. If you load in the dataset incorrectly, you will have incorrect answers throughout the entire assignment, and you will have marks removed for every incorrect answer. 5 The Analysis Tasks The questions below are structured in a logical sequence, applicable to analysing a wide range of real-world data sets. Working through these questions will deepen your understanding of key concepts covered in the lecture slides and help you locate specific information within them–a skill that will prove invaluable for the final exam. While the slides will be available during the exam, the volume of content makes it challenging to quickly find the necessary information unless you have spent substantial time reviewing and familiarising yourself with them beforehand. Therefore, it is strongly recommended that you carefully review the lecture slides as you complete this assignment. PART I: Study Design Q1. In this question, you will think about the research questions and aspects of study design. For all parts in Q1, your answers should be no more than one sentence long. 1.a. Briefly, explain what is the research question that the stakeholders are interested in based on what is described in the scenario. Keep this in mind when you analyse the data in Parts II and III. 1.b. What is the population that is of interest to researchers? 1.c. What are the cases here? (We do not expect a list of all cases here.) 1.d. Is it an observational study or is it an experiment? Provide a brief justification for your answer. With the markers in mind, in your assignment, please start every question on a new page. Q2. In this question, you will describe the organisation of the data. For each one of part a–c, your answer should be no more than two sentences. 2.a. Your data is provided to you in a specific file format. What is the extension of the data file and what does the extension stand for? 2.b. What is the sample size? (We expect a value here.) 2.c. Are there any IDs (labels) provided in this data set? If no, how could you define them? 2.d. Complete the table below so that it lists all of the variables that are contained in the dataset and the type of each variable. You should add rows to the table as required. When describing the type of each variable, you should be more specific than just saying that the variable is categorical or quantitative, i.e., you should specify what kind of categorical or quantitative variable it is. Table 1: Table to be completed and submitted with your assignment. Variable Name Variable Type 6 With the markers in mind, in your assignment, please start every question on a new page. PART II: Exploratory Data Analysis Q3. Your second task, as any statistician would, is to explore your data with univariate analyses to gain a good understanding of each variable in the data set. This is always a good strategy to help you detect problems in a data set, and also to know enough about your data to better answer the research questions. 3.a. Let us deal with missing values first, if any. How many missing values are there in your dataset? You can determine this using the R function is.na(). (They are indicated by NA entries after importation into R, a code meaning “Non Available”.) Just state the number of missing values. 3.b. When doing initial data exploration, it is always good to consider the potential reasons for missing data and where they appear in the data set. One way to handle missing values is sometimes to replace all of them with a suitably chosen value. Other times, it is more appropriate to leave them as they are. What is or are the variable that contain missing values, if any? What is the appropriate strategy here? Justify your answer. Your answer should be no more than two sentences. 3.c. We now move on to univariate graphical summaries. Create a boxplot of the variable Weight. Include it in your submitted assignment properly labelled. 3.d. Comment on the presence or absence of outliers in the boxplot you produced in part 3.c (in no more than one sentence). 3.e. Create an appropriate graphical summary for the variable Polin Include it in your submitted assignment properly labelled. 3.f. Comment in no more than one sentence on the graphical summary in part 3.e. 3.g. We now move on to univariate numerical summaries. Compute an appropriate numerical summary for the variable Type. 3.h. In no more than one sentence, comment on the result of part 3.g. 3.i. Compute the five number summary of variable Height for all subjects combined (healthy and non-healthy). (Do NOT use the fivenum() function since it does not compute the quartiles.) 3.j. In no more than one sentence, comment on the result of part 3.i. 7 With the markers in mind, in your assignment, please start every question on a new page. Q4. 4.a. We now move on to bivariate graphical summaries. We want to study the relationship between the variables Height and Weight. What type of graphical summary is appropriate for this? Just state the name of the summary. (Do not give the graphical summary at this point. You will be asked to do so in part 4d.) 4.b. It is sometimes appropriate to add a least-squares line to graphical summaries of the kind referred to in part 4.a. Is it the case here? Just answer yes or no for this part. 4.c. Justify your answer to the part 4.b. Write no more than 3 sentences. 4.d. Now, produce the graphical summary referred to in part 4.a. Ensure that your plot is properly labelled and include it in your assignment. 4.e. Describe the nature of the relationship observed on the plot you produced in part 4.d, using the four adjectives (or their antonyms) given in the lecture slides. Your answer should be no more than four sentences, but writing only one sentence should suffice. Do you notice some unusual observations on this plot? 4.f. What is an appropriate numerical summary to describe the relationship between the Weight and Height? Just state the name of the numerical summary. 4.g. Compute the value of the numerical summary referred to in part 4.f. Give your answer to at least two decimal places. 4.h. Comment on the value of the numerical summary you computed in part 4.g in no more than one sentence. 4.i. Produce an appropriate graph to study the association between the variables Polin and Type. Ensure that your plot is properly labelled and include it in your assignment. 4.j. Comment on the plot you produced in part 4.i in no more than two sentences. 8 With the markers in mind, in your assignment, please start every question on a new page. PART III: Modeling and Inference Q5. Now, we are going to do some modeling and statistical inference. 5.a. Let µ be the population mean plant height (in centimeters) of plant heights in the national park now. The research team decided to compare the current mean plant height with the mean from 20 years ago using plant height data obtained from the same national park. It is known that the mean plant height from 20 years ago is µ0 = 190 centimeters. Recall the name of the hypothesis test strategy you can use here (just state the name of the test). 5.b. Perform an appropriate hypothesis test to compare the true means of Height between the two periods of time. You must summarise all steps in your solution: • (1) state the null (give both H0 and H˜0) and alternative (Ha) hypotheses relevant to the research objectives stated in this scenario, • (2) write down an expression/formula for a suitable test statistic, • (3) give its observed value in the sample, • (4) give the null distribution for this statistic, • (5) give the expression of the P-value, • (6) give the numerical value of the P-value, • (7) give your interpretation of the P-value and • (8) give your conclusion in plain language. 5.c. Some assumptions need to be made for the sampling distribution of the test statistic of Part 5.b to be valid. State these assumptions, briefly discuss the validity of all the assumptions needed to safely apply this hypothesis test. Justify. With the markers in mind, in your assignment, please start every question on a new page. Q6. 6.a. Produce a 95% confidence interval for µ, the present mean heights. For this question you may assume that it is appropriate to use a t-distribution. 6.b. Does this confidence interval include the value µ0 given in part 5.a? 6.c. Is your answer to 6.b consistent with your conclusions from the hypothesis test in part 5.b? 6.d. Referring back to the scenario, write a one-sentence plain-language interpretation of the confidence interval obtained above. END OF ASSIGNMENT 9 学霸联盟