R代写-MATH1905-Assignment 1|学霸联盟

R代写-MATH1905-Assignment 1

时间：2021-10-04

The University of Sydney School of Mathematics and Statistics Assignment 1 MATH1905: Statistical Thinking with Data (Advanced) Semester 2, 2021 Lecturers: Uri Keich This individual assignment is due by 11:59pm Thursday 14 October 2021, via Canvas. Late assignments will receive a penalty of 5% per day until the closing date. A single PDF copy of your answers must be uploaded in Canvas. It should include your SID, your tutorial time and day. To ensure compliance with our anonymous marking obligations, please do not under any circumstances include your name in any area of your assignment; only your SID should be present. Please make sure you review your submission carefully. What you see is exactly how the marker will see your assignment. Submissions can be overwritten until the due date. If you have technical difficulties with your submission, see the University of Sydney Canvas Guide, available from the Help section of Canvas. This assignment is worth 10% of your final assessment for this course. Your answers should be well written, neat, thoughtful, mathematically concise, and a pleasure to read. Please cite any resources used and show all working. Unless the question asks you to plot or otherwise compute something obvious you should explain what any code that you provide is doing! You can use R markdown (creating an HTML file which you can then export to pdf) or R Sweave (which directly creates a pdf file) to generate a nicer looking report but this is not mandatory. The important thing is that you make sure that any R code you use to generate your report should be shown in its entirety: no truncated or skipped lines of code. The School of Mathematics and Statistics encourages some collaboration between students when working on problems, but students must write up and submit their own version of the solutions. 1. The bivariate data (x, y) in the file linear_analysis_data_n1e4.txt (available through the Resources page on Canvas) obeys the relation y = A+Bx+ e, (1) where A = 1.5, B = 0.5, and e is independent of x (Cor(e,x) ≈ −0.003). (a) How many data points are in this set? (b) Is the scatterplot of y vs. x ellipse-shaped and are x or y bell-shaped? In the lectures we promoted the regression line as a line that can well-interpolate the graph of averages when the scatterplot is ellipse-shaped. However, recalling that it is also the line that minimizes the RMS prediction error it is tempting to use the regression line to try and estimate A and B in (1) above. (c) Add two lines to the scatterplot of y vs. x: the regression line in red and the “correct” A+Bx line in blue. (d) Use the above figure and examine the actual coefficients to gauge how well is the regression doing in estimating A and B? Copyright c© 2021 The University of Sydney 1 (e) Use the normal approximation of y in the vertical strip x ∈ (1.9, 2.1) to estimate the proportion of y values in that strip that exceed 2.7. (f) Compare the last estimate with the actual proportion of y values in that strip that exceed 2.7. 2. Recall that we briefly showed in class and further demonstrated in week 7 tutorial that the linear model can be used to analyze nonlinear data. The bivariate data (t, w) in the file nonlinear_analysis_data_n4e4.txt (available through the Resources page on Canvas) obeys the relation w = eA+B sin(Ct)+e, where A,B,C ∈ R, and e is independent of t (Cor(e, t) ≈ −0.006). (a) How many data points are in this set? (b) Are t or w bell-shaped? (c) Draw the scatterplot of w vs. t and use it to try and visually determine the value of C from the following 4 possible values: C ∈ {pi/2, pi, 3pi/2, 2pi}. (d) Find a more rigorous way to determine which of the four possible values of C is the most suitable for this data. Hint: let x = sin(Ct) and consider y = logw. Does it agree with your visually determined value? (e) Estimate A and B. (f) Redraw the scatterplot of w vs. t while adding the curve w = ea+b sin(ct), where a, b, c are your estimates of A,B,C respectively. (g) Using the density scale plot the histogram of logw for t ∈ (0.99, 1.01) and add the normal curve shifted and scaled using the proper estimates of the average and the SD of logw in this strip. (h) The last histogram should convince you it is reasonable to use the normal approx- imation to estimate the proportion of w values in that strip that exceed 1.8. In estimating that proportion, again, use the proper estimates of the average and the SD of logw in the strip. (i) Compare the last estimated proportion with the actual one. 3. Recall that the slope of the regression line is given by b = r SDy SDx . Prove the following identity or give a counter example: b = ∑n i=1(xi − x¯) · yi∑n i=1(xi − x¯)2 . 4. Find the value of b that minimizes the RMS prediction error among all lines that go through the origin, in other words, among all lines of the form y = b · x where b ∈ R. 2