程序代写案例-STA302/1001|学霸联盟

程序代写案例-STA302/1001

时间：2021-10-15

STA302/1001 - Methods of Data Analysis 1 LEC0101 Midterm - February 25, 2020 UToronto Email: Last Name First Name Student ID Instructions 1. Write your UToronto email, name and ID number at the top of this page. Make sure that these match the information in Quer- cus. 2. Questions are on both sides of the page. There should be a total of 9 pages. 3. Answer the questions in the spaces provided. You should not need any extra pages. 4. Your grade will be influenced by how clearly you express your ideas, and how well you organize your solutions. You must show all your work to get full credit. Marking Scheme: Question Out of Grade 1 18 2 8 3 17 4 16 MC 5 Total 64 1 1. (18 pts) A company that sells photocopiers to businesses also provides maintenance and repairs on them when needed. The company keeps records of the maintenance calls. For each call, they record the number of photocopiers serviced (X) and the total number of minutes spent working on the machine by a serviceperson (Y ). The average number of photocopiers serviced per call is 5.11 and the average number of minutes spent working on a call is 76.27. The company models the relationship between number of photocopiers serviced and repair time with a simple linear regression, with the results seen below. Call: lm(formula = Y ~ X, data = copier) Residuals: Min 1Q Median 3Q Max -22.7723 -3.7371 0.3334 6.3334 15.4039 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.5802 2.8039 -0.207 0.837 X 15.0352 0.4831 31.123 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 8.914 on 43 degrees of freedom Multiple R-squared: 0.9575,Adjusted R-squared: 0.9565 F-statistic: 968.7 on 1 and 43 DF, p-value: < 2.2e-16 Using this R output, answer the following questions: (a) (1 pts) Interpret the coefficient of determination in the context of the data. 95.75% of the variation in total number of minutes spent working on the photo- copier is explained by the number of photocopiers being serviced (or alternatively by the model). (1 point, but 0 points if they don’t interpret in context of data.) 2 (b) (3 pts) Complete the ANOVA table below. Show all your work. Source DF Sum Squares Mean Squares F value Regression 1 76972.32 76972.32 968.7 Residual 43 3416.75 79.46 - Total 44 80389.07 - - Degrees of freedom: 1 for Regression (because always 1 in simple linear regres- sion), 43 for Residuals (from DF of F test in output), 44 for Total (by decompo- sition) (0.25 point if all 3 are correct, 0 otherwise) F value is easily found in R output under F statistic: 968.7 (0.25 points) Residual sum of squares in R output: se = 8.914 = √ RSS n−2 ⇒ RSS = (n− 2)s2e = 43(8.914)2 = 3416.75 (0.5 points) Mean squares residual: s2e = RSS n−2 = 3416.75 43 = 79.46 (0.5 point) Mean squares regression: F = MSreg MSres ⇒MSreg = F (MSres) = 968.7(79.46) = 76972.32 (0.5 point) Sum of Squares Regression = Mean squares regression = 76972.32 (0.5 point) Total sum of squares: SST = SSreg + RSS = 76972.32 + 3416.75 = 80389.07 (0.5 point) 3 (c) (5 pts) How long would we expect the actual repair/maintenance time (in total number of minutes) spent by a serviceperson to be if they are working on 3 pho- tocopiers? Compute an appropriate 95% interval using the R output above. yˆ∗ = −0.5802 + 15.0352(3) = 44.53 (1 point) In order to compute the standard error of the prediction, we need to find SXX, which we can do using the standard error of the estimate of the slope: SE(βˆ1) = se√ SXX ⇒ SXX = s 2 e SE(βˆ1)2 = 8.9142 0.48312 = 340.464 (1 point) So the standard error of the prediction of an actual value is SE(yˆ∗) = se √ 1 + 1 n + (x∗ − x¯)2 SXX = 8.914 √ 1 + 1 45 + (3− 5.11)2 340.464 = 9.07 (1 point) The closest t-value in the table would be approximately t0.975,43 = 2.021 (1 point) so the 95% prediction interval is yˆ∗ ± t0.975,43SE(yˆ∗) = 44.53± 2.021(9.07) = [26.20, 62.86] (1 point) (d) (2 pts) Is there a strong linear relationship between number of photocopiers per call and the total number of minutes spent by a serviceperson? Justify your con- clusion by using two pieces of information from the R output (do not use the same number more than once). Yes there is a strong linear relationship. The coefficient of determination tells us that more than 95% of the variation in total number of minutes spent by a ser- viceperson is explained by the model (1 point). Further the p-value in the overall test of significance is small, which indicates a significant linear relationship ex- ists between total number of minutes and number of copiers (could also use the conclusion of the t-test) (1 point). 4 (e) (7 pts) Consider a maintenance call for 10 photocopier machines that required the serviceperson to spend a total of 127 minutes on the call. Determine whether this call is a leverage point, an outlier, an influential point or some combination of these. Justify your answer with appropriate numerical summaries. The predicted value for this service call is yˆ = −0.5802 + 15.0352(10) = 149.77 (1 point) The estimated residual for this observation is eˆi = yi−yˆi = 127−149.77 = −22.77 (1 point) The leverage value for this observation is hii = 1 n + (xi − x¯)2 SXX = 1 45 + (10− 5.11)2 340.464 = 0.092 (1 point) The cutoff for a leverage point is 4/n = 4/45 = 0.089 (0.5 point for cutoff)so this observation is a leverage point. The standardized residual is ri = eˆi se √ 1− hii = −22.77 8.914 √ 1− 0.092 = −2.68 (1 point) Since this value is smaller than both -2 so this observation is also an outlier. The Cook’s Distance is Di = r2i 2 hii 1− hii = (−2.68)2 2 0.092 1− 0.092 = 0.364 (1 point) Since this value is quite large (i.e. larger than 4/(n− 2) = 0.093 (0.5 point for cutoff)), we see that this observation is a leverage point, an outlier, and an influential point. (1 point) 5 2. (8 pts) An experimenter wishes to study the rate of change in a response Y when values of a predictor X are changed. The experimenter believes the relationship be- tween the response and predictor is linear and needs help deciding which values of the predictor he should use in his experiment. The region of interest is from X = 4 to X = 13. He has enough resources to obtain 10 observations. (a) (2 pts) What values of X should the experimenter use to ensure that his esti- mate of the average change in the response for a unit increase in the predictor will have smallest variance? Justify your choice. Because the standard error of the slope is given by SE(βˆ1) = se√ SXX = se√∑10 i=1(xi − x¯)2 we require that the sum of squared predictors (SXX) to be large (1 point). We can achieve this by selecting predictor values at the end of the range of possible values. Thus we would want to choose 5 observations at x = 4 and 5 observations at x = 13 (1 point) (b) (2 pts) The experimenter takes your advice and runs his experiment using your chosen predictor values. When he fits the linear model to his data, he is disap- pointed that the variance of the average rate of change is still quite large. Explain to him why this might be happening. This can be happening for two different reasons. The sample size is quite small and since SE(βˆ1) = se√ SXX = √∑n i=1(yi − yˆi)2√ n− 2√SXX we could be getting a large standard error because the sample size is too small (1 point) , or we could have that the deviations of the observed responses from the regression line are too large (so the response values are too variable)(1 point). 6 (c) (2 pts) The experimenter decides to build a 95% confidence interval for the rate of change in the response. The confidence interval he calculates does not contain zero so he claims this means we know for certain that there is a true linear re- lationship between the predictor and response. Explain what is wrong with his conclusion. The researcher has incorrectly interpreted the confidence interval. He can only say that his sample of data yielded a confidence interval that did not contain 0. He should have said that we are 95% confident that this sample and thus this interval is one of the 95% of all intervals that captured the true slope. (1 point) If this is true, then we know the true slope is not 0, but it is not possible to verify for certain. (1 point) (d) (2 pts) Upon inspection of the Normal QQ plot for his data, the experimenter notices that Normality seems to be violated. He wonders how this will impact any conclusions he makes with his confidence interval. Explain how non-Normality will affect his confidence interval. Non-normality has the potential to reduce the coverage probability of the confi- dence interval, as seen in assignment 2. (1 point) So if we were to make conclu- sions based on this interval, such as to conclude a test, our confidence that we have captured the true value in our interval (i.e. that our interval is one of the 95% that does capture the true slope) should be lower than the 95% we want. This means we have a false sense of confidence in our sample. (1 point) 7 3. (17 pts) Suppose we have collected a random sample of n pairs (xi, yi) from a popu- lation where the true relationship between the response and predictor is Y = β0 + β1X + , where | X ∼ N(1, σ2) (a) (2 pts) What is the true population average response when X = x? E(Y | X) = E(β0 +β1X+ )(1 point) = β0 +β1X+E() = β0 +β1X+1(1 point) (b) (5 pts) Derive the least squares estimators of the intercept and slope for this population regression line. The least squares estimating equation we use is RSS = n∑ i=1 (yi − β0 − β1xi)2 Take the first derivative with respect to both parameters: ∂ ∂β0 = −2 n∑ i=1 (yi − β0 − β1xi) = 0 (1 point) ∂ ∂β1 = −2 n∑ i=1 xi(yi − β0 − β1xi) = 0 (1 point) Solve for each of the parameters: 0 = n∑ i=1 xiyi − β0 n∑ i=1 xi − β1 n∑ i=1 x2i ⇒ n∑ i=1 xiyi = (y¯ − β1x¯) n∑ i=1 xi + β1 n∑ i=1 x2i = ny¯x¯− β1 [ n∑ i=1 x2i − nx¯2 ] (1 point) ⇒ βˆ1 = ∑n i=1 xiyi − nx¯y¯∑n i=1 x 2 i − nx¯2 (1 point) ⇒ βˆ0 = n∑ i=1 yi n − βˆ1 n∑ i=1 xi n = y¯ − βˆ1x¯ (1 point) 8 (c) (5 pts) Determine whether the least squares estimator of the slope from part (b) is unbiased. To show unbiasedness, we must take the expectation of the estimator: E[βˆ1] = E [∑n i=1 xiyi − nx¯y¯∑n i=1 x 2 i − nx¯2 ] (1 point) = 1∑n i=1 x 2 i − nx¯2 { n∑ i=1 xiE[yi]− nx¯E[y¯] } (1 point) = 1∑n i=1 x 2 i − nx¯2 { n∑ i=1 xi(β0 + β1xi + E[])− nx¯(β0 + β1x¯+ E[]) } (1 point) = 1∑n i=1 x 2 i − nx¯2 { β1 n∑ i=1 x2i − nx¯2 + n∑ i=1 xi(1)− nx¯(1) } (1 point) = 1∑n i=1 x 2 i − nx¯2 { β1 [ n∑ i=1 x2i − nx¯2 ] + 0 } = β1 ∑n i=1 x 2 i − nx¯2∑n i=1 x 2 i − nx¯2 = β1 (1 point) 9 (d) (5 pts) Determine whether the least squares estimator of the intercept from part (b) is unbiased. If not, is it possible to collect data in such a way as to get an unbiased estimate of the intercept? Justify your answer. Again to show unbiasedness, we take the expectation of the estimator: E[βˆ0] = E[y¯ − βˆ1x¯] (1 point) = E[(β0 + β1x¯+ ¯)]− x¯E[βˆ1] (1 point) = β0 + β1x¯+ E[i]− x¯β1 (1 point) = β0 + 1 (1 point) So the estimator of the intercept is biased. Since it is biased by a constant term, there is nothing we can do in our data collection that will remove this bias. (1 point) 10 4. (16 pts) Nutritional information from 77 different breakfast cereals have been collected by a nutritionist. The nutritionist is interested in the relationship between Calories per serving and Carbohydrates (in grams). The following summary statistics have already been found: 77∑ i=1 xi = 1124, 77∑ i=1 yi = 8230, 77∑ i=1 x2i = 17799, 77∑ i=1 xiyi = 121725, se = 18.99 (a) (3 pts) What is the fitted regression line for the relationship between Calories and Carbohydrates? βˆ1 = ∑n i=1 xiyi − nx¯y¯∑n i=1 x 2 i − nx¯2 = 121725− 77(1124/77)(8230/77) 17799− 77(1124/77)2 = 1.14 (1 point) βˆ0 = y¯ − βˆ1x¯ = 8230 77 − 1.14 ( 1124 77 ) = 90.24 (1 point) So the fitted regression line is yˆ = 90.24 + 1.14x (1 point) 11 (b) (5 pts) Determine if the average calories per serving significantly increases when the carbohydrate content increases by 1 gram. Use a formal hypothesis test, including appropriate hypotheses, test statistic, p-value and conclusion in the context of the data. The hypotheses are H0 : β1 = 0 vs Ha : β1 > 0 (1 point, 0 if betas have hats or if alternative is wrong) To build the test statistic, we need the standard error of the slope: SE(βˆ1) = se√ SXX = 18.99√ 1391.519 = 0.509 (1 point) so the test statistic is T = βˆ1 − 0 SE(βˆ1) = 1.14 0.509 = 2.24 (1 point) The p-value will require comparing to a T distribution with 75 degrees of freedom (this is not in your table, so you will need to approximate its location) p− value = P (T75 > 2.24) ∈ [0.01, 0.025] (1 point) because 2.24 is located in between 2.0 and 2.4, which for a one-tailed test means the p-value has to be between 0.01 and 0.025. Thus we reject the null hypothesis and conclude that mean calories significantly increases as carbohydrates increases by 1 gram. (1 point only if they use the context of the data) 12 (c) (3 pts) Based on the plots below, discuss whether each regression modelling assumption is satisfied and if there are any problematic observations. 0 5 10 15 20 60 80 10 0 14 0 Scatterplot Carbohydrates C al or ie s 0 5 10 15 20 -4 0 -2 0 0 20 40 Residuals vs Predictor Carbohydrates R es id ua ls 90 95 100 105 110 115 -4 0 -2 0 0 20 40 Residuals vs Fitted Fitted Valuse R es id ua ls -2 -1 0 1 2 -4 0 -2 0 0 20 40 Normal Q-Q Plot Theoretical Quantiles S am pl e Q ua nt ile s *** if justifications seem reasonable and aren’t completely contradictory to the plots, they can receive full marks. • Linearity: Based on the scatterplot, we seem to say linearity should hold, although there does appear to be a slight increase pattern in the plots that may suggest otherwise. (0.5 point) • Independence: There do appear to be some evidence of grouping in both resid- ual plots, with a large cluster in the top right area and a smaller cluster with large negative residuals, so independence may be violated (0.5 point) • Constant variance: Difficult to tell since the lone observation on the left of the residual plots may make it appear as if we have a funnel pattern. Even without this point, there may be an increasing relationship in the plots which may suggest that the regression line is more likely to over-predict as X increases. (0.5 point) • Normality: The lifting in the tails suggests that normality may be violated. (0.5 point) • problematic observations: there would appear to be an observation that is far from the other observations and appears to have a negative carbohydrate value. (1 point) 13 (d) (2 pts) Provide two possible methods that can be used to correct non-constant variance. One can transform the response variable and refit the model. (1 point) Alter- natively, one may use weighted least squares regression. (1 point) (e) (3 pts) Find a variance stabilizing transformation when the response follows a distribution with a mean of θ and a variance of θ2. Show your work. Use the Delta Method: V ar(f(E[Y ])) = [f ′(E[Y ])]2V ar(Y ) Here the mean of Y is θ and the variance is θ2, so we can write V ar(f(θ)) = [f ′(θ)]2θ2 = c (1 point) ⇒ [f ′(θ)]2 = c θ2 ⇒ f ′(θ) = √ c θ (1 point) ⇒ f(θ) = ∫ √ c θ dθ = √ c ln(θ) + c∗ (1 point) which tells us that a natural logarithm transformation will result in constant vari- ance. 14 (5 pts) Multiple Choice Section: Please answer the multiple choice questions on the attached bubble sheet, corresponding to questions 1-5. A correct response is one point while an incorrect or no response is worth 0 points. If you do not answer the questions on the bubble sheet, your answers will not be graded. 1. Which term corresponds to the following definition? The variation in the response variable explained by the model. (a) Sample Variance (b) Coefficient of Determination (c) Residual Mean Sqaures (d) Regression Sum of Squares 2. TRUE or FALSE: an influential observation must necessarily also be a leverage point. (a) True (b) False 3. In what situation would we not trust the coefficient of determination to properly assess the goodness of the simple linear model? (a) When a curved pattern is visible in a scatterplot of the data (b) When Normality of errors appears to be violated (c) When we have a good leverage point (d) All of the above 4. Which of the following statements is correct? (a) In large samples, residuals may appear to be Normal, even when the population errors are not. (b) If a bad leverage point is present, it is always a good idea to discard it from the data. (c) Violations of model assumptions are less likely if high leverage points are present (d) All of the above (e) None of the above 15 5. If the correlation between the predictor and response is 0, the predicted response from the least squares regression will be equal to... (a) sample mean of the predictor (b) sample mean of the response (c) zero (d) the average difference between the response and predictor. 16

学霸联盟