Assignment 1 MATH3821 Statisical Modelling and Computing, UNSW Term 2, 2025 The total marks available for this assignment is 30. The assignment is due on Tuesday June 24 at 5pm and should be submitted through the link titled “Submission link - Assignment 1” on the subject’s Moodle page. Please combine your work into a single pdf document that includes all relevant R code. Question 1 [3+3+3+2+2+3=16 Marks] Use the command data<-read.csv("longevity_data.csv",header=TRUE) to import the data longevity_data.csv to R (first download the data from the subject’s Moodle page and set the working directory of R to the folder where the data is stored, which can be done through the “Session” tab). The dataset includes the species name Species, its class (“Aves” i.e., birds or “Mammalia”, i.e., mammals) Class, its body mass in grams Body_mass_g, and its maximum recorded age in years Maximum_age_yrs: head(data) ## Species Class Body_mass_g Maximum_age_yrs ## 1 Red-shouldered hawk Aves 658.0 22.4 ## 2 Black-chested buzzard-eagle Aves 2860.0 42.0 ## 3 Common swift Aves 44.9 21.1 ## 4 Calliope hummingbird Aves 3.0 7.0 ## 5 Brown kiwi Aves 2380.0 35.0 ## 6 Herring gull Aves 1000.0 49.0 For example, from the output data[which(data$Species == "Human"),] ## Species Class Body_mass_g Maximum_age_yrs ## 152 Human Mammalia 70000 122.5 we see that humans are mammals, they have an average body mass of 70kg, and the maximum recorded age of a human is 122.5 years. Our goal is to determine how the maximum recorded age behaves as function of body mass and class. The figure below suggests that to fit a linear model we should apply a log transformation to both maximum age and body mass (which implies the relationship E(maximum age) = a× (body mass)b): mammals<-which(data$Class=="Mammalia") plot(log(data$Body_mass_g[mammals]),log(data$Maximum_age_yrs[mammals]), pch=16, col='blue', xlab="log(Body Mass (g))", ylab="log(Maximum age (yrs))") aves<-which(data$Class=="Aves") points(log(data$Body_mass_g[aves]),log(data$Maximum_age_yrs[aves]), pch=17, col='red') 1 2 4 6 8 10 12 14 1 2 3 4 log(Body Mass (g)) lo g(M ax im u m a ge (y rs) ) (a) Write down the expression for a linear model (as on Slide 2 of Lecture 3) with log(maximum age) as the response and log(body mass) and class as predictors. Clearly define all variables and list the assumptions of the model. (b) Write down an expression for the log likelihood function for the model and use this expression to show that the maximum likelihood estimator for β is the same as the least squares estimator for β: argmin β (y −Xβ)T (y −Xβ). (c) Fit the model you described in (a) using R’s lm() function. Do you think any of the predictors can be removed? State any relevant tests used. (d) Use the model obtained in (c) to write down a formula for the expected value of maximum age as a function of body mass for both Aves and Mammalia. Explain what we can conclude about the maximum lifetimes of the Aves and Mammalia classes in simple terms (i.e., as you would to a non-mathematician). (e) Biologists believe that coefficient of log(Body_mass_g) is 1/4. Does the data support this hypothesis? Explain your reasoning. (f) Make diagnostic plots for the model in obtained in (c) using the plot() and lm() functions. Based on the plots, which linear model assumptions listed in (a) do you think have been violated? Explain your reasoning. Question 2 [2+3+9=14 Marks] The file Malacca_strait_data.csv contains observations of the sea temperature in the Malacca Strait between Malaysia and Indonesia (latitude 3.175, longitude 100.625) in degrees celcius. The observations were recorded every seven days from January 1, 1982 to December 31, 2019. In this question we model sea_temperature as a function of day_of_year, where day_of_year corresponds to the day of the year at which the observation was made (i.e., 1 = January 1st, 2 = January 2nd, and so on). We assume that yi = β0 + 73∑ j=1 βj(xi − κj)+ + εi, 2 where κ1 = 0, κ2 = 5, . . ., κ73 = 360, and (z)+ denotes the maximum between a number z and 0. The code below creates a new data frame data.2 with sea_temperature as the first column and (x− κj)+ for j = 1, . . . , 73 as columns 2, . . . , 74 (labelled x1, . . ., x73). It also plots the the function (x− κ21)+. data<-read.csv("Malacca_strait_data.csv",header=TRUE) data<-data[order(data[[1]]),] #sort the data by day_of_year data.2<-data.frame(sea_temperature = data$sea_temperature) for(i in 1:73){ x<-data$day_of_year-(i-1)*5 x[which(x<0)]=0 data.2[,paste("x", i, sep = "")]=x } plot(data$day_of_year,data.2$x21, type="l",col="blue",lwd=2) 0 100 200 300 0 50 10 0 15 0 20 0 25 0 data$day_of_year da ta .2 $x 21 Note that by choosing the predictors in this way, we are assuming yi = f(xi) + εi, where f is a continuous piecewise linear function whose gradient changes at x = κ1, . . . , κ73. We can see this if we estimate f using R’s lm() function with all the predictors (x− κ1)+, . . ., (x− κ73)+ and plot the resulting estimated regression function fˆ : lm.2<-lm(sea_temperature ~.,data=data.2) plot(data$day_of_year,data$sea_temperature) lines(data$day_of_year,lm.2$fit,col="blue",lwd=5) 3 0 100 200 300 28 29 30 31 data$day_of_year da ta $s ea _te mp era tu re The roughness of the estimated regression function fˆ in the above figure suggests that the model is over fitting the data. In this question we consider the three general approaches to avoid over (or under) fitting data: -Restriction (part a): We impose a restrictive form for f by considering only a few predictors. -Selection (part b): We consider all predictors but remove any that don’t contribute significantly to the fit of the model. -Regularisation (part c): We retain all predictors but restrict the values that β can take through the introduction of a “smoothness parameter”, λ. (a) [Restriction] Fit the model using only the predictors x6, x23, x55, and x70. Similar to the figure above, make a scatter plot of sea_temperature and day_of_year that also displays the estimated regression function fˆ . Briefly comment on any key difference between this regression function and the regression function that was estimated with all the predictors (displayed in the figure above). (b) [Selection] Use the AIC criterion to do backwards stepwise selection for the predictors as described in Lecture 3 (i.e., delete one predictor at a time using the AIC criterion). Due to the large number of predictors we will use the stepAIC function which automates this process. Type the commands library(MASS) and help(stepAIC) into R to see how it works (I recommend setting trace = FALSE to avoid unnecessary output). Similar to (a), make a scatter plot of sea_temperature and day_of_year that also displays the resulting estimated regression function fˆ . (c) [Regularisation] For λ ≥ 0, we now consider the estimator b∗ that minimises the penalised least squares: n∑ i=1 ( yi − β0 − 73∑ j=1 βj(xi − κj)+ )2 + λ 73∑ j=1 β2j . Note that in matrix notation the penalised least squares can be expressed as (y −Xβ)T (y −Xβ) + λβTDβ, 4 where D = 0 0 0 . . . 0 0 1 0 . . . 0 0 0 1 . . . 0 ... ... ... . . . ... 0 0 0 . . . 1 . 1. Show that the penalised least squares regression estimator is given by b∗ = (XTX + λD)−1XT y 2. Use your answer to (c)1. to compute b∗ for different values of λ, and then use these values of b∗ to make a scatter plot which displays the estimated regression function fˆ for different values of λ. Describe how the appearance of the estimated regression function fˆ changes with λ. (To compute b∗ you may choose to use the commands X<-as.matrix(cbind(rep(1,1983),data.2[,-1])), Y<-data$sea_temperature, and D=diag(c(0,rep(1,73)))). 3. We generally choose λ by minimising a predictive criterion like the predictive sum of squares (PRESS —Lecture 3, Slide 28), ∑n i=1(yi− yˆi,−i)2. The code below computes the PRESS for a given value of λ (it takes a few seconds to run). Which value of λ (approximately) minimises the PRESS? X<-as.matrix(cbind(rep(1,1983),data.2[,-1])) Y<-data$sea_temperature D=diag(c(0,rep(1,73))) lambda=5000 PRESS<-0 for(i in 1:length(Y)){ Y.i<-Y[-i] X.i<-X[-i,] beta.i<-solve(t(X.i)%*%X.i + lambda*D)%*%t(X.i)%*%Y.i PRESS<-PRESS+(Y[i]-X[i,]%*%beta.i)ˆ2 } PRESS 4. What value of λ minimises the residual sum of squares ∑n i=1(yi − yˆi)2? Use your answer to explain why we shouldn’t choose λ by minimising the residual sum of squares. 5
学霸联盟