MATH3821 -math3821代写-Assignment 1|学霸联盟

MATH3821 -math3821代写-Assignment 1

时间：2025-06-11

Assignment 1
MATH3821 Statisical Modelling and Computing, UNSW
Term 2, 2025
The total marks available for this assignment is 30. The assignment is due on Tuesday June 24 at 5pm
and should be submitted through the link titled “Submission link - Assignment 1” on the subject’s Moodle
page. Please combine your work into a single pdf document that includes all relevant R code.
Question 1 [3+3+3+2+2+3=16 Marks]
Use the command
data<-read.csv("longevity_data.csv",header=TRUE)
to import the data longevity_data.csv to R (first download the data from the subject’s Moodle page and set
the working directory of R to the folder where the data is stored, which can be done through the “Session” tab).
The dataset includes the species name Species, its class (“Aves” i.e., birds or “Mammalia”, i.e., mammals)
Class, its body mass in grams Body_mass_g, and its maximum recorded age in years Maximum_age_yrs:
head(data)
## Species Class Body_mass_g Maximum_age_yrs
## 1 Red-shouldered hawk Aves 658.0 22.4
## 2 Black-chested buzzard-eagle Aves 2860.0 42.0
## 3 Common swift Aves 44.9 21.1
## 4 Calliope hummingbird Aves 3.0 7.0
## 5 Brown kiwi Aves 2380.0 35.0
## 6 Herring gull Aves 1000.0 49.0
For example, from the output
data[which(data$Species == "Human"),]
## Species Class Body_mass_g Maximum_age_yrs
## 152 Human Mammalia 70000 122.5
we see that humans are mammals, they have an average body mass of 70kg, and the maximum recorded age
of a human is 122.5 years. Our goal is to determine how the maximum recorded age behaves as function of
body mass and class.
The figure below suggests that to fit a linear model we should apply a log transformation to both maximum
age and body mass (which implies the relationship E(maximum age) = a× (body mass)b):
mammals<-which(data$Class=="Mammalia")
plot(log(data$Body_mass_g[mammals]),log(data$Maximum_age_yrs[mammals]), pch=16,
col='blue', xlab="log(Body Mass (g))", ylab="log(Maximum age (yrs))")
aves<-which(data$Class=="Aves")
points(log(data$Body_mass_g[aves]),log(data$Maximum_age_yrs[aves]), pch=17, col='red')
1
2 4 6 8 10 12 14
1
2
3
4
log(Body Mass (g))
lo
g(M
ax
im
u
m
a
ge
(y
rs)
)
(a) Write down the expression for a linear model (as on Slide 2 of Lecture 3) with log(maximum age)
as the response and log(body mass) and class as predictors. Clearly define all variables and list the
assumptions of the model.
(b) Write down an expression for the log likelihood function for the model and use this expression to show
that the maximum likelihood estimator for β is the same as the least squares estimator for β:
argmin
β
(y −Xβ)T (y −Xβ).
(c) Fit the model you described in (a) using R’s lm() function. Do you think any of the predictors can be
removed? State any relevant tests used.
(d) Use the model obtained in (c) to write down a formula for the expected value of maximum age as a
function of body mass for both Aves and Mammalia. Explain what we can conclude about the maximum
lifetimes of the Aves and Mammalia classes in simple terms (i.e., as you would to a non-mathematician).
(e) Biologists believe that coefficient of log(Body_mass_g) is 1/4. Does the data support this hypothesis?
Explain your reasoning.
(f) Make diagnostic plots for the model in obtained in (c) using the plot() and lm() functions. Based on
the plots, which linear model assumptions listed in (a) do you think have been violated? Explain your
reasoning.
Question 2 [2+3+9=14 Marks]
The file Malacca_strait_data.csv contains observations of the sea temperature in the Malacca Strait
between Malaysia and Indonesia (latitude 3.175, longitude 100.625) in degrees celcius. The observations
were recorded every seven days from January 1, 1982 to December 31, 2019. In this question we model
sea_temperature as a function of day_of_year, where day_of_year corresponds to the day of the year at
which the observation was made (i.e., 1 = January 1st, 2 = January 2nd, and so on).
We assume that
yi = β0 +
73∑
j=1
βj(xi − κj)+ + εi,
2
where κ1 = 0, κ2 = 5, . . ., κ73 = 360, and (z)+ denotes the maximum between a number z and 0. The
code below creates a new data frame data.2 with sea_temperature as the first column and (x− κj)+ for
j = 1, . . . , 73 as columns 2, . . . , 74 (labelled x1, . . ., x73). It also plots the the function (x− κ21)+.
data<-read.csv("Malacca_strait_data.csv",header=TRUE)
data<-data[order(data[[1]]),] #sort the data by day_of_year
data.2<-data.frame(sea_temperature = data$sea_temperature)
for(i in 1:73){
x<-data$day_of_year-(i-1)*5
x[which(x<0)]=0
data.2[,paste("x", i, sep = "")]=x
}
plot(data$day_of_year,data.2$x21, type="l",col="blue",lwd=2)
0 100 200 300
0
50
10
0
15
0
20
0
25
0
data$day_of_year
da
ta
.2
$x
21
Note that by choosing the predictors in this way, we are assuming
yi = f(xi) + εi,
where f is a continuous piecewise linear function whose gradient changes at x = κ1, . . . , κ73. We can see
this if we estimate f using R’s lm() function with all the predictors (x− κ1)+, . . ., (x− κ73)+ and plot the
resulting estimated regression function fˆ :
lm.2<-lm(sea_temperature ~.,data=data.2)
plot(data$day_of_year,data$sea_temperature)
lines(data$day_of_year,lm.2$fit,col="blue",lwd=5)
3
0 100 200 300
28
29
30
31
data$day_of_year
da
ta
$s
ea
_te
mp
era
tu
re
The roughness of the estimated regression function fˆ in the above figure suggests that the model is over
fitting the data. In this question we consider the three general approaches to avoid over (or under) fitting
data:
-Restriction (part a): We impose a restrictive form for f by considering only a few predictors.
-Selection (part b): We consider all predictors but remove any that don’t contribute significantly to the fit of
the model.
-Regularisation (part c): We retain all predictors but restrict the values that β can take through the
introduction of a “smoothness parameter”, λ.
(a) [Restriction] Fit the model using only the predictors x6, x23, x55, and x70. Similar to the figure above,
make a scatter plot of sea_temperature and day_of_year that also displays the estimated regression
function fˆ . Briefly comment on any key difference between this regression function and the regression
function that was estimated with all the predictors (displayed in the figure above).
(b) [Selection] Use the AIC criterion to do backwards stepwise selection for the predictors as described
in Lecture 3 (i.e., delete one predictor at a time using the AIC criterion). Due to the large number
of predictors we will use the stepAIC function which automates this process. Type the commands
library(MASS) and help(stepAIC) into R to see how it works (I recommend setting trace = FALSE
to avoid unnecessary output). Similar to (a), make a scatter plot of sea_temperature and day_of_year
that also displays the resulting estimated regression function fˆ .
(c) [Regularisation] For λ ≥ 0, we now consider the estimator b∗ that minimises the penalised least squares:
n∑
i=1
(
yi − β0 −
73∑
j=1
βj(xi − κj)+
)2
+ λ
73∑
j=1
β2j .
Note that in matrix notation the penalised least squares can be expressed as
(y −Xβ)T (y −Xβ) + λβTDβ,
4
where
D =

0 0 0 . . . 0
0 1 0 . . . 0
0 0 1 . . . 0
...
...
... . . .
...
0 0 0 . . . 1
 .
1. Show that the penalised least squares regression estimator is given by
b∗ = (XTX + λD)−1XT y
2. Use your answer to (c)1. to compute b∗ for different values of λ, and then use these values of b∗ to make
a scatter plot which displays the estimated regression function fˆ for different values of λ. Describe how
the appearance of the estimated regression function fˆ changes with λ. (To compute b∗ you may choose to
use the commands X<-as.matrix(cbind(rep(1,1983),data.2[,-1])), Y<-data$sea_temperature,
and D=diag(c(0,rep(1,73)))).
3. We generally choose λ by minimising a predictive criterion like the predictive sum of squares (PRESS
—Lecture 3, Slide 28),
∑n
i=1(yi− yˆi,−i)2. The code below computes the PRESS for a given value of λ (it
takes a few seconds to run). Which value of λ (approximately) minimises the PRESS?
X<-as.matrix(cbind(rep(1,1983),data.2[,-1]))
Y<-data$sea_temperature
D=diag(c(0,rep(1,73)))
lambda=5000
PRESS<-0
for(i in 1:length(Y)){
Y.i<-Y[-i]
X.i<-X[-i,]
beta.i<-solve(t(X.i)%*%X.i + lambda*D)%*%t(X.i)%*%Y.i
PRESS<-PRESS+(Y[i]-X[i,]%*%beta.i)ˆ2
}
PRESS
4. What value of λ minimises the residual sum of squares
∑n
i=1(yi − yˆi)2? Use your answer to explain
why we shouldn’t choose λ by minimising the residual sum of squares.
5

学霸联盟