STATS5020 - Introduction to R programming
Assignment 3
Task 1 - version 1 [trees - v1; a3t1] [10 marks in total]
1.[1 mark] The dataset trees.txt provides measurements of the diameter (in inches), height (in ft) and
volume (in cubic ft) of timber in 31 felled black cherry trees. Use R to read in the file trees.txt correctly
and save it as a data frame called trees.
2. [3 marks] We can get a rough estimate of a tree’s age without cutting it down and counting its rings.
Such an estimate can be computed as the tree’s diameter (in centimetres) divided by its growth rate. Create a
function called tree.age that receives the arguments diameter (diameter of the tree, in inches or centimetres),
rate.of.growth (the rate of growth of the tree), and cm (logical, set by default to FALSE to indicate that
diameter is measured in inches). If cm is FALSE, then your functions should update diameter to its value in
centimetres. Your function should return the estimated age.
3. [2 marks] What is the average age of the cherry trees? To answer this question, use your function
tree.age to estimate the age of the 31 cherry trees with a rate of growth of 2.5 and report the average age.
4. [4 marks] We are interested in checking if a linear relationship between the diameter of cherry trees and
the timber produced can be assumed and whether the height affects this relationship. To this end, we can
categorise the variable Height and plot Diameter versus Volume using different colours for the categories of
Height. Create a new column in the trees data frame called HeightGroup, which takes the value short if
the height is less or equal than the median height, and tall if the height is greater than the median height.
You can use the built-in function median to calculate the median height. Plot Diameter versus Volume using
two different colours for the two categories of HeightGroup. Your plot should look like the one below (hint:
you can use the argument title in the legend function to put a title to the legend).
## (a)
trees = read.table('trees.txt', header = T)
## (b)
tree.age = function(diameter, rate.of.growth, cm = FALSE){
if(!cm)
diameter = diameter*2.54
diameter/rate.of.growth
}
## (c)
mean(tree.age(trees$Diameter, 2.5))
## [1] 13.46036
## (d)
meH = median(trees$Height)
trees = transform(trees, heightGroup = cut(Height,
breaks = c(-Inf, meH, Inf),
labels = c('short', 'tall')))
plot(trees$Diameter, trees$Volume, col = unclass(trees$heightGroup),
xlab = 'Diameter', ylab = 'Volume')
legend('topleft', col = 1:2, pch = rep(1,2), c('Short', 'Tall'), title = 'Height')
1
8 10 12 14 16 18 20
10
20
30
40
50
60
70
Diameter
Vo
lu
m
e
Height
Short
Tall
Task 1 - version 2 [trees - v2; a3t1] [10 marks in total]
1. [1 mark] The dataset trees.txt provides measurements of the diameter (in inches), height (in ft) and
volume (in cubic ft) of timber in 31 felled black cherry trees. Use R to read in the file trees.txt correctly
and save it as a data frame called trees.
2. [3 marks] We can get a rough estimate of a tree’s age without cutting it down and counting its rings.
Such an estimate can be computed as the tree’s diameter (in centimetres) divided by its growth rate. Create a
function called tree.age that receives the arguments diameter (diameter of the tree, in inches or centimetres),
rate.of.growth (the rate of growth of the tree), and cm (logical, set by default to FALSE to indicate that
diameter is measured in inches). If cm is FALSE, then your functions should update diameter to its value in
centimetres. Your function should return the estimated age.
3. [2 marks] What is the average age of the cherry trees? To answer this question, use your function
tree.age to estimate the age of the 31 cherry trees with a rate of growth of 1.88 and report the average age.
4. [4 marks] We are interested in checking if a linear relationship between the height of cherry trees and
the timber produced can be assumed and whether the diameter affects this relationship. To this end, we can
categorise the variable Diameter and plot Height versus Volume using different colours for the categories
of Diameter. Create a new column in the trees data frame called DiameterGroup, which takes the value
small if the diameter is less or equal than the median diameter, and big if the diameter is greater than the
median diameter. You can use the built-in function median to calculate the median diameter. Plot Height
versus Volume using two different colours for the two categories of DiameterGroup. Your plot should look like
the one below (hint: you can use the argument title in the legend function to put a title to the legend).
## (a)
trees = read.table('trees.txt', header = T)
## (b)
tree.age = function(diameter, rate.of.growth, cm = FALSE){
if(!cm)
diameter = diameter*2.54
diameter/rate.of.growth
2
}## (c)
mean(tree.age(trees$Diameter, 1.88))
## [1] 17.89942
## (d)
meD = median(trees$Diameter)
trees = transform(trees, DiameterGroup = cut(Diameter,
breaks = c(-Inf, meD, Inf),
labels = c('small', 'big')))
plot(trees$Height, trees$Volume, col = unclass(trees$DiameterGroup),
xlab = 'Height', ylab = 'Volume')
legend('topleft', col = 1:2, pch = rep(1,2), c('Small', 'Big'), title = 'Diameter')
65 70 75 80 85
10
20
30
40
50
60
70
Height
Vo
lu
m
e
Diameter
Small
Big
Task 2 - version 1 [maxtemp - v1; a3t2] [10 marks in total]
1. [2 marks] The dataset tmax in the tmax1993.Rdata file is a list of length 133 providing daily maximum
temperature recorded at 133 stations. Each element of the list (i.e., each station) contains a data frame with
the following information:
Variable Description
year year (1993)
month month (May to September)
day day of the month
z maximum temperature
lat,long latitude and longitude of the station
The list is named according to stations’ IDs. Use R to read in the file tmax1993.Rdata correctly.
2. [3 marks] Update the tmax list by removing the following columns for every station: julian, id, proc
3
and date. The updated tmax list should contain, for each station, a data frame with 6 columns (year, month,
day, z, lat, lon).
3. [3 marks] Define df to be the data frame containing the 6 columns (year, month, day, z, lat and lon)
for the 1st of May of 1993 and all the 133 stations (therefore, the data frame df should have 133 rows and 6
columns). Note that the columns year, month, and day will have exactly the same entries (as they all refer
to the same date, 1st of May of 1993).
4. [2 marks] Plot all the locations in df using the maximum temperature (z) as a colour scale. Your plots
should look like the one below.
## (a)
load('tmax1993.Rdata')
## (b)
# Solution 1: using for()
for(i in 1:length(tmax)){
tmp = tmax[[i]]
tmax[[i]] = tmp[, c("year", "month", "day", "z", "lat", "lon")]
}
# Solution 2: using apply()
rm(tmax) # to remove the object created using Solution 1
load('tmax1993.Rdata') # load data again
tmax = lapply(tmax, function(x) x[, c("year", "month", "day", "z", "lat", "lon")])
## (c)
# Solution 1: using for()
df = NULL
for(i in 1:length(tmax)){
tmp = tmax[[i]][tmax[[i]]$month == 5 & tmax[[i]]$day == 1, ]
df = rbind(df, tmp)
}
# Solution 2: using apply()
tmp = lapply(tmax, function(x) x[x$month == 5 & x$day == 1, ])
df = data.frame(matrix(unlist(tmp), nrow = length(tmp), byrow = TRUE))
names(df) = names(tmp[[1]])
# (d)
plot(df$lon, df$lat, col = df$z, pch = 16, xlab = 'Longitude',
ylab = 'Latitude', main = '1st of May, 1993')
4
−100 −95 −90 −85 −80
32
34
36
38
40
42
44
46
1st of May, 1993
Longitude
La
tit
ud
e
Task 2 - version 2 [maxtemp - v2; a3t2] [10 marks in total]
1. [2 marks] The dataset tmax in the tmax1990.Rdata file is a list of length 137 providing daily maximum
temperature recorded at 137 stations. Each element of the list (i.e., each station) contains a data frame with
the following information:
Variable Description
year year (1990)
month month (October to February)
day day of the month
z maximum temperature
lat,long latitude and longitude of the station
The list is named according to stations’ IDs. Use R to read in the file tmax1990.Rdata correctly.
2. [3 marks] Update the tmax list by removing the following columns for every station: julian, id, proc
and date. The updated tmax list should contain, for each station, a data frame with 6 columns (year, month,
day, z, lat, lon).
3. [3 marks] Define df to be the data frame containing the 6 columns (year, month, day, z, lat and lon)
for the 30th of November of 1990 and all the available stations. Note that the columns year, month, and day
will have exactly the same entries (as they all refer to the same date, 30th of November of 1990).
4. [2 marks] Plot all the locations in df using the maximum temperature (z) as a colour scale. Your plot
should look like the one below.
## (a)
load('tmax1990.Rdata')
## (b)
5
# Solution 1: using for()
for(i in 1:length(tmax)){
tmp = tmax[[i]]
tmax[[i]] = tmp[, c("year", "month", "day", "z", "lat", "lon")]
}
# Solution 2: using apply()
rm(tmax) # to remove the object created using Solution 1
load('tmax1990.Rdata') # load data again
tmax = lapply(tmax, function(x) x[, c("year", "month", "day", "z", "lat", "lon")])
## (c)
# Solution 1: using for()
df = NULL
for(i in 1:length(tmax)){
tmp = tmax[[i]][tmax[[i]]$month == 11 & tmax[[i]]$day == 30, ]
df = rbind(df, tmp)
}
# Solution 2: using apply()
tmp = lapply(tmax, function(x) x[x$month == 11 & x$day == 30, ])
df = data.frame(matrix(unlist(tmp), ncol = 6, byrow = TRUE))
names(df) = names(tmp[[1]])
# (d)
plot(df$lon, df$lat, col = df$z, pch = 16, xlab = 'Longitude',
ylab = 'Latitude', main = '30th of November, 1990')
−100 −95 −90 −85 −80
32
34
36
38
40
42
44
46
30th of November, 1990
Longitude
La
tit
ud
e
6
Task 3 - version 1 [simulation - v1; a3t3] [10 marks in total]
In this task, we will simulate data from a model with known parameters and then, assuming we do not know
the true value of the parameters, try to estimate them.
1. [3 marks] [Simulating data] Define x = (x1, . . . , x500) to be a sorted vector (in ascending order) with 500
randomly generated numbers from a Uniform distribution in the interval (1,6). Define y to be a vector with
500 randomly generated observations from a Normal distribution with mean 2 + 5x and variance 25, i.e., y
= (y1, . . . , y500) with
yi
ind∼ N (2 + 5xi, 25), i = 1, . . . , 500.
2. [3 marks] [Log-likelihood] Assume now that we do not know the relationship between x and y, and we
want to estimate it. To this end, we assume a Normal regression model with a known variance of the form
E(yi) = β0 + β1xi + i, i
ind∼ N (0, 25), i = 1, . . . , 500. (1)
(Clearly, we know that β0 = 2 and β1 = 5 because we generated the data in this form! But, for the sake of
the exercise, we will forget that we know. . . )
The log-likelihood for the model in (1) is
`(β) =
500∑
i=1
log{f(yi;β0 + β1xi, 25)},
where f(yi;β0 + β1xi, 25) is the Normal density evaluated at yi with mean β0 + β1xi and variance 25, i.e.,
f(yi;β0 + β1xi, 25) =
1√
2pi · 52 exp
{
− [yi − (β0 + β1xi)]
2
2 · 52
}
, i = 1, . . . , 500.
Write a function nlog.lik which takes the parameter vector β = (β0, β1) as first argument, the observations
y as second argument, and the covariate x as thrid argument. The function nlog.lik should returns the
negative log-likelihood −`(β).
3. [2 marks] [Optimisation] Use the function optim to minimise the negative log-likelihood function,nlog.lik,
over the parameter vector β = (β0, β1) using the covariate x and observations y you generated in parts 1 and
2. Save the estimated values of β0 and β1 in a vector called beta.hat.
4. [2 marks] Plot the data (x,y). Add the true regression line yi = 2 + 5xi, i = 1, . . . , 500 in continuous red
and the estimated regression line yi = βˆ0 + βˆ1xi, i = 1, . . . , 500 in dashed green, where βˆ0 = beta.hat[1]
and βˆ1 = beta.hat[2]. Include a legend for the true and estimated regression lines. Your plot should look
similar to the one below (remember that, since this exercise is based on simulated data, the point and lines
you produce will differ from the ones shown below).
## 1
set.seed(16032021)
n = 500
x = sort(runif(n, 1, 6))
sd = 5
y = rnorm(n, mean = 2 + 5*x, sd = sd)
## 2
nlog.lik = function(beta, x, y){
beta0 = beta[1]
beta1 = beta[2]
-sum(dnorm(y, mean = beta0 + beta1*x, sd = sd, log = T))
}
7
## 3
opt.beta = optim(par = c(0.2, 4), fn = nlog.lik, x = x, y = y)
beta.hat = opt.beta$par
beta.hat
## [1] 2.238212 4.946959
## 4
plot(x,y, pch = 16)
lines(x, 0.2 + 5*x, col = 2, lwd = 3)
lines(x, beta.hat[1] + beta.hat[2]*x, col = 3, lty = 2, lwd = 3)
legend('topleft', c('True line', 'Estimated line'), col = c(2,3),
lty = c(1,2), lwd = c(3,3))
1 2 3 4 5 6
0
10
20
30
40
x
y
True line
Estimated line
Task 3 - version 2 [simulation - v2; a3t3] [10 marks in total]
In this task, we will simulate data from a model with known parameters and then, assuming we do not know
the true value of the parameters, try to estimate them.
1. [3 marks] [Simulating data] Define x = (x1, . . . , x500) to be a sorted vector (in ascending order) with 500
randomly generated numbers from a Uniform distribution in the interval (-5,-1). Define y to be a vector with
500 randomly generated observations from a Normal distribution with mean 4.5− 2x and variance 36, i.e., y
= (y1, . . . , y500) with
yi
ind∼ N (4.5− 2xi, 36), i = 1, . . . , 500.
2. [3 marks] [Log-likelihood] Assume now that we do not know the relationship between x and y, and we
want to estimate it. To this end, we assume a Normal regression model with known variance of the form
E(yi) = β0 + β1xi + i, i
ind∼ N (0, 36), i = 1, . . . , 500. (2)
(Clearly, we know that β0 = 4.5 and β1 = −2 because we generated the data in this form! But, for the sake
of the exercise, we will forget that we know. . . )
8
The log-likelihood for the model in (2) is
`(β) =
500∑
i=1
log{f(yi;β0 + β1xi, 36)},
where f(yi;β0 + β1xi, 36) is the Normal density evaluated at yi with mean β0 + β1xi and variance 36, i.e.,
f(yi;β0 + β1xi, 36) =
1√
2pi · 62 exp
{
− [yi − (β0 + β1xi)]
2
2 · 62
}
, i = 1, . . . , 500.
Write a function nlog.lik which takes the parameter vector β = (β0, β1) as first argument, the observations
y as second argument, and the covariate x as thrid argument. The function nlog.lik should returns the
negative log-likelihood −`(β).
3. [2 marks] [Optimisation] Use the function optim to minimise the negative log-likelihood function,
nlog.lik, over the parameter vector β = (β0, β1) using the covariate x and observations y you generated in
parts 1 and 2. Save the estimated values of β0 and β1 in a vector called beta.hat.
4. [2 marks] Plot the data (x,y). Add the true regression line yi = 4.5− 2xi, i = 1, . . . , 500 in continuous
red and the estimated regression line yi = βˆ0+ βˆ1xi, i = 1, . . . , 500 in dashed green, where βˆ0 = beta.hat[1]
and βˆ1 = beta.hat[2]. Include a legend for the true and estimated regression lines. Your plot should look
similar to the one below (remember that, since this exercise is based on simulated data, the point and lines
you produce will differ from the ones shown below).
## 1
set.seed(16032021)
n = 500
x = sort(runif(n, -5, -1))
sd = 6
y = rnorm(n, mean = 4.5 - 2*x, sd = sd)
## 2
nlog.lik = function(beta, x, y){
beta0 = beta[1]
beta1 = beta[2]
-sum(dnorm(y, mean = beta0 + beta1*x, sd = sd, log = T))
}
## 3
opt.beta = optim(par = c(4, -1), fn = nlog.lik, x = x, y = y)
beta.hat = opt.beta$par
beta.hat
## [1] 4.326759 -2.079172
## 4
plot(x,y, pch = 16)
lines(x, 4.5 - 2*x, col = 2, lwd = 3)
lines(x, beta.hat[1] + beta.hat[2]*x, col = 3, lty = 2, lwd = 3)
legend('topright', c('True line', 'Estimated line'), col = c(2,3),
lty = c(1,2), lwd = c(3,3))
9
−5 −4 −3 −2 −1
0
10
20
30
x
y
True line
Estimated line
Task 4 [Monte Carlo; a3t4] [10 marks in total]
Let g be a function and U ∼ Unif(0, 1), i.e., U is a Uniform (0,1) variable. The Monte Carlo method says
that we can approximate
θ = E{g(U)} =
∫ 1
0
g(x)dx,
by generating n Uniform numbers in (0,1), U1, . . . , Un ∼ Unif(0, 1), and computing
θˆ = 1
n
n∑
i=1
g(Ui).
1. [3 marks] Let U ∼ Unif(0, 1) and let θ be defined as
θ = Cov(U, eU ) = E(UeU )− E(U)E(eU ). (3)
Use Monte Carlo to approximate (3). To this end, create a function in R called cov.mc that takes n as
argument and does the following:
• Step 1: Checks that n is at least 1. If not, return an error message.
• Step 2: Defines u to be a vector of length n with n Uniform (0,1) random numbers. Set e = exp(u).
• Step 3: Defines UE to be the average of u·e, U to be the average of u and E to be the average of e.
• Step 4: Returns UE-U·E.
2. [2 marks] Evaluate the function cov.mc for n = 5000. Save the result in an object called theta.hat.
The exact value of θ is 3/2− e/2. Report the absolute difference between theta.hat and the exact value of θ.
3. [3 marks] Let Ui ∼ Unif(0, 1), i ≥ 1, i.e., random variables with Uniform distribution in (0,1). Define
the random variable N as
N = max
{
n :
n∏
i=1
Ui ≥ e−3
}
.
The following algorithm generates one sample from N :
10
• Step 1: set n = 0 and v = 1. The object v is an auxiliary variable.
• Step 2: Generate u ∼ Unif(0, 1).
• Step 3: Compute f = v · u. I advise you to use the function prod in R to do this.
• Step 4: If f ≥ e−3, set v = f and n = n+ 1.
• Step 5: Repeat Steps 2-4 until f < e−3. The final value of n is one sample from the variable N .
Use the previous algorithm to create a function called sampleN that takes k as argument and generates k
samples from N .
4. [2 marks] Evaluate the function sampleN for k = 20, 000. Save the result in an object called Nsamp. Use
Nsamp to approximate Pr(N = 3) by computing the proportion of times Nsamp is equal to 3.
## 1
cov.mc = function(n){
if(n < 1) stop('n should be at least 1')
u = runif(n)
e = exp(u)
UE = mean(u*e)
U = mean(u)
E = mean(e)
UE - U*E
}
## 2
theta = 3/2 - exp(1)/2
theta.hat = cov.mc(5000)
abs(theta - theta.hat)
## [1] 0.0004697047
## 3
sampleN = function(k){
N = rep(NA, k)
for(j in 1:k){
n = 0 # starting value for N. Will be updated
v = 1 # auxiliary variable used to multiply
while(TRUE){
u = runif(1)
f = prod(v,u)
if(f >= exp(-3)){
v = f
n = n + 1
}else break
}
N[j] = n
}
N
}
## 4
Nsamp = sampleN(20000)
mean(Nsamp == 3)
## [1] 0.2261
11