r代写-STATS4044-Assignment 3|学霸联盟

r代写-STATS4044-Assignment 3

时间：2021-11-30

Assignment 3 (2020) alternative questions
STATS4044 Intro to R (H) 2021-2022
Introduction
This document contains alternative questions for Assignment 3 (2020).
Contents:
• 2 alternative versions of Question 1 (each worth 11 marks):
– ArcLake
– Lake Surface Water Temperature
• 1 alternative version of Question 2 (worth 6 marks):
– Computing the median
• 2 alternative versions of Question 3 (each worth 13 marks):
– Finding the maximum
– Finding the minimum
• 1 alternative version of Question 4 (worth 10 marks):
– Regression for Calibration
• 2 additional questions from the DD80 level Assignment 3. These are at a lower level of material than
those for the Honours level assignment:
– Harmonic Regression
– First digits
The original assignment contained 1 of each version of Question 1, 2, 3 and 4 (40 marks total).
Question 1 - ArcLake [11 marks total]
All data required is available in the R data object a3data2020.RData which can be installed in your R
workspace by running the code;
load(url("http://www.maths.gla.ac.uk/~rhaggarty/rp/a3data2020.Rdata"))
The data frame arclake (contained in a3data2020.RData) contains data from temperature measurements
for lakes around the world. It contains the following columns.
id Numeric ID of the lake
latitude Latitude of the lake
longitude Longitude of the lake
group Group number of the lake (ranging from 1 to 8)
lswt Lake surface water temperature measurements
(yearly average, in degrees Celsius)
amplitude Amplitude of seasonal pattern in lake surface water
temperature (in degrees Celsius)
(a) [2 marks]
Define a variable warmest which contains only the id of the lake with the highest lake surface water temperature
recorded.
1
(b) [3 marks]
Define a 8x2 matrix called north.south where each row corresponds to a single group and the columns
correspond to the proportion of lakes in that group that lie in the Northern and Southern hemispheres. Note;
locations with latitudes greater than 0 lie in the Northern Hemisphere.
(c) [2 marks]
Create a new column called dist2equ which contains the distance to the equator in kilometres, calculated as
dist2equ = 6371pi180 |λ|
where |λ| is the absolute value of the latitude of lake location.
(d) [4 marks]
Create a scatterplot of the lake surface water temperature against the distance to the equator. Use different
colours and plotting symbols for the different groups of lakes and add a legend to your plot. The label of
the horizontal axis should be “Distance to Equator”. The label of the vertical axis should be “Lake Surface
Water Temperature”.
Your plot should look similar to the plot shown below.
2
Note; If you have not managed to compute distance to equator in part (c) you can use latitude in your plot as
a substitute. If you do this your plot will look slightly different to the one pictured.
Question 1 - Lake Surface Water Temperature [11 marks total]
All data required is available in the R data object a3data2020.RData which can be installed in your R
workspace by running the code;
load(url("http://www.maths.gla.ac.uk/~rhaggarty/rp/a3data2020.Rdata"))
The data frame arclake (contained in a3data2020.RData) contains data from temperature measurements
for lakes around the world. It contains the following columns.
id Numeric ID of the lake
latitude Latitude of the lake
longitude Longitude of the lake
group Group number of the lake (ranging from 1 to 8)
lswt Lake surface water temperature measurements
(yearly average, in degrees Celsius)
amplitude Amplitude of seasonal pattern in lake surface water
temperature (in degrees Celsius)
(a) [2 marks]
Define a variable coolest which contains the only the id of the lake with the lowest lake surface water
temperature recorded.
(b) [3 marks]
Define a 8x2 matrix called prop.hem where each row corresponds to a single group and the columns correspond
to the proportion of lakes in that group that lie in the Northern and Southern hemispheres. Note; locations
with latitudes greater than 0 lie in the Northern Hemisphere.
(c) [2 marks]
To normalise a variable x such that the values fall within the a given range [a, b] the following formula can be
used
xnormalized = (b− a) x−min(x)max(x)−min(x) + a
Add the column to the arclake data called amp.norm which contains the values of amplitude normalised so
that they lie within the range [1,2].
(d) [4 marks]
Create a scatterplot of the latitude against longitude of the lake locations. Use different colours for the
different groups of lakes and add a legend which indicates which colour corresponds to which group. The
label of the horizontal axis should be “Longitude”. The label of the vertical axis should be “Latitude”. Scale
the size of each point according to the amplitude by using the normalised amplitude values you have created
in part (c).
Your plot should look similar to the plot shown below.
3
Question 2 - Computing the median [6 marks total]
(a) [4 marks]
Write a function compute.median which computes the median of a vector x, supplied as its only argument.
Your function should not use the built-in functions median or quantile, but compute the median as set out
below.
1. Order the data in x and denote the resulting ordered data vector by x(1), . . . , x(n).
2. If n is even, then the median is x(n/2)+x(n/2+1)2 . If n is odd, then the median is x(n+1)/2.
Your function should check that x is a numeric vector and provide an error message if it is not.
(b) [2 marks]
Create a vector x containing the numbers −7,−5,−3, . . . , 11, 13 and use your function from part (a) to
compute the median.
Question 3 - Finding the maximum [13 marks total]
Consider the following function f : R→ R and its derivatives
4
f(θ) = sin(3θ)− θ2 f ′(θ) = ∂∂θf(θ) = 3 cos(3θ)− 2θ f ′′(θ) = ∂
2
∂θ2 f(θ) = −9 sin(3θ)− 2
(a) [5 marks]
Write three functions f, f.d, and f.dd which take theta as argument and which return f(θ), f ′(θ), and f ′′(θ),
respectively.
(b) [4 marks]
Implement Newton’s method to find a local maximum of f(·) by starting with an initial value of theta=2.
Next use the function optimize to check the maximum of f(·).
(c) [4 marks]
Use R to create a sketch of the function f(θ) for θ ∈ [−3, 3] and add a red horizontal dashed line at the value
maximum value of f(·) and a vertical blue dashed line at the value of theta which corresponds that maximum
value (both values should be obtained in part (b)).
Your plot should look similar to the one below.
Note; If you have not managed to identified the optimal value using the methods in part (b) you can use the
value θ = 0.4 as a substitute in order to answer this question.
Question 3 - Finding the minimum [13 marks total]
Consider the following function f : R→ R and its derivatives
f(θ) = sin(2θ) + θ22 f ′(θ) =
∂
∂θf(θ) = 2 cos(2θ) + θ f ′′(θ) =
∂2
∂θ2 f(θ) = −4 sin(2θ) + 1
5
(a) [5 marks]
Write three functions f, f.d, and f.dd which take theta as argument and which return f(θ), f ′(θ), and f ′′(θ),
respectively.
(b) [4 marks]
Implement Newton’s method to find a local minimum of f(·) by starting with an initial value of theta=-1.
Next use the function optimize to check the minimum of f(·).
(c) [4 marks]
Use R to create a sketch of the function f(θ) for θ ∈ [−3, 3] and add a red horizontal dashed line at the value
minimum value of f(·) and a vertical blue dashed line at the value of theta which corresponds that minimum
value (both values should be obtained in part (b)).
Your plot should look similar to the one below.
Note; If you have not managed to identified the optimal value using the methods in part (b) you can use the
value θ = −0.6 as a substitute in order to answer this question.
Question 4 - Regression for Calibration [10 marks total]
Passing Bablok (PB) regression is a method for robustly fitting a line to sample of n points (xi, yi) {i =
1, . . . , n} by estimating the slope of the line as the median of the slopes of all lines through pairs of points. It
is thought to be less sensitive to the impact of unsual observations than ordinary least squares regression
(OLS).
The Passing Bablok regression line can be written as y = α+ βx where the intercept, α, and slope, β, can be
estimated using the following steps.
For a set of n points (xi, yi), i = 1, . . . n
6
1. For each pair of points (xi, yi) and (xj , yj) compute the slopes βij using the formula βij = yj−yixj−xi . Note
that i 6= j.
2. Estimate the slope β as the median of the βij ’s.
3. Estimate the residuals using y− βx where x = (x1, . . . , xn) and y = (y1, . . . , yn).
4. Estimate the intercept α as the median of the residuals.
(a) [7 marks]
Write a function pb.regression which estimates PB regression line for a set of points x = (x1, x2, . . . , xn)
and y = (y1, y2, . . . , yn). Your function should take as arguments two vectors x and y and return the
estimated parameters for α and β as described above. Your function should remove any points where there
are missing values and provide a warning to the user that missing values have been removed if this is the case.
(b) [3 marks]
Use your pb.regression function from part (a) to fit a Passing Bablok regression model to the data stored
in the vectors pbx and pby (available in the data file a3data2020.Rdata). Next plot the data, adding the
PB regression line in red, and the OLS regression line in blue.
If you have not managed to complete part (a) you can use values α = 0.2152 and β = 1.35 as substitutes for
the values of Passing Bablok coefficients.
Your plot should look similar to the one below.
Note the OLS regression line can be computed using the lm function in R.
Additional question (DD80 level) - Harmonic Regression [18 marks total]
Note that this question was included in the DD80 version of Assignment 3, not the Honours version.
A data set which is built into R is USAccDeaths which contains time series of the monthly totals of accidental
deaths in the US between 1973 and 1978.
Before attempting this question, enter and run the following code to access and set up the USAccDeaths data
7
library(MASS)
data(USAccDeaths)
USAccDeaths <- as.numeric(USAccDeaths)
month <- rep(1:12, times=6)
year <- rep(1973:1978, each=12)
(a) [2 marks]
Use the USAccDeaths, month and year vectors above to define a data frame called accidental which has
three columns, one corresponding to each of the three vectors named above.
(b) [2 marks]
Add a fourth column to the data frame accidental you have defined in part (a) which contains the natural
log (loge) transformed USAccDeaths data.
(c) [3 marks]
Decimal dates, where dates are expressed as a fraction of their year, are often used in time series. To convert
a date involving a month and year to a decimal date the following formula can be used;
dec.date = year + month− 112
Define a variable named dec.date which contains the decimal dates for the US accidental deaths data
contained in the data frame accidental created in part (a). Add this column to the accidental data frame.
(d) [2 marks]
The decimal dates defined in dec.date from part (c) correspond to the USAccDeaths series. Produce a line
plot of the USAccDeaths against decimal date as defined in part (c). Your plot should look similar to the one
below.
8
Note: If you have not managed to successfully define decimal date in part c) then you can use the code below
to define a version of decimal data that can be used as a substitute in order to answer this question;
dec.date <- seq(from=1973, to=1978.92, by=0.083) _________________________________________________
A harmonic regression model is commonly used to model a time series of length n denoted yi (i = 1, ..., n).
For a time series of monthly observations a harmonic regression model takes the form
yi = αdec.datei + β sin
(
2pimonthi − 112
)
+ γ cos
(
2pimonthi − 112
)
This design matrix X used to fit this model can be written in the form
X =

dec.date1 sin(2pimonthi−112 ) cos(2pi
monthi−1
12 )
dec.date2 sin(2pimonth2−112 ) cos(2pi
month2−1
12 )
...
...
...
dec.daten sin(2pimonthn−112 ) cos(2pi
monthn−1
12 )

9
(e) [4 marks]
Define the design matrix X for fitting a harmonic regression model to the USAccDeaths data. In this case
n = 72.
(f) [3 marks]
The estimated harmonic regression model can then be found using the standard regression formula
yˆ = X(XTX)−1XT y
Taking X defined in part (e), and y as USAccDeaths, compute yˆ using the formula above and store in a vector
called fitted.
(g) [2 marks]
Add the fitted model to the plot produced in part (d), using a blue line. Your plot should look similar to the
one below.
Additional question (DD80 level) - First digits [6 marks total]
Note that this question was included in the DD80 version of Assignment 3, not the Honours version.
10
The first digit of numbers in many real-world data sets (like the population sizes of towns, lengths of rivers,
numbers of votes in an election) does not have a uniform distribution. The digit 1 for example, occurs much
more often than the digit 9.
In 1938, the physicist Frank Benford suggested that the first digit of most empirical data has a distribution
whose probability mass function (p.m.f.) is given by
p(x) =
{
log10(1 + 1x ) for x ∈ {1, 2, ..., 9}
0 otherwise
where log10 denotes the logarithm with base 10. Note: log10(x) can be found in R using log(x, base=10).
(a) [2 marks]
Create a vector p such that its ith entry contains the probability p(i). Check whether the entries of p sum to
one.
(b) [2 marks]
Use R to define a variable mu which contains the expected value E(X) =
∑9
x=1 xp(x).
(c) [2 marks]
Use R to define a variable myvar which contains the variance V ar(X) =
∑9
x=1(x− µ)2p(x), where µ = E(X).
11