QUEEN MARY, UNIVERSITY OF LONDON
MTH6991/MTH791U/MTH791P
Computational Statistics with R
Exercise Sheet 4 Spring 2022
Question 1 is due to be handed in for assessment along with two questions from the previous
exercise sheet. The link for submission will be in the week 6 section on QMPlus. The deadline
is 1pm on Thursday the 3rd March. Please include an R script file with the R code used, and
a separate file with all answers asked for (you can submit more than one file).
Late submissions will receive zero marks.
1. (Problem for handing in) 50 marks
This question uses a dataset on QMPlus. There is a different dataset for each student,
which can be found via the link “Exercise 4 datasets” in the week 5 section. For each
student, there should be a file called “exercise4 XYZ.txt”, where XYZ is your ID number
(you need to be logged in to QMPlus). If you cannot see a file, please send me an email.
The dataset contains one column, called x. Using R, generate three kernel density
estimates (KDE) using bandwidths of h = 0.2, 1 and 2, and the Gaussian kernel. Plot
each of these KDEs. What is the effect of changing h on the appearance of the graphs?
What is the main feature of the data that these plots show?
What is the bandwidth that R calculates (still with the Gaussian kernel) if you do not
specify a bandwidth?
Also generate and plot a KDE using a bandwidth of 1 and the rectangular kernel. Which
produces a smoother curve (for h = 1), the rectangular or Gaussian kernel?
Include the graphs with your answers, but apart from that don’t copy any R output.
Within R, right-clicking on a graph gives the options of saving it to disk or copying it
(e.g. to paste into a Word document).
2. Without using R, just with pen and paper, calculate the histogram estimator fˆH(y) of
the probability density function (pdf) using the following data:
0.5, 4.9, 6.5, 4.4, 7.5, 6.9, 1.2, 6.7, 5.8, 4.7
Use bins (intervals) with boundaries at 0, 2, 4, 6, 8.
So you would need to fill in the ? in the following:
fˆH(y) =
?, y ≤ 0
?, 0 < y ≤ 2
?, 2 < y ≤ 4
?, 4 < y ≤ 6
?, 6 < y ≤ 8
?, y > 8
1
3. Using R, draw a histogram with the data from question (2), with the same intervals,
and check that the probability density function estimate is the same as you calculated
by hand.
4. For a general kernel function K (which is by definition a pdf), if σ > 0 is the standard
deviation of this pdf, then we can define the rescaled kernel K∗ by K∗(x) = σK(σx).
Show that K∗ is a pdf, and that it has standard deviation 1.
5. Using the same data as question 2, without using R, calculate the kernel density estimate
fˆn,h(y) using the rectangular kernel, and with bandwidth h = 1, for the values of
y = 0, 1, 3 and 4.
6. In R, simulate a sample of size 1000 from an exponential distribution with parameter 1,
which can be done with the command rexp. Estimate a kernel density with Gaussian
kernel and bandwidth 0.2. Plot the kernel density estimate and the true exponential
pdf on the same graph. The latter is given by
f(y) = e−y, y ≥ 0
You could use the command
curve(dexp, add=TRUE)
to add this true pdf to an existing graph, such as the plot of the KDE. The vertical
range of the exponential pdf is [0, 1], so it may help to add the option ylim=c(0, 1) to
the plot of the KDE.
How do the true and estimated pdfs differ, and what feature of the true pdf do you
think might cause this?
2