程序代写案例-5M|学霸联盟

程序代写案例-5M

时间：2022-04-23

December 2021
2 hours + 30 minutes to upload solutions
EXAMINATION FOR THE DEGREES OF XXXX
STATISTICS
Spatial Statistics 5M
This paper consists of 8 pages and contains 4 questions.
Candidates should attempt all questions.
Question 1 20 marks
Question 2 20 marks
Question 3 20 marks
Question 4 20 marks
Total 80 marks
The following material is made available to you:
Statistical tables∗
Statistical Tables
Probability formula sheet
Discrete univariate distributions
Distribution Probability mass function Range Parameters E(X) Var(X) Moment-generating Comments
(p.m.f.) p(x) functionM(t)
Binomial1,2
Bi(n, θ)
(
n
x
)
θx(1− θ)n−x x ∈ {0, 1, . . . , n} n ∈ N
0 < θ < 1
nθ nθ(1− θ) (1− θ + θ exp(t))n No. of successes in n trials
θ – probability of success
Geometric
Geo(θ)
θx−1(1− θ) x ∈ N 0 < θ < 1 1
1− θ
θ
(1− θ)2
(1− θ) exp(t)
1− θ exp(t)
No. of trials until (and including)
first failure
θ – probability of success
Hypergeometric
HyGe(n,N,M)
(
M
x
)(
N−M
n−x
)(
N
n
) x∈{max{0, n−(N−M)},
. . . ,min{n,M}}
N,n ∈ N
M ∈ {0, . . . , N} nθ
nθ(1− θ) · N − n
N − 1
(with θ = M
N
)
—3
No. of type I objects in a sample of
size n, drawn without replacement
from a population of size N , con-
taining M type I objects.
Negative Binomial
NeBi(k, θ)
(
x− 1
k − 1
)
θx−k(1− θ)k x ∈ {k, k + 1, . . .} k ∈ N
0 < θ < 1
k
1− θ
kθ
(1− θ)2
(
(1− θ) exp(t)
1− θ exp(t)
)k No. of trials until (and including)
kth failure
θ – probability of success
NeBi(1, θ) ≡ Geo(θ)
Poisson
Poi(λ)
exp(−λ)λ
x
x!
x ∈ N0 λ > 0 λ λ exp(λ(exp(t)− 1))
1 Bi(n, θ) can be approximated by Poi(nθ), if n large, θ small and nθ moderate.
2 Bi(n, θ) can be approximated by N(nθ, nθ(1− θ)), if n large and θ not too close to 0 or 1.
3 No simple closed form expression exists.
Continuous univariate distributions
Distribution Probability density function Range Parameters E(X) Var(X) Moment-generating Comments
(p.d.f.) f(x) functionM(t)
Beta
Be(α1, α2)
xα1−1(1− x)α2−1
B(α1, α2)
0 ≤ x ≤ 1 α1 > 0
α2 > 0
α1
α1 + α2
α1α2
(α1 + α2)2(α1 + α2 + 1)
—3
X1 ∼ Ga(α1, θ)
X2 ∼ Ga(α2, θ) independent
⇒ X1
X1+X2
∼ Be(α1, α2)
Cauchy
Ca(η, γ)
1
piγ
(
1 + (x−η)
2
γ2
) x ∈ R η ∈ R
γ > 0
—4 —4 —4 Ca(0, 1) ≡ t(1)
Chi-Squared
χ2(ν)
x
ν
2−1 exp
(−x2 )
2
ν
2 Γ
(
ν
2
) x > 0 ν ∈ N ν 2ν 1
(1− 2t) ν2
Xi ∼ N(0, 1) independent
⇒ Pνi=1X2i ∼ χ2(ν)
Exponential
Expo(θ)
θ exp(−θx) x > 0 θ > 0 1
θ
1
θ2
1
1− tθ
F
F(ν1, ν2)
ν
ν1
2
1 ν
ν2
2
2
B
(
ν1
2 ,
ν2
2
) x ν12 −1
(ν1x+ ν2)
ν1+ν2
2
x > 0 ν1, ν2 ∈ N
ν2
ν2 − 2
(for ν2 > 2)
2 ν22 (ν1 + ν2 − 2)
ν1(ν2 − 2)2(ν2 − 4)
(for ν2 > 4)
—4
X1 ∼ χ2(ν1)
X2 ∼ χ2(ν2) independent
⇒ X1/ν1
X2/ν2
∼ F(ν1, ν2)
“An electronic calculator may be used provided that it is allowed under the School of
Mathematics and Statistics Calculator Policy. A copy of this policy has been distributed
to the class prior to the exam and is also available via the invigilator.”
1
CONTINUED OVERLEAF/
NOTE: Candidates should attempt all questions.
1. (a) Suppose that {Z(s) : s ∈ D} is a stationary isotropic geostatistical process with
zero mean and a covariance function given by
C(h) =
{
σ2
(
1− h
φ+h
)
if h > 0,
σ2 + τ 2, if h = 0.
i. Assuming that σ2 = 1 and τ 2 = 2, compute C(0.5) for: (i) φ = 1 and (ii)
φ = 2. Which of the two values of φ gives the longest range autocorrelation?
[3 MARKS]
ii. Compute the semi-variogram corresponding to this covariance model.
[4 MARKS]
iii. What is the range, sill, nugget, and partial sill for this covariance model? Give
reasons for your answers. [3 MARKS]
(b) Measurements of air pollution were collected at 127 locations in and around
Greater London, and are plotted as a geoR object below. Recall that in geoR
in the top left plot the colours represent the quartiles of the distribution of values.
For example, the red crosses denote locations where the values are in the highest
quartile, while the blue circles denote locations where the data points are in the
lowest quartile of values. The X co-ordinate represents easting in metres and the
Y co-ordinate represents northing in metres. Describe the main features of these
data, and comment on whether the assumption that they are weakly stationary is
reasonable? [2 MARKS]
2
CONTINUED OVERLEAF/
510000 5400001
50
00
0
18
00
00
X Coord
Y
Co
or
d
80 120 1601
50
00
0
18
00
00
data
Y
Co
or
d
510000 540000
80
12
0
16
0
X Coord
da
ta
data
D
en
si
ty
80 100 120 140 1600
.0
00
0.
01
0
0.
02
0
(c) Constant, linear and quadratic trend models in easting (X Coord) and northing (Y
Coord) were fitted to these data using the linear model framework, and the results
of each are shown below. From the output shown and the plot above which is the
most appropriate trend model for these data? Justify your answer. [2 MARKS]
Model 1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 130.975 1.081 121.2 <2e-16 ***
Model 2
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.002e+02 1.352e+01 -29.598 <2e-16 ***
easting 1.003e+00 2.462e-02 40.746 <2e-16 ***
northing -8.616e-03 3.603e-02 -0.239 0.811
Multiple R-squared: 0.9324,Adjusted R-squared: 0.9313
Model 3
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.565e+02 4.853e+02 -0.941 0.349
easting 9.029e-01 1.757e+00 0.514 0.608
northing 9.218e-01 1.155e+00 0.798 0.427
easting.sq 9.443e-05 1.652e-03 0.057 0.954
northing.sq -2.607e-03 3.236e-03 -0.806 0.422
Multiple R-squared: 0.9328,Adjusted R-squared: 0.9306
3
CONTINUED OVERLEAF/
(d) The residuals from the most appropriate trend model were computed and the
binned empirical semi-variogram was then plotted together with 95% Monte Carlo
envelopes. This plot is shown below.
0 10000 20000 30000 40000 50000
0
50
10
0
20
0
30
0
distance
se
m
iva
ria
nc
e
i. From the empirical semi-variogram plot of the data approximately estimate
the nugget, sill and range of the process. [3 MARKS]
ii. Describe three disadvantages with the estimation of the binned empirical semi-
variogram.
[3 MARKS]
2. (a) In areal unit data the spatial closeness between the set of n areal units is sum-
marised by a non-negative n × n neighbourhood or proximity matrix W, where
the ijth element of this matrix wij determines the proximity of areal units (i, j).
Describe how to construct W using the k nearest neighbours specification, and
give one disadvantage of using this approach. [2 MARKS]
(b) Consider a simple 1-dimensional spatial data set with 4 regions ordered as [A1|A2|A3|A4],
with a corresponding neighbourhood matrix
W =

0 1 0 0
1 0 1 0
0 1 0 1
0 0 1 0
 .
Then suppose that Z1 = 102, Z2 = 101, Z3 = 100 and Z4 = 87. Compute the local
4
CONTINUED OVERLEAF/
indicators of spatial association based on Moran’s I statistic for areal units A1 and
A4, and interpret what these statistics tell you about the data. [4 MARKS]
(c) Consider a vector of areal unit data Z = (Z1, . . . , Zn) relating to n non-overlapping
areal units. Additionally, consider a binary n×n neighbourhood matrix W, where
wkj = 1 if areas (k, j) share a common border and wkj = 0 otherwise.
i. Consider the following model for Z.
Zk|Z−k ∼ N
(∑n
j=1wkjZj∑n
j=1wkj
,
τ 2∑n
j=1wkj
)
,
where in the usual notation Z−k denotes all the observations except the kth.
Give two limitations of this particular conditional autoregressive model?
[2 MARKS]
ii. Now suppose that areal unit k is an island, and hence does not share a common
border with any of the other areas. Given the definition of the neighbourhood
matrix W above, is the full conditional distribution described in the previous
part a valid normal distribution model for Zk? Justify your answer. If it is
not a valid model, how could W be altered to make it a valid model?
[4 MARKS]
(d) Consider the vector of random variables Z = (Z1, . . . , Zn), which are assigned
a Gaussian Markov Random Field (GMRF) model, with mean m and precision
matrix Q. Suppose the vector Z is partitioned into two components Z = (Z1,Z2).
Then partitioning the mean and variance of Z similarly as
Z =
(
Z1
Z2
)
∼ N
((
m1
m2
)
, τ 2
(
Q11 Q12
Q21 Q22
)−1)
,
it can be shown that
Z1|Z2 ∼ N
(
m1 −Q−111 Q12(Z2 −m2) , τ 2[Q11]−1
)
Consider a GMRF model with m = (m1,m2) = λ1 and Q = θ[diag(W1)−W] +
(1− θ)I, where 1 is an n× 1 vector of ones, I is an n× n identity matrix and W
is an n× n neighbourhood matrix.
i. Using the above result derive the full conditional distribution f(Zi|Z−i), where
Z−i denotes all observations except the ith. [4 MARKS]
ii. Derive the partial correlation Corr(Zi, Zj|Z−ij), where Z−ij denotes all obser-
vations except the ith and jth.
[4 MARKS]
5
CONTINUED OVERLEAF/
3. (a) For an arbitrary spatial point process define Ripley’s K function in terms of the
pair correlation function ρ(t). Consider a spatial point process with ρ(t) = 1 for all
t, derive Ripley’s K function for this process. What process does this correspond
to? [4 MARKS]
(b) Describe how Ripley’s K function can be used to determine whether a given point
process is a completely spatially random, clustered or regular process.
[2 MARKS]
(c) The data in the left plot below are the locations of trees in a 1km square grid on
Exmoor in Somerset, while the right plots shows the estimated K function.
0.0 0.4 0.8
0.
2
0.
4
0.
6
0.
8
x
y
0.00 0.10 0.20
0.
00
0.
10
0.
20
Estimated K function
r
K(
r)
K^iso(r)
Kpois(r)
i. From looking at both plots is there any evidence that the trees exhibit spatial
dependence? Justify your answer. [3 MARKS]
ii. It is decided to use a non-parametric kernel smoothing approach to estimate
the first order intensity function λ(s) rather than a parametric model. Give
one advantage and one disadvantage of this approach. [2 MARKS]
(d) Consider the following Poisson point process model defined on the unit square.
Z(A) ∼ Poisson(µ(A))
µ(A) =
∫
A
λ(s)ds,
where λ(s) is a spatially varying first order intensity function and the location s =
6
CONTINUED OVERLEAF/
(s1, s2). Now suppose that this intensity function is represented by the following
models:
(i) λ(s) ∼ N(µ, σ2).
(ii) ln(λ(s)) ∼ N(µ, σ2).
Which, if any, of the intensity function models (i) and (ii) above lead to a valid spa-
tial point process, and which, if any, induce spatial dependence into the process?
[4 MARKS]
(e) Assume that point process data are collected on a unit square domain D with
corners (0, 0), (0, 1), (1, 0) and (1, 1). An appropriate first order intensity function
at location s = (s1, s2) for this point pattern is thought to be λ(s) = 2 + s
2
1 + s2.
Compute the expected number of points in the domain D, i.e. compute E[Z(D)].
[5 MARKS]
4. (a) Consider a vector of random variables Z = (Z1, . . . , Zn) that are represented by
the model
Z ∼ N(Xβ, τ 2Q),
where X is an n×p matrix of known covariates and β is a p×1 vector of associated
regression parameters. Here Q is a precision matrix (not a variance matrix), that
is given by
Q = diag(W1)−W,
where 1 is an n× 1 vector of ones, and W is a n× n neighbourhood matrix.
i. What is the problem with fitting this model to a set of spatial areal unit data
z = (z1, . . . , zn) as a data likelihood model? [3 MARKS]
ii. Suggest an alternative yet similar model that can be used as a data likelihood
model for z? [3 MARKS]
(b) A researcher is interested in estimating what factors impact Covid-19 mortality
rates in England, and which areas have the highest mortality rates. They have
collected a population-level summary data set for a set of n small areal units that
make up England. Specifically, for each areal unit they collected the following
data:
• The number of deaths due to Covid-19 denoted Zk for area k.
• The expected number of deaths from Covid-19 based on the population size
and its age-sex demographics denoted ek for area k.
7
CONTINUED OVERLEAF/
• Five different measures of poverty, including measures of average income, un-
employment rate, average house price, average education level, and crime
rates.
• Other potentially important covariates including those relating to population
density, ethnicity, air pollution concentrations, number of care homes, etc.
The covariates are collectively denoted by xk for area k.
i. What type of spatial data are being modelled here? Justify your answer.
[2 MARKS]
ii. What exploratory analysis would you undertake before fitting a spatial model
to these data? [4 MARKS]
iii. Write down a plausible data likelihood model for these data, making sure you
specify what elements should be included in the linear predictor.
[3 MARKS]
iv. How would you determine which spatial correlation model to fit to the data?
[2 MARKS]
v. What outputs would you produce from the model to answer the questions of
interest? [3 MARKS]
Total: 80 MARKS
8
END OF QUESTION PAPER.