程序代写案例-STATS 786|学霸联盟

程序代写案例-STATS 786

时间：2022-05-02

STATS 786
THE UNIVERSITY OF AUCKLAND
SEMESTER 1, 2021
Campus: Offshore Online
STATISTICS
Time Series Forecasting for Data Science
Midterm - Test
(Time allowed: ONE Hour)
NOTE: * Answer all parts of all questions
* Refer appendix (pages 8–10 for necessary figures)
* Open book examination
* Calculators are permitted
* Total marks 65
Page 1 of 11
STATS 786
Instructions
The length of the test includes an additional 30 minutes (to allow for reading time,
the additional complexity of the online mode, and submission). You get one hour
for answering the questions and an extra 30 minutes for uploading your files.
You must submit your final answers before due time so do not leave sub-
mitting until just before the due time - make sure you allow time for submission.
Test answers will not be accepted after the end of this extra 30 minute period.
If you encounter computer/internet/other issues during the test that affect your
ability to work on or submit your test answers please contact the lecturer via email
(s.wickramasuriya@auckland.ac.nz).
We STRONGLY recommend you download your submitted document from Can-
vas, after submitting it, to verify you have uploaded the correct document. It is
your responsibility to check you have submitted the correct document.
It is your responsibility to ensure your test is successfully submitted on time. Please
don’t leave it until the last minute to submit your test.
Page 2 of 11
STATS 786
Academic Honesty Declaration
By completing this assessment, I agree to the following declaration: I understand the
University expects all students to complete coursework with integrity and honesty. I
promise to complete all online assessment with the same academic integrity standards
and values. Any identified form of poor academic practice or academic misconduct will
be followed up and may result in disciplinary action. As a member of the University’s
student body, I will complete this assessment in a fair, honest, responsible and trustworthy
manner.
This means that:
I declare that this assessment is my own work.
I will not seek out any unauthorised help in completing this assessment.
I am aware the University of Auckland may use plagiarism detection tools to check
my content.
I will not discuss the content of the assessment with anyone else in any form,
including, Canvas, Piazza, Facebook, Twitter or any other social media or online
platform within the assessment period.
I will not reproduce the content of this assessment anywhere in any form at anytime.
I declare that I generated the calculations and data in this assessment indepen-
dently, using only the tools and resources defined for use in this assessment.
I will not share or distribute any tools or resources I developed for completing this
assessment.
Page 3 of 11
STATS 786
1 Run the following code in R.
# Use your student ID as the seed
set.seed(2021)
sample(letters[1:6], 3, replace = FALSE)
Use the output from the above R code to select the statements that you need to
answer from the list given below. State whether they are true or false. You MUST
provide reasoning for your answer.
a There is something wrong with my forecasts because they take the same value
for all forecast horizons.
False. The forecasts for the given series might have been obtained using a
naive method. So it is possible that the forecasts to take the same value for all
forecast horizons.
b I should always choose the regression model with the smallest sum of squared
errors for obtaining predictions.
False. When we add more and more variables into the regression model, the
sum of squared errors decreases. It is better to choose the model based on an
information theoretic criterion.
c Prediction intervals are not very important because most people want the point
forecasts.
False. The point forecasts carry limited information without quantifying the
uncertainty associated with it. Prediction intervals are one way to quantify
the uncertainty in forecasts. They are also useful for making decisions about
possible future outcomes.
d A time series cross-validation based on a rolling forecast origin is better than
a simple test set for comparing forecast methods.
True. A time series cross-validation based on a rolling forecast origin provides
more out-of-sample comparisons. On the other hand, simple test set allows for
one out-of-sample comparison. When we have a limited number of observations
then we won’t be able to perform cross-validation effectively, in such case we
can use a simple test set.
e A white noise series has zero mean and constant autocovariance.
False. A white noise series (say wt) has zero mean, constant variance, and is
uncorrelated at time t and t+h, where h 6= 0. This implies that autocovariance
is non-zero at lag-0 and zero elsewhere. Therefore, a white noise series has
“constant autocovariance” statement is wrong.
f Linear regression models are simplistic because the real world is nonlinear.
True. Linear models are simplistic, however that doesn’t imply that they are
less useful in practice. Often simple approximations to reality work well. If we
don’t have enough data to capture the nonlinearity (if it is mild), then linear
models can be useful.
Page 4 of 11
STATS 786
[Total: 15 marks]
2 This question attempts to analyze the effect of temperature and pollution level on
weekly cardiovascular mortality in one of the states in the US.
Note: Please refer to the appendix on page 8 for the necessary figures.
Figure 1 shows the time plots for average weekly cardiovascular mortality, temper-
ature, and particulate pollution level over ten years. Figure 2 shows a scatter plot
matrix of mortality and the two predictor variables.
a Briefly describe the main features that you can observe from Figures 1 and 2.
[5 marks]
All of the series show strong seasonal components.
There is a downward trend in the cardiovascular mortality.
There is a positive linear relationship between mortality and pollution level
and a somewhat nonlinear relationship with temperature.
There is no prominent relationship between temperature and pollution level.
b Let Mt denotes cardiovascular mortality, Tt denotes the temperature and Pt
denotes the particulate levels at time t. One of the students in the class
suggested fitting the following four models:
Mt = β0 + β1t+ εt, (M1)
Mt = β0 + β1t+ β2(Tt − T¯ ) + εt, (M2)
Mt = β0 + β1t+ β2(Tt − T¯ ) + β3(Tt − T¯ )2 + εt, (M3)
Mt = β0 + β1t+ β2(Tt − T¯ ) + β3(Tt − T¯ )2 + β4Pt + εt, (M4)
where T¯ denotes the mean temperature. Explain briefly why the student has
suggested fitting these four models.
[4 marks]
i M1: There is a downward trend in the mortality series with time.
ii M2: We assume that the temperature is linearly related with mortality.
iii M3: The relationship between temperature and mortality is somewhat ap-
pear to be quadratic.
iv M4: Pollution level has a linear relationship between mortality.
c Summary statistics for M1–M4 are given in Table 1. Among these models,
which one do you select as the best model? Briefly give reasons for your
selection. Interpret the value of R¯2.
[5 marks]
I would choose M4 as the best model. Because AIC and BIC are lowest for
this model and ∆AIC/∆BIC is quite large between M3 and M4. The model
explains approximately 60% of the variation present in the mortality series.
d Figure 3 shows the residual diagnostics for the best model chosen from M1–M4.
What conclusions can you draw from these plots.
[3 marks]
Page 5 of 11
STATS 786
Table 1: Summary statistics for models M1–M4.
σˆ2 R¯2 AIC BIC
M1 79.1 0.209 2224 2237
M2 62.2 0.378 2103 2120
M3 55.5 0.445 2047 2068
M4 40.8 0.592 1891 1916
The residuals vary around zero with approximately constant variance.
The first six residual autocorrelations are statistically significant. There-
fore, the fitted model doesn’t capture the inherent autocorrelation structure
well. We should consider ways of improving the fitted model considering
these facts.
The histogram is slightly skewed.
[Total: 17 marks]
3 The revenue-domestic-flights.csv file contains information about monthly rev-
enue from domestic flights in US from 1979–2000.
a Read the file into R and convert it to a tsibble object.
[3 marks]
b Plot the revenue series and comment briefly the main features of the data.
[2 marks]
c Do you think a Box-Cox transformation is useful for this time series? Briefly
give reasons for your answer.
[3 marks]
d Mention at least four forecasting methods that are most appropriate for this
series.
[4 marks]
e Using last 2 years of data as the test set, fit the methods that you suggested
in part d (you may transform the original series based on your answer to part
c).
[11 marks]
f Obtain the forecasts for 2 years.
[2 marks]
g Compare the accuracy of your forecasts from different methods against the
test set.
[2 marks]
h Which method does best? Justify your selection.
[3 marks]
Page 6 of 11
STATS 786
i Plot the point forecasts from the best method along with the 95% prediction
interval.
[3 marks]
[Total: 33 marks]
require(tidyverse)
require(fpp3)
# a
revenue <- read_csv("revenue-domestic-flights.csv")
revenue <- revenue %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble()
# b
revenue %>%
autoplot(Revenue)
The series has trend and seasonality.
# c
# Yes, the variation changes proportional to the level of the series.
# d
# SNAIVE/SANIVE+drift/TSLM/STL
# e
train_revenue <- revenue %>%
filter(year(Month) <= 1998)
lambda <- train_revenue %>%
features(Revenue, features = guerrero) %>% pull()
train_revenue %>%
autoplot(box_cox(Revenue, lambda = lambda)) +
ylab("") + ggtitle("Transformed revenue")
fit <- train_revenue %>%
model(
snaive_drift = SNAIVE(box_cox(Revenue, lambda = lambda) ~ drift()),
snaive = SNAIVE(Revenue),
tslm = TSLM(box_cox(Revenue, lambda = lambda) ~ trend() + season()),
dcmp = decomposition_model(
STL(box_cox(Revenue, lambda = lambda) ~ season(window = Inf)),
RW(season_adjust)),
dcmp_drift = decomposition_model(
STL(box_cox(Revenue, lambda = lambda) ~ season(window = Inf)),
RW(season_adjust ~ drift()))
)
Page 7 of 11
STATS 786
# f and g
fit %>%
forecast(h = "2 years") %>%
accuracy(revenue)
# h
# Seasonal naive with drift method is better as it has the highest accuracy measures.
fit %>%
select(snaive_drift) %>%
forecast(h = "2 years") %>%
autoplot(train_revenue, level = 95)
Page 8 of 11
STATS 786
Appendix
M
ortality
P
a
rticulate
Te
m
perature
0002 W01 0004 W01 0005 W52 0008 W01 0009 W53
80
100
120
20
40
60
80
100
50
60
70
80
90
100
Week
Figure 1: Time plots for average weekly cardiovascular mortality, temperature, and par-
ticulate pollution level.
Page 9 of 11
STATS 786
Corr:
-0.439***
Corr:
0.444***
Corr:
-0.017
Mortality Temperature Particulate
M
ortality
Tem
perature
Particulate
80 100 120 50 60 70 80 90 100 20 40 60 80 100
0.00
0.01
0.02
0.03
0.04
50
60
70
80
90
100
20
40
60
80
100
Figure 2: A scatter plot matrix of mortality and the two predictor variables.
Page 10 of 11
STATS 786
-20
-10
0
10
20
30
0002 W01 0004 W01 0005 W52 0008 W01 0009 W53
index
.re
si
d
-0.1
0.0
0.1
0.2
0.3
0.4
26
lag [1W]
ac
f
0
20
40
60
-20 -10 0 10 20 30
.resid
co
un
t
.re
si
d
ac
f
co
un
t
.re
si
d
ac
f
co
un
t
Figure 3: Residual diagnostic plots.
Page 11 of 11