MUDT5010-无代写
时间:2023-10-21
MUDT 5010 Practical Assignment #2
Due by Week 7 (October 26, 2023)
Assignment comments:
• You may do the analysis of this homework in any programming language/software of your
choosing, but it will be the easiest to do it in Python since you can use the material in
the class demo.
• Make sure to label the x- and y-axes of your plots, so we know what you are plotting.
• Please send in your code, figures, and written responses through Moodle as a single
combined PDF file.
• For this assignment you will need to download the NYCsubway.csv file from Moodle
and load it into Python for analysis. This can be done using Google Colab, as detailed
underneath the file on Moodle.
Question #1: Exploratory Data Analysis
1. Plot the stations using (x, y) = (longitude, latitude), and colored by the total number of
entries in first half of 2020. This will result in the same plot as in the demo code, but
without a log scale. What differences do you see in the plot from the plot shown in the
demo? What insight can you gain from this example that suggests when we should use a
logarithmic scale for visualizing data?
2. There are 11 neighborhoods in this dataset. For each neighborhood, please:
(a) Report the mean and standard deviation of the number of entries over the whole
time period.
(b) Report the mean and standard deviation of the number of exits over the whole time
period.
The standard deviations you’ve reported should be either similar in magnitude to the
corresponding means, or even larger. Given these large levels of uncertainty, what could be
a reasonable next step you could take in your analysis to understand what the distributions
of entries and exits look like for each neighborhood? (There are many possible answers.)
3. Choose any one neighborhood out of the 11, and for this neighborhood:
(a) Report the name of the neighborhood.
(b) Create a scatter plot of the number of entries (x-axis) and the number of exits (y-
axis) over the whole time period. Along with these points, plot the least-squares
regression line corresponding to these x and y values, labeled with its slope and
intercept.
1
(c) Suppose we have an incomplete dataset, and only know that there were 5000 entries
for a particular day within this neighborhood. Using your regression line, provide a
prediction of how many exits we would expect.
Now, take all the entries and exits across all 11 neighborhoods, and create the same plot,
with labeled regression line, and predict how many exits we would expect given 5000
entries and that we did not know which neighborhood or stop we were at. What is your
new prediction? In general, why is it important that we isolated only the neighborhood-
level data for our regression prediction, instead of combining all neighborhoods together?
Question #2: Modelling Time Series of Subway Exits
1. Construct a dataset where each row corresponds to a date (in chronological order), each
column is a stop name, and the values correspond to the number of exits at that stop
name on each date. This is equivalent to the ‘prediction df’ object in the demo code. For
this dataset:
(a) Plot a correlation matrix, where rows and columns represent stop names and the
values are the Pearson correlation coefficient between the time series of ‘Exit’ at each
pair of stops. This can be done, for example, using the ‘.corr()’ function in Pandas.
The plot should be in the form of a heatmap, which allows for easier visualization
of the level of correlation for each stop pair’s exit time series. In Python, heatmaps
can be created using the Matplotlib function ‘imshow()’, but there are other, nicer
looking alternatives you may want to try as well. Make sure to have a colorbar
indicating the Pearson correlation values, but no need to label the ticks along each
axis. (You will get bonus points if you can find a nice way to display the stop name
corresponding to each axis tick!)
(b) Report the pair of non-identical stops with the highest Pearson correlation value,
and report the value of the correlation.
(c) Report the pair of stops with the lowest Pearson correlation value, and report the
value of the correlation.
Given your heatmap, do you think it makes sense for us to predict the time series for
each stop separately (as we did in the demo), or should we use a multivariate time series
analysis technique instead? Multivariate time series methods will predict the future values
for all stops at once, while accounting for correlations between the series. An example of
such a technique is VARIMA, which is an extension of ARIMA.
2. Using the lecture demo code for the ARIMA model, produce a prediction plot for the ‘1
Av’ stop’s future ‘Exit’ time series for 14 days into the future, with p = 5 and q = d = 1
as the ARIMA model parameters. Why do the results look so much worse for p = 5 than
for p = 7, which we explored in class?
There are two primary ways one can systematically pick the best model for a set of data
among some candidate set of models, a task called model selection . The first method
is to see how well each model can predict future or unknown data (“predictive” criterion
for model selection). The second method is to see how well each models fits the known
data, while penalizing models that are overly complex (“compressive” criterion for model
selection). We will explore both techniques using the ARIMA model and the ‘Exit’ values
for the ‘1 Av’ stop (as in the previous question).
2
3. Set d = 1, and vary the parameters p and q within the ranges p = 2, 5, 7, 14 and q =
2, 5, 7, 14. This will result in 16 different ARIMA models, with parameter combinations
(p, d, q) = (2, 1, 2), (2, 1, 5), (2, 1, 7), (2, 1, 14), (5, 1, 2), (5, 1, 5), ..., etc representing all the
different combinations of the parameters p and q within these ranges. We will search over
these 16 models to determine which provides the best model of our data using the two
methods described above. For each of these 16 models, compute the RMSE prediction
error for the 14 day-ahead forecast, and list your results in the form (p, d, q) : RMSE value.
An example for (p, d, q) = (7, 1, 1) is already given in the lecture demo code. Which model
is the best according to this predictive criterion?
4. This question will explore the second model selection technique by computing the Bayesian
Information Criterion (BIC) of the same set of ARIMA models. Compute the BIC of each
of the 16 models analyzed in the previous question. An example for (p, d, q) = (7, 1, 1)
is already given in the lecture demo code. Which model is the best according to this
compressive criterion?
Extra Credit Challenge Problems (Optional)
1. Use a Recurrent Neural Network (RNN) to predict the future 14 day forecast of the exit
time series values for the ‘1 Av’ stop in Question #2. This can be done using any number
of Python packages, and there are multiple great tutorials online for this, for example at
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent
-neural-networks-python-keras/ or https://www.tensorflow.org/tutorials/s
tructured_data/time_series. Produce a forecast plot similar to those shown in the
lecture demo and in the homework. What is your RMSE? Were you able to beat the
ARIMA models in Question #2 according to this predictive model selection criterion?
2. Use a VARIMA model to predict the future 14 day forecast of all the exit time series shown
in the lecture demo (94 series in total, corresponding to the 94 subway stops). What is
the average RMSE you obtain for each individual time series? Did this multivariate
prediction method improve your prediction error?