R代写-PSTAT 274|学霸联盟

R代写-PSTAT 274

时间：2022-03-07

Study of Time Series Model for the Employee
Demand
Presented By:
Tina Yu
Instructor: Professor Raya Feldman
PSTAT 274: Time Series
Fall 2021
Dec 4, 2021
TABLE OF CONTENTS
Executive Summary 2
1.0 Introduction 2
2.0 Data Model 3
2.1 Description 3
2.2 Data Analysis 4
2.3 Transformation and Decomposition 6
2.4 Differencing 9
2.5 Model selection 10
2.6 Check of stationary and invertibility 12
3.0 Diagnostic Checking 12
3.1 Model 1 Checking 12
3.2 Model 2 Checking 15
4.0 Spectral Analysis 17
5.0 Final Model 18
6.0 Forecast 18
7.0 Conclusion 20
Reference 21
Appendix A: R Code 22
1
Executive Summary
In this project, I picked the monthly employee demand record in the wholesale and retail
stores that were taken in Wisconsin from 1961 to 1975 from Time Series Data Library, which
was collected by O-Donovan (1983). My purpose of this project is to find the best fitted time
series model, then use the model to predict the future monthly employee demand.
This project consists of two parts, the first part is the model selection, the second part is
the forecasting. First, I separated the data into two parts, one is training data, which is used to
find the model, the other is test data, which is used to compare with my prediction. To get a
stationary time series, I applied Box-Cox Transformation to reduce the variance and differencing
to remove seasonality. Then, by ACF, PACF and calculating the AICC values, I chose the P, Q, p,
q values that minimize the AICC values. As a result, I picked two models to apply the diagnostic
checking.
As the result of the diagnostic checking, both model 1 and 2 pass the tests. However, I
decided to use the model 1 as it has the smallest AICC value, which is
. Last, I used a model to forecast a one-year period of(0, 1, 0) × (0, 1, 1)
12
employee demand, then compared my results with the test data, which is the true value. Since the
true values are inside the confidence interval of the forecasting, I concluded that my model is
accurate and reasonable to forecast the future employee demand trend.
1.0 Introduction
For this project, I chose to use the dataset from Time Series Data Library (TSDL). The
tsdl library consists of 648 datasets, which belong to 22 different subjects. Since I took some
Economic courses in previous quarters, I found that I was interested in the subject named Sales.
As I took the Economic courses, I learned that the forecasting of sales trends is really important
in the market. The data I choose is a monthly employee demand record in the wholesale and
retail stores that were taken in Wisconsin from 1961 to 1975, which was collected by
O-Donovan (1983). I was interested in forecasting the future monthly employee demand trend in
the wholesale and retail stores by a time series model, which will be significantly important for
workers and companies.
2
As a company in the market, they may be interested in how the employee demands in the
market. Then, they can make a plan to provide the company with necessary employees while
keeping low time and financial costs since they know what time the market needs more workers.
Also, The right planning and employee placement maintain proper and highly effective use of
any employee’s resources. The most important reason to forecast the future employee demand is
to prevent shortages of people where and when they are needed most. As an employee, then you
may be interested in when the market provides more positions and how competitive the
environment is. Thus, they can change their plan of job looking to fit the needs of the market.
Hence, the future prediction of employee demand is important for both employees and
employers in the market. I divided the data into two parts, the first part is the training dataset,
which is the previous years’ data that was used to build up the model, then I used the model to
forecast future one-year trends. The second part is the test dataset, which will be used to compare
with my prediction in order to check if the model is accurate.
The main tool I used to analyze the model is the R Markdown in RStudio. First, I
transformed the training data to reduce the variance, the transformation I chose is Box-Cox
Transformation. Then, I differenced the transformed data at lag 12 and 1, which helped me
remove the seasonality of the data. Next, by looking at the ACF and PACF, I predicted that P = Q
= 1, so I estimated the AICC values for P, Q, p, q to be either 1 or 0. From this step, I chose two
models to do further analysis.
By diagnostic checking and AICC value, the final model I chose is the following model,
.(0, 1, 0) × (0, 1, 1)
12
Then, by comparing the prediction and test dataset, I can see that the test dataset is within the CI
of my prediction, which proves that my final model is accurate and reasonable.
2.0 Data Model
2.1 Description
The data I decided to use is from the Time Series Data Library (TSDL), which was
created by Rob Hyndman, Professor of Statistics at Monash University, Australia. The data is
collected by O-Donovan (1983). The description of the dataset is the monthly employee demand
record in the wholesale and retail stores that were taken in Wisconsin. Data is sorted by month
3
over fourteen years (1961-1975) for this region were available for analysis. The sample size of
this data is 178 months.
2.2 Data Analysis
The first step of my project is plotting the time series data to see if it is stationary.
According to Figure 2.2.1, the data has a linear increasing trend with the seasonal components,
which indicates that the data is not a stationary time series.
Figure 2.2.1 Time series plot of monthly employee
Then, for the convenience of model selection and forecasting, I separated the dataset into
two parts, which are the training dataset and the test dataset. The training dataset is used to build
up the model in order to produce a stationary time series. The other dataset, the test dataset, is
used to compare with the future forecasting estimate. The test dataset is the last 12-month data,
which is a one-year time interval, that is taken from the original dataset. The training dataset is
the first 166-month data. Thus, I decided to plot the training data again with the expected value
line, which is the blue line, and the fitted line, which is the red line, then I got the new time series
plot, Figure 2.2.2.
4
Figure 2.2.2 Time series plot of past monthly employee (exclude current 1 year)
As we can see from Figure 2.2.2, the variance is not constant and it is unstable from 1961
to 1974. However, it did not perform any sharp changes in behavior and outlying observations.
Also, there is a clear linear increase trend throughout time. In addition, according to the
histogram, Figure 2.2.3, the data is not normally distributed. According to the ACF and PACF,
Figure 2.2.4, the plot shows a clear seasonality. Hence, by all these main features in plots, the
data is a nonstationary time series. That is the reason why I consider the transformation and
differencing to generate a stationary time series.
Figure 2.2.3 Histogram of Monthly Employee Demand
5
Figure 2.2.4 ACF and PACF of Employee Demand
2.3 Transformation and Decomposition
Because of the nonstationary time series, I choose to transform the data to reduce the
variance. First, I decided to use the Box-Cox Transformation and Log Transformation. From
Figure 2.3.1, the lambda I got is 0.1818182, which is not close to 0. However, since 0 is included
in the 95% confidence interval of the lambda, I choose to use the Box-Cox Transformation and
Log Transformation. The Box-Cox Transformation is calculated by the following formula:
.

λ − 1
λ
Similarly, I used the for Log Transformation. (

)
6
Figure 2.3.1 95% Confidence interval of lambda
By the comparison of the histogram after transformations (Figure 2.3.2), the histogram of
the Box-Cox Transformation is closer to a normal distribution than the Log Transformation, so I
decide to select Box-Cox Transformation. The variance of the original dataset is about 1830.85.
The variance of the data after applying Box-Cox Transformation is about 0.1610, which is much
stable than the original variance. The stability of the variance after transformation is also
presented in the new time series plot (Figure 2.3.2).
Figure 2.3.2 Histogram and time series plot of Log Transformation and Box-Cox Transformation
As we can see in Figure 2.3.3, which is the decomposition of the transformed data, the
data still has a positive linear trend with seasonality.
7
Figure 2.3.3 Decomposition of the monthly demand
In order to remove the seasonality, I used the periodogram and the formula that 1 divided
by frequency to estimate the period. As the data is recorded by month, I expected the period
should be 12. Based on the periodogram, Figure 2.3.4, there is a show up at 0.08. Hence, if we
use the formula, we can get 1/0.08 = 12.5, which is close to 12, so it matches the status of the
data. Then, I used the period to remove the seasonality from the data.
Figure 2.3.4 Periodogram of the monthly employee
8
2.4 Differencing
In order to remove the significant seasonality of the data, I used period 12, which was
obtained from the periodogram first. Hence, differencing at lag 12 is my first step. After that, I
difference that data at lag 1 once and twice, and compare the histogram and the stability of
variance of the differencing. The result I expected is to get a normally distributed histogram and
the most stable variance.
Figure 2.4.1 Histogram of transformed data after differencing at lag 12 and lag 12 and 1
According to Figure 2.4.1, the differencing at lag 12 and lag 1 is approximately normal
distributed, and the differencing at lag 12 is left-skewed. Thus, I believe that the differencing at
lag 12 and 1 is the best choice. In addition, if we looked at the variance after differencing:
● The variance of Box-Cox Transformed data is 0.1610108,
● The variance of the data after differencing at lag 12 is 0.002009143,
● The variance of the data after differencing at lag 12, and lag 1 once is 0.0001822453,
● The variance of the data after differencing at lag 12, and lag 1 twice is 0.0003232859.
We can see that the smallest variance is the data after differencing at lag 12, and lag 1 once,
which is 0.0001822453, so it indicates again that the differencing at lag 12 and 1 is the best
choice.
I considered the ACF and PACF of differencing as well, the best result I looked for is the
ACF and PACF with the minimum number of the lag that exceeds the 95% confidence interval.
If we look at the ACF and PACF of the differencing at lag 12 and 1, Figure 2.4.2, we can find
out that only lag 12 exceeds the confidence level. After all, we can decide that differencing at lag
12 and 1 is the best differencing, which means that D = 1 and d = 1 in the SARIMA model.
9
Figure 2.4.2 ACF and PACF of transformed data after differencing at lag 12 and 1
2.5 Model selection
In this section, I need to choose the P, p, Q, and q values in the SARIMA model.
According to the ACF and PACF (Figure 2.4.2) in the previous section, both ACF and PACF
exceed the confidence interval at lag 12, so I assume that P = Q = 1. As a result, I generate a for
loop to let P, Q, p, q be 0 or 1 with D = d = 1, and calculate the AICC values. Thus, the purpose
of this step is to select the model that minimizes the AICC value.
Figure 2.5.1 AICC values for different P, Q, p, q
10
Figure 2.5.1 shows the result AICC values. From ACF and PACF, we get P = Q = 1.
First, if I looked at the model with P = Q = 1 and D = d = 1:
● The first model has AICC = -886.0912, P = Q = p = q = 1,
● The second model has AICC = -888.1447, P = Q = q = 1, p = 0,
● The third model has AICC = -888.1706, P = Q = p = 1, q = 0,
● The last model has AICC = -889.2481, P = Q = 1, p = q =0.
Then, I want to check the significance of the coefficients. For the first model, I got the
coefficient estimates are ar1 = 0.2179, ma1 = -0.1370, sar1 = 0.2505, sma1 = -0.5389. The
formula I used to calculate the 95% confidence interval is (coefficients ± 1.96 × s.e.). As a result,
the 95% confidence intervals of ar1, ma1, sar1 include 0. After fixing these coefficients equal to
0, I get the new AICC = -890.0702, which is smaller than the original AICC, -886.0912. Thus, it
is helpful to fix these coefficients, which will result in the model Q = 1 with D = d = 1. Similarly,
after calculating the 95% confidence interval, the other three models also result in the model Q =
1 with D = d = 1.
Thus, considering of minimizing the AICC values, and all P = Q = 1 models would be
better if only taking Q = 1, I would select the following models:
● The first model has AICC = -890.2943, Q = D = d = 1, P = p = q = 0, S = 12,
● The second model has AICC = -887.7621, P = D = d = 1, p = Q = q = 0, S = 12,
● The third model has AICC = -881.7315, p = D = d = 1, P = Q = q = 0, S = 12,
● The last model has AICC = -881.7612, q = D = d = 1, P = p = Q = 0, S = 12.
After calculating the 95% confidence intervals, we can get that the 95% confidence
interval of the parameters of the first and second model include 0, which reveals that the
parameters are not significant for the third and the last models. Hence, we will end in two model
selections, which includes the first and the second model:
● The first model has AICC = -890.2943, Q = D = d = 1, P = p = q = 0, S = 12, with sma1
= -0.3230,
● The second model has AICC = -887.7621, P = D = d = 1, p = Q = q = 0, S = 12, with sar1
= -0.2449.
11
2.6 Check of stationary and invertibility
Next, I want to check if the models we determined above are invertible and stationary.
For the first model, which only contains the MA part, so all MA models are always stationary.
Then, since the parameter , the model is also invertible. Similarly, for modelΘ
1| | = 0. 3230 < 1
2, since the model only contains the AR part, the AR model is always invertible, and the model
is also stationary as . Therefore, both Model 1 and model 2 are invertibleΦ
1| | = 0. 2449 < 1
and stationary.
3.0 Diagnostic Checking
Furthermore, I need to check if the model passes the diagnostic checking, which includes
ACF and PACF, Box-Pierce test, Shapiro test, Ljung-Box test, Bcleod Li test, histogram, and
Normal Q-Q Plot.
3.1 Model 1 Checking
For the first model, I looked at the histogram (Figure 3.1.1), Normal Q-Q Plot (Figure
3.1.2), and Shapiro-Wilk normality test (Figure 3.1.3) first.
Figure 3.1.1 Histogram of the residuals of the model 1
12
Figure 3.1.2 Normal Q-Q Plot of the residuals of model 1
The histogram above clearly shows a White Noise normal distribution. Also, most of the
points are lying on the line in Normal Q-Q Plot above. Both of these plots reveal that the data is
normally distributed.
Figure 3.1.3 Shapiro-Wilk normality test of the residuals of the model 1
As we can see from the Shapiro-Wilk normality test above, the p-value is 0.1699, which
is greater than 0.05, so we fail to reject the hypothesis that the model is normally distributed, so
it passes the Shapiro-Wilk normality test.
13
Figure 3.1.4 ACF of the residuals and residuals^2 of model 1, PACF of the residuals of model 1
In addition, from the ACF and PACF (Figure 3.1.4) above, there is no value exceeding
the confidence interval. If we looked at the plot below, Figure 3.1.5, it shows no seasonality, no
trend, and no visible change of variance. Also, the sample mean is close to zero.
Figure 3.1.5 Time series plot of the residuals of model 1
14
Next, I used the Box-Pierce test, Ljung-Box test, and Bcleod Li test. According to Figure
3.1.6, all p-values are greater than 0.05, so we fail to reject the null hypothesis about white noise,
which means that it passes all three tests.
Figure 3.1.6 Box-Pierce test, Ljung-Box test, Bcleod Li test, and Yule-Walker estimation
The last test is the Yule-Walker estimation, which is used to determine if the residuals fit
the white noise distribution. As we can see in Figure 3.1.5, since the order selected is 0 and the
Yule-Walker estimates the AR parameter, it implies that the model is AR(0), which is white
noise. Therefore, model 1 passes all tests, and it can be used to forecast the future value.
3.2 Model 2 Checking
Similarly, I applied the same test on model 2. According to Figure 3.2.1, there is no point
in ACF and PACF exceeding the confidence interval. From Figure 3.2.2, the histogram displays
a normal distribution, and most of the points are close to the line in Normal Q-Q Plot. The plot
shows no trend, seasonality, and sharp change of variance. Also, all p-values are greater than
0.05, so it also passes all tests, which you can find in Appendix A.
15
Figure 3.2.1 ACF and PACF of the residuals of the model 1
Figure 3.2.2 Histogram, Normal Q-Q Plot, time series plot, ACF of residuals^2 of the model 1
After all, since both model 1 and model 2 pass all tests, both models can be used to
forecast the future trend. In this case, I decided to pick model 1 as my final model since it has a
smaller AICC value.
16
4.0 Spectral Analysis
Last, I plan to use spectral analysis to check if the residuals are within the confidence
interval. At first, I used the Kolmogorov-Smirnov test to draw the confidence interval. According
to Figure 4.0.1, there is no point exceeding the CI, which indicates that the model is statistically
accurate.
Figure 4.0.1 Kolmogorov-Smirnov test for model 1
Also, by checking the periodogram, we can know if there is any frequent trend such as
the combination of sine and cosine functions. As we can see from Figure 4.0.2, which is the new
periodogram of model 1, there is no significant trend in the residuals, this also proves that my
model is accurate.
17
Figure 4.0.2 Periodogram of model 1
The last tool I used is the Fisher’s Test, which is used to check if there is any unspecified
trend in the model. As we got the result 0.8131371 which is greater than 0.05, we fail to reject
the white noise hypothesis.
5.0 Final Model
Therefore, the final model for the Box-Cox Transform of original data is
P = 0, D = 1, Q = 1, p = 0, d = 1, q = 0, S = 12,
,(0, 1, 0) × (0, 1, 1)
12
which can be written as
.
6.0 Forecast
After deciding on my final model, the last step of my project is to forecast the future
trend using the model I chose to see if the model is useful and accurate.
First, I simulate the one-year future value of my model by forecast library in R. Moreover, I
added the upper bound and lower bound of the possible results, which is the 95% confidence
interval, and the training dataset. From Figure 6.0.1 below, it is clear that the test dataset is
within the interval.
Figure 6.0.1 Forecasting the Box-Cox Transformed data
18
However, the figure above is the forecasting of the Box-Cox transformed data. In order to
change the predicted values back to the original values, we need to transform using the formula
. As we can see in Figure 6.0.2, we can see that the forecasting of theλ ( × λ + 1)
original data is also within the intervals, which implies that the model is accurate.
Figure 6.0.2 Forecasting the original data
To see the prediction more clearly, I zoomed the prediction part of the data to see if every
prediction is between the upper and lower bound. Figure 6.0.3, which corresponds to the last 12
months’ data, proves that every prediction is within the CI. Therefore, this proves that my final
model is accurate and useful.
19
Figure 6.0.3 Zoom-in forecasting of original data
7.0 Conclusion
The purpose of my project is to predict the future employee demand by modeling the
previous monthly employee demand record. Throughout the process of transforming,
differencing, model selection, diagnostic checking, and spectral analysis, the final model I picked
is , which can be written as(0, 1, 0) × (0, 1, 1)
12
.
Since the test dataset from the original data is the true value of the prediction. That is the
reason why I use it to compare with my prediction simulated by the time series model. As the
true value is included in the prediction interval, and it is also close to my predicted values, I
know that my model is good and reasonable enough.
20
Reference
Rob Hyndman and Yangzhuoran Yang (2018). tsdl: Time Series Data Library. v0.1.0.
https://pkg.yangzhuoranyang./tsdl/.
O-Donovan, “Monthly Employee Demand in the Wholesale and Retail Stores in
Wisconsin from 1961 to 1975”
Lecture 11.
https://gauchospace.ucsb.edu/courses/pluginfile.php/18429360/mod_resource/content/1/week6-L
ecture%2011%20%20slides-diagnostics.pdf
Lecture 15.
https://gauchospace.ucsb.edu/courses/pluginfile.php/18494621/mod_resource/content/1/Lecture
%2015-AirPass%20slides.pdf
Brockwell, Peter J., and Richard A. Davis. Introduction to Time Series and Forecasting.
Springer, 2016.
21
Appendix A: R Code
22
23
24
25
26
27
28