DATA3888-无代写
时间:2024-05-10
DATA3888: Data Science Capstone
Week 3 - Case study 4: Time series data (Optiver)
Qiuzhuang Sun
03 March, 2024
Overview
Volatility is one of the most prominent terms you'll hear on any trading oor
Volatility captures the amount of uctuation in prices
High volatility: periods of market turbulence and large price swings
Low volatility: more clam and quiet markets
Predicting volatility is essential
Example: Option price is directly related to the volatility of the underlying product
Purpose of this project
To build a model that predicts future volatility based on the provided stock data
Use communication strategies (e.g., educational shiny app, short videos, or infographics) to
provide traders with a rm understanding of the prediction methodology
2 / 37
Preliminary trading theory
Order book
There are buyers and sellers for any given nancial instrument
Intended buy orders: bid & Intended sell orders: ask
Order book is an electronic list of buy and sell orders for a specic security or nancial instrument
organized by price level
4 / 37
Trades
When someone wants to do a trade in a stock, they check the order book of the stock and nd
someone with counter-interest to trade with
Example: I want to buy 20 shares of a stock
Check the ask (aka offer) side of the book and lift 20 shares for a price of 148
"Lift the offer" means to buy the thing at the lowest price offered by the market
A
5 / 37
Market making
An actively traded (or liquid) nancial instrument has a dense order book
Consider a less liquid order book and I, again, want to buy 20 shares of the stock
I still want to buy at price 148, but there's nobody willing to sell at 148
My order will be sitting in the book
How about I insert an order to buy at 155?
6 / 37
Market making
The previous example shows that it's dicult to trade in an inecient market
This is why investors love liquidity
A market maker is a rm or individual who actively quotes two-sided markets in a security, providing
bids and offers along with the market size of each
Optiver is a market maker
As a market maker will show both bid and offer orders, an order book with the presence of market
maker will be more liquid, therefore a more ecient market will be provided to end investors to trade
freely without concern on executions
Making a market involves providing both a buy and a sell price for a nancial instrument
7 / 37
Order book statistics
Lots of statistics can be derived from raw order book data to reect market liquidity and stock
valuation
We'll introduce bid/ask spread and weighted average price (WAP)
You might come up with some of your own that are helpful in predicting the volatility
Bid/ask spread: computed based on the ratio of best offer price and best bid price
What does a BidAskSpread of 0.5 mean?
What about 0.02?
Can BidAskSpread be negative?
BidAskSpread = − 1
BestOffer
BestBid
8 / 37
Order book statistics
Weighted averege price (WAP)
A fair book-based valuation must take two factors into account: the level and the size of orders
and are the highest bid price and its size in the order book, respectively
and are the lowest ask price and its size in the order book, respectively
If two books have both bid and ask offers on the same price level and remember
, then (try to prove it):
WAP decreases in with xed
WAP increases in with xed
Intuition: more offers mean that there are more intended sellers in the book, and more sellers
imply a fact of more supply on the market resulting in a lower stock valuation
WAP =
BidPrice
1
⋅ AskSize
1
+AskPrice
1
⋅ BidSize
1
BidSize
1
+AskSize
1
BidPrice
1
BidSize
1
AskPrice
1
AskSize
1
BidPrice
1
< AskPrice
1
AskSize
1
BidSize
1
BidSize
1
AskSize
1
9 / 37
Order book statistics
Some exercises
Compute the BidAskSpread and WAP of the following order book
Consider a stock that currently has a best bid of 100 and a best offer of 105. There are 20 lots
bid at 100 and 5 lots offered at 105. What is the BidAskSpread and the WAP of this order book?
What is a big problem with the current denition of WAP? (Hint: Think about how you can
manipulate the WAP of the previous example by inserting only a single 1-lot order.)
10 / 37
Log returns
We want to compare the price (we use WAP in this project) of a stock between yesterday and today
Example: Assume we have invested $1000 in both stock and
Stock moves from $100 to $102 and stock moves from $10 to $11
Prot of : ; price difference: $2
Prot of : ; price difference: $1
Take the difference of price is NOT a good idea to compare to stocks
How about computing the percentage change in price?
Stock A:
Stock B:
The stock return coincides with the percentage change in our invested capital
A B
A B
A (1000/100) ⋅ (102 − 100) = 20
B (1000/10) ⋅ (11 − 10) = 100
(102 − 100)/100 = 2%
(11 − 10)/10 = 10%
11 / 37
Log returns
However, log returns are preferred whenever some mathematical modelling is required
Let be the price of the stock at time . We dene the log return between and as
Usually, we look at log returns over xed time intervals, so with 10-minute log return we could set
the notation , where the time unit of is in minute
Properties of log returns
Additive across time:
Regular returns cannot go below , while log returns are not bounded
For small moves of stock, the log return essentially pulls out the percentage change in the stock
Example: A stock was trading at $2 yesterday and at $2.1 today, then the increase is 5%
We also have , where we use the Taylor
approximation for small
S
t
t t
1
t
2
r
t
1
,t
2
= log( )
S
t
2
S
t
1
r
t
= r
t−10,t
t
r
t
1
,t
2
+ r
t
2
,t
3
= r
t
1
,t
3
−100%
log(2.1/2) = log(1.05) = log(1 + 0.05) ≈ 0.05
log(1 + x) ≈ x x 12 / 37
Volatility
When we trade options, a valuable input to our models is the standard deviation of the stock's log
returns
The standard deviation will be different for log returns computed over longer or shorter intervals, for
this reason it is usually normalized to a 1-year period and the annualized standard deviation is called
volatility
Consider a 10-min xed time interval and assume that we sequentially observe log returns over
time, which are denoted as . The realised volatility for this 10-min interval is dened as
We implicitly assume that the log returns during the 10-min time interval have 0 mean for
simplicity. This assumption can be safely imposed for our project
T
r
1
,… , r
T
σ
σ =




T

t=1
r
2
t
13 / 37
Case study 4: Predicting volatility of stocks from order books
Prediction on future volatility
You will be given 10 minutes of book data and we want you to predict what the volatility will be in a
specic future period
Write to mean the historical volatility over a specied time period
Write to mean the estimate for future volatility over a specied time period
Some naive methods for predicting volatility in the next 10 minutes:
Prediction is precisely what you've just observed the last 10 minutes:
The volatility in the previous 5 minutes may be a better estimate: (we multiply
by because )
We may compute a weighted average:
Try to build a sensible model by yourself in this project!
σ
time
^
σ
time
^
σ
10 min
= σ
10 min
^
σ
10 min
=


5 min

2 √10 min/5 min =

2
σ^
10 min
= 0.8 ⋅ σ
10 min
+ 0.2 ⋅


5 min
15 / 37
Overview on data
We provide order book snapshots of 127 stocks over multiple 10-min intervals (>10GB)
Each csv le corresponds to a specic stock (also shown in stock_id)
time_id : ID code for the 10-min time bucket. Time IDs are not necessarily sequential but are
consistent across all stocks
seconds_in_bucket : # of seconds from the start of the bucket, always starting from 0 and
never exceeding 600
Remaining columns: Highest (lowest) two bid (ask) prices and their sizes
16 / 37
Time series on log returns (Stock 1)
stock1 <- read.csv("data/individual_book_train/stock_1.csv")
stock1 <- stock1 %>% mutate(
WAP = (bid_price1 * ask_size1 + ask_price1 * bid_size1) / (bid_size1 + ask_size1))
sec <- stock1 %>% filter(time_id == stock1[, 1][1]) %>% pull(seconds_in_bucket)
price <- stock1 %>% filter(time_id == stock1[, 1][1]) %>% pull(WAP)
log_r <- log(price[-1] / price[1:(length(price) - 1)])
df_return <- data.frame(time = sec[-1], log_return = log_r)
ggplot(data = df_return, aes(x = time, y = log_return)) + geom_line()
17 / 37
ACF and PACF plots (Stock 1)
The autocorrelation function (ACF) and partial ACF (PACF) of the time series can show the
autocorrelation of a time series.
par(mfrow = c(1, 2))
acf(log_r, main = "ACF plot for log returns")
pacf(log_r, main = "PACF plot for log returns")
18 / 37
Prediction on realised volatility
This project lets you do a one-step prediction with a 30-second resolution
At any time, predict the realised volatility in the next 30 seconds based on the data available
hitherto
We can divide the 10-min data into non-overlapping 30-sec intervals and compute the realised
volatility in each interval
This leads to data points for each 10-min time bucket
We may also divide the 10 minutes into overlapping intervals
Advantage: it increases the size of training data set
Disadvantage: the resulting volatility time series will be highly autocorrelated, which can
deteriorate the performance of many models
You may explore yourself if your model can predict accurately under this division scheme
600/30 = 20
19 / 37
Time series on realised volatility (Stock 1)
comp_vol <- function(x) {return(sqrt(sum(x ^ 2)))}
sec <- stock1 %>% filter(time_id == 5) %>% pull(seconds_in_bucket)
log_r1 <- data.frame(time = sec[-1], log_return = log_r)
log_r1 <- log_r1 %>% mutate(time_bucket = ceiling(time / 30))
vol <- aggregate(log_return ~ time_bucket, data = log_r1, FUN = comp_vol)
colnames(vol) <- c('time_bucket', 'volatility')
ggplot(data = vol, aes(x = time_bucket, y = volatility)) + geom_line() + geom_point()
20 / 37
Benchmarks
We will talk about three possible strategies on future volatility prediction
Regression: not treating data as a time series, but weighting the errors by recency
Heterogeneous autoregressive (HAV): directly working on the realised volatility time series
ARMA-GARCH: working on the log returns time series and predict volatility indirectly
These models can be treated as benchmarks
You can use the benchmarks as a building block to develop your own model
You can also try other methods (like machine learning models)
For simplicity, the benchmarks overlook possible dependence of data across different stock_id
and time_id
We t a model separately for each combination of stock_id and time_id
You can improve the model performance by jointly analysing data from different stocks and/or
under different time_id
21 / 37
Regression: overview
We regress the realised volatility on some observed covariates in period
Why covariates in period not ? Because we want to forecast future volatility. If we predict
the volatility in period , then the available data are covariates before period
Possible covariates include
Mean, variance, and/or quantiles of the WAP during the last period
Measures of liquidity: less liquidity usually correlates with more volatility
Possible measures of liquidity
Mean, variance, and/or quantiles of the number of orders during the past period
BidAskSpread at the end of the last period (the most recent liquidity measure)
You may come up with more measures on liquidity in your own project
You may also include covariates in periods , , ... for regression
In this case, remember to avoid the overtting problem
RV
t
t− 1
t− 1 t
t t
t− 2 t− 3
22 / 37
Regression: estimation
Potential drawbacks of regression analysis: overlooking the time information
Solution: weighting the data point for regression by recency
Example: regressing the realised volatility on the mean price and mean number of
order in period
Data:
The weighted least square (WLS) estimator is the solution to
We may use exponential weights where is a tuning parameter
(Explore other possible weights in the project)
RV
t
¯¯¯¯¯¯¯¯¯¯¯¯¯
WAP
t−1
¯
O
t−1
t− 1
(RV
t
,
¯¯¯¯¯¯¯¯¯¯¯¯¯
WAP
t−1
,
¯
O
t−1
)
t=2,…,T
min
b
0
,b
1
,b
2
T

t=2
w
t
(RV
t
− b
0
− b
1
¯¯¯¯¯¯¯¯¯¯¯¯¯
WAP
t−1
− b
2
¯
O
t−1
)
2
w
t
= α
T−t
0 < α < 1
23 / 37
Regression: R code
Use lm for linear regression and use for the weights
stock1 <- stock1 %>% mutate(time_bucket = ceiling(seconds_in_bucket / 30),
num_order = bid_size1 + ask_size1 + bid_size2 + ask_size2,
BidAskSpread = ask_price1 / bid_price1 - 1)
stats.bucket <- stock1 %>%
filter(time_id == 5 & time_bucket != 0) %>%
select(c(BidAskSpread, WAP, num_order, time_bucket))
mean.price <- aggregate(WAP ~ time_bucket, data = stats.bucket, FUN = mean)
mean.order <- aggregate(num_order ~ time_bucket, data = stats.bucket, FUN = mean)
mean.BAS <- aggregate(BidAskSpread ~ time_bucket, data = stats.bucket, FUN = mean)
df.reg <- data.frame(volatility = vol$volatility[-1],
price = mean.price$WAP[1:19],
order = mean.order$num_order[1:19],
BidAskSpread = mean.BAS$BidAskSpread[1:19])
lm.model <- lm(volatility ~ price + order + BidAskSpread, df.reg, weights = 0.8 ^ ((18:0) / 2))
summary(lm.model)
α = 0.8
24 / 37
HAV model: overview
HAV is one of the most popular models for predicting daily realised volatility
Easy to understand and interpret
Easy for statistical inference and prediction
Let be the observed (log) returns during period (e.g., a 10-min interval, day, or week). Let
be the realised volatility in period
A common form of HAV model is
is the error term in period
Consider as one day. Then and are the mean realised volatility in
the past week and in the past month, respectively
We need to estimate from data
(r
ti
)
i=1,…,M
t
RV
t
=


M
i=1
r
2
ti
t
RV
t
= β
0
+ β
1
RV
t−1
+
5

i=1
RV
t−i
+
22

i=1
RV
t−i
+ u
t
β
2
5
β
3
22
u
t
t
t ∑
5
i=1
RV
t−i
/5 ∑
22
i=1
RV
t−i
/22
β
0

1

2

3
25 / 37
HAV model: OLS estimation
Recall we assume
Assume we observe realised volatility data
A natural way for estimation is to use ordinary least square (OLS)
The OLS estimator of is the solution to
The OLS estimator is optimal (in terms of asymptotic eciency) when the error terms are
independent, normally distributed, and homoskedastic
RV
t
= β
0
+ β
1
RV
t−1
+ ∑
5
i=1
RV
t−i
+ ∑
22
i=1
RV
t−i
+ u
t
β
2
5
β
3
22
(RV
t
)
t=1,…,n

0

1

2

3
)
min
b
0
,b
1
,b
2
,b
3
n

t=23
(RV
t
− b
0
− b
1
RV
t−1

5

i=1
RV
t−i

22

i=1
RV
t−i
)
2
b
2
5
b
3
22
(u
t
)
t=1,…,n
26 / 37
HAV model: WLS estimation
An alternative to OLS is the WLS
The WLS estimator of is the solution to
is the weight for the th observation
There are many ways to determine
, where is the realised quarticity in period
, where is the estimated realised volatility in period using OLS
Use OLS for estimation and then compute residuals. Use a on the OLS residuals.
Let , where is the tted value of the conditional variance of the
(More details on GARCH later...)

0

1

2

3
)
min
b
0
,b
1
,b
2
,b
3
n

t=23
w
t
(RV
t
− b
0
− b
1
RV
t−1

5

i=1
RV
t−i

22

i=1
RV
t−i
)
2
b
2
5
b
3
22
w
t
> 0 t
w
t
w
t
= RV
t−1
/

RQ
t−1
RQ
t
= ∑
M
i=1
r
4
ti
M
3
t
w
t
= 1/RV
t−1
w
t
= 1/
ˆ
RV
t
ˆ
RV
t
t
GARCH(1, 1)
w
t
= 1/
^
h
t
^
h
t
GARCH(1, 1)
27 / 37
HAV model: R code
HAV is popular for daily realised volatility, but our data is only for 10 minutes
Because our time series are not long, we may only use and for estimation
mean.vol <- rep(0, 15)
for (j in 1 : 5) {
mean.vol <- mean.vol + vol$volatility[j : (j + 14)] / 5
}
df.HAV <- data.frame(vol = vol$volatility[-(1:5)],
vol_1 = vol$volatility[5:19],
mean_vol_5 = mean.vol)
comp_quar <- function(x) {return(30 / 3 * sum(x ^ 4))}
quar <- aggregate(log_return ~ time_bucket, data = log_r1, FUN = comp_quar)
colnames(quar) <- c('time_bucket', 'quarticity')
HAV.ols <- lm(vol ~ vol_1 + mean_vol_5, df.HAV)
HAV.wls <- lm(vol ~ vol_1 + mean_vol_5, df.HAV,
weights = df.HAV$vol_1 / sqrt(quar$quarticity[5:19]))
RV
t−1

5
i=1
RV
t−i
/5
28 / 37
ARMA-GARCH: Overview
ARMA-GARCH forecasting
ARMA: Autoregressive Moving Average
GARCH: Generalized Autoregressive Conditional Heteroskedasticity
An ARMA model estimates the conditional mean
A GARCH model estimates the conditional variance
Math behind an ARMA-GARCH model:
: white noise process with and
and : orders for ARMA and GARCH, respectively
E[r
t
∣ Historical data]
Var[r
t
∣ Historical data]
r
t
=
p

i=1
a
i
r
t−i
+
q

i=1
b
i
ϵ
t−i
+ ϵ
t
ϵ
t
=

σ
t
z
t
, σ
2
t
= ω+
p


i=1
α
i
ϵ
2
t−i
+
q


i=1
β
i
σ
2
t−i
z
t
E[z
t
] = 0 Var(z
t
) = 1
(p, q) (p

, q

) 29 / 37
ARMA-GARCH: Estimation and prediction
: known as the conditional variance at time
White noise process : iid standard normal, iid -distribution, etc.
Order and : can be selected by AIC and BIC
Estimation can be done by the R package rugarch
A simple simulation-based prediction:
Fit the model with predetermined distributions for and orders for ARMA and GARCH
At time , use available time series data on log returns to simulate a path in the next time bucket
(e.g., next 30 seconds)
Use the simulated path to compute the realised volatility at the next period
Repeat the above simulation for a large number of times and average the realised volatility as an
estimate of
σ
2
t
t
z
t
t
(p, q) (p

, q

)
z
t
t
RV
t+1
30 / 37
ARMA-GARCH: R code
Use and and normal white noise process to t the data
Use ugarchfit for parameter estimation
Use ugarchpath to simulate future time series for volatility prediction
library(rugarch)
spec <- ugarchspec(variance.model = list(model = "sGARCH", garchOrder = c(1, 1)),
mean.model = list(armaOrder = c(1, 1)), distribution.model = "norm")
# Fit the model
ARMA_GARCH <- ugarchfit(spec = spec, data = log_r1$log_return, solver = 'hybrid')
# Make the prediction
fspec <- getspec(ARMA_GARCH)
setfixed(fspec) <- as.list(coef(ARMA_GARCH))
future.path <- fitted(ugarchpath(fspec, n.sim = 30, m.sim = 1000))
future.path[is.na(future.path)] <- 0
RV.pred <- mean(apply(future.path, 2, sd))
ARMA(1, 1) GARCH(1, 1)
31 / 37
Performance measures
Let and be the realised and forecast volatility at time , respectively
Performance measures
Mean squared error (MSE)
Empirical quasi-likelihood (QLIKE)
QLIKE is a robust measure in the sense it's less sensitive to outliers or very extreme values (see
Patton 2011, JOE)
Other measures like MAE, RMSE, RMSPE, , VaR...
RV
t
F
t
t
MSE =
T

t=1
(RV
t
− F
t
)
2
1
T
QLIKE =
T

t=1
( − log − 1)
1
T
RV
t
F
t
RV
t
F
t
R
2
32 / 37
Model assessment: Two schemes
Based on a performance measure, we have many schemes for (out-of-sample) forecasting
Scheme 1: tting the data in a one-shot way
Split the rst 80% data as training data (e.g., data in periods ) and the rest as testing
data (e.g., data in periods )
Use training data for to t the model
At time , use the tted model and data from for prediction
Summarise results for to compute performance measures like QLIKE and MSE
Scheme 2: rolling window forecasting
Fix an positive integer as the length of a window
At time , use data from periods to as training data to t a model
Use the model just tted for prediction in period
Summarise results for to compute performance measures like QLIKE and MSE
Rolling window is more useful if we believe system dynamics changes over time
t = 1,… ,T
1
t = T
1
+ 1,… ,T
2
t = 1,… ,T
1
τ = T
1
+ 1,… ,T
2
t = 1,… , τ − 1
t = T
1
+ 1,… ,T
2
n
t = n,n+ 1,… ,T t− n+ 1 t
t+ 1
t = n+ 1,… ,T
33 / 37
Other potential solutions
Machine learning methods
We illustrated regression (using WLS), HAV, and ARMA-GARCH for prediction
Machine learning and deep learning are also potential, but it's better to keep the model being
interpretable
Explore more information in the order book snapshots (e.g., the second bid and ask prices) for
prediction
Making use of features from neighbours
Given order book snapshots of a stock within a given time, we can nd its nearest neighbours
in the training data set (across all stock_id and time_id) with respect to some appropriate
similarity measure
We can directly make a -nearest-neighbour average for the realised volatility or extract features
from the neighbours for prediction
k
k
34 / 37
Other potential solutions
Prediction for multiple stocks simultaneously
Behaviours of stocks can be very similar
We can jointly do estimation and prediction for multiple stocks (linked by time_id)
Potential solution procedures: (i) use clustering methods to obtain several stock clusters; (ii) use
multivariate time series models (e.g., vector auto-regression) to t the time series of stocks in
the same cluster; (iii) jointly predict future volatility based on the tted model
Joint prediction for a stock with different time_id
Previously, we separately t a model for each combination of stock_id and time_id
With the same stock_id , we may develop a model that jointly analyse data under different
time_id 's
See Wang et al. (2016), JBF
35 / 37
Summary
Order book is used for trading
We can compute statistics like BidAskSpread, WAP, and log returns from order book snapshots
over time
We can measure the market liquidity and compute realised volatility using the above statistics
We use three different strategies for the time series data
Regression: use historical covariates for prediction at the next period
HAV: work directly on the time series of the realised volatility
ARMA-GARCH: work on the time series of the log returns and predict realised volatility indirectly
Performance measures: QLIKE and MSE
Schemes for forecasting
One-shot estimation
Rolling window forecasting
36 / 37
Further reading
Clements, A., & Preve, D. P. (2021). A practical guide to harnessing the HAR volatility model. Journal of
Banking & Finance, 133, 106285.
Wang, Y., Ma, F., Wei, Y., & Wu, C. (2016). Forecasting realized volatility in a changing world: A dynamic
model averaging approach. Journal of Banking & Finance, 64, 136-149.
Ghysels, E., Santa-Clara, P., & Valkanov, R. (2006). Predicting volatility: getting the most out of return data
sampled at different frequencies. Journal of Econometrics, 131(1-2), 59-95.
Liu, L. Y., Patton, A. J., & Sheppard, K. (2015). Does anything beat 5-minute RV? A comparison of realized
measures across multiple asset classes. Journal of Econometrics, 187(1), 293-311.
Patton, A., (2011). Volatility forecast comparison using imperfect volatility proxies. Journal of
Econometrics, 160(1), 246–256.


essay、essay代写