SMM069 -无代写
SMM069 Advanced Predictive Analytics – Group Coursework
Prof Vali Asimit, Ms Ziwei Chen and Mr Hoi Chen
Released on: Thursday, 20 of June, 2024
Due: Thursday, 11 of July, 2024 at 4pm
1 Introduction
In this coursework, you will analyse commodity futures price data using regression models and statistical
techniques that we have covered in this module. The dataset spans monthly observations from 29 July 2011
to 31 May 2024. This introduction is designed to familiarise you with the data structure, the concept of
factor models, the specifics of the dataset you will be using, and the objectives of this coursework.
1.1 Commodity Futures Price Data and S&P GSCI Index
Commodity futures prices are volatile and are influenced by a multitude of factors. In this coursework, you
will work with the commodities that influence the S&P Goldman Sachs Commodity Index (S&P GSCI), a
widely recognised benchmark for investment in commodity markets. The S&P GSCI Index represents the
performance of a diversified group of commodities, providing a broad measure of the commodity market’s
overall performance. It includes various commodities such as energy, metals, agriculture, and livestock,
making it a comprehensive indicator of commodity futures price movements.
1.2 Factor Models
A factor model is a financial model that explains the returns of an asset through various underlying factors.
These factors are variables that are considered to influence commodity futures prices. In this coursework,
you will use factor models to analyse commodity futures price data. Factor models are based on i) observable
factors that are computed from observational data and are based on expert opinion or ii) engineered factors
that could be extracted via Machine Learning models like Principal component analysis (PCA) or Factor
Analysis (FA). This coursework relies on a factor model with eight observational factors, and are described
as follows:
• Momentum: Measures the tendency of a commodity futures price to continue moving in its current
direction. It is calculated based on the cumulative excess returns of the commodity over the previous
12 months.
• Basis: The basis for each commodity is the difference between the future prices of the nearby and
next-to-nearby futures contracts. This difference provides insights into the market’s expectations of
future price movements.
• Basis-Momentum: This factor combines the basis and momentum effects. It measures the difference
in momentum between the nearby and next-to-nearby futures contracts.
• Skewness: Measures the asymmetry of the commodity futures price distributions. It helps in un-
derstanding the likelihood of extreme future price movements by analysing the daily returns of the
commodity over the past 12 months.
• Inflation Beta: Represents the sensitivity of the commodity futures price to changes in inflation. It
shows how the commodity’s future price responds to unexpected changes in monthly inflation rates
over the past 60 months.
• Volatility: Defined as the variance-per-absolute mean of the first-nearby futures returns over the prior
36 months. It indicates the stability or instability of a commodity’s future price.
• Open Interest: Indicates the level of market activity and liquidity. It is defined as the monthly
change in the total number of outstanding contracts for a commodity.
• Value: Represents the intrinsic worth of a commodity based on fundamental factors such as supply
and demand. It is measured as the average future price of the nearby futures contract from 4.5 to 5.5
years ago compared to the current future price.
The data for these risk factors will be provided to you, so you do not need to calculate them by your-
self. Your task is to use these factors to analyse the commodity futures price data, explain movements in
commodity futures prices, and determine which factors have the greatest impact.
1.3 Data Structure and Assignment
Each group will be provided with THREE CSV files:
1. S&P GSCI Index and Commodities Future Price Daily Data: A CSV file provided to the
daily index values of the S&P GSCI Index and commodity futures prices from 03 Oct. 2005 to 09
May 2024.
2. Commodity Futures Price Data: Each group will be assigned to select two different commodities,
each containing the monthly future price data of a specific commodity along with the eight corre-
sponding factors described above. For information on the commodities assigned to your group, please
refer to the Commodities selection for each group.csv. You may download the appropriate commodity
files from the zip folder.
The datasets for each group can be found in the folder assigned to you. Each group will use the same
S&P GSCI Index daily data file but will work with two assigned commodity futures monthly data files. The
meanings of the commodity tickers are detailed in Table 1 at the end of this document.
1.4 Objectives
The primary objectives of this coursework are to:
• Comprehend and preprocess commodity futures price data accurately.
• Use regression analysis with factor models to elucidate the factor coefficients that determine the volatil-
ity of commodity futures prices.
• Evaluate the importance of various factors in affecting commodity futures prices and explain your
• Validate your models and provide clear, concise, and insightful interpretations of your findings.
2 Tasks for Multiple Linear Regression
You will use only the S&P GSCI Index daily CSV data file for tasks A1) and A2), and the
two commodity futures monthly CSV data files for tasks A3) to A6).
A1) The first task is to redo the Ch1 R-lab. The goal is to apply the Ordinary Least Squares (OLS),
Ridge Regression (RR), Slab Regression (SR), Stein (St), Diagonal Shrinkage (DSh), and Shrinkage
(Sh) estimators to the data from Period 7 and use their β coefficients to predict the results of Period
8, which is an out-of-sample test. That is, the dependent (or target) variable is the log return of
S&P GSCI and the independent variables (or covariates of features) are the log return of the 35
commodities; further, a change point detection is also needed. You are free to use any indicator to
select the commodities you believe best represent and interpret the S&P GSCI Index. However, you
must explain the reasons for selecting your chosen commodities from Period 7.
Here are the commodity selection limits for each group:
• Group 1: Select up to 25 commodities.
• Group 2: Select up to 24 commodities.
• Group 3: Select up to 23 commodities.
• ... (continue this pattern for each group)
• Group 14: Select up to 12 commodities.
[Total marks for A1): 10 marks]
A2) Analyse the model results obtained in A1) and compare the performance of the OLS, RR, SR, St,
DSh and Sh estimators. Evaluate the goodness-of-fit performance of each model for the Period 8
results, considering the actual events and the selected commodities. Compare the predicted results
with the actual results, explaining any differences. Discuss why the selected commodities may or
may not accurately represent and interpret the S&P GSCI Index and how various variables may
affect the results. Provide a clear comparison of the performance of the models, highlighting which
model performed best and explaining why it is effective. Interpret your results and use appropriate
visualisations to make your points clear.
[Total marks for A2): 15 marks]
A3) Download the THREE CSV files and clean the data set to produce a factor model, including the
OLS, RR, SR, St, DSh, and Sh estimators. That is, if (X1,j ,X2,j , . . . ,X8,j) are the factors of the
commodity j (of course, p = 8 and j = 1, 2 for this task), which are explained in the introduction. Let
F j,t be the future price of the commodity j at time t. The Multiple Linear Regression (MLR) model
is estimated as follows
Rj,t = θ0 + θ1x1,j,t + θ2x2,j,t + . . .+ θ8x8,j,t + error term, (2.1)
where Rj,t is the percentage change in future price at time t for commodity j, which is given by
Rj,t =
F j,t − F j,t−1
F j,t−1
. (2.2)
Apply all models individually to the entire set. That is, apply the six regression models (OLS, RR,
SR, St, DSh and Sh) to each of the two commodities assigned to your group.
[Total marks for A3): 10 marks]
A4) Analyse the results of the estimated coefficients for each model and explain the impact of risk factors
on commodity returns. Discuss how each factor affects returns and compare the results of the different
estimators. Highlight any significant differences in the performance of the estimators and carefully
analyse the reasons why some models may perform better than others. Interpret your results and use
appropriate visualisations to make your points clear.
[Total marks for A4): 15 marks]
A5) Perform a robustness test on the factor model in A4). Refer to Chapter 2 for the definition of robustness
tests in Real Data Analysis Objective 2. Perform these tests using 15%, 30%, 50% and 75% of the full
dataset. Interpret your results and use appropriate visualisations to make your points clear.
Note 1: It is explained here what a robustness test (using 15% of the full dataset) would mean for one
commodity for one model, e.g., OLS. Note that we have 155 observations from 29 July 2011 to 31 May
2024. First, we estimate the OLS parameters, denoted as
θˆFull0 , θˆ
1 , . . . , θˆ
based on the sample of size 155. Second, we estimate the OLS parameters, denoted as
θˆRed0 , θˆ
1 , . . . , θˆ
based on the sample of size 23, since we use the first 23 observations from your sample; you might
have noticed that 155× 15% = 23, which clarifies why the first 23 observations are considered. Third,
compute the L2 error in between the two OLS vector estimates which is given by√(
θˆRed0 − θˆFull0
θˆRed1 − θˆFull1
+ . . .+
θˆRed8 − θˆFull8
The L2 error tells you how robust your OLS model (based on only 15% of the full dataset) by comparing
to the full OLS model (based on the full dataset).
Note 2: Redo the calculations from Note 1 for each regression model (OLS, RR, SR, St, DSh, and Sh)
by considering robustness tests using 15%, 30%, 50% and 75% of the full dataset.
[Total marks for A6): 10 marks]
A6) Analyse and interpret the results obtained from the robustness tests in A5). Discuss how the model’s
performance changes with different dataset sizes and how these changes affect the reliability and stabil-
ity of the models. Interpret your results and use appropriate visualisations to make your points clear.
[Total marks for A6): 10 marks]
[Total marks for tasks A1) to A6): 70 marks]
3 Tasks for Generalised Linear Models
You consider here only the two commodity futures monthly CSV data file and risk factor
model defined in A3).
B1) Our goal is to use the Generalised Linear Model (GLM) with the OLS, SR, St, and DSh estimators
discussed in Chapter 3, and to consider a modified commodity return satisfying Gamma distributions
with the log and quadratic link functions, respectively. The first task is to redo the Ch3 R-lab, but
using the same THREE CSV files provided to you and the modified risk model as follows:
For the log link function: Rj,t = exp (θ0 + θ1x1,j,t + θ2x2,j,t + . . .+ θ8x8,j,t) (3.1)
For the quadratic link function: Rj,t = (θ0 + θ1x1,j,t + θ2x2,j,t + . . .+ θ8x8,j,t)
Rj,t =
F j,t
F j,t−1
. (3.3)
Divide the full dataset into a 50% training set and a 50% testing set. Fit the models using the training
set and evaluate their performance on the testing set. Calculate the Mean Square Error (MSE), Root
Mean Square Error (RMSE), and Mean Absolute Error (MAE) for each GLM on the i) training set and
ii) testing set, and iii) compare the results of each GLM estimator. Discuss which estimator performed
“best” and explain why. Interpret your results and use appropriate visualisations to make your points
[Total marks for B1): 20 marks]
B2) Repeat the process in B1), but this time split the full dataset into a 70% training set and a 30% testing
set. Interpret your results and use appropriate visualisations to make your points clear.
[Total marks for B2): 10 marks]
[Total marks for tasks B1) to B2): 30 marks]
4 Instructions
Please read the following instructions before starting your coursework.
1. The data are given per group. Therefore, each group should run its own analysis for the allocated
dataset, otherwise, a 20% penalty (on the group mark) will be applied.
2. Each student will receive the mark of its group mark unless the collaboration amongst the students
falls apart, in which case the students should report this incident via e-mail and cc to all
members of the group by 01/07/2024.
3. Split the work amongst all group members and provide a careful discussion of your results.
4. You could use any piece of software that you are comfortable with, but annotate the code so that the
marker is able to follow your work.
5. The group feedback will be given on Moodle within three weeks after this coursework is due.
6. Submit your entire report and code via Moodle. The quality of your presentation (coding,
results and methods, etc.) represents 25% of the final mark.
7. Late submissions require appropriate penalties that are applied by your Course Officer and are in
accordance with the general rules that have been applied to your coursework over the academic year.
Table 1: Commodity Futures and Their Exchanges
Category Commodity futures Exchange Ticker
Brent Crude Oil ICE CO1
Gasoil Petroleum ICE QS1
Gasoline NYMEX XB1
Heating Oil NYMEX HO1
Natural Gas NYMEX NG1
Propane NYMEX PN1
Grains & Oilseeds
Canola WCE RS1
Corn CBOT C 1
Oats CBOT O 1
Rough Rice CBOT RR1
Soybean Meal CBOT SM1
Soybean Oil CBOT BO1
Soybeans CBOT S 1
Wheat CBOT W 1
Feeder Cattle CME FC1
Lean Hogs CME LH1
Live Cattle CME LC1
Pork Belly CME PB1
Aluminium LME LA1
Copper CMX HG1
Gold CMX GC1
Lead LME LL1
Nickel LME LN1
Palladium NYMEX PA1
Platinum NYMEX PL1
Silver CMX SI1
Zinc LME LX1
Cocoa NYB CC1
Coffee NYB KC1
Cotton NYB CT1
Ethanol CME DL1
Lumber CME LB1
Milk CME DA1
Orange Juice ICE JO1
Rubber OSE JN1
Sugar NYB SB1
Note: CBOT: Chicago Board of Trade, CME: Chicago Mercantile Exchange, COMEX: Commodity
Exchange, ICE: Intercontinental Exchange, LME: London Metal Exchange, NYB: New York Board of Trade,
NYMEX: New York Mercantile Exchange, OSE: Osaka Exchange, WCE: Winnipeg Commodity Exchange