SMM069 Advanced Predictive Analytics – Group Coursework Prof Vali Asimit, Ms Ziwei Chen and Mr Hoi Chen Released on: Thursday, 20 of June, 2024 Due: Thursday, 11 of July, 2024 at 4pm 1 Introduction In this coursework, you will analyse commodity futures price data using regression models and statistical techniques that we have covered in this module. The dataset spans monthly observations from 29 July 2011 to 31 May 2024. This introduction is designed to familiarise you with the data structure, the concept of factor models, the specifics of the dataset you will be using, and the objectives of this coursework. 1.1 Commodity Futures Price Data and S&P GSCI Index Commodity futures prices are volatile and are influenced by a multitude of factors. In this coursework, you will work with the commodities that influence the S&P Goldman Sachs Commodity Index (S&P GSCI), a widely recognised benchmark for investment in commodity markets. The S&P GSCI Index represents the performance of a diversified group of commodities, providing a broad measure of the commodity market’s overall performance. It includes various commodities such as energy, metals, agriculture, and livestock, making it a comprehensive indicator of commodity futures price movements. 1.2 Factor Models A factor model is a financial model that explains the returns of an asset through various underlying factors. These factors are variables that are considered to influence commodity futures prices. In this coursework, you will use factor models to analyse commodity futures price data. Factor models are based on i) observable factors that are computed from observational data and are based on expert opinion or ii) engineered factors that could be extracted via Machine Learning models like Principal component analysis (PCA) or Factor Analysis (FA). This coursework relies on a factor model with eight observational factors, and are described as follows: • Momentum: Measures the tendency of a commodity futures price to continue moving in its current direction. It is calculated based on the cumulative excess returns of the commodity over the previous 12 months. • Basis: The basis for each commodity is the difference between the future prices of the nearby and next-to-nearby futures contracts. This difference provides insights into the market’s expectations of future price movements. 1 • Basis-Momentum: This factor combines the basis and momentum effects. It measures the difference in momentum between the nearby and next-to-nearby futures contracts. • Skewness: Measures the asymmetry of the commodity futures price distributions. It helps in un- derstanding the likelihood of extreme future price movements by analysing the daily returns of the commodity over the past 12 months. • Inflation Beta: Represents the sensitivity of the commodity futures price to changes in inflation. It shows how the commodity’s future price responds to unexpected changes in monthly inflation rates over the past 60 months. • Volatility: Defined as the variance-per-absolute mean of the first-nearby futures returns over the prior 36 months. It indicates the stability or instability of a commodity’s future price. • Open Interest: Indicates the level of market activity and liquidity. It is defined as the monthly change in the total number of outstanding contracts for a commodity. • Value: Represents the intrinsic worth of a commodity based on fundamental factors such as supply and demand. It is measured as the average future price of the nearby futures contract from 4.5 to 5.5 years ago compared to the current future price. The data for these risk factors will be provided to you, so you do not need to calculate them by your- self. Your task is to use these factors to analyse the commodity futures price data, explain movements in commodity futures prices, and determine which factors have the greatest impact. 1.3 Data Structure and Assignment Each group will be provided with THREE CSV files: 1. S&P GSCI Index and Commodities Future Price Daily Data: A CSV file provided to the daily index values of the S&P GSCI Index and commodity futures prices from 03 Oct. 2005 to 09 May 2024. 2. Commodity Futures Price Data: Each group will be assigned to select two different commodities, each containing the monthly future price data of a specific commodity along with the eight corre- sponding factors described above. For information on the commodities assigned to your group, please refer to the Commodities selection for each group.csv. You may download the appropriate commodity files from the zip folder. The datasets for each group can be found in the folder assigned to you. Each group will use the same S&P GSCI Index daily data file but will work with two assigned commodity futures monthly data files. The meanings of the commodity tickers are detailed in Table 1 at the end of this document. 1.4 Objectives The primary objectives of this coursework are to: • Comprehend and preprocess commodity futures price data accurately. 2 • Use regression analysis with factor models to elucidate the factor coefficients that determine the volatil- ity of commodity futures prices. • Evaluate the importance of various factors in affecting commodity futures prices and explain your choices. • Validate your models and provide clear, concise, and insightful interpretations of your findings. 2 Tasks for Multiple Linear Regression You will use only the S&P GSCI Index daily CSV data file for tasks A1) and A2), and the two commodity futures monthly CSV data files for tasks A3) to A6). A1) The first task is to redo the Ch1 R-lab. The goal is to apply the Ordinary Least Squares (OLS), Ridge Regression (RR), Slab Regression (SR), Stein (St), Diagonal Shrinkage (DSh), and Shrinkage (Sh) estimators to the data from Period 7 and use their β coefficients to predict the results of Period 8, which is an out-of-sample test. That is, the dependent (or target) variable is the log return of S&P GSCI and the independent variables (or covariates of features) are the log return of the 35 commodities; further, a change point detection is also needed. You are free to use any indicator to select the commodities you believe best represent and interpret the S&P GSCI Index. However, you must explain the reasons for selecting your chosen commodities from Period 7. Here are the commodity selection limits for each group: • Group 1: Select up to 25 commodities. • Group 2: Select up to 24 commodities. • Group 3: Select up to 23 commodities. • ... (continue this pattern for each group) • Group 14: Select up to 12 commodities. [Total marks for A1): 10 marks] A2) Analyse the model results obtained in A1) and compare the performance of the OLS, RR, SR, St, DSh and Sh estimators. Evaluate the goodness-of-fit performance of each model for the Period 8 results, considering the actual events and the selected commodities. Compare the predicted results with the actual results, explaining any differences. Discuss why the selected commodities may or may not accurately represent and interpret the S&P GSCI Index and how various variables may affect the results. Provide a clear comparison of the performance of the models, highlighting which model performed best and explaining why it is effective. Interpret your results and use appropriate visualisations to make your points clear. [Total marks for A2): 15 marks] A3) Download the THREE CSV files and clean the data set to produce a factor model, including the OLS, RR, SR, St, DSh, and Sh estimators. That is, if (X1,j ,X2,j , . . . ,X8,j) are the factors of the commodity j (of course, p = 8 and j = 1, 2 for this task), which are explained in the introduction. Let 3 F j,t be the future price of the commodity j at time t. The Multiple Linear Regression (MLR) model is estimated as follows Rj,t = θ0 + θ1x1,j,t + θ2x2,j,t + . . .+ θ8x8,j,t + error term, (2.1) where Rj,t is the percentage change in future price at time t for commodity j, which is given by Rj,t = F j,t − F j,t−1 F j,t−1 . (2.2) Apply all models individually to the entire set. That is, apply the six regression models (OLS, RR, SR, St, DSh and Sh) to each of the two commodities assigned to your group. [Total marks for A3): 10 marks] A4) Analyse the results of the estimated coefficients for each model and explain the impact of risk factors on commodity returns. Discuss how each factor affects returns and compare the results of the different estimators. Highlight any significant differences in the performance of the estimators and carefully analyse the reasons why some models may perform better than others. Interpret your results and use appropriate visualisations to make your points clear. [Total marks for A4): 15 marks] A5) Perform a robustness test on the factor model in A4). Refer to Chapter 2 for the definition of robustness tests in Real Data Analysis Objective 2. Perform these tests using 15%, 30%, 50% and 75% of the full dataset. Interpret your results and use appropriate visualisations to make your points clear. Note 1: It is explained here what a robustness test (using 15% of the full dataset) would mean for one commodity for one model, e.g., OLS. Note that we have 155 observations from 29 July 2011 to 31 May 2024. First, we estimate the OLS parameters, denoted as θˆ Full = ( θˆFull0 , θˆ Full 1 , . . . , θˆ Full 8 ) , based on the sample of size 155. Second, we estimate the OLS parameters, denoted as θˆ Red = ( θˆRed0 , θˆ Red 1 , . . . , θˆ Red 8 ) based on the sample of size 23, since we use the first 23 observations from your sample; you might have noticed that 155× 15% = 23, which clarifies why the first 23 observations are considered. Third, compute the L2 error in between the two OLS vector estimates which is given by√( θˆRed0 − θˆFull0 )2 + ( θˆRed1 − θˆFull1 )2 + . . .+ ( θˆRed8 − θˆFull8 )2 . The L2 error tells you how robust your OLS model (based on only 15% of the full dataset) by comparing to the full OLS model (based on the full dataset). Note 2: Redo the calculations from Note 1 for each regression model (OLS, RR, SR, St, DSh, and Sh) by considering robustness tests using 15%, 30%, 50% and 75% of the full dataset. [Total marks for A6): 10 marks] 4 A6) Analyse and interpret the results obtained from the robustness tests in A5). Discuss how the model’s performance changes with different dataset sizes and how these changes affect the reliability and stabil- ity of the models. Interpret your results and use appropriate visualisations to make your points clear. [Total marks for A6): 10 marks] [Total marks for tasks A1) to A6): 70 marks] 3 Tasks for Generalised Linear Models You consider here only the two commodity futures monthly CSV data file and risk factor model defined in A3). B1) Our goal is to use the Generalised Linear Model (GLM) with the OLS, SR, St, and DSh estimators discussed in Chapter 3, and to consider a modified commodity return satisfying Gamma distributions with the log and quadratic link functions, respectively. The first task is to redo the Ch3 R-lab, but using the same THREE CSV files provided to you and the modified risk model as follows: For the log link function: Rj,t = exp (θ0 + θ1x1,j,t + θ2x2,j,t + . . .+ θ8x8,j,t) (3.1) For the quadratic link function: Rj,t = (θ0 + θ1x1,j,t + θ2x2,j,t + . . .+ θ8x8,j,t) 2 (3.2) where Rj,t = F j,t F j,t−1 . (3.3) Divide the full dataset into a 50% training set and a 50% testing set. Fit the models using the training set and evaluate their performance on the testing set. Calculate the Mean Square Error (MSE), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE) for each GLM on the i) training set and ii) testing set, and iii) compare the results of each GLM estimator. Discuss which estimator performed “best” and explain why. Interpret your results and use appropriate visualisations to make your points clear. [Total marks for B1): 20 marks] B2) Repeat the process in B1), but this time split the full dataset into a 70% training set and a 30% testing set. Interpret your results and use appropriate visualisations to make your points clear. [Total marks for B2): 10 marks] [Total marks for tasks B1) to B2): 30 marks] 4 Instructions Please read the following instructions before starting your coursework. 1. The data are given per group. Therefore, each group should run its own analysis for the allocated dataset, otherwise, a 20% penalty (on the group mark) will be applied. 5 2. Each student will receive the mark of its group mark unless the collaboration amongst the students falls apart, in which case the students should report this incident via e-mail and cc to all members of the group by 01/07/2024. 3. Split the work amongst all group members and provide a careful discussion of your results. 4. You could use any piece of software that you are comfortable with, but annotate the code so that the marker is able to follow your work. 5. The group feedback will be given on Moodle within three weeks after this coursework is due. 6. Submit your entire report and code via Moodle. The quality of your presentation (coding, results and methods, etc.) represents 25% of the final mark. 7. Late submissions require appropriate penalties that are applied by your Course Officer and are in accordance with the general rules that have been applied to your coursework over the academic year. 6 Table 1: Commodity Futures and Their Exchanges Category Commodity futures Exchange Ticker Energy Brent Crude Oil ICE CO1 Gasoil Petroleum ICE QS1 Gasoline NYMEX XB1 Heating Oil NYMEX HO1 Natural Gas NYMEX NG1 Propane NYMEX PN1 WTI Crude Oil NYMEX CL1 Grains & Oilseeds Canola WCE RS1 Corn CBOT C 1 Oats CBOT O 1 Rough Rice CBOT RR1 Soybean Meal CBOT SM1 Soybean Oil CBOT BO1 Soybeans CBOT S 1 Wheat CBOT W 1 Livestock Feeder Cattle CME FC1 Lean Hogs CME LH1 Live Cattle CME LC1 Pork Belly CME PB1 Metals Aluminium LME LA1 Copper CMX HG1 Gold CMX GC1 Lead LME LL1 Nickel LME LN1 Palladium NYMEX PA1 Platinum NYMEX PL1 Silver CMX SI1 Tin LME LT1 Zinc LME LX1 Softs Cocoa NYB CC1 Coffee NYB KC1 Cotton NYB CT1 Ethanol CME DL1 Lumber CME LB1 Milk CME DA1 Orange Juice ICE JO1 Rubber OSE JN1 Sugar NYB SB1 Note: CBOT: Chicago Board of Trade, CME: Chicago Mercantile Exchange, COMEX: Commodity Exchange, ICE: Intercontinental Exchange, LME: London Metal Exchange, NYB: New York Board of Trade, NYMEX: New York Mercantile Exchange, OSE: Osaka Exchange, WCE: Winnipeg Commodity Exchange 7
学霸联盟