BFF5555 - Seminar 6 - Tree Based Methods Reading: Chapter 8 Binh Do 2025 1 / 34 Learning objectives Understand the fundamental concepts and structure of decision trees in the context of financial machine learning. Learn how to grow and interpret both regression and classification trees to make predictions. Analyse the bias-variance trade-off and how it can be managed through techniques such as pruning. Explore ensemble methods, including bagging and random forests, and their role in improving model accuracy. Gain insight into the concept of feature importance and its application in tree-based models. 2 / 34 Tree-based methods - the idea Tree-based methods involve splitting the training data into subsets, called regions or categories This is done recursively based on the most significant feature at each split. Prediction on a new observation is made by identifying which region that observation belongs to then assign the value of that region as the predicted value of the observation. Since the set of splitting rules can be summarized in a tree, these approaches are also known as decision tree methods. 3 / 34 Pros and Cons Tree-based methods are simple and useful for interpretation. However, they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy. Hence, we also discuss bagging and random forests, as popular ensemble techniques. These methods grow multiple trees which are then combined to yield a single consensus prediction. Combining a large number of trees can result in material improvements in prediction accuracy, at the expense of some loss of interpretation. 4 / 34 The Basics of Decision Trees Decision trees can be applied to both regression and classification problems. We first consider regression problems, and then move on to classification. To make things clear, we will use a simple example of building a model that describes how monthly stock returns are related to Size and Value (B/M), based on 1000 observations. 5 / 34 Linear regression If you choose to describe the relationship via a linear regression, the model is: Ri = α+ β1Sizei + β2Valuei + ϵi If we wish to allow for interaction between these two predictors then we add an interaction term: Ri = α+ β1Sizei + β2Valuei + β3SizeixValuei + ϵi To fit the model to the data, perform OLS estimation. To predict return for stock j, plug in stock j’s values into the estimated model: Rˆj = αˆ+ βˆ1sizej + βˆ2BMj + βˆ3sizejxBMj 6 / 34 Regression tree Regression tree seeks to find the relationship between returns and Size and Value by sorting the observations into J regions or categories such that the Residual Sum of Squares RSS is the smallest, where RSS = J∑ j=1 ∑ i∈Rj (yi − yˆRj )2 Here the predicted value for each observation is the average response value in the region that observation belongs to, ie, yˆRj We will see how this sorting is done later. For now, let’s suppose when fit to the 1000 observations, the regression tree looks like the one on next slide. 7 / 34 Regression tree 8 / 34 Describing the tree The 1000 observations are first split into two subsets, depending on whether their Size value is less than, or greater than 0.5. For stocks with Size < 0.5, a second split is applied, so stocks with Value less than 0.3 are placed in Category 1, and stocks with Value greater or equal to 0.3 are placed in Category 2. For stocks with Size ≥ 0.5, no further split is applied, and they are placed in Category 3. The categories are known as the terminal nodes or leaf nodes of the tree. The points along the tree where the predictor space is split are referred to as internal nodes The segments that connect the nodes are called branches of the tree. Finally, regression trees are typically drawn upside down. 9 / 34 Interpreting the tree According to this simple tree, Size and Value are the only factors determining returns. For large stocks (Size > 0.5), Value (ie Book-to-Market) plays little role in determining returns For smaller stocks (Size < 0.5), Value is a relevant factor. If we want to use this tree to predict the return of a stock that has Size = 0.4 and Value = 0.35 as an example, the predicted return will be the average return of all the stocks in Category 2. 10 / 34 How to fit a tree to the data Recall that fitting a regression tree means we find a set of J regions R1,R2, ...,RJ that minimizes the RSS: RSS = J∑ j=1 ∑ i∈Rj (yi − yˆRj )2 where yˆRj is the mean response for the training observations within the jth box. Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into J boxes. A common solution is to take a top down, greedy approach known as recursive binary splitting Top down: it begins at the top of the tree (at which point all observations belong to one region). Greedy: at each step in the tree building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. 11 / 34 How to fit a regression tree to the data: in detail Starting with all observations in one single region, RSS takes this value: RSS = N∑ i=1 (yi − yˆ)2 where yˆ is the mean response value across all observations. We then select predictor Xj and cutoff point s that splits the observations into 2 regions: R1(j , s) = {X |Xj < s} and R2(j , s) = {X |Xj >= s} and we seek value j and s that minimises this quantity:∑ i :xi∈R1(j ,s) (yi − yˆR1)2 + ∑ i :xi∈R2(j ,s) (yi − yˆR2)2 where yˆR1 and yˆR2 re the mean response values in Regions 1 and 2 respectively. 12 / 34 How to fit a regression tree to the data: in detail Finding the values of j and s that minimize that quantity can be done quite quickly, especially when the number of predictors is not too large. Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimise the RSS within each of the resulting regions. The process continues until a stopping criterion is reached; for instance, We may continue until no region contains more than five observations. Or continue until the maximum number of splits (max depth) has been reached. 13 / 34 How to fit a regression tree to the data: some notes Note 1: A feature can be chosen for more than one split. Note 2: After a split results in two regions, one of these regions may not participate in the subsequent split. A five-region example in the next slide illustrates these points. 14 / 34 A five-region example 15 / 34 How large should we grow the tree? A very large tree with many terminal nodes might overfit the data, leading to poor test set performance: low bias but high variance A very small tree with few terminal nodes might not capture the important structure: low variance but high bias. One way to address this issue is pruning: grow a full tree then prune it back. Another approach is to use an ensemble technique: grow a large number of full trees, make predictions using each of these trees then average out the results. 16 / 34 Pruning a tree using cost-complexity tuning In this approach, we grow a very large tree that has |T0| number of terminal nodes We then find a subtree that minimises this quantity called cost complexity criterion: |T |∑ m=1 ∑ i :xi∈Rm (yi − yˆRm)2 + α|T | in which α is a chosen nonnegative number, governing the tradeoff between the tree size and its goodness of fit to the data. Here |T | indicates the number of terminal nodes of the tree T , Rm is the rectangle (i.e. the subset of predictor space) corresponding to the mth terminal node, and yˆRm is the mean of the training observations in Rm. 17 / 34 Choosing the best subtree The tuning parameter α controls a trade-off between the subtree’s complexity and its fit to the training data. We select an optimal value αˆ using cross-validation. We then return to the full data set and obtain the subtree corresponding to αˆ. 18 / 34 Example: Monthly stock returns Using the Stocks short dataset, we fit the 2019-03 cross sectional data to a regression tree. The response variable is excess rets and the predictor variables are stock characteristics and macroeconomic variables (total = 102). Without any restriction on the depth of the tree, we would end up with a very bushy tree with 5271 leave nodes (for 5272 observations!) To visually illustrate pruning, next slide shows a tree that grows to a depth of 3. The subsequent slide shows a tree that has been pruned from that 3-depth tree, using cost complexity pruning. 19 / 34 The unpruned tree 20 / 34 The final pruned tree 21 / 34 Classification Trees Before going to the ensemble method as a way to address the bias-variance tradeoff in decision trees, let’s talk about applying decision trees to classification problems. Very similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative one. Two differences: first the splitting is done by minimising classification errors instead of RSS. Second, when making classification prediction on an observation, we find the leaf node in the tree that the observation belongs to, then assign to it the most commonly occurring class of training observations in that region. 22 / 34 Details of classification trees Just as in the regression setting, we use recursive binary splitting to grow a classification tree. As indicated, one criterion for the splitting decision is the classification error rate. This is simply the fraction of the training observations in that region that do not belong to the most common class However in practice, Gini index and cross-entropy are the preferred criteria of node impurity. Next slides show how they are calculated and why they are preferred over the classification error 23 / 34 Gini index and Cross-entropy The Gini index for node m is defined by Ginim = K∑ k=1 pˆmk(1− pˆmk), where pmk is the proportion of observations in node m that belong to class k. For binary classification, Ginim = 2p(1− p) We can see that the Gini index takes on a small value if all of the pˆmk ’s are close to zero or one, indicating the node is pure since the node contains predominantly observations from a single class, An alternative to the Gini index is cross-entropy, which for binary classification is: Cross entropy = −(p logp + (1− p)log(1− p)) 24 / 34 Different measures of impurity at each node Unlike the misclassication error, Gini index and cross entropy functions are differentiable, hence amenable to numerical optimisation. 25 / 34 How to fit a binary classification tree: in detail Starting with all observations in one single region, the Gini index (or another impurity measure) takes this value: Gini = 2p(1− p) where p is the proportion of observations in the region belonging to one of the 2 classes. We then recursively select predictor Xj and cutoff point s that splits the observations into 2 regions: R1(j , s) = {X |Xj < s} and R2(j , s) = {X |Xj ≥ s} and we seek the values of j and s that minimize the following weighted impurity measure: nR1 N · Gini(R1) + nR2 N · Gini(R2) where nR1 and nR2 are the number of observations in each region. 26 / 34 Example: a classification tree 27 / 34 Advantages and Disadvantages of Trees Advantages Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression! Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters. Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small). Trees can easily handle qualitative predictors without the need to create dummy variables. Disadvantages Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this unit. However, by aggregating many decision trees, the predictive performance of trees can be materially improved. We introduce these ensemble concepts next. 28 / 34 Bagging Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method; we introduce it here because it is particularly useful and frequently used in the context of decision trees. Recall that given a set of n independent observations Z1, . . . ,Zn, each with variance σ 2, the variance of the mean Z¯ of the observations is given by σ2/n. In other words, averaging a set of observations reduces variance. Of course, this is not practical because we generally do not have access to multiple training sets. 29 / 34 Bagging— continued Instead, we can bootstrap, by taking repeated samples from the (single) training data set. In this approach we generate B different bootstrapped training data sets. We then train our method on the b-th bootstrapped training set in order to get fˆ ∗b (x), the prediction at a point x. We then average all the predictions to obtain fˆbag (x) = 1 B B∑ b=1 fˆ ∗b (x). This is called bagging. 30 / 34 Bagging classification trees The above description applies to regression trees. For classification trees: for each test observation, we record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring class among the B predictions! 31 / 34 Random Forests Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. This reduces the variance when we average the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random selection of m predictors is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors. A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p — that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors (4 out of a dataset of 13 predictors). 32 / 34 Feature importance measure Feature importance in a decision tree measures how valuable each feature is in making predictions. In a single tree, it is computed based on the reduction in impurity that a feature provides at each split in the tree (like Gini in classification or RSS in regression. For each node, the decrease in impurity caused by splitting on a feature is calculated. The importance of a feature is the sum of these decreases across all nodes where the feature is used. The importance values are then normalized so that they sum to 1. A large value indicates an important predictor. For bagged or random forest trees, we record the total amount that the impurity is decreased due to splits over a given predictor, averaged over all B trees. 33 / 34 Summary Decision trees are simple and interpretable models for regression and classification However they are often not competitive with other methods in terms of prediction accuracy Bagging and random forests are good methods for improving the prediction accuracy of trees. They work by growing many trees on the training data and then combining the predictions of the resulting ensemble of trees. The latter two methods are among the state-of-the-art methods for supervised learning. However they do not have great interpretability. 34 / 34
学霸联盟