BFF5555 - -无代写|学霸联盟

BFF5555 - -无代写

时间：2025-09-05

BFF5555 - Seminar 6 - Tree Based Methods
Reading: Chapter 8
Binh Do
2025
1 / 34
Learning objectives
Understand the fundamental concepts and structure of
decision trees in the context of financial machine learning.
Learn how to grow and interpret both regression and
classification trees to make predictions.
Analyse the bias-variance trade-off and how it can be
managed through techniques such as pruning.
Explore ensemble methods, including bagging and random
forests, and their role in improving model accuracy.
Gain insight into the concept of feature importance and its
application in tree-based models.
2 / 34
Tree-based methods - the idea
Tree-based methods involve splitting the training data into
subsets, called regions or categories
This is done recursively based on the most significant feature
at each split.
Prediction on a new observation is made by identifying which
region that observation belongs to then assign the value of
that region as the predicted value of the observation.
Since the set of splitting rules can be summarized in a tree,
these approaches are also known as decision tree methods.
3 / 34
Pros and Cons
Tree-based methods are simple and useful for interpretation.
However, they typically are not competitive with the best
supervised learning approaches in terms of prediction accuracy.
Hence, we also discuss bagging and random forests, as
popular ensemble techniques. These methods grow multiple
trees which are then combined to yield a single consensus
prediction.
Combining a large number of trees can result in material
improvements in prediction accuracy, at the expense of some
loss of interpretation.
4 / 34
The Basics of Decision Trees
Decision trees can be applied to both regression and
classification problems.
We first consider regression problems, and then move on to
classification.
To make things clear, we will use a simple example of building
a model that describes how monthly stock returns are related
to Size and Value (B/M), based on 1000 observations.
5 / 34
Linear regression
If you choose to describe the relationship via a linear
regression, the model is:
Ri = α+ β1Sizei + β2Valuei + ϵi
If we wish to allow for interaction between these two
predictors then we add an interaction term:
Ri = α+ β1Sizei + β2Valuei + β3SizeixValuei + ϵi
To fit the model to the data, perform OLS estimation.
To predict return for stock j, plug in stock j’s values into the
estimated model:
Rˆj = αˆ+ βˆ1sizej + βˆ2BMj + βˆ3sizejxBMj
6 / 34
Regression tree
Regression tree seeks to find the relationship between returns
and Size and Value by sorting the observations into J regions
or categories such that the Residual Sum of Squares RSS is
the smallest, where
RSS =
J∑
j=1
∑
i∈Rj
(yi − yˆRj )2
Here the predicted value for each observation is the average
response value in the region that observation belongs to, ie,
yˆRj
We will see how this sorting is done later.
For now, let’s suppose when fit to the 1000 observations, the
regression tree looks like the one on next slide.
7 / 34
Regression tree
8 / 34
Describing the tree
The 1000 observations are first split into two subsets,
depending on whether their Size value is less than, or greater
than 0.5.
For stocks with Size < 0.5, a second split is applied, so stocks
with Value less than 0.3 are placed in Category 1, and stocks
with Value greater or equal to 0.3 are placed in Category 2.
For stocks with Size ≥ 0.5, no further split is applied, and
they are placed in Category 3.
The categories are known as the terminal nodes or leaf nodes
of the tree.
The points along the tree where the predictor space is split
are referred to as internal nodes
The segments that connect the nodes are called branches of
the tree.
Finally, regression trees are typically drawn upside down.
9 / 34
Interpreting the tree
According to this simple tree, Size and Value are the only
factors determining returns.
For large stocks (Size > 0.5), Value (ie Book-to-Market) plays
little role in determining returns
For smaller stocks (Size < 0.5), Value is a relevant factor.
If we want to use this tree to predict the return of a stock
that has Size = 0.4 and Value = 0.35 as an example, the
predicted return will be the average return of all the stocks in
Category 2.
10 / 34
How to fit a tree to the data
Recall that fitting a regression tree means we find a set of J
regions R1,R2, ...,RJ that minimizes the RSS:
RSS =
J∑
j=1
∑
i∈Rj
(yi − yˆRj )2
where yˆRj is the mean response for the training observations
within the jth box.
Unfortunately, it is computationally infeasible to consider
every possible partition of the feature space into J boxes.
A common solution is to take a top down, greedy approach
known as recursive binary splitting
Top down: it begins at the top of the tree (at which point all
observations belong to one region).
Greedy: at each step in the tree building process, the best
split is made at that particular step, rather than looking ahead
and picking a split that will lead to a better tree in some
future step. 11 / 34
How to fit a regression tree to the data: in detail
Starting with all observations in one single region, RSS takes
this value:
RSS =
N∑
i=1
(yi − yˆ)2
where yˆ is the mean response value across all observations.
We then select predictor Xj and cutoff point s that splits the
observations into 2 regions:
R1(j , s) = {X |Xj < s} and R2(j , s) = {X |Xj >= s}
and we seek value j and s that minimises this quantity:∑
i :xi∈R1(j ,s)
(yi − yˆR1)2 +
∑
i :xi∈R2(j ,s)
(yi − yˆR2)2
where yˆR1 and yˆR2 re the mean response values in Regions 1
and 2 respectively.
12 / 34
How to fit a regression tree to the data: in detail
Finding the values of j and s that minimize that quantity can
be done quite quickly, especially when the number of
predictors is not too large.
Next, we repeat the process, looking for the best predictor
and best cutpoint in order to split the data further so as to
minimise the RSS within each of the resulting regions.
The process continues until a stopping criterion is reached; for
instance,
We may continue until no region contains more than five
observations.
Or continue until the maximum number of splits (max depth)
has been reached.
13 / 34
How to fit a regression tree to the data: some notes
Note 1: A feature can be chosen for more than one split.
Note 2: After a split results in two regions, one of these
regions may not participate in the subsequent split.
A five-region example in the next slide illustrates these points.
14 / 34
A five-region example
15 / 34
How large should we grow the tree?
A very large tree with many terminal nodes might overfit the
data, leading to poor test set performance: low bias but high
variance
A very small tree with few terminal nodes might not capture
the important structure: low variance but high bias.
One way to address this issue is pruning: grow a full tree
then prune it back.
Another approach is to use an ensemble technique: grow a
large number of full trees, make predictions using each of
these trees then average out the results.
16 / 34
Pruning a tree using cost-complexity tuning
In this approach, we grow a very large tree that has |T0|
number of terminal nodes
We then find a subtree that minimises this quantity called
cost complexity criterion:
|T |∑
m=1
∑
i :xi∈Rm
(yi − yˆRm)2 + α|T |
in which α is a chosen nonnegative number, governing the
tradeoff between the tree size and its goodness of fit to the
data.
Here |T | indicates the number of terminal nodes of the tree
T , Rm is the rectangle (i.e. the subset of predictor space)
corresponding to the mth terminal node, and yˆRm is the mean
of the training observations in Rm.
17 / 34
Choosing the best subtree
The tuning parameter α controls a trade-off between the
subtree’s complexity and its fit to the training data.
We select an optimal value αˆ using cross-validation.
We then return to the full data set and obtain the subtree
corresponding to αˆ.
18 / 34
Example: Monthly stock returns
Using the Stocks short dataset, we fit the 2019-03 cross
sectional data to a regression tree.
The response variable is excess rets and the predictor variables
are stock characteristics and macroeconomic variables (total
= 102).
Without any restriction on the depth of the tree, we would
end up with a very bushy tree with 5271 leave nodes (for 5272
observations!)
To visually illustrate pruning, next slide shows a tree that
grows to a depth of 3.
The subsequent slide shows a tree that has been pruned from
that 3-depth tree, using cost complexity pruning.
19 / 34
The unpruned tree
20 / 34
The final pruned tree
21 / 34
Classification Trees
Before going to the ensemble method as a way to address the
bias-variance tradeoff in decision trees, let’s talk about
applying decision trees to classification problems.
Very similar to a regression tree, except that it is used to
predict a qualitative response rather than a quantitative one.
Two differences: first the splitting is done by minimising
classification errors instead of RSS.
Second, when making classification prediction on an
observation, we find the leaf node in the tree that the
observation belongs to, then assign to it the most commonly
occurring class of training observations in that region.
22 / 34
Details of classification trees
Just as in the regression setting, we use recursive binary
splitting to grow a classification tree.
As indicated, one criterion for the splitting decision is the
classification error rate. This is simply the fraction of the
training observations in that region that do not belong to the
most common class
However in practice, Gini index and cross-entropy are the
preferred criteria of node impurity.
Next slides show how they are calculated and why they are
preferred over the classification error
23 / 34
Gini index and Cross-entropy
The Gini index for node m is defined by
Ginim =
K∑
k=1
pˆmk(1− pˆmk),
where pmk is the proportion of observations in node m that
belong to class k.
For binary classification,
Ginim = 2p(1− p)
We can see that the Gini index takes on a small value if all of
the pˆmk ’s are close to zero or one, indicating the node is pure
since the node contains predominantly observations from a
single class,
An alternative to the Gini index is cross-entropy, which for
binary classification is:
Cross entropy = −(p logp + (1− p)log(1− p))
24 / 34
Different measures of impurity at each node
Unlike the misclassication error, Gini index and cross entropy
functions are differentiable, hence amenable to numerical
optimisation.
25 / 34
How to fit a binary classification tree: in detail
Starting with all observations in one single region, the Gini
index (or another impurity measure) takes this value:
Gini = 2p(1− p)
where p is the proportion of observations in the region
belonging to one of the 2 classes.
We then recursively select predictor Xj and cutoff point s that
splits the observations into 2 regions:
R1(j , s) = {X |Xj < s} and R2(j , s) = {X |Xj ≥ s}
and we seek the values of j and s that minimize the following
weighted impurity measure:
nR1
N
· Gini(R1) + nR2
N
· Gini(R2)
where nR1 and nR2 are the number of observations in each
region.
26 / 34
Example: a classification tree
27 / 34
Advantages and Disadvantages of Trees
Advantages
Trees are very easy to explain to people. In fact, they are even
easier to explain than linear regression!
Some people believe that decision trees more closely mirror
human decision-making than do the regression and
classification approaches seen in previous chapters.
Trees can be displayed graphically, and are easily interpreted
even by a non-expert (especially if they are small).
Trees can easily handle qualitative predictors without the need
to create dummy variables.
Disadvantages
Unfortunately, trees generally do not have the same level of
predictive accuracy as some of the other regression and
classification approaches seen in this unit.
However, by aggregating many decision trees, the predictive
performance of trees can be materially improved. We
introduce these ensemble concepts next.
28 / 34
Bagging
Bootstrap aggregation, or bagging, is a general-purpose
procedure for reducing the variance of a statistical learning
method; we introduce it here because it is particularly useful
and frequently used in the context of decision trees.
Recall that given a set of n independent observations
Z1, . . . ,Zn, each with variance σ
2, the variance of the mean Z¯
of the observations is given by σ2/n.
In other words, averaging a set of observations reduces
variance. Of course, this is not practical because we generally
do not have access to multiple training sets.
29 / 34
Bagging— continued
Instead, we can bootstrap, by taking repeated samples from
the (single) training data set.
In this approach we generate B different bootstrapped training
data sets. We then train our method on the b-th
bootstrapped training set in order to get fˆ ∗b (x), the prediction
at a point x. We then average all the predictions to obtain
fˆbag (x) =
1
B
B∑
b=1
fˆ ∗b (x).
This is called bagging.
30 / 34
Bagging classification trees
The above description applies to regression trees.
For classification trees: for each test observation, we record
the class predicted by each of the B trees, and take a majority
vote: the overall prediction is the most commonly occurring
class among the B predictions!
31 / 34
Random Forests
Random forests provide an improvement over bagged trees by
way of a small tweak that decorrelates the trees. This reduces
the variance when we average the trees.
As in bagging, we build a number of decision trees on
bootstrapped training samples.
But when building these decision trees, each time a split in a
tree is considered, a random selection of m predictors is
chosen as split candidates from the full set of p predictors.
The split is allowed to use only one of those m predictors.
A fresh selection of m predictors is taken at each split, and
typically we choose m ≈ √p — that is, the number of
predictors considered at each split is approximately equal to
the square root of the total number of predictors (4 out of a
dataset of 13 predictors).
32 / 34
Feature importance measure
Feature importance in a decision tree measures how valuable
each feature is in making predictions.
In a single tree, it is computed based on the reduction in
impurity that a feature provides at each split in the tree (like
Gini in classification or RSS in regression.
For each node, the decrease in impurity caused by splitting on
a feature is calculated.
The importance of a feature is the sum of these decreases
across all nodes where the feature is used.
The importance values are then normalized so that they sum
to 1. A large value indicates an important predictor.
For bagged or random forest trees, we record the total
amount that the impurity is decreased due to splits over a
given predictor, averaged over all B trees.
33 / 34
Summary
Decision trees are simple and interpretable models for
regression and classification
However they are often not competitive with other methods in
terms of prediction accuracy
Bagging and random forests are good methods for improving
the prediction accuracy of trees. They work by growing many
trees on the training data and then combining the predictions
of the resulting ensemble of trees.
The latter two methods are among the state-of-the-art
methods for supervised learning. However they do not have
great interpretability.
34 / 34

学霸联盟