ST443代写
时间:2022-11-10
ST443: Machine Learning and Data Mining
Tengyao Wang
London School of Economics and Political Science
Lecture 7: Trees and forests
8 Nov 2022
Key learning points
I Decision trees
I How to prune a tree
I Gradient boosting
I Bootstrap aggregation
I Random forest
Tengyao Wang 2/37
Tree-based methods
I We focus on the tree-based methods for regression and classification.
I These involve stratifying or segmenting the predictor space into a number
of simple regions.
I Since the set of spliing rules used to segment the predictor space can be
summarized in a tree, these types of approaches are know as decision tree
methods.
Tengyao Wang 3/37
Baseball player salary data
I Salary is colored from low (blue, green) to high (yellow, red).
I How would you stratify it?
Tengyao Wang 4/37
Decision tree for the data
R1 = {Years < 4.5}, R2 = {Years ≥ 4.5 and Hits < 117.5} and
R3 = {Years ≥ 4.5 and Hits > 117.5}.
Tengyao Wang 5/37
Details for the decision tree
I In keeping with the tree analogy, the regions R1, R2 and R3 are known as
terminal nodes or leaves of the tree.
I Decision tree are typically drawn upside down, in the sense the leaves are
at the boom of the tree.
I The points along the tree where predictor space is split are referred to as
internal nodes.
I In the hiers tree, the two internal nodes are indicated by the text
Years < 4.5 and Hit < 117.5.
I We refer to the segments of the trees that connect the nodes as branches.
Tengyao Wang 6/37
Interpretation of the decision tree results
I Years is the most important factor in determining Salary and players with
less experience earn lower salaries than more experienced players.
I Given that a player is less experienced, the number of Hits that he made in
the previous year seems to play lile role in his Salary.
I But among players who have been in the major leagues for five or more
years, the number of Hits made in the previous year does aect Salary,
and players who made more Hits last year tend to have higher salaries.
I The figure is likely a over-simplification, but compared to a regression
model, is advantageous in displaying and interpreting.
Tengyao Wang 7/37
Details in tree building process
I We divide the predictor/covariate space, i.e. the set of possible values for
X1, . . . , Xp into J distinct and non-overlapping regions R1, R2, . . . , RJ .
I For every observation that falls into the region Rj , we make the same
prediction, which is simply mean of the response values for the traning
observation in Rj .
I In theory, the regions could have any shape. Decision trees divide the
predictor space into axis-aligned high dimensional rectangles for simplicity
and interpretability.
I Our target is to find rectangles, R1, . . . , RJ that minimize the RSS
J∑
j=1
∑
i∈Rj
(yi − yˆRj )2,
where yˆRj = N
−1
j
∑
i:xi∈Rj yi and Nj = #{i : xi ∈ Rj}.
Tengyao Wang 8/37
More details in tree building process
I It is computational infeasible to consider every possible partition of the
feature space into J rectangles.
I We take a top-down, greedy approach that is known as recursive
binary spliing.
I Top-down: it begins at the top of tree and then successively splits the
predictor space, each split is indicated via two new branches further down
on the tree.
I Greedy: at each step of the tree-building process, the best split is made at
the particular step, rather than looking ahead and picking a split that will
lead to a beer tree in some future step.
Tengyao Wang 9/37
More details in tree building process
I We first select the predictor Xj and cutpoint s such that spliing the
predictor space into regions {X : Xj < s} and {X : Xj ≥ s} leads the the
greatest possible reduction in RSS.
I Then, we repeat the process, searching for the best predictor and cutpoint
to further split the data so as to minimize the RSS within each of the
resulting regions. Note rather than spliing the entire space, we split one
of the two previously identified regions. Now we have three regions.
I Again, we look to split one of these three regions further, so as to minimize
the RSS. The process continues until a stopping criterion is reached, e.g. we
may continue until no region contains no more than five observations.
Tengyao Wang 10/37
Spliing the predictor space
Generally we create the partitions by iteratively spliing one of the X variables
into two regions. For instance:
1. First split on X1 = t1.
2. If X1 < t1, split on X2 = t2.
3. If X1 > t1 split on X1 = t3.
4. If X1 > t3, split on X2 = t4.
Tengyao Wang 11/37
Spliing the predictor space
Generally we create the partitions by iteratively spliing one of the X variables
into two regions. For instance:
1. First split on X1 = t1.
2. If X1 < t1, split on X2 = t2.
3. If X1 > t1 split on X1 = t3.
4. If X1 > t3, split on X2 = t4.
Tengyao Wang 11/37
Spliing the predictor space
Generally we create the partitions by iteratively spliing one of the X variables
into two regions. For instance:
1. First split on X1 = t1.
2. If X1 < t1, split on X2 = t2.
3. If X1 > t1 split on X1 = t3.
4. If X1 > t3, split on X2 = t4.
Tengyao Wang 11/37
Spliing the predictor space
Generally we create the partitions by iteratively spliing one of the X variables
into two regions. For instance:
1. First split on X1 = t1.
2. If X1 < t1, split on X2 = t2.
3. If X1 > t1 split on X1 = t3.
4. If X1 > t3, split on X2 = t4.
Tengyao Wang 11/37
Spliing the predictor space
Generally we create the partitions by iteratively spliing one of the X variables
into two regions. For instance:
1. First split on X1 = t1.
2. If X1 < t1, split on X2 = t2.
3. If X1 > t1 split on X1 = t3.
4. If X1 > t3, split on X2 = t4.
Tengyao Wang 11/37
Estimator for regression trees
I Consider the statistical learning model:
Yi = f(Xi) + εi, i = 1, . . . , n,
where Xi = (xi1, . . . , xip)>.
I A regression tree T with leaf regions R1, . . . , RJ corresponds to the
estimator fˆT such that for x ∈ Rp is
fˆ(x) :=
J∑
j=1
Y¯Rj1{x ∈ Rj},
where Y¯Rj := N
−1
j
∑
i:Xi∈Rj Yi and Nj =
∑n
i=1 I(Xi ∈ Rj).
I The training RSS of the regression tree T is
RSS(T ) :=
J∑
j=1
∑
i:Xi∈Rj
(Yi − Y¯Rj )2.
Tengyao Wang 12/37
Pruning a tree
I The tree building process may produce good predictions on the training set,
but is likely to overfit the data, leading to poor test set performance.
I We can first grow a very big tree T0 and then prune it back to obtain a
subtree.
I Cost complexity pruning, we consider a sequence of trees indexed by a
tuning parameter α > 0. For each α, we seek a subtree Tα ⊂ T0 such that
Tα = argmin
T⊂T0
{RSS(T ) + α|T |},
where |T | indicates the number of leaves of T .
Tengyao Wang 13/37
Choose the optimal subtree
I α is the tuning parameter that controls a trade-o between the subtree’s
complexity (variance) and its fit to the training data (bias).
I Optimal α can be obtained using cross-validation.
I We then return to the full dataset and obtain the subtree corresponding to
the optimal α.
Tengyao Wang 14/37
Algorithm for building a regression tree
1. Use recursive binary spliing to grow a large tree on the training data,
stopping only when each terminal node has fewer than some minimum
number of observations.
2. Apply cost complexity pruning to the large tree in order to obtain a
sequence of best subtree, as a function of α.
3. Use K-fold cross validation to choose α. That is, divide the training
observation into K folds. For each k = 1, . . . ,K :
a Repeat Steps 1 and 2 on all but the k-th fold of the training data.
b Evaluate the mean squared prediction error on the data in the le-out
k-th fold, as a function of α.
Average the results for each value of α and pick α to minimize the average
error.
4. Return the subtree from Step 2 that corresponds to the chosen value of α.
Tengyao Wang 15/37
Baseball data continued
I Randomly split the data: 132 training observations and 131 observations in
the test dataset.
I We then build a large regression tree on the training data and varied α to
create subtrees with dierent number of terminal nodes.
I Finally, a six-fold cross validation is implemented to estimate the
cross-validated MSE of the trees as a function of α.
Tengyao Wang 16/37
Baseball data regression tree
Tengyao Wang 17/37
Cross validation
Tengyao Wang 18/37
Classification trees
I For a classification tree, we predict that each observation belongs to the
most common occurring class of training observations in the region to
which it belongs.
I We use recursive binary spliing to grow a classification tree.
I Recall that in a regression tree, we define the cost complexity criterion:
Cα(T ) = RSS(T ) + α|T | =
|T |∑
j=1
NjQj(T ) + α|T |,
where QMSEj (T ) =
1
Nj
∑
i:Xi∈Rj (Yi − Y¯Rj )2 is the MSE of the regression
in the m-th terminal node (region).
Tengyao Wang 19/37
Classification trees
I Under classification setups, we need to substitute the loss QMSEj by some
alternative measures.
I misclassification error rate: the fraction of training observations in the
region that do not belong to the most common class:
QMERj (T ) =
1
Nj
∑
i:Xi∈Rj
1
{
Yi 6= ˆ`(j)
}
= 1− pˆj,ˆ`(j),
where ˆ`(j) = argmax` pˆj,` and pˆj,` =
1
Nj
∑
i:Xi∈Rj 1{Yi = `} represents
the proportion of training observations in the j-th leaf region that are from
the `-th class.
Tengyao Wang 20/37
Two other measures of loss
I However classification error is not suiciently sensitive for tree growing
and in practice two other measures are preferable.
I Gini index:
QGinij (T ) =
L∑
`=1
pˆj,`(1− pˆj,`),
a measure of total variance across the K classes.
I Gini index is referred to as a measure of node purity — a small value
indicates a node contains predominately observations from a single class.
I Cross-entropy (deviance):
Qxentj (T ) = −
L∑
`=1
pˆj,` log pˆj,`.
I These two measures are used to evaluate the quality of a particular split,
since they are more sensitive to node purity than the classification error
rate.
Tengyao Wang 21/37
Comparison among the three measures
I For K = 2, the three measures are 1−max(p, 1− p), 2p(1− p) and
−p log(p)− (1− p) log(1− p), respectively.
I The cross-entropy and the Gini index are dierentiable and hence also
more amendable to numerical optimization.
Tengyao Wang 22/37
Example: Heart data
I Binary outcome HD for 303 patients who presented with chest pain.
I An outcome value of Yes indicates the presences of heart disease based on
an angiographic test , while No means no heart decease.
I There are 13 predictors including Age, Sex, Chol (a cholesterol
measurement) and other heart and lung function measurements.
I Cross-entropy-based cross validation yields a pruned tree with six terminal
nodes, see next page.
Tengyao Wang 23/37
Tengyao Wang 24/37
Tree vs linear models
Top row: true linear boundary; Boom row: true nonlinear boundary.
Tengyao Wang 25/37
Advantages and disadvantages of trees
I Trees are easy to explain to people. They are even easier to explain than
linear regression.
I Some people believe that decision trees more closely mirror human decision
making than do the regression and classification seen in previous lectures.
I Trees can be displayed graphically and easily interpreted even by a
non-expert.
I Trees can easily handle qualitative predictors without the need to create
dummy variables.
I Trees do not have the same level of predictive accuracy as some of the
other regression and classification approaches seen in previous lectures.
Tengyao Wang 26/37
Bootstrap aggregation
I The bootstrap is an extremely powerful idea but what does it have to do
with statistical learning?
I Suppose that we have a procedure (such as trees, neural nets and etc.) that
has high variance.
I An ideal way to reduce variance would be to take many samples from the
population, build a separate prediction model using each sample and
average the resulting predictions i.e.
fˆaverage(x) =
1
B
B∑
b=1
fˆ b(x).
I Of course, as discussed previously, we cannot take multiple dierent
samples from the population.
Tengyao Wang 27/37
Bootstrap aggregation
I However, we can use the bootstrap approach which does the next best
thing by taking repeated samples from the training data.
I We therefore end up with B dierent training data sets.
I We can train our method on each data set and then average all the
predictions i.e.
fˆbag(x) =
1
B
B∑
b=1
fˆ∗b(x).
I This approach is called Bagging or Bootstrap Aggregation.
Tengyao Wang 28/37
Bagging regression and classification trees
I Regression trees: We use average prediction to aggregate bootstrapped
trees.
I Classification trees: for each test observation, we record the class
predicted by each of B trees and take a majority vote: the overall
prediction is the most commonly occuring class.
I Alternatively, we can take the average of B probabilities and assign to
the class with the highest averaged probability.
Tengyao Wang 29/37
Simulation to compare error rates
I Here the green line represents a simple majority vote approach.
I The purple line corresponds to averaging the probabilities.
I Both methods do beer than a single tree and get close to the Bayes risk.
Tengyao Wang 30/37
Random forest
I Random forest provides an improvement over bagged trees by way of a
small tweak that decorrelates the trees. This reduce the variance when we
average the tree.
I As in bagging, we build a number of decision trees on bootstrapped
training samples.
I But in random forest, we also randomly sample predictors: each time a
split in a tree is considered, a random selection of m predictors is chosen as
split candidates from the full set of p predictors. The split is allowed to use
only one of those m predictors.
I A typical choice is m = √p.
I Bagging is a special case of random forest when m = p.
Tengyao Wang 31/37
Variable importance measure
I Bagged trees and random forests typically results in improved accuracy over
prediction using a single tree. However, it can be diicult to interpret the model.
I We record the total amount that the RSS is decreased due to splits over a given
predictor, average over all B trees. A large value indicated an important predictor.
I Variable importance plot.
Tengyao Wang 32/37
Boosting
I Like bagging, boosting can be applied to many statistical learning
approaches. We concentrate on decision trees.
I Recall that bagging involves creating copies of the original training dataset
using the bootstrap, and then combining all of the trees to create a single
predictive model.
I Each tree is independent of trees based on a bootstrap dataset.
I Boosting works in a similar way, but the trees are grown sequentially,
each tree is grown using information from previously grown trees.
Tengyao Wang 33/37
Boosting Algorithm for Regression Trees
1. Set fˆ(x) = 0 and ri = yi for all i in the training set.
2. For b = 1, . . . , B, repeat:
a Fit a tree fˆ b with d splits (d+ 1 terminal nodes) to the training data
(X, r).
b Update the residuals,
ri ← ri − λfˆ b(xi).
3. Output the boosted model,
fˆ(x) =
B∑
b=1
λfˆ b(x).
Tengyao Wang 34/37
Ideas behind
I Unlike fiing a single large decision tree to the data, which amounts to
fiing the data hard, the boosting approach instead learns slowly.
I Given the current model, we fit a decision tree to the residuals from the
model. We then add this new decision tree into then fied function in order
to update the residuals.
I Each of these trees can be rather small, with just a few terminal nodes,
determined by the parameter d in the algorithm.
I By fiing small trees to the residuals, we slowly improve fˆ in areas where it
does not perform well. This shrinkage parameter λ slows the process down
even further, allowing more and dierent shaped trees to aack the
residuals.
Tengyao Wang 35/37
Regression Example
I Number of trees B (CV for selecting B); shrinkage parameter λ (typical
vlaues are 0.01 or 0.001) and number of splits d (interaction depth) are
tuning parameters for boosting.
I See further details in Chapter 10 of ESL.
I R package: gbm.
I Gradient boosted models.
Tengyao Wang 36/37
Summary
I Decision trees are simple and interpretable. But they can not lead to high
prediction accuracy.
I Bagging, random forests and boosting are developed to improve the
prediction accuracy of trees.
I Random forests and boosting are among the state-of-art methods for
regression or classification (supervised learning). However, their results can
be diicult to interpret.
I Further details of random forests and boosting can be learnt from ESL.
Tengyao Wang 37/37