程序代写案例-S491|学霸联盟

程序代写案例-S491

时间：2022-01-06

219
Unit 4
Supervised learning
COMP S491
Machine Learning and Applications
HKMU Course Team
Course Development Coordinator:
Dr Vanessa Ng Sin-chun, HKMU
Developer:
Kendrew Lau Chu-man, Consultant
Instructional Designer:
Ross Vermeer, HKMU
Members:
Kevin Tse Ka-wing, HKMU
Dr Henry Leung Man Fai, HKMU
External Course Assessor
Prof. Chu Xiaowen, Hong Kong Baptist University
Production
Office for Advancement of Learning and Teaching (ALTO)
Copyright © Hong Kong Metropolitan University, 2020.
Reprinted 2021.
All rights reserved.
No part of this material may be produced in any form
by any means without permission in writing from the
President, Hong Kong Metropolitan University. Sale of this
material is prohibited.
Hong Kong Metropolitan University
Ho Man Tin, Kowloon
Hong Kong
This course material is printed on environmentally friendly paper.
Contents
Introduction 1
Linear regression 3
Details and an illustrative implementation 3
Using scikit-learn’s LinearRegression 11
Stochastic gradient descent and SGDRegressor 13
Multiple linear regression 14
Polynomial regression 22
Logistic regression 25
Basic concepts 25
Using LogisticRegression and SGDClassifier 27
Multiclass classification 30
Generalization and regularization 39
Underfitting and overfitting 39
An example of overfitting 41
Bias and variance 43
Regularization 43
Model evaluation 51
Model training, selection and assessment 51
Cross-validation 53
Evaluation metrics 59
Tuning hyperparameters 71
K-nearest neighbours 78
How kNN works 78
Characteristics of kNN 80
Basic code examples 84
The value of k 86
Distance metrics 89
Naive Bayes 93
How Naive Bayes works 93
Bernoulli Naive Bayes 95
Multinomial Naive Bayes 97
Gaussian Naive Bayes 100
Decision trees 102
Using decision trees in machine learning 102
Decision tree learning algorithms 104
Implementing decision trees 114
Rule-based classifiers (optional) 118
Obtaining rules 118
ZeroR 119
OneR 120
Ensemble learning methods 122
Bagging 122
Random forests 128
Boosting 130
Support vector machines 138
Linear SVMs 138
Non-linear SVMs 142
Implementing SVMs 145
Neural networks 151
Perceptrons 151
Multilayer perceptrons 155
Feed-forward neural networks and back-propagation 157
Implementing neural networks 158
Comparison of supervised learning algorithms 163
Characteristics and usage 163
Basic performance comparison 164
Summary 168
References 171
Suggested answers to self-tests 173
Feedback on selected activities 180
Appendix 4.1: Multivariate calculus and gradient
descent 206
Multivariate calculus 206
Gradient descent 207
Unit 4 1
Introduction
Supervised machine learning is a type of machine learning that acquires
knowledge from some known data, and makes predictions about future
unseen data. It often works by building a model from the known
example data. The data examined for building or training a model are
labelled examples, consisting of both features and labels of observed
instances. When the trained model is employed for making predictions,
unlabelled examples of only features are input, and the predicted
outputs of labels are obtained.
There are two subtypes of supervised machine learning, according to
the types of labels to predict:
• Regression predicts numerical labels. Examples include predicting
the price of real estate from the size and other attributes, and
predicting income from personal particulars.
• Classification predicts categorical labels. Some examples are a spam
detector (which predicts whether a message is spam or not), an
image classifier (which predicts whether an image shows a dog or a
cat, for example), and cancer diagnosis (which predicts whether an
image indicates cancer or not).
While these concepts and terminology about supervised machine
learning were described in Unit 3, this unit discusses the algorithms
and techniques for implementing supervised learning solutions.
Specifically, you’ll learn about various models for solving regression
and classification problems, the characteristics of the models, and the
Python code for building and using the models.
With multiple models applicable to a machine learning problem, plus
plenty of model parameters and training parameters, it is important
for us to evaluate the models and understand their performance and
behaviours. In fact, it is possible for a model to perform well during
training but work very poorly in actual use (on unseen data) — this
is a noticeable generalization issue that we will address and avoid by
applying model evaluation.
In short, this unit:
• describes models of supervised machine learning;
• discusses and contrasts algorithms for solving classification and
regression problems;
• implements solutions to machine learning problems using different
supervised learning techniques; and
• evaluates different models of supervised learning.
The unit contains a number of activities and self-tests to help you better
understand the topics. Be sure to complete each of them.
2 COMP S491 Machine Learning and Applications
Activity 4.1
Remember that the unit was developed as Jupyter notebooks, which
contains both the description and program source. Try out all the
programs, running, modifying, and otherwise exploring them during
your study. If necessary, you can find instructions to access the
notebooks on the OLE.
Some programs in this unit are long and are broken into multiple Code
cells to facilitate discussion. If you have doubts or see unexpected
results, examine earlier Code cells and try to run them in order.
Many machine learning algorithms involve randomness in operations,
and the results contain some variations. In other words, the outputs
of some programs are not deterministic. If you have doubts about the
outputs, run the code multiple times, or, if applicable, wrap the relevant
code in a loop to get some average result values.
Unit 4 3
Linear regression
Linear regression is a method of supervised machine learning for
solving regression problems, i.e. predicting numerical values. A
regression method, such as linear regression, is also known as a
regressor. In Unit 3, we outlined the mechanics of linear regression
using an example of predicting the prices of real estate properties; this
section goes on to explain the details with illustrative implementations.
Before going into the details, let’s recap the key concepts of linear
regression. The linear regression model associates the prediction input
features with prediction output labels using a linear relation. For a basic
case with only one feature and one label, the association is the model of
a straight line. To learn or train the model, we determine the parameters
for the best straight line, including the line’s position and steepness.
The performance of a line, or a model, is measured using a loss (cost)
function, which is applied to test data rather than training data. To
improve or train a model (line), we use an iterative optimizer algorithm
to adjust the model parameters with reference to the current parameters,
loss function and training data.
Details and an illustrative implementation
More details of model training in linear regression are discussed next,
illustrated by an implementation. The machine learning problem
involved is predicting the prices of real estate properties from their
sizes.
Getting data
Let’s begin by getting some data. The code segment below first
generates 1,000 sizes in the range of 200 to 5,000, and keeps them in
an array of type float. Next, the corresponding prices are computed by
multiplying the sizes by 10,000, and adding 5,000,000 (i.e. 5 × 106 or
5e6) and some noise, which has a normal distribution with mean 0 and
standard deviation 3,000,000 (i.e. 3e6). The sizes and prices are plotted
on a graph for examination.
import numpy as np
import matplotlib.pyplot as plt
sizes = np.random.randint(200, 5000, 1000).astype(float)
prices = sizes * 10000 + 5e6 + np.random.normal(0, 3e6, sizes.shape)
plt.plot(sizes, prices, 'x')
Output:
[]
4 COMP S491 Machine Learning and Applications
Now, let’s pretend that we have these data in hand, but we do not know
their generation or relation. We will use machine learning to model the
data so that we can (later) predict the price of a real estate property from
its size.
The graph of the data indicates that the sizes and prices seem to be
related according to a straight line. With this observation, we decide to
model the data using a straight line. The equation of a straight line is:
y = mx + b
In the equation, m denotes the slope of the line, and b denotes the
y-intercept when x is 0. A straight line and the values of m and b are
illustrated in the following figure.
Figure 4.1 A straight line, with the equation y = m x + b
Back to our real estate problem, our machine learning model is:
price = m × size + b
We will use the data and an algorithm to train the model, i.e. determine
the best values for the parameters m and b.
Unit 4 5
Normalization, training set and test set
We normalize the sizes and prices because their values are very different
in scale. As you learned from the section ‘Data transformation’ in Unit
3, normalization scales data values to smaller and similar ranges, and
many machine learning algorithms (including gradient descent, as in
this discussion) work better when the features contain similar ranges of
values.
After normalization, we divide the data (labelled examples) into a
training set and a test set. The constants x_scale (1e3 or 1,000) and y_
scale (1e7 or 10,000,000) are used for normalization. 70% (0.7) of the
data will be used for training, and the remaining 30% for testing. The
variable train_test_cutoff keeps the number of training examples. The
sizes are sliced into x_train and x_test, and scaled by x_scale, while
the prices are sliced into y_train and y_test, and scaled by y_scale.
x_scale = 1e3 # sizes
y_scale = 1e7 # prices
train_test_cutoff = int(0.7 * len(sizes))
x_train = sizes[:train_test_cutoff] / x_scale
x_test = sizes[train_test_cutoff:] / x_scale
y_train = prices[:train_test_cutoff] / y_scale
y_test = prices[train_test_cutoff:] / y_scale
print(len(x_train), len(x_test), len(y_train), len(y_test))
Output:
700 300 700 300
Loss function and mean squared error
In general, predicted values are not perfect — they differ from actual
values of examples, hopefully to a tiny extent. Prediction errors, usually
in some form of differences or distances between the predicted and
actual values, provide a performance measure of the machine learning
model. Such a measure is called the loss function or cost function of the
model.
Figure 4.2 Prediction errors between predicted values (empty circles) and
example/actual values (filled circles)
A commonly-used loss function is the mean squared error, or MSE,
which is calculated using the following formula. Here, yi is an
actual example value, yi' is a predicted value, and n is the number of
predictions/examples involved.
6 COMP S491 Machine Learning and Applications
The figure below depicts the squared errors as red squares. Each red
square has a corner on the straight line (the predicted value) and another
corner on the data point (the actual example value).
Figure 4.3 Squared errors indicated as red squares
The function below, mean_squared_error(), computes the MSE between
two arrays of actual and predicted values.
def mean_squared_error(y, y2):
return np.sum((y - y2) ** 2) / len(y)

print(mean_squared_error(np.array([0, 2, 5]), np.array([1, 1, 1])))
Output:
6.0
Gradient descent, learning rate and hyperparameters
Gradient descent is an optimizer that determines the best, or optimal,
parameter values using an iterative approach. In each iteration, gradient
descent updates each parameter value in a way that (very likely) reduces
the loss function. The main idea of gradient descent is depicted in the
figure below. In each of the three graphs, a curve shows the relation
between the loss (vertical axis) and the parameter (horizontal axis). The
objective of gradient descent is to update (i.e. increase or decrease) the
parameter value in order to reduce the loss function, as depicted in the
graph on the left.
Unit 4 7
Figure 4.4 Gradient descent: updating a parameter (left), increasing it for
negative slope (middle), decreasing it for positive slope (right)
In order to determine whether it should increase or decrease the
parameter value, gradient descent uses the slope, or gradient, of the
curve at the current parameter value. When the gradient is negative,
as shown in the middle graph, the curve goes down towards the right;
gradient descent increases the parameter value to decrease the loss.
When the gradient is positive, as shown in the graph on the right,
the curve goes down towards the left; gradient descent decreases
the parameter value to decrease the loss. For a typical loss function
with multiple parameters, each parameter is processed and updated
individually.
The gradients at the current parameter values can be determined
mathematically. With the predicted values yi' replaced by the straight-
line model, the loss function MSE becomes the following.
Using calculus, the gradients of m and b can be calculated. The
mathematical details can be found in Appendix 4.1 ‘Multivariate calculus
and gradient descent’ at the end of this unit. These gradients are used
in the code below. The gradient_descent_step() function computes
new values of m and b from their current values, x and y values of the
training data, and a learning_rate that specifies how much to update
the parameter values. The function first computes the predicted y values
from the current m and b values and the x values. Next, the gradients for
m and b are obtained and used for calculating the new m and b values.
Note that in the calculations, a negative gradient value will produce
a new value that is larger than the current value, and vice versa. The
learning_rate variable is multiplied by the gradient values to adjust the
amount of changes.
def gradient_descent_step(m, b, x, y, learning_rate):
predicted = m * x + b
m_gradient = 2 * np.dot(x, predicted - y) / len(x)
b_gradient = 2 * (predicted - y).sum() / len(x)
new_m = m - learning_rate * m_gradient
new_b = b - learning_rate * b_gradient
return new_m, new_b
8 COMP S491 Machine Learning and Applications
The gradient_descent_step() function performs a step, or an iteration,
of gradient descent. It is used in the following gradient_descent()
function, which sets the initial values of m and b (both defaulted to 0),
updates m and b for a number of iterations, and returns the final m and b
values.
def gradient_descent(x, y, learning_rate, iterations,
initial_m=0, initial_b=0):
m = initial_m
b = initial_b
for i in range(iterations):
m, b = gradient_descent_step(m, b, x, y, learning_rate)
return m, b
The learning rate and the number of iterations of gradient descent are
called hyperparameters. Hyperparameters are variables, or settings,
of the training or optimization algorithm that creates or learns a
machine learning model. Controlling how the algorithm executes,
hyperparameters are usually set manually before the algorithm executes
(as in our case), but some may be adjusted automatically by the
algorithm itself. For example, the learning rate is often a constant set to
the algorithm, but some advanced algorithms may tune the learning rate
according to intermediate training results.
Putting it all together
We’re almost done. The predict_error() function below predicts y
values from parameters of m, b, and x, and computes the error between
the predicted values and the y parameters. The x and y parameters of the
function are normally labelled examples of the test set.
def predict_error(m, b, x, y):
predicted = m * x + b
error = mean_squared_error(y, predicted)
return predicted, error
The following code sets the learning_rate, invokes the gradient_
descent() function to obtain the values of m and b (i.e. train the model)
for different numbers of iterations, and displays the m and b values
and the errors. The m and b values returned by gradient_descent()
have been scaled, due to the normalization of data; they are adjusted
to the original scale, displayed, and used to plot a line for comparison
to the example data. As you can see, in 1,000 iterations, the values of
m and b are fairly close to the values that were used to create the data
(i.e. 10,000 and 5e6 respectively). The MSE decreases when more
iterations are carried out to train the model (i.e. from about 7.56 in 1
iteration to about 0.09 in 1,000 or 5,000 iterations).
Unit 4 9
learning_rate = 0.01
for iterations in (1, 10, 100, 1000, 5000):
m, b = gradient_descent(x_train, y_train, learning_rate,
iterations)
predicted, error = predict_error(m, b, x_test, y_test)
print(f"{iterations}: m={m:g}, b={b:g}, error={error:g}")

mm = m * y_scale / x_scale
bb = b * y_scale
print(mm, bb)
plt.plot(sizes, prices, 'x')
plt.plot(sizes, mm * sizes + bb, linewidth=5)
Output:
1: m=0.199789, b=0.0623739, error=7.56358
10: m=0.925426, b=0.293246, error=0.262215
100: m=1.03836, b=0.38881, error=0.0943503
1000: m=0.9996, b=0.514655, error=0.0904458
5000: m=0.998474, b=0.518312, error=0.0904313
9984.735474642719 5183115.164165969
[]
With the trained model — the relation y = mx + b and the values of m
and b — we can predict the price of a real estate property of size 1,000
feet to be about 15,000,000.
print(1000, mm * 1000 + bb)
Output:
1000 15167850.638808686
The importance of learning rates
Before leaving our implementation, let’s look into the learning rate in
more detail. As mentioned previously, the learning rate is multiplied by
the gradient values to adjust how much to change the parameter values.
The following figure illustrates the effects of different learning rates.
10 COMP S491 Machine Learning and Applications
Figure 4.5 Learning rates: too small (left), proper (middle), too large (right)
The graph on the left shows a learning rate that is too small. In this case,
the update of the model parameter — the learning process — is slow, so
the algorithm cannot return a useful parameter value within a practical
period of time. As shown in the middle graph, a proper learning rate
updates the model parameter gradually to arrive at an optimal low
value. In the graph on the right, the learning rate is too large, and the
parameter value is updated with a large increment or decrement. The
outcome is like bouncing between the two sides of the curve, and gives
the impression that the machine learning algorithm and/or model are
totally invalid.
To verify whether the learning rate is proper, we can display the
parameter values and the loss (error) for each run of model training
or, at a finer level, for each iteration of a run. In our implementation
above, we display this information after each run of gradient descent.
If the parameters gradually change and the loss keeps decreasing,
then the learning rate is proper. If both parameters and the loss do
not change at all, then the learning rate is too small (or the algorithm/
model is inappropriate). If the parameters and learning rate change
in large quantities, perhaps alternating between positive and negative
values, then the learning rate is too large (or the algorithm/model is
inappropriate).
The learning rate is determined by trial and error. It is often set
manually before executing the machine learning algorithm, as we did
in the above implementation. We review the resulting or intermediate
values of parameters and loss, and adjust the learning rate accordingly.
Now work through the following activity to explore the learning rate
and other aspects of the implementation.
Unit 4 11
Activity 4.2
Perform these steps to examine some details of the above
implementation.
1 If you haven’t done so already, run the code of the implementation.
Since the code is written in multiple Code cells in a notebook, make
sure to run all of them in order, for example, by using the Jupyterlab
menu Kernel > Restart Kernel and Run All Cells…. You may run
all Code cells for each of the changes in this activity.
2 Modify the generated prices near the beginning of the
implementation. The relevant line of code is given below. For
example, change the values of 10,000, 5e6 and 3e6, and check if the
implementation can learn the corresponding new parameter values.
prices = sizes * 10000 + 5e6 + np.random.normal(0, 3e6, sizes.shape)
3 Modify the learning rate near the end of the implementation. The
relevant line of code is given below. Change it to various larger and
smaller values, and see the effects on the progress of learning.
learning_rate = 0.01
Modify the numbers of iterations near the end of the implementation.
The relevant line of code is given below. For example, add some
values, and see the effects on the learned parameters and errors.
for iterations in (1, 10, 100, 1000, 5000):
Using scikit-learn’s LinearRegression
The implementation you’ve just been introduced to is good for learning
machine learning, but should not be used in real applications. Instead,
we should adopt a machine learning library. The following code
achieves the same function as our implementation, and it is more robust
and much shorter! It adopts the LinearRegression class of the scikit-
learn library.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
sizes_2d = sizes.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(
sizes_2d, prices, test_size=0.3)
reg = LinearRegression()
reg.fit(X_train, y_train)
print(reg.coef_, reg.intercept_)
print(reg.score(X_test, y_test))
print(reg.predict([[1000]]))
Output:
[9898.02699602] 5260937.702313833
0.9577091125644049
[15158964.69833569]
12 COMP S491 Machine Learning and Applications
In the code, we first reshape the sizes 1D array into a 2D array called
sizes_2d. This is required because features are stored in 2D arrays when
they are used in the scikit-learn library. Then, we call the train_test_
split() function to shuffle the labelled examples and split them into a
training set and a test set. In the variable names X_train and X_test, X
is in uppercase because these variables denote 2D arrays (matrices) of
features; the variables y_train and y_test denote 1D arrays of labels.
Next, we create a LinearRegression object called reg, and learn the
model by calling the fit() method. The fit() method takes the training
data in X_train and y_train. The learned parameter values can be
obtained as coef_ and intercept_ attributes of the reg object, and the
accuracy, or score, of the model from the score() method (discussed
below). As you can see, the learned values of m and b are very close to
10,000 and 5,000,000 (i.e. 5e6) respectively.
Finally, to use the learned model to make a prediction, we call the
predict() method, which takes a 2D array of features from multiple
unlabelled examples. In our case, we predict the price of a property of
size 1,000 feet to be about 15,000,000.
The R2 score
The score() method of LinearRegression returns a measure of the
model’s accuracy called the R2 coefficient of determination, which is
defined as follows:
Here, y is the actual labels, y' is the predicted labels, and is the mean
of the actual labels. The term of the numerator, , is called
the residual sum of squares (RSS), and is a measure of the differences
between the actual and predicted labels. The term of the denominator,
, is called the total sum of squares (TSS), and is a measure
of the variance of the actual labels. The R2 score is a value between
negative infinity (the worst possible model) and 1.0 (the best possible
model).
Let’s consider two cases that should help to give you a better
understanding of the score. For the best possible model, which makes
perfect predictions all the time, the error yi – yi' is 0 and thus the RSS is
also 0; the resulting R2 score is 1.0. For a naive model that ignores input
features and predicts any label as the mean of all actual labels,
and thus RSS = TSS; the resulting R2 score is 0.0.
Unit 4 13
According to scikit-learn’s documentation about the R2 score1:
It represents the proportion of variance (of y) that has been
explained by the independent variables in the model. It provides an
indication of goodness of fit and therefore a measure of how well
unseen samples are likely to be predicted by the model, through the
proportion of explained variance.
Regressors of scikit-learn measure their scores using the R2 coefficient
by default.
Stochastic gradient descent and
SGDRegressor
The LinearRegression class of scikit-learn does not use gradient descent
internally.2 The class is generally good to use, except for situations in
which the data size is very huge; in such cases gradient descent and its
variants are more performant. The scikit-learn library supplies another
class called SGDRegressor that implements linear regression using
stochastic gradient descent (to be discussed shortly). The use of the
SGDRegressor class is similar to the LinearRegression class, except that
the input features should be scaled. This is because stochastic gradient
descent is sensitive to feature scaling. As shown in the code below, we
create a pipeline of a StandardScaler and a SGDRegressor using the make_
pipeline() function. The resulting pipeline, denoted by the reg variable,
applies standardization to the features and implements a SGDRegressor
model. The same fit(), score(), and predict() methods are used for
training the model, finding the model’s R2 score and making predictions
respectively.
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
sizes_2d = sizes.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(
sizes_2d, prices, test_size=0.3)
reg = make_pipeline(StandardScaler(), SGDRegressor())
reg.fit(X_train, y_train)
print(reg.score(X_test, y_test))
print(reg.predict([[1000]]))
Output:
0.9469313135225962
[15100140.41215088]
1 https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score
2 LinearRegression uses matrix calculations to find the ordinary least
squares solution to the linear equation.
14 COMP S491 Machine Learning and Applications
In the name SGDRegressor, SGD stands for stochastic gradient descent,
which is a type of gradient descent. There are three common types of
gradient descent:
• Batch gradient descent: This is the gradient descent algorithm used
in our custom implementation of linear regression. In each iteration,
all examples of the training set are examined to determine the
gradients for updating the model parameters.
• Stochastic gradient descent: In each iteration, one example of the
training set is examined to determine the gradients for updating
the model parameters. Because the examples of the training set
are shuffled, a random example is used in each iteration — this is
why the algorithm is described as ‘stochastic’. Stochastic gradient
descent requires more iterations to train a model; however, with
small one-example iterations, less overall time and memory are
required for training a model than batch gradient descent, especially
for huge datasets.
• Mini-batch gradient descent: This is a compromise between batch
gradient descent and stochastic gradient descent. In each iteration, a
random subset (mini-batch) of examples is selected from the training
set and then examined to determine the gradients for updating the
model parameters.
Multiple linear regression
Our examples above use a feature, i.e. the size of a real estate property,
to predict a numeric label, i.e. the price of the property. In general,
linear regression works with multiple features. All features are related
linearly to the label, i.e. there are no square or other non-linear terms.
This type of regression is called a multiple linear regression. The
equation of linear regression with multiple features is shown below:
y = m1x1 + m2x2 + m3x3 + … + b
Here, y is the label (output), and x1, x2, … are the features (input).
The model parameters m1, m2, …, b are determined when the model is
trained. An example of multiple linear regression is described next.
The Boston house prices dataset
The scikit-learn library supplies a number of ‘toy’ datasets for education
and demonstration purposes.3 One of the datasets, the Boston house
prices dataset, is used in the following discussion about regression.
Let’s begin by loading the dataset and exploring its contents. In the code
segment below, we load the Boston dataset using the load_boston()
function of scikit-learn. The loaded dataset, in the boston object, has
attributes DESCR, data, feature_names, filename and target; in particular,
3 https://scikit-learn.org/stable/datasets
Unit 4 15
DESCR contains a textual description of the dataset, data contains the
features, and target contains the labels. Using the shape attributes of
data and target, we note that data is a 2D array of 506 rows and 13
features, i.e. shape (506, 13), while target is a 1D array of 506 rows,
i.e. shape (506,). The feature names are contained in boston.feature_
names, and they are described in the text of boston.DESCR, among other
details. For example, at index 0, the first feature/attribute CRIM is the per
capita crime rate by town; the final attribute, MEDV, is the label and is the
median value of owner-occupied homes in $1000’s.
from sklearn.datasets import load_boston
boston = load_boston()
print(dir(boston))
print(boston.data.shape, boston.target.shape)
print(boston.feature_names)
print(boston.DESCR)
Output:
['DESCR', 'data', 'feature_names', 'filename', 'target']
(506, 13) (506,)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value
(attribute 14) is usually the target.

:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0
otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

16 COMP S491 Machine Learning and Applications
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie
Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that
address regression
problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data
and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In
Proceedings on the Tenth International Conference of Machine Learning, 236-243,
University of Massachusetts, Amherst. Morgan Kaufmann.
To gain some basic understanding of the data, we put the features of
boston.data into a pandas DataFrame, and use the feature names as the
column names. Then, we add a column for the label to the DataFrame
using the column name TARGET. Printing the DataFrame displays the first
five and final five rows, while calling the describe() method obtains
basic statistics from the columns, such as the counts, means and ranges
of values.
import pandas as pd
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df['TARGET'] = boston.target
print(boston_df)
print(boston_df.describe())
Output:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0
.. ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0

Unit 4 17
PTRATIO B LSTAT TARGET
0 15.3 396.90 4.98 24.0
1 17.8 396.90 9.14 21.6
2 17.8 392.83 4.03 34.7
3 18.7 394.63 2.94 33.4
4 18.7 396.90 5.33 36.2
.. ... ... ... ...
501 21.0 391.99 9.67 22.4
502 21.0 396.90 9.08 20.6
503 21.0 396.90 5.64 23.9
504 21.0 393.45 6.48 22.0
505 21.0 396.90 7.88 11.9

[506 rows x 14 columns]
CRIM ZN INDUS CHAS NOX RM \
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000

AGE DIS RAD TAX PTRATIO B \
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032
std 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864
min 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000
25% 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500
50% 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000
75% 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000
max 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000

LSTAT TARGET
count 506.000000 506.000000
mean 12.653063 22.532806
std 7.141062 9.197104
min 1.730000 5.000000
25% 6.950000 17.025000
50% 11.360000 21.200000
75% 16.955000 25.000000
max 37.970000 50.000000
18 COMP S491 Machine Learning and Applications
Training with all features
Now we apply linear regression to the dataset in the code below. The
code is essentially the same as our price prediction example discussed
earlier. In this case, however, there are 13 features in boston.data.
When I ran the code, the score returned by the score() method was
about 0.6 to 0.7, a good score for such a simple model and small
dataset. The score obtained varies in different executions because the
train_test_split() function shuffles the data and affects the training.4
As previously mentioned, X in the variable names X_train and X_test is
in uppercase because both variables denote 2D arrays (matrices).
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = LinearRegression()
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print(score)
Output:
0.7623333936662576
Activity 4.3
Rewrite the code segment using SGDRegressor instead of
LinearRegression. Remember to use a pipeline with a StandardScaler.
# Modify code below
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = LinearRegression()
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print(score)
Feedback is provided for this activity.
4 For reproducible output across executions, specify the random_state
keyword argument with an int to the train_test_split() function,
e.g. random_state=2.
Unit 4 19
Training with selected features
In the code above, we use all features of boston.data to train the linear
regression model. Using all features is simple, and often produces good
results. However, some datasets have a large number of features, and
using all of them slows down the training process. In such cases, we
need to select some features for training the model. Unit 5 discusses
some feature selection and extraction techniques; in the following, we
will look at a simple technique for finding a subset of useful features.
An intuition about features useful for training a model is that those
useful features are closely related, or highly correlated, to the target
labels that we want to predict. For example, the size of a real estate
property is useful for predicting the price of the property. However,
the Boston dataset does not contain a size feature. So, which feature
or features are (the most) useful for model training? Let’s examine the
correlation values between the features and the labels using the corr()
method DataFrame, as shown below.
The output of the method call is a table of correlation values between
−1 to 1. A value close to 1 (resp. −1) indicates strong positive (resp.
negative) correlation, while a value close to 0 indicates no correlation
at all. The correlation of an attribute with itself (in the diagonal of the
table) is 1.0. The final row of the table shows the correlation values of
the features to the target; in that row, the features RM and LSTAT have
large values (0.695360 and −0.737663 respectively), which indicate
their strong correlation to the target (price).
print(boston_df.corr())
Output:
CRIM ZN INDUS CHAS NOX RM AGE \
CRIM 1.000000 -0.200469 0.406583 -0.055892 0.420972 -0.219247 0.352734
ZN -0.200469 1.000000 -0.533828 -0.042697 -0.516604 0.311991 -0.569537
INDUS 0.406583 -0.533828 1.000000 0.062938 0.763651 -0.391676 0.644779
CHAS -0.055892 -0.042697 0.062938 1.000000 0.091203 0.091251 0.086518
NOX 0.420972 -0.516604 0.763651 0.091203 1.000000 -0.302188 0.731470
RM -0.219247 0.311991 -0.391676 0.091251 -0.302188 1.000000 -0.240265
AGE 0.352734 -0.569537 0.644779 0.086518 0.731470 -0.240265 1.000000
DIS -0.379670 0.664408 -0.708027 -0.099176 -0.769230 0.205246 -0.747881
RAD 0.625505 -0.311948 0.595129 -0.007368 0.611441 -0.209847 0.456022
TAX 0.582764 -0.314563 0.720760 -0.035587 0.668023 -0.292048 0.506456
PTRATIO 0.289946 -0.391679 0.383248 -0.121515 0.188933 -0.355501 0.261515
B -0.385064 0.175520 -0.356977 0.048788 -0.380051 0.128069 -0.273534
LSTAT 0.455621 -0.412995 0.603800 -0.053929 0.590879 -0.613808 0.602339
TARGET -0.388305 0.360445 -0.483725 0.175260 -0.427321 0.695360 -0.376955

DIS RAD TAX PTRATIO B LSTAT TARGET
CRIM -0.379670 0.625505 0.582764 0.289946 -0.385064 0.455621 -0.388305
ZN 0.664408 -0.311948 -0.314563 -0.391679 0.175520 -0.412995 0.360445
INDUS -0.708027 0.595129 0.720760 0.383248 -0.356977 0.603800 -0.483725
CHAS -0.099176 -0.007368 -0.035587 -0.121515 0.048788 -0.053929 0.175260
NOX -0.769230 0.611441 0.668023 0.188933 -0.380051 0.590879 -0.427321
RM 0.205246 -0.209847 -0.292048 -0.355501 0.128069 -0.613808 0.695360
AGE -0.747881 0.456022 0.506456 0.261515 -0.273534 0.602339 -0.376955
20 COMP S491 Machine Learning and Applications
DIS 1.000000 -0.494588 -0.534432 -0.232471 0.291512 -0.496996 0.249929
RAD -0.494588 1.000000 0.910228 0.464741 -0.444413 0.488676 -0.381626
TAX -0.534432 0.910228 1.000000 0.460853 -0.441808 0.543993 -0.468536
PTRATIO -0.232471 0.464741 0.460853 1.000000 -0.177383 0.374044 -0.507787
B 0.291512 -0.444413 -0.441808 -0.177383 1.000000 -0.366087 0.333461
LSTAT -0.496996 0.488676 0.543993 0.374044 -0.366087 1.000000 -0.737663
TARGET 0.249929 -0.381626 -0.468536 -0.507787 0.333461 -0.737663 1.000000
Let’s visually observe the correlation between each of the two features
and the target. The following code uses the plot.scatter() method
to plot a scatter graph for each of the features. The shapes of the data
points confirm a positive correlation between the RM feature and the
target, and a negative correlation between the LSTAT feature and the
target.
boston_df.plot.scatter('RM', 'TARGET')
boston_df.plot.scatter('LSTAT', 'TARGET')
Output:

Unit 4 21
The code below uses the RM and LSTAT features to train the linear
regression model. In particular, the columns of the two features are
retrieved using NumPy slicing and stored to the X2 variable. The slicing
syntax, [:, [5, 12]], obtains all rows (:) and the RM and LSTAT columns
at indices 5 and 12 ([5, 12]) respectively. When I ran the code, the
score was about 0.6 to 0.7, comparable to the earlier model that has
been trained with all features.
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
boston = load_boston()
X2 = boston.data[:, [5, 12]]
X_train, X_test, y_train, y_test = train_test_split(
X2, boston.target, test_size=0.3)
reg = LinearRegression()
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print(score)
Output:
0.6981355726211429
What will be the outcome if we use features other than RM and LSTAT?
Complete the following activity to find out.
Activity 4.4
Modify the code below to change the features that are used to train the
model. Specifically, change the slicing syntax to select the features,
i.e. change [:, [5, 12]] to [:, [0]], [:, [1]], [:, [2]], [:, [0, 1]],
and so on. Remember to execute the code multiple times after each
change, because the outcomes may vary across executions.
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
boston = load_boston()
# Modify code below
X2 = boston.data[:, [5, 12]]
X_train, X_test, y_train, y_test = train_test_split(
X2, boston.target, test_size=0.3)
reg = LinearRegression()
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print(score)
22 COMP S491 Machine Learning and Applications
You should find that feature correlation directly affects the model’s
performance. When the RM (index 5) or the LSTAT (index 12) feature is
included for training the model, the resulting model has a relatively
good score (usually above 0.5). When these two features are not
included, the resulting model has a relatively poor score (usually below
0.3).
The activity below lets you explore another scikit-learn dataset and
apply linear regression to it. Work through it now.
Activity 4.5
The diabetes dataset of scikit-learn contains particulars and blood
measurements (the features) of diabetes patients, and a measure of
disease progression (the label) a year later. Perform the following steps.
1 Read about the diabetes dataset at https://scikit-learn.org/stable/
datasets/toy_dataset.html and the load_diabetes() function at
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.
load_diabetes.html.
2 Write code to load and explore the dataset. In particular, show the
attribute names, shapes, description and correlation. If necessary,
refer to the work on the Boston house prices dataset earlier in this
section.
3 Write code to perform linear regression on the dataset, using each of
LinearRegression and SGDRegressor.
Feedback is provided for this activity.
Polynomial regression
In a linear regression model, every feature x is related linearly to the
label y. The linear equation has all variable terms of order 1 (e.g. xi),
and no higher-order terms (e.g. x2) or other non-linear terms
(e.g. , logx). Not all machine learning problems can be modelled
properly by a linear relation, of course. A more complex and capable
model relates the features to the label using a polynomial equation,
with higher-order terms of x2, x3, etc. Such a model is called polynomial
regression.
Polynomial regression can be implemented by creating polynomial
features and using a linear regression model. Let’s see how this is done.
Consider the basic case with one feature called x1. We create a new
feature, called x2, whose values are the squares of the x1, i.e. x2 = x12.
Now we apply linear regression with both features x1 and x2, and the
original label y:
y = m1x1 + m2x2 + b
Unit 4 23
Since x2 = x12, we have:
y = m1x1 + m2x12 + b
That is, the linear regression with x1 and x2 effectively performs
polynomial regression with the original feature x1.
The polynomial features (and other non-linear features) can be added
manually using Python and NumPy functions. However, the scikit-
learn library supplies a convenient class called PolynomialFeatures that
creates polynomial terms of the features. All you need to do is create a
PolynomialFeatures object with the desired degree of the polynomial
(defaulted to 2), and then pass it to a pipeline prior to a linear regression
model such as LinearRegression. This is demonstrated in the code
below:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = make_pipeline(PolynomialFeatures(2), LinearRegression())
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print(score)
Output:
0.6925472941501909
When I ran this code, the resulting score of the polynomial regression
was often about 0.7 to 0.8, but was occasionally very poor, and even
negative, such as −4! The unstable nature of the polynomial regression
model is caused by a problem called overfitting, which you will learn
about later in the unit.
To check your understanding of the above discussion, complete the
following self-test. Suggested answers can be found at the end of the
unit.
24 COMP S491 Machine Learning and Applications
Self-test 4.1
1 Is linear regression used for predicting numerical or categorical
labels?
2 In gradient descent, should the parameter value be increased
or decreased for a positive gradient? What about for a negative
gradient?
3 What are hyperparameters in model training?
4 What is the problem when the learning rate is too small or too large?
5 What are two classes of scikit-learn for performing linear
regression? What is their difference?
6 What is stochastic gradient descent? Why is it called ‘stochastic’?
7 What are three types of gradient descent?
8 What is multiple linear regression?
9 What scikit-learn class can be used for implementing polynomial
regression?
If you want to learn more about linear regression and gradient descent,
watch these videos:
• StatQuest: Fitting a line to data, aka least squares, aka linear
regression: https://www.youtube.com/watch?v=PaFPbb66DxQ
• StatQuest: Linear models pt. 1 — Linear regression: https://www.
youtube.com/watch?v=nk2CQITm_eo
• StatQuest: Linear models pt. 1.5 — Multiple regression: https://www.
youtube.com/watch?v=zITIFTsivN8
• Gradient descent, step-by-step: https://www.youtube.com/
watch?v=sDv4f4s2SB8
• Stochastic gradient descent, clearly explained: https://www.youtube.
com/watch?v=vMh0zPT0tLI
Linear regression is simple and maintains an easy-to-understand relation
between the input and output of predictions. Despite its simplicity
and age compared to more modern approaches, linear regression is
functional and is still widely used, especially when there are few
training examples and little noise. In fact, many other machine learning
methods can be considered as generalizations or extensions to linear
regression. One such method is logistic regression, which is the topic of
the next section.
Unit 4 25
Logistic regression
Logistic regression, despite its name, is a classification method. A
classification method, or a classifier, employs labelled examples to train
or build a model, and uses the model to predict categorical labels of
unlabelled examples. Examples of classifiers include a spam detector
that predicts whether a message is spam or not; cancer diagnosis
that predicts whether an image indicates cancer or not; and an image
classifier that predicts whether an image has a dog, a cat or a bird, for
example.
Basic concepts
Linear regression, which you learned about in the previous section,
predicts numerical labels, but not categorical labels. However, we
can employ linear regression to predict numbers, and then interpret
or convert the numbers to some desired categorical labels. This is
essentially how logistic regression works.
Consider the example of a spam detector, which classifies a message
into a label of two categories or classes: spam and non-spam. Using
linear regression techniques and labelled examples of messages, a model
can be trained for predicting a numerical measure of how likely it is that
a message is spam. If the predicted measure is above a preset threshold,
then the message is classified as spam; otherwise, it is classified as non-
spam. Normally, the numerical measure is the probability that a message
is spam.
The logistic function
In general, the predicted label of linear regression can have any value,
but a probability must be a value between 0 and 1. A conversion of
the value is therefore needed. There are many known mappings, or
functions, that can convert arbitrary values to the range of 0 to 1.
Logistic regression uses the logistic function (also called logit function,
or sigmoid function):
The following figure shows the logistic function. Note that the input x
takes arbitrary values, and the output f (x) lies between 0 and 1.
26 COMP S491 Machine Learning and Applications
Figure 4.6 The logistic function
As mentioned above, the logistic function is applied to the result of
linear regression. When there is one feature, a linear regressor outputs
y = mx + b. Applying the logistic function produces the result of logistic
regression:
Similarly, the result of logistic regression with multiple features is given
by:
These equations look complicated, but don’t worry. The scikit-learn
library implements these functionalities and more, as you’ll see shortly.
Binary and multiclass classification
In binary (binomial) classification, the predicted label takes on one of
two possible classes (categories). Logistic regression can handle binary
classification by building a model to compute the probability for one of
the two classes.
In multiclass (multinomial) classification, the predicted label takes
on one of three or more possible classes. Logistic regression handles
multiclass classification by building multiple models to compute various
probabilities for the classes. There are two approaches:
• One-versus-rest (OVR) or one-versus-all (OVA): For every class, we
build a model for comparing the class with all other classes. In other
words, to classify a label of n classes, n models are built.
• One-versus-one (OVO): For every class, we build a model for
comparing the class with each of the other classes. In other words,
to classify a label of n classes, every class is compared with n−1
Unit 4 27
other classes in n−1 models, and altogether there are n × (n–1) / 2
models.
For a classification problem with 10 classes, the one-versus-rest
approach requires 10 classes, while the one-versus-one approach
requires 45 (= 10 × 9 / 2) classes. In general, the one-versus-rest
approach is less complex, and is faster to train and use than the one-
versus-one approach. However, the one-versus-one approach may
perform better (have higher accuracy) than the one-versus-rest approach,
especially for a dataset with imbalanced classes (i.e. the numbers of
examples/instances of some classes are much higher than those of other
classes).
In scikit-learn, the OneVsRestClassifier class implements the one-
versus-rest approach, while the OneVsOneClassifier class implements
the one-versus-one approach. Both classes, however, are not frequently
used; instead, we use the LogisticRegression and SGDClassifier
classes, which are slightly higher-level and easier to use. These two
classes are described next.
Using LogisticRegression and SGDClassifier
The scikit-learn library supplies several classes for implementing
logistic regression. Two commonly-used ones are LogisticRegression
and SGDClassifier. The main difference between the two is that the
SGDClassifier class uses a stochastic gradient descent optimizer, while
the LogisticRegression class uses other optimizers or solvers. Both
classes support binary and multiclass classification; for multiclass
classification, they adopt the one-versus-rest approach. Let’s look at
how to use these classes with some simple data.
In the code below, we generate 1,000 random marks, each between 0
and 100, and the corresponding passing status. Marks below 50 are
not passing (denoted by 0), while those being 50 or above are passing
(denoted by 1). In this classification problem, the marks are a numerical
feature, while the passing status is a categorical label. The generated is
plotted in a graph for visualization.
import numpy as np
import matplotlib.pyplot as plt
marks = np.random.randint(0, 100, 1000)
passes = np.where(marks < 50, 0, 1)
plt.plot(marks, passes, 'x')
Output:
[]
28 COMP S491 Machine Learning and Applications
The usage of the LogisticRegression class is very similar to that of
the LinearRegression class you learned about in the previous section.
As shown below, LogisticRegression has a fit() method for training
the model, a score() method for obtaining the score of the model,
and a predict() method for making predictions using the model. In
particular, the score is equal or close to 1.0, and the predicted labels for
the marks 40 and 51 are 0 (not passing) and 1 (passing) respectively.
The LogisticRegression class also supplies a predict_proba() method
for estimating the probabilities of the labels. In the output below,
the predict_proba() method returns four values: the first two values
indicate a high probability that the mark 40 is not passing and a low
probability that it is passing, and the final two values indicate a low
probability that the mark 51 is not passing and a high probability that it
is passing.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
marks_2d = marks.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(
marks_2d, passes, test_size=0.3)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(clf.predict([[40], [51]]))
print(clf.predict_proba([[40], [51]]))
Output:
1.0
[0 1]
[[9.99999999e-01 8.96279580e-10]
[3.97922891e-02 9.60207711e-01]]
The score() method of LogisticRegression returns the accuracy of
prediction, that is:
Accuracy =
Number of correct predictions
Total number of predictions
Unit 4 29
In the code above, we use the values 0 and 1 to indicate, or encode,
the categorical passing status. These values are arbitrary, and have
no meaning or significance in their orders. Now please complete the
following activity to try changing these values.
Activity 4.6
The data generation and logistic regression code is repeated below.
Modify the values of 0 and 1 in the np.where() method to 10 and 1
respectively; run the code and see the output.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
marks = np.random.randint(0, 100, 1000)
passes = np.where(marks < 50, 0, 1) # Modify this line
marks_2d = marks.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(
marks_2d, passes, test_size=0.3)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(clf.predict([[40], [51]]))
print(clf.predict_proba([[40], [51]]))
Feedback is provided for this activity.
The SGDClassifier class implements classification models using
stochastic gradient descent, including logistic regression and other
models. To use the class for logistic regression, specify the loss
keyword argument with value "log" as shown in the code below. Like
SGDRegressor, SGDClassifier is sensitive to feature scaling, so we use
a pipeline with a StandardScaler. The result is similar to that of using
LogisticRegression above.
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
X_train, X_test, y_train, y_test = train_test_split(
marks_2d, passes, test_size=0.3)
clf = make_pipeline(StandardScaler(), SGDClassifier(loss="log"))
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(clf.predict([[40], [51]]))
print(clf.predict_proba([[40], [51]]))
Output:
1.0
[0 1]
[[0.99752843 0.00247157]
[0.28106307 0.71893693]]
30 COMP S491 Machine Learning and Applications
Multiclass classification
We next look into the classification problem of the iris dataset. This
dataset is very popular in machine learning for studying and testing
different classifiers. The scikit-learn library supplies a convenient
function for loading the dataset, which we’ll use shortly.
The iris plants dataset
The iris plants dataset contains examples of three species of irises: Iris
Setosa, Iris Versicolour and Iris Virginica. These species are the three
classes of the categorical label. There are four features, or attributes,
which are measurements of irises: the sepal length and width, and the
petal length and width. The following figure shows the three species of
irises and their attributes.
Figure 4.7 Three species of iris and their attributes
Source: http://suruchifialoke.com/2016-10-13-machine-learning-tutorial-iris-classification/
Since this is the first time we’re using the iris dataset, we should explore
it to become more familiar with it as we did with the Boston dataset
earlier. The code below loads the iris dataset using the load_iris()
function, and displays its information.
from sklearn.datasets import load_iris
iris = load_iris()
print(dir(iris))
print(iris.data.shape, iris.target.shape)
print(iris.feature_names)
print(iris.DESCR)
Output:
['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_
names']
(150, 4) (150,)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width
(cm)']
.. _iris_dataset:

Iris plants dataset
--------------------

Unit 4 31
**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to
a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene
Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE
32 COMP S491 Machine Learning and Applications
Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
Using pandas, we create a DataFrame to contain the features and label
of the iris dataset, and to display the first five and final five examples,
and the statistics of the dataset. Note that the iris species, or classes, are
encoded as the values 0, 1 and 2.
import pandas as pd
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
print(iris_df)
print(iris_df.describe())
Output:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
\
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

species
0 0
1 0
2 0
3 0
4 0
.. ...
145 2
146 2
147 2
148 2
149 2

[150 rows x 5 columns]
sepal length (cm) sepal width (cm) petal length (cm) \
count 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000
std 0.828066 0.435866 1.765298
min 4.300000 2.000000 1.000000
25% 5.100000 2.800000 1.600000
50% 5.800000 3.000000 4.350000
75% 6.400000 3.300000 5.100000
max 7.900000 4.400000 6.900000

Unit 4 33
petal width (cm) species
count 150.000000 150.000000
mean 1.199333 1.000000
std 0.762238 0.819232
min 0.100000 0.000000
25% 0.300000 0.000000
50% 1.300000 1.000000
75% 1.800000 2.000000
max 2.500000 2.000000
To examine the correlation between the features and label, we invoke
the corr() method of DataFrame to display a table of correlation values,
as in the following code. In the last row of the table, the large values
(close to 1) for petal length and petal width indicate their strong
correlation to the iris species.
print(iris_df.corr())
Output:
sepal length (cm) sepal width (cm) petal length (cm) \
sepal length (cm) 1.000000 -0.117570 0.871754
sepal width (cm) -0.117570 1.000000 -0.428440
petal length (cm) 0.871754 -0.428440 1.000000
petal width (cm) 0.817941 -0.366126 0.962865
species 0.782561 -0.426658 0.949035

petal width (cm) species
sepal length (cm) 0.817941 0.782561
sepal width (cm) -0.366126 -0.426658
petal length (cm) 0.962865 0.949035
petal width (cm) 1.000000 0.956547
species 0.956547 1.000000
To visualize the relation between the petal length, the petal width
and the iris species, we create a scatter plot using the plot.scatter()
method of DataFrame. The petal length and width are denoted by the
x-axis and y-axis (the x and y arguments) respectively; the iris species
are denoted by the colours (the c argument) of the data points. The
colours are specified using the matplotlib colourmap called plasma (the
colormap argument). In the resulting scatter plot, there are three groups
of data points in three different colours at the lower left corner, in the
middle and at the upper right corner of the plot. The three colours of the
three groups denote the values 0, 1 and 2, as indicated in the colour bar
on the right. The three colours, and thus data points in the three groups,
correspond to the three iris species.
iris_df.plot.scatter(x="petal length (cm)", y="petal width (cm)",
c="species", colormap="plasma")
Output:

34 COMP S491 Machine Learning and Applications
Classifying an iris
Now that we understand the dataset, let’s classify an iris using logistic
regression. The code shown below is very similar to the earlier example
of mark classification. The code loads the iris dataset, splits the training
set and test set, creates a LogisticRegression classifier, fits or trains the
classifier using the training set, and computes the classifier’s score using
the test set. The resulting score is equal or close to 1.0.
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score)
Output:
0.9777777777777777
One interesting note about the code is the max_iter argument in the
LogisticRegression creation. This max_iter argument, with a default
value of 100, specifies the maximum number of iterations for training
the model. In some cases (i.e. some shuffled versions of the dataset), the
default of 100 iterations is not sufficient for training a model to have a
convergent performance. This problem can be solved by allowing more
iterations for the training, as designated by the max_iter=200 argument.
Unit 4 35
To demonstrate how to use the trained classifier, we make a prediction
for the first example of the test set, as shown below. Specifically, the
features are obtained from X_test[[0]] as a 2D array, and the (actual)
label is obtained from y_test[0]. To predict the label from the features,
we call the predict() of the classifier. The actual and predicted labels
are then displayed (they are most probably the same!5), and so are the
probabilities that the example belongs to the three classes, respectively.
features = X_test[[0]]
label = y_test[0]
predicted = clf.predict(features)
print(label, predicted[0])
print(clf.predict_proba(features))
Output:
0 0
[[9.62984156e-01 3.70157090e-02 1.35432478e-07]]
As mentioned above, the default of 100 iterations may not be sufficient
for training a proper model for the iris dataset. An alternative to allowing
more iterations is to scale the features, e.g. using a StandardScaler.
The code below depicts the use of a pipeline of a StandardScaler and a
LogisticRegression, without specifying the max_iter argument in the
LogisticRegression creation.
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score)
Output:
0.9333333333333333
The example code above uses the LogisticRegression class. The
SGDClassifier is used similarly — try it out by completing the
following activity.
5 The machine learning model captures the statistical characteristics of the
dataset. As a result, it is possible, although very unlikely, that the actual and
predicted labels differ. In such a case, run the code segments to train and
use the models again for a few times, as necessary.
36 COMP S491 Machine Learning and Applications
Activity 4.7
Modify the code below to use SGDClassifier instead of
LogisticRegression. Remember to specify the loss argument with value
"log" in the SGDClassifier creation.
# Modify code below
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score)
Feedback is provided for this activity.
The activity below lets you explore another scikit-learn dataset and
apply logistic regression to it. Work through it now.
Activity 4.8
The handwritten digits dataset of scikit-learn contains 8×8 images (the
features) of handwritten digits, and numeric values of the digits (the
label). Each 8×8 image contains 64 grayscale pixels, and each pixel
takes on an integer value from 0 to 16. The 64 pixel values of an image
are 64 features of the dataset. The code below displays the images for
the first 10 digits of the dataset.6
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
digits = load_digits()
fig, axs = plt.subplots(2, 5, figsize=(10, 5))
for digit, ax in enumerate(axs.flat):
im = ax.imshow(digits.images[digit], cmap=plt.cm.gray_r)
ax.axis("off")
plt.subplots_adjust(hspace=-0.5)
fig.colorbar(im, ax=axs.ravel().tolist(), shrink=0.6)
Output:

6 For placing a colour bar in multiple subplots, see https://matplotlib.org/
gallery/subplots_axes_and_figures/colorbar_placement.html#sphx-glr-
gallery-subplots-axes-and-figures-colorbar-placement-py
Unit 4 37
Perform the following steps to apply logistic regression to the dataset.
1 Read about the digits dataset at https://scikit-learn.org/stable/
datasets/toy_dataset.html and the load_digits() function at https://
scikit-learn.org/stable/modules/generated/sklearn.datasets.load_
digits.html.
2 Write code to load and explore the dataset. In particular, show the
attribute names, shapes and description. There is no need to view the
correlation. If necessary, refer to the work on the iris dataset earlier
in this section.
3 Write code to perform logistic regression on the dataset, using each
of LogisticRegression and SGDClassifier.
Feedback is provided for this activity.
To check your understanding of logistic regression, complete the
following self-test. Suggested answers can be found at the end of the
unit.
Self-test 4.2
1 Is logistic regression used for solving regression problems or
classification problems?
2 What are the ranges of input values and output values of the logistic
function?
3 How many classes of the label are there in binary classification and
in multiclass classification?
4 For a classification problem with three classes, how many models
are there in each of the one-versus-rest and one-versus-one
approaches?
5 For a classification problem with the three classes red, green, and
blue, write down the models in each of the one-versus-rest and one-
versus-one approaches. For example, a model for the one-versus-
rest approach is ‘red versus green and blue’.
38 COMP S491 Machine Learning and Applications
6 What are two classes of scikit-learn for performing logistic
regression?
7 An image classification problem employs a dataset that contains 100
images, each having a width of 20 pixels and a height of 30 pixels.
How many features are there in the dataset?
Like linear regression, logistic regression has a long history, and is
still used in many applications nowadays. But unlike linear regression,
which aims to predict numbers, logistic regression predicts classes or
categories of instances.
If you want to learn more about logistic regression, watch this video:
StatQuest: Logistic Regression, https://www.youtube.com/
watch?v=yIYKR4sgzI8X
In order to tell whether linear regression, logistic regression, or another
machine learning method is good for a particular problem at hand, the
main criterion is how well the trained model generalizes to work for
future unseen data. In the next section, we will discuss the topic of
generalization.
Unit 4 39
Generalization and regularization
Now that you have learned about a regressor and a classifier — namely
linear regression and logistic regression — let’s address some issues
related to model performance. This section discusses the concept of
how good (or bad) a machine learning model is, and a method to train a
better linear regressor and classifier.
Underfitting and overfitting
In machine learning, we’re interested in a model’s performance when it
is used rather than when it is trained. Generalization refers to how well
a trained model predicts for unseen data; this is a proper indication of
the performance when the model is deployed for real or production use.
Some people may think a model that performs well at training should
perform well in actual use, but this is not true in general. Consider a
simple naive ‘model’ that memorizes (e.g. stores in a memory table)
all feature–label (input–output) pairs of the training examples. This
model works perfectly for the training data, and data whose features
match exactly an example of the training data; but it simply fails for
data whose features do not match the training data, i.e. unseen data. In
other words, this memorizing model does not generalize at all. In fact,
memorizing too much detail from the training data is a common cause
of generalization problems, as you’ll see shortly.
Two types of generalization problems are underfitting and overfitting.
The following figure shows an overview of the two problems.
Underfitting occurs when a model does not capture sufficient
information from the data, as shown by the straight line in the diagram
on the left. Overfitting occurs when a model captures too much
information, including noises, from the data, as shown by the curve in
the diagram on the right. A good model, on the other hand, captures
an appropriate amount of information from the data, as shown by the
curve in the diagram in the middle. The problems of underfitting and
overfitting are further explained below.
Figure 4.8 Underfitting (left), good fit (middle), and overfitting (right)
40 COMP S491 Machine Learning and Applications
Underfitting
Underfitting means that a model captures too little statistical information
from the training data. Such a model works poorly (has large errors)
for both the training data (seen data) and the test data (unseen data). It
is analogous to a student who hadn’t prepared well for a test; his/her
performance would be poor in both the review exercises and in the real
test.
The two main reasons for underfitting are:
• The model is too simple, or not capable enough, to deal with the
data from the problem.
• The features are not informative enough.
Overfitting
Overfitting means that a model captures too much statistical information
from the training data. Such a model works well for the training data,
but poorly for the test data. It is analogous to a student who focused
on all the details, including many irrelevant ones, when preparing for
a test; his/her performance would be good in the review exercises, but
poor in the real test.
The two main reasons for overfitting are:
• The model is too complex for the data of the problem.
• There are too many features, but too few training examples.
Regarding machine learning model complexity, linear regression and
logistic regression belong to a family of models called linear models,
which are generally considered simple models. In a linear model, every
feature is related linearly (i.e. multiplied by a constant/parameter) to
the label. You may refer to the linear equation in the earlier section
‘Multiple linear regression’.
Relations to training error and test error
The following figure depicts the concepts of underfitting and overfitting,
and their relations to the training error and test error. The upper and
lower curves show the amounts of error during training and test
respectively for different models or different stages of training a
particular model.
Unit 4 41
Figure 4.9 Underfitting and overfitting
On the left side of the graph, the machine learning model captures little
statistical information from the training data. Both the training error
and test error are high, and underfitting occurs. This happens when the
model is too simple and incapable of dealing with the problem, or when
too little training (e.g. iterations) is carried out to train the model.
When a more capable model is used, or when more training is carried
out, more statistical information from the training data is captured,
and both the training error and test error are lower. As indicated in the
middle of the graph, the resulting model represents a good fit for the
problem — a result that is usually what we aim for.
The right side of the graph shows the situation when the model is too
complex for the problem, or when too much training is carried out
to train the model. When even more statistical information from the
training data is captured, the training error keeps diminishing, though
at a diminishing rate. The test error, however, bounces up from a
minimum and increases. This happens because the captured information
now contains irrelevant details or noise from the training data, which
effectively pollute or dilute the useful information captured by the
model for making predictions. This then becomes a case of overfitting.
An example of overfitting
It is relatively easy to understand underfitting — insufficient knowledge
leads to poor results in both training and test. Overfitting, however,
is not so intuitive. How can too much knowledge result in poor test
results? Let me illustrate the problem of overfitting using a complex
model.
The following graph shows 11 data points (training set) and two lines,
or models, that have been trained using the data points. The straight line
does not fit the points exactly, but captures the overall trend and high-
level statistics of the data. It is a simple linear model and corresponds
42 COMP S491 Machine Learning and Applications
to the linear equation y = mx + b. On the other hand, the curve fits
the points exactly by varying up and down significantly. It is a more
complex non-linear model and corresponds to an order-10 polynomial
equation y = m10x10 + … + mx + b.7
Figure 4.10 Overfitted data
The problem with the complex model is that it captures the fine details
of the data, including noise. Remember, noise is inevitable in machine
learning datasets, so a good model should ignore the noise. A simple
model, by nature, captures fewer details and consequently less noise.
In the example in the above graph, the data points probably stand for
a linear underlying relation, and the deviations come from noise. As
a result, the straight line (linear) is a better model of the underlying
relation, and performs better in predictions for unseen data, than the
curve (order-10).
We should therefore choose a model that is as simple as possible —
but not one that is too simple, which, as mentioned before, would lead
to the problem of underfitting. A basic approach to selecting models
is to, going from simple to complex models, train each model and
measure the training error and test error. Both types of error should
decrease until, at some point, the test error starts to increase, which
indicates a model that is a good fit. The same approach also works for
controlling an algorithm’s iterations to avoid capturing too much detail
in unnecessary iterations. This approach to controlling complexity
and learning by detecting the turning point of test error is called early
stopping.
7 An order-n equation can fit n+1 or less points exactly.
Unit 4 43
Bias and variance
The prediction errors of a model can be divided into three parts:
Error = Bias + Variance + Noise
This is called bias-variance decomposition. The three parts are
described as follows:
• The bias of a model is the difference between the feature–label
relation acquired by the model and the actual relation. Because
machine learning models are most often statistical and approximate,
they may miss part of the real relation, leading to systematic errors
in the predicted results. Generally, simple models have high bias and
are more vulnerable to the problem of underfitting.
• The variance of a model is the variation of building the model from
the training data, or the variation of making predictions from the
referenced data. Using different training data results in different
models, but some models are more sensitive to, and more affected
by, the specifics of the training data being used. Generally, complex
models have high variance and are more vulnerable to the problem
of overfitting.
• The noise of the true label/target comes from the random nature of
data. It is independent of the model and is irreducible.
An ideal model for a machine learning problem has low bias and low
variance, but finding this is a real challenge in practice. In fact, it is
easy to find models with low bias and high variance, and models with
low variance and high bias. Finding a good model is usually more like
making a trade-off on both reasonably low bias and variance — this is
called the bias-variance trade-off in machine learning.
Regularization
Regularization is a technique for avoiding overfitting by penalizing
complex models. For linear regression, for example, we can add to
the cost function (loss function) a penalty term that is proportional or
otherwise related to the complexity of the model. Recall the equation of
the multiple linear regression model:
y = m1x1 + m2x2 + m3x3 + … + b
The model is simpler if some of the parameters, m1, m2, etc. are zeros or
small in value, whereas it is more complex if the parameters are large
(positive or negative) in value. A possible term that denotes the model’s
complexity is therefore the sum of the absolute values of the parameters
(coefficients), i.e. . This is in fact one of the commonly-used
kinds of regularization as described next.
44 COMP S491 Machine Learning and Applications
Lasso (L1)
The lasso (least absolute shrinkage and selection operator), or
L1 regularization, adds to the cost function a penalty term that is
proportional to the sum of absolute values of the parameters:
The hyperparameter λ is used for tuning the strength of regularization.
When λ is sufficiently large, some parameters are shrunk to zero, and
the model becomes simpler. In effect, L1 regularization estimates sparse
coefficients and performs feature selection (variable selection).
In scikit-learn, the Lasso class implements a linear model for regression
with L1 regularization. When a Lasso object is created, the alpha
parameter (defaulted to 1.0) designates the λ value of L1 regularization.
To demonstrate the effect of regularization on overfitting, we use a
complex model with polynomial features, which was discussed in the
earlier section ‘Polynomial regression’. The following code shows
the use of Lasso with polynomial regression and compares it to using
LinearRegression (which is not regularized). The scores of Lasso
have a higher mean and a lower standard deviation than those of
LinearRegression. This improvement of Lasso over LinearRegression
demonstrates that L1 regularization alleviates the overfitting problem
of the (relatively) complex polynomial regression model for the Boston
dataset.
from statistics import mean, stdev
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
boston = load_boston()
lasso_scores = []
linear_scores = []
for i in range(100):
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = make_pipeline(PolynomialFeatures(2), StandardScaler(),
Lasso(alpha=0.1, max_iter=2000))
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
lasso_scores.append(score)
reg = make_pipeline(PolynomialFeatures(2), StandardScaler(),
LinearRegression())
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
linear_scores.append(score)
print("Lasso:", mean(lasso_scores), stdev(lasso_scores))
print("LinearRegression:", mean(linear_scores), stdev(linear_scores))
Output:
Lasso: 0.7911415578913858 0.05171280918964042
LinearRegression: 0.6746281538957123 0.3200461223822122
Unit 4 45
Ridge (L2)
The ridge (also called ridge regression), or L2 regularization, adds
to the cost function a penalty term that is proportional to the sum of
squares of the parameters:
A proper λ value reduces certain parameters of the model, but seldom
to zero. The main effect of L2 regularization is to relieve the impact of
correlated features.
In scikit-learn, the Ridge class implements a linear model for regression
with L2 regularization. In Ridge creation, the alpha parameter (defaulted
to 1.0) designates the λ value of L2 regularization. The code below
shows the use of Ridge with the Boston house prices dataset and
compares it to using Lasso and LinearRegression. The scores for Ridge
have a comparable mean and standard deviation as those for Lasso,
and both are better than those for LinearRegression. Both L1 and L2
regularizations reduce the overfitting problem of LinearRegression
when they are applied to the Boston dataset.
from statistics import mean, stdev
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LinearRegression, Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
boston = load_boston()
ridge_scores = []
lasso_scores = []
linear_scores = []
for i in range(100):
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = make_pipeline(PolynomialFeatures(2), StandardScaler(),
Ridge(alpha=0.1, max_iter=2000))
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
ridge_scores.append(score)
reg = make_pipeline(PolynomialFeatures(2), StandardScaler(),
Lasso(alpha=0.1, max_iter=2000))
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
lasso_scores.append(score)
reg = make_pipeline(PolynomialFeatures(2), StandardScaler(),
LinearRegression())
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
linear_scores.append(score)
print("Ridge:", mean(ridge_scores), stdev(ridge_scores))
print("Lasso:", mean(lasso_scores), stdev(lasso_scores))
print("LinearRegression:", mean(linear_scores), stdev(linear_scores))
46 COMP S491 Machine Learning and Applications
Output:
Ridge: 0.8292859617749727 0.04694405888082697
Lasso: 0.7925930014588958 0.05519773683891316
LinearRegression: 0.6740408549432398 0.7460016780029061
The RidgeClassifier class also implements a linear model with L2
regularization, but it performs classification rather than regression. Try
it out by completing the next activity.
Activity 4.9
The usage of RidgeClassifier is similar to that of Ridge. Modify the
code below to use RidgeClassifier to perform classification on the iris
dataset and compare it to using LogisticRegression.
# Modify code below
from statistics import mean, stdev
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LinearRegression, Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
boston = load_boston()
ridge_scores = []
lasso_scores = []
linear_scores = []
for i in range(100):
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = make_pipeline(PolynomialFeatures(2), StandardScaler(),
Ridge(alpha=0.1, max_iter=2000))
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
ridge_scores.append(score)
reg = make_pipeline(PolynomialFeatures(2), StandardScaler(),
Lasso(alpha=0.1, max_iter=2000))
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
lasso_scores.append(score)
reg = make_pipeline(PolynomialFeatures(2), StandardScaler(),
LinearRegression())
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
linear_scores.append(score)
print("Ridge:", mean(ridge_scores), stdev(ridge_scores))
print("Lasso:", mean(lasso_scores), stdev(lasso_scores))
print("LinearRegression:", mean(linear_scores), stdev(linear_scores))
Feedback is provided for this activity.
Unit 4 47
Elastic-net
The elastic-net combines both L1 and L2 regularization:
The r parameter designates the ratio of contributions from the L1 and L2
regularization. When r is 1, the elastic-net becomes L1 regularization;
when r is 0, it becomes L2 regularization.
The ElasticNet class implements the elastic-net for regression. Its
alpha parameter (defaulted to 1.0) designates the λ value of combined
L1 and L2 regularization, and its l1_ratio parameter (defaulted to 0.5)
designates the ratio r of L1 penalty and L2 penalty. The example code
below shows the use of ElasticNet with the Boston dataset.
from statistics import mean, stdev
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import ElasticNet
boston = load_boston()
scores = []
for i in range(100):
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = make_pipeline(PolynomialFeatures(2), StandardScaler(),
ElasticNet(alpha=0.1, l1_ratio=0.3, max_iter=2000))
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
scores.append(score)
print("ElasticNet:", mean(scores), stdev(scores))
Output:
ElasticNet: 0.789448449619043 0.04472089851183958
The classes discussed above — Lasso, Ridge, RidgeClassifier
and ElasticNet — do not use stochastic gradient descent to train
models. If you’re interested, you may refer to the scikit-learn and
API documentation for relevant details.8 The SGDRegressor and
SGDClassifier classes, discussed earlier in the unit, use stochastic
gradient descent and have regularization built in. Both classes have
these parameters for controlling regularization:
• The penalty parameter, defaulted to "l2", designates the type
of regularization to use. Possible values are "l2", "l1" and
"elasticnet".
8 https://scikit-learn.org/stable/user_guide.html and https://scikit-learn.org/
stable/modules/classes.html
48 COMP S491 Machine Learning and Applications
• The alpha parameter, defaulted to 0.0001, designates the λ value, or
strength, of regularization.
• The l1_ratio parameter, defaulted to 0.15, designates the ratio of L1
penalty and L2 penalty. This parameter is only used if the penalty
parameter is "elasticnet".
These parameters are demonstrated in the following code segment,
which uses a SGDClassifier to perform classification on the iris dataset.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import make_pipeline
iris = load_iris()
scores = []
for i in range(100):
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
sgd = SGDClassifier(penalty="elasticnet", alpha=0.01, l1_ratio=0.5)
clf = make_pipeline(PolynomialFeatures(2), StandardScaler(), sgd)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
scores.append(score)
print("SGDClassififer:", mean(scores), stdev(scores))
Output:
SGDClassififer: 0.9566666666666667 0.02668443104521826
Now complete the activity below to use the parameters for controlling
regularization in a SGDRegressor model.
Activity 4.10
Modify the code below to use a SGDRegressor to perform regression
on the Boston dataset (instead of using SGDClassifier to perform
classification on the iris dataset).
# Modify code below
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import make_pipeline
iris = load_iris()
scores = []
for i in range(100):
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
sgd = SGDClassifier(penalty="elasticnet", alpha=0.01, l1_ratio=0.5)
clf = make_pipeline(PolynomialFeatures(2), StandardScaler(), sgd)
clf.fit(X_train, y_train)
Unit 4 49
score = clf.score(X_test, y_test)
scores.append(score)
print("SGDClassififer:", mean(scores), stdev(scores))
Feedback is provided for this activity.
In the preceding examples, the linear regression models are improved
by applying regularization (L1/L2/elastic-net). In general, regularization
is an effective and convenient way of improving a linear regression
solution. Of course, regularization does not help to address all problems.
If you find that regularization does not improve the L1/L2/elastic-
net scores, you can try some other techniques and models (e.g. those
discussed in the rest of this course).
Complete the self-test below to firm up your understanding of
generalization. Suggested answers can be found at the end of the unit.
Self-test 4.3
1 What is underfitting? When underfitting occurs, is each of the
training error rate and test error rate high or low?
2 What is overfitting? When overfitting occurs, is each of the training
error rate and test error rate high or low?
3 Which one is more vulnerable to underfitting: a simple model or a
complex model? Which is more vulnerable to overfitting?
4 What are the bias and variance of a model?
5 Does regularization aim to avoid underfitting or overfitting? Briefly
describe how to apply regularization to a linear model.
6 What are the three kinds of regularization covered?
If you want to learn more about regularization, watch these videos:
• Regularization part 1: Ridge (L2) regression: https://www.youtube.
com/watch?v=Q81RR3yKn30
• Regularization part 2: Lasso (L1) regression: https://www.youtube.
com/watch?v=NGf0voTMlcs
• Ridge vs. Lasso regression, visualized: https://www.youtube.com/
watch?v=Xm2C_gTAl8c
• Regularization part 3: Elastic net regression: https://www.youtube.
com/watch?v=1dKRdX9bfIo
50 COMP S491 Machine Learning and Applications
In some examples in this section, we used a loop to train and use a
model multiple times to gather its scores and compute the average. This
is a simple way of checking the performance of a model. In the next
section, some standard facilities and convenient techniques for model
evaluation are presented.
Unit 4 51
Model evaluation
So far, you have learned about three main aspects of machine learning
models (among others):
• the types and variations of supervised machine learning models,
such as linear regression, logistic regression, polynomial regression,
among the many others that you’ll learn about later;
• the hyperparameters that control the training or optimization
algorithm, such as the learning rates, learning iterations, whether to
use regularization, and the types and attributes of regularization; and
• the parameters of a model, conceptually like the content of a model,
such as the coefficients of linear models.
With plenty of model types and options, how do we choose the best
solution for a machine learning problem? A commonly-used approach
is to determine several candidate models, and then evaluate them to
choose the best one. Model evaluation, an important task in machine
learning, is the topic of this section. In particular, we’ll discuss the
different learning phases that involve model evaluation, the sets of data
used in these phases, the important technique of cross-validation, some
useful evaluation metrics and hyperparameter tuning.
Model training, selection and assessment
In machine learning, model evaluation is involved in three main phases:
model training, model selection and model assessment. In these phases,
models are evaluated using three separate sets of data, called the
training set, validation set and test set. These sets of data are parts of the
labelled example dataset collected for the machine learning problem.
More details about these phases and sets of data are described in the
following.
Model training and the training set
In the model training phase, multiple candidate models for a machine
learning problem are trained (and evaluated) individually using the
training set. These models are differentiated by model types and
combinations of hyperparameters. Using scikit-learn, for example,
the regressors LinearRegression(), Ridge(alpha=1), Ridge(alpha=10),
SGDRegressor(penalty="l1") and SGDRegressor(penalty="l2") can be
some candidate models for a regression problem.
While the type and hyperparameters of every model are fixed and
known to begin with, the parameters are optimized or determined in
the process of model training. Recall the operation of gradient descent
discussed previously: we evaluate the model using a cost function and
the training set, and tune the parameters according to the evaluation
result. For example, the parameters of one-feature linear regression,
y = mx + b, are the values of m and b.
52 COMP S491 Machine Learning and Applications
Model selection and the validation set
In the model selection phase, the multiple trained models are
individually evaluated using the validation set (also called the validation
test set). The model with the best evaluation result is selected for use in
production.
Separating the validation set from the training set allows us to determine
how well the models generalize to work for unseen data, rather than
how well they memorize seen data. That is, it detects the problem of
overfitting, as discussed in the earlier section on generalization.
Model assessment and the test set
In the model assessment phase, the selected best model is evaluated
using the test set (also called the holdout test set). This evaluation result
is an estimate of the performance when the model is used in production.
The test set is separate from both the training set and the validation set,
for the same reason as separating the validation set from the training set
in model selection. Doing so checks how well the model generalizes to
work for future data and avoids overfitting.
In the above discussion, the best model selected in the model selection
phase is used in production. An alternative is to retrain the model
(i.e. using its type and hyperparameters) using all data in the training,
validation and test sets; this increases the amount of training data and
may produce an improved model.
More about the training, validation and test sets
As explained above, the training set, validation set and test set should
be separate. A straightforward way of obtaining them is to divide all
available labelled examples in the dataset into three parts: a larger part
as the training set, and two smaller parts as the validation set and the
test set.
Figure 4.11 Dividing labelled examples into the training set, validation set and
test set
There is no optimal ratio for the examples of the three sets (training
set : validation set : test set), but two general recommendations are
50:25:25 and 70:15:15. For huge datasets with millions of examples,
the validation set and test set may have smaller proportions, such as
95:2.5:2.5.
Unit 4 53
Simple division of all examples into the three sets raises some issues
associated with the use of the training set and validation set:
• The evaluation results in the model selection phase are highly
dependent on, and vary according to, the contents of the training set
and validation set. In other words, these results may be inconsistent
due to the random selection of examples into the two sets.
• Because some examples are included in the validation set, there
are fewer examples in the training set. With less training data, the
models may learn less statistical information from the data and
perform more poorly.
The mentioned problems are particularly significant for small datasets.
To relieve these problems, we can use a technique called cross-
validation, which is presented next.
Cross-validation
Cross-validation (CV) is a technique that improves the training and
evaluation (validation) of a model by using different parts of a dataset
as the validation set in turns. Cross-validation can be applied in the
combination of model training and model selection, the exploration of
models for educational purposes, and so on.
k-fold cross-validation
The algorithm of k-fold cross-validation is depicted in the following
figure and described below.
Figure 4.12 Algorithm of k-fold cross-validation
54 COMP S491 Machine Learning and Applications
The algorithm of k-fold cross-validation is carried out as follows:
1 Divide the dataset into k groups, or folds, of about the same size.
2 For each of the k folds (call it the current fold):
a Assign the current fold as the validation set, and all remaining
folds as the training set.
b Train a model using the training set.
c Evaluate the trained model using the validation set.
3 Collect and summarize the evaluation results of the k models, e.g.
using the mean.
When k-fold cross-validation is used in the combination of model
training and model selection, each candidate model is trained and
evaluated k times (as k models with different parameters), and the k
evaluation results are summarized as the overall evaluation result of the
candidate model. Note that the best candidate model selected actually
consists of k models, and that normally these k models are not directly
used in production.9 Instead, the best candidate model (model type
and hyperparameters) is retrained using both the training set and the
validation set for production use.
Code examples of cross-validation
Let’s look into some implementation examples of cross-validation. The
main workhorse for cross-validation in scikit-learn is the cross_val_
score() function. This function takes the arguments of a model object,
a 2D array of features, a 1D array of labels and some options (discussed
later), and it returns an array of k scores.
In the code below, the cross_val_score() function is called with a
LinearRegression object, the shuffled features and labels of the Boston
dataset, and an option of cv with the value 5 (#2). The cv option
designates the cross-validation splitting strategy, and can take an int
value or a splitter object; an int value such as 5 specifies the number of
folds in cross-validation. The scores of the five models and their mean
is displayed at the end of the code.
One thing to note is that we don’t need to call the train_test_split()
function because splitting the dataset into two sets is done by cross-
validation in the cross_val_score() function. However, we call the
shuffle() function (#1) to shuffle the feature array and label array, as
shuffling has been done by train_test_split() in previous examples.
9 If it takes a long time to train the models — say weeks or months — some
people use the best of the k models for production use; however, this is not
recommended in general.
Unit 4 55
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target) #1
reg = LinearRegression()
scores = cross_val_score(reg, X, y, cv=5) #2
print(scores, scores.mean())
Output:
[0.74447057 0.74349513 0.69721099 0.74800652 0.67837488]
0.7223116180738653
How many folds should we use in cross-validation? The following
code segment is an experiment using different numbers of folds, from
two to 99, in cross-validation. To speed up the operation, the n_jobs=-
1 keyword argument is specified in the cross_val_score() function to
process the models in parallel using as many jobs (CPUs) as possible.
The results are plotted on a graph. The score is good (about 0.7 when I
ran the code) for about ten or fewer folds, and then gradually decreases
with more folds. In practice, five folds or ten folds are usually used.
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.utils import shuffle
boston = load_boston()
cvs = range(2, 100, 1)
cv_scores = []
X, y = shuffle(boston.data, boston.target)
reg = LinearRegression()
for cv in cvs:
scores = cross_val_score(reg, X, y, cv=cv, n_jobs=-1)
cv_scores.append(scores.mean())
plt.plot(cvs, cv_scores)
Output:
[]
56 COMP S491 Machine Learning and Applications
The examples above apply cross-validation to LinearRegression.
Complete the activity below to use another regressor.
Activity 4.11
Modify the following code to use the SGDRegressor class instead
of the LinearRegression class. Remember to use a pipeline and a
StandardScaler.
# Modify code below
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
reg = LinearRegression()
scores = cross_val_score(reg, X, y, cv=5)
print(scores, scores.mean())
Feedback is provided for this activity.
The cross_val_score() function works for both regressors and
classifiers. In the code below, we pass a LogisticRegression classifier as
the first parameter to cross_val_score().
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
iris = load_iris()
X, y = shuffle(iris.data, iris.target)
reg = LogisticRegression(max_iter=200)
scores = cross_val_score(reg, X, y, cv=5)
print(scores, scores.mean())
Output:
[0.96666667 0.96666667 0.93333333 1. 0.96666667]
0.9666666666666666
Now complete the following activity to apply cross-validation to a
SGDClassifier.
Unit 4 57
Activity 4.12
Modify the following code to use the SGDClassifier class instead
of the LogisticRegression class. Remember to use a pipeline, a
StandardScaler and the loss keyword argument with "loss" in the
SGDClassifier creation.
# Modify code below
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
iris = load_iris()
X, y = shuffle(iris.data, iris.target)
reg = LogisticRegression(max_iter=200)
scores = cross_val_score(reg, X, y, cv=5)
print(scores, scores.mean())
Feedback is provided for this activity.
Stratification
For classification, the cross_val_score() function carefully divides the
dataset into folds so that the proportions of classes (categories) in the
dataset are approximately the same as those in the folds. In other words,
the labelled examples of each class are distributed nearly evenly to the
folds. This resampling technique is called stratification, or stratified
sampling.
Figure 4.13 Stratification: original dataset (top), and k stratified folds (bottom)
What problem does stratification solve? Consider a spam detector that
classifies email messages of which 5% are spam and 95% are not spam,
for example. When the example email messages are divided randomly
into folds, some folds may contain very few or even none of the spam
messages. When such folds are used in model training, the model may
not learn sufficient information about the spam messages; when they are
used in validation, insufficient spam messages may lead to inaccurate or
invalid evaluation results. These issues of imbalanced classification are
prevented by stratification.
58 COMP S491 Machine Learning and Applications
Leave-one-out cross-validation (optional)
In k-fold cross-validation, when k equals the number of examples of the
dataset, each of the k folds contains exactly one example. Each of the
k models is trained using k−1 examples (i.e. k−1 folds) and evaluated
using one example (i.e. one fold). This special case of k-fold cross-
validation is called leave-one-out (LOO) cross-validation.
There are two benefits of leave-one-out cross-validation. First, the k
models are trained using almost all of the examples in the dataset, and
therefore tend to have better performance than using fewer examples
in training. Second, all examples are used for evaluating the models
individually, and the overall evaluation result is comprehensive with
respect to the whole dataset. The drawback of leave-one-out cross-
validation is the heavy computation work required for training and
evaluating the k models.
To implement leave-one-out cross-validation, specify the splitter object
LeaveOneOut() to the cv keyword argument of the cross_val_score()
function. This is shown in the code below. Because the default scoring
for regressors, R2, does not work for a validation set of one example,
we use "neg_mean_squared_error", the negated mean of squares of
errors (negated MSE). The negation is required for converting an error
value (the higher the worse) to a score value (the higher the better).
All negated MSE values are negative or zero, and, for example, −20 is
better than −25 as negated MSE values for model performance.
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score, LeaveOneOut
from sklearn.linear_model import LinearRegression
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
reg = LinearRegression()
scores = cross_val_score(reg, X, y, cv=LeaveOneOut(),
scoring="neg_mean_squared_error")
print(scores.mean())
Output:
-23.72574551947615
Complete the activity below to apply leave-one-out cross-validation to a
classifier.
Unit 4 59
Activity 4.13
The code below is an earlier example of cross-validation. Modify the
code to use leave-one-out cross-validation. You don’t need to change the
scoring of the cross_val_score() function because the default accuracy
scoring of classification works for leave-one-out cross-validation.
# Modify code below
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
iris = load_iris()
X, y = shuffle(iris.data, iris.target)
reg = LogisticRegression(max_iter=200)
scores = cross_val_score(reg, X, y, cv=5)
print(scores, scores.mean())
Feedback is provided for this activity.
As a remark, leave-one-out cross-validation is deterministic because
there is only one way to divide k examples into k folds. In fact, there is
no need to shuffle the features and labels of the dataset before passing
them to the cross_val_score() function. You can try to remove the call
to the shuffle() function in the examples of LeaveOneOut; the evaluation
results should remain the same.
Evaluation metrics
You have learned about some measures, or metrics, for evaluating
machine learning models, including the mean squared error (MSE) and
R2 score for regression, and the accuracy for classification. This section
reviews these topics and discusses a few other useful evaluation metrics.
Regression metrics
In general, the R2 score and mean squared error are proper evaluation
metrics for most regression problems. The R2 score is the default
evaluation metric for regressors in scikit-learn.
Recall that the R2 score is defined as:
Here, y is the actual labels, y' is the predicted labels, and is the mean
of the actual labels. The value of R2 ranges from negative infinity for the
worst model to 1.0 for the best model. The R2 score does not work when
there is one test (or validation) example: that example’s label equals the
mean, and the denominator in the R2 formula becomes zero.
60 COMP S491 Machine Learning and Applications
Mean squared error, as described earlier in the unit, can be calculated
as:
Two other main evaluation metrics for regression are mean absolute
error and median absolute error.
The scikit-learn library supplies a number of utility functions for
computing evaluation metrics from actual values (labels) and predicted
values. To try some of these functions, we’ll generate some datasets for
regression using the scikit-learn’s make_regression() function.
The following code shows the basic use of make_regression(). We call
it to generate a dataset for regression with 1,000 samples (the n_samples
argument), one feature (the n_features argument), and Gaussian noise
with a standard deviation of 30 (the noise argument). The resulting
dataset is visualized in a plot.
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=1, noise=30)
plt.plot(X[:, 0], y, '.')
Output:
[]
The code below generates a dataset, applies a LinearRegression and
computes a few evaluation metrics. In particular, it invokes the r2_
score(), mean_squared_error(), mean_absolute_error() and median_
absolute_error() functions to compute the R2 score, mean squared
error, mean absolute error and median absolute error, respectively, of
the predictions made by LinearRegression. All scikit-learn regressors,
including LinearRegression, use the R2 score by default; in other words,
the score() method returns the same value as the r2_score() function.
Unit 4 61
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, \
mean_absolute_error, median_absolute_error
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=1000, n_features=4, noise=30)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3)
reg = LinearRegression()
reg.fit(X_train, y_train)
y_predicted = reg.predict(X_test)
print("r2 score:", r2_score(y_test, y_predicted))
print("mean squared error:", mean_squared_error(y_test, y_predicted))
print("mean absolute error:", mean_absolute_error(y_test, y_predicted))
print("median absolute error:", median_absolute_error(y_test, y_predicted))
print("score():", reg.score(X_test, y_test))
Output:
r2 score: 0.9625739980264532
mean squared error: 868.6503912699762
mean absolute error: 24.19176362171814
median absolute error: 20.992576794119877
score(): 0.9625739980264532
Activity 4.14
Modify the code below to use the Boston house prices dataset
instead of generating a dataset, and use the Lasso class instead of
LinearRegression.
# Modify code below
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, \
mean_absolute_error, median_absolute_error
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=1000, n_features=4, noise=30)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3)
reg = LinearRegression()
reg.fit(X_train, y_train)
y_predicted = reg.predict(X_test)
print("r2 score:", r2_score(y_test, y_predicted))
print("mean squared error:", mean_squared_error(y_test, y_predicted))
print("mean absolute error:", mean_absolute_error(y_test, y_predicted))
print("median absolute error:", median_absolute_error(y_test, y_predicted))
print("score():", reg.score(X_test, y_test))
Feedback is provided for this activity.
62 COMP S491 Machine Learning and Applications
In addition to utility functions for computing evaluation metrics, a
few scikit-learn functions accept a scoring option for specifying the
metric for evaluating different models. One such function is cross_val_
score(), which has a scoring argument for the evaluation metric of the
cross-validation models. A previous example specified scoring="neg_
mean_squared_error" to use the negated MSE as the evaluation metric.
Some other values for the scoring argument are "neg_mean_absolute_
error", "neg_median_absolute_error", and "r2". The error metrics are
negated when they are used as scores, due to the opposite interpretation
of values: larger error values are worse, but larger score values are
better. For a list of scoring values, see https://scikit-learn.org/stable/
modules/model_evaluation.html#scoring-parameter.
Classification metrics
For many classification problems, the accuracy of predictions is
generally a proper evaluation metric. In fact, accuracy is the default
evaluation metric for classifiers in scikit-learn.
Accuracy =
Number of correct predictions
Total number of predictions
On the other hand, the accuracy may be inappropriate for some
classification problems, especially for imbalanced classification
problems where the populations of the classes are very different or
skewed. Consider a spam detector that classifies email messages, of
which 5% (say) are spam and 95% are not spam. A dummy model
may ignore message contents, and predict all messages as non-spam.
Such a model ‘predicts’ correctly for all non-spam messages (95%),
and incorrectly for all spam messages (5%); the prediction accuracy
is therefore 95%. In machine learning, a score of 95% looks very
impressive, but is the model really good? Obviously not! The model
is not useful, as it does nothing to solve the problem of detecting
spam. The accuracy metric is simply improper for this problem. Some
alternative metrics for classification are described in the following.
Classification metrics: Confusion matrix, precision,
recall and F1
Confusion matrix
The confusion matrix tabulates the correctness of predictions for each of
the classes in a classification problem. The confusion matrix of a binary
classification is shown below. In binary classification, we call the two
classes positive and negative; very often, the positive class is the class
we are interested in, or is the minority class. An example of a positive
class is spam messages in a spam detector, and a corresponding negative
class is non-spam messages.
Unit 4 63
Figure 4.14 Confusion matrix
The four entries in the above confusion matrix are explained as follows:
• True positive (TP), or hit — positive examples that are predicted
correctly (true) as positive (positive).
• False negative (FN), or miss — positive examples that are predicted
incorrectly (false) as negative (negative).
• False positive (FP), or false alarm — negative examples that are
predicted incorrectly (false) as positive (positive).
• True negative (TN), or correct rejection — negative examples that
are predicted correctly (true) as negative (negative).
As an illustration, consider a spam detector that classifies 100 messages,
of which five messages are actually spam, and 95 messages are
actually not spam. Let’s say the prediction results are: three of five
spam messages are correctly identified as spam (the remaining two are
identified incorrectly), and 90 of 95 messages are correctly identified as
non-spam (the remaining five messages are identified incorrectly). Then,
we have:
• True positive (TP) = 3
• False negative (FN) = 2
• False positive (FP) = 5
• True negative (TN) = 90
We’ll use these numbers for the spam detector in part of the coming
discussion.
The scikit-learn library supplies the confusion_matrix() function to
compute the confusion matrix. In the code below, we denote a non-spam
message as 0 and a spam message as 1, create two arrays of the actual
and predicted message classes (using the numbers of the above spam
detector), and display the confusion matrix.
64 COMP S491 Machine Learning and Applications
from sklearn.metrics import confusion_matrix
NOT_SPAM, SPAM = 0, 1
actual = 95 * [NOT_SPAM] + 5 * [SPAM]
predicted = 90 * [NOT_SPAM] + 5 * [SPAM] + 3 * [SPAM] + 2 * [NOT_SPAM]
print(confusion_matrix(actual, predicted))
Output:
[[90 5]
[ 2 3]]
The description above focuses on binary classification. The confusion
matrix for multiclass classification, i.e. n > 2 classes, has shape n × n,
with one row and one column for each of the n classes.
Precision and recall
Precision and recall are two commonly-used metrics of classification.
Precision is the ratio of correct positive predictions to the total number
of positive predictions:
Precision =
TP
TP + FP
A high precision value means that most positive predictions are correct.
In a spam detector, a high precision means that most messages in the
spam folder are actually spam, i.e. non-spam messages rarely appear in
the spam folder. Using the numbers of the above spam detector:
Precision =
TP
=
3
= 0.375
TP + FP 3 + 5
Recall is the ratio of correct positive predictions to the number of actual
positive examples in the dataset:
Recall =
TP
TP + FN
A high recall value means that most positive examples are identified
(predicted), i.e. there are few incorrect negative predictions. In a spam
detector, a high recall means that most messages in the inbox are
actually non-spam, i.e. spam messages rarely appear in the inbox. Using
the numbers of the above spam detector:
Recall =
TP
–
3
= 0.6
TP + FN 3 + 2
Unit 4 65
The F1 score
The precision and recall can be combined into a metric called the F1
score:
F1 score =
2 × Precision × Recall
=
2 × TP
Precision + Recall 2 × TP + FP + FN
The value of the F1 score tends to be the smaller of the values of
precision and recall. Therefore, a high F1 value indicates that both
precision and recall are high, and the prediction results are good.
Using the numbers of the above spam detector:
F1 score =
2 × Precision × Recall
=
2 × 0.375 × 0.6
= 0.4615
Precision + Recall 0.375 + 0.6
Activity 4.15
Compute the F1 score of a dummy spam detector that ‘predicts’ all
messages as non-spam, using a dataset of five spam messages and 95
non-spam messages.
Feedback is provided for this activity.
Classification metrics: Using scikit-learn functions
The scikit-learn library supplies the functions precision_score(),
recall_score() and f1_score() to compute the precision, recall and F1
score, respectively, of predicting classes. In the code below, we denote
a non-spam message as 0 and a spam message as 1; create two arrays
of the actual and predicted message classes (using the number of the
earlier spam detector); and display the precision, recall and f1 score
respectively. The resulting metrics are identical to the ones that we
calculated above.
from sklearn.metrics import precision_score, recall_score, f1_score
NOT_SPAM, SPAM = 0, 1
actual = 95 * [NOT_SPAM] + 5 * [SPAM]
predicted = 90 * [NOT_SPAM] + 5 * [SPAM] + 3 * [SPAM] + 2 * [NOT_SPAM]
print("precision:", precision_score(actual, predicted))
print("recall:", recall_score(actual, predicted))
print("f1:", f1_score(actual, predicted))
Output:
precision: 0.375
recall: 0.6
f1: 0.4615384615384615
66 COMP S491 Machine Learning and Applications
Activity 4.16
Modify the code below to find the precision, recall, and f1 score for a
dummy spam detector that ‘predicts’ all messages as non-spam, using a
dataset of five spam messages and 95 non-spam messages.
from sklearn.metrics import precision_score, recall_score, f1_score
NOT_SPAM, SPAM = 0, 1
# Modify code below
actual = 95 * [NOT_SPAM] + 5 * [SPAM]
predicted = 90 * [NOT_SPAM] + 5 * [SPAM] + 3 * [SPAM] + 2 * [NOT_SPAM]
print("precision:", precision_score(actual, predicted))
print("recall:", recall_score(actual, predicted))
print("f1:", f1_score(actual, predicted))
Feedback is provided for this activity.
We’re going to use some datasets for classification and computing
the above metrics. To generate such a dataset, we use the make_
classification() function. As illustrated in the following code, we
create 100 samples (the n_samples argument), in two classes (the n_
classes argument), flipping 3% of the class values randomly (the flip_
y argument), using two features (the n_features argument), zero of
two features being redundant (the n_redundant argument), and two of
two features being informative (the n_informative argument). The data
points for the two classes are then visualized in a plot. For more details
of using the make_classification() function, read its documentation at
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_
classification.html.
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
X, y = make_classification(n_samples=100, n_classes=2, flip_y=0.03,
n_features=2, n_redundant=0, n_informative=2)
X0 = X[y==0]
plt.plot(X0[:, [0]], X0[:, [1]], 'or')
X1 = X[y==1]
plt.plot(X1[:, [0]], X1[:, [1]], '^g')
Output:
[]
Unit 4 67
The code below generates a dataset, applies a RidgeClassifier,
and computes a few evaluation metrics. In particular, it invokes the
precision_score(), recall_score(), f1_score(), and accuracy_score()
functions to compute the precision, recall, F1 score and accuracy,
respectively, of the predictions made by RidgeClassifier. All scikit-
learn classifiers, including RidgeClassifier, use the accuracy score by
default; in other words, the score() method returns the same value as
the accuracy_score() function.
from sklearn.datasets import make_classification
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import precision_score, recall_score, \
f1_score, accuracy_score
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_classes=2, flip_y=0.03,
n_features=2, n_redundant=0, n_informative=2)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3)
clf = RidgeClassifier()
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print("precision:", precision_score(y_test, y_predicted))
print("recall:", recall_score(y_test, y_predicted))
print("f1 score:", f1_score(y_test, y_predicted))
print("accuracy:", accuracy_score(y_test, y_predicted))
print("score():", clf.score(X_test, y_test))
Output:
precision: 0.84
recall: 0.863013698630137
f1 score: 0.8513513513513513
accuracy: 0.8533333333333334
score(): 0.8533333333333334
The precision_score(), recall_score(), and f1_score() metric
functions work for multiclass classification by some type of averaging
performed on the data. The type of averaging is specified using the
average keyword argument to the functions. Work through the activity
below to try it out.
68 COMP S491 Machine Learning and Applications
Activity 4.17
Modify the code below to do the following.
1 Use the iris dataset instead of generating a dataset.
2 For the calls to the precision_score(), recall_score(), and f1_
score() functions, specify the argument average="weighted". This
argument is required for multiclass classification, as the iris dataset
has three classes; refer to the documentation for these functions for
details.
3 Invoke the classification_report() function of the sklearn.
metrics module, as classification_report(y_test, y_predicted).
This function creates a text report of the main classification metrics;
refer to its documentation for details.
# Modify code below
from sklearn.datasets import make_classification
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import precision_score, recall_score, \
f1_score, accuracy_score
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_classes=2, flip_y=0.03,
n_features=2, n_redundant=0, n_informative=2)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3)
clf = RidgeClassifier()
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print("precision:", precision_score(y_test, y_predicted))
print("recall:", recall_score(y_test, y_predicted))
print("f1 score:", f1_score(y_test, y_predicted))
print("accuracy:", accuracy_score(y_test, y_predicted))
print("score():", clf.score(X_test, y_test))
Feedback is provided for this activity.
Classification metrics: Area under the ROC curve
Another classification metric is area under the ROC curve (AUC).
To discuss AUC, let’s begin with the ROC curve. The ROC (receiver
operating characteristic) curve is a graph of true positive rate versus
false positive rate, which can be calculated from the values of the
confusion matrix as:
True positive rate (TPR) =
TP
TP + FN
False positive rate (FPR) =
FP
FP + TN
Unit 4 69
The true positive rate (TPR) is the same as recall, and is also called
sensitivity, or probability of detection. It indicates how precise the
positive predictions are. The values of TPR range from 0 to 1, and the
ideal TPR is 1 (i.e. FN = 0).
The false positive rate (FPR) is also called probability of false alarm.
It indicates the rate of false alarm among all negative examples. The
values of FPR range from 0 to 1, and the ideal FPR is 0 (i.e. FP = 0).
Let’s consider how to obtain the TPR and FPR to plot the ROC curve.
For classification models that compute a probability or other numerical
value for predicting the class of an example, a threshold is set as the
decision point or break point for the predicted class of an example. A
typical value is 0.5 for a probability: if the model returns a value under
0.5, then the example is predicted as negative; otherwise, it is predicted
as positive.
Changing the threshold changes the model’s predictions, and therefore
the values of TP, FP, TN, FN, TPR and FPR. A plot of the TPR versus
FPR values when the threshold varies is the ROC curve. There are some
cases for the threshold, TPR, and FPR that we’re particularly interested
in (cases of boundary values are ignored for the sake of simplicity in
this discussion):
• When the threshold is 0, all values returned by the model are above
the threshold, and thus all predictions are positive. Then, FN = TN =
0, and TPR = FPR = 1. This means that the point (1, 1) exists on all
ROC curves.
• When the threshold is 1, all values returned by the model are below
the threshold, and thus all predictions are negative. Then, TP = FP =
0, and TPR = FPR = 0. This means that the point (0, 0) exists on all
ROC curves.
• The ideal TPR is 1, and the ideal FPR is 0. This means that the ROC
curve of an ideal model passes the point (0, 1).
Figure 4.15 The ROC curve (left) and area under the ROC curve (right)
70 COMP S491 Machine Learning and Applications
A random classifier makes random predictions, and has the same TPR
and FPR for all threshold values. The ROC curve for this classifier is
the straight line joining (0, 0) and (1, 1), shown as the dotted line on the
left of the above figure. This dotted line can be used as a baseline for
reference: a useful classifier has TPR greater than FPR, and is above the
dotted line. In fact, the higher the ROC curve is at the top-left corner
(i.e. near the point (0, 1)), the better the model.
The area under the ROC curve (AUC) is a classification metric: the
higher the ROC curve, the better the model, and the larger the area
under the ROC curve! Because the ROC curve goes from (0, 0) to (1,
1), the AUC value ranges from 0 (for the worst possible model) to 1 (for
the ideal model). Note that the ROC curve and therefore the AUC metric
work for classifiers that compute a probability or similar numerical
values for predicting classes of examples, such as logistic regression,
neural networks, and decision trees (the latter two are discussed later in
the unit).
The scikit-library supplies the roc_auc_score() function to compute
the AUC score of class predictions. The first two parameters of the
function are the actual labels and the target scores or probabilities of
the predictions. The following code segment uses the function to find
the AUC score for applying a LogisticRegression to the iris dataset.
Because the iris dataset is multiclass, we need to specify the multi_
for computing the score of each class using the one-versus-
rest approach (refer to the section ‘Binary and multiclass classification’),
and the average="weighted" for applying weighted averaging to the
scores of the classes.
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
print("roc_auc:", roc_auc_score(y_test, y_proba, multi_,
average="weighted"))
Output:
roc_auc: 1.0
Work through the following activity to find the AUC score of a dummy
classifier.
Unit 4 71
Activity 4.18
The scikit-learn library supplies a dummy classifier, i.e. the
DummyClassifier class of the sklearn.dummy module. Modify the code
below to use DummyClassifier instead of LogisticRegression, and
specify the argument strategy="uniform" in the DummyClassifier
creation.
# Modify code below
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
print("roc_auc:", roc_auc_score(y_test, y_proba, multi_,
average="weighted"))
Feedback is provided for this activity.
In addition to the classification metric functions discussed above, a
few scikit-learn functions accept a scoring option for specifying the
evaluation metric. For example, the cross_val_score() function has a
scoring argument of the metric for evaluating cross-validation models.
Some possible values of classifier metrics are "accuracy", "f1", "f1_
weighted", "precision", "precision_weighted", "recall", "recall_
weighted", "roc_auc", "roc_auc_ovr" and "roc_auc_ovo". For a list
of scoring values, see https://scikit-learn.org/stable/modules/model_
evaluation.html#scoring-parameter.
Tuning hyperparameters
Hyperparameters are settings that configure how a model is trained.
Most of the time, hyperparameters are not learned by the model, but are
set manually before the model is trained. This section discusses some
facilities for tuning or optimizing hyperparameters to achieve better
evaluation results.
Models with built-in cross-validation
The scikit-learn library supplies some models that perform built-in
cross-validation to determine optimal values for certain hyperparameters
of the models. The names of these models have the suffix CV, such as
RidgeCV, which we’ll look into next.
72 COMP S491 Machine Learning and Applications
Recall that the Ridge model performs linear regression with L2
regularization. When a Ridge object is created, the alpha argument
designates the λ value of L2 regularization. Technically, this alpha
argument is a hyperparameter that controls how the Ridge model is
trained. The RidgeCV model has built-in cross-validation for determining
the optimal alpha value of the model. When RidgeCV is created, the
alphas argument (defaulted to (0.1, 1.0, 10.0)) designates a list of
candidate alpha values. When the RidgeCV model is trained (the fit()
method called), cross-validation is carried out to select the optimal
alpha from the candidate values.
The code below demonstrates the use of the RidgeCV model. The alphas
argument designates (0.0001, 0.001, 0.01, 0.1, 1, 10, 100) as the
candidate alpha values. When I ran the code, the optimal alpha value
was either 0.1 or 0.01 most of the time.
from sklearn.datasets import load_boston
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = RidgeCV(alphas=(0.0001, 0.001, 0.01, 0.1, 1, 10, 100))
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print(score, reg.alpha_)
Output:
0.677489673474316 0.01
Now attempt the activity below to explore and use another scikit-learn
class with built-in cross-validation.
Activity 4.19
The LogisticRegressionCV model is a classifier with built-in cross-
validation. Read its documentation at https://scikit-learn.org/stable/
modules/generated/sklearn.linear_model.LogisticRegressionCV.
html. Then, write code to use this classifier to work on the iris dataset.
You can use default candidate values for the hyperparameters to be
optimized by cross-validation.
Feedback is provided for this activity.
For a list of linear models with built-in cross-validation, go to https://
scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_
model and see the class names with the suffix CV.
Unit 4 73
Grid search
Trying out different hyperparameters and comparing the evaluation
results are the key components of a traditional optimization technique
called grid search. The models with built-in cross-validation discussed
above implement grid search internally. The scikit-learn library supplies
additional facilities for performing grid search, such as the GridSearchCV
class.
To use the GridSearchCV class, we need to specify two mandatory
arguments — the model object and the grid parameters.10 The
grid parameters are a dictionary in which every entry contains a
hyperparameter name as the key, and a list of candidate values for that
hyperparameter as the value.
The tunable hyperparameters of a model can be obtained using the get_
params() method. In the code below, we call the get_params() method
to display the hyperparameters of the Ridge model.
from sklearn.linear_model import Ridge
display(Ridge().get_params())
Output:
{'alpha': 1.0,
'copy_X': True,
'fit_intercept': True,
'max_iter': None,
'normalize': False,
'random_state': None,
'solver': 'auto',
'tol': 0.001}
Let’s try to optimize the alpha and normalize hyperparameters of Ridge
for the Boston dataset. In the following code, we create a dictionary
called param_grid for the two hyperparameters: alpha with candidate
values of (0.0001, 0.001, 0.01, 0.1, 1, 10, 100), and normalize with
candidate values of (True, False). Next, we create a GridSearchCV with
a Ridge model and the param_grid dictionary. The search operation is
actually performed when the fit() method of GridSearchCV is called. To
obtain the results of the search, i.e. the best model, we use the attributes
best_estimator_ to get the model, best_score_ to get the score, and
best_params_ to get the hyperparameters of the model.
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
reg = Ridge()
param_grid = {"alpha": (0.0001, 0.001, 0.01, 0.1, 1, 10, 100),
10 The scikit-learn documentation occasionally refers to hyperparameters as
parameters, especially when they are used as function/method parameters.
74 COMP S491 Machine Learning and Applications
"normalize": (True, False)}
grid = GridSearchCV(reg, param_grid)
grid.fit(X, y)
display(grid.best_estimator_, grid.best_score_, grid.best_params_)
Output:
Ridge(alpha=0.1, normalize=True)
0.6806248851208401
{'alpha': 0.1, 'normalize': True}
If you’re interested in the performance results of individual models with
different combinations of hyperparameters, use the cv_results_ attribute
of GridSearchCV. The code below displays this attribute, showing the
execution times, scores, etc. of the models.
display(grid.cv_results_)
Output:
{'mean_fit_time': array([0.00086937, 0.00070424, 0.00083237, 0.00066833, 0.00081053,
0.00069809, 0.00080175, 0.00065131, 0.00077791, 0.00065265,
0.00077271, 0.00064263, 0.0007647 , 0.00062561]),
'std_fit_time': array([4.15471584e-05, 2.64373417e-05, 1.26279948e-05, 4.41531640e-06,
1.55236679e-05, 6.43809632e-05, 1.23199441e-05, 5.86606615e-06,
8.76576636e-06, 7.16621367e-06, 1.35458686e-05, 8.29833666e-06,
7.33554729e-06, 7.22058057e-06]),
'mean_score_time': array([0.00047708, 0.00047836, 0.00045648, 0.00046272, 0.00045471,
0.00045433, 0.00044227, 0.00044227, 0.0004353 , 0.00043674,
0.00043092, 0.00042868, 0.00042744, 0.00042429]),
'std_score_time': array([1.07293659e-05, 1.14813833e-05, 2.10349512e-06, 1.20625665e-05,
7.54307554e-06, 8.68393650e-06, 1.58866227e-06, 8.25217643e-06,
7.15478227e-06, 6.69170917e-06, 6.90607719e-06, 2.49144563e-06,
7.10535368e-06, 6.46567274e-06]),
'param_alpha': masked_array(data=[0.0001, 0.0001, 0.001, 0.001, 0.01, 0.01, 0.1, 0.1, 1,
1, 10, 10, 100, 100],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False],
fill_value='?',
dtype=object),
'param_normalize': masked_array(data=[True, False, True, False, True, False, True, False,
True, False, True, False, True, False],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'alpha': 0.0001, 'normalize': True},
{'alpha': 0.0001, 'normalize': False},
{'alpha': 0.001, 'normalize': True},
{'alpha': 0.001, 'normalize': False},
{'alpha': 0.01, 'normalize': True},
{'alpha': 0.01, 'normalize': False},
{'alpha': 0.1, 'normalize': True},
{'alpha': 0.1, 'normalize': False},
{'alpha': 1, 'normalize': True},
{'alpha': 1, 'normalize': False},
{'alpha': 10, 'normalize': True},
{'alpha': 10, 'normalize': False},
Unit 4 75
{'alpha': 100, 'normalize': True},
{'alpha': 100, 'normalize': False}],
'split0_test_score': array([0.73018406, 0.73021026, 0.7299388 , 0.73020021, 0.72754189,
0.73010005, 0.70790659, 0.72912984, 0.59192885, 0.72223088,
0.25751313, 0.70680873, 0.0228881 , 0.67322678]),
'split1_test_score': array([0.7437057 , 0.74367001, 0.74403526, 0.74368302, 0.74693719,
0.74381088, 0.7591945 , 0.7448891 , 0.69148046, 0.74749106,
0.32334775, 0.74090311, 0.05181684, 0.70714885]),
'split2_test_score': array([0.70769697, 0.70768717, 0.70777283, 0.70768002, 0.70810365,
0.70760869, 0.69688699, 0.70691031, 0.55918025, 0.70165792,
0.21763714, 0.68944757, 0.00299459, 0.66438867]),
'split3_test_score': array([ 0.67236184, 0.67228607, 0.67305055, 0.67229944, 0.6793781 ,
0.67243031, 0.71527491, 0.67349198, 0.68566897, 0.67449412,
0.23318246, 0.66376158, -0.17924573, 0.64441292]),
'split4_test_score': array([0.5293168 , 0.52932893, 0.52919951, 0.52932107, 0.52804527,
0.52924279, 0.52386143, 0.52849776, 0.5145376 , 0.52461996,
0.27061465, 0.54890597, 0.04635174, 0.64828221]),
'mean_test_score': array([ 0.67665307, 0.67663649, 0.67679939, 0.67663675, 0.67800122,
0.67663855, 0.68062489, 0.6765838 , 0.60855923, 0.67409879,
0.26045902, 0.66996539, -0.01103889, 0.66749189]),
'std_test_score': array([0.07752643, 0.07751932, 0.07759329, 0.07752245, 0.07823207,
0.07755337, 0.08118785, 0.07782729, 0.06982372, 0.07849421,
0.03645417, 0.06551796, 0.08589068, 0.02243588]),
'rank_test_score': array([ 4, 7, 3, 6, 2, 5, 1, 8, 12, 9, 13, 10, 14, 11],
dtype=int32)}
Now complete the activity below to perform a grid search on a
SGDClassifier model.
Activity 4.20
Explore the hyperparameters of SGDClassifier by executing the code
segment below.
from sklearn.linear_model import SGDClassifier
display(SGDClassifier().get_params())
Output:
{'alpha': 0.0001,
'average': False,
'class_weight': None,
'early_stopping': False,
'epsilon': 0.1,
'eta0': 0.0,
'fit_intercept': True,
'l1_ratio': 0.15,
'learning_rate': 'optimal',
'loss': 'hinge',
'max_iter': 1000,
'n_iter_no_change': 5,
'n_jobs': None,
'penalty': 'l2',
'power_t': 0.5,
'random_state': None,
76 COMP S491 Machine Learning and Applications
'shuffle': True,
'tol': 0.001,
'validation_fraction': 0.1,
'verbose': 0,
'warm_start': False}
Modify the code below to perform a grid search on a SGDClassifier
model using the iris dataset. Use penalty="elasticnet" in the
SGDClassifier model, and the following hyperparameter candidate
values:
• "alpha": (0.0001, 0.001, 0.01, 0.1, 1, 10, 100)
• "l1_ratio": np.linspace(0, 1, 11)
# Modify code below
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
reg = Ridge()
param_grid = {"alpha": (0.0001, 0.001, 0.01, 0.1, 1, 10, 100),
"normalize": (True, False)}
grid = GridSearchCV(reg, param_grid)
grid.fit(X, y)
display(grid.best_estimator_, grid.best_score_, grid.best_
params_)
Feedback is provided for this activity.
A grid search may take a long time to execute, especially for a large
number of hyperparameter combinations. By default, GridSearchCV
does not run jobs in parallel; you may specify the keyword argument n_
jobs=-1 in GridSearchCV creation to run jobs in parallel using all CPUs.
A grid search tries out all combinations of the candidate hyperparameter
values. A random search, on the other hand, randomly selects some of
the combinations to try in order to save time and computation. A random
search may perform better than a grid search when only a small number
of hyperparameters affect the performance of the machine learning
algorithm. The scikit-learn RandomizedSearchCV class implements
random search; its usage is very similar to the GridSearchCV class.
Now complete the self-test below to check your understanding of model
evaluation discussed in this section. Suggested answers can be found at
the end of the unit.
Unit 4 77
Self-test 4.4
1 Model evaluation is involved in three main phases. What are the
three phases? What set of data is involved in each phase? What is
the purpose of carrying out model evaluation in each phase?
2 Five-fold cross-validation is carried out on a dataset of 1,000
examples. How many examples does a fold contain? How many
models are there in the cross-validation operation? For each of
these models, how many examples does each of the training set and
validation set contain?
3 Repeat question 2 above for leave-one-out cross-validation on a
dataset of 1,000 examples.
4 What are the pros and cons of leave-one-out cross-validation?
5 What are four commonly-used regression metrics?
6 In what situation is the accuracy of predictions not a good metric for
classification?
7 Consider the confusion matrix values of TP = 10, FN = 20, FP = 30,
and TN = 40. Compute the precision, recall, and F1 score.
8 What are the values in the two axes of an ROC curve? Does a higher
ROC curve or a lower one represent a better model?
9 What does the AUC score stand for? What is the range of the AUC
score values of classifiers?
10 What are two types of facilities scikit-learn supplies for tuning
hyperparameters?
11 A grid search is carried out on a hyperparameter with three
candidate values and using five-fold cross-validation. How many
models are evaluated in cross-validation in total?
In this section, you learned about cross-validation for measuring model
performance effectively, and grid search for optimizing hyperparameters
of models. These techniques, which are applicable to all supervised
machine learning methods, help you understand the performance of a
model, and how the performance varies with model hyperparameters.
They are also useful in our study of machine learning methods and
understanding the related hyperparameters.
If you want to learn more about cross-validation, watch these videos:
• Machine learning fundamentals: Cross-validation: https://www.
youtube.com/watch?v=fSytzGwwBVw
• K-fold cross-validation — Intro to machine learning: https://www.
youtube.com/watch?v=TIgfjmp-4BA
In the next section, we turn to a machine learning method that works for
both regression and classification — k-nearest neighbours.
78 COMP S491 Machine Learning and Applications
K-nearest neighbours
The k-nearest neighbours (kNN) algorithm makes predictions based
on similar examples. To predict the label of an unlabelled example, k
labelled examples, or neighbours, nearest to it are identified, and their
labels are averaged in some way to produce the predicted result for the
query example (query point). For classification, the most frequent class
of the k labels can be selected; for regression, the mean of the k labels
can be computed and returned.
How kNN works
The figure below illustrates how kNN works for a classification
problem. It is binary classification, of which the two classes are denoted
by circles and squares in the figure. The cross denotes an unlabelled
example whose class we want to predict. When k is 1, as shown on the
left of the figure, the neighbour nearest to the cross is a square; thus,
the cross is predicted to be the square class. In the middle of the figure,
when k is 3, the three neighbours nearest to the cross are two circles
and a square; then, the cross is predicted to be the circle class. On the
right of the figure, when k is 5, the five neighbours nearest to the cross
are three squares and two circles; then, the cross is predicted to be the
square class. As you have seen, the value of k has a significant effect on
the predictions made by kNN; we’ll further discuss the value of k later.
Figure 4.16 kNN predicting the cross as either the circle or square class: k=1
(left), k=3 (middle), and k=5 (right)
kNN is easy to understand, and it works very well for some simple
machine learning problems.
A worked example
Let’s consider a worked example that illustrates the mechanism of kNN.
Two important factors involved in making a good soft drink are the
amount of sugar and the amount of carbon dioxide. A company made a
few trial products and conducted a survey to obtain customer responses.
The results are shown in the following table. The amounts of sugar and
carbon dioxide are specified in the integer range from 0 to 10 for the
sake of simplicity.
Unit 4 79
Table 4.1 Survey of soft drink trial products
Trial product
Amount of
sugar (x1)
Amount of carbon
dioxide (x2)
Response (y)
#1 1 4 Dislike
#2 5 1 Dislike
#3 4 7 Like
#4 8 5 Like
Later, the company wants to know the response of another trial product
(query point) with the amounts of sugar and carbon dioxide being 5 and
5 respectively. Without conducting another survey, we can apply kNN
(say k is 3) to predict the response.
We first calculate the Euclidean distance (discussed later) from each of
the surveyed trial products to the query trial product, as shown in the
fifth column of the table below. For example, the distance from trial
product #1 (1, 4) to the query point (5, 5) is .
Then, we rank the trial products by minimum distance from the query
point (the sixth column of the table). The three nearest neighbours (i.e. k
is 3, or 3-NN) are next identified as the trial products #3, #4, and #2,
and their respective responses are Like, Like, and Dislike. Using the
majority, we predict the response to the new trial product (5, 5) to be
Like.
Table 4.2 Applying kNN to soft drink trial products
Trial
product
Amount
of sugar
(x1)
Amount
of carbon
dioxide (x2)
Response
(y)
Distance
to (5, 5)
Rank by
minimum
distance
#1 1 4 Dislike 4.12 4
#2 5 1 Dislike 4.00 3
#3 4 7 Like 2.24 1
#4 8 5 Like 3.00 2
This example applies kNN to perform classification — like or dislike
a product. If the survey responses are numeric liking scores (instead of
Like or Dislike), then it becomes a regression problem.
Now work on the following activity to apply kNN to perform regression.
80 COMP S491 Machine Learning and Applications
Activity 4.21
Assume the soft drink survey was carried out to obtain the responses
of numeric liking scores, as shown in the following table. Apply kNN,
with k equal to 2, to predict the score for the query trial product (5, 5).
Table 4.3 Survey of soft drink trial products (with numeric responses)
Trial product
Amount of
sugar (x1)
Amount of carbon
dioxide (x2)
Response (y)
#1 2 4 3.5
#2 5 2 4.6
#3 4 7 5.1
#4 8 5 6.5
Feedback is provided for this activity.
Characteristics of kNN
The kNN algorithm differs from linear regression and logistic regression
in some fundamental ways. Key characteristics of kNN are described
and contrasted to the other algorithms next, with some machine learning
terminology introduced in the following discussion:
• In training, kNN does not learn at all from the examples, but rather
stores them for later use. As it doesn’t learn, kNN is considered
‘lazy’ and is sometimes described as a lazy learner or an instance-
based learner. In contrast, linear regression and logistic regression
learn a lot from the examples during training and are described as
eager learners.
• Because kNN does not learn from the training examples but only
stores them , the training process is very fast. However, making
predictions is slow due to the measurement of distances from all
the stored examples and identification of the k nearest neighbours;
this is a problem when the number of stored examples is huge. On
the other hand, linear regression and logistic regression are slow to
train, but fast in making predictions.
• Linear regression and logistic regression assume a function or
relation between the features and labels of the examples. Their
models work by learning the parameters of the relation during
training, and using the parameters and the relation to make
predictions. They are called parametric methods or models. In
contrast, kNN does not assume a relation between the features and
the labels, and it is called a non-parametric method. In general,
a non-parametric method does not have the risk of using a bad
(inaccurate and/or imprecise) relation, but it requires a large number
of labelled examples to work.
Unit 4 81
Now complete the following two activities that let you explore how the
kNN algorithm performs classification on different datasets.
Activity 4.22
Machine Learning Playground is a website for exploring machine
learning algorithms. Use the website and perform the following steps to
learn about kNN.
Figure 4.17 Machine Learning Playground, https://ml-playground.com/
1 Go to https://ml-playground.com/. The page shows an initially
empty canvas on the left and some buttons and input fields on the
right.
2 Set up two classes of data points on the canvas as follows. Click on
the orange square button on the right, and then click on the canvas
multiple times at different locations to create data points for the
orange class. Repeat for the purple square button to create data
points for the purple class. If necessary, click on the red cross button
on the right, and then click on a data point on the canvas to remove
it.
3 Select the kNN algorithm by clicking on the K Nearest Neighbors
button.
82 COMP S491 Machine Learning and Applications
4 Set the K parameter to 3 if necessary. Then, train the kNN model
by clicking on the Train button. The canvas shows two regions of
light orange and light purple respectively. The two regions indicate
the predicted classes when a query (unseen) point falls in them. For
example, a query point in the light orange (resp. light purple) region
is predicted as the orange (resp. purple) class by kNN.
5 Modify the data points and the K parameter, train the model, and
observe the two regions of predictions. Repeat several times for
different data and values of the K parameter.
6 Scroll down the page and read the brief overview of the kNN
algorithm.
Figure 4.18 An example of a kNN result on Machine Learning Playground
The regions of different colours provide a visual and comprehensive
way for observing the overall prediction outcomes of a machine learning
algorithm. The boundary between the regions is called a decision
boundary, which ‘decides’ the predicted class of a query data point.
Unit 4 83
Activity 4.23
K-Nearest Neighbors Demo is another page for exploring kNN. It
supports two to five classes of data points and interactive modifications
of data points can be made by dragging them. Perform the following
steps.
Figure 4.19 The K-Nearest Neighbors Demo, http://vision.stanford.edu/teaching/
cs231n-demos/knn/
1 Go to http://vision.stanford.edu/teaching/cs231n-demos/knn/.
The page shows four classes of data points and four regions of
predictions.
2 Drag a data point and observe the changes to the regions and the
decision boundaries. Repeat for a few other data points.
3 Scroll down the page and modify Num Neighbors (K), Num classes
and Num points. Observe the regions and decision boundaries, and
then repeat step 2.
84 COMP S491 Machine Learning and Applications
Basic code examples
Let’s look at some code that applies kNN. The scikit-learn library
supplies the KNeighborsRegressor and KNeighborsClassifier classes in
the sklearn.neighbors module for kNN, among others.
Using KNeighborsRegressor
The KNeighborsRegressor class implements kNN for regression. Its
basic usage, as shown below, is similar to other scikit-learn regressors.
In the code, we load the Boston house prices dataset, create a
KNeighborsRegressor object, call the fit() method and the score()
method, and finally display the score. By default, KNeighborsRegressor
uses the five nearest neighbours, which may be overridden using the n_
neighbors argument in KNeighborsRegressor creation.
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
knn = KNeighborsRegressor()
knn.fit(X_train, y_train)
score = knn.score(X_test, y_test)
print(score)
Output:
0.37997975526680194
When I ran the code, the score was about 0.3, which is quite bad. What
was wrong? It turned out that the features of the Boston dataset had very
different scales (ranges of values), and those features in large scales
dominated the ‘closeness’ or distance calculation. For example, the AGE
(proportion of owner-occupied units built prior to 1940) and TAX (full-
value property-tax rate per $10,000) features had large scales, but the
RM (average number of rooms per dwelling) feature had a small scale.
The problem is that the features in large scales may not be the most
informative or relevant features for our predictions; it may be that some
features in small scales are more informative and important!
To address this issue, we can standardize the features using the
StandardScaler class, as in the following. When I ran this code, the
resulting score was about 0.7, which is comparable to the score achieved
by applying LinearRegression to the same dataset.
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
Unit 4 85
knn = KNeighborsRegressor()
pipe = make_pipeline(StandardScaler(), knn)
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
print(score)
Output:
0.8092473344084745
In the above code segments, we use the fit() and score() method to
obtain the scores. A better approach to evaluating models and comparing
scores is to use cross-validation, as you will do in the next activity.
Activity 4.24
Write code to use cross-validation and compute the scores of applying
kNN regression to the Boston dataset without and with standardizing the
features. (Hint: Use the cross_val_score() function and, if necessary,
refer to the examples in the earlier Cross-validation section.)
Feedback is provided for this activity.
Using KNeighborsClassifier
The KNeighborsClassifier class implements kNN for classification. In
the code below, we load the iris dataset, create a KNeighborsClassifier
object, use a pipeline with a StandardScaler, call the fit() method
and the score() method, and finally display the score. By default,
KNeighborsClassifier uses the five nearest neighbours, which may be
overridden using the n_neighbors argument in KNeighborsClassifier
creation.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
knn = KNeighborsClassifier()
pipe = make_pipeline(StandardScaler(), knn)
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
print(score)
Output:
0.9333333333333333
Now complete the following activity to compare the classification
scores without and with feature standardization.
86 COMP S491 Machine Learning and Applications
Activity 4.25
Write code to use cross-validation and compute the scores of
applying kNN classification to the iris dataset both without and with
standardizing the features. (Hint: use the cross_val_score() function
and, if necessary, refer to the examples in the earlier ‘Cross-validation’
section.)
Feedback is provided for this activity.
The value of k
We mentioned near the beginning of this section that the k
hyperparameter has a significant impact on the predictions made by
kNN. More details are discussed in the following.
Let’s consider how the value of k relates to the variance of a kNN
implementation. When k is 1, only the nearest example, or neighbour,
is used to make a prediction. If this example has much noise, then the
predicted result is very bad. In other words, the prediction highly (or
solely) depends on the variation of the nearest example — the variance
is high. As k increases, the combined variation of the k nearest examples
decreases, and so does the variance.
To see how the value of k varies with the bias of a kNN implementation,
consider the case when k equals the total number of examples. Then,
all predicted results are simply the average of all of the examples. A
solution with such a value of k is dumb, and captures the least statistical
information (only the average) of the relation between the features and
labels of the examples. This kNN implementation has a high bias. As
k decreases, more statistical information from the neighbourhood is
considered, and the bias decreases.
In short, when k is small, kNN has low bias and high variance, and is
vulnerable to overfitting; when k is large, kNN has high bias and low
variance, and is vulnerable to underfitting.
Commonly-used values of k include 1, 3, 10, and 20. Some people
suggest evaluating different values of k based on the dataset size: for a
dataset of n examples, consider the range of values from the square root
of n to 1. The latter suggestion is implemented in the following code
using a grid search.
import math
from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
boston = load_boston()
Unit 4 87
X, y = shuffle(boston.data, boston.target)
knn = KNeighborsRegressor() #1
pipe = Pipeline([("scaler", StandardScaler()), ("knn", knn)]) #2
max_k = math.ceil(math.sqrt(len(X))) #3
param_grid = {"knn__n_neighbors": range(1, max_k+1)} #4
grid = GridSearchCV(pipe, param_grid) #5
grid.fit(X, y)
display(grid.best_estimator_, grid.best_score_, grid.best_params_)
Output:
Pipeline(steps=[('scaler', StandardScaler()),
('knn', KNeighborsRegressor(n_neighbors=3))])
0.7652257998858663
{'knn__n_neighbors': 3}
Part of the code is similar to earlier examples of grid search, but there
is a significant difference — the use of a pipeline in grid search. In
this case, we need a pipeline that allows us to refer to its components
by name. The make_pipeline() function we used before cannot do
this, but the Pipeline class can. In the code above, after creating a
KNeighborsRegressor (#1), we build a Pipeline of a StandardScaler and
a KNeighborsRegressor (#2). Specifically, the argument of Pipeline()
is a list of two name-object tuples, each containing the name and
transform/model object of a pipeline component. The names "scaler"
and "knn" can be used later to refer to these pipeline components.
After building the pipeline, we compute the maximum k value, which
is the ceiling of the square root of the number of examples (#3). Next,
we establish the grid parameter for evaluating different values for "n_
neighbors" of KNeighborsRegressor (#4). Since the KNeighborsRegressor
is inside a pipeline, the grid parameter name is specified as the pipeline
component name ("knn"), followed by two underscores ("__"), and
then the name of the component’s hyperparameter ("n_neighbors"); so,
the grid parameter name is "knn__n_neighbors". The grid parameter
candidate values are the range from 1 to max_k (#4). The grid parameter,
called param_grid, is passed to GridSearchCV for carrying out the grid
search (#5).
When I ran the code, the best knn__n_neighbors value was 3. The code
below plots the mean scores versus the numbers of neighbours. From
the cross-validation results in the grid.cv_results_ dictionary, the
numbers of neighbours can be obtained using the key param_knn__n_
neighbors, and the mean scores using the key mean_test_score.
import matplotlib.pyplot as plt
ns_neighbors = grid.cv_results_["param_knn__n_neighbors"]
mean_scores = grid.cv_results_["mean_test_score"]
plt.plot(ns_neighbors, mean_scores, "o-")
Output:
[]
88 COMP S491 Machine Learning and Applications
Now complete the activity below to perform a grid search on a kNN
classifier.
Activity 4.26
Modify the code below to find the best n_neighbors for using a
KNeighborsClassifier to classify the iris dataset (instead of using a
KNeighborsRegressor and the Boston dataset).
# Modify code below
import math
from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
knn = KNeighborsRegressor()
pipe = Pipeline([("scaler", StandardScaler()), ("knn", knn)])
max_k = math.ceil(math.sqrt(len(X)))
param_grid = {"knn__n_neighbors": range(1, max_k+1)}
grid = GridSearchCV(pipe, param_grid)
grid.fit(X, y)
display(grid.best_estimator_, grid.best_score_, grid.best_
params_)

import matplotlib.pyplot as plt
ns_neighbors = grid.cv_results_["param_knn__n_neighbors"]
mean_scores = grid.cv_results_["mean_test_score"]
plt.plot(ns_neighbors, mean_scores, "o-")
Feedback is provided for this activity.
Unit 4 89
Distance metrics
In the above discussion, we talked a lot about the nearest neighbours
(examples), but didn’t mention how to measure the closeness, or
distance, between two examples. In fact, there are many ways of
measuring distances. You can try different distance metrics in your
kNN solution, and apply a grid search to evaluate them as we did to the
numbers of neighbours above.
The KNeighborsClassifier and KNeighborsRegressor classes, by default,
use the Euclidean distance, which is appropriate for most applications.
The Euclidean distance and some other commonly-used distance
measures are described below. For the context of the description, the
distance between two examples (instances) is the distance between the
two feature vectors of them. The x vector is [x1, x2, …, xn] and the y
vector is [y1, y2, …, yn].
Euclidean distance
In 2D, the Euclidean distance between two vectors is the straight-line
distance between them.
The general form of the Euclidean distance between vectors x and y is:
The Euclidean distance is illustrated at the top-left of the following
figure.
Figure 4.20 Distance metrics: Euclidean distance (top-left), Manhattan distance
(top-right), Chebyshev distance (bottom-left), and cosine distance
(bottom-right)
90 COMP S491 Machine Learning and Applications
Manhattan distance
In a grid layout of city buildings, such as the island of Manhattan,
the shortest distance between two places is the sum of the east–west
distance and the north–south distance. This is called the Manhattan
distance, and is illustrated at the top-right of the above figure.
The general form of the Manhattan distance between vectors x and y is:
Minkowski distance
The Minkowski distance is a generalization of both the Euclidean
distance and the Manhattan distance, with an order parameter called p.
The Minkowski distance between vectors x and y is:
The Minkowski distance becomes the Euclidean distance when p is
2, the Manhattan distance when p is 1, and the Chebyshev distance
(described below) when p is infinity.
Chebyshev distance
In 2D, the Chebyshev distance of two vectors is the maximum of their
horizontal difference and vertical difference. It is illustrated at the
bottom-left of the Figure 4.20. Chebyshev distance is a special case of
Minkowski distance when p is infinity.
The general form of the Chebyshev distance between vectors x and y is:
Cosine distance
Cosine distance is the negated cosine similarity, where the cosine
similarity is the cosine of the angle between two vectors. Cosine
distance is illustrated at the bottom-right of the above figure.
The cosine similarity of vectors x and y is:
Unit 4 91
By default, the scikit-learn classes KNeighborsClassifier and
KNeighborsRegressor use the metric argument of "minkowski" and the p
argument of 2, i.e. the Euclidean distance. You may use another distance
by specifying the metric argument. For a list of supported distance
metrics, see https://scikit-learn.org/stable/modules/generated/sklearn.
neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric.
Let’s conclude this section by addressing the key strengths and
weaknesses of kNN. On the plus side, kNN is generally effective and
accurate if the training data size is large. In addition, it is unsusceptible
to noise in the training data because the nearest neighbours are used,
and distant outliers are naturally ignored. Finally, kNN is ready for
online learning when example data are added during production. On the
minus side, it may be difficult to determine the optimal distance metric
and features to use, which require domain-specific knowledge of the
problem. Another potential problem is the high computation cost for
calculating distances from all example data to determine the nearest
neighbours.
To check your understanding of kNN discussed in this section, work
through the following self-test. Suggested answers can be found at the
end of the unit.
Self-test 4.5
1 How does the k-nearest neighbours (kNN) algorithm predict a
numerical label and categorical label from the k nearest neighbours?
2 Why is kNN described as a lazy learner?
3 What is the difference between a parametric method and a non-
parametric method?
4 Which argument of the KNeighborsRegressor and
KNeighborsClassifier classes designates the number of nearest
neighbours used in kNN? What is the default value of the argument?
5 Consider two kNN implementations, one with a small value of k and
the other with a large value of k. Which of the two implementations
is more vulnerable to underfitting? Which of them is more
vulnerable to overfitting?
6 Given two vectors (1, 1) and (4, 5), compute their (a) Euclidean
distance, (b) Manhattan distance, and (c) Chebychev distance.
7 What is the default distance metric used by the KNeighborsRegressor
and KNeighborsClassifier classes?
92 COMP S491 Machine Learning and Applications
The k-nearest neighbours method is another simple machine learning
algorithm that is applicable to some prediction problems. The method
works well in situations in which there are sufficient known labelled
examples. It doesn’t learn in order to build a model with parameters, but
rather uses known examples (instances) to make predictions; as such, it
is lazy, non-parametric, and instance-based.
If you want to learn more about k-nearest neighbours, watch this video:
• StatQuest: K-nearest neighbours, clearly explained: https://www.
youtube.com/watch?v=HVXime0nQeIX
In the next section, you’ll learn another simple classification technique
— Naive Bayes.
Unit 4 93
Naive Bayes
Naive Bayes is a family of classifiers that use probabilities from
example training data to make predictions. It is called ‘naive’ because
it assumes that all features are independent of each other when they are
used for predicting targets (labels). Although this assumption does not
hold strictly in most scenarios, Naive Bayes classifiers perform well in
many applications, such as spam detectors and document classifiers.
This section discusses how Naive Bayes works, the key types of Naive
Bayes classifiers, and their implementations.
How Naive Bayes works
Naive Bayes makes use of Bayes’ theorem, which you learned about in
Unit 3. The theorem is reviewed below, with a discussion of how it is
used for classification. For two events, features X and target Y:
Here, P(Y | X) is called the posterior probability of the target (class)
given the features. P(Y) is the prior probability of the target (class).
P(X | Y) is the likelihood or probability of the features given the target.
P(X) is the evidence, or prior probability of the features.
If the features X, denoted as X1, X2, …, Xn, are independent, Bayes’s
theorem becomes:
How is this theorem useful for classification? The values on the right
side of the equation can be calculated from the statistics of the example
data. Using these values, the posterior probability of each class (on the
left side of the equation) can be obtained. The class with the highest
posterior probability is the outcome of the prediction.
A worked example
Let’s go through a worked example that involves predicting whether or
not players will play tennis in different weather conditions. The training
example data contain 14 days of data (14 examples): on five sunny days,
they played on three of the five days; on four cloudy days, they played
on all four days; on five rainy days, they played on two of the five days.
The following table summarizes these example data.
94 COMP S491 Machine Learning and Applications
Table 4.4 Example data of playing tennis and weather conditions (frequencies)
Weather Playing tennis Not playing tennis
Sunny 3 2
Cloudy 4 0
Rainy 2 3
The weather condition is the feature X, and whether or not to play
tennis is the target Y. Using the numbers from the above table, we can
calculate the various values of the probabilities P(X|Y), P(Y), and P(X)
(which appear on the right side of the Bayes’ theorem equation). The
calculated values are shown in the tables below.
Table 4.5 Example data of playing tennis and weather conditions (probabilities)
Weather Playing tennis Not playing tennis
Sunny P(Sunny|Play) = 3/9 P(Sunny|Not play) = 2/5 P(Sunny) = 5/14
Cloudy P(Cloudy|Play) = 4/9 P(Cloudy|Not play) = 0/5 P(Cloudy) = 4/14
Rainy P(Rainy|Play) = 2/9 P(Rainy|Not play) = 3/5 P(Rainy) = 5/14
P(Play) = 9/14 P(Not play) = 5/14
With these probability values, we can apply Bayes’ theorem to make
predictions. Let’s determine whether players will play if the weather is
sunny:
• P(Play|Sunny) = P(Sunny|Play) × P(Play) / P(Sunny)
= 3/9 × 9/14 / 5/14 = 0.60
• P(Not play|Sunny) = P(Sunny|Not play) × P(Not play) / P(Sunny)
= 2/5 × 5/14 / 5/14 = 0.40
Since the posterior probability P(Play|Sunny) is higher, the class ‘Play’
is predicted, i.e. we predict that players will play tennis if the weather is
sunny.
Now complete the following activity to compute the other posterior
probabilities and make other predictions.
Activity 4.27
Using the above data, predict whether players will play tennis if the
weather is cloudy and if the weather is rainy by computing the relevant
posterior probabilities.
Feedback is provided for this activity.
Unit 4 95
Next we’ll discuss three Naive Bayes algorithms and their
implementations using scikit-learn. In the process, you’ll also learn
some machine learning skills related to preprocessing and feature
encoding.
Bernoulli Naive Bayes
The Bernoulli Naive Bayes classifier works with binary features, i.e.
every feature takes on the value 0 or 1 (or, true or false, etc). The name
originates from Bernoulli distribution, a term in probability theory
and statistics meaning the probability distribution of a binary random
variable (i.e. taking on values of either 0 or 1). In scikit-learn, Bernoulli
Naive Bayes is implemented by the BernoulliNB class of the sklearn.
naive_bayes module.
One-hot encoding
A frequently-used technique for converting, or encoding, a categorical
feature into binary features is one-hot encoding, which was discussed in
the Unit 3 section ‘Binarization and one-hot encoding’.
The scikit-learn library supplies a convenience class, called
OneHotEncoder, to perform one-hot encoding. The basic use of this class
is demonstrated in the code below. The fit() method of OneHotEncoder
accepts a parameter that is a 2D array of categorical and other features,
and devises a one-hot encoding scheme for the features. In the code,
we pass the weather feature categories — ‘Sunny’, ‘Cloudy’, and
‘Rainy’ — to the fit() method. The fourth value, "Cloudy", is here to
show that the fit() method automatically handles or ignores duplicate
values — in real use we often pass all values of features to the method
without eliminating the duplication ourselves. To try out the encoding,
we invoke the transform() method; the results show that the feature is
encoded as three binary features: ‘Sunny’ as (0, 0, 1), ‘Cloudy’ as (1, 0,
0), and ‘Rainy’ as (0, 1, 0). The categories_ attribute of OneHotEncoder
contains an array of the categorical values, whose order matches the
binary features, e.g. the ‘Cloudy’ category appears first in categories_,
so it has a ‘1’ in the first value and is encoded as (1, 0, 0).
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
onehot.fit([["Sunny"], ["Cloudy"], ["Rainy"], ["Cloudy"]])
print(onehot.transform([["Sunny"], ["Cloudy"], ["Rainy"]]).toarray())
print(onehot.categories_)
Output:
[[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]]
[array(['Cloudy', 'Rainy', 'Sunny'], dtype=object)]
96 COMP S491 Machine Learning and Applications
Instead of calling fit() and transform() manually, we can simply pass
an OneHotEncoder object to a scikit-learn pipeline to process the features.
You will see how this is done shortly.
Using the BernoulliNB class
The code below uses the BernoulliNB class to implement the preceding
worked example of playing tennis according to the weather conditions.
The training examples are first specified in a 2D array, in which each
entry contains a string specifying the weather conditions (‘Sunny’,
‘Cloudy’, or ‘Rainy’) and an integer of 0 (for not playing tennis) and
1 (for playing tennis). Next, the features are retrieved in a 2D array
called X, while the labels in a 1D array called y. A pipeline is then built
from a OneHotEncoder and a BernoulliNB, and trained, or fitted, using
the X and y training examples. To use the BernoulliNB classifier, we
call the predict() method with ‘Sunny’, ‘Cloudy’, and ‘Rainy’. The
predicted results are three integers — (1, 1, 0) — indicating playing on
a sunny day, playing on a cloudy day and not playing on a rainy day
respectively. To retrieve the posterior probabilities of the classes — not
playing (0) and playing (1) — we use the predict_proba() method.11
import numpy as np
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
data = np.array([["Sunny", 1],
["Sunny", 1],
["Sunny", 1],
["Sunny", 0],
["Sunny", 0],
["Cloudy", 1],
["Cloudy", 1],
["Cloudy", 1],
["Cloudy", 1],
["Rainy", 1],
["Rainy", 1],
["Rainy", 0],
["Rainy", 0],
["Rainy", 0]])
X = data[:, [0]]
y = data[:, 1]
nb = make_pipeline(OneHotEncoder(), BernoulliNB())
nb.fit(X, y)
predicted = nb.predict([["Sunny"], ["Cloudy"], ["Rainy"]])
proba = nb.predict_proba([["Sunny"], ["Cloudy"], ["Rainy"]])
print(predicted)
print(proba)
11 The posterior probabilities returned by predict_proba() differ slightly
from those we calculated earlier because BernoulliNB implements
optimization and handles edge cases such as additive smoothing. See
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.
BernoulliNB.html#sklearn.naive_bayes.BernoulliNB.
Unit 4 97
Output:
['1' '1' '0']
[[0.37746016 0.62253984]
[0.08457775 0.91542225]
[0.62157413 0.37842587]]
Multinomial Naive Bayes
The multinomial Naive Bayes classifier works with numerical
features that are discrete or count-like, such as word counts for text
classification. Such features, described as in multinomial distributions,
typically contain integer counts, or fractional counts such as tf-idf (to be
described shortly).
Multinomial Naive Bayes is a classic technique used in text
classification, of which an example will be presented in this section.
In scikit-learn, multinomial Naive Bayes is implemented by the
MultinomialNB class. Before looking into the example that uses a
MultinomialNB classifier, we’ll discuss two things — getting a text
dataset, and converting text data to multinomial features.
The 20 newsgroups text dataset
The scikit-learn library provides the 20 newsgroups text dataset, which
contains about 18,000 messages in 20 newsgroups. To load the dataset,
invoke the fetch_20newsgroups() function, as shown in the following
code. This function returns a bunch object with these attributes (among
others): a data list of text messages, a target array of newsgroup
index, and a target_names list of newsgroup names. By default,
fetch_20newsgroups() returns the training set of 11,314 messages.
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups()
print(type(newsgroups.data), len(newsgroups.data))
print(type(newsgroups.target), len(newsgroups.target))
display(newsgroups.target_names)
Output:
11314
11314
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
98 COMP S491 Machine Learning and Applications
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
The code below displays the first message with index 0 (newsgroups.
data[0]).
print(newsgroups.data[0])
Output:
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In
addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----
This code displays the index (newsgroups.target[0]) and name
(newsgroups.target_names[...]) of the first message’s newsgroup.
target = newsgroups.target[0]
print(target, newsgroups.target_names[target])
Output:
7 rec.autos
Term frequency-inverse document frequency (tf-idf)
To use text data or documents in machine learning, we need to
transform them to features that can be processed by the algorithms.
There are different methods of transformation; a commonly-used one is
term frequency-inverse document frequency (tf-idf).
Tf-idf is used for determining how important a word, or term, is to a
document in a corpus (i.e. a collection of documents). The importance
is calculated as the term frequency multiplied by the inverse document
frequency. The two components are explained as follows:
Unit 4 99
• Term frequency refers to how many times a term appears in a
document. The more times it appears, the more important it is.
• Document frequency refers to how many documents the term
appears in. The more documents it appears in, the less important it
is. For example, the words ‘I’, ‘the’, ‘a’ and ‘in’ appear in nearly all
documents, but these common words are not important. The inverse
of document frequency is considered as the importance of the term.
A text document is transformed to the tf-idf values of its containing
terms, or words. These tf-idf numerical values are the features that
machine learning algorithms deal with.
The scikit-learn library supplies the TfidfVectorizer class for the
complex calculations of the tf-idf values (or counts) of documents. The
class is very easy to use — just put it in a pipeline! This is shown in the
next example.
Using the MultinomialNB class
The example below uses the MultinomialNB and TfidfVectorizer classes
to process the 20 newsgroups text dataset. We first define an array
of four newsgroups for fetching messages, and pass the array as the
categories keyword argument to the fetch_20newsgroups() function.
The subset='all' argument to fetch_20newsgroups() is designated to
load both training and test sets of the four newsgroups. Next, we split
the loaded messages into the training and test sets. After that, a pipeline
is built from a TfidfVectorizer and a MultinomialNB. The MultinomialNB
classifier is then fitted, and its score on the test set is obtained.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

categories = ['comp.graphics', 'rec.autos', 'sci.electronics', 'sci.space']
newsgroups = fetch_20newsgroups(categories=categories, subset='all')
X_train, X_test, y_train, y_test = train_test_split(
newsgroups.data, newsgroups.target, test_size=0.3)
nb = make_pipeline(TfidfVectorizer(), MultinomialNB())
nb.fit(X_train, y_train)
print(nb.score(X_test, y_test))
Output:
0.9373412362404742
It is interesting to compose new messages and use the trained
multinomial Naive Bayes classifier to predict their newsgroups. As
shown in the code segments below, the classifier performs quite well: it
appears to ‘understand’ the messages!
100 COMP S491 Machine Learning and Applications
label = nb.predict(["I brought a new Intel display card."])
print(categories[label[0]])
Output:
comp.graphics
label = nb.predict(["Love speed driving!"])
print(categories[label[0]])
Output:
rec.autos
label = nb.predict(["Motorola versus Intel!"])
print(categories[label[0]])
Output:
sci.electronics
Gaussian Naive Bayes
The Gaussian Naive Bayes classifier works with continuous features.
The features are assumed to be in Gaussian distributions, from which
the likelihoods of the features (i.e. probabilities of the features given
the target, P(Xi | Y) are calculated. Even though the assumption may
not hold, the Gaussian Naive Bayes approach works very well in many
machine learning problems where the features are not Gaussian at all.
In scikit-learn, Gaussian Naive Bayes is implemented by the GaussianNB
class. The use of this class is very similar to that of other classifiers,
such as the LogisticRegression and KNeighborsClassifier.
Now work through the following activity to try out the GaussianNB class.
Activity 4.28
Write code below to use cross-validation and compute the score of
applying the GaussianNB classifier to the iris dataset. (Hint: use the
cross_val_score() function and, if necessary, refer to the examples in
the earlier ‘Cross-validation’ section.)
# Write code below
from sklearn.naive_bayes import GaussianNB
Feedback is provided for this activity.
Here is a summary of the strengths and weaknesses of Naive Bayes.
Naive Bayes classifiers are fast and easy to use. When the assumption
of feature independence holds, Naive Bayes performs well in multiclass
predictions compared to other models such as logistic regression. On the
other hand, Naive Bayes does not work with a feature category that does
not exist in the training set: the calculated probability is zero, and no
prediction can be made. This problem, often known as ‘zero frequency’,
Unit 4 101
can be solved by using smoothing techniques (e.g. Laplace estimation).
Another issue related to Naive Bayes is the assumption of independent
features; ignoring the dependency of features for making predictions
may lead to suboptimal results in certain machine learning problems.
Complete the following self-test to firm up your understanding of Naive
Bayes. Suggested answers can be found at the end of the unit.
Self-test 4.6
1 What is the posterior probability in the Bayes’ theorem when it is
used in the context of classification?
2 What are three main types of Naive Bayes classifiers? What are the
distributions of features for each of the types?
3 What is a commonly-used technique for converting a categorical
feature to binary features? Which scikit-learn class implements the
technique?
4 What is a commonly-used technique for converting a text/document
feature to multinomial features (i.e. count-like values)? Which
scikit-learn class implements the technique?
5 What are the two components of tf-idf? Briefly describe them.
If you want to learn more about Naive Bayes, watch these videos:
• Naive Bayes, clearly explained: https://www.youtube.com/
watch?v=O2L2Uv9pdDA
• Gaussian Naive Bayes, clearly explained: https://www.youtube.com/
watch?v=H3EjCKtlVog
The next section discusses another machine learning algorithm —
decision trees.
102 COMP S491 Machine Learning and Applications
Decision trees
Decision trees are tree data structures for making decisions by the use
of conditional branching at the tree nodes. In machine learning, features
are examined in the internal nodes of a decision tree for branching to
arrive at a predicted outcome (decision) at a leaf node of the tree.
This section begins by reviewing decision trees, and then discusses
machine learning algorithms for building decision trees. After that, the
implementation of decision trees is addressed.
Using decision trees in machine learning
Let’s begin by looking at the two simple decision trees in the figure
below. The tree on the left depicts the decision of whether a student
passes a course according to the student’s assignment and examination
scores. The tree on the right depicts the prediction of an assignment
score according to a student’s understanding of the course content,
whether Google search is used, and the time spent to work on the
assignment.
Figure 4.21 Two decision trees: course result (left) and assignment score (right)
When a decision tree is used in machine learning, the internal nodes
(represented by the rectangles in the figure) of the tree act on the
features of an example, e.g. the examination score (on the tree on the
left side of the figure) and whether Google search is used (on the tree
on the right side of the figure). The predicted outcome, or label, is a
leaf node (the circles in the figure) that is reached according to the
branching of the internal nodes, e.g. the course result (on the left tree)
and the assignment score (on the right tree). In general, decision trees
can handle both categorical and numerical features, and carry out both
classification and regression tasks.
The depth of a decision tree affects the bias and variance of the tree.
A deep tree has low bias and high variance. A shallow tree, on the
other hand, has high bias and low variance. When a leaf node contains
multiple examples, the node’s predicted outcome is the most frequent
class (for classification) or the average value (for regression). When
the leaf nodes are allowed to contain many examples, the tree has high
bias and low variance. When the leaf nodes are allowed to contain few
examples, the tree has low bias and high variance.
Unit 4 103
Complete the following activity to learn more about how a decision tree
uses features to do classification.
Activity 4.29
Perform the following steps on the Machine Learning Playground
website.
1 Go to https://ml-playground.com/.
2 Set up two classes of data points on the canvas by using the orange
and purple square buttons and clicking on the canvas. If necessary,
use the red cross button for removing a data point on the canvas.
3 Click on the Decision Tree button.
4 Set the Max Tree Depth parameter to 5 if necessary. Then, train the
decision tree by clicking on the Train button. The canvas shows two
regions of predictions.
5 Modify the data points and the Max Tree Depth parameter, train
the model, and observe the two regions of predictions. Repeat
several times for different data and values of the Max Tree Depth
parameter.
6 Scroll down the page and read the brief overview of the decision
tree algorithm. Part of the overview content will be discussed later
in this unit.

Figure 4.22 An example of decision tree result on Machine Learning
Playground
104 COMP S491 Machine Learning and Applications
As you’ve seen in the previous activity, the decision boundary of a
decision tree consists of multiple straight line segments. Each line
segment corresponds to a comparison between a feature value and a
breaking value, e.g. is the examination score at least 40? For a feature
that is plotted as the horizontal axis, a condition on the feature is
denoted by a vertical line segment that separates, or breaks, different
horizontal values of the feature. Similarly, a feature of the vertical axis
is broken by a horizontal line segment. Note that a decision tree may use
a numerical feature, as multiple line segments in the decision boundary,
to create multiple conditions for making decisions.
While it is simple to use decision trees, in the study of machine learning
we are more interested in how to build them, which is discussed next.
Decision tree learning algorithms
There are different ways of building, or learning, a decision tree from
a training set of example data. Three widely-known algorithms for
learning decision trees are ID3, C4.5, and CART.
ID3
ID3 (Iterative Dichotomiser 3) is a classic algorithm for building
decision trees in machine learning. Though superseded by its successors
in practice, ID3 is relatively easy to understand, and demonstrates
the key mechanisms of building decision trees shared by most other
algorithms.
Entropy, information gain, and the algorithm
Two important components of ID3 are entropy and information gain.
Recall in Unit 3 that you learned about entropy, which is a measure of
the information, or uncertainty, contained in a variable. The entropy of a
variable X with n possible outcomes xi of probabilities P(xi) is:
Information gain is the difference between the total amount of entropy
before and after splitting a set of examples on an attribute (feature).
In a decision tree, the examples are denoted by a parent node before
splitting, and by multiple child nodes after splitting. The information
gain of splitting on an attribute A with m possible attribute values ai of
probabilities P(X | ai) is:
Unit 4 105
The calculation of entropy and information gain will be clearly
demonstrated when we go through a worked example shortly. Before
that, the ID3 algorithm is presented as follows:
1 Calculate the entropy of the set of examples.
2 Find the attribute that best splits the examples. To do so, calculate
the information gain for splitting on each of the attributes, and select
the attribute that gives the largest information gain.
3 Add a node to the decision tree for splitting on the selected attribute
of step 2.
4 Recurse on each of the split subsets of examples (i.e. go to step 1)
using the remaining attributes, until a subset has a single class or
until no attributes are left.
A worked example
The ID3 algorithm is illustrated in a worked example. The dataset
below contains 11 examples of whether players play tennis (the label)
according to particular weather conditions (the features, or attributes).
Table 4.6 ID3 worked example dataset
Example # Weather Humidity Wind Play tennis
#1 Sunny High Strong Not play
#2 Sunny High Weak Play
#3 Sunny Low Strong Not play
#4 Sunny Low Weak Play
#5 Cloudy High Strong Not play
#6 Cloudy High Weak Play
#7 Cloudy Low Strong Not play
#8 Rainy High Strong Not play
#9 Rainy High Weak Not play
#10 Rainy Low Strong Not play
#11 Rainy Low Weak Not play
Iteration #1
The dataset contains 11 examples, of which eight have the label ‘Not
play’ and three have ‘Play’. The entropy is:
Entropy(All) = −8/11 log(8/11) − 3/11 log(3/11) = 0.845
In this iteration, the example may be split on three candidate attributes
— weather, humidity, and wind.
106 COMP S491 Machine Learning and Applications
Iteration #1: Splitting on weather
The following table shows the 11 examples grouped by weather —
Sunny, Cloudy, and Rainy. (It is also possible to group the three subsets
into three separate tables.)
Table 4.7 ID3 worked example, iteration #1 splitting on weather
Example # Weather Humidity Wind Play tennis
#1 Sunny High Strong Not play
#2 Sunny High Weak Play
#3 Sunny Low Strong Not play
#4 Sunny Low Weak Play
#5 Cloudy High Strong Not play
#6 Cloudy High Weak Play
#7 Cloudy Low Strong Not play
#8 Rainy High Strong Not play
#9 Rainy High Weak Not play
#10 Rainy Low Strong Not play
#11 Rainy Low Weak Not play
The Sunny subset contains four examples, of which two have the label
‘Not play’ and two have ‘Play’. The entropy is:
Entropy(Sunny) = −2/4 log(2/4) − 2/4 log(2/4) = 1
The Cloudy subset contains three examples, of which two have the label
‘Not play’ and one has ‘Play’. The entropy is:
Entropy(Cloudy) = −2/3 log(2/3) − 1/3 log(1/3) = 0.918
The Rainy subset contains four examples, all of which have the label
‘Not play’. The entropy is:
Entropy(Rainy) = −4/4 log(4/4) − 0/4 log(0/4) = 0
Note that in the last calculation, the first term 4/4 × log(4/4) is 0; the
second term 0/4 × log(0/4) equals 0 × infinity, which is regarded as 0 in
entropy calculation.
The three entropy values are used to calculate the information gain:
Gain(Weather) = 0.845 − (4/11 × 1 + 3/11 × 0.918 + 4/11 × 0) = 0.231
Unit 4 107
Iteration #1: Splitting on humidity
The following table shows the 11 examples grouped by humidity —
High and Low.
Table 4.8 ID3 worked example, iteration #1 splitting on humidity
Example # Weather Humidity Wind Play tennis
#1 Sunny High Strong Not play
#2 Sunny High Weak Play
#5 Cloudy High Strong Not play
#6 Cloudy High Weak Play
#8 Rainy High Strong Not play
#9 Rainy High Weak Not play
#3 Sunny Low Strong Not play
#4 Sunny Low Weak Play
#7 Cloudy Low Strong Not play
#10 Rainy Low Strong Not play
#11 Rainy Low Weak Not play
The High-humidity subset contains six examples, of which four have
the label ‘Not play’ and two have ‘Play’. The entropy is:
Entropy(High) = −4/6 log(4/6) − 2/6 log(2/6) = 0.918
The Low-humidity subset contains five examples, of which four have
the label ‘Not play’ and one has ‘Play’. The entropy is:
Entropy(Low) = −4/5 log(4/5) − 1/5 log(1/5) = 0.722
Using these entropy values, the information gain is calculated as:
Gain(Humidity) = 0.845 − (6/11 × 0.918 + 5/11 × 0.722) = 0.016
Iteration #1: Splitting on wind
The following table shows the 11 examples grouped by wind — Strong
and Weak.
108 COMP S491 Machine Learning and Applications
Table 4.9 ID3 worked example, iteration #1 splitting on wind
Example # Weather Humidity Wind Play tennis
#1 Sunny High Strong Not play
#3 Sunny Low Strong Not play
#5 Cloudy High Strong Not play
#8 Rainy High Strong Not play
#7 Cloudy Low Strong Not play
#10 Rainy Low Strong Not play
#2 Sunny High Weak Play
#4 Sunny Low Weak Play
#6 Cloudy High Weak Play
#9 Rainy High Weak Not play
#11 Rainy Low Weak Not play
The Strong-wind subset contains six examples, all of which have the
label ‘Not play’. The entropy is:
Entropy(Strong) = −6/6 log(6/6) − 0/6 log(0/6) = 0
The Weak-wind subset contains five examples, of which two have the
label ‘Not play’ and three have ‘Play’. The entropy is:
Entropy(Weak) = −2/5 log(2/5) − 3/5 log(3/5) = 0.971
Using these entropy values, the information gain is calculated as:
Gain(Wind) = 0.845 − (6/11 x 0 + 5/11 × 0.971) = 0.404
Iteration #1: Conclusion
Among Gain(Weather) = 0.231, Gain(Humidity) = 0.016, and
Gain(Wind) = 0.404, Gain(Wind) is the largest. Therefore, the wind
attribute is selected to be processed by a new node in the decision tree,
as shown in the following figure.
Figure 4.23 ID3 worked example, the decision tree after adding the wind node
Unit 4 109
The new node has two branches, or children, corresponding to Weak-
wind and Strong-wind. Since the Weak-wind examples have both labels
‘Not play’ and ‘Play’, further work is required to deal with them in the
Weak-wind branch (in iteration #2 below). On the other hand, all the
Strong-wind examples have the label ‘Not play’, so the Strong-wind
branch leads to a leaf node of the predicted outcome ‘Not play’.
Iteration #2
In this iteration, we work on the Weak-wind examples, which are shown
in the following table.
Table 4.10 ID3 worked example, iteration #2 the Weak-wind examples
Example # Weather Humidity Wind Play tennis
#2 Sunny High Weak Play
#4 Sunny Low Weak Play
#6 Cloudy High Weak Play
#9 Rainy High Weak Not play
#11 Rainy Low Weak Not play
There are five examples, of which two have the label ‘Not play’ and
three have ‘Play’. The entropy is:
Entropy(Weak) = −2/5 log(2/5) − 3/5 log(3/5) = 0.971
In this iteration, two candidate attributes for splitting the examples are
weather and humidity.
Iteration #2: Splitting on weather
The following table shows the five Weak-wind examples grouped by
weather — Sunny, Cloudy, and Rainy.
Table 4.11 ID3 worked example, iteration #2 splitting on weather
Example # Weather Humidity Wind Play tennis
#2 Sunny High Weak Play
#4 Sunny Low Weak Play
#6 Cloudy High Weak Play
#9 Rainy High Weak Not play
#11 Rainy Low Weak Not play
110 COMP S491 Machine Learning and Applications
The Sunny subset contains two examples, of which both have the label
‘Play’. The entropy is:
Entropy(Sunny|Weak) = −2/2 log(2/2) − 0/2 log(0/2) = 0
The Cloudy subset contains one example, which has the label ‘Play’.
The entropy is:
Entropy(Cloudy|Weak) = −1/1 log(1/1) − 0/1 log(0/1) = 0
The Rainy subset contains two examples, both of which have the label
‘Not play’. The entropy is:
Entropy(Rainy|Weak) = −2/2 log(2/2) − 0/2 log(0/2) = 0
The information gain is:
Gain(Weather|Weak) = 0.971 − (2/5 × 0 + 1/5 × 0 + 2/5 × 0) = 0.971
Iteration #2: Splitting on humidity
The following table shows the five Weak-wind examples grouped by
humidity — High and Low.
Table 4.12 ID3 worked example, iteration #2 splitting on humidity
Example # Weather Humidity Wind Play tennis
#2 Sunny High Weak Play
#6 Cloudy High Weak Play
#9 Rainy High Weak Not play
#4 Sunny Low Weak Play
#11 Rainy Low Weak Not play
The High-humidity subset contains three examples, of which one has
the label ‘Not play’ and two have ‘Play’. The entropy is:
Entropy(High|Weak) = −1/3 log(1/3) − 2/3 log(2/3) = 0.918
The Low-humidity subset contains two examples, of which one has the
label ‘Not play’ and one has ‘Play’. The entropy is:
Entropy(Low|Weak) = −1/2 log(1/2) − 1/2 log(1/2) = 1
The information gain is:
Gain(Humidity|Weak) = 0.971 − (3/5 × 0.918 + 2/5 × 1) = 0.020
Unit 4 111
Iteration #2: Conclusion
Among Gain(Weather|Weak) = 0.971 and Gain(Humidity|Weak) =
0.020, Gain(Weather|Weak) is larger. Therefore, the weather attribute is
selected to be processed by a new node in the decision tree, as shown in
the figure below.
Figure 4.24 ID3 worked example — the complete decision tree
Under the new node, all Sunny and Cloudy examples have the label
‘Play’, and all Rainy examples have the label ‘Not play’. All three
branches lead to leaf nodes of the predicted outcomes, and no further
work is required. The decision tree is complete.
Overfitting of ID3
The ID3 algorithm has high variance and is vulnerable to the overfitting
problem. Consider the following dataset, which is identical to the
dataset in the preceding worked example except for an additional
attribute of the example recording date.
Table 4.13 ID3 worked example dataset
Example # Date Weather Humidity Wind Play tennis
#1 1 May Sunny High Strong Not play
#2 2 May Sunny High Weak Play
#3 4 May Sunny Low Strong Not play
#4 5 May Sunny Low Weak Play
#5 6 May Cloudy High Strong Not play
#6 9 May Cloudy High Weak Play
#7 11 May Cloudy Low Strong Not play
#8 12 May Rainy High Strong Not play
#9 13 May Rainy High Weak Not play
#10 15 May Rainy Low Strong Not play
#11 16 May Rainy Low Weak Not play
112 COMP S491 Machine Learning and Applications
For the attempt to split the dataset on an attribute (i.e. iteration #1),
the amounts of information gain by splitting on weather, humidity, and
wind are the same as those in the preceding worked example, namely
0.231, 0.049, and 0.049 respectively. How about the new data attribute?
Each example has a different date value and forms a subset, and entropy
for each subset (e.g. example #1 1 May) is:
Entropy(1 May) = −1/1 log(1/1) − 0/1 log(0/1) = 0
The information gain is therefore:
Gain(Date) = 0.845 − (1/11 × 0 × 11) = 0.845
This information gain is greater than all others, and according to ID3
we add a new node to split the examples on the date. Since all examples
have different dates, the new node has 11 branches, or 11 children, each
leading to a leaf node (a predicted outcome). At this point, the decision
tree is completely built.
In any case, a basic understanding of the problem (domain knowledge)
reveals that the recorded dates are not useful for predicting whether
players play tennis and are actually noise in the dataset. The resulting
decision tree is seriously overfitted. The successor of ID3, C4.5,
addresses this issue and is discussed next.
C4.5
The successor of ID3 is C4.5, which has the following main
improvements:
• C4.5 solves the overfitting problem of ID3.
• C4.5 handles both categorical and numerical features, while ID3
only handles categorical ones.
• C4.5 handles missing data, i.e. incomplete examples.
The solution to the overfitting problem is explained in this section.
The C4.5 algorithm is very similar to the ID3 algorithm. One major
difference is that C4.5 uses the information gain ratio instead of
information gain.
The information gain ratio, or normalized information gain, is defined
as the information gain divided by the split information (also called
intrinsic value). The split information on an attribute A with m possible
attributes values ai of probabilities P(X | ai) is:
Unit 4 113
A worked example
Let’s calculate the information gain ratio for the dataset discussed at the
end of the ID3 section. Consider all 11 examples in iteration #1.
For splitting on the weather attribute, Gain(Weather) = 0.231. There are
four Sunny examples, three Cloudy examples, and four Rainy examples.
The information gain ratio is:
Gain ratio(Weather) = 0.231 / (−4/11 log(4/11) − 3/11 log(3/11) − 4/11
log(4/11)) = 0.147
For splitting on the humidity attribute, Gain(Humidity) = 0.016. There
are six High-humidity examples and five Low-humidity examples. The
information gain ratio is:
Gain ratio(Humidity) = 0.016 / (−6/11 log(6/11) − 5/11 log(5/11)) =
0.016
For splitting on the wind attribute, Gain(Wind) = 0.404. There are six
Strong-wind examples and five Weak-wind examples. The information
gain ratio is:
Gain ratio(Wind) = 0.404 / (− 6/11 log(6/11) − 5/11 log(5/11)) = 0.406
For splitting on the date attribute, Gain(Date) = 0.845. All 11 examples
have distinct date values. The information gain ratio is:
Gain ratio(Date) = 0.845 / (−1/11 log(1/11) × 11)) = 0.244
The information gain ratio on the wind attribute, 0.406, is the largest.
Therefore, the wind attribute is selected for the new node (compared to
the date attribute when ID3 is used). The use of information gain ratio
favours splitting that results in large subsets rather than small subsets,
thereby alleviating the issue of overfitting.
Now compute the information gain ratio in iteration #2 in the activity
below.
Activity 4.30
Assuming splitting on the wind attribute in iteration #1 and using
the five Weak-wind examples, calculate the information gain ratio in
iteration #2 for splitting on each of the weather, humidity, and date
attributes.
Feedback is provided for this activity.
114 COMP S491 Machine Learning and Applications
CART
CART, or classification and regression trees, is a modern algorithm
for learning decision trees. The advantages of using CART include the
following:
• CART can handle outliers, while ID3 and C4.5 do not work well
with outliers.
• CART handles both categorical and numerical features, and missing
data, like C4.5.
• CART can perform both classification and regression tasks, as its
name suggests!
Internally, CART builds binary decision trees and uses the measure of
Gini impurity instead of entropy. The Gini impurity of a variable X with
n possible outcomes xi of probabilities P(xi) is:
Gini impurity and entropy works equally well in most practical
applications. However, Gini impurity is slightly faster than entropy
because the former does not use the log function.
CART is used in the scikit-learn library for building decision trees,
which are discussed next.
Implementing decision trees
The scikit-learn library supplies the classes DecisionTreeClassifier
and DecisionTreeRegressor for implementing decision trees to perform
classification and regression tasks respectively. The usage of the two
classes is essentially identical to that of other classifiers and regressors,
such as LinearRegression and LogisticRegression.
The following code shows the use of the DecisionTreeClassifier class
for classifying the iris dataset. Except for the DecisionTreeClassifier
class name, all the code has been explained and used earlier in the
unit. Note that the decision tree does not require standardization of the
numerical features.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score)
Unit 4 115
Output:
0.9555555555555556
We can obtain the depth and number of leaf nodes for the decision tree
using the get_depth() and get_n_leaves() methods. These give us a
sense of the tree’s complexity and its use in making predictions.
print("Depth:", clf.get_depth())
print("Number of leaves:", clf.get_n_leaves())
Output:
Depth: 4
Number of leaves: 6
To visualize the decision tree, we use the plot_tree() function of the
sklearn.tree module. The code segment below displays the decision
tree built from the iris dataset. The matplotlib functions figure() and
show() are optional; they are used for customizing the figure size and
avoiding textual output of plot_tree() respectively.
import matplotlib.pyplot as plt
from sklearn import tree
plt.figure(figsize=(10, 10))
tree.plot_tree(clf, filled=True, fontsize=10)
plt.show()
Output:
116 COMP S491 Machine Learning and Applications
Let’s look into the root node of the tree.12 The first line, like X[2] <=
2.45, specifies the feature being examined (X[2]) and the branching/
splitting condition (X[2] <= 2.45); this condition is used for splitting
the examples into the two child nodes. The second line, like gini =
0.665, specifies the Gini impurity of the examples involved in this node.
The third line, like samples = 105, specifies the number of examples, or
samples, involved in this node. The fourth line, like value = [33, 38,
34], specifies the numbers of examples in the three iris classes.
Now complete the activity below to try out the DecisionTreeRegressor
class for doing regression.
Activity 4.31
Write code to use the DecisionTreeRegressor class to perform regression
on the Boston dataset. Also print the depth and number of leaves of the
tree, and visual it. (Hint: use the max_depth=4 argument of the plot_
tree() function because the tree is very large.)
Feedback is provided for this activity.
Let’s sum up this section of decision trees by discussing their strengths
and weaknesses. In general, decision trees handle both categorical and
numerical features, and work for both classification and regression
tasks. Very little or no preprocessing, e.g. no feature standardization, is
required. Decision trees are easy to interpret, easy to use and have no
complex hyperparameters to tune. They deal with (ignore) redundant
features automatically and no feature extraction is required in many
cases.
On the other hand, decision trees generally have high variance and are
vulnerable to overfitting (even though some methods can alleviate the
overfitting problem). The decision trees generated from some datasets
may be unstable, meaning that small variations in the training data may
result in very different trees. Finally, learning optimal decision trees is
technically intractable (extremely difficult or impossible), and practical
learning algorithms are suboptimal.
Now work through the self-test below to check your understanding of
decision trees. Suggested answers can be found at the end of the unit.
12 This discussion may differ from the output figure because of the randomness
in obtaining the training dataset.
Unit 4 117
Self-test 4.7
1 Three widely-known algorithms for learning decision trees are ID3,
C4.5, and CART.
a Which algorithms handle only categorical features? Which
handle both categorical and numerical features?
b Which algorithms solve only classification problems? Which
solve both classification and regression problems?
c Which of the three algorithms is the most vulnerable to
overfitting?
2 Calculate the entropy and Gini impurity for each of the following
datasets:
a ten dogs and ten cats
b ten apples, five bananas, and five oranges
3 Consider the dataset of ten dogs and ten cats. A feature splits the
dataset into two subsets: four dogs and two cats, and one dog and
three cats. Calculate the information gain and information gain ratio.
4 Which two scikit-learn classes implement decision trees for
classification and regression? Which learning algorithm do these
classes use?
If you want learn more about decision trees, watch these videos:
• StatQuest: Decision trees: https://www.youtube.com/
watch?v=7VeUPuFGJHk
• StatQuest: Decision trees, part 2 — Feature selection and missing
data: https://www.youtube.com/watch?v=wpNl-JwwplA
The next section discusses rule-based classifiers, whose learning
mechanism is related to decision trees.
118 COMP S491 Machine Learning and Applications
Rule-based classifiers (optional)
In a decision tree, the (internal) nodes keep the conditions, or rules,
for splitting the examples and making decisions. Like a decision tree,
a rule-based classifier applies rules to classify examples based on the
example features. Unlike a decision tree, a rule-based classifier does not
store the rules in a tree data structure, but in a list. Rule-based classifiers
are not as popular as decision trees. This section introduces the basics of
rule-based classifiers for educational purposes.
In a rule-based classifier, the rules are specified in the form of ‘If
… then …’ form, e.g. ‘If Wind is Strong, then Not play.’ To make a
prediction, the classifier goes through the list of rules one by one; when
the condition of a rule holds, the rule’s result is returned as the predicted
label (class).
Obtaining rules
One way to obtain the rules of a rule-based classifier is to use a decision
tree. Each leaf node of the decision tree is used to construct a rule. In
the path from the root to the leaf node, the conditions in the internal
nodes of the path are combined using and to form the if part of the rule,
and the content of the leaf node becomes the then part.
Let’s extract the rules of the preceding ID3 worked example of playing
tennis. The dataset and decision tree are repeated below.
Table 4.14 ID3 worked example dataset
Example # Weather Humidity Wind Play tennis
#1 Sunny High Strong Not play
#2 Sunny High Weak Play
#3 Sunny Low Strong Not play
#4 Sunny Low Weak Play
#5 Cloudy High Strong Not play
#6 Cloudy High Weak Play
#7 Cloudy Low Strong Not play
#8 Rainy High Strong Not play
#9 Rainy High Weak Not play
#10 Rainy Low Strong Not play
#11 Rainy Low Weak Not play
Unit 4 119
Figure 4.25 ID3 worked example, the complete decision tree
Considering the leaf nodes from left to right, the rules are obtained as
follows:
• If Wind is Weak and Weather is Sunny, then Play.
• If Wind is Weak and Weather is Cloudy, then Play.
• If Wind is Weak and Weather is Rainy, then Not play.
• If Wind is Strong, then Not play.
We look at two special rule-based classifiers next.
ZeroR
ZeroR, or Zero Rule, is a classifier that predicts only one outcome —
the most frequent label (class) in the training data. It ignores all features
and is the simplest classification algorithm.
In the above worked example of playing tennis, the dataset contains 11
examples, of which eight examples have the label ‘Not play’ and three
have ‘Play’. Since ‘Not play’ is the most frequent label, ZeroR will
predict ‘No play’ for any example.
In scikit-learn, the DummyClassifier class implements ZeroR if its
strategy argument is set to "most_frequent" or "prior". The following
code demonstrates the use of such a DummyClassifier. The make_
classification() function is called to create a dataset that contains two
classes in the proportion (weights) of 0.65 to 0.35. The DummyClassifier
predicts all outcomes as the majority class, which constitutes 0.65
(65%) of the training examples. Therefore, 0.65 of the predictions are
correct, which is confirmed by the cross-validation score in the output
of the code.
120 COMP S491 Machine Learning and Applications
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.dummy import DummyClassifier
X, y = make_classification(1000, n_classes=2, weights=(0.65, 0.35))
reg = DummyClassifier(strategy="most_frequent")
scores = cross_val_score(reg, X, y, cv=5)
print(scores, scores.mean())
Output:
[0.655 0.65 0.65 0.65 0.65 ] 0.651
ZeroR is not used for making predictions in practice, but is useful as a
baseline for comparing the performance of other algorithms.
OneR
OneR, or One Rule, is a simple classifier that exploits the best feature
for making predictions. The OneR algorithm works as follows:
1 For each of the features, calculate the error when the feature is used
for making predictions.
a For each possible value of the feature:
i Find the examples with that feature value.
ii Find the most frequent label/class of those examples.
iii Create a rule with that value and class: If the feature is the
value, then return the class.
b Using the set of rules for the feature, make predictions for all the
examples and calculate the prediction error.
2 Select the feature that gives the minimum error, and use the feature’s
set of rules for making predictions.
As the OneR algorithm uses the single best features, it is equivalent to
a decision tree with one level (i.e. with only the root node). Despite its
simplicity, OneR performs reasonably well, and is only slightly worse
than more complex algorithms in some machine learning applications.
Now complete the following activity to compare the performance of
ZeroR, OneR and a decision tree.
Activity 4.32
Write code to compare the cross-validation scores of classifying the
iris dataset using ZeroR, OneR, and a decision tree. For ZeroR, use the
DummyClassifier class with strategy="most_frequent". For OneR, use
the DecisionTreeClassifier class with max_depth=1.
Feedback is provided for this activity.
Unit 4 121
Now complete the following self-test before moving on to the next
topic. Suggested answers can be found at the end of the unit.
Self-test 4.8
1 What is the main difference between rule-based classifiers and
decision trees?
2 What are ZeroR and OneR? Describe their algorithms briefly.
3 Consider the dataset of 50 apples, 30 bananas, and 20 oranges.
a What is the score (accuracy) when ZeroR is applied to classify
the dataset?
b Assume that the best feature splits, or separates, the apples from
the bananas and oranges. What is the score (accuracy) when
OneR is applied to classify the dataset?
At this point, we have discussed a number of learning algorithms.
These algorithms, such as k-nearest-neighbours, Naive Bayes and
decision trees, are relatively simple, and may not perform well in some
applications (i.e. they are weak). Multiple models from such simple
algorithms can be combined to form a composite model, or an ensemble
model, that performs better than the individual models. The next section
discusses ensemble learning methods for building such ensemble
models.
122 COMP S491 Machine Learning and Applications
Ensemble learning methods
Ensemble learning methods combine multiple base models into an
ensemble model for making better predictions. The base models are
simple models such as k-nearest-neighbours and decision trees. Such
models are relatively easy and fast to build and use, but may be weak
and have high bias and variance for some machine learning problems.
These simple models are called weak learners. When weak learners are
combined properly, the issues of high bias and variance are alleviated,
and the resulting ensemble model is stronger and achieves higher
prediction accuracy for solving complex problems.
Three ensemble methods — bagging, random forests, and boosting —
are explained in this section.
Bagging
Bagging, short for bootstrap aggregating, is an ensemble method that
uses bootstrap samples to train multiple base models and combines, or
aggregates, the models’ outcomes as the final predicted outcome. Both
bootstrap sampling and prediction aggregation are described in the
following.
Figure 4.26 Bagging (Modified from source: https://commons.wikimedia.org/wiki/
File:Ensemble_Bagging.svg)
Bootstrapping
Bootstrapping is a sampling technique that takes samples with
replacement. There are two ways of taking samples, without and with
replacement, as shown in the figure below.
Unit 4 123
Figure 4.27 Sampling without replacement (top) and sampling with replacement
(bottom)
In sampling without replacement, a randomly-selected item is taken
from, and not replaced in, the dataset; since the item no longer exists
in the dataset, it will not be taken a second time. In sampling with
replacement, a copy of the randomly selected item is taken, or the
taken item is replaced by an equivalent item in the dataset; since the
(equivalent) item still exists in the dataset, it may be taken later, and be
taken multiple times.
Bootstrapping is sampling with replacement, and the resulting sample is
called a bootstrap sample. Different bootstrap samples of a dataset are
nearly independent, and are properly representative of the dataset. It is
generally easier to work with multiple smaller samples of a dataset than
working with the dataset as a whole.
In machine learning, a bootstrap sample can be used to train a model.
The size of the bootstrap sample is usually the same as the original
dataset, but may be smaller if the original dataset is large. The examples
not in the sample can be used as a test set for evaluating the model.
Prediction aggregation
In bagging, multiple bootstrap samples are taken and used for training
individual base models. When the trained base models make predictions
on a query example, their results are aggregated to a final predicted
outcome for the ensemble model.
For regression, the numerical results of the base models are averaged
to produce the outcome of the ensemble model. For example, in an
ensemble regressor, three base models predict the label of a query
example as 1.34, 1.62, and 1.65 respectively; then the final predicted
label of the ensemble regressor is (1.34 + 1.62 + 1.65) / 3 = 1.55.
For classification, two ways of prediction aggregation are hard-voting
and soft-voting. In hard-voting, the most frequent class of the base
models’ results becomes the outcome of the ensemble model. For
instance, in an ensemble classifier, three base models predict the label
of a query example as follows:
124 COMP S491 Machine Learning and Applications
• Base model #1: class A with probability 0.55 and class B with
probability 0.45
• Base model #2: class A with probability 0.29 and class B with
probability 0.71
• Base model #3: class A with probability 0.62 and class B with
probability 0.38
Using hard-voting, the three base models predict the label of the query
example as class A, class B, and class A respectively. The outcome of
the ensemble classifier is the majority class A.
In soft-voting, the probabilities of classes are averaged over the base
models’ results and the class with the highest average probability is
the outcome of the ensemble model. Using the data from the above
ensemble classifier, the average probabilities of the predicted classes
are:
• Class A average probability: (0.55 + 0.29 + 0.62) / 3 = 0.49
• Class B average probability: (0.45 + 0.71 + 0.38) / 3 = 0.51
According to soft-voting, class B, which has a higher average
probability, is the outcome of the ensemble classifier.
Implementing bagging
The scikit-learn library supplies the BaggingClassifier and
BaggingRegressor class in the sklearn.ensemble module for
implementing bagging for classification and regression respectively.
Some important arguments in creating objects of these classes are:
• base_estimator, defaulted to None meaning a decision tree,
designates the base learner (called a base estimator in scikit-learn).
• n_estimators, defaulted to 10, designates the number of base
learners.
• max_samples, default to 1.0, designates the maximum int number of
samples, or the maximum float ratio of sample size to dataset size.
Using BaggingClassifier
The following code shows the use of a BaggingClassifier with decision
trees as base models on the digits dataset and compares its performance
with a single decision tree.
from sklearn.datasets import load_digits
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
Unit 4 125
bagging = BaggingClassifier()
scores = cross_val_score(bagging, X, y)
print("Bagging of trees:", scores.mean())
tree = DecisionTreeClassifier()
scores = cross_val_score(tree, X, y)
print("Tree:", scores.mean())
Output:
Bagging of trees: 0.9337790157845868
Tree: 0.8697864438254411
The score for bagging (about 0.93) is higher than that of a single
decision tree (about 0.87). By default, a BaggingClassifier uses ten
decision trees. How does the number of trees affect the performance
(score)? To find out, let’s obtain and plot the scores of bagging with
different numbers of base models. Bagging ensembles with 5, 10, 15, …
and 100 decision trees are implemented in the code segment below.
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
bagging = BaggingClassifier(n_estimators=n)
scores = cross_val_score(bagging, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Output:
[]
The graph shows the trend of improvement when more decision trees
are used in a bagging ensemble. Now work through the following
activity to try out bagging of kNN base models.
126 COMP S491 Machine Learning and Applications
Activity 4.33
Modify the code below to find the scores of using kNN base models
instead of decision trees in bagging. (Hint: use the StandardScaler
and KNeighborClassifier classes, and the base_estimator argument of
BaggingClassifier.)
# Modify code below
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
bagging = BaggingClassifier(n_estimators=n)
scores = cross_val_score(bagging, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Feedback is provided for this activity.
Using BaggingRegressor
Let’s now apply bagging to a regression problem. The code below
shows the use of a BaggingRegressor with decision trees as base models
on the Boston dataset, and compares its performance with a single
decision tree.
from sklearn.datasets import load_boston
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
bagging = BaggingRegressor()
scores = cross_val_score(bagging, X, y)
print("Bagging of trees:", scores.mean())
tree = DecisionTreeRegressor()
scores = cross_val_score(tree, X, y)
print("Tree:", scores.mean())
Output:
Bagging of trees: 0.8516361539771291
Tree: 0.7662328577234939
Unit 4 127
The score for bagging (about 0.85) is higher than that for a single
decision tree (about 0.77). The code below finds out and plots the scores
for bagging ensembles with 5, 10, 15, … and 100 decision trees.
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
bagging = BaggingRegressor(n_estimators=n)
scores = cross_val_score(bagging, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Output:
[]
The graph shows a trend of improvement when more decision trees
are used in bagging. Complete the following activity to find out how
bagging of kNN base models works for the Boston dataset.
128 COMP S491 Machine Learning and Applications
Activity 4.34
Modify the code below to find the scores when using kNN models
instead of decision trees in bagging. (Hint: use the StandardScaler
and KNeighborClassifier classes, and the base_estimator argument of
BaggingClassifier.)
# Modify code below
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
bagging = BaggingRegressor(n_estimators=n)
scores = cross_val_score(bagging, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Feedback is provided for this activity.
Random forests
Random forests are similar to bagging but are different in two ways.
First, a random forest uses only decision trees as base models, as its
name suggests. Second, each base model of a decision tree is trained
using a subset of features from the dataset.
There are several benefits of using subsets of features to train the base
tree models. One benefit is that using fewer features simplifies and
speeds up the creation of the base models. Moreover, base models with
different feature subsets have low correlation, so they are less likely to
produce similar prediction errors. Upon averaging, the different errors
of individual base models tend to cancel each other out, so the base
models’ variance is reduced. This addresses, for example, the issue
of overfitting when base models are deep trees. Another benefit is the
handling of missing data, which is achieved by the base models whose
feature subsets do not contain the missing data.
The RandomForestClassifier and RandomForestRegressor classes of
scikit-learn supplies random forests implementations for classification
and regression tasks respectively. The code below shows the use of
RandomForestClassifier and plots its performance on the digits dataset
for different numbers of base tree models.
Unit 4 129
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
forest = RandomForestClassifier(n_estimators=n)
scores = cross_val_score(forest, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Output:
[]
Work through the following activity to try using the
RandomForestRegressor class.
130 COMP S491 Machine Learning and Applications
Activity 4.35
Modify the code below to use the RandomForestRegressor class to work
on the Boston dataset, instead of using the RandomForestClassifier
class and the digits dataset.
# Modify code below
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
forest = RandomForestClassifier(n_estimators=n)
scores = cross_val_score(forest, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Feedback is provided for this activity.
Boosting
In bagging and random forests, base models are independent and may be
trained in parallel. In contrast, boosting trains base models sequentially,
improving later training based on the results of the base models that
have been trained earlier.
Figure 4.28 Boosting (Modified from source: https://commons.wikimedia.org/
wiki/File:Ensemble_Boosting.svg)
Unit 4 131
Focusing on the prediction errors of base models, boosting reduces the
bias of those models. In general, boosting works effectively with base
models that have low variance and high bias, such as shallow decision
trees. Two popular boosting methods are AdaBoost and gradient
boosting, which are explained as follows.
AdaBoost
AdaBoost, or adaptive boosting, is a very popular ensemble learning
meta-algorithm that puts more focus on difficult examples in training a
sequence of base models. Initially, all training examples have the same
weight, and they are used for training the first base model. Using the
results of this base model, we increase the weights of the examples that
are predicted incorrectly, and decrease the weights of the examples that
are predicted correctly. The examples with the adjusted weights are
then used to train the next base model. With higher weights, the difficult
examples obtain more attention, and are more likely to be predicted
correctly in subsequent base models of the ensemble. To obtain the
predicted outcome of the ensemble, the results of the base models are
combined and weighted according to the models’ scores, by averaging
or voting for regression and classification respectively.
There are different ways of implementing the weights of the examples.
Some algorithms used for the base models work with weights of training
examples out of the box. For some other algorithms, we may duplicate
the examples to increase their significance in the training process.
The AdaBoostClassifier and AdaBoostRegressor classes of scikit-learn
implement the AdaBoost algorithm. By default, these two classes use 50
shallow decision trees as base models, but we may specify them using
the base_estimator and n_estimators arguments. Another argument,
learning_rate, with values from 0.0 to 1.0, controls how much the
example weights are changed; in general, there is a trade-off between
learning_rate and n_estimators.
The code below shows how an AdaBoostClassifier works with the
digits dataset for different numbers of base models.
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
boost = AdaBoostClassifier(n_estimators=n)
scores = cross_val_score(boost, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Output:
[]
132 COMP S491 Machine Learning and Applications
The scores are about 0.26 — a very poor result. What’s wrong? By
default, the AdaBoostClassifier class uses base models of decision trees
with one level; that is, predictions are made using only one feature,
or one pixel of an image. We need to use more pixels in making each
prediction, i.e. more levels in the decision trees. The following code
uses decision trees with at most ten levels (depth). The scores went up
to about 0.98.
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
tree = DecisionTreeClassifier(max_depth=10)
boost = AdaBoostClassifier(base_estimator=tree, n_estimators=n)
scores = cross_val_score(boost, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Output:
[]
Unit 4 133
Now complete the following activity to try out the AdaBoostRegressor
class.
Activity 4.36
Modify the code below to use the AdaBoostRegressor class to apply
regression to the Boston dataset instead of using the AdaBoostClassifier
class and the digits dataset.
# Modify code below
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
boost = AdaBoostClassifier(n_estimators=n)
scores = cross_val_score(boost, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Feedback is provided for this activity.
Gradient boosting
Gradient boosting, also known as gradient boosting machines or GBM,
is an ensemble meta-algorithm in which base models are trained using
the errors from preceding base models. Initially, the first base model
is trained using the examples in the training set. The prediction errors,
i.e. the differences between the actual labels and the predicted labels,
are calculated and used to train the second base model. The prediction
errors of the second base model are used to train the third base
model, and so on. These prediction errors are called pseudo-residuals
in gradient boosting. To make a prediction, the ensemble adds the
predicted results of the base models to produce the final predicted label.
Here is a simple example that illustrates how gradient boosting works. A
gradient boosting ensemble contains three weak base models, and each
base model happens to predict 90% of a value. Consider an example
label of 1000:
• The first base model is trained with the label 1000, and, after
training, it predicts the label as 900 (= 1000 × 90%), and the pseudo-
residual is 100 (= 1000 − 900).
134 COMP S491 Machine Learning and Applications
• The second base model is trained with the pseudo-residual 100. It
predicts the pseudo-residual value as 90 (= 100 × 90%), and the
pseudo-residual is 10 (= 100 − 90).
• The third base model is trained with the pseudo-residual 10. It
predicts the pseudo-residual value as 9 (= 10 × 90%), and the
pseudo-residual is 1 (= 10 − 9).
The prediction of the ensemble is the sum of the models’ predicted
values, i.e. 900 + 90 + 9, or 999, which is very close to the actual label
1000 — so the ensemble makes a much better prediction than the
individual base models.
Gradient boosting uses pseudo-residuals, or errors, to direct the training
of the next base model, or the next step in the algorithm. Conceptually,
it is very similar to the gradient descent algorithm, which you learned
about near the beginning of this unit. Gradient boosting also has a
parameter of learning rate (a value from 0.0 to 1.0), which controls
the number of errors for use in training the next base model. Typically,
decision trees are used in a gradient boosting ensemble.
The GradientBoostingClassifier and GradientBoostingRegressor
classes of scikit-learn implement gradient boosting of decision trees.
Their uses are very similar to other classifier and regressor classes. The
following code uses the GradientBoostingClassifier class to classify
the digits dataset using different numbers of decision trees.
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
boost = GradientBoostingClassifier(n_estimators=n)
scores = cross_val_score(boost, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Output:
[]
Unit 4 135
Now complete the following activity to try out the
GradientBoostingRegressor class.
Activity 4.37
Modify the code below to use the GradientBoostingRegressor
class to apply regression to the Boston dataset instead of using the
GradientBoostingClassifier class and the digits dataset.
# Modify code below
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
boost = GradientBoostingClassifier(n_estimators=n)
scores = cross_val_score(boost, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Feedback is provided for this activity.
In bagging, random forests and boosting, an ensemble contains base
models of the same kind. It is also possible to use base models of
different kinds in an ensemble — this is an ensemble method called
stacking. If you’re interested, read about the StackingClassifier and
StackingRegressor class at https://scikit-learn.org/stable/modules/
ensemble.html#stacked-generalization.
136 COMP S491 Machine Learning and Applications
To conclude, bagging and random forests improve high-variance weak
learners, while boosting enhances high-bias weak learners. All of
these ensemble methods demand more execution time and resources,
especially for boosting in situations in which models cannot be trained
in parallel.
Now complete the self-test below to check your understanding of
ensemble methods discussed in this section. Suggested answers can be
found at the end of the unit.
Self-test 4.9
1 A box contains a red ball and a green ball. What are the possible
results of drawing two balls from the box using sampling without
replacement? What are the possible results using sampling with
replacement?
2 Is it possible to take a sample of size 200 from a dataset of size 100?
Briefly explain your answer.
3 How does bagging combine, or aggregate, the predicted results of
the base models to form the ensemble’s predicted result?
4 What are two differences between random forests and bagging?
5 In an AdaBoost ensemble, three examples with labels 5, 12, and
8 are used to train the first base model. After training, that model
predicts the labels as 5, 10, and 10. How are the three examples
adjusted for training the second base model?
6 In a gradient boosting ensemble, three examples with labels 5,
12, and 8 are used to train the first base model. After training, that
model predicts the labels as 5, 10, and 10. What data are used to
train the second base model?
7 Which of bagging, random forests, and boosting are used for
improving high-variance weak learners? Which are used for
improving high-bias weak learners?
If you want to learn more about ensemble methods, watch these videos:
• StatQuest: Random forests part 1 — Building, using and evaluating:
https://www.youtube.com/watch?v=J4Wdy0Wc_xQ
• StatQuest: Random forests part 2 — Missing data and clustering:
https://www.youtube.com/watch?v=nyxTdL_4Q-Q
• AdaBoost, clearly explained:
https://www.youtube.com/watch?v=LsK-xG1cLYA
Unit 4 137
• Gradient boost part 1 — Regression main ideas: https://www.
youtube.com/watch?v=3CC4N4z3GJc
• Gradient boost part 3 — Classification:
https://www.youtube.com/watch?v=jxuNLH5dXCs
The next section discusses support vector machines, a relatively
complex and advanced learning algorithm.
138 COMP S491 Machine Learning and Applications
Support vector machines
The support vector machine, or SVM, is a relatively sophisticated
learning method that performs well for many machine learning
problems. As you’ll learn in this section, the key idea behind the SVM
is seeking an optimal boundary that separates two classes of data points
or examples. Despite its roots in classification, the SVM has been
extended to solve regression problems as well.
The inner workings of SVMs involve some complicated mathematics,
which are not emphasized in this section. Instead, we’ll focus on the
core concepts and basic implementation of code. Specifically, this
section begins by discussing the concepts underlying the linear SVM.
It then describes how non-linear techniques are applied to the SVM
for solving more advanced problems, and finally demonstrates some
program implementations.
Linear SVMs
To study SVMs, let’s consider two classes of data points that are linearly
separable, which means the existence of a linear boundary (which will
be discussed shortly) that can cleanly separate the two classes of data
points. For 1D data points, the linear boundary is a point (break point)
that separates the two classes. For 2D data points, the linear boundary
is a straight line. For 3D data points, the linear boundary is a flat 2D
plane. In general, and for any dimension, such a boundary is called a
hyperplane.
Separable data and maximal margin classifiers
Although SVMs work for data of any dimension, we use 2D data in
the discussion below for simplicity and visualization purposes. The
following figure shows two classes of 2D data points that are linearly
separable.
Unit 4 139
Figure 4.29 Separating two classes of 2D data points (Modified from source:
https://commons.wikimedia.org/wiki/File:Svm_separating_
hyperplanes_(SVG).svg)
In the figure, data points for the two classes are denoted as black and
white circles respectively. The straight line H1 fails to separate the two
classes of data points and is not a valid separating boundary. Each of
the straight lines H2 and H3 separates the two classes; both of them are
valid separating boundaries — but which is better? The SVM chooses
the boundary that is most distant from the two classes of data points.
In other words, the SVM maximizes the distances from the boundary
to the closest data points of the classes. Usually, there are two or three
such data points that determine, or support, the separating boundary;
these data points are called support vectors.
In the figure above, there are three support vectors. The distance
between the support vectors in the direction across, or perpendicular to,
the separating boundary is called the margin. The margin is maximized
as the SVM determines the best separating boundary for the dataset.
To help you digest the concept of the SVM, here is an alternative
explanation: the SVM determines the widest rectangle that separates
the two classes of data points. As shown in the figure below, each of
the rectangles R1 and R2 separates the black and white circles. There
are many such rectangles, and the rectangle R2 is the widest. The SVM
selects this rectangle R2: the width of the rectangle is the maximum
margin; the data points touching the sides of the rectangle are the
support vectors (there are three in the figure); and the straight line that
bisects the rectangle across its width is the separating boundary (line).
140 COMP S491 Machine Learning and Applications
Figure 4.30 Separating two classes by rectangles (Modified from source: https://
commons.wikimedia.org/wiki/File:Svm_separating_hyperplanes_
(SVG).svg)
For separable data, as in the above discussion, the margin is called a
hard margin, and the classifier is called a maximal margin classifier.
Inseparable data and support vector classifiers
The maximal margin classifier does not work (well) for some datasets.
Two such datasets are shown in the following figure.
Unit 4 141
Figure 4.31 A data point of one class is very close to those of another class
(top); a data point of one class is located within another class of data
points (bottom) (Modified from source: https://commons.wikimedia.
org/wiki/File:Svm_separating_hyperplanes_(SVG).svg)
Consider the dataset at the top of the figure. One black circle, an outlier,
is very close to the group of white circles. The two classes of data
points are separable, but the outlier effectively ‘pushes’ the separating
boundary close to the class of white circles. Using this boundary,
the maximal margin classifier may tend to mis-classify future query
examples that are actually white circles. For instance, consider how
the maximal margin classifier classifies, or mis-classifies, the query
point in a dotted line with a question mark. According to the separating
boundary, the classifier (mis-)classifies the query point as a black circle.
However, the query point is assumed to be a white circle, as it is closer
to the group of most white circles than the group of most black circles.
142 COMP S491 Machine Learning and Applications
Maximal margin classifiers are sensitive to extreme data points in the
training set; they have high variance and do not generalize well.
For the dataset at the bottom of the figure, a black outlier circle lays
within the group of white circles. The two classes are inseparable using
a linear boundary: the maximal margin classifier simply does not work
for this dataset.
While the hard margin of a maximal margin classifier does not work
(well) for these datasets, a soft margin does. A soft margin is a margin
that allows a (small) number of data points to lie on the wrong side
of, or violate, the separating boundary. For example, a soft margin
that tolerates one data point works for each of the above two datasets,
because the outlier black circle is ignored, and a straight line separates
the two groups of most black circles and white circles. A classifier using
such a soft margin is known as a soft margin classifier or support vector
classifier.
The number of data points allowed to violate the separating boundary is
a tuning parameter of a soft margin classifier. This parameter controls
the trade-off between the variance and bias of the classifier. In general,
when a large number of data points may violate the boundary, the
margin is large, and the classifier has high bias and low variance. When
a small number of data points may violate the boundary, the margin is
small, and the classifier has low bias and high variance.
In the above, we discussed binary classification using maximal margin
classifiers and soft margin classifiers. These classifiers can be extended
to work for multiclass classification, using either one-versus-rest (OVR)
or one-versus-one (OVO). Both OVR and OVO were explained in the
earlier section ‘Binary and multiclass classification’ under ‘Logistic
regression’. In addition, as you’ve learned, SVM methods can be
extended to work for regression tasks too.
Non-linear SVMs
Soft margin classifiers tolerate some data points that violate the linear
separating boundary. To put this another way, they work for classes
that can be separated by a nearly-linear boundary. Not all datasets are
like that, anyway. For instance, on the left side of the figure below,
the dataset contains two classes of data points — those inside the
large center circle belong to one class, and those outside belong to the
other class. The two classes are separated by the large circle, but not
by anything like a straight line. We need something more than a soft
margin classifier to deal with this dataset — that extra ‘thing’ is a kernel
in SVM.
Unit 4 143
Figure 4.32 SVM with a kernel (Source: https://commons.wikimedia.org/wiki/
File:Kernel_trick_idea.svg)
Recall the discussion of polynomial regression earlier in this unit.
Essentially, non-linear operations are applied to the features to produce
higher-order features, which effectively turn a linear regression
technique into a non-linear solution that can deal with a non-linear
problem.
Kernels in SVM work in a similar way. Consider the above dataset of
data points inside and outside the large circle. Let’s denote the features
in the two axes as x1 and x2. We add a third feature, as
. When all three features x1, x2 and x3 are used, it turns out
that a soft margin classifier can separate the two classes using a linear
boundary (2D flat plane). To see how this works, the three features are
plotted in 3D on the right of the above figure, where the vertical axis
denotes x3, and the other two axes denote x1 and x2. The data points
inside the large circle have small values of x3 and are lower in the 3D
plot, while those outside the large circle have large values of x3 and are
higher in the 3D plot. Overall, a flat horizontal 2D plane can separate
the two classes of data points in the 3D plot.
A kernel in SVM is a general form of the operation, or function, that
derives x3 from x1 and x2 (i.e. ). Instead of working on an
individual data point, a kernel is applied to two data points to compute
their similarity. In addition, the kernel function should fulfil certain
mathematical requirements that allow the SVM to operate efficiently.
The details of SVM kernels are beyond the scope of this unit.
Three important and commonly-used SVM kernels are outlined:
• A linear kernel is equivalent to a support vector classifier without a
kernel, although there are differences in the implementations.
• A polynomial kernel enables a support vector classifier to work in
a higher-dimensional space using polynomial terms of the original
features. The resulting boundary is more flexible, like the circular
separating boundary in the preceding example.
144 COMP S491 Machine Learning and Applications
• A radial kernel (also known as radial basis function kernel or RBF)
achieves similar outcomes as a polynomial kernel, but is more
powerful and flexible. Infinite-dimensional terms of the original
features are used to determine the separating boundary. In addition,
a radial kernel has the desirable characteristic that only nearby
training data points have a significant effect on predicting a label
(like kNN).
Using distances and margins in the main operations, SVMs are sensitive
to scales of features. Therefore, feature scaling should be applied in
implementing SVMs, as you’ll see shortly.
Now complete the following activity to explore the use of SVMs to do
classification with and without a kernel.
Activity 4.38
Perform the following steps on the Machine Learning Playground
website.
1 Go to https://ml-playground.com/.
2 Set up two classes of data points on the canvas, by using the orange
and purple square buttons and clicking on the canvas. If necessary,
use the red cross button for removing a data point on the canvas.
3 Click on the Support Vector Machine button.
4 Set the C parameter to 10, and unselect the RBF Kernel? option,
if necessary. Then, train the SVM by clicking on the Train button.
The canvas shows two regions of predictions, separated by a linear
decision boundary. The SVM without a kernel is a linear model.
5 Select the RBF Kernel? option, and click on the Train button.
The two regions of predictions are now separated by a non-linear
decision boundary. The SVM with a RBF kernel is a non-linear
model.
6 Modify the data points, the C parameter, and the RBF Kernel?
option; train the model; and observe the two regions of predictions.
Repeat several times for different data and values of the C parameter
and the RBF Kernel? option.
7 Scroll down the page and read the brief overview of the SVM
algorithm. You may skip the mathematics in the overview content.
Unit 4 145

Figure 4.33 The SVM with RBF result on Machine Learning Playground
Implementing SVMs
The scikit-learn library supplies the sklearn.svm module of classes
for implementing SVMs, including SVC, LinearSVC, SVR, and LinearSVR
(among others). The SVC and LinearSVC classes perform SVM
classification, while the SVR and LinearSVR classes perform SVM
regression.
The SVC and SVR classes have an argument kernel for specifying the
kernel of the SVMs. Three commonly-used options for kernel are
linear, poly, and rbf (default). For kernel="poly", we can specify the
polynomial degree using the degree argument (defaulted to 3).
The LinearSVC and LinearSVR classes are similar to the SVC and SVR
classes with the argument kernel="linear" but have more options for
penalties and loss functions and generally scale better for large datasets.
All four classes have a C argument for controlling the strength of
regularization. The argument is a strictly positive float value and is
defaulted to 1.0. A smaller C value designates stronger regularization to
be applied to the SVM.
146 COMP S491 Machine Learning and Applications
Using the LinearSVC and SVC classes
The basic usage of the LinearSVC and SVC classes is similar to that of
other classifiers. The following code shows the use of the LinearSVC
class to classify the iris dataset.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = make_pipeline(StandardScaler(), LinearSVC())
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score)
Output:
0.9333333333333333
The code below compares the cross-validation scores of LinearSVC and
SVC with the linear, poly, and rbf kernels for classifying the iris dataset.
The linear and rbf kernels perform better than the poly kernel and the
LinearSVC class.
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC
from sklearn.utils import shuffle

iris = load_iris()
X, y = shuffle(iris.data, iris.target)
def svc_cv(name, svc):
pipe = make_pipeline(StandardScaler(), svc)
scores = cross_val_score(pipe, X, y, n_jobs=-1)
print(f"{name}:", scores.mean())

svc_cv("LinearSVC", LinearSVC())
svc_cv("SVC, linear", SVC(kernel="linear"))
svc_cv("SVC, poly", SVC(kernel="poly"))
svc_cv("SVC, rbf", SVC(kernel="rbf"))
Output:
LinearSVC: 0.9333333333333333
SVC, linear: 0.9533333333333335
SVC, poly: 0.9200000000000002
SVC, rbf: 0.9666666666666668
Let’s look into an example that examines more details of using the
poly kernel. The code below creates a dataset containing two features
and a label that depends on the cubes (third powers) of the features.
A grid search is performed on the SVC’s C and degree parameters. The
results show the best degree parameter of 3, matching the nature of the
generated dataset.
Unit 4 147
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.utils import shuffle

X = np.random.randint(0, 100, size=(1000, 2))
random_values = np.random.normal(scale=50000, size=1000)
y = X[:, 0] ** 3 + 2 * X[:, 1] ** 3 + random_values > 100000
X, y = shuffle(X, y)
pipe = Pipeline([("scaler", StandardScaler()), ("svc", SVC(kernel="poly"))])
param_grid = {"svc__C": np.arange(0.1, 2, 0.1),
"svc__degree": range(1, 5)}
grid = GridSearchCV(pipe, param_grid)
grid.fit(X, y)
display(grid.best_estimator_, grid.best_score_, grid.best_params_)
Output:
Pipeline(steps=[('scaler', StandardScaler()),
('svc', SVC(C=0.4, kernel='poly'))])
0.944
{'svc__C': 0.4, 'svc__degree': 3}
Now complete the following activity to see how the rbf kernel performs
with this dataset.
Activity 4.39
Modify the code below to use the rbf kernel instead of the poly kernel,
and perform a grid search on only the SVC’s C parameter in the same
range np.arange(0.1, 2, 0.1). How does the rbf kernel perform
compared to the poly kernel?
# Modify code below
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.utils import shuffle

X = np.random.randint(0, 100, size=(1000, 2))
random_values = np.random.normal(scale=50000, size=1000)
y = X[:, 0] ** 3 + 2 * X[:, 1] ** 3 + random_values > 100000
X, y = shuffle(X, y)
pipe = Pipeline([("scaler", StandardScaler()), ("svc", SVC(kernel="poly"))])
param_grid = {"svc__C": np.arange(0.1, 2, 0.1),
"svc__degree": range(1, 5)}
grid = GridSearchCV(pipe, param_grid)
grid.fit(X, y)
display(grid.best_estimator_, grid.best_score_, grid.best_params_)
Feedback is provided for this activity.
148 COMP S491 Machine Learning and Applications
Using the LinearSVR and SVR classes
The basic usage of the LinearSVR and SVR classes is similar to that of
other regressors. The following code shows the use of the LinearSVR
class for carrying out regression on the Boston dataset.
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVR
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = make_pipeline(StandardScaler(), LinearSVR())
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print(score)
Output:
0.672000647060823
The code below compares the cross-validation scores of LinearSVR and
SVR with the linear, poly, and rbf kernels for carrying out regression
on the Boston dataset. The results show that the LinearSVR class and the
linear kernels performed the best, the rbf kernel intermediate, and the
poly kernel the worst.
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVR, SVR
from sklearn.utils import shuffle

boston = load_boston()
X, y = shuffle(boston.data, boston.target)
def svc_cv(name, svr):
pipe = make_pipeline(StandardScaler(), svr)
scores = cross_val_score(pipe, X, y, n_jobs=-1)
print(f"{name}:", scores.mean())

svc_cv("LinearSVR", LinearSVR())
svc_cv("SVR, linear", SVR(kernel="linear"))
svc_cv("SVR, poly", SVR(kernel="poly"))
svc_cv("SVR, rbf", SVR(kernel="rbf"))
Output:
LinearSVR: 0.6902298863169763
SVR, linear: 0.697851563373602
SVR, poly: 0.601543132701694
SVR, rbf: 0.6470316063824647
Unit 4 149
In the preceding two code segments, the scores for the SVM regressors
are not very high. Can we tune the parameters of the SVM regressors
to obtain better results? To find out, complete the following activity to
perform grid search using the poly and rbf kernels.
Activity 4.40
Two kernels of an SVR are the poly kernel and the rbf kernel. Write code
to load the Boston dataset, and carry out grid search and regression
using each of the two kernels. For the poly kernel, use the search
parameters of C in range(5, 101, 5) and degree in range(1, 5). For the
rbf kernel, use the search parameter of C in range(5, 101, 5).
Feedback is provided for this activity.
In this activity, you should find that the SVM regressors with tuned
(proper) parameters perform moderately better than those with default
parameters. Generally, it is important to determine (e.g. by grid search)
and use the optimal regularization and other parameters of SVM models
in many machine learning applications.
Let’s conclude our study of SVMs by summarizing their benefits and
limitations. SVMs are effective when there are many features (i.e. high
dimensional space of features). Moreover, they are memory-efficient
because the separating boundary (decision function) is determined by
a small number of support vectors. Finally, SVMs are versatile and
flexible, since they are customizable by using various types of kernels.
On the other hand, parameter tuning and kernel selection are required
and important for applying SVMs to many applications. In addition,
SVMs do not provide probabilities in classification; extra computations
are required to obtain these probabilities if needed.
Now work through the self-test below to assess your understanding of
SVMs. Suggested answers can be found at the end of the unit.
150 COMP S491 Machine Learning and Applications
Self-test 4.10
1 What is a hard margin? Is a hard margin applicable to linearly
separable or inseparable datasets, or both?
2 What is the difference between a hard margin and a soft margin? Is a
soft margin applicable to linearly separable or inseparable datasets,
or both?
3 What are three commonly-used kinds of kernels in SVMs? Which of
them is the most flexible?
4 In scikit-learn, what are two classes that are used for implementing
SVM classifiers? What are two classes for implementing SVM
regressors?
5 What are some important parameters (arguments to the classes)
to tune when applying the SVC and SVR classes to solve machine
learning problems?
If you want to learn more about the support vector machine, watch these
videos:
• Support vector machines, clearly explained: https://www.youtube.
com/watch?v=efR1C6CvhmE
• Support vector machines part 2: The polynomial kernel: https://
www.youtube.com/watch?v=Toet3EiSFcM
• Support vector machines part 3: The radial (RBF) kernel: https://
www.youtube.com/watch?v=Qc5IyLW_hns
The next section discusses neural networks, another advanced machine
learning technique.
Unit 4 151
Neural networks
Artificial neural networks (ANNs), usually simply called neural
networks (NNs), are a very popular and powerful method for building
machine learning applications to date. Inspired by how animal brains
work, neural networks have a long history that is full of ups and
downs, beliefs and doubts, advancement and stagnation. Nowadays,
neural networks — and in particular deep neural networks — form
the most prominent subfield of machine learning, and even in artificial
intelligence.
This section introduces the basics of neural networks. You’ll learn about
two types of simple neural networks — perceptrons and multilayer
perceptron networks — and how to implement them using scikit-
learn. Unit 6 Deep learning will cover more technical details of neural
networks and will discuss the more complex kind of deep neural
networks.
Perceptrons
A perceptron is the oldest and simplest neural network; multiple
perceptrons can also be connected to build more complex and powerful
neural networks. The original design of a perceptron was inspired by a
neuron of animal brains. As shown in the figure below, a neuron obtains
input signals from dendrites, transmits and processes them in the axon,
and delivers output signals to the axon terminals. Neurons are connected
and work together to form a neural network in the brain.
Figure 4.34 Neuron (Modified from source: https://commons.wikimedia.org/wiki/
File:Neuron3.svg)
Like a neuron, a perceptron obtains input values, processes them,
and delivers output values; and perceptrons are connected to form an
artificial neural network. Nevertheless, the similarity between a neuron
and a perceptron ends here. We still don’t know much about how
the neural network of a brain works, but we know, and in fact have
designed, ways to compose artificial neural networks to solve real-world
problems.
152 COMP S491 Machine Learning and Applications
The following diagram depicts the operations of a perceptron. It takes a
number of n input values, x1, x2, …, xn, and returns an output value y.
Figure 4.35 Perceptron
The perceptron performs two operations: a weighted summation and an
activation function. In the first operation, the n input values
xi (i = 1, 2, …, n) are first summed with different weights wi, and a bias
b is added. This operation may also be expressed using the input vector
x (of n values) and the weight vector w (of n values):
In the second operation, an activation function is applied to the output
of the summation operation. A perceptron was originally designed
for making binary decisions (such as binary classification), and the
activation function maps the summation output to a binary value (true
or false) as to whether or not the output is activated.
Many mathematical functions can be used as the activation function. An
example is the step function, which returns 1 if its input is positive, or 0
otherwise:
Figure 4.36 The step function
Unit 4 153
A commonly-used activation function is the logistic function (logit
function, or sigmoid function), which we’ve discussed earlier in the
section on logistic regression. The logistic function has several desirable
features for use in perceptrons and neural networks. It is smooth,
differentiable, and easy to operate in calculus calculations. The logistic
function is defined as follows:
Figure 4.37 The logistic function
Two other commonly-used activation functions are the hyperbolic
tangent function f (x) = tanh(x), and the rectified linear unit function
(ReLU) f (x) = max(0, x). Their plots are shown in the figure below.
Figure 4.38 The tanh function (left) and the ReLU function (right)
154 COMP S491 Machine Learning and Applications
The combination of the weighted summation and the logistic function
may remind you of logistic regression. What is the difference, then,
between logistic regression and a perceptron that uses the logistic
function as its activation function? There is indeed no difference in
principle! In fact, training a perceptron can be done the same way as a
logistic regression model, using the gradient descent algorithm we’ve
discussed near the beginning of the unit. The real power of perceptrons
is their cascading in layers to build more complex neural networks, as
you’ll learn shortly.
A perceptron can be represented as a two-layer network of nodes, as
shown in the figure below. The first layer, or input layer, contains a
number of input nodes, which does nothing except forwarding the
values to the second layer. The second layer, or output layer, contains
one perceptron. The perceptron is also known as the single-layer
perceptron (SLP), since its network representation has only one layer of
perceptron.
Figure 4.39 Single-layer perceptron
Now complete the following activity to explore the use of a perceptron
to do classification.
Activity 4.41
Perform the following steps on the Machine Learning Playground
website.
1 Go to https://ml-playground.com/.
2 Set up two classes of data points on the canvas by using the orange
and purple square buttons and clicking on the canvas. If necessary,
use the red cross button for removing a data point on the canvas.
3 Click on the Perceptron button.
Unit 4 155
4 Set the Max Iters parameter to 20 if necessary. Then, train the
perceptron by clicking on the Train button. The canvas shows two
regions of predictions, separated by a linear decision boundary. The
perceptron is a linear model.
5 Modify the data points and the Max Iters parameter, train the
model, and observe the two regions of predictions. Repeat several
times for different data and values of the Max Iters parameter.
6 Scroll down the page and read the brief overview of the perceptron.

Figure 4.40 The perceptron result on Machine Learning Playground
Multilayer perceptrons
A multilayer perceptron (MLP) is a neural network containing an input
layer and two or more layers of perceptrons. The following figure
depicts a multilayer perceptron network with an input layer and two
layers of perceptrons. The two layers of perceptrons are called the
hidden layer and the output layer respectively. The hidden layer is so
named because it is invisible to, or hidden from, the outside of the
neural network.
156 COMP S491 Machine Learning and Applications
Figure 4.41 Multilayer perceptron
A multilayer perceptron has one or more hidden layers. A neural
network with two or more hidden layers is called a deep neural network;
the study of such networks is called deep learning, which you will learn
about later in the course.
A perceptron or SLP works for datasets that are linearly separable. For
datasets that are not linearly separable, we need to use an MLP. You’ll
see some examples when we discuss the implementations of neural
networks later.
Now complete the following activity to explore the use of an MLP
neural network to do classification.
Activity 4.42
Perform the following steps on the Machine Learning Playground
website.
1 Go to https://ml-playground.com/.
2 Set up two classes of data points on the canvas by using the orange
and purple square buttons and clicking on the canvas. If necessary,
use the red cross button for removing a data point on the canvas.
3 Click on the Artificial Neural Network.
4 Use the default parameters and network structure. Train the neural
network by clicking on the Train button. The canvas shows two
regions of predictions, separated by a non-linear decision boundary.
The neural network is a non-linear model.
Unit 4 157
5 Modify the data points, the parameters (Learning rate, Max
Epochs, and Max error %), and the network structure (click on
the black button to delete a node or the white button with a + sign
to add a node; click on the + or − button under Add Layers to add
or remove a layer). Train the model and observe the two regions
of predictions. Repeat several times for different data, parameter
values, and network structures.
6 Scroll down the page and read the brief overview of the neural
network.

Figure 4.42 The neural network result on Machine Learning Playground
Feed-forward neural networks and back-
propagation
A feed-forward neural network is a network whose connections do not
form a circle of nodes. In other words, data are passed from the input
layer to the output layer in a path without a loop. Both SLPs and MLPs
are feed-forward neural networks. Some neural networks are not feed-
forward, e.g. recurrent neural networks for dealing with temporal or
sequential input data; you’ll learn about such neural networks in Unit 6.
In a typical feed-forward neural network, data and partial results are
passed across adjacent layers one by one, and from input to output.
This happens in the calculations or predictions of output values during
training and deployment.
158 COMP S491 Machine Learning and Applications
To train a neural network, or to tune the network’s weights and biases,
an algorithm called back-propagation (backprop or BP) is widely used.
The back-propagation algorithm propagates the errors (losses) from the
output layer backwards to the input layer, across and for every layer.
With the propagated errors for a layer, we can optimize the weights
and biases leading to the layer using the gradient descent algorithm (or
stochastic gradient descent, etc.). More details of back-propagation will
be covered in Unit 6.
A feed-forward neural network and the concept of back-propagation are
depicted in the following figure.
Figure 4.43 A feed-forward neural network and back-propagation
Implementing neural networks
The Perceptron, MLPClassifier, and MLPRegressor classes implement
neural networks in the scikit-learn library.
Using the Perceptron class
The Perceptron class in the sklearn.linear_model module implements
a perceptron (SLP) for classification tasks. The class is built on, and is
a particular version of, the SDGClassifier class; the two classes have
similar usage. The following code shows the use of the Perceptron class
to classify the digits dataset.
Unit 4 159
from sklearn.datasets import load_digits
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.3)
clf = make_pipeline(StandardScaler(), Perceptron())
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score)
Output:
0.9092592592592592
Using the MLPClassifier and MLPRegressor classes
The MLPClassifier and MLPRegressor classes in the sklearn.neural_
network module implement MLP neural networks for classification and
regression tasks respectively. Important arguments to the classes include
the following:
• hidden_layer_sizes designates the sizes of the hidden layers in a
tuple. This argument is defaulted to (100,), meaning one hidden
layer of 100 nodes. An example is (2, 5), which means two hidden
layers of 2 nodes and 5 nodes respectively.
• activation designates the activation function. The default option
is "relu"; two other commonly-used options are "logistic" and
"tanh".
• solver designates the optimization algorithm. The default option is
"adam", a modified stochastic gradient descent algorithm that is good
for relatively large datasets. Another option is "sgd", the stochastic
gradient descent algorithm. A third option is "lbfgs", an algorithm
based on the quasi-Newton method and suitable for small datasets.
The following code shows how to use the MLPClassifier class to
classify the Boston dataset. The resulting score was about 0.98 when I
ran the code.
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.3)
clf = make_pipeline(StandardScaler(), MLPClassifier())
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score)
160 COMP S491 Machine Learning and Applications
Output:
0.9833333333333333
The code segment below examines how the hidden layers affect the
score of the neural network. Different structures of the hidden layers are
set using the hidden_layer_sizes argument of MLPClassifier, and the
cross-validation scores are obtained and displayed.
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
sizes = ((), (1,), (3,), (10,), (20,), (50,), (100,), (1000,),
(100, 100), (100, 100, 100))
for size in sizes:
clf = MLPClassifier(hidden_layer_sizes=size)
pipe = make_pipeline(StandardScaler(), clf)
scores = cross_val_score(pipe, X, y, n_jobs=-1)
print(f"Size: {size}, Score: {scores.mean()}")
Output:
Size: (), Score: 0.956032188177035
Size: (1,), Score: 0.2581987000928505
Size: (3,), Score: 0.7551346332404828
Size: (10,), Score: 0.9543655215103684
Size: (20,), Score: 0.9638177034973692
Size: (50,), Score: 0.973279170535438
Size: (100,), Score: 0.9782884555865058
Size: (1000,), Score: 0.9816279789538843
Size: (100, 100), Score: 0.9777329000309501
Size: (100, 100, 100), Score: 0.9738393686165274
The results show that for one hidden layer, the scores increase as the
number of nodes in that layer, but the increments after 50 nodes ((50,))
become more and more marginal. Using two or three hidden layers
((100, 100) and (100, 100, 100)) shows no improvement but instead
a tiny degradation, compared to a single hidden layer with the same
number of nodes ((100,)).
Simply adding more layers to a neural network therefore may not
improve its performance. The structure of a neural network should be
carefully designed and tuned for the specific machine learning problem
and dataset. This is especially important for complex problems such as
image classification and speech processing, and is a key topic in deep
learning.
The usage of the MLPRegressor class is similar to other regressor classes.
Now complete the following activity to try using the MLPRegressor class.
Unit 4 161
Activity 4.43
Modify the two code segments below to use the MLPRegressor class to
work on the Boston dataset instead of using the MLPClassifier class and
the digits dataset. (Hint: to alleviate the convergence problem, use the
arguments max_iter=1000 and learning_rate_init=0.01. Refer to the
API documentation of the MLPClassifier class for details.)
# Modify code below
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.3)
clf = make_pipeline(StandardScaler(), MLPClassifier())
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score)
# Modify code below
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
sizes = ((), (1,), (3,), (10,), (20,), (50,), (100,), (1000,),
(100, 100), (100, 100, 100))
for size in sizes:
clf = MLPClassifier(hidden_layer_sizes=size)
pipe = make_pipeline(StandardScaler(), clf)
scores = cross_val_score(pipe, X, y, n_jobs=-1)
print(f"Size: {size}, Score: {scores.mean()}")
Feedback is provided for this activity.
Let’s conclude this section on neural networks by discussing their
advantages and disadvantages. Neural networks are flexible and capable
of handling both classification and regression tasks. They work for
non-linear and complex datasets by designing the network structure of
nodes and other configurations. Using a trained neural network to make
predictions is quite fast too.
On the other hand, the structure of layers and nodes needs to be
carefully designed and tuned for the specific machine learning problems
and datasets, especially for complex ones. In addition, the training of
complex neural networks is time-consuming and computationally-
expensive. Finally, the internal working of a neural network is not
interpretable, and the resulting predictions may not be easily understood
or explained.
162 COMP S491 Machine Learning and Applications
Now complete the self-test below to check your understanding of neural
networks discussed in this section. Suggested answers can be found at
the end of the unit.
Self-test 4.11
1 What are the two operations of a perceptron?
2 What are four commonly-used activation functions? Which of them
are smooth and differentiable?
3 How many layers are there in a single-layer perceptron (SLP)? What
are the layers?
4 How many layers are there in a multilayer perceptron (MLP)? What
are the layers?
5 What is a feed-forward neural network? What is back-propagation?
6 Which classes of scikit-learn implement MLP neural networks?
How do you specify the hidden layers of neural networks in these
classes?
If you want to see an alternative presentation of neural networks, watch
this video:
How do neural networks work?: https://www.youtube.com/
watch?v=fkqZyYo_ebsX
Neural networks are one of the most useful techniques in machine
learning. Some people even claim that if a problem can be solved using
any machine learning method, it can very probably be solved using a
neural network! This section has only described the basics of neural
networks, and you’ll learn much more about them in Unit 6 Deep
learning.
Unit 4 163
Comparison of supervised
learning algorithms
In this unit, we have discussed a dozen supervised learning algorithms.
These algorithms have very different mechanisms and characteristics.
This section summarizes their characteristics, usage, and performs some
basic performance evaluations.
Characteristics and usage
The following table summarizes the mechanisms and characteristics of
the learning algorithms discussed in this unit. Note that the comparative
attributes, especially in the ‘Prediction capability’ and ‘Robustness
to outliers’ columns, aim to provide a general sense of the algorithms
rather than an absolute evaluation. As you learned in the unit, the exact
behaviours of an algorithm depend highly on the problem and the
dataset. For example, while having weaker prediction capability than
neural networks in general, linear regression can perform well for a
dataset whose features and labels are related linearly.
Table 4.15 Key mechanisms and characteristics of supervised learning
algorithms
Algorithm Type Key mechanism
Prediction
capability
Learning
speed
Prediction
speed
Robustness
to outliers
Inter-
pretability
Requires
scaling?
Linear
regression Regression
Linear
functional
approximation
Poor Fair Fast Poor Good Yes
Logistic
regression Classification
Linear
functional
approximation
Poor Fair Fast Poor Good Yes
K-nearest
neighbours
Classification
and regression
Instance-
based Fair Fast Slow Poor Good Yes
Naive
Bayes Classification Probabilistic Poor Fast Fast Good Good No
Decision
trees
Classification
and regression Tree Fair Fair Fast Good Good No
Rule-based
classifiers
Classification Rules Fair Fair Fast Good Good No
Bagging Classification and regression Meta-heuristic Good Slow Fast Good Fair No
Random
forests
Classification
and regression Meta-heuristic Good Slow Fast Good Fair No
Boosting Classification and regression Meta-heuristic Good Slow Fast Poor Fair No
Support
vector
machine
Classification
and regression
Decision
boundary Good Slow Fast Good Poor Yes
Neural
networks
Classification
and regression
Non-linear
functional
approximation
Good Slow Fast Good Poor Yes
164 COMP S491 Machine Learning and Applications
The following figure, from the scikit-learn documentation, is an aid for
choosing a machine learning algorithm (or the scikit-learn class) for
a problem in hand. You go from the START node, make decisions on
the nodes to branch, and arrive at some candidate algorithm — this is a
decision tree itself! The algorithms of the classification and regression
groups have been covered in this unit. Those in the clustering and
dimensionality reduction groups belong to unsupervised learning, which
we will discuss in Unit 5.
Figure 4.44 Choosing a machine learning algorithm (Source: https://scikit-learn.
org/stable/tutorial/machine_learning_map/index.html)
Basic performance comparison
In the following, we carry out simple evaluations to compare the
performance of different learning algorithms. The code below computes
the cross-validation scores of classifiers for handling the digits dataset.
All these classifiers have been described and demonstrated, so you
should have no problem understanding the code. As a reminder, the cv()
function collects its keyword arguments to the kwargs argument and
passes them to model_class(**kwargs). If necessary, review ‘Packing
and unpacking arguments’ under ‘Functions’ of the ‘Python basics’
section in Unit 2. You will see an example use of the keyword argument
shortly.
Unit 4 165
from sklearn.datasets import load_digits
from sklearn.ensemble import (AdaBoostClassifier, BaggingClassifier,
GradientBoostingClassifier,
RandomForestClassifier)
from sklearn.linear_model import LogisticRegression, Perceptron,
SGDClassifier
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle

digits = load_digits()
X, y = shuffle(digits.data, digits.target)

def cv(model_class, scale=False, **kwargs):
if scale:
model = make_pipeline(StandardScaler(), model_class(**kwargs))
else:
model = model_class(**kwargs)
scores = cross_val_score(model, X, y, n_jobs=-1)
print(f"{model_class.__name__}: {scores.mean():.4f}")

cv(LogisticRegression, scale=True)
cv(SGDClassifier, scale=True)
cv(KNeighborsClassifier, scale=True)
cv(GaussianNB)
cv(DecisionTreeClassifier)
cv(BaggingClassifier)
cv(RandomForestClassifier)
cv(AdaBoostClassifier)
cv(GradientBoostingClassifier)
cv(LinearSVC, scale=True)
cv(SVC, scale=True)
cv(Perceptron, scale=True)
cv(MLPClassifier, scale=True)
Output:
LogisticRegression: 0.9705
SGDClassifier: 0.9482
KNeighborsClassifier: 0.9783
GaussianNB: 0.8336
DecisionTreeClassifier: 0.8492
BaggingClassifier: 0.9254
RandomForestClassifier: 0.9766
AdaBoostClassifier: 0.2749
GradientBoostingClassifier: 0.9633
LinearSVC: 0.9560
SVC: 0.9850
Perceptron: 0.9393
MLPClassifier: 0.9800
166 COMP S491 Machine Learning and Applications
For the sake of simplicity, all classifiers use their default settings. When
I ran the code, the SVC and MLPClassifier performed best with scores
of about 0.98. The AdaBoostClassifier performed worst with a score of
about 0.27. As discussed in the section ‘Boosting’, AdaBoostClassifier
uses 1-level decision trees by default and it works poorly for classifying
the digits. The solution is to use 10-level trees instead:
cv(AdaBoostClassifier, base_estimator=DecisionTreeClassifier(max_
depth=10))
The cv() function collects the base_estimator keyword argument
to its kwargs argument, and uses the kwargs argument to create the
classifier. In effect, the AdaBoostClassifier object is created like this:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_
depth=10)).
The code below computes the cross-validation scores of regressors for
handling the Boston dataset. All these regressors have been described
and demonstrated, so you should have no problem understanding the
code. When I ran the code, the GradientBoostingRegressor performed
best with a score of about 0.86.
from sklearn.datasets import load_boston
from sklearn.ensemble import (AdaBoostRegressor, BaggingRegressor,
GradientBoostingRegressor,
RandomForestRegressor)
from sklearn.linear_model import LinearRegression, Perceptron, SGDRegressor
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVR, SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.utils import shuffle

boston = load_boston()
X, y = shuffle(boston.data, boston.target)

def cv(model_class, scale=False, **kwargs):
if scale:
model = make_pipeline(StandardScaler(), model_class(**kwargs))
else:
model = model_class(**kwargs)
scores = cross_val_score(model, X, y, n_jobs=-1)
print(f"{model_class.__name__}: {scores.mean():.4f}")

cv(LinearRegression, scale=True)
cv(SGDRegressor, scale=True)
cv(KNeighborsRegressor, scale=True)
cv(DecisionTreeRegressor)
cv(BaggingRegressor)
cv(RandomForestRegressor)
cv(AdaBoostRegressor)
cv(GradientBoostingRegressor)
cv(LinearSVR, scale=True)
Unit 4 167
cv(SVR, scale=True)
cv(MLPRegressor, scale=True)
Output:
LinearRegression: 0.7220
SGDRegressor: 0.7219
KNeighborsRegressor: 0.7385
DecisionTreeRegressor: 0.7181
BaggingRegressor: 0.8121
RandomForestRegressor: 0.8437
AdaBoostRegressor: 0.8017
GradientBoostingRegressor: 0.8680
LinearSVR: 0.6934
SVR: 0.6491
MLPRegressor: 0.6852
When approaching a machine learning problem, you may carry out
similar initial evaluations of different algorithms, perhaps using a
subset of a huge dataset. Remember that the default settings of scikit-
learn classes may not work best for your dataset, and make sure to tune
the parameters and hyperparameters. You may use models with built-
in cross-validation and the grid search technique, both discussed in the
section ‘Tuning hyperparameters’ under ‘Model evaluation’.
Tuning and experimenting are two things that you will do a lot of during
modelling in a machine learning project.
168 COMP S491 Machine Learning and Applications
Summary
Supervised machine learning aims to make predictions using knowledge
acquired from some known examples. Depending on the types of values
to predict, there are two types of supervised learning: regression for
predicting numerical values, and classification for predicting categorical
values or classes.
The topics in this unit addressed a number of supervised machine
learning algorithms for solving regression and classification problems,
and techniques for evaluating the algorithms.
The unit began by discussing two simple learning methods that make
predictions using linear models, which relate features and labels
linearly. The first method discussed was linear regression, which
aims to predict numbers using a model that has been trained from
labelled examples. Training the model means determining the best
parameters for the model. A common training or optimization algorithm
is stochastic gradient descent, which optimizes the parameters based
on the error rate returned by the cost function. Hyperparameters are
settings for the training process, such as the learning rate and iterations
of the process, and are usually fixed before the training. Multiple linear
regression involves multiple (two or more) features in an example.
Linear regression can be adopted to build a polynomial regression
model, which is non-linear, by transforming the original features to new
polynomial higher-order features.
The second method we discussed was logistic regression, which aims to
predict categories or classes of unlabelled (unseen) examples. Logistic
regression works by using linear regression to predict numerical scores
or probabilities that a query example belongs to different classes, and
concluding on a predicted class from these scores. Logistic regression
works for binary classification, which involves two classes, and for
multiclass classification, which involves three or more classes. There
are two approaches to performing multiclass classification: one-versus-
rest, which compares every class with the rest classes; and one-versus-
one, which compares every class with every other class.
You also learned about generalization, which is how well a machine
learning algorithm makes predictions for unseen data. There are two
types of generalization problems. Underfitting occurs when a model
captures too little information from the training data; the model
performs poorly for both training (seen) data and test (unseen) data.
Overfitting occurs when a model captures too much information from
the training data, including noise; the model performs well for training
(seen) data, but poorly for test (unseen) data.
You learned that two characteristics of a model are bias and variance.
Bias is the difference between the feature–label relation acquired by the
model and the actual relation, while variance is the variation in building
the model from the training data. Simple models often have high bias
and are more vulnerable to underfitting; complex models often have
Unit 4 169
high variance and are more vulnerable to overfitting. Regularization
is a technique for avoiding overfitting by adding to the cost function
a penalty term that represents the model’s complexity. Three types of
regularization are the lasso (L1), the ridge (L2), and the elastic-net
(L1+L2).
Model evaluation is important in machine learning, and is carried out
for training, selecting, and assessing models. In model training, models
are evaluated using the training set for optimizing the parameters of the
models. In model selection, models are evaluated using the validation
set for determining the type and hyperparameters of the best model.
In model assessment, the selected best model is evaluated using the
test set for estimating the performance in production. The training set,
validation set, and test set should be separate. Training and validation
can be facilitated by cross-validation, which divides a dataset into k
folds, and which uses each fold as the validation set in turns in the
k resulting models. For classification, stratification means to evenly
distribute examples of different classes into the folds, so that each fold
has similar statistical information as the whole dataset. In leave-one-out
cross-validation, the number of examples equals the number of folds,
and one example is used for validation in each of the cross-validation
models.
You then learned to use various evaluation metrics for regression and
classification. Regression metrics include the R2 score, mean squared
error, mean absolute error, and median absolute error. Classification
metrics include the accuracy score, precision, recall, F1 score, and area
under the ROC curve. A main use of evaluation techniques and metrics
is to tune hyperparameters, as in grid search, which compares different
combinations of hyperparameters to obtain the best combination
(model).
After explaining generalization and model evaluation, the unit discussed
other learning methods such as the k-nearest neighbours (kNN) method.
kNN does not use a model, but makes predictions based on similar
known examples that are nearest to the query example of interest.
Without learning a model, kNN is described as lazy, non-parametric, and
instance-based. Different distance metrics can be used to determine the
nearest neighbours. The unit described a few of these metrics: Euclidean
distance, Manhattan distance, Chebyshev distance, Minkowski distance,
and cosine distance.
Another learning method discussed in the unit was Naive Bayes, which
makes predictions according to probabilities derived from the example
training data. These probabilities are used to calculate the likelihood
(probability) of the predicted targets based on the Bayes’ theorem.
Three key types of Naive Bayes classifiers are Bernoulli Naive Bayes
classifiers (for binary features), multinomial Naive Bayes classifiers (for
numerical count-like features), and Gaussian Naive Bayes classifiers
(for continuous features).
170 COMP S491 Machine Learning and Applications
You saw how decision trees are used in supervised learning to make
decisions according to conditions in internal tree nodes to reach
predicted targets in leaf nodes. Decision tree learning algorithms include
ID3, C4.5, and CART; ID3 and C4.5 consider the entropy of the split
example data, while CART considers the data’s Gini impurity. All three
algorithms build a decision tree by adding nodes that split the training
example data to maximize the information gain.
A rule-based classifier determines the class of a query example using
a sequence of ‘If … Then …’ rules. The rules can be obtained from a
decision tree. Two special rule-based classifiers are ZeroR and OneR:
ZeroR predicts the most frequent class, while OneR uses one best
feature to make predictions.
The unit described ensemble learning methods that combine multiple
base models into an ensemble model for making better predictions.
Bagging (bootstrap aggregating) employs bootstrap samples
(i.e. sampling with replacement) to train base models, and combines
their predicted outcomes. To combine predicted outcomes, averaging is
applied for regression, and voting is applied for classification. Random
forests are bagging ensembles that use decision trees as base models
and use feature subsets to train the decision trees. Boosting trains
base models sequentially, improving the training based on the results
of preceding base models. Two boosting methods are AdaBoost and
gradient boosting.
Two advanced learning methods — support vector machines and neural
networks — were discussed in the unit. A support vector machine
(SVM) trains its model by maximizing the margin between two different
classes of data points. A soft margin allows a few data points to violate
the separating boundary. Kernels enable SVMs to deal with non-linear
datasets; three important SVM kernels are the linear kernel, polynomial
kernel, and radial basis function kernel.
Artificial neural networks are inspired by biology. A perceptron,
the simplest neural network, computes the weighted summation of
the inputs plus a bias, and then applies an activation function. Four
commonly-used activation functions are the step function, logistic
function, tanh function, and ReLU function. A perceptron is also called
a single-layer perceptron (SLP), which is regarded as containing an
input layer and an output layer. A multilayer perceptron (MLP) contains
an input layer, one or more hidden layers, and an output layer. A feed-
forward neural network passes data from input to output in a path
without a loop. To train a neural network, we optimize the weights
and biases using the back-propagation algorithm, which propagates
the errors (losses) from output backwards to the input for applying the
gradient descent optimization algorithm.
Various supervised learning methods have been discussed in this unit.
In the next unit, you will learn about unsupervised learning techniques,
which identify patterns from example data that have features but no
labels.
Unit 4 171
References
Brownlee, J (2017) ‘How to train a final machine learning model’,
Machine Learning Mastery, 17 March, https://machinelearningmastery.
com/train-final-machine-learning-model/.
Burkov, A (2019) The Hundred-Page Machine Learning Book, Andriy
Burkov.
Fenner, M E (2020) Machine Learning with Python for Everyone,
Addison-Wesley.
Google Machine Learning Crash Course, ‘A self-study guide for
aspiring machine learning practitioners’, https://developers.google.com/
machine-learning/crash-course.
Hastie, T, Tibshirani, R and Friedman, J (2009) The Elements of
Statistical Learning Data Mining, Inference, and Prediction (2nd edn)
New York: Springer.
James, G, Witten, D, Hastie, T and Tibshirani, R (2013) An Introduction
to Statistical Learning with Applications in R, New York: Springer.
Maini V (2017), ‘Machine learning for humans’, Machine Learning for
Humans, 19 August, https://medium.com/machine-learning-for-humans/
why-machine-learning-matters-6164faf1df12.
McCaffrey, J D (2018) ‘A comparison of ten machine learning
classification algorithms’, James D. McCaffrey, 7 November, https://
jamesmccaffrey.wordpress.com/2018/11/07/a-comparison-of-ten-
machine-learning-classification-algorithms/.
Mueller, J P and Massaron, L (2020) Data Science Programming All-in-
One For Dummies, New York: John Wiley & Sons.
Osisanwo, F Y, Akinsola, J E T, Awodele, O, Hinmikaiye, J O,
Olakanmi, O and Akinjobi, J (2017) ‘Supervised machine learning
algorithms: Classification and comparison’, International Journal
of Computer Trends and Technology, https://www.semanticscholar.
org/paper/Supervised-Machine-Learning-Algorithms%3A-and-F.
Y-AkinsolaJ.E./bed9dc37c6597136eb5ae761a14b2d7f8e0204a1.
Rooca, J (2019) ‘Ensemble methods: Bagging, boosting and stacking’,
towards data science, 23 April, https://towardsdatascience.com/
ensemble-methods-bagging-boosting-and-stacking-c9214a10a205.
Sayad, S ‘An Introduction to Data Science’, https://www.saedsayad.
com/data_mining_map.htm.
Schönleber, D (2018) ‘A “short” introduction to model selection’,
towards data science, 11 December, https://towardsdatascience.com/
a-short-introduction-to-model-selection-bb1bb9c73376.
172 COMP S491 Machine Learning and Applications
Scikit-learn, ‘API reference’, https://scikit-learn.org/stable/modules/
classes.html.
Scikit-learn, ‘User guide’, https://scikit-learn.org/stable/user_guide.
html.
Serengil, S I (2018) ‘A step by step C4.5 decision tree example’, Sefik
Ilkin Serengil, 13 May, https://sefiks.com/2018/05/13/a-step-by-step-
c4-5-decision-tree-example/.
Serengil, S I (2017) ‘A step by step ID3 decision tree example’, Sefik
Ilkin Serengil, 20 November, https://sefiks.com/2017/11/20/a-step-by-
step-id3-decision-tree-example/.
Tan, P N, Steinbach, M, Karpatne, A and Kumar, V (2019) Introduction
to Data Mining (2nd edn), Pearson.
The Open University of Hong Kong (2020) COMPS492F Artificial
Intelligence, Hong Kong: OUHK.
Wikipedia, https://en.wikipedia.org/.
Unit 4 173
Suggested answers to self-tests
Self-test 4.1
1 Linear regression is used for predicting numerical labels.
2 In gradient descent, the parameter value should be decreased for a
positive gradient, and increased for a negative gradient.
3 Hyperparameters are variables, or settings, of the training or
optimization algorithm that creates or learns a machine learning
model, e.g. the learning rate and the number of iterations of gradient
descent.
4 When the learning rate is too small, the update of the model
parameter — the learning process — is slow, and the algorithm
cannot return a useful parameter value within a practical period
of time. When the learning rate is too large, the parameter value
is updated with a large increment or decrement; the outcome is
like bouncing between the two sides of the curve, and gives the
impression that the machine learning algorithm and/or model are
totally invalid.
5 The LinearRegression and SGDRegressor classes (among others)
perform linear regression. The LinearRegression class is
implemented using matrix operations, while the SGDRegressor class
is implemented using stochastic gradient descent.
6 Stochastic gradient descent is a kind of gradient descent in which
an example of the training set is examined in each iteration to
determine the gradients for updating the model parameters. Because
the examples of the training set are shuffled, a random example is
used in each iteration — this is why the algorithm is described as
‘stochastic’.
7 Three types of gradient descent are batch gradient descent,
stochastic gradient descent and mini-batch gradient descent.
8 Multiple linear regression is linear regression that involves two or
more features for making predictions.
9 The PolynomialFeatures class can be used for implementing
polynomial regression.
Self-test 4.2
1 Logistic regression is used for solving classification problems.
2 The input values of the logistic function range from negative infinity
to positive infinity, while the output values range from 0.0 to 1.0.
3 There are two classes in binary classification, and three or more
classes in multiclass classification.
174 COMP S491 Machine Learning and Applications
4 For a classification problem with three classes, there are three
models in the one-versus-rest approach, and three (= 3 × 2 / 2) in the
one-versus-one approach.
5 In the one-versus-rest approach, the models are red versus green and
blue, green versus red and blue, and blue versus red and green. In
the one-versus-one approach, the models are red versus green, green
versus blue, and blue versus red.
6 The LogisticRegression and SGDClassifier classes (among others)
perform logistic regression.
7 There are 600 (= 20×30) features.
Self-test 4.3
1 Underfitting means that a model captures too little statistical
information from the training data. When this happens, both the
training error rate and test error rate are high.
2 Overfitting means that a model captures too much statistical
information from the training data. When this happens, the training
error rate is low, but the test error rate is high.
3 A simple model is more vulnerable to underfitting, while a complex
model is more vulnerable to overfitting.
4 The bias of a model is the difference between the feature–label
relation acquired by the model and the actual relation. The variance
of a model is the variation of building the model from the training
data.
5 Regularization aims to avoid overfitting by penalizing complex
models. To apply regularization to a linear model, we add to the
cost function (loss function) a penalty term that is proportional or
otherwise related to the complexity of the model.
6 The three kinds of regularization are the lasso (L1), the ridge (L2)
and the elastic-net.
Self-test 4.4
1 The three phases are model training, model selection, and model
assessment. In model training, model evaluation is performed
using the training set for the purpose of learning or optimizing the
parameters of the models. In model selection, model evaluation is
performed using the validation set for the purpose of comparing the
models for selecting the best. In model assessment, model evaluation
is performed using the test set for the purpose of estimating the
performance when the selected model is used in production.
2 A fold contains 200 (= 1000/5) examples. There are five models. For
each model, the training set contains 800 (= 4×200) examples, and
the validation set contains 200 examples.
Unit 4 175
3 A fold contains one example. There are 1,000 models. For each
model, the training set contains 999 examples, and the validation set
contains one example.
4 The benefits of leave-one-out cross-validation are better performance
because of more training examples, and more comprehensive
evaluation results with respect to the whole dataset. The drawback is
that more computation is required for dealing with the large number
of models.
5 Four commonly-used regression metrics are the R2 score, mean
squared error, mean absolute error, and median absolute error.
6 The accuracy of predictions is not a good metric for imbalanced
classification in situations in which the populations of the classes
are very different or skewed.
7 Precision = TP / (TP + FP) = 10 / (10 + 30) = 0.25.
Recall = TP / (TP + FN) = 10 / (10 + 20) = 0.3333.
F1 score = 2 × TP / (2 × TP + FP + FN) = 2 × 10 / (2 × 10 + 30 + 20)
= 0.2857.
8 The horizontal axis of an ROC curve is the false positive rate (FPR),
while the vertical axis is the true positive rate (TPR). A higher ROC
curve represents a better model than a lower one.
9 The AUC score stands for the area under the ROC curve score. The
score values range from 0 to 1.
10 For tuning hyperparameters, scikit-learn supplies models with
built-in cross-validation (e.g. RidgeCV) and a grid search tool
(GridSearchCV).
11 15 (= 3 × 5) models are evaluated in cross-validation.
Self-test 4.5
1 The kNN algorithm predicts a label based on the average of the
labels of the k nearest neighbours. For regression, the predicted
numerical label can be the mean of the k labels. For classification,
the predicted categorical label can be the most frequent class of the
k labels.
2 kNN is described as a lazy learner because it does not learn at all
from the examples during model training/fitting, but rather stores
them for later use.
3 A parametric method assumes a relation between the features and
labels of the examples, and learns the relation parameters in model
training. A non-parametric method does not assume or formulate
a relation or parameters between the features and labels of the
examples.
176 COMP S491 Machine Learning and Applications
4 The n_neighbors argument designates the number of nearest
neighbours used in kNN, and its default value is 5.
5 The implementation with a large value of k is more vulnerable to
underfitting, and the one with a small value of k is more vulnerable
to overfitting.
6 The distances between the vectors (1, 1) and (4, 5) are computed as
follows:
a Euclidean distance = = 5
b Manhattan distance = |1 – 4| + |1 – 5| = 7
c Chebyshev distance = max (|1 – 4|, |1 – 5|) = 4
7 The KNeighborsRegressor and KNeighborsClassifier classes use the
Euclidean distance (i.e. the metric argument of "minkowski" and the
p argument of 2) by default.
Self-test 4.6
1 The posterior probability is the probability of the target (class) given
the features, i.e. P(Y | X).
2 Three main types are Bernoulli, multinomial, and Gaussian Naive
Bayes classifiers. The Bernoulli Naive Bayes classifier works with
binary features in Bernoulli distribution, i.e. every feature takes on
the value 0 or 1 (or, true or false, etc). The multinomial Naive Bayes
classifier works with numerical features that are discrete or count-
like, i.e. in multinomial distribution. The Gaussian Naive Bayes
classifier works with continuous features in Gaussian distribution.
3 One-hot encoding is commonly used for converting a categorical
feature to binary features, and is implemented by the scikit-learn
OneHotEncoder class.
4 Term frequency-inverse document frequency, tf-idf, is commonly
used for converting a text/document feature to multinomial features,
and is implemented by the scikit-learn TfidfVectorizer class.
5 The two components of tf-idf are term frequency and inverse
document frequency. The term frequency refers to how many times
a term appears in a document. The document frequency refers to
how many documents the term appears in.
Self-test 4.7
1 Three algorithms are ID3, C4.5, and CART.
a ID3 handles only categorical features. C4.5 and CART handle
both categorical and numerical features.
Unit 4 177
b ID3 and C4.5 solve only classification problems. CART solves
both classification and regression problems.
c ID3 is the most vulnerable to overfitting.
2 The entropy and Gini impurity are calculated as follows.
a For ten dogs and ten cats:
• Entropy: −10/20 log(10/20) − 10/20 log(10/20) = 1.0
• Gini impurity: 1 − (10/20)2 − (10/20)2 = 0.5
b For ten apples, five bananas, and five oranges:
• Entropy: −10/20 log(10/20) − 5/20 log(5/20) −
5/20 log(5/20) = 1.5
• Gini impurity: 1 − (10/20)2 − (5/20)2 − (5/20)2 = 0.625
3 Entropy of the original dataset: −10/20 log(10/20) −
10/20 log(10/20) = 1.0
For the subset of four dogs and two cats, the entropy:
−4/6 log(4/6) − 2/6 log(2/6) = 0.918
For the subset of one dog and three cats, the entropy:
−1/4 log(1/4) − 3/4 log(3/4) = 0.811
Information gain: 1.0 − (6/10 × 0.918 + 4/10 × 0.811) = 0.125
Information gain ratio: 0.125 / (−6/10 log(6/10) − 4/10 log(4/10)) =
0.129
4 The classes DecisionTreeClassifier and DecisionTreeRegressor
implement decision trees, and they use the CART learning
algorithm.
Self-test 4.8
1 Both rule-based classifiers and decision trees split examples based
on features. A rule-based classifier maintains the rules in a list
and goes through the rules one by one to determine the class of
an example. On the other hand, a decision tree maintains the rules
in a tree data structure and works by branching through the tree’s
internal nodes.
2 ZeroR, or Zero Rule, is a classifier that predicts only one outcome
— the most frequent label (class) in the training data. OneR, or
One Rule, is a simple classifier that exploits the single best feature
for making predictions; for each feature value, it predicts the most
frequent class.
178 COMP S491 Machine Learning and Applications
3 The dataset contains 50 apples, 30 bananas, and 20 oranges.
a ZeroR uses the most frequent class, i.e. 50 apples, as the
predicted outcome. Therefore, 50 among the 100 fruits are
predicted correctly and the score is 50/100, or 0.5.
b For OneR, the best feature splits the 50 apples from the others,
so these 50 apples are predicted correctly. In the mix of 30
bananas and 20 oranges, the bananas are the major class; so
all bananas and oranges are predicted as bananas and the 30
bananas are predicted correctly. Altogether, 50 apples and 30
bananas are predicted correctly, and the score is (50+30)/100, or
0.8.
Self-test 4.9
1 Using sampling without replacement, the only possible result is a
red ball plus a green ball. Using sampling with replacement, there
are three possible results: two red balls, two green balls, and a red
ball plus a green ball.
2 It is possible to take a sample of size 200 from a dataset of size 100
if sampling with replacement is used. Sampling with replacement
allows duplicate items in a sample, so the sample size may be larger
than the dataset size.
3 For the ensemble’s predicted result, bagging uses the mean of the
base models’ results for regression, and uses the most frequent class
(hard-voting) or the class with the highest probability (soft-voting)
for classification.
4 First, the base models for random forests are decision trees, while
those for bagging may be any weak learners. Second, random
forests use feature subsets in training the base models, but in general
bagging uses all the features in training the base models.
5 For training the second base model, the weight of the first example
is decreased, and the weights of the second and third examples are
increased.
6 The pseudo-residuals, 0, −2, and 2 (= 5 − 5, 10 − 12, and 10 − 8),
are used to train the second base model.
7 Bagging and random forests are used for improving high-variance
weak learners, while boosting is used for improving high-bias weak
learners.
Self-test 4.10
1 Applicable to only linearly separable datasets, a hard margin is
the distance between the support vectors in the direction across, or
perpendicular to, the separating boundary (hyperplane).
Unit 4 179
2 A soft margin is similar to a hard margin, but a number of data
points are allowed to violate the separating boundary (i.e. to be on
the ‘wrong’ side of the boundary). A soft margin is applicable to
both linearly separable datasets and linearly inseparable datasets
(typically close-to-linearly separable datasets).
3 Three kinds of commonly-used kernels in SVMs are linear kernels,
polynomial kernels, and radial (radial basis function, RBF) kernels.
The RBF kernels are the most flexible among the three kinds.
4 The LinearSVC and SVC classes are used for implementing SVM
classifiers, while the LinearSVR and SVR classes are used for
implementing SVM regressors.
5 Some important parameters to tune include C (the normalization
strength), kernel (the kind of kernel), and degree (the degree of
polynomial kernel for kernel="poly").
Self-test 4.11
1 The first operation of a perceptron computes the weighted
summation of the input values plus a bias. The second operation
applies an activation function (e.g. step function or logistic function)
to the first operation’s result.
2 Four commonly-used activation functions are the step function,
logistic function, hyperbolic tangent function (tanh), and rectified
linear unit function (ReLU). The logistic function and tanh function
are smooth and differentiable.
3 In a single-layer perceptron, there are two layers, which are an input
layer and an output layer.
4 In a multilayer perceptron, there are three or more layers, which are
an input layer, one or more hidden layers, and an output layer.
5 A feed-forward neural network is a network whose connections do
not form a circle of nodes; the path of data flow through the network
does not contain loops. The back-propagation algorithm propagates
the errors (losses) from the output layer backwards to the input
layer, across and for every layer, for the tuning of the weights and
biases of the layers.
6 The MLPClassifier and MLPRegressor classes in the sklearn.neural_
network module implement MLP neural networks. To specify the
hidden layers, use the hidden_layer_sizes argument to specify a
tuple for the numbers of nodes in the hidden layers, e.g. (2, 5) for
two hidden layers of 2 and 5 nodes respectively.
180 COMP S491 Machine Learning and Applications
Feedback on selected activities
Activity 4.3
# Modified code below
from sklearn.datasets import load_boston
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = make_pipeline(StandardScaler(), SGDRegressor())
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print(score)
Output:
0.7266726292962548
Activity 4.5
The diabetes dataset is explored and linear regression is applied as
follows. This code loads and displays the information of the dataset:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
print(dir(diabetes))
print(diabetes.data.shape, diabetes.target.shape)
print(diabetes.feature_names)
print(diabetes.DESCR)
Output:
['DESCR', 'data', 'data_filename', 'feature_names', 'frame', 'target',
'target_filename']
(442, 10) (442,)
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one
year after baseline

Unit 4 181
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, T-Cells (a type of white blood cells)
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, thyroid stimulating hormone
- s5 ltg, lamotrigine
- s6 glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled
by the standard deviation times 'n_samples' (i.e. the sum of squares of each
column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004)
"Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
This code creates a DataFrame from the dataset, and displays some rows
and the data’s statistics:
import pandas as pd
diabetes_df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
diabetes_df['target'] = diabetes.target
print(diabetes_df)
print(diabetes_df.describe())
Output:
age sex bmi bp s1 s2 s3 \
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142
.. ... ... ... ... ... ... ...
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674
438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674
439 0.041708 0.050680 -0.015906 0.017282 -0.037344 -0.013840 -0.024993
440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674
441 -0.045472 -0.044642 -0.073030 -0.081414 0.083740 0.027809 0.173816

s4 s5 s6 target
0 -0.002592 0.019908 -0.017646 151.0
1 -0.039493 -0.068330 -0.092204 75.0
2 -0.002592 0.002864 -0.025930 141.0
3 0.034309 0.022692 -0.009362 206.0
4 -0.002592 -0.031991 -0.046641 135.0
.. ... ... ... ...
437 -0.002592 0.031193 0.007207 178.0
438 0.034309 -0.018118 0.044485 104.0
439 -0.011080 -0.046879 0.015491 132.0
182 COMP S491 Machine Learning and Applications
440 0.026560 0.044528 -0.025930 220.0
441 -0.039493 -0.004220 0.003064 57.0

[442 rows x 11 columns]
age sex bmi bp s1
\
count 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02
mean -3.639623e-16 1.309912e-16 -8.013951e-16 1.289818e-16 -9.042540e-17
std 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02
min -1.072256e-01 -4.464164e-02 -9.027530e-02 -1.123996e-01 -1.267807e-01
25% -3.729927e-02 -4.464164e-02 -3.422907e-02 -3.665645e-02 -3.424784e-02
50% 5.383060e-03 -4.464164e-02 -7.283766e-03 -5.670611e-03 -4.320866e-03
75% 3.807591e-02 5.068012e-02 3.124802e-02 3.564384e-02 2.835801e-02
max 1.107267e-01 5.068012e-02 1.705552e-01 1.320442e-01 1.539137e-01

s2 s3 s4 s5 s6
\
count 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02
mean 1.301121e-16 -4.563971e-16 3.863174e-16 -3.848103e-16 -3.398488e-16
std 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02
min -1.156131e-01 -1.023071e-01 -7.639450e-02 -1.260974e-01 -1.377672e-01
25% -3.035840e-02 -3.511716e-02 -3.949338e-02 -3.324879e-02 -3.317903e-02
50% -3.819065e-03 -6.584468e-03 -2.592262e-03 -1.947634e-03 -1.077698e-03
75% 2.984439e-02 2.931150e-02 3.430886e-02 3.243323e-02 2.791705e-02
max 1.987880e-01 1.811791e-01 1.852344e-01 1.335990e-01 1.356118e-01

target
count 442.000000
mean 152.133484
std 77.093005
min 25.000000
25% 87.000000
50% 140.500000
75% 211.500000
max 346.000000
This code shows the correlation between the features and label:
print(diabetes_df.corr())
Output:
age sex bmi bp s1 s2 s3
\
age 1.000000 0.173737 0.185085 0.335427 0.260061 0.219243 -0.075181
sex 0.173737 1.000000 0.088161 0.241013 0.035277 0.142637 -0.379090
bmi 0.185085 0.088161 1.000000 0.395415 0.249777 0.261170 -0.366811
bp 0.335427 0.241013 0.395415 1.000000 0.242470 0.185558 -0.178761
s1 0.260061 0.035277 0.249777 0.242470 1.000000 0.896663 0.051519
s2 0.219243 0.142637 0.261170 0.185558 0.896663 1.000000 -0.196455
s3 -0.075181 -0.379090 -0.366811 -0.178761 0.051519 -0.196455 1.000000
s4 0.203841 0.332115 0.413807 0.257653 0.542207 0.659817 -0.738493
s5 0.270777 0.149918 0.446159 0.393478 0.515501 0.318353 -0.398577
s6 0.301731 0.208133 0.388680 0.390429 0.325717 0.290600 -0.273697
target 0.187889 0.043062 0.586450 0.441484 0.212022 0.174054 -0.394789

Unit 4 183
s4 s5 s6 target
age 0.203841 0.270777 0.301731 0.187889
sex 0.332115 0.149918 0.208133 0.043062
bmi 0.413807 0.446159 0.388680 0.586450
bp 0.257653 0.393478 0.390429 0.441484
s1 0.542207 0.515501 0.325717 0.212022
s2 0.659817 0.318353 0.290600 0.174054
s3 -0.738493 -0.398577 -0.273697 -0.394789
s4 1.000000 0.617857 0.417212 0.430453
s5 0.617857 1.000000 0.464670 0.565883
s6 0.417212 0.464670 1.000000 0.382483
target 0.430453 0.565883 0.382483 1.000000
This code applies linear regression to the dataset using the
LinearRegression class:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
diabetes.data, diabetes.target, test_size=0.3)
reg = LinearRegression()
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print(score)
Output:
0.4803567114831401
This code applies linear regression to the dataset using the SGDRegressor
class:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
diabetes.data, diabetes.target, test_size=0.3)
reg = make_pipeline(StandardScaler(), SGDRegressor())
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print(score)
Output:
0.45500477546442186
184 COMP S491 Machine Learning and Applications
Activity 4.6
The modified code is shown below. It works by having a score equal or
close to 1, and the predicted labels for 40 and 51 being 10 (not passing)
and 1 (passing), respectively. However, the order of the values in the
predict_proba() method’s output has been changed.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
marks = np.random.randint(0, 100, 1000)
passes = np.where(marks < 50, 10, 1) # Modified this line
marks_2d = marks.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(
marks_2d, passes, test_size=0.3)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(clf.predict([[40], [51]]))
print(clf.predict_proba([[40], [51]]))
Output:
1.0
[10 1]
[[1.37026412e-09 9.99999999e-01]
[9.58676300e-01 4.13237003e-02]]
The order of the probabilities can be obtained by the classes_ attribute
of LogisticRegression. Here, the result [1 10] indicates that for each
row containing two probabilities, the first is the probability that a mark
is passing (denoted by 1), and the second is the probability that the mark
is not passing (denoted by 10).
print(clf.classes_)
Output:
[ 1 10]
Activity 4.7
# Modified code below
from sklearn.datasets import load_iris
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = make_pipeline(StandardScaler(), SGDClassifier(loss="log"))
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score)
Output:
0.8444444444444444
Unit 4 185
Activity 4.8
The digits dataset is explored and logistic regression is applied as
follows. This code loads and displays the information of the dataset:
from sklearn.datasets import load_digits
digits = load_digits()
print(dir(digits))
print(digits.data.shape, digits.target.shape)
print(digits.feature_names)
print(digits.DESCR)
Output:
['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_
names']
(1797, 64) (1797,)
['pixel_0_0', 'pixel_0_1', 'pixel_0_2', 'pixel_0_3', 'pixel_0_4',
'pixel_0_5', 'pixel_0_6', 'pixel_0_7', 'pixel_1_0', 'pixel_1_1', 'pixel_1_2',
'pixel_1_3', 'pixel_1_4', 'pixel_1_5', 'pixel_1_6', 'pixel_1_7', 'pixel_2_0',
'pixel_2_1', 'pixel_2_2', 'pixel_2_3', 'pixel_2_4', 'pixel_2_5', 'pixel_2_6',
'pixel_2_7', 'pixel_3_0', 'pixel_3_1', 'pixel_3_2', 'pixel_3_3', 'pixel_3_4',
'pixel_3_5', 'pixel_3_6', 'pixel_3_7', 'pixel_4_0', 'pixel_4_1', 'pixel_4_2',
'pixel_4_3', 'pixel_4_4', 'pixel_4_5', 'pixel_4_6', 'pixel_4_7', 'pixel_5_0',
'pixel_5_1', 'pixel_5_2', 'pixel_5_3', 'pixel_5_4', 'pixel_5_5', 'pixel_5_6',
'pixel_5_7', 'pixel_6_0', 'pixel_6_1', 'pixel_6_2', 'pixel_6_3', 'pixel_6_4',
'pixel_6_5', 'pixel_6_6', 'pixel_6_7', 'pixel_7_0', 'pixel_7_1', 'pixel_7_2',
'pixel_7_3', 'pixel_7_4', 'pixel_7_5', 'pixel_7_6', 'pixel_7_7']
.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 5620
:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Di
gits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

186 COMP S491 Machine Learning and Applications
For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

.. topic:: References

- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
Graduate Studies in Science and Engineering, Bogazici University.
- E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
Linear dimensionalityreduction using relevance weighted LDA. School of
Electrical and Electronic Engineering Nanyang Technological University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin Classification
Algorithm. NIPS. 2000.
This code creates a DataFrame from the dataset, and displays some rows
and the data’s statistics:
import pandas as pd
digits_df = pd.DataFrame(digits.data, columns=digits.feature_names)
digits_df['target'] = digits.target
print(digits_df)
print(digits_df.describe())
Output:
pixel_0_0 pixel_0_1 pixel_0_2 pixel_0_3 pixel_0_4 pixel_0_5 \
0 0.0 0.0 5.0 13.0 9.0 1.0
1 0.0 0.0 0.0 12.0 13.0 5.0
2 0.0 0.0 0.0 4.0 15.0 12.0
3 0.0 0.0 7.0 15.0 13.0 1.0
4 0.0 0.0 0.0 1.0 11.0 0.0
... ... ... ... ... ... ...
1792 0.0 0.0 4.0 10.0 13.0 6.0
1793 0.0 0.0 6.0 16.0 13.0 11.0
1794 0.0 0.0 1.0 11.0 15.0 1.0
1795 0.0 0.0 2.0 10.0 7.0 0.0
1796 0.0 0.0 10.0 14.0 8.0 1.0

pixel_0_6 pixel_0_7 pixel_1_0 pixel_1_1 ... pixel_6_7 pixel_7_0
\
0 0.0 0.0 0.0 0.0 ... 0.0 0.0
1 0.0 0.0 0.0 0.0 ... 0.0 0.0
2 0.0 0.0 0.0 0.0 ... 0.0 0.0
3 0.0 0.0 0.0 8.0 ... 0.0 0.0
4 0.0 0.0 0.0 0.0 ... 0.0 0.0
... ... ... ... ... ... ... ...
1792 0.0 0.0 0.0 1.0 ... 0.0 0.0
1793 1.0 0.0 0.0 0.0 ... 0.0 0.0
1794 0.0 0.0 0.0 0.0 ... 0.0 0.0
1795 0.0 0.0 0.0 0.0 ... 0.0 0.0
1796 0.0 0.0 0.0 2.0 ... 0.0 0.0

Unit 4 187
pixel_7_1 pixel_7_2 pixel_7_3 pixel_7_4 pixel_7_5 pixel_7_6 \
0 0.0 6.0 13.0 10.0 0.0 0.0
1 0.0 0.0 11.0 16.0 10.0 0.0
2 0.0 0.0 3.0 11.0 16.0 9.0
3 0.0 7.0 13.0 13.0 9.0 0.0
4 0.0 0.0 2.0 16.0 4.0 0.0
... ... ... ... ... ... ...
1792 0.0 2.0 14.0 15.0 9.0 0.0
1793 0.0 6.0 16.0 14.0 6.0 0.0
1794 0.0 2.0 9.0 13.0 6.0 0.0
1795 0.0 5.0 12.0 16.0 12.0 0.0
1796 1.0 8.0 12.0 14.0 12.0 1.0

pixel_7_7 target
0 0.0 0
1 0.0 1
2 0.0 2
3 0.0 3
4 0.0 4
... ... ...
1792 0.0 9
1793 0.0 0
1794 0.0 8
1795 0.0 9
1796 0.0 8

[1797 rows x 65 columns]
pixel_0_0 pixel_0_1 pixel_0_2 pixel_0_3 pixel_0_4 \
count 1797.0 1797.000000 1797.000000 1797.000000 1797.000000
mean 0.0 0.303840 5.204786 11.835838 11.848080
std 0.0 0.907192 4.754826 4.248842 4.287388
min 0.0 0.000000 0.000000 0.000000 0.000000
25% 0.0 0.000000 1.000000 10.000000 10.000000
50% 0.0 0.000000 4.000000 13.000000 13.000000
75% 0.0 0.000000 9.000000 15.000000 15.000000
max 0.0 8.000000 16.000000 16.000000 16.000000

pixel_0_5 pixel_0_6 pixel_0_7 pixel_1_0 pixel_1_1 ...
\
count 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000 ...
mean 5.781859 1.362270 0.129661 0.005565 1.993879 ...
std 5.666418 3.325775 1.037383 0.094222 3.196160 ...
min 0.000000 0.000000 0.000000 0.000000 0.000000 ...
25% 0.000000 0.000000 0.000000 0.000000 0.000000 ...
50% 4.000000 0.000000 0.000000 0.000000 0.000000 ...
75% 11.000000 0.000000 0.000000 0.000000 3.000000 ...
max 16.000000 16.000000 15.000000 2.000000 16.000000 ...

pixel_6_7 pixel_7_0 pixel_7_1 pixel_7_2 pixel_7_3 \
count 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000
mean 0.206455 0.000556 0.279354 5.557596 12.089037
std 0.984401 0.023590 0.934302 5.103019 4.374694
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 1.000000 11.000000
50% 0.000000 0.000000 0.000000 4.000000 13.000000
75% 0.000000 0.000000 0.000000 10.000000 16.000000
max 13.000000 1.000000 9.000000 16.000000 16.000000

188 COMP S491 Machine Learning and Applications
pixel_7_4 pixel_7_5 pixel_7_6 pixel_7_7 target
count 1797.000000 1797.000000 1797.000000 1797.000000 1797.000000
mean 11.809126 6.764051 2.067891 0.364496 4.490818
std 4.933947 5.900623 4.090548 1.860122 2.865304
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 10.000000 0.000000 0.000000 0.000000 2.000000
50% 14.000000 6.000000 0.000000 0.000000 4.000000
75% 16.000000 12.000000 2.000000 0.000000 7.000000
max 16.000000 16.000000 16.000000 16.000000 9.000000

[8 rows x 65 columns]
This code applies logistic regression to the dataset using the
LogisticRegression class:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.3)
clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score)
Output:
0.9703703703703703
This code applies logistic regression to the dataset using the
SGDClassifier class:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.3)
clf = make_pipeline(StandardScaler(), SGDClassifier(loss="log"))
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score)
Output:
0.9388888888888889
Unit 4 189
Activity 4.9
# Modified code below
from statistics import mean, stdev
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
iris = load_iris()
ridge_scores = []
logistic_scores = []
for i in range(100):
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = make_pipeline(PolynomialFeatures(2), StandardScaler(),
RidgeClassifier())
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
ridge_scores.append(score)
clf = make_pipeline(PolynomialFeatures(2), StandardScaler(),
LogisticRegression())
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
logistic_scores.append(score)
print("RidgeClassifier:", mean(ridge_scores), stdev(ridge_scores))
print("LogisticRegression:", mean(logistic_scores), stdev(logistic_scores))
Output:
RidgeClassifier: 0.9606666666666667 0.028923401641042987
LogisticRegression: 0.9642222222222222 0.027516390834540265
Activity 4.10
# Modified code below
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import make_pipeline
boston = load_boston()
scores = []
for i in range(100):
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
sgd = SGDRegressor(penalty="elasticnet", alpha=0.01, l1_ratio=0.5)
reg = make_pipeline(PolynomialFeatures(2), StandardScaler(), sgd)
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
scores.append(score)
print("SGDRegressor:", mean(scores), stdev(scores))
Output:
SGDRegressor: 0.8196165068268226 0.039648981566582024
190 COMP S491 Machine Learning and Applications
Activity 4.11
# Modified code below
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
reg = make_pipeline(StandardScaler(), SGDRegressor())
scores = cross_val_score(reg, X, y, cv=5)
print(scores, scores.mean())
Output:
[0.7729814 0.58844009 0.7019026 0.71430874 0.75569577]
0.706665721070719
Activity 4.12
# Modified code below
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.utils import shuffle
iris = load_iris()
X, y = shuffle(iris.data, iris.target)
reg = make_pipeline(StandardScaler(), SGDClassifier(loss="log"))
scores = cross_val_score(reg, X, y, cv=5)
print(scores, scores.mean())
Output:
[0.9 0.93333333 0.93333333 0.96666667 0.93333333]
0.9333333333333333
Activity 4.13
# Modified code below
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, LeaveOneOut
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
iris = load_iris()
X, y = shuffle(iris.data, iris.target)
reg = LogisticRegression(max_iter=200)
scores = cross_val_score(reg, X, y, cv=LeaveOneOut())
print(scores, scores.mean())
Output:
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1.
Unit 4 191
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.] 0.9666666666666667
Activity 4.14
# Modified code below
from sklearn.datasets import load_boston
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score, mean_squared_error, \
mean_absolute_error, median_absolute_error
from sklearn.model_selection import train_test_split
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = Lasso()
reg.fit(X_train, y_train)
y_predicted = reg.predict(X_test)
print("r2 score:", r2_score(y_test, y_predicted))
print("mean squared error:", mean_squared_error(y_test, y_predicted))
print("mean absolute error:", mean_absolute_error(y_test, y_predicted))
print("median absolute error:", median_absolute_error(y_test, y_predicted))
print("score():", reg.score(X_test, y_test))
Output:
r2 score: 0.5983847662889151
mean squared error: 32.89853697839766
mean absolute error: 4.184524192870988
median absolute error: 3.222446278463348
score(): 0.5983847662889151
Activity 4.15
We have TP = FP = 0, TN = 95, FN = 5.
Activity 4.16
You may get a warning in the call to the precision_score() function
when there is a division by zero. Setting the argument zero_division=0
suppresses the warning.
from sklearn.metrics import precision_score, recall_score, f1_score
NOT_SPAM, SPAM = 0, 1
# Modified code below
actual = 95 * [NOT_SPAM] + 5 * [SPAM]
predicted = 100 * [NOT_SPAM]
print("precision:", precision_score(actual, predicted, zero_division=0))
print("recall:", recall_score(actual, predicted))
print("f1:", f1_score(actual, predicted))
Output:
precision: 0.0
recall: 0.0
f1: 0.0
192 COMP S491 Machine Learning and Applications
Activity 4.17
# Modified code below
from sklearn.datasets import load_iris
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import precision_score, recall_score, \
f1_score, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = RidgeClassifier()
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print("precision:", precision_score(y_test, y_predicted, average="weighted"))
print("recall:", recall_score(y_test, y_predicted, average="weighted"))
print("f1 score:", f1_score(y_test, y_predicted, average="weighted"))
print("accuracy:", accuracy_score(y_test, y_predicted))
print("score():", clf.score(X_test, y_test))
print("classification report:\n", classification_report(y_test, y_predicted))
Output:
precision: 0.8407407407407408
recall: 0.8222222222222222
f1 score: 0.8202233980011758
accuracy: 0.8222222222222222
score(): 0.8222222222222222
classification report:
precision recall f1-score support

0 1.00 0.93 0.96 14
1 0.83 0.62 0.71 16
2 0.70 0.93 0.80 15

accuracy 0.82 45
macro avg 0.84 0.83 0.83 45
weighted avg 0.84 0.82 0.82 45
Activity 4.18
# Modified code below
from sklearn.datasets import load_iris
from sklearn.dummy import DummyClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = DummyClassifier(strategy="uniform")
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
print("roc_auc:", roc_auc_score(y_test, y_proba, multi_,
average="weighted"))
Output:
roc_auc: 0.5
Unit 4 193
Activity 4.19
# Written code below
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3)
clf = LogisticRegressionCV(max_iter=500)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(score, clf.Cs_, clf.l1_ratio_)
Output:
0.9333333333333333 [1.00000000e-04 7.74263683e-04 5.99484250e-03
4.64158883e-02
3.59381366e-01 2.78255940e+00 2.15443469e+01 1.66810054e+02
1.29154967e+03 1.00000000e+04] [None None None]
Activity 4.20
# Modified code below
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.utils import shuffle
iris = load_iris()
X, y = shuffle(iris.data, iris.target)
param_grid = {"alpha": (0.0001, 0.001, 0.01, 0.1, 1, 10, 100),
"l1_ratio": np.linspace(0, 1, 11)}
clf = SGDClassifier(penalty="elasticnet")
grid = GridSearchCV(clf, param_grid)
grid.fit(X, y)
display(grid.best_estimator_, grid.best_score_, grid.best_params_)
Output:
SGDClassifier(alpha=0.001, l1_ratio=0.9, penalty='elasticnet')
0.9733333333333334
{'alpha': 0.001, 'l1_ratio': 0.9}
194 COMP S491 Machine Learning and Applications
Activity 4.21
The distances and ranks are obtained in the same way as we discussed
— using kNN for classification. For 2-NN, the two nearest neighbours
are #3 and #4, and the mean of their scores is the predicted score for the
query point (5, 5), i.e. (5.1+6.5)/2 = 5.8.
Trial
product
Amount
of sugar
(x1)
Amount
of carbon
dioxide (x2)
Response
(y)
Distance
to (5, 5)
Rank by
minimum
distance
#1 1 4 3.5 4.12 4
#2 5 1 4.6 4.00 3
#3 4 7 5.1 2.24 1
#4 8 5 6.5 3.00 2
Activity 4.24
# Written code below
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
knn = KNeighborsRegressor()
scores = cross_val_score(knn, X, y)
print("kNN:", scores.mean())
knn = KNeighborsRegressor()
pipe = make_pipeline(StandardScaler(), knn)
scores = cross_val_score(pipe, X, y)
print("kNN scaled:", scores.mean())
Output:
kNN: 0.5049946964577553
kNN scaled: 0.7518412547289474
Activity 4.25
The scores for kNN classification without and with feature
standardization are comparable. This is because the features of the iris
dataset have similar scales.
# Written code below
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
iris = load_iris()
X, y = shuffle(iris.data, iris.target)
knn = KNeighborsClassifier()
Unit 4 195
scores = cross_val_score(knn, X, y)
print("kNN:", scores.mean())
knn = KNeighborsClassifier()
pipe = make_pipeline(StandardScaler(), knn)
scores = cross_val_score(pipe, X, y)
print("kNN scaled:", scores.mean())
Output:
kNN: 0.9733333333333334
kNN scaled: 0.9666666666666666
Activity 4.26
# Modified code below
import math
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
iris = load_iris()
X, y = shuffle(iris.data, iris.target)
knn = KNeighborsClassifier()
pipe = Pipeline([("scaler", StandardScaler()), ("knn", knn)])
max_k = math.ceil(math.sqrt(len(X)))
param_grid = {"knn__n_neighbors": range(1, max_k+1)}
grid = GridSearchCV(pipe, param_grid)
grid.fit(X, y)
display(grid.best_estimator_, grid.best_score_, grid.best_params_)

import matplotlib.pyplot as plt
ns_neighbors = grid.cv_results_["param_knn__n_neighbors"]
mean_scores = grid.cv_results_["mean_test_score"]
plt.plot(ns_neighbors, mean_scores, "o-")
Output:
Pipeline(steps=[('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=12))])
0.9733333333333334
{'knn__n_neighbors': 12}
[]
196 COMP S491 Machine Learning and Applications
Activity 4.27
To determine whether players will play if the weather is cloudy:
• P(Play|Cloudy) = P(Cloudy|Play) × P(Play) / P(Cloudy)
= 4/9 × 9/14 / 4/14 = 1.00
• P(Not play|Cloudy) = P(Cloudy|Not play) × P(Not play) / P(Cloudy)
= 0/5 × 5/14 / 4/14 = 0.00
Since the posterior probability P(Play|Cloudy) is higher, the class ‘Play’
is predicted, i.e. we predict that players will play tennis if the weather is
cloudy.
To determine whether players will play if the weather is rainy:
• P(Play|Rainy) = P(Rainy|Play) × P(Play) / P(Rainy)
= 2/9 × 9/14 / 5/14 = 0.40
• P(Not play|Rainy) = P(Rainy|Not play) × P(Not play) / P(Rainy) =
3/5 × 5/14 / 5/14 = 0.60
Since the posterior probability P(Not play|Rainy) is higher, the class
‘Not play’ is predicted, i.e. we predict that players will not play tennis if
the weather is rainy.
Activity 4.28
# Written code below
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.utils import shuffle
iris = load_iris()
X, y = shuffle(iris.data, iris.target)
nb = GaussianNB()
scores = cross_val_score(nb, X, y, cv=5)
print(scores, scores.mean())
Output:
[0.96666667 0.96666667 0.93333333 0.96666667 0.96666667] 0.96
Activity 4.30
Consider the five Weak-wind examples in iteration #2.
For splitting on the weather attribute, Gain(Weather|Weak) = 0.971.
There are two Sunny examples, one Cloudy example, and two Rainy
examples. The information gain ratio is:
Gain ratio(Weather|Weak) = 0.971 / (−2/5 log(2/5) − 1/5 log(1/5) −
2/5 log(2/5)) = 0.638
Unit 4 197
For splitting on the humidity attribute, Gain(Humidity|Weak) = 0.020.
There are three High-humidity examples and two Low-humidity
examples. The information gain ratio is:
Gain ratio(Humidity|Weak) = 0.020 / (−3/5 log(3/5) − 2/5 log(2/5)) =
0.021
For splitting on the date attribute, the information gain and information
gain ratio are:
Gain(Date|Weak) = 0.971 − (−1/1 log(1/1) − 0/1 log(0/1)) / 11 × 11 =
0.971
Gain ratio(Date|Weak) = 0.971 / (−1/5 log(1/5) × 5) = 0.418
The information gain ratio on the weather attribute, 0.638, is the largest.
Therefore, the weather attribute is selected for the new node.
Activity 4.31
# Written code below
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print("Score:", score)

print("Depth:", reg.get_depth())
print("Number of leaves:", reg.get_n_leaves())
plt.figure(figsize=(20, 10))
tree.plot_tree(reg, filled=True, fontsize=10, max_depth=4)
plt.show()
Output:
Score: 0.8587068549538901
Depth: 16
Number of leaves: 336
198 COMP S491 Machine Learning and Applications
Activity 4.32
The code is shown below. Here are some comments on the ZeroR
and OneR results. The iris dataset has three classes, corresponding to
three kinds of iris flowers, in about the same proportions (i.e. about
33%). Using the most frequent class (or just any class), ZeroR predicts
correctly for 33% of all examples. OneR uses the best feature to make
predictions, and in the iris dataset the best feature happens to separate
out one class from the other two classes. So, examples of the former one
class are correctly predicted (about 33%); examples of the latter two
classes are predicted as one of the two, and half of them are predicted
correctly (about 66%/2). Altogether, OneR predicts correctly for 66% (=
33% + 66%/2) of all examples.
from sklearn.datasets import load_iris
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle

iris = load_iris()
X, y = shuffle(iris.data, iris.target)
def classify_cv(name, clf):
scores = cross_val_score(clf, X, y, cv=5)
print(f"{name}:", scores.mean())

classify_cv("ZeroR", DummyClassifier(strategy="most_frequent"))
classify_cv("OneR", DecisionTreeClassifier(max_depth=1))
classify_cv("Decision tree:", DecisionTreeClassifier())
Output:
ZeroR: 0.3333333333333333
OneR: 0.6666666666666666
Decision tree:: 0.9533333333333335
Unit 4 199
Activity 4.33
The code for using bagging with kNN base models is shown below.
There is no significant trend of improvement; it appears that a single
kNN model is good enough for classifying the digits dataset.
# Modified code below
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
digits = load_digits()
X, y = shuffle(digits.data, digits.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
knn = make_pipeline(StandardScaler(), KNeighborsClassifier())
bagging = BaggingClassifier(base_estimator=knn, n_estimators=n)
scores = cross_val_score(bagging, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Output:
[]
200 COMP S491 Machine Learning and Applications
Activity 4.34
The code for using bagging with kNN models shown below. There is no
observable trend of improvement.
# Modified code below
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
knn = make_pipeline(StandardScaler(), KNeighborsRegressor())
bagging = BaggingRegressor(base_estimator=knn, n_estimators=n)
scores = cross_val_score(bagging, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Output:
[]
Unit 4 201
Activity 4.35
# Modified code below
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
forest = RandomForestRegressor(n_estimators=n)
scores = cross_val_score(forest, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Output:
[]
Activity 4.36
# Modified code below
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
tree = DecisionTreeRegressor(max_depth=5)
boost = AdaBoostRegressor(base_estimator=tree, n_estimators=n)
scores = cross_val_score(boost, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Output:
[]
202 COMP S491 Machine Learning and Applications
Activity 4.37
# Modified code below
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
ns_estimators = range(5, 101, 5)
mean_scores = []
for n in ns_estimators:
boost = GradientBoostingRegressor(n_estimators=n)
scores = cross_val_score(boost, X, y, n_jobs=-1)
mean_scores.append(scores.mean())
plt.plot(ns_estimators, mean_scores)
Output:
[]
Unit 4 203
Activity 4.39
The rbf kernel performs better (i.e. has a higher score) than the poly
kernel on the generated dataset.
# Modified code below
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.utils import shuffle

X = np.random.randint(0, 100, size=(1000, 2))
random_values = np.random.normal(scale=50000, size=1000)
y = X[:, 0] ** 3 + 2 * X[:, 1] ** 3 + random_values > 100000
X, y = shuffle(X, y)
pipe = Pipeline([("scaler", StandardScaler()), ("svc", SVC(kernel="rbf"))])
param_grid = {"svc__C": np.arange(0.1, 2, 0.1)}
grid = GridSearchCV(pipe, param_grid)
grid.fit(X, y)
display(grid.best_estimator_, grid.best_score_, grid.best_params_)
Output:
Pipeline(steps=[('scaler', StandardScaler()),
('svc', SVC(C=0.7000000000000001))])
0.954
{'svc__C': 0.7000000000000001}
Activity 4.40
With the tuned parameters, the poly kernel achieves a score of about
0.76, and the the rbf kernel achieves a score of about 0.87, both better
than the SVM regressors with default parameters.
# Written code below
from sklearn.datasets import load_boston
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.utils import shuffle

boston = load_boston()
X, y = shuffle(boston.data, boston.target)
pipe = Pipeline([("scaler", StandardScaler()), ("svr", SVR(kernel="poly"))])
param_grid = {"svr__C": range(5, 101, 5),
"svr__degree": range(1, 5)}
grid = GridSearchCV(pipe, param_grid)
grid.fit(X, y)
display("Poly kernel:", grid.best_estimator_, grid.best_score_, grid.best_
params_)

pipe = Pipeline([("scaler", StandardScaler()), ("svr", SVR(kernel="rbf"))])
param_grid = {"svr__C": range(5, 101, 5)}
grid = GridSearchCV(pipe, param_grid)
204 COMP S491 Machine Learning and Applications
grid.fit(X, y)
display("RBF kernel:", grid.best_estimator_, grid.best_score_, grid.best_
params_)
Output:
'Poly kernel:'
Pipeline(steps=[('scaler', StandardScaler()),
('svr', SVR(C=25, kernel='poly'))])
0.7576128658436414
{'svr__C': 25, 'svr__degree': 3}
'RBF kernel:'
Pipeline(steps=[('scaler', StandardScaler()), ('svr', SVR(C=60))])
0.8675949434698988
{'svr__C': 60}
Activity 4.43
# Modified code below
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.3)
reg = make_pipeline(StandardScaler(),
MLPRegressor(max_iter=1000, learning_rate_init=0.01))
reg.fit(X_train, y_train)
score = reg.score(X_test, y_test)
print(score)
Output:
0.8081868048806456
# Modified code below
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.utils import shuffle
boston = load_boston()
X, y = shuffle(boston.data, boston.target)
sizes = ((), (1,), (3,), (10,), (20,), (50,), (100,), (1000,),
(100, 100), (100, 100, 100))
for size in sizes:
reg = MLPRegressor(hidden_layer_sizes=size, max_iter=1000,
learning_rate_init=0.01)
pipe = make_pipeline(StandardScaler(), reg)
scores = cross_val_score(pipe, X, y, n_jobs=-1)
print(f"Size: {size}, Score: {scores.mean()}")
Unit 4 205
Output:
Size: (), Score: 0.6264082344026662
Size: (1,), Score: 0.6778840090886737
Size: (3,), Score: 0.7504638377852713
Size: (10,), Score: 0.8185047691804381
Size: (20,), Score: 0.8258485881307186
Size: (50,), Score: 0.8224408101413443
Size: (100,), Score: 0.8336989786506885
Size: (1000,), Score: 0.831011108429965
Size: (100, 100), Score: 0.8111656661296692
Size: (100, 100, 100), Score: 0.7527113733138626
206 COMP S491 Machine Learning and Applications
Appendix 4.1: Multivariate
calculus and gradient descent
This appendix provides an overview of multivariate calculus and finding
gradients in the gradient descent algorithm.
Multivariate calculus
Multivariate calculus refers to applying calculus to functions that
involve multiple variables. In machine learning, it is common to have
data of multiple dimensions, or functions of multiple variables.
For a function with multiple variables, its partial derivative is the
derivative with respect to the variable of interest with all other variables
held ‘fixed’ during the differentiation. In other words, the other variables
are treated as constants (numbers). For example, consider f (x, y) = x2y3.
The partial derivative of f with respect to x is:
The partial derivative of f with respect to y is:
Some rules of (partial) derivatives are listed below. Here, c and n are
constants, i.e. independent of x; f and ɡ are functions of x.
• Constant multiplication rule:
• Power rule:
• Sum rule:
• Chain rule: (where f is a function of y and y is a
function of x)
Unit 4 207
Gradient descent
The predicted results, y', is given by:
y' = mx + b
The loss function of linear regression, MSE, is:
The gradient descent algorithm minimizes the MSE using the rates of
changes of the parameters m and b. The rates of changes, or gradients,
can be found as the partial derivatives of MSE = f (m, b).
The gradient of m is:
(constant rule)
(sum rule)
(chain rule)
(power rule)

208 COMP S491 Machine Learning and Applications
The last line is implemented in Python code as follows:
m_gradient = 2 * np.dot(x, predicted - y) / len(x)
new_m = m - learning_rate * m_gradient
The gradient of b is:
(constant rule)
(sum rule)
(chain rule)
(power rule)

The last line is implemented in Python code as follows:
b_gradient = 2 * (predicted - y).sum() / len(x)
new_b = b - learning_rate * b_gradient