COMP9417-COMP9417包课代写|学霸联盟

COMP9417-COMP9417包课代写

时间：2023-06-14

Regression (2)
COMP9417 Machine Learning and Data Mining
Term 2, 2023
COMP9417 ML & DM Regression (2) Term 2, 2023 1 / 43
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie,
R. Tibshirani & J. Friedman. Springer (2009)
http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy
MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book
“Machine Learning” by P. Flach
Cambridge University Press (2012)
http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber
Cambridge University Press (2012)
http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book
“Machine Learning” by T. Mitchell
McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course
“Machine Learning” by A. Srinivasan
BITS Pilani, Goa, India (2016)
COMP9417 ML & DM Regression (2) Term 2, 2023 2 / 43
Optimization by gradient descent
Optimization by gradient descent
COMP9417 ML & DM Regression (2) Term 2, 2023 3 / 43
Optimization by gradient descent
Optimization
Basic problem1:
minimize
x
f(x)
subject to x ∈ X
• The problem is to find some x which is called a design point, which is
a p-dimensional vector of values for p design variables.
• These values are manipulated by an optimization algorithm to find a
minimizer, a solution x∗ that minimizes the objective function f .
• A solution must be an element of the feasible set X , and may be
subject to additional constraints, called constrained optimization.
• Optimization is widely studied and applied in many fields such as
engineering, science, economics, . . .
1See: Kochenderfer and Wheeler (2019).
COMP9417 ML & DM Regression (2) Term 2, 2023 4 / 43
Optimization by gradient descent
^
ski#
,
✗
I
✗
*
x
design points
COMP9417 ML & DM Regression (2) Term 2, 2023 5 / 43
Optimization by gradient descent
Optimization
The formulation is general, for constrained or unconstrained optimization,
since we can always replace a problem where we want to maximize the
objective function f :
maximize
x
f(x) subject to x ∈ X
by an equivalent minimization expression:
minimize
x
− f(x) subject to x ∈ X
COMP9417 ML & DM Regression (2) Term 2, 2023 6 / 43
Optimization by gradient descent
Optimization
Usually, we would like an optimization algorithm to quickly reach an
answer that is close to being the right one.
There are many possible approaches to optimization. In machine learning
methods based on finding the derivative, or gradient, are widely used.
• typically, need to minimize a function
• e.g., error or loss
• optimization is known as gradient descent or steepest descent
• sometimes, need to maximize a function
• e.g., probability or likelihood
• optimization is known as gradient ascent or steepest ascent
Requires function to be differentiable.
COMP9417 ML & DM Regression (2) Term 2, 2023 7 / 43
Optimization by gradient descent
a
f-K) t.MY#flectiionP""
Jʰlocalglobal min . MinMin . x
3¥
,
at a minimum if d) f
'
# = 0
(but not
" t"⇔≥o}÷÷÷!,
first - order FONC
second - order SONC
COMP9417 ML & DM Regression (2) Term 2, 2023 8 / 43
Optimization by gradient descent
What is an optimization algorithm ?
• many kinds of optimization algorithm
• depends on objective function, constraints, data
• approaches based on derivatives or gradients are widely used
• derivatives provide the direction of the search for a minimizer
• may be obtained analytically or estimated numerically
• can also used automatic differentiation
• not all approaches require derivatives . . .
COMP9417 ML & DM Regression (2) Term 2, 2023 9 / 43
Optimization by gradient descent
Optimization
A general iterative algorithm 2 to optimise some function f :
1 start with initial point x = x0
2 select a search direction g, usually to decrease f(x)
3 select a step length η
4 set s = ηg
5 set x = x+ s
6 go to step 2, unless convergence criteria are met
For example, could minimize a real-valued function f : Rn → R
Note: convergence criteria will be problem-specific.
2See: Ripley (1996).
COMP9417 ML & DM Regression (2) Term 2, 2023 10 / 43
Optimization by gradient descent
Least-Squares as Loss Minimization
• consider MSE as a loss function
• this is what we want to minimize in OLS regression
• finding the least-squares solution is in effect finding the value of a and
b that minimizes 1n
∑n
i=1(yi − yˆi)2, where yˆi = a+ bxi
• we saw before that the minimum value can be obtained analytically
by the usual process of differentiating and equating to 0
• an alternative is to apply an iterative gradient descent approach
• with MSE as a loss function, given data (x1, y1), . . . , (xn, yn)
Loss(θ) =
1
n
n∑
i=1
(yi − fθ(xi))2
• note that Loss is a function of θ, and yˆi = fθ(xi), so here θ is the
parameters a and b
COMP9417 ML & DM Regression (2) Term 2, 2023 11 / 43
Optimization by gradient descent
Least-Squares as Loss Minimization
• the gradient descent alternative to the analytical approach is to take
(small) steps that decreases the value of the function to be
minimised, stopping when we reach a minimum
• recall that at a point the gradient vector points in the direction of
greatest increase of a function. So, the opposite direction to the
gradient vector gives the direction of greatest decrease
• for univariate linear regression, suppose gb, ga give the gradient, then
the iterative steps are:
• bi+1 = bi - η × gb
• ai+1 = ai - η × ga
• Stop when bi+1 ≈ bi and ai+1 ≈ ai
COMP9417 ML & DM Regression (2) Term 2, 2023 12 / 43
Optimization by gradient descent
^
7
"
large
"
""
"
""
"
÷-
-
.
:
-¥
"
• !
! !
I
'
f
i. : i
2+7"it 1
,,
Dci
COMP9417 ML & DM Regression (2) Term 2, 2023 13 / 43
Multivariate linear regression
Multivariate linear regression
COMP9417 ML & DM Regression (2) Term 2, 2023 14 / 43
Multivariate linear regression
Many variables
• Often, we are interesting in modelling the relationship of Y to several
other variables
• In observational studies, the value of Y may be affected by the values
of several variables. For example, carcinogenicity may be
gender-specific. A regression model that ignores gender may find that
carcinogenicity to be related to some surrogate variable (height, for
example)3
• Including more variables can give a narrower confidence interval on
the prediction being made
• However, more complex models are not always better
3A surrogate variable is a variable for which we have data that replaces one that
cannot be observed, or is otherwise not in the dataset.
COMP9417 ML & DM Regression (2) Term 2, 2023 15 / 43
Multivariate linear regression
Multivariate linear model
• We now have multiple variables to estimate a model of the form
Y = β0 + β1X1 + β2X2 + · · ·+ βpXp
• As before, this linear model is estimated from a sample by the
equation yˆ = b0 + b1x1 + b2x2 + · · ·+ bpxp
• With many variables, we often need to ensure the regression does not
overfit the training data, so we control the bi by regularisation
COMP9417 ML & DM Regression (2) Term 2, 2023 16 / 43
Regularisation
Regularisation
COMP9417 ML & DM Regression (2) Term 2, 2023 17 / 43
Regularisation
Parameter Estimation by Optimization
Regularisation is a general method to avoid overfitting by applying
additional constraints to the weight vector. A common approach is to
make sure the weights are, on average, small in magnitude: this is referred
to as shrinkage.
Recall the setting for regression in terms of loss minimization.
• Can add penalty terms to a loss function, forcing coefficients to
shrink to zero
Y = fθ0,θ1,...,θp(X1, X2, . . . , Xp) = fθ(X)
COMP9417 ML & DM Regression (2) Term 2, 2023 18 / 43
Regularisation
Parameter Estimation by Optimization
• MSE as a loss function, given data (x1, y1), . . . , (xn, yn):
L(θ) = 1
n
n∑
i=1
(yi − fθ(xi))2
and with a penalty function:
L(θ) = 1
n
n∑
i=1
(yi − fθ(xi))2 + λ
∑
j
θ2j
or:
L(θ) = 1
n
n∑
i=1
(yi − fθ(xi))2 + λ
∑
j
|θj |
• Parameter estimation by optimisation will attempt to find values for
θ0, θ1, . . . , θp s.t. loss L(θ) is minimized
COMP9417 ML & DM Regression (2) Term 2, 2023 19 / 43
Regularisation
Regularised regression
The multivariate least-squares regression problem is often written as an
optimisation problem using vector notation:
w∗ = argmin
w
(y −Xw)T(y −Xw)
where w represents the parameters as “weights”.
The regularised version of this optimisation is then as follows:
w∗ = argmin
w
(y −Xw)T(y −Xw) + λ||w||2
where ||w||2 =∑iw2i is the squared norm of the parameters or weights
vector w, or, equivalently, the dot product wTw; λ is a scalar determining
the amount of regularisation.
COMP9417 ML & DM Regression (2) Term 2, 2023 20 / 43
Regularisation
Regularised regression
This regularised problem still has a closed-form solution:
wˆ = (XTX+ λI)−1XTy
where I denotes the identity matrix. Regularisation amounts to adding λ
to the diagonal of XTX, a well-known trick to improve the numerical
stability of matrix inversion. This form of least-squares regression is known
as ridge regression.
COMP9417 ML & DM Regression (2) Term 2, 2023 21 / 43
Regularisation
Regularised regression
The alternative “absolute value” form of regularised regression is provided
by the LASSO, which stands for ‘Least Absolute Shrinkage and Selection
Operator’4.
It replaces the ridge regularisation term
∑
iw
2
i with the sum of absolute
weights
∑
i |wi|. The result is that some weights are shrunk, but others
are set to 0, and so the lasso form of regularised regression favours sparse
solutions.
Unlike ridge regression, the lasso form of regularised regression has to be
solved by an iterative optimization algorithm.
4(Tibshirani, 1996).
COMP9417 ML & DM Regression (2) Term 2, 2023 22 / 43
Regularisation Bias-Variance Decomposition
Bias-Variance Decomposition
COMP9417 ML & DM Regression (2) Term 2, 2023 23 / 43
Regularisation Bias-Variance Decomposition
The Bias-Variance Tradeoff
• we can view the issue of error using a probabilistic model
Y = f(X) + ϵ, ϵ ∼ N(0, σ2)
• briefy, this model assumes the data we observe was generated
according to the target function f and Gaussian or
normally-distributed noise with zero mean and variance σ2 was added
to each data point
• this assumption leads to several useful properties
• for regression, we assume that f is the same as before
COMP9417 ML & DM Regression (2) Term 2, 2023 24 / 43
Regularisation Bias-Variance Decomposition
The Bias-Variance Tradeoff
• When comparing unbiased estimators, we would like to select the one
with minimum variance
• In general, we would be comparing estimators that have some bias
and some variance
• We can combine the bias and variance of an estimator by obtaining
the mean square error of the estimator, or MSE. This is the average
value of squared deviations of an estimated value V from the true
value of the parameter θ. That is:
MSE = Avg. value of (V − θ)2
• Now, it can be shown that:
MSE = (variance) + (bias)2
• If, as sample size increases, the bias and the variance of an estimator
approaches 0, then the estimator is said to be consistent.
COMP9417 ML & DM Regression (2) Term 2, 2023 25 / 43
Regularisation Bias-Variance Decomposition
The Bias-Variance Tradeoff
• Since
MSE = (variance) + (bias)2
the lowest possible value of MSE is 0
• In general, we may not be able to get to the ideal MSE of 0.
Sampling theory tells us the minimum value of the variance of an
estimator. This value is known as the Cramer-Rao bound. So, given
an estimator with bias b, we can calculate the minimum value of the
variance of the estimator using the CR bound (say, vmin). Then:
MSE ≥ vmin + b2
The value of vmin depends on whether the estimator is biased or
unbiased (that is b = 0 or b ̸= 0)
• It is not the case that vmin for an unbiased (b = 0) estimator is less
than vmin for a biased estimator. So, the MSE of a biased estimator
can end up being lower than the MSE of an unbiased estimator.
COMP9417 ML & DM Regression (2) Term 2, 2023 26 / 43
Regularisation Bias-Variance Decomposition
Decomposition of MSE
Suppose f(x) is the value of the (unknown) target function for input x,
and yˆ = g(x) is the prediction of the learned regression model.
Imagine evaluating predictions yˆ of the model g(x) trained on dataset D
of size n sampled at random from the target distribution, where error is
based on the squared difference between predicted and actual values.
Averaged over all such datasets D, the MSE can be decomposed like this:
MSE = ED[(yˆ − f(x))2]
= ED[(yˆ − ED[yˆ])2] + (ED[yˆ − f(x)])2
Note that the first term in the error decomposition (variance) does not
refer to the actual value at all, although the second term (bias) does.
COMP9417 ML & DM Regression (2) Term 2, 2023 27 / 43
Some further issues in learning linear regression models
Some further issues in learning linear regression models
COMP9417 ML & DM Regression (2) Term 2, 2023 28 / 43
Some further issues in learning linear regression models
What do the Coefficients bi Mean?
• Consider the two equations:
Yˆ = a+ bX
Yˆ = b0 + b1X1 + b2X2
• b: change in Y that accompanies a unit change in X
• b1: change in Y that accompanies a unit change in X1 provided X2
remains constant
• More generally, bj (j > 0) is the change in Y that accompanies a unit
change in Xj provided all other X’s are constant
• So: if all relevant variables are included, then we can assess the effect
of each one in a controlled manner
COMP9417 ML & DM Regression (2) Term 2, 2023 29 / 43
Some further issues in learning linear regression models
Categoric Variables: X’s
• “Indicator” variables are those that take on the values 0 or 1
• They are used to include the effects of categoric variables
• For example, if D is a variable that takes the value 1 if a patient takes
a drug and 0 if the patient does not. Suppose you want to know the
effect of drug D on blood pressure Y keeping age (X) constant
Yˆ = 70 + 5D + 0.44X
• So, taking the drug (a unit change in D) makes a difference of 5
units, provided age is held constant
COMP9417 ML & DM Regression (2) Term 2, 2023 30 / 43
Some further issues in learning linear regression models
Categoric Variables: X’s
• How do we capture any interaction effect between age and drug
intake?
• Introduce a new indicator variable DX = D ×X
Yˆ = 70 + 5D + 0.44X + 0.21DX
COMP9417 ML & DM Regression (2) Term 2, 2023 31 / 43
Some further issues in learning linear regression models
Categoric Variables: Y values
• Sometimes Y values may just be one of two values (say, 0 and 1)
• We can’t use the regression model as we described earlier, in which
the Y ’s can take any real value
• But, we can define a new linear regression model in which predicts
not the value of Y , but what are called the log odds of Y :
log odds Y = Odds = b0 + b1X1 + · · ·+ bpXp
• Once Odds are estimated, they can be used to calculate the
probability of Y :
Pr(Y = 1) =
eOdds
(1 + eOdds)
We can then use the value of Pr(Y = 1) to decide if Y = 1
• This procedure is called logistic regression
COMP9417 ML & DM Regression (2) Term 2, 2023 32 / 43
Some further issues in learning linear regression models
Is the Model Appropriate ?
• If there is no systematic pattern to the residuals—that is, there are
approximately half of them that are positive and half that are
negative, then the line is a good fit
• It should also be the case that there should be no pattern to the
residual scatter all along the line. If the average size of the residuals
varies along the line (this condition is called heteroscedasticity) then
the relationship is probably more complex than a straight line
• Residuals from a well-fitting line should show an approximate
symmetric, bell-shaped frequency distribution with a mean of 0
COMP9417 ML & DM Regression (2) Term 2, 2023 33 / 43
Some further issues in learning linear regression models
Non-linear Relationships
A question: is it possible to do better than the line of best fit?
Maybe. Linear regression assumes that the (xi, yi) examples in the data
are “generated” by the true (but unknown) function Y = f(X).
So any training set is a sample from the true distribution E(Y ) = f(X).
But what if f is non-linear ?
We may be able to reduce the mean squared error (MSE) value∑
i(yi − yˆ)2 by trying a different function.
COMP9417 ML & DM Regression (2) Term 2, 2023 34 / 43
Some further issues in learning linear regression models
Non-linear Relationships
• Some non-linear relationships can be captured in a linear model by a
transformation (“trick”). For example, the curved model
Yˆ = b0 + b1X1 + b2X
2
1 can be transformed by X2 = X
2
1 into a linear
model. This works for polynomial relationships.
• Some other non-linear relationships may require more complicated
transformations. For example, the relationship is Y = b0X
b1
1 X
b2
2 can
be transformed into the linear relationship
log(Y ) = log b0 + b1 logX1 + b2 logX2
• Other relationships cannot be transformed quite so easily, and will
require full non-linear estimation (in subsequent topics in the ML
course we will find out more about these)
COMP9417 ML & DM Regression (2) Term 2, 2023 35 / 43
Some further issues in learning linear regression models
Non-Linear Relationships
• Main difficulty with non-linear relationships is choice of function
• How to learn ?
• Can use a form of gradient descent to estimate the parameters
• After a point, almost any sufficiently complex mathematical function
will do the job in a sufficiently small range
• Some kind of prior knowledge or theory is the only way to help here.
• Otherwise, it becomes a process of trial-and-error, in which case,
beware of conclusions that can be drawn
COMP9417 ML & DM Regression (2) Term 2, 2023 36 / 43
Some further issues in learning linear regression models
Model Selection
• Suppose there are a lot of variables Xi, some of which may be
representing products, powers, etc .
• Taking all the Xi will lead to an overly complex model. There are 3
ways to reduce complexity:
1 Subset-selection, by search over subset lattice. Each subset results in a
new model, and the problem is one of model-selection
2 Shrinkage, or regularization of coefficients to zero, by optimization.
There is a single model, and unimportant variables have near-zero
coefficients.
3 Dimensionality-reduction, by projecting points into a lower dimensional
space (this is different to subset-selection, and we will look at it later)
COMP9417 ML & DM Regression (2) Term 2, 2023 37 / 43
Some further issues in learning linear regression models
Model Selection as Search I
• The subsets of the set of possible variables form a lattice with S1 ∩ S2
as the g.l.b. or meet and S1 ∪ S2 as the l.u.b. or join
• Each subset refers to a model, and a pair of subsets are connected if
they differ by just 1 element
• A lattice is a graph, and we know how to search a graph
• A∗, greedy, randomised etc .
• “Cost” of node in the graph: MSE of the model. The parameters
(coefficients) of the model can be found
• Historically, model-selection for regression has been done using
“forward-selection”, “backward-elimination”, or “stepwise” methods
• These are greedy search techniques that either: (a) start at the top of
the subset lattice, and add variables; (b) start at the bottom of the
subset lattice and remove variables; or (c) start at some interior point
and proceed by adding or removing single variables (examining nodes
connected to the node above or below)
COMP9417 ML & DM Regression (2) Term 2, 2023 38 / 43
Some further issues in learning linear regression models
Model Selection as Search II
• Greedy selection done on the basis of calculating the coefficient of
determination (often denoted by R2) which denotes the proportion of
total variation in the dependent variable Y that is explained by the
model
• Given a model formed with a subset of variables X, it is possible to
compute the observed change in R2 due to the addition or deletion of
some variable x
• This is used to select greedily the next best move in the graph-search
To set other hyper-parameters, such as shrinkage parameter λ, can use
grid search
COMP9417 ML & DM Regression (2) Term 2, 2023 39 / 43
Some further issues in learning linear regression models
Prediction I
• It is possible to quantify what happens if the regression line is used
for prediction:
• The intuition is this:
• Recall the regression line goes through the mean (X,Y )
COMP9417 ML & DM Regression (2) Term 2, 2023 40 / 43
Some further issues in learning linear regression models
Prediction II
• If the Xi are slightly different, then the mean is not going to change
much. So, the regression line stays somewhat “fixed” at (X,Y ) but
with a different slope
• With each different sample of the Xi we will get a slightly different
regression line
• The variation in Y values is greater further we move from (X,Y )
• MORAL: Be careful, when predicting far away from the centre value
• ANOTHER MORAL: The model only works under approximately the
same conditions that held when collecting the data
COMP9417 ML & DM Regression (2) Term 2, 2023 41 / 43
Summary
Summary
COMP9417 ML & DM Regression (2) Term 2, 2023 42 / 43
Summary
Summary
• Linear regression give us a glimpse into many aspects of Machine
Learning
• Linear models are only one way to predict numerical quantities
• Local regression: a nearest-neighbour approach
• Trees and ensembles: non-parametric models, including regression
• Neural networks: non-linear models, including regression
COMP9417 ML & DM Regression (2) Term 2, 2023 43 / 43
References
Kochenderfer, M. and Wheeler, T. (2019). Algorithms for Optimization. MIT Press.
Ripley, B. (1996). Pattern Recognition and Neural Networks. Cambridge University Press.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal
Statistical Society. Series B, 58(1):267–288.
COMP9417 ML & DM Regression (2) Term 2, 2023 43 / 43