BUSS6002-无代写
时间:2022-12-13
BUSS6002 Week 10
Introduction to Maximum Likelihood and
Optimisation
Discipline of Business Analytics
The University of Sydney Business School
Previously
It was mentioned that:
the unknown coefficients β can be found by least squares
RSS = ‖y − yˆ‖22 = ‖y −Xβ‖22 = (y −Xβ)>(y −Xβ)
The least squares optimisation problem is then given by
βˆ = argmin
β
‖y −Xβ‖22
The solution βˆ can be found by equating the gradient of the RSS to zero and solve for β. (This
involves computing matrix derivatives.)
βˆ = (X
>
X)
−1
X
>
y
Optimisation in Machine Learning
Estimating model parameters often requires solving an
optimisation problem.
Therefore optimisation is a crucial component of supervised
learning.
Examples of models/algorithms:
• Linear Regression
• Logistic Regression
• Support Vector Machine
• Boosted Ensembles
• Neural Networks (including deep learning)
Implementation
Most packages will do the heavy lifting of optimisation for us.
However understanding optimisation will lead to a better
understanding of machine learning.
Maximum Likelihood Estimation
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is a general method for
estimating parameters of a model.
The MLE philosophy:
Select parameter values that are most likely to have
generated the observed sample.
Joint Probability
Suppose we have a discrete probability distribution which depends
on the parameter vector θ = [θ1, θ2, . . . , θk]
>.
If we know the value of θ we can compute the joint probability of
observing a particular sample (i.e., set of realisations).
Joint Probability - Example
Suppose we have a fair coin, i.e., the probability of heads (H) or
tails (T) is φ = 0.5.
We flip the coin four times.
The probability of observing HHHT is given by the product of
probabilities of each outcome i.e.
P (HHHT | φ = 0.5) = φ× φ× φ× (1− φ)
= φ3(1− φ)
= 0.53(1− 0.5)
= 0.0625
Likelihood
Now consider the reverse problem:
how likely is a given value of θ after observing a sample?
In the coin example we
• do not know the value of φ, which is our parameter θ;
• observe a sample – a sequence of H or T.
Likelihood
Suppose we flip the coin four times and observe HHHT.
The likelihood of each possible value of φ given our observed
realisations is
L(φ) := P (HHHT | φ) = φ3(1− φ)
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.02
0.04
0.06
0.08
0.10
Lik
el
ih
oo
d
3(1 )
Likelihood
Notice that we are treating the observed sample (HHHT) as fixed,
while varying the parameter φ.
This is the difference between the probability P and likelihood L.
Maximum Likelihood
From the graph we can identify that the maximum value of the
likelihood function L occurs at φ = 0.75.
Therefore our Maximum Likelihood estimate is φˆ = 0.75.
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.02
0.04
0.06
0.08
0.10
Lik
el
ih
oo
d
3(1 )
Maximum Likelihood
This maximum point can be found by:
1. obtaining the derivative of L with respect to θ,
2. setting this to 0,
3. solving for θ.
This is the typical approach to maximum likelihood estimation,
since we usually have multiple parameters for a model.
Maximum Likelihood
Continuing our coin example. If
L(φ) = φ3(1− φ)
then
dL(φ)
dφ
= 3φ2 − 4φ3
Setting dL(φ)dφ = 0 yields
0 = 3φ2 − 4φ3
4φ3 = 3φ2
4φ = 3
φ =
3
4
= 0.75
Maximum Likelihood
0.0 0.2 0.4 0.6 0.8 1.0
1.0
0.8
0.6
0.4
0.2
0.0
0.2
3 2 4 3
Maximum Likelihood
We can summarise the ML estimation process with the following
equation:
θˆ = argmax
θ
L(θ),
which we read as:
the estimated value θˆ is given by the value of θ that
maximises the likelihood function L for a given sample x.
Remark
Obviously our maximum likelihood estimate in our example
deviates from the true value of φ = 0.5.
Fortunately, for many distribution families, the maximum likelihood
estimator is consistent, meaning that as the sample size increases
so will the accuracy of the estimate.
θˆn → θ, as n→∞,
where θˆn is the estimate obtained using a sample of n
observations.
Optimisation
Finding the maximum of a function is an optimisation problem.
To estimate parameters using MLE we need to solve optimisation
problems.
MLE for Linear Regression
MLE for Linear Regression
We can directly apply the MLE principle to Linear Regression.
Recall that in Linear Regression the parameters θ are commonly
referred to as β = [β0, β1, . . . , βd]
>
Therefore the MLE solution is given by
βˆ = argmax
β
L(β)
for an observed feature matrix X and response vector y.
Linear Regression Model
Recall linear regression with Gaussian errors:
y = β0 + β1x1 + · · ·+ βdxd + ε
= xβ + ε
where ε ∼ N (0, σ2) and xβ is the regression line.
This means that
y | x,β, σ2 ∼ N (xβ, σ2).
Again, we assume that x ∈ Rd+1 is a row vector, and β ∈ Rd+1 is
a column vector.
Gaussian Linear Regression Model
If we know the values of x, β, and σ2, we can compute the
probability density of observing any value of y using the Gaussian
(normal) probability density function (pdf):
p(y | x,β, σ2) = 1√
2piσ2
e−
(y−xβ)2
2σ2
Likelihood
Assume that we only have a set of observed responses y ∈ Rn and
corresponding features vectors X ∈ Rn×(d+1) and that the
parameters are unknown.
The likelihood function for this situation is the joint density of
observing all the data
L(β) =
n∏
i=1
p(yi | xi,β, σ2) =
n∏
i=1
1√
2piσ2
e−
(yi−xiβ)2
2σ2
where yi is the i-th element of y, and xi is the i-th row of X.
Remark: to simplify exposition, we assume that σ2 is known and
focus on the estimation of β.
Likelihood
This means that the MLE for linear regression is given by
βˆ = argmax
β
n∏
i=1
1√
2piσ2
e−
(yi−xiβ)2
2σ2
But this can be difficult to solve!
Log-Likelihood
Instead we can use the log-likelihood:
l(β) = logL(β)
=
n∑
i=1
log
[
1√
2piσ2
e−
(yi−xiβ)2
2σ2
]
It is often easier to work with the log-likelihood instead of the
likelihood, and the solution of the optimisation does not change
(since log is a strictly increasing function).
Log Likelihood
We can simplify the log-likelihood
l(β) =
n∑
i=1
log
[
1√
2piσ2
e−
(yi−xiβ)2
2σ2
]
= − 1
2σ2
n∑
i=1
(yi − xiβ)2 − n
2
log(2piσ2)
Ignoring constants (w.r.t. β) we are left with
l(β) ∝ −
n∑
i=1
(yi − xiβ)2
RSS
Note that our log-likelihood is proportional to the negative RSS
l(β) ∝ −
n∑
i=1
(yi − xiβ)2
= −RSS
Therefore, under Gaussian errors, the least squares and maximum
likelihood estimates are numerically equivalent:
βˆ = argmax
β
−
n∑
i=1
(yi − xiβ)2
MLE for Linear Regression - Analytical Solution
Optimisation
To ultimately find βˆ we need to take the derivative of l with
respect to β.
It is easiest to do this using matrix notation.
Optimisation
We can rewrite as
n∑
i=1
(yi − xiβ)2 = ‖y −Xβ‖22
because
‖a‖2 =
√√√√ n∑
i
a2i =
√
a>a
Therefore
n∑
i=1
(yi − xiβ)2 = (y −Xβ)>(y −Xβ)
Optimisation
Then we seek to find
∂
∂β
(y −Xβ)>(y −Xβ)
which is the vector of partial derivatives.
If we have a scalar-valued function of several variables, f , the
vector of partial derivatives is given by:
∂f
∂β
=
[
∂f
∂β0
,
∂f
∂β1
, · · · , ∂f
∂βd
]>
Optimisation
If we set this vector of partial derivatives equal to the zero vector
∂f
∂β0
∂f
∂β1
...
∂f
∂βd
=
0
0
...
0
the result can be interpreted as a system of equations:
∂f
∂β0
= 0
∂f
∂β1
= 0
...
∂f
∂βd
= 0
Then we can solve for the values of βi that satisfy the system.
Optimisation
Alternatively we can keep things in matrix notation and use known
differentiation rules1
∂
∂β
(y −Xβ)>(y −Xβ) = −2X>(y −Xβ)
Setting to 0 and solving for βˆ gives
2X>(Xβˆ − y) = 0
X>Xβˆ −X>y = 0
X>Xβˆ = X>y
βˆ = (X>X)−1X>y
1Matrix Cookbook: Eq. 84
Computing the Solution
Since the solution involves standard matrix operations, we can use
packages like numpy to compute an answer.
You will do this in the tutorial!
Numerical Considerations
There are two requirements that must be met for us to solve this
optimisation problem:
1. more observations than variables,
2. invertible X>X matrix.
More observations than variables
This condition means that we must have n ≥ d+ 1.
If the solution to β is given by
βˆ = (X>X)−1X>y
then
[((X>X)−1X>)−1]n×(d+1)βˆ(d+1)×1 = yn×1
and we can see that βˆ is the solution to a system of n equations
with d+ 1 unknowns.
When n < d+ 1 there is no unique solution!
Invertibility
The matrix X>X must be invertible.
This condition is satisfied when X is full rank.
The rank refers to the number of linearly independent columns. In
other words each column cannot be written as a linear combination
of the others.
Invertibility
If X is not full rank, we say that the feature matrix X suffers from
multicollinearity.
This is the same as saying that two or more features are perfectly
correlated.
Gradient Descent
Motivation
In many cases, it is impossible to obtain an analytical solution to
an optimisation problem.
Common reasons:
• choice of loss function doesn’t permit a unique solution;
• evaluating loss function and its derivative is too costly.
For example computing matrix inversions takes a huge amount of
memory, which might not be feasible on your computer.
Iterative methods
To get around these issues we often resort to iterative procedures.
Such procedures usually have lower computational requirements
and permit the use of approximations of our original function.
Gradient Descent
The most popular method is called gradient descent.
We use gradient descent to find the minimum of a function -
usually a loss function.
Gradient descent has underpinned almost every recent advance in
machine learning such as neural networks and deep learning.
Algorithm
1. Initialise θ and step size α
2. Repeat until convergence
2.1 Compute the gradient ∇f(θ)
2.2 Update θk+1 = θk − α∇f(θk)
Intuition
The algorithm can be summarised as:
repeatedly step towards the minimum of the function
Example
Suppose we want to find the minimum of the function f(x) = x2.
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
x
0.0
0.2
0.4
0.6
0.8
1.0
f(x
)
f(x) = x2
Example
In this example our parameter θ is x and the derivative of the
function is f ′(x) = 2x.
We will initialise by setting x = 1 and α = 0.1.
Substituting the values into the algorithm yields
1. Initialise x = 1 and step size α = 0.1
2. Repeat until convergence
2.1 Update xk+1 = xk − 0.1(2xk)
Example - Iteration 1
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
x
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
f(x
)
f ′(x) at x=1
f(x) = x2
x0=1
x1=0.8
Example - Iteration 2
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
x
0.0
0.2
0.4
0.6
0.8
1.0
f(x
)
f ′(x) at x=0.8
f(x) = x2
x1=0.8
x2=0.64
Example - All iterations
0.2 0.0 0.2 0.4 0.6 0.8 1.0
x
0.0
0.2
0.4
0.6
0.8
1.0
f(x
)
f(x) = x2
Convexity
Gradient descent is guaranteed to find the global minimum if the
function is convex.
If the function is not convex then the best guarantee is that a
local minimum will be reached.
If we are lucky the local and global minimum will coincide.
Convexity
The function (defined on [−1, 3]) shown below has three local
minima (red), two local maxima (blue) and one saddle point
(orange). In this domain there is one global minimum, however we
may not be able to reach it depending on our initialisation of x.
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
1
0
1
2
3
4
f(x
)
Convexity
When the loss function is not convex it is typical to repeat the
gradient descent process at different initialisation points in order to
test if the global minimum has been reached.
Convergence
When a minimum is reached we say that the algorithm has
converged.
Convergence can be detected by
1. monitoring change in function value,
2. monitoring the gradient value.
For 1., check whether the function value between iterations is less
than a threshold.
For 2., check whether the norm of the gradient vector is less than
a threshold.
Convergence
To prevent an infinite loop most implementations will also use a
maximum number of iterations.
The algorithm will be stopped even if the convergence conditions
haven’t been met.
Step size
The step size (learning rate) α controls how quickly the algorithm
moves.
Small values of α leads to slow convergence or early termination
due to the max iterations being reached.
Large values of α leads to overshooting and failure to converge.
Step size
Consider the case of f(x) = x2 with α = 1000 and x0 = 1
The update rule will set x1 to
x1 = 1− 1000× 2(1)
= −1999
In the next iteration
x2 = −1999− 1000× 2(−1999)
= 3, 996, 001
Notice that we are jumping left to right and moving further away
from the global minimum x = 0.
Step size
We can set the step size empirically or analytically.
Analytical methods such as line-search will dynamically update the
step size at each iteration.
Dynamic step sizes can greatly improve the convergence rate.
In BUSS6002, we assume that it is fixed and set empirically.
MLE for Linear Regression - Gradient Descent
Negative Log Likelihood
Previously we identified that the MLE for Linear Regression was
the negative RSS
l(β) ∝ −
n∑
i=1
(yi − xiβ)2
= −RSS
This function is concave. To make it convex we can negate it,
which in our case gives the Negative Log Likelihood function.
−l(β) ∝
n∑
i=1
(yi − xiβ)2
Gradient Ascent
Alternatively we can could change the sign in our update rule and
perform gradient ascent using the standard log likelihood
function.
θk+1 = θk + α∇f(θk)
The solutions from gradient ascent and descent are identical.
However most software packages will only implement the descent
version.
Gradient
From earlier we know that
∂
∂β
n∑
i=1
(yi − xiβ)2 = ∂
∂β
(y −Xβ)>(y −Xβ)
= −2X>(y −Xβ)
Algorithm
Substituting the values into the algorithm yields
1. Initialise β randomly and step size α to some value
2. Repeat until convergence
2.1 Update βk+1 = βk − α(−2X>(y −Xβk))
Since the RSS is convex we can randomly initialise β and we are
guaranteed to find the global minimum. However this depends on
careful selection of α.
Wrapping Up
Discussion
Compare the analytical and gradient descent approaches:
• The analytical solution is mathematically simpler and can be
easily implemented.
• The gradient descent approach offers lower max computation
requirements.
• Both methods will achieve the same solution up to some
tolerance.
• As the problem size scales up it becomes more favourable to
take the gradient descent approach - we will talk about this
later.
Resources
Murphy, K. P. (2012). Machine learning: a probabilistic
perspective. MIT press.
https://research.google/pubs/pub38136.pdf
Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex
optimization. Cambridge university press.
http://mlss11.bordeaux.inria.fr/docs/mlss11Bordeaux_
Vandenberghe.pdf