R代写-STAT 428 WEEK
时间:2022-05-02
STAT 428 WEEK 13 - NONLINEAR OPTIMIZATION
Susu Zhang
University of Illinois at Urbana-Champaign
April 18, 2022
13.1 RECAP & INTRODUCTION
13.1 RECAP & INTRODUCTION
13.2 THE ROOT-FINDING PROBLEM
13.3 NONLINEAR OPTIMIZATION
13.1 RECAP & INTRODUCTION
THE PROBLEM OF PARAMETER ESTIMATION
A parametric statistical model describes the chances of different outcomes of
a random variable, given the model parameters.
The parameter (θ) is unknown, and one wants to infer something about them
based on the observed data (x).
We have already seen a few examples:
I Linear regression - estimating regression coefficients (βs) based on
observed Xs and Ys.
I Investment model - estimating θ which governs probability that a stock is
a winner on a given day.
I Change point model - estimating k (change point) and the before- and
after-change Poisson rates (µ, λ).
13.1 RECAP & INTRODUCTION
RECAP: BAYESIAN ESTIMATION
The latter two were done under the Bayesian framework, where we
I used MCMC to draw samples of the parameter from the posterior
distribution,
f (θ | x) ∝ L(θ;x)fθ(θ),
I obtained a point estimate of the parameter based on the posterior
samples. For example, the EAP estimator is given by the posterior
mean:
θˆEAP = Eθ|x(θ).
This is estimated using Monte Carlo integration on the post burn-in
samples:
θˆEAP =
1
m − b
m∑
j=b+1
θ(j).
13.1 RECAP & INTRODUCTION
PARAMETER ESTIMATION AS AN OPTIMIZATION
PROBLEM
In other occasions, parameter estimation might be formulated as an
optimization problem.
For example:
I MAP (Bayesian posterior mode):
θˆMAP = argmax
θ
f (θ | x).
I MLE:
θˆMLE = argmax
θ
L(θ;x).
Question: How do we find the value of θ achieving the maximum likelihood (or
posterior probability)?
13.1 RECAP & INTRODUCTION
NONLINEAR OPTIMIZATION
Given the observed data (x), the likelihood function,
L(θ;x) = fX |θ(x | θ),
is a nonlinear function of θ.
Finding its maximum thus involves nonlinear optimization.
For an arbitrary smooth function f (x), its extrema (minima/maxima) occur at
values of x such that
f ′(x) = 0.
I Finding the analytic root to f ′(x) = 0 isn’t always straightforward (they
don’t even necessarily exist).
I Numerical methods for solving nonlinear equations may be used to find
approximation solutions.
13.1 RECAP & INTRODUCTION
THIS WEEK’S LECTURES
We will start this week’s lectures by introducing a few numerical methods for
solving nonlinear equations:
I Bisection
I Newton-Raphson
I Brent’s method
Then we’ll discuss nonlinear optimization.
I Closely related to root-finding: finding the optima of a smooth function is
equivalent to finding the root(s) of its derivative.
I The methods are for generic functions. We’ll see applications in MLE.
Optimization itself is worth another class - the problem becomes much more
challenging when the dimension of variables is high or when there are
linear/non-linear constraints. We will only cover the basics.
13.1 RECAP & INTRODUCTION
EXISTENCE AND UNIQUENESS OF ROOT
Keep in mind that there doesn’t always exist a unique solution to a nonlinear
equation:
I e−x − x = 0 has a unique solution.
I ex + 1 = 0 has no solution.
I sin(x) = 0 has infinitely many solutions.
By the same logic, a unique optimum may not exist:
I e.g., a multimodal distribution has multiple local maxima.
I Most optimization methods are designed to find a local maximum, which
may not be global maximum.
I If global maximum is desired, it is advisable to try different starting points
and see if all produce same result.
13.2 THE ROOT-FINDING PROBLEM
13.1 RECAP & INTRODUCTION
13.2 THE ROOT-FINDING PROBLEM
13.3 NONLINEAR OPTIMIZATION
13.2 THE ROOT-FINDING PROBLEM
13.2.1 Bisection Method
13.2 THE ROOT-FINDING PROBLEM
BISECTION METHOD
Without loss of generality, we’ll consider the problem of finding the solution to
f (x) = 0.
(Anything non-zero on the RHS can be moved to the LHS. e.g., x2 = 2 is the
same as x2 − 2 = 0)
Suppose:
I f (x) is continuous on [a,b], and
I f (a) and f (b) have opposite signs.
By the intermediate value theorem, there must exist c ∈ (a,b) satisfying
f (c) = 0.
13.2 THE ROOT-FINDING PROBLEM
BISECTION: PROCEDURES
The Bisection method begins with an initial bracket [a,b], and iteratively
narrows down the bracket by half, until a solution has been isolated as
accurately as desired, i.e., reaching a pre-specified tolerance level, (tol).
Specifically:
I Step 1: Compute the midpoint x = (a+ b)/2, and check the sign of f (x).
I Step 2: If f (a) and f (x) have opposite signs, replace [a,b] with [a, x ].
Otherwise, replace [a,b] with [x ,b].
Iterate Steps 1 and 2 until (b − a) ≤ tol .
13.2.1 Bisection
Example
Suppose we want to find the solutions to
a2 + y2 + 2ayn − 1 = n − 2,
Where a is a constant and n > 2 is an integer.
Note that this is equivalent to the quadratic equation
f (y) = y2 + 2an − 1y + a
2 − n + 2 = 0
Here’s a function for f (y):
f <- function(y, a, n){
y^2+(2*a*y)/(n-1)+a^2-n+2
}
S Zhang ( UIUC ) STAT428 - Optimization 2 / 34
13.2.1 Bisection
Finding the Root w/ Bisection
We’ll numerically solve f (y) = 0 when a = .5 and n = 20, using bisection
with the starting interval of [b0, b1] = [0, 5n] and tolerance .0001. (Note:
Book uses .Machine$double.epsˆ.25, which is around .00012.)
a <- .5
n <- 20
b0 <- 0
b1 <- 5*n
eps <- .0001 # tolerance
Note that the book expresses error in terms of the difference between
f (b0+b12 ) and 0, instead of the length of the interval (b1 − b0). Let’s be
consistent with the book and use the former.
S Zhang ( UIUC ) STAT428 - Optimization 3 / 34
13.2.1 Bisection
Initialize ys and f (y)s
# initialize y and fy
y <- seq(b0,b1, length = 3) # sequence of b0, midpt, b1
fy <- sapply(y, f, a=a, n = n) # evaluate f() at b0,midpt, b1
it <- 0 # iteration number
S Zhang ( UIUC ) STAT428 - Optimization 4 / 34
13.2.1 Bisection
# iterate until |f(midpoint)|while(abs(fy[2])>eps & it<1000){
it <- it+1
if(fy[1]* fy[2]<0){ # unequal sign?
y[3] <- y[2] # set upper bound to be current midpt
fy[3] <- fy[2] # set f(upper) to f(midpt)
}else{
y[1] <- y[2] # set lower bound to be current midpt
fy[1] <- fy[2] # set f(lower) to f(midpt)
}
# new midpt and f(midpt)
y[2] <- (y[1]+y[3])/2
fy[2] <- f(y[2], a = a, n = n)
# store new [b0, b1]
b0 <- c(b0, y[1])
b1 <- c(b1, y[3])
}
S Zhang ( UIUC ) STAT428 - Optimization 5 / 34
13.2.1 Bisection
Results
Total number of iterations:
it
## [1] 21
Solution - midpoint at last iteration:
y[2]
## [1] 4.186845
Corresponding f (y):
fy[2]
## [1] 2.984885e-05
S Zhang ( UIUC ) STAT428 - Optimization 6 / 34
13.2.1 Bisection
Interval at each iteration
data.frame(b0, b1)
## b0 b1
## 1 0.000000 100.000000
## 2 0.000000 50.000000
## 3 0.000000 25.000000
## 4 0.000000 12.500000
## 5 0.000000 6.250000
## 6 3.125000 6.250000
## 7 3.125000 4.687500
## 8 3.906250 4.687500
## 9 3.906250 4.296875
## 10 4.101562 4.296875
## 11 4.101562 4.199219
## 12 4.150391 4.199219
## 13 4.174805 4.199219
## 14 4.174805 4.187012
## 15 4.180908 4.187012
## 16 4.183960 4.187012
## 17 4.185486 4.187012
## 18 4.186249 4.187012
## 19 4.186630 4.187012
## 20 4.186821 4.187012
## 21 4.186821 4.186916
## 22 4.186821 4.186869
S Zhang ( UIUC ) STAT428 - Optimization 7 / 34
13.2.1 Bisection
Compare w/ Analytic Solution
Based on the quadratic formula, the analytic solutions are
− 2an−1 ±
√
( 2an−1)2 − 4(a2 − n + 2)
2 ,
which simplifies to
− an − 1 ±
√
( an − 1)
2 − a2 + n − 2.
-a/(n-1)+sqrt((a/(n-1))^2-a^2+n-2)
## [1] 4.186841
Note that there is another root ≈ −4.24, which is outside [0, 100].
S Zhang ( UIUC ) STAT428 - Optimization 8 / 34
13.2 THE ROOT-FINDING PROBLEM
RATE OF CONVERGENCE
Let x∗ be the true solution and xk be the solution in the k th iteration. Then,
ek = xk − x∗
is the error in the k th iteration.
I For methods that maintain an interval containing x∗ (e.g., bisection),
define error to be the interval length, i.e.,
ek = b − a.
The algorithm is said to have rate of convergence r if
lim
k→∞
‖ek+1‖
‖ek‖r = C,
where C is a nonzero constant.
13.2 THE ROOT-FINDING PROBLEM
CONVERGENCE RATE, CTD.
Intuitively, convergence rate describes the rate at which the error decreases:
I If r = 1, the algorithm converges linearly.
I If r < 1, the algorithm converges sublinearly.
I If r > 1, the algorithm converges superlinearly.
I In particular, if r = 2, the algorithm converges quadratically.
13.2 THE ROOT-FINDING PROBLEM
CONVERGENCE RATE OF BISECTION
For the bisection method, ek (interval length) reduces by 1/2 in every
iteration. In other words,
‖ek+1‖
‖ek‖1 = .5.
With r = 1, the bisection method converges linearly. Let’s see an example
from the text.
13.2 THE ROOT-FINDING PROBLEM
13.2.2 Newton-Raphson Method
13.2 THE ROOT-FINDING PROBLEM
NEWTON-RAPHSON METHOD
The Newton-Raphson method, developed by Isaac Newton and Joseph
Raphson, is another iterative method for solving nonlinear equations:
I Recall the first-order Taylor approximation:
f (x + h) ≈ f (x) + f ′(x)h.
I Given a current guess (xk ) that is reasonably close to the true solution
(x∗), we have
f (x∗) = f (xk + h) ≈ f (xk ) + f ′(xk )h,
a linear function of h for approximating f near xk .
I Instead of solving f (x∗) = 0, the Newton-Raphson solves h s.t.
f (xk ) + f ′(xk )h = 0.
The solution is h = − f (xk )f ′(xk ) . Thus, the next approximation of x∗ is given by
xk+1 = xk − f (xk )f ′(xk ) .
13.2 THE ROOT-FINDING PROBLEM
PROCEDURES
The Newton-Raphson method begins with an initial guess x0, that is
reasonably close to the true value x∗.
At each iteration k = 0,1,2, . . . :
I Step 1: Compute f (xk ) and f ′(xk ).
I Step 2: Set
xk+1 = xk − f (xk )f ′(xk ) .
This is repeated until reaching maximum number of iterations or until f (xk ) is
sufficiently close to 0.
13.2.2 Newton-Raphson Method
Example
Let’s solve the earlier equation,
f (y) = y2 + 2an − 1y + a
2 − n + 2 = 0,
using Newton-Raphson.
The derivative is given by
f ′(y) = 2y + 2an − 1 .
Here’s a function for computing f ′():
fprime <- function(y, a,n){
2*y + 2*a/(n-1)
}
We’ll start with x0 = 50. Again, let a = .5 and n = 20.
S Zhang ( UIUC ) STAT428 - Optimization 9 / 34
13.2.2 Newton-Raphson Method
x <- xs <- 50 # initial val
it <- 0 # iteration #
eps <- .0001 # tolerance
fx <- f(x, a = a,n = n)
# iterate newton-raphson til 1000 iterations or |f(xk)|<=eps
while(abs(fx)>eps & it<1000){
# compute f'() at x_k
fp <- fprime(x, a = a, n = n)
# update x to x_k+1
x <- x - fx/fp
it <- it + 1
# update fx to f(x_k+1)
fx <- f(x, a = a, n = n)
# store x_k+1
xs <- c(xs, x)
}
S Zhang ( UIUC ) STAT428 - Optimization 10 / 34
13.2.2 Newton-Raphson Method
Results
Total number of iterations:
it
## [1] 7
Solution - x in last iteration:
x
## [1] 4.186841
Corresponding f (x):
fx
## [1] 2.921774e-08
S Zhang ( UIUC ) STAT428 - Optimization 11 / 34
13.2.2 Newton-Raphson Method
xk at each iteration
data.frame(iteration = 0:it, xs)
## iteration xs
## 1 0 50.000000
## 2 1 25.164256
## 3 2 12.921298
## 4 3 7.132972
## 5 4 4.793025
## 6 5 4.224965
## 7 6 4.187012
## 8 7 4.186841
As you can see, the Newton-Raphson method converged faster than
bisection - in fact, the rate of convergence is r = 2 (i.e., quadratic) if f (x)
has a single root.
S Zhang ( UIUC ) STAT428 - Optimization 12 / 34
13.2 THE ROOT-FINDING PROBLEM
13.2.3 Brent’s Method
13.2 THE ROOT-FINDING PROBLEM
THE COST OF FASTER CONVERGENCE
The convergence of the Newton-Raphson method is faster than the bisection,
meaning that we can get sufficiently close to the root with fewer iterations.
However, the faster convergence comes at a cost. Note that, in each iteration:
I The bisection method only requires testing the sign of f ( a+b2 ).
I For Newton-Raphson, f (xk ) and f ′(xk ) need to be evaluated. (Not to
mention you also need to find the derivative.)
To circumvent evaluating f ′(xk ) directly, interpolation methods (instead of
Taylor series) may be used to approximate the function f .
13.2 THE ROOT-FINDING PROBLEM
INTERPOLATION METHODS
The idea is to fit a line/polynomial curve with the last few iterations’ (x , f (x)).
For example, the secant method uses a straight line through (xk−1, f (xk−1))
and (xk , f (xk )) to approximate f :
The convergence rate is approximately 1.618.
13.2 THE ROOT-FINDING PROBLEM
HIGHER-DEGREE INTERPOLATIONS
It should be no surprise that higher-degree polynomial interpolation may
further increase accuracy in approximating f (and increase convergence rate.)
However, computing the root of a polynomial isn’t trivial - there may be no
unique root, and an analytic root may not exist.
The inverse quadratic interpolation method, offers an alternative.
I Idea: For a one-to-one function, h(x) = 0 implies h−1(0) = x .
I Instead of fitting a quadratic polynomial (h) of f (xk )s given xks and
finding its root, fit a quadratic polynomial of xks given f (xk )s and
evaluate it at 0.
13.2 THE ROOT-FINDING PROBLEM
INVERSE QUADRATIC INTERPOLATION
13.2 THE ROOT-FINDING PROBLEM
BRENT’S METHOD
Brent’s method combines bisection with inverse quadratic interpolation to find
the root for f (x).
Specifically:
I Given:
I a current interval [a,b] containing the solution (i.e., f (a), f (b) have
opposite signs), where f (b) is the current best guess (i.e.,
|f (b)| ≤ |f (a)|);
I c, f (c) from a previous iteration
I Fit inverse quadratic polynomial using (f (a),a), (f (b),b), (f (c), c) (e.g.,
using Lagrange interpolation polynomial).
I If the zero point is outside [a,b], use bisection instead. Otherwise,
update the interval with the zero-point.
The R function uniroot applies this method.
13.2.3 Brent’s Method
Uniroot
Let’s solve the previous equation, this time with the uniroot function that
employs Brent’s method. This function requires specifying:
The function for which to find the root (i.e., f)
A starting bracket - we’ll use [0, 5n] as for bisection.
uniroot(f, a = a, n = n, # the fn and other pars
lower = 0, upper= 5*n)
## $root
## [1] 4.18687
##
## $f.root
## [1] 0.0002381408
##
## $iter
## [1] 14
##
## $init.it
## [1] NA
##
## $estim.prec
## [1] 6.103516e-05
S Zhang ( UIUC ) STAT428 - Optimization 13 / 34
13.2.3 Brent’s Method
Grid Search
Before we move on, there is also a crude method for root-finding - grid
search.
The idea is simple:
At equally spaced values x1, x2, . . . , xN in the interval [a, b]:
Evaluate f (xi) for each i ∈ {1, 2, . . . ,N}
Pick the xi s.t. f (xi) is closest to 0 as the approximate solution.
x <- seq(0,100, length = 10000)
fx <- f(x, a = a, n = n)
root <- x[which.min(abs(fx))]
root
## [1] 4.190419
S Zhang ( UIUC ) STAT428 - Optimization 14 / 34
13.3 NONLINEAR OPTIMIZATION
13.1 RECAP & INTRODUCTION
13.2 THE ROOT-FINDING PROBLEM
13.3 NONLINEAR OPTIMIZATION
13.3 NONLINEAR OPTIMIZATION
13.3.1 Unidimensional Optimization
13.3.1 Unidimensional Optimization
Optimization
Let’s consider the very related problem of optimization.
Suppose we wish to maximize the function
f (x) = log(x + log(x))log(1 + x) .
A graph shows the maximum takes place somewhere between 2 and 10.
f <- function(x){
log(x+log(x))/log(1+x)
}
xx <- seq(2,10, length=10000)
fx <- f(xx)
S Zhang ( UIUC ) STAT428 - Optimization 15 / 34
13.3.1 Unidimensional Optimization
plot(xx,fx, type = 'l')
2 4 6 8 10
0.
90
1.
00
xx
fx
S Zhang ( UIUC ) STAT428 - Optimization 16 / 34
13.3.1 Unidimensional Optimization
Grid search of maximum
We can always use a grid search to find the point where f (x) is maximized.
print(xmax <- xx[which.max(fx)])
## [1] 5.792379
However, it can be slow and doesn’t extend well to multivariable functions
(imagine evaluating 10000× 10000× 10000 points in a 3-dimensional
space).
S Zhang ( UIUC ) STAT428 - Optimization 17 / 34
13.3.1 Unidimensional Optimization
Optimization as a root-finding problem
An alternative is to turn optimization into the root-finding problem of
solving
f ′(x) = 0.
For this function,
f ′(x) =
log(1+x)(1+1/x)
x+log(x) − log(x+log(x))x+1
log2(1 + x)
.
S Zhang ( UIUC ) STAT428 - Optimization 18 / 34
13.3.1 Unidimensional Optimization
Finding the root of f ′ with uniroot
fprime <- function(x){
num <- (log(1+x)*(1+1/x))/(x+log(x))-(log(x+log(x)))/(x+1)
denom <- (log(1+x))^2
return(num/denom)
}
We can use uniroot to find its root:
unlist(uniroot(fprime, lower = 2, upper = 10))
## root f.root iter init.it estim.prec
## 5.792294e+00 5.731928e-08 7.000000e+00 NA 6.103516e-05
S Zhang ( UIUC ) STAT428 - Optimization 19 / 34
13.3.1 Unidimensional Optimization
Finding the root of f ′ with Newton-Raphson
Strictly speaking, to find the root of f ′ with Newton-Raphson, we need to
find its derivative, i.e., f ′′.
Here we’ll numerically approximate the second derivative using numerical
differentiation, that is, evaluate the slope of the line going through
(x , f ′(x)) and (x + h, f ′(x + h)) when h is a very small number, e.g., 10−7.
We’ll create a generic function for (approximate) univariate
Newton-Raphson root-finding. This function takes a smooth function f as
input and returns a root of the function.
S Zhang ( UIUC ) STAT428 - Optimization 20 / 34
13.3.1 Unidimensional Optimization
NewtonRaph <- function(f,tol=1e-5,x0,max_it=1000){
h <- 1e-7 # used for approximating f'(x)
it <- 0
fx <- f(x0)
x <- x0 # x will be updated
while(it < 1000 & abs(fx)>tol){
# approx fprime at x
fpx <- (f(x+h) - fx)/h
# newton-raphson update
x <- x - fx/fpx
# update fx
fx <- f(x)
# increment iteration count
it <- it + 1
}
return(list(solution = x, value = fx, iterations = it))
}
S Zhang ( UIUC ) STAT428 - Optimization 21 / 34
13.3.1 Unidimensional Optimization
Applying Newton-Raphson to find maximum of f
We’ll look for the root of f ′(x) with a starting value of x0 = 2.
NewtonRaph(fprime, x0 = 2)
## $solution
## [1] 5.790378
##
## $value
## [1] 5.271111e-06
##
## $iterations
## [1] 7
S Zhang ( UIUC ) STAT428 - Optimization 22 / 34
13.3.1 Unidimensional Optimization
R optimize Function
Now let’s turn to an R function for one-dimensional optimization.
The function optimize searches for local minimums and maximums in a
supplied interval. It utilizes a combination of golden section search and
successive parabolic interpolation.
Note that now the input is f instead of fprime. It also by default searches
for the minimum, so maximum = TRUE is needed.
optimize(f, lower = 2, upper = 10, maximum = TRUE)
## $maximum
## [1] 5.792311
##
## $objective
## [1] 1.055122
S Zhang ( UIUC ) STAT428 - Optimization 23 / 34
13.3 NONLINEAR OPTIMIZATION
13.3.2 Multidimensional Optimization
13.3 NONLINEAR OPTIMIZATION
Optimization becomes more and more challenging when we consider
functions of several variables.
An extremely common optimization problem in statistics as finding the
parameter of a model that maximizes the likelihood function, i.e. finding the
MLE,
θˆMLE = argmax
θ
L(θ;x),
where θ is either a scalar (1-dimensional) or a vector (multidimensional).
13.3 NONLINEAR OPTIMIZATION
NEWTON-RAPHSON FOR N-DIMENSIONS
The Newton-Raphson method for one-dimensional optimization,
xk+1 = xk − f
′(xk )
f ′′(xk )
,
can be generalized to n-dimensions:
I f ′(xk ) is generalized to the gradient at xk :
∇f (xk ) =
∂f
∂x1
(xk )
...
∂f
∂xn
(xk )
.
13.3 NONLINEAR OPTIMIZATION
NEWTON-RAPHSON FOR N-DIMENSIONS
I f ′′(xk ) is generalized to the Hessian matrix at xk :
Hf (xk ) =
∂2f (xk )
∂x21
∂2f (xk )
∂x1 ∂x2
· · · ∂
2f (xk )
∂x1 ∂xn
∂2f (xk )
∂x2 ∂x1
∂2f (xk )
∂x22
· · · ∂
2f (xk )
∂x2 ∂xn
...
...
. . .
...
∂2f (xk )
∂xn ∂x1
∂2f (xk )
∂xn ∂x2
· · · ∂
2f (xk )
∂x2n
.
I The iterative updating formula becomes
xk+1 = xk − H−1f (xk )∇f (xk ).
13.3.2 Multidimensional Optimization
Toy example
Suppose we want to find the maximum of
f (x) = −12(x
2
1 − x2)2 −
1
2(1− x1)
2.
The gradient is
∇f =
[
−(x21 − x2)2x1 + (1− x1)
(x21 − x2)
]
.
The Hessian is
Hf (x) =
[
−6x21 + 2x2 − 1 2x1
2x1 −1
]
.
S Zhang ( UIUC ) STAT428 - Optimization 24 / 34
13.3.2 Multidimensional Optimization
Functions for f , ∇f and Hf .
f <- function(x){
-1/2*(x[1]^2-x[2])^2-1/2*(1-x[1])^2
}
# gradient
grad_f <- function(x){
c(-(x[1]^2-x[2])*2*x[1]+(1-x[1]),
x[1]^2-x[2])
}
# hessian
H_f <- function(x){
cbind(c(-6*x[1]^2+2*x[2]-1,2*x[1]),c(2*x[1], -1))
}
We’ll start at the initial value x0 = (5, 5). This time, let’s run 100 iterations.
x <- xs <- c(5,5)
S Zhang ( UIUC ) STAT428 - Optimization 25 / 34
13.3.2 Multidimensional Optimization
fx <- f(x) # f(x_k)
for(it in 1:100){
# compute gradient and Hessian at x_k
grad <- grad_f(x)
H <- H_f(x)
# compute s = H^-1 %*% Grad, i.e., solution to Hs = Grad
s <- solve(H, grad)
# update x
x <- x - s
xs <- rbind(xs,x)
# update f(x_k)
fx <- f(x)
}
S Zhang ( UIUC ) STAT428 - Optimization 26 / 34
13.3.2 Multidimensional Optimization
Results
Final solution and corresponding objective:
x
## [1] 1 1
fx
## [1] 0
An alternative is the built-in optim function which finds the minimum:
optim(par = c(5,5), # initial value
fn = function(x) -f(x))
## $par
## [1] 0.9989039 0.9979473
##
## $value
## [1] 6.102997e-07
##
## $counts
## function gradient
## 85 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
S Zhang ( UIUC ) STAT428 - Optimization 27 / 34
13.3.2 Multidimensional Optimization
MLE Example
To wrap up, we will see an example of maximum likelihood estimation of
the Gamma distribution shape and rate parameters.
Let x1, x2, . . . , xn be a random sample from a Gamma(r , λ) distribution.
The parameter vector we need to estimate is θ = (r , λ), whose likelihood
function is
L(r , λ) =
n∏
i=1
λr
Γ(r)x
r−1
i e−λxi .
S Zhang ( UIUC ) STAT428 - Optimization 28 / 34
13.3.2 Multidimensional Optimization
Log-likelihood
We will work with and maximize the log-likelihood function instead of the
likelihood.
l(r , λ) = log L(r , λ) =
n∑
i=1
log( λ
r
Γ(r)x
r−1
i e−λxi ).
With some simplifying, this becomes
l(r , λ) = nr log(λ)− n log(Γ(r)) + (r − 1)
n∑
i=1
log(xi)− λ
n∑
i=1
xi .
S Zhang ( UIUC ) STAT428 - Optimization 29 / 34
13.3.2 Multidimensional Optimization
Here’s a function for the negative log-likelihood (recall that optim by
default does minimization.) This is the objective we want to minimize.
LL <- function(theta,x){
n <- length(x)
sx <- sum(x)
slogx <- sum(log(x))
r <- theta[1]
lambda <- theta[2]
loglik<-n*r*log(lambda)-n*log(gamma(r))+
(r-1)*slogx-lambda*sx
return(-loglik)
}
S Zhang ( UIUC ) STAT428 - Optimization 30 / 34
13.3.2 Multidimensional Optimization
Simulation Study
Now, let’s generate some data (n = 100) from Gamma(r = 5, λ = 2) and
see whether the maximum likelihood estimate can recover the true
parameter.
r <- 5
lambda <- 2
x <- rgamma(100, r, lambda)
# MLE - minimize the - log-likelihood
optim(c(1,1), LL, x = x) # c(1,1):init. val of (r,lamb)
## $par
## [1] 4.636733 1.963287
##
## $value
## [1] 143.5552
##
## $counts
## function gradient
## 63 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
S Zhang ( UIUC ) STAT428 - Optimization 31 / 34
13.3.2 Multidimensional Optimization
Monte Carlo Study: Replicate this for 10000 times
Let us conduct a Monte Carlo study with m = 10000 replications.
In each replication, we’ll generate data from Gamma(5, 2) and
compute the maximum likelihood estimator.
We’ll look at the approximated sampling distribution of the MLE.
m <- 10000
thetahats_mle <- replicate(m, {
x <- rgamma(100, r, lambda)
optim(c(1,1), LL, x = x)$par # mle
})
dim(thetahats_mle) # each col is the mle of a rep.
## [1] 2 10000
S Zhang ( UIUC ) STAT428 - Optimization 32 / 34
13.3.2 Multidimensional Optimization
Plot the densities
par(mfrow = c(1,2))
hist(thetahats_mle[1,],freq = F, main = 'MLE of r',xlab = '')
hist(thetahats_mle[2,],freq = F, main = 'MLE of lambda',
xlab = '',ylab = '')
S Zhang ( UIUC ) STAT428 - Optimization 33 / 34
13.3.2 Multidimensional Optimization
MLE of r
D
en
si
ty
3 4 5 6 7 8 9
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
MLE of lambda
1.5 2.5 3.5
0.
0
0.
4
0.
8
1.
2
S Zhang ( UIUC ) STAT428 - Optimization 34 / 34