统计代写-S611|学霸联盟

统计代写-S611

时间：2022-04-28

STAT - S611
Applied Statistical Computing
Spring 2022
Instructor: Daniel Manrique-Vallier
Section 12: Basic Optimization
Maximum Likelihood Estimation
Consider a data vector X = (X1, . . . Xn)
Xi
iid∼ p(⋅∣θ) i = 1, . . . , n,
where θ is a p × 1 column vector. The joint density of X is
p(x1, . . . , xn∣θ) = n∏
i=1
p(xi∣θ) = Lx(θ).
When seen as a function of θ, the function LX (θ) is a random (we
get a different one for each realization of X. For a particular
sample, we call this function the “likelihood function”.
Maximum Likelihood Estimation
A popular way of estimating θ from a sample X is to choose the
number (vector) θˆ that maximizes the likelihood:
θˆ = arg max
θ
LX (θ).
This is known as a “maximum likelihood estimator” (MLE).
In general, it’s easier to work with the log-likelihood:
lx(θ) = logLX (θ).
Since log is a monotonic function, any θ that maximizes lx(θ) also
maximizes Lx(θ).
Score and Information matrix
▶ The function l˙x(θ) is known as the score function.
▶ The MLE often (when l¨(x)∣θˆ < 0) satisfies the likelihood
equation:
l˙(θ)»»»»»θ=θˆ = 0.
▶ The Fisher information matrix in X about θ is defined as
I(θ) = [−Eθ l¨X (⋅)]θ .
where the expectation is taken over X assuming X ∼ p(⋅∣θ).
Under some mild regularity conditions we have that
I(θ) = Eθ [l˙2x(⋅)]θ
Score and Information matrix
▶ Under regularity conditions when the true value of θ is θ0, the
asymptotic variance of θˆ is
Vθ0(θˆ) = [−Eθ l¨X (⋅)]−1»»»»»»θ0 = I(θ0)−1.
▶ This requires θ0 known. When it’s unknown we can estimate
it via
Estimated F.I. ∶ [I(θˆ)]− 12 ;
Observed F.I. ∶ [−l¨x(θˆ)]− 12 .
Score and Information Matrix
▶ For multivariate θ = (θ1, θ2, . . . , θp)T we have
l˙(θ) = (∂l(θ)
∂θ1
, . . .
∂l(θ)
∂θp
)
[l¨(θ)]ij = ∂2l(θ)∂θi∂θj .
▶ For those points where l achieves its maximum we have that
l˙(θ) = 0 and I(θ) is negative definite.
Solving the MLE problem: Newton’s method and Scoring
Newton’s Method:
Let f ∶ R → R be twice differentiable and bounded from below. We
want to find θ
∗
where f attains its minimum
1
Newton’s method is based on a local quadratic approximation.
Consider a small neighborhood of a current guess, θk. We can get
a second-order Taylor approximation:
f (θ) ≈ f (θk) + (θ − θk)f ′(θk) + 12 (θ − θk)2f ′′(θk)
= fθk (θ).
1
Although we really want to the location of the maximum, the usual
language in optimization calls from fining minimums. For this we only need to
consider taking the negative of our objective function.
Newton’s method
Instead of trying to minimize directly f (θ), we try with fθk (θ).
Since this is a quadratic function, this minimum always exists and
can be located by taking the first derivative rt to θ,
dfθk (θ)
dθ
= f ′(θk) + (θ − θk)f ′′(θk),
and finding θ = θk+1 that makes it equal to zero
f
′(θk) + (θk+1 − θk)f ′′(θk) = 0
θk+1 − θk = −
f
′(θk)
f ′′(θk)
θk+1 = θk −
f
′(θk)
f ′′(θk) .
We now take θk+1 as our current guess, and find a θk+2 following
the same schema until convergence.
Multidimensional Target Functions
when f ∶ Rp → R this schema generalizes to
θk+1 = θk − [∇2f (θk)]−1 ∇f (θk),
where ∇f ∶ Rp → Rp and ∇2f ∶ Rp → Rp×p are the gradient and
Hessian matrix of f , respectively.
Example
consider the normal model Xi
iid∼ N (µ, σ2) The likelihood is
p(x1, . . . , xn∣µ, σ2) = n∏
i=1
N (yi ∣ µ, σ2) .
so the negative log-likelihood up to a constant is
l(µ, σ2) = − n∑
i=1
log N (yi ∣ µ, σ2) = − n∑
i=1
log ( 1√
2piσ2
e
− 1
2
(xi−µ
σ
)2)
≡ n
2
log σ
2 +
1
2σ2
n
∑
i=1
(yi − µ)2
= n
2
log σ
2 +
1
2σ2
[(n − 1)s2 + n(y¯ − µ)2] .
where s
2 = 1
n−1
∑ni=1(yi − y)2, y = 1n ∑ni=1 yi
Example
Since σ
2 > 0, we reparametrize to γ = log σ2. Then
−l(µ, γ) = n
2
γ + e−γ [(n − 1)s2 + n(y − µ)2] .
∇f (µ, γ) = [ n(µ − y)e−γn
2
− e−γ [(n − 1)s2 + n(y − µ)2]]
∇2f (µ, γ) = [ ne−γ −n(µ − y)e−γ
−n(µ − y)e−γ e−γ
2
[(n − 1)2s2 + n(y − µ)2]] .
(see R demo)
Issues
▶ Convergence not guaranteed!
▶ Not a descent method: f (θk+1) < f (θk) not guaranteed.
▶ May converge to a local minimum (unavoidable; all methods
suffer from this problem).
▶ Unstable if ∇2f (θ) close to singular.
Positive definite approximations
▶ We can enforce a descent algorithm by replacing ∇2f (θk) with
a positive-definite matrix Ak:
∆θk = −A
−1
k ∇f (θk).
Reminder: A is positive-definite if for any column vector
yp×1 ≠ 0 we have y
T
Ay > 0.
▶ Then a sufficiently small movement in the direction of ∆θk
guarantees a decrease in f (θ):
f (θk + α∆θk) − f (θk) = ∇f (θk)T (α∆θk) + o(α)
= −α∇f (θk)TA−1k ∇f (θk)ÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑ ÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
yTA−1
k
y>0
+o(α),
which is negative as α → 0. This requires to identify some
appropriate α.
Step-halving
A simple way of obtaining the constant α is to known as
step-halving. At iteration n:
1. Let ∆θk ← −A
−1
k ∇f (θk)
2. α ← 1
3. θ
∗
← θk + α∆θk
4. If f (θ∗) ≥ f (θk) make α ← α2 and go back to step 3.
5. take θk+1 = θ
∗
.
Scoring:
▶ Remember the form of the Newton updating:
θk+1 = θk − [∇2f (θk)]−1 ∇f (θk).
▶ In MLE problems our objective function is f (θ) = −lx(θ).
Therefore
∇2f (θ) = −∇2lx(θ) = −∇2 log n∏
i=1
p(xi∣θ)
= −
n
∑
i=1
∇2 log p(xi∣θ)
≈ nE [∇2 log p(xi∣θ)] (law of large numbers)
= In(θ) (expected information matrix!).
▶ Since the expected information matrix is always positive
definite, We can take it as our matrix A.
Example
Let’s come back to our normal model yi
iid∼ Normal (µ, σ2), with the
re parametrization γ = log σ2. The Hessian matrix is
∇2l1 = ( e−γ e−γ(x−µ)
e
−γ(x−µ) e−γ
2
(x − µ)2) .
therefore
E [−∇2l] = (e−γ 0
0 1
2
) ⟹ I(θ)−1 = n−1 (eγ 0
0 2
) .
(Note that since we have taken the expectation, we don’t have x
anymore)
Example
(R Demo here)
Parenthesis: stopping rules in iterative procedures
▶ Successive approximation iterative procedures, like Newton’s
method, consist in starting from an initial guess, x0, calculate
the sequence x0, x1 . . . xn using a recursive rule xk+1 = s(xk),
until either:
1. We succeed in finding a good enough approximation or,
2. we determine that we’re going nowhere so we need to give up.
▶ In any case, we need a rule so that our software
implementation knows when to stop.
▶ The way of determine “success” is usually to determine if the
change from xk to xk+1 is small enough to matter. If there’s
no significant change we take the latest term as our answer.
▶ In order not to get caught in an infinite loop if things fail, we
set up a maximum number of iterations, after which we
declare failure.
Stopping rules in iterative procedures
There are several ways of checking “convergence”.
▶ When the distance between xk+1 and xk is small enough: For
this we need to use some distance function and determine the
meaning of “small enough”. We can do this by evaluating the
▶ Relative distance:∥xk+1 − xk∥∥xk∥ < , or ∥xk+1 − xk∥ < ∥xk.∥
▶ Absolute distance: ∥xk+1 − xk∥ < .
This is sometimes preferred when the x terms are expected to
be in the vicinity of zero.
▶ A possible reference for picking the value of is the machine
epsilon.
Stopping rules in iterative procedures
▶ Common choices for multivariate distance functions are:
∥y∥1 = p∑
i=1
∣yi∣
∥y∥2 = [ p∑
i=1
y
2
i ] 12∣y∥∞ = max
i
∣yi∣.
▶ Another alternative is to check for changes in the objective
function itself:∣f (xk+1) − f (xk)∣ < (absolute)∣f (xk+1) − f (xk)∣ < ∣f (xk)∣ (relative)
.
Generalized linear models
▶ p(y∣θ) is an exponential family if
p(y∣θ) = exp [yθ − b(θ)
a(ψ) + c(y, ψ)] .
where a(⋅), b(⋅) and a(⋅), are known functions, ψ is a known
parameter, and θ is the unknown parameter
▶ Examples of exponential families are the binomial, Poisson,
normal, and gamma distributions:
Examples:
Poisson (y∣λ) = e−λλy
y!
= exp
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣y log λÍÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÏθ − λÍÑÏb(θ)− log y!ÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑ ÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏc(y,ψ)
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
Bernoulli (y ∣ p) = py(1 − p)1−y
= exp
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣y log
p
1 − pÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
θ
+ log(1 − p)ÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
−b(θ)
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ .
Generalized linear models
▶ Remember the linear model:
yi = x
T
i β + i with i,
iid∼ N (0, σ2) ,
for xi = (xi1 , xi2 , . . . , xip) and β = (β0, β1, . . . , βp−1)
▶ Another way of writing the same is
yi
ind∼ N (θi, σ2) .
where θi = x
T
i β.
▶ We can generate other models by replacing the Normal
distribution with other distributions. This results in. . . .
Generalized Linear Models
▶ Consider the model
yi
ind∼ EF (θi).
where F is an exponential family of distributions indexed by
its natural parameter θi, with mean µ(θi).
▶ We link predictors xi to θi using a Link Function
G (µ(θi)) = xTi β,
where β are unknown coefficients.
▶ This model is known as a Generalized Linear Model with link
function G.
Generalized Linear Models
▶ When the link function is G = µ−1 we call it the Canonical
Link. When that’s the case
θi = x
T
β.
and
µ(θ) = E[yi∣θ] = G−1(xTβ).
Example: Poisson Regression
We already showed that the Poisson distribution is an exponential
family:
Poisson (y∣λ) = e−λλy
y!
= exp
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣y log λÍÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÏθ − λÍÑÏb(θ)− log y!ÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑ ÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏc(y,ψ)
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ .
Since E[y∣λ] = λ, then µ(θ) = eθ. Therefore the canonical link is
G(⋅) = log(⋅).
The Poisson regression model is based on making
θi = x
T
i β ⟹ λi = e
x
T
i β,
and
yi
ind∼ Poisson (exTi β) .
Computing the MLE of GLMs
▶ Let’s assume GLMs with canonical links. Then the likelihood
is
L(θ) = n∏
i=1
p(yi∣θi) = n∏
i=1
exp [yiθi − b(θi)
a(ψ) + c(yi, ψ)]
where θi = x
T
i β.
▶ Our objective function is f (β) = − logL(β). To apply
Newton’s method we need this function and its derivatives:
f (β) = − n∑
i=1
(yiθi − b(θi))
∇f (β) = − n∑
i=1
[yi − b′(θi)]xi
∇2f (β) = n∑
i=1
b
′′(θi)xixTi .
Computing the MLE of GLMs
▶ For exponential families we have:
b
′(θ) = µ(θ) = Eθ[y]
b
′′(θ) = V arθ[y]
a(ψ) .
▶ Let’s call
µi = b
′(θ) and Vi = b′′(θi).
▶ Then we have. . .
Computing the MLE of GLMs
. . . the expressions
∇f (β) = −X−TWy˜
∇2f (β) = XTWX.
where
X =
⎛⎜⎜⎜⎜⎜⎜⎝
x1 →
x2 →
⋮
xn →
⎞⎟⎟⎟⎟⎟⎟⎠ y˜ =
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝
y1−µ1
V1
y2−µ2
V2
⋮
yn−µn
Vn
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠ W =
⎛⎜⎜⎝V1 0⋱0 Vn
⎞⎟⎟⎠ .
Computing the MLE of GLMs
▶ Finally, the Newton update is:
βk+1 = βk − [∇2f (βk)]−1 ∇f (βk)
= βk + [XTWkX]−1XTWky˜k.
▶ Which implies just solving a least squares problem with
response y˜, design X, and weights Wk (next slide).
▶ Note that ∇2f (β) doesn’t depend on yi, so this coincides with
Fisher’s scoring algorithm.
Bonus: Solving the weighted least squares problem:
▶ In general, the expression
[XTWX]−1XTWy. (1)
Is the solution to a least squares problem with response y,
design X, and weights W .
▶ In order to solve this problem, we note that since W is
diagonal, we can decompose is as:
W =W
1
2W
1
2 =
⎛⎜⎜⎜⎜⎜⎝
w
1
2
1 0
⋱
0 w
1
2
n
⎞⎟⎟⎟⎟⎟⎠
⎛⎜⎜⎜⎜⎜⎝
w
1
2
1 0
⋱
0 w
1
2
n
⎞⎟⎟⎟⎟⎟⎠ .
▶ Thus we can express (1) as the solution to the LS problem
X
∗
β
∗ = y∗, where X∗ =W
1
2X, and y
∗ =W
1
2 y, and solve it
using an appropriate method such as the QR decomposition.
Example: logistic regression
The logistic regression models is based on the following conditional
distribution
yi
ind∼ Bernoulli (θi) .
where θi is the natural parameter for the exponential family form
of the distribution θi = x
T
i β. Remember that
Bernoulli (y ∣ p) = py(1 − p)1−y
= exp
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣y log
p
1 − pÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
θ
+ log(1 − p)ÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
−b(θ)
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ .
Therefore
θi =
pi
1 − pi
= xTi β,
and
pi =
e
θi
1 + eθi
.
Logistic regression
From the canonical expression we get that
b(θi) = − log(1 − pi) = − log (1 − eθi
1 + eθi
) = log(1 + eθi).
and
b
′(θi) = eθi
1 + eθi
= pi
b
′′(θi) = eθi(eθi + 1)2 = p(1 − pi).
Thus,
W
(k) = diag(pi(1 − pi))
y˜
(k)
i =
yi − pi
pi(1 − pi) .