STA303H1-r stuido代写|学霸联盟

STA303H1-r stuido代写

时间：2024-02-21

STA303H1: Methods of Data Analysis II
(Lecture 4)
Mohammad Kaviul Anam Khan
PhD (candidate) in Biostatistics
sta303@utoronto.ca
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 1 / 34
To summarize
• In categorical data analysis our outcome (response) is categorical or discrete
• So far we discussed the covariate (independent variable) is also categorical
• We can also deal with a third variable. Investigate whether the third variable is a
confounder or an interaction variable
• Interaction means, that the third variable modeifies the effect of our exposure
• What about a fourth, fifth or sixth variable
• In real life data there is always existence of a large number of variables
• We learned how to measure the association. But what about prediction?
• What about continuous independent variables??
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 2 / 34
Generalized Linear Models (GLM)
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 3 / 34
Regression of Binary Variable
• What is regression??
• Let Y be a continuous response and X is a continuous covariate
• Then we can assume the relationship, E (Y |X ) = β0 + β1X .
• The residuals Y − E (Y |X ) has mean = 0 and equal variance (homoscedasticity)
• Now assume Y is a binary variable. Then, can we have E (Y |X ) = β0 + β1X .
• Since Y = 0, 1, then 0 < E (Y |X ) < 1⇒ 0 < β0 + β1X < 1.
• This assumption may not hold
• Recall E (Y |X ) = π is a probability for binary variable
• What if we take the log of this. That is log(E (Y |X )) = β0 + β1X
• Since, 0 < E (Y |X ) < 1, then −∞ < β0 + β1X < 0
• Then what should be the approach
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 4 / 34
Regression of Binary Variable
• Recall E (Y |X ) can be defined as a risk, i.e., E (Y |X ) = P(Y = 1|X )
• Due to restricted space (−∞, 0) it is difficult to model risk
• But what about odds, Ω = E (Y |X )1− E (Y |X )
• Then log odds will have a range (−∞,∞)
• Thus, we can model,
log
( E (Y |X )
1− E (Y |X )
)
= β0 + β1X
• This is the form of the famous logistic regression
• The link log-odds is also called the ‘logit’ link (What is a link function?)
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 5 / 34
Logistic Regression
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 6 / 34
Logistic Regression
• Continuing our previous discussion. Let Y = 0, 1 is an binary outcome, X = 0, 1
is our exposure of interest and Z = 0, 1 is a third variable
• Let the model,
log
( E (Y |X ,Z )
1− E (Y |X ,Z )
)
= β0 + β1X + β2Z
• When, Z = 0, the odds ratio θZ=0 = exp(β1)
• When, Z = 1, the odds ratio θZ=1 = exp(β1)
• Thus the interpretation of exp(β1) is that when, Z is fixed at a constant value
then the odds ratio between X = 1 and X = 0 is exp(β1)
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 7 / 34
Logistic Regression
• However, when interaction exists, the model is,
log
( E (Y |X ,Z )
1− E (Y |X ,Z )
)
= β0 + β1X + β2Z + β12XZ
• When, Z = 0, the odds ratio θZ=0 = exp(β1)
• When, Z = 1, the odds ratio θZ=1 = exp(β1 + β12)
• The odds ratios have to be interpreted separately for the levels of Z
• The interpretation of exp(β12) is: how much the odds ratio changes by the level
of Z
• Often referred to as the ratio of odds ratios (ratio-in-ratio parameter)
• How do we estimate the βs
• First start with linear models
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 8 / 34
Least Squared Estimates
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 9 / 34
Linear Models
• Assume the following linear regression model,
Y = Xβ + ϵ
• Here , Y is an n × 1 vector, and X is a n × p matrix of covariates and β is p × 1
vector of regression of covariates
• To esimate β, the target is to minimize ϵT ϵ, w.r.t. β, or,
βˆ = argmin
β
(Y − Xβ)T (Y − Xβ)
This is called the ordinary least squared (OLS) equation
• The least squared estimates are βˆ =
(
XTX
)−1
XTY
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 10 / 34
Linear Models
• Recall the OLS method produces the same estimates as the MLE assuming
Y ∼ N(Xβ, σ2I) equavalent with assuming ϵ ∼ N(0, σ2I), where I is n × n
identity matrix. The variance of the estimates are different (Gauss-Markov
assumption)
• However, the OLS cannot be performed for Generalized Linear Models (GLM),
since most of the cases Y is not continuous
• Thus the estimation from GLM models are conducted with MLE
• But before understanding the estimation procedures we need to understand few
related topics such as, the link function and exponential families
• That is the goal for this lecture
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 11 / 34
Exponential Families
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 12 / 34
Exponential Families
• Let Y ∼ fY (y ; θ, ϕ). If fY falls into exponential family, then, it can be written as,
fY (y ; θ, ϕ) = exp
[yθ − b(θ)
a(ϕ) + c(y , ϕ)
]
• Here,
• θ is a canonical (natural) parameter
• a(.), b(.) and c(.), are known functions
• ϕ is a dispersion parameter
Normal Distribution
For normal distribution we know,
fY (y ;µ, σ2) =
1√
2πσ2
exp
[
−(y − µ)
2
2σ2
]
= exp
[
yµ− µ2/2
σ2
− 12
(
y2
σ2
+ log(2πσ2)
)]
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 13 / 34
Exponential Families (Example)
Normal Distribution
• Here, θ = µ, and,
• b(θ) = θ2/2, a(ϕ) = σ2, c(y , ϕ) = −12
(
y2/σ2 + log(2πσ2)
)
• Now assume for general case the log-likelihood is
L = log (fY (y ; θ, ϕ)) =
(yθ − b(θ)
a(ϕ)
)
+ c(y , ϕ)
• According to Bartlett’s first Identity for exponential family, E
(
∂L
∂θ
)
= 0
• According to Bartlett’s second Identity,
E
(
∂2L
∂θ2
)
+ E
(
∂L
∂θ
)2
= 0
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 14 / 34
Exponential Families
• Here,
E
(
∂L
∂θ
)
= µ− b
′(θ)
a(ϕ) = 0
⇒ µ = b′(θ)
• Using the second identity,
E
(
∂2L
∂θ2
)
+ E
(
∂L
∂θ
)2
= 0
⇒ −b
′′(θ)
a(ϕ) +
Var(Y )
a2(ϕ) = 0
⇒ Var(Y ) = b′′(θ)a(ϕ)
• Often the Var(Y ) is written as V (µ) = b′′(θ)a(ϕ). V (µ) is a function of µ, which
can be constant w.r.t µ, e.g., for normal.
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 15 / 34
Exponential Families
• For Y = 0, 1, 2, ... having a Poisson(µ) distribution,
• fY (y , µ) = exp(−µ)µ
y
y ! = exp (−µ+ y log(µ)− log(y !))
• Here θ = log(µ)
• b(θ) = µ = exp(log(µ)) = exp(θ)
• a(ϕ) = 1
• E (Y ) = b′(θ) = exp(θ) = µ
• V (µ) = b′′(θ)a(ϕ) = µ
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 16 / 34
Exponential Families
• For Y = 0, 1 having a Bernoulli(π) distribution,
• fY (y , π) = πy (1− π)1−y = exp
(
y log
(
π
1−π
)
+ log(1− π)
)
• Here θ = log
(
π
1− π
)
• b(θ) = log(1+ exp(θ)) (HOW??)
• a(ϕ) = 1
• E (Y ) = b′(θ) = exp(θ)1+ exp(θ) = π
• V (µ) = b′′(θ) = exp(θ)(1+ exp(θ))2 = π(1− π)
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 17 / 34
Link Function
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 18 / 34
Link Function
• Going back to regression models we need to model Y on covariates X
• As we have discussed before the logistic regression has a logit link and linear
regression has identity link
• What is a link??
• Let’s assume the linear predictor is, η = Xβ
Link Function
The link function is a function which relates the linear predictor η to expected value µ
• Let’s define a function g(.), for which, g(µ) = η
• For canonical parameters we can have θ = g(µ) = η = Xβ. Thus, this specific
function is called a Canonical Link
• For Binomial model θ = log
(
π
1− π
)
and for Poisson model θ = log(µ) are
canonical links
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 19 / 34
Likelihood for GLM
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 20 / 34
Likelihood for GLM
• Let’s assume we have n independent observations of (Y ,X)
• For i th observations the linear predictor ηi = Xiβ =∑j βjxij
• The log-likelihood L(β) =∑i Li =∑i log(fY (yi ; θi , ϕ)) would be,∑
i
yiθi − b(θi)
a(ϕ) +
∑
i
c(yi , ϕ)
• Since, θi can vary by covariate but ϕ is fixed (recall the assumptions of linear
model)
• Then the score function (estimating function) w.r.t βj
∂L(β)
∂βj
=
∑
i
∂Li
∂βj
= 0; ∀j
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 21 / 34
Likelihood for GLM
• To differentiate the log-likelihood a chain rule is used,
∂Li
∂βj
= ∂Li
∂θi
∂θi
∂µi
∂µi
∂ηi
∂ηi
∂βj
• Recall, b′(θi) = µi and Var(Y ) = b′′(θ)a(ϕ)
• Thus,
∂Li
∂θi
= yi − µia(ϕ)
∂θi
∂µi
= a(ϕ)Var(Yi)
∂ηi
∂βj
= xij
∂Li
∂βj
= yi − µia(ϕ)
a(ϕ)
Var(Yi)
∂µi
∂ηi
xij =
(yi − µi)xij
Var(Yi)
∂µi
∂ηi
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 22 / 34
MLE for GLM
• Recall to obtain the MLE for GLM we need to solve,
n∑
i=1
(yi − µi)xij
Var(Yi)
∂µi
∂ηi
= 0
• Although the equation does not seem to be a function of βj , but we can use
µi = g−1(
∑
j βjxij)
• For Poisson, µi = exp(∑j βjxij)
• For Binomial (logistic) µi = πi =
exp(∑j βjxij)
1+ exp(∑j βjxij)
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 23 / 34
Asymptotic Covariances
• Using Bartlet’s second identity for multivariate case,
−E
[
∂2Li
∂βh∂βj
]
= E
[
∂Li
∂βh
] [
∂Li
∂βj
]
= E
[(Yi − µi)xih
Var(Yi)
∂µi
∂ηi
(Yi − µi)xij
Var(Yi)
∂µi
∂ηi
]
= xihxijVar(Yi)
(
∂µi
∂ηi
)2
E
[
(Yi − µi)2
Var(Yi)
]
= xihxijVar(Yi)
(
∂µi
∂ηi
)2
• Thus,
−E
[
∂2L(β)
∂βh∂βj
]
=
n∑
i=1
xihxij
Var(Yi)
(
∂µi
∂ηi
)2
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 24 / 34
Asymptotic Covariances
• Thus, the Fisher’s Information matrix can be defined as,
J = −E
[
∂2L(β)
∂β∂βT
]
= XTWX
• where, W =

w1 0... 0
0 w2 ... 0
... ... ... ...
... ... ... wn
 is a diagonal matrix with diagonal elements,
wi =
(
∂µi
∂ηi
)2 1
Var(Yi)
• The covariance, Cov(βˆ) = J −1 =
(
XT ŴX
)−1
. Here Ŵ is evaluated at βˆ
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 25 / 34
Estimating with Newton-Raphson method
• For GLM calculating MLE does not have any analytical solutions like linear models
• Thus, we need to use some numerical method such as the Newton-Raphson
method
• It starts with an initial guess of the parameter estimates and updates it’s guess at
each iteration
• Let, uT =
(
∂L(β)
∂β0
,
∂L(β)
∂β1
, ...,
∂L(β)
∂βP
)
, which is called the score vector
• Let H having entries hjk = ∂
2L(β)
∂βj∂βk
. H is called the Hessian Matrix
• Then let u(t) and H(t) are evaluated at β(t)
• Then, the iterative guess at (t + 1)th iteration is,
β(t+1) = β(t) −
(
H(t)
)−1
u(t)
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 26 / 34
Iteratively Reweighted Least Squares (IRLS)
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 27 / 34
IRLS
• So far we have discussed that MLE technique is used to estimate the GLM
parameters (i.e., βs)
• Analytical solutions cannot be found so Newton-Raphson or other iterative
methods are used
• Interestingly when Newton-Raphson is applied the equation kind of resembles the
Weighted Least Squares
• In linear regression, when homoscedasticity assumption is not met then we cannot
assume Var(ϵ) = σ2I, rather we assume Var(ϵ) = Σ, which is a n × n matrix.
Thus Y | X ∼ N(Xβ,Σ)
• Then the OLS estimators are weighted with this Σ, by,
βˆ =
(
XTWX
)−1
XW y
where, W = Σ−1
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 28 / 34
IRLS
• For GLM the link function g(µi) = ηi = Xiβ. Let Zi = g(Yi)
• Using 1st order Taylor series expansion,
g(Yi) ∼= g(µi) + (Yi − µi)g ′(µi)
Zi ∼= Xiβ + (Yi − µi)g ′(µi)
• E(Zi) = Xiβ and Var(Zi) = (g ′(µi))2 Var(Yi) = vi
• Thus Zi form a linear model in β given µi and
βˆ =
(
XTWX
)−1
XTW z
where, W = diag
[
v−11 , v−12 , ..., v−1n ,
]
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 29 / 34
IRLS for logistic regression
• Now we find out the estimates for the specific models. Let’s start with logistic
• Recall,
µi = πi =
exp(ηi)
1+ exp(ηi)
⇒∂πi
∂ηi
= (1+ exp(ηi)) exp(ηi)− exp(ηi) exp(ηi)(1+ exp(ηi))2
⇒∂πi
∂ηi
= exp(ηi)(1+ exp(ηi))2
⇒∂πi
∂ηi
= exp(ηi)1+ exp(ηi)
1
1+ exp(ηi)
= πi(1− πi)
• Thus, the estimating equation becomes,
n∑
i=1
(yi − πi)xij
Var(Yi)
∂πi
∂ηi
=
n∑
i=1
xij(yi − πi) = 0
⇒ u = ∂L(β)
∂β
= XT (y− π) = 0 [here bold letters represent vectors]
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 30 / 34
IRLS for logistic regression
• Let, βˆ(t) are the estimates at tth iteration of the Newton-Raphson. We know
that, H = −J = −XTWX Then,
β(t+1) = β(t) −
(
H(t)
)−1
u(t)
= β(t) + (XTWX)−1XT (y− π)
= (XTWX)−1XTWXβ(t) + (XTWX)−1XTWW−1(y− π)
= (XTWX)−1XTW (Xβ(t) +W−1(y− π))
= (XTWX)−1XTW z
here, W = W (t+1) evaluated at the (t + 1)th iteration
• The whole purpose of this algorithm is to eventually describe a very simple(!)
equation.
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 31 / 34
The W matrix for Logistic model
• Recall
wi = v−1i =
((
g ′(µi)
)2 Var(Yi))−1
=
((
∂ηi
∂πi
)2
Var(Yi)
)−1
=
(
∂πi
∂ηi
)2 1
Var(Yi)
= (πi(1− πi))2 1
πi(1− πi)
= πi(1− πi)
• Hence, W =

π1(1− π1) 0... 0
0 π2(1− π2) ... 0
... ... ... ...
... ... ... πn(1− πn)

Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 32 / 34
IRLS algorithm
• Here we specify the algorithm with respect to logistic regression. Only observation
we have is the yis and xijs
1. Initialize: Set all βˆ(0)j = 0, for j = 0, 1, ...,P. Calculate η
(0)
i =
∑
j xij βˆ
(0)
j and
µˆ
(0)
i = πˆ
(0)
i =
exp(η(0)i )
1+ exp(η(0)i )
2. Set, W (1) = diag(π1(1− π1), π2(1− π2), ..., πn(1− πn)). Here πi = π(0)i
3. For i = 1, 2, ..., n, set z (1)i = ηˆ
(0)
i +
(
W (1)
)−1 (yi − πˆ(0)i )
4. Estimate βˆ(1) = (XTW (1)X)−1XTW (1)z1
5. Set, ηˆ(1)i =
∑
j xij βˆ
(1)
j , πˆ
(1)
i =
exp(ηˆ(1)i )
1+ exp(ηˆ(1)i )
6. Perform 2-4 again to obtain W (2), z (2)i and βˆ
(2)
7. Repeat 2-6 untill either ηˆ(t)i or πˆ
(t)
i satisfy,
∣∣∣ηˆ(t)i − ηˆ(t−1)i ∣∣∣ < εη or∣∣∣πˆ(t)i − πˆ(t−1)i ∣∣∣ < επ or just make sure ∂L(β)∂β |βˆ < δ
• Now let’s code it
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 33 / 34
IRLS coding
• First we simulate some covariates. Let’s simulate X1 ∼ Unif[−10, 10] as
acontinuous variable and X2 ∼ Bernoulli(0.5). We simulate n = 100 observations
• The linear predictor ηi = β0 + β1X1i + β2X2i
• Thus, πi = exp(ηi)1+ exp(ηi)
• Simulate yi ∼ Bernoulli(πi)
Mohammad Kaviul Anam Khan Data Analysis 2 Lecture 4 34 / 34