程序代写案例-ALY-6020
时间:2021-12-03
ALY-6020
Predictive Analytics:
Generalized Linear Models
Ajit Appari, Ph.D., M.Tech., B.Tech.
College of Professional Studies, Northeastern University
Email: a.appari@northeastern.edu
November 10, 2021
Linear Regression: The Classic Bivariate Model
2
If red line/curve is the
fitted line
Which one is linear
regression model?
A quick review of linear
regression model before
discussing GLM
Multiple Linear Regression Model
 For multivariable problem (three or more variables), i.e.,
Response as a function of two or more predictors
where i=1,…,n (sample size); ei is random error for i-th observation
Often xi,1=1 implying β1 is constant term
 All n equations stacked together as matrix
3
Linear Regression: The Classic Model
4
Linear Regression Model
Parameters of Linear Regression Model
NOTE-1: Model is linear in parameters, irrespective of the nature of predictors.
NOTE-2: Normal distribution assumption is for error; NOT for Y
Quantitative inputs
Transformations, e.g. ln(x), Sin(x)
Indicator [0/1] or Categorical,
Polynomial, x2, x3
Interactions, X3=X1 . X2
Linear Regression: The Classic Model
 Model is linear in parameters
 Predictors can be in non-linear relationship with Response variable and yet
such relationship can be estimated using linear regression
 Error follows Normal Distribution with zero mean
 This does not imply Response variable has to follow normal distribution (a
common mistake among analytics professionals)
 Predictors are not related to error
 In practice it is difficult to avoid because of missing variables
 Predictors are not correlated to each other {No multicollinearity}
 In practice, most systems have correlated predictors
 Errors are independent and identically distributed
 Errors across observations are uncorrelated {cross-sectional/ longitudinal}
 Error are homoscedastic {error variance does not vary with predictors values}
5
Homoscedastic Vs Heteroscedastic Errors
6
Linear Regression Estimation: The OLS Approach
7
Minimize RSS: ∑=1 2 = ∑=1 − � 2Estimation Approach: Ordinary Least Square (OLS)
Bivariate Linear Regression Model: Simple version
Residual or Prediction Error =
Deviation of Observed Values
from their Predicted Values on
the Fitted Line

∶ � = + .
: = − �
= − − .
Linear Regression Estimation: The OLS Approach
8
Systematic
Component
Random
Component
Linear Regression Model with Two Covariates
Random
Component
Systematic
Component
Generalized Linear Models
9
Generalized Linear Model: Why Needed
 OLS estimation fails if random component follows non-
normal distribution
 Maximum likelihood estimation approach is used to predict
linear models
Potential Scenarios of non-normal regression modeling
 Binary variable as response { 0 or 1}
 Modeled as Logistic
 Proportion of total cases as response {ranges from 0 to 1}
 Modeled as Binomial distribution
 If #cases=1  same as binary
10
Generalized Linear Model: Why Needed
Potential Scenarios of non-normal regression modeling
 Count variable as response {non-negative number}
 Modeled as Poisson {variance=mean}
 Modeled as Negative Binomial { variance> mean}
 Poisson for Rates {if denominator/exposure variable is very
large}
 Positive continuous variable, e.g. Rates, Service time
 Modeled as Gamma
11
Generalized Linear Models
12
One of the oldest agricultural
research center.
Established 1843;
Birthplace of modern statistical
theory and practice
GLM Framework
 Generalized Linear Models (GLM) is a framework that:
 Extends ordinary linear regression model of continuous response
variable to cases of categorical or discrete variables
 Maximum Likelihood Estimation approach (default)
• Iteratively Reweighted Least Squares Method (Nelder & Wedderburn 1972)
 GLM has three components:
 Random component: Error component follows exponential
dispersion model family
 Systematic component: Linear predictor of response
 Link function: links the linear predictor to expected mean of
response{unique to the GLM}
13
GLM Framework
 Random Component (Error Distribution): Exponential
Dispersion Model Family.
 The probability function ; , is defined as
; , = , . −

 is called the canonical parameter.
 >0 is dispersion parameter {similar to }. Should be very
close to 1;
• under dispersion if <<1; and
• over dispersion if >> 1
 is a known function called cumulant function.
 , is a normalizing function to ensure ; , is a
probability function, i.e.
• ∫ ; , = for continuous y or
• ∑ ; , = for discrete y
14
GLM Framework
 Members of Exponential Dispersion Model Family
 Normal {e.g., Sample averages when sample sizes are sufficiently
large}
 Bernoulli {e.g., Yes/No decisions}
 Binomial {e.g., number of success in ‘n’ trials sum of ‘n’ Bernoulli
trials}
 Categorical or Multinomial Logit {e.g., a customer’s race/ethnicity}
 Multinomial {e.g., Customer counts in each race/ethnicity out of n
customers}
 Exponential {e.g., Waiting Time in a queue or interarrival time}
 Gamma {e.g., amount of rainfall in reservoir, Waiting Time till k-th
customer served, customer life-time value, annual health expenditure}
 Poisson {e.g., Number of customers in the queue}
 Negative Binomial – for over-dispersed count variable {e.g., number of
hospital visits in a year}; generalization of Poisson distribution
15
GLM Framework
 Systematic Component (Linear Predictor)
= +
 Conditional Expectation = 1, 2 … . ;
 Parameter vector = 0, 1 ,2 … . . .
 O is offset a parameter known a priori ; commonly occurs in Poisson
GLM but may appear in any GLM.
 Offset is measure of exposure (a.k.a. denominator) variable.
• Annual Birth Count across cities can be modeled as Poisson, but this expected
annual count depends on city’s adult population – offset or exposure
• Number of workers with lung diseases in various coal mines depends on the
number of workers and how long they have worked. Offset or exposure would
be number of person-years.
 X is model matrix [1,1,2 … . . .], with
• first column fixed at “1” for intercept parameter 0 ,
• s are explanatory variables that includes interactions (3 = 1 ∗ 2 ),
quadratic term 5 = 4 ∗ 4 or polynomial terms 5 = 4 ∗ 4 ∗ ⋯∗ 4
16
GLM Framework
 Link Function g[µ] : connects conditional mean of
response to the linear predictor
= = +
 Regression parameters are estimated using Maximum
Likelihood
 We skip all math and focus on example/ applications
17
GLM Framework
 Link Function g[µ] : connects conditional mean of
response to the linear predictor
= = +
 Regression parameters are estimated using Maximum
Likelihood
 We skip all math and focus on example/ applications
18
GLM Framework: Canonical Link Functions
19
Range
for Y
0,1,2… ∞
(0,1)
Or
(0,1,..n)/n
(−∞,∞)
(0, ∞)
(0, N)
for Y
Natural Link Function for Select Probability Distribution of Y
Negative Binomial Distribution: k number of successes [or failures] before a r number of failures [or
successes] has occurred; Alternately: n trials required for r success or failures.
When p->0 or 1
GLM : Multiple Linear Regression in R
 Function: glm() from stats Package{part of Base R}
glm(formula, data, subset, na.action, weights, offset, family =
guassian(link=“identity”), start = NULL, control =
glm.control(...), model = TRUE, y = TRUE, x = FALSE, ...)
formula: specify the model, e.g. y ~ x1 + x2 + x3 + offset(x4)
data: data frame
subset: if a subset of data needs to be regressed, e.g. regress data when gender=F
na.action = what to do for missing observations, remove or impute
weights= specify to perform weighted GLM; useful when response is observed over
some type of denominator, e.g., varying time length, varying sample size, selection bias
(probability of the observation being in sample) {Heckman’s correction}
offset=same as above
glm.control(...): used to set the parameters to control fitting process  epsilon=
convergence tolerance, maxit= max number of IWLS steps, trace=False {produce output
at each iteration}
model, y, x: are logical values indicating if they are to be returned as output.
20
Other GLM Packages
 Package glmx: Generalized Linear Models Extended
 Package glmnet: for Lasso and Elastic-net Regularized
Generalized Linear Models
 Package glmertree: for Generalized Linear Mix Model Trees
 Package biglm: for Big data analysis- Bounded Memory
Linear and Generalized Linear Models
 Package fishMod: for specialized GLMS on count data –
Poisson-Sum-of-Gamma GLMs, Tweedie GLMs, and Delta
Log-Normal GLMs
 Package hglm: for Hierarchical Generalized Linear Models
 Package oglmx: for Ordered Generalized Linear Models
 Package pglm: for Panel Generalized Linear Models
 Package plsRglm: Partial Least Square for Generalized
Linear Models
21
essay、essay代写