STAT 3022/3922-stat3022代写|学霸联盟

STAT 3022/3922-stat3022代写

时间：2023-05-13

Multiple Linear Regression Model - Part 1
Model description, least squares, model inference, prediction
Dr. Linh Nghiem
STAT 3022/3922
Applied Linear Models
Model Description
Model description
• Suppose we have an outcome Y that you want to predict
from p− 1 predictors (or covariates) X1, . . . , Xp−1. The
population multiple linear regression model is
E(Y | X1 = x1, . . . , Xp−1 = xp−1) = β0 + β1x1 + . . .+ βp−1xp−1
Var(Y | X1 = x1, . . . , Xp−1 = xp−1) = σ2
• Predictors can be either one of the following types:
1. Separate variables, eg: height (X1), weight (X2), ...
2. Polynomial of a previous predictor; e.g X2 = X
2
1
3. Transformed variable; e.g X2 = log(X1),
4. Interaction effect: e.g X3 = X1X2
5. Qualitative (categorical variables): X1 =
{
1 if female
0 if male
1
Model description
• When we say linear model, we mean linear in parameters, not
in predictors.
- There is no such term like ββ10 or e
β1 in the model.
• The linear model above may cover many complicated
relationships between Y and the predictors X1, . . . , Xp−1.
• The population multiple linear regression (MLR) model can be
alternatively written as
Y = β0 + β1X1 + . . .+ βp−1Xp−1 + ε
with E(ε) = 0 and Var(ε) = σ2. If we assume ε ∼ N(0, σ2),
then we have the normal MLR model.
2
Model description
Assume we observe data following from the MLR model, meaning
yi = β0 + β1xi1 + . . .+ βp−1xi,p−1 + εi, i = 1, . . . , n
with the following assumptions:
(A1) E(εi) = 0, i = 1, . . . , n
(A2) Var(εi) = σ
2, i = 1, . . . , n
(A3) For any i 6= j, the random noises εi and εj are independent or
at least uncorrelated.
If we assume (A4) εi ∼ N(0, σ2), then we say we have data
following from a normal MLR model.
3
Matrix representation
We have p coefficients, so instead of working with each of them
separately, we can work with all of them together through the
following matrix representation of the MLR model.
First, we put all the y data to form a response vector:
yn×1 =
y1...
yn
 ,
and similarly, we can form a coefficient vector and an error vector
βp×1 =

β0
β1
...
βp−1
 , εn×1 =
ε1...
εn

4
Matrix representation
Finally, we put all the covariates data on the design matrix
Xn×p =

1 x11 x12 . . . x1,p−1
1 x21 x22 . . . x2,p−1
...
...
...
...
1 xn1 xn2 . . . xn,p−1
 .
Hence, we can write the MLR model as the following concise
matrix representation:
y = Xβ + ε. (1)
5
Matrix representation
From the above equation, we can also write
yi = x
>
i β + εi, i = 1, . . . , n.
The assumptions in the matrix form are
(B1) E(ε) = 0.
(B2) Var(ε) = σ2In, where In is the n× n identity matrix (this
assumption includes assumptions (A2) and (A3) above).
If we have normal MLR, then
(B3) ε follows a multivariate normal distribution ε ∼ Nn(0, σ2In).
6
Matrix representation
Similar to what we do in the simple linear model, we assume X to
be deterministic (i.e not random). As a result,
E(y) = Xβ,
Var(y) = σ2In.
If we assume (B3), then
y ∼ Nn(Xβ, σ2In).
7
Fuel consumption
Model Y = fuel consumption (mpg) of automobiles as a linear
model of p− 3 covariates, including X1 = displacement (disp), X2
= gross horsepower (hp) and X3 = rear axle ratio (drat) from
n = 32 observations.
Corr:
−0.848***
Corr:
−0.776***
Corr:
0.791***
Corr:
0.681***
Corr:
−0.710***
Corr:
−0.449**
mpg disp hp drat
m
pg
disp
hp
drat
10 15 20 25 30 35 100 200 300 400 100 200 300 3.0 3.5 4.0 4.5 5.0
0.00
0.02
0.04
0.06
100
200
300
400
100
200
300
3.0
3.5
4.0
4.5
5.0
8
Least Square Estimator
Least-square estimators
Similar to the simple linear model, we estimate the coefficients by
minimizing the sum of squared residuals:
D(b0, b1, . . . , bp−1) =
n∑
y=1
(yi − b0 − b1xi,1 − . . .− bp−1xi,p−1)2
= (y−Xb)>(y−Xb)
= y> y−b>X> y−y>Xb+ b>X>Xb
= y> y−2b>X> y+b>X>Xb.
with b = (b0, . . . , bp−1)>. The least square estimator is therefore
βˆ = argmin
b
D(b)
9
Normal equations
To solve for βˆ, we take partial derivative with respect to each
component b to obtain the normal equations of MLR:
∂D
∂b
= 2X> y−2X>Xb = 0
and hence
X>(y−Xb) = 0
X>Xb = X> y
Note that X>X is a p× p matrix, so the equation above has a
unique solution if and only if rank(X>X) = p. In this case,
βˆ = (X>X)−1X> y .
10
Solving normal equations
When does the matrix X>X has a full rank?
• Because rank(X>X) = rank(X), this condition is satisfied if
and only if rank(X) = p.
However, since rank(X) ≤ min(n, p), the OLS estimate is only
defined if and only if two following conditions are satisfied:
1. n > p (this is usually called a low-dimensional setting).
2. No predictor is a linear combination of the other predictors
(no perfect multicollinearity).
In R, if you are able to get the OLS estimates from the lm
command, that means both the conditions are satisfied.
11
Fuel consumption data
12
Properties of least square estimators
From now on, we will assume the OLS estimate for MLR is
defined. Then,
• βˆ is an unbiased estimator for β
E(βˆ) = E
{
(X>X)−1X> y
}
= (X>X)−1X> E(y)
= (X>X)−1X>Xβ = Ipβ = β.
• Next, we find the covariance-variance matrix of βˆ
Var(βˆ) = Var
{
(X>X)−1X> y
}
= (X>X)−1X> Var(y)X(X>X)−1
= (X>X)−1X> σ2InX(X>X)−1
= σ2(X>X)−1.
13
Properties of the least square estimator
• Finally, the OLS estimates βˆ is a linear combination of y data.
βˆ = (X>X)−1X> y = Ry .
with R = (X>X)−1X>.
• Among all linear unbiased estimator, βˆ has the minimum
variance (Gauss-Markov Theorem).
• We say that the OLS estimates βˆ is the Best Linear Unbiased
Estimator (BLUE).
- Note that this is only the ”best” among the class of linear
unbiased estimator
- If we go outside of this class, we can possibly find a better
estimator.
14
Fitted values and residuals
The fitted value for the ith observation is
yˆi = βˆ0 + βˆ1xi1 + βˆ2xi2 + . . .+ βˆpxip = x
>
i βˆ = x
>
i (X
>X)−1X> y
Putting all fitted values together, we have the vector of fitted value
yˆ = X(X>X)−1X> y = Hy .
where the n× n matrix H = X(X>X)−1X> is usually called the
hat matrix. Such matrix play a very important role in the residual
analysis for MLR.
15
Fitted values and residuals
• The residual for the ith observation is
ei = yi − yˆi = yi − h>i y .
where h>i denotes the ith row of H. The vector of residual is
therefore
e = y−X βˆ = y−yˆ = y−Hy = (In −H)y .
• From the normal equations, we have X> e = 0, which implies
n∑
i=1
ei = 0,
n∑
i=1
eixij = 0, j = 1, . . . , p− 1.
16
Partitioning of variability
Similar to what we have in SLR, we have the following
decomposition of the total variability for MLR:
n∑
i=1
(yi − y¯)2 =
n∑
i=1
(yi − yˆi + yˆi − y¯)2
=
n∑
i=1
(yi − yˆi)2 +
n∑
i=1
(yˆi − y¯)2 + 2
n∑
i=1
(yi − yˆi)(yˆi − y¯).
Note that yi − yˆi = ei, so
n∑
i=1
(yi − yˆi)(yˆi − y¯) =
n∑
i=1
ei(yˆi − y¯) =
n∑
i=1
eiyˆi − y¯
n∑
i=1
ei
= e> yˆ − y¯ e> 1 = e>X βˆ − y¯ e> 1 = 0
17
Partitioning of variability
SST = SSE + SSR
SST =
∑n
i=1(yi − y¯)2 Total Sum of Squares
SSE =
∑n
i=1(yi − yˆi)2 Sum of Squared Error (SSE)
SSR =
∑n
i=1(yˆi − y¯)2 Sum of Squared Regression (SSR)
As a result, we have the following partition of variability for MLR:
SST = SSE + SSR
(n− 1) df (n− p) df (p− 1) df
18
Extra (sequential) sum of squares
Because we have more than one covariate in MLR, we usually
further decompose SSR into smaller parts to measure the
contribution of each covariate.
Extra (sequential) sum of squares (ESS or Seq. SS) measures
the increase of SSR when one or several new covariates are added
to the model.
Assume we consider p− 1 potential covariates X1, . . . , Xp−1
• Regardless of how many covariates are in the model, SST is
not changed (why?)
• Hence, an increase in SSR always corresponds to the same
decrease in SSE.
19
Extra (sequential) sum of squares
Hence, by the above partitioning of total variability, we have
SST = SSR(X1) + SSE(X1)
= SSR(X1, X2) + SSE(X1, X2),
where inside the parentheses for each SSR and SSE is the list of
covariates in the model. Therefore, the extra sum of squares of
X2 given X1 is already included in the model, denoted as
SSR(X2|X1) is
SSR(X2|X1) = SSR(X1, X2)− SSR(X1)
= SSE(X1)− SSR(X1, X2).
20
Extra (sequential) sum of squares
By a similar argument, we can define quantities such as
SSR(X3|X1, X2) = SSR(X1, X2, X3)− SSR(X1, X2),
SSR(X3, X2|X1) = SSR(X1, X2, X3)− SSR(X1),
and so forth. As a result, for a model with (p− 1) covariates
X1, Xp−1, we can write
SSR(X1, X2, . . . , Xp−1) = SSR(X1)
+ SSR(X2|X1)
+ SSR(X3|X2, X1)
+ . . .
+ SSR(Xp−1|Xp−2 . . . , X1).
21
Extra (sequential) sum of squares
• Each extra sum of squares has exactly degree of freedom
equal to the number of covariates being considered to be
added into the model.
SSR(X1), SSR(X2|X1), . . ., SSR(Xp−1|X1, . . . , Xp−2) has 1
df, and SSR(X3, X2|X1) has 2 df.
• Finally, the mean of squares is obtained by dividing the sum
of squares with the corresponding df
MSR(X1|X2) = SSR(X1|X2)
1
,
MSR(X2, X3|X1) = SSR(X2, X3|X1)
2
,
MSE(X1, X2, . . . , Xp−1) =
SSE(X1, X2, . . . , Xp−1)
n− p , . . .
22
Extra (sequential) sum of squares
We have the following partitioning of total variabiliity for a MLR
with p− 1 covariates:
df SS MS
X1 1 SSR(X1) MSR(X1)
X2 1 SSR(X2|X1) MSR(X2|X1)
X3 1 SSR(X3|X2, X1) MSR(X3|X2, X1)
. . . . . . . . . . . .
Xp−1 1 SSR(Xp−1|Xp−2, . . . , X1) MSR(Xp−1|Xp−2 . . . X1)
Regression p− 1 SSR(X1, . . . , Xp−1) MSR(X1, . . . , Xp−1)
Residuals n− p SSE(X1, . . . , Xp−1) MSE(X1, . . . , Xp−1)
Total n− 1 SST
All these SS and MS can be obtained from the anova() for a lm
object in R.
23
Extra (sequential) sum of squares
Note that order of covariates matter for the ANOVA table, but not
for the summary table.
24
Estimating σ2
Finally, we estimate the error variance σ2 by
σˆ2 =
1
n− p e
> e = MSE(X1, . . . , Xp−1)
where again e denotes the vector of residual. Note that the
denominator is now n− p, since we have to estimate p coefficients
before estimating σ2. In other words, the degree of freedom of the
sum of squared residual is n− p.
25
Fuel consumption
26
Model Inference
Model Inference
Similar to the case of simple linear regression, we will need to
assume normal MLR to do any inference on the coefficients. In
this case, the estimate βˆ is jointly normal with the vector mean
and variance-covariance matrix as derived.
βˆ ∼ Np(β, σ2V)
where V =
(
X>X
)−1
. As a result, each component of βˆ is
normally distributed
βˆj ∼ N(βj , σ2vj+1,j+1).
27
Inference for a single coefficient
Now, we replace σ2 by the estimates σˆ2, so we obtain the standard
error for each βˆj
se(βˆj) = σˆ
√
vj+1,j+1
and hence the t-statistics
Tj =
βˆj − βj
se(βˆj)
∼ tn−p,
with tn−p being the t-distribution with n− p degrees of freedom.
Therefore, a 100(1− α)% confidence interval for βj is
βˆj ± t1−α/2,n−pse(βˆj).
28
Inference for a single coefficient
Next, hypothesis testing for a single coefficient is also done in the
usual way. Let a be a given constant.
H0 : βj = a vs H1 : βj 6= a or H1 : βj > a or H1 : βj < a
Test statistics:
T =
βˆj − a√
σˆ2vj+1,j+1
∼ tn−p under H0
Compute p-value:
H1 : βj 6= a H1 : β0 > a H1 : β0 < a
p-value 2P (tn−p > |T |) P (tn−p > T ) P (tn−p < T )
Reject H0 if p-value < α (significance level).
29
Inference for a single coefficient
An important detail about all the previous inference for a single
coefficient is that it is under the condition that the model has
already contained all the other predictors.
• The covariance-variance matrix of βˆ is computed given we
have p− 1 predictors in the model.
• As a result, if the number of covariates change, the inference
of the same coefficient can change drastically.
For example, confidence interval for β1 in the model
yi = β0 + β1xi1 + β2xi2 + εi is not the same as confidence interval
for β1 in the model yi = β0 + β1xi1 + β2xi2 + β3xi3 + εi.
30
Inference for a single coefficient
This detail can be seen clearer if we view the testing from model
selection perspective. For example, given we have the MLR model
yi = β0 + β1xi1 + . . .+ βp−1xi,p−1 + εi, i = 1, . . . , n
the test H0 : β1 = a vs β1 6= a actually test the two following
models
H0 : yi = β0 + axi1 + . . .+ βp−1xi,p−1 + εi vs
H1 : yi = β0 + β1xi1 + . . .+ βp−1xi,p−1 + εi.
Note that both model under H0 and H1 contain all the other
predictors x2, . . . , xp−1!
31
Inference for a single coefficient
If a = 0, then the test can be read directly from the summary of
the lm output in R.
32
Further hypothesis testing for MLR
In MLR, since we have more than one covariate, we are usually
interested in further testing that usually involve more than one
coefficient. The next tests that we will consider include:
• Whether all slopes are zero. This is usually called the
overall/ubiquitous test for significance of MLR.
H0 : β1 = β2 = . . . = βp−1 = 0 vs.
H1 : at least one of β1, . . . , βp−1 is not 0
• Whether some slopes are zero.
H0 : βi = βj = . . . = βk = 0 vs.
H1 : at least one of βi, . . . , βk is not 0
These testings are done from model selection perspective.
33
Test whether all slopes are zero
H0 : β1 = β2 = . . . = βp−1 = 0 vs.
H1 : at least one of β1, . . . , βp−1 is not 0
First, we translate it to the testing of the two models:
H0 : yi = β0 + εi, i = 1, . . . , n vs.
H1 : yi = β0 + β1xi,1 + . . .+ βp−1xi,p−1 + εi i = 1, . . . , n
The model under H0 is called the reduced model, while the
model under H1 is called the full model.
34
Test whether all slopes are zero
Second, we fit the model under both H0 and H1 and compute the
sum of squared errors and its degree of freedom.
• Under H0, the OLS estimate for β0 is y¯, so
SSE(H0) =
n∑
i=1
(yi − y¯)2 = SST(H1), df(H0) = n− 1.
• Under H1 the OLS estimate for all the coefficients
β0, . . . , βp−1 is found in the matrix form as above, and its
sum of squared error is
SSE(H1) =
n∑
i=1
(yi − yˆi)2, df(H1) = n− p
35
Test whether all slopes are zero
Again, since SSE(H0) > SSE(H1), then we can reject H0 if the
reduced sum of squared residuals is significant. Recall the general
form of the F-statistic is given by
F =
{SSE(H0)− SSE(H1)} / {(df(H0)− df(H1)}
SSE(H1)/df(H1)
so in this case, it equals
F =
SSR(H1)/(p− 1)
SSE(H1)/(n− p) =
MSR(H1)
MSE(H1)
∼ Fp−1,n−p
If this F statistic is large, the reduced sum of squared residual is
significant, so we have evidence against H0.
36
ANOVA for MLR
Finally, we compute p-value = P (Fp−1,n−p > F ) and reject H0 at
significance level α if this p-value < α.
The above analyses are summarized in the following (overall)
ANOVA table:
df SS MS F value Pr(> F )
Regression p− 1 SSR MSR MSR/MSE p-value
Residuals n− p SSE MSE
Total n− 1 SST
If we reject H0, we say that the MLR model is significant.
37
Test whether all slopes are zero
This overall F-test can be read directly from the summary of the
lm output in R.
38
Testing a subset of slopes is zero
H0 : βi = βj = . . . = βk = 0 vs.
H1 : at least one of βi, . . . , βk is not 0.
For example, for the fuel consumption data, we want to test at
least one coefficient corresponding to disp and drat are not zero, i.e
H0 : β1 = β3 = 0.
H1 : at least one of β1, β3 is not 0.
We will follow the same testing procedure from model selection
perspective as before.
39
Testing a subset of slopes is zero
First, under H0, the reduced model is
yi = β0 + β2xi2 + εi, i = 1, . . . , n,
which is a SLR model with only one predictor being hp. We can
estimate coefficients β0 and β2 by OLS, then compute the sum of
squared residual of that model SSE(H0).
• The degrees of freedom df(H0) = n− 2, since we have 2
coefficients to estimate.
40
Testing a subset of slopes is zero
Next, under H1, we have the full model
yi = β0 + β1xi1 + β2xi2 + β3xi3 + εi, i = 1, . . . , n
so we can fit the model, obtain the sum of squared residual
SSE(H1), and df(H1) = n− 4.
41
Testing a subset of slopes is zero
Hence, we compute the F -statistic as
F =
{SSE(H0)− SSE(H1)} /(n− 2− (n− 4))
SSE(H1)/(n− 4)
=
{SSE(H0)− SSE(H1)} /2
SSE(H1)/(n− 4) = 10.739
Under H0, this F -statistics follows an F2,n−4 distribution, so
p-value = P (F2,n−4 > F ).
42
Testing a subset of slopes is zero
This process can be verified by using anova(modelH0, modelH1)
43
Using extra sum of squares
Note that SSE(H0) = SSE(X2), and SSE(H1) =
SSE(X1, X2, X3). Therefore, the numerator of the F-statistic can
be written as
SSE(H0)− SSE(H1)
2
= {SSE(X2)− SSE(X1, X2, X3)} /2
= {SSR(X1, X2, X3)− SSR(X2)} /2
= SSR(X1, X3|X2)/2
= MSR(X1, X3|X2).
Similarly, the denominator of the F-statisic is MSE(X1, X2, X3).
Hence, another way to write the F-statistic is
F =
MSR(X1, X3|X2)
MSE(X1, X2, X3)
44
Using extra sum of squares
With this form of the F-statistic, we can conduct the test by fitting
the full model only, but include the subset of variables being
tested at last. Then just look at the sequential ANOVA table.
From that ANOVA output, we can compute
SSR(X1, X3|X2) = SSR(X1|X2) + SSR(X3|X1, X2)
= 38.11 + 156.22 = 194.33.
45
Using extra sum of squares
Hence, the F-statistic is
F =
MSR(X1, X3|X2)
MSE(X1, X2, X3)
=
194.33/2
9.05
= 10.739,
which is the same as before.
46
What are the F-statistics in the anova table?
The F-value column is obtained by dividing the corresponding MSR
with the MSE; for example,
F1 = 17.2657 =
156.22
9.05
=
MSR(X1|X2)
MSE(X1, X2, X3)
Does it test anything?
47
What are the F-statistics in the anova table?
This F -statistics test whether X1 should be added after X2 is
already included in the model. In other words, H0 : y ∼ X2 vs
H1 : y ∼ X2 +X1. If we follow the general full-reduced model
approach before, then we use the test statistic
F =
MSR(X1|X2)
MSE(X1, X2)
,
so F and F1 only differs in the denominator. However, both
MSE(X1, X2, X3) and MSE(X1, X2) are unbiased estimators of
σ2, and R favors the use of a larger model to estimate it.
48
What are the F-statistics in the anova table?
More generally, the F -statistic has the form
F =
MSR(X1|X2)
σˆ2
.
Under H0, it has an F -distribution with 1 and n− k df, where
n− k is the degree of freedom we use to estimate σˆ2.
Nonetheless, ithe F -statistic and p-value in the anova table test
covariate being added sequentially.
• Given model has no covariate, testing whether hp should be added
to the model leads to p-value ≈ 0.
• Given model already has hp, testing drat should be added to the
model leads to p-value = 0.0002.
• Given model already has hp and drat, testing disp should be
added to the model lead to p-value = 0.05.
49
Prediction
Prediction problem
Given a new predictor observation, say x>0 = (x01, . . . , x0p−1), we
want to predict the value of the response and an associated
interval estimate around this.
Similar to SLR, we have two kinds of predictions for MLR:
1. On-the-regression line: we want to predict the mean response
y0 = E[y|x0] = β0 + β1x01 + . . .+ βp−1x0,p−1 = x˜>0 β
with x˜>0 = (1, x01, . . . , x0,p−1).
2. Off-the-regression-line: we want to predict an actual
observation of the response:
y∗0 = x˜
>
0 β + ε0, ε0 ∼ N(0, σ2)
50
On-the-regression line prediction
• Point estimate: Since βˆ is an unbiased predictor for β, then
an unbiased predictor for y0 will be
yˆ0 = x˜
>
0 βˆ .
and the associated variance is
Var(yˆ0) = Var(x˜
>
0 βˆ) = x˜
>
0 Var(βˆ)x˜0 = σ
2x˜>0 Vx˜0
so yˆ0 ∼ N(y0, σ2x˜>0 Vx˜0) and V = (X>X)−1.
51
On-the-regression line prediction
• Now if we replace σ2 by σˆ2, then we have again the
t-statistic.
t =
yˆ0 − y0
σˆ
√
x˜>0 Vx˜0
∼ tn−p.
• Therefore, a 100(1− α)% confidence interval for y0 is
yˆ0 ± t1−α/2,n−pσˆ
√
x˜>0 Vx˜0.
52
Off-the-regression line prediction
Similar to the case of simple linear regression, the only difference
between on-the-regression line and off-the-regression line is that
off-the-regression line prediction has one more source of variability
from ε0.
Therefore, we made exactly the same adjustment like in the case of
simple linear regression. Specifically,
• Point estimate for y∗0:
yˆ0 = x˜
>
0 βˆ .
• Prediction interval for y∗0:
yˆ0 ± t1−α/2,n−pσˆ
√
1 + x˜>0 Vx˜0.
53
Fuel consumption
54