STA302H1-无代写|学霸联盟

STA302H1-无代写

时间：2023-03-14

STA302H1: Methods of Data Analysis I
(Lecture 5)
Mohammad Kaviul Anam Khan
Assistant Professor
Department of Statistical Sciences
University of Toronto
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 1 / 33
Distribution of the regression parameters
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 2 / 33
Distribution of βˆ
• Since, βˆ is linear combination of Y, thus βˆ also follows normal and
E (βˆ|X) = E ((X′X)−1X′Y | X)
= (X′X)−1X′Xβ
= β
• Thus, βˆ|X is an unbiased estimator of β
• The variance,
Var(βˆ|X) = Var((X′X)−1X′Y | X)
= (X′X)−1X′σ2IX(X′X)−1
= σ2(X′X)−1X′X(X′X)−1
= (X′X)−1σ2
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 3 / 33
Distribution of βˆ
• Thus, assume C = (X′X)−1. Then variance of βj will be Cjjσ2 and
Cov(βk , βj) = σ2Ckj
• The least squares estimates are the Best Linear Unbiased Estimator (BLUE)
according to the Gauss-Markov theorem
• The proof of Gauss-Markov will be skipped for now
• However, we need to know that the Gauss-Markov assumes:
• E (ϵ) = 0, that is the error mean is zero
• Var(ϵ) = σ2, homoscedasticity
• It does not assume normality
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 4 / 33
Distribution of βˆ
• Like in the simple linear regression case the the βˆjs also follow normal distribution
• That is βˆj ∼ N(βj , σ2Cjj)
• We can test hypothesis,
H0 : βj = β(0)j
H1 : βj ̸= β(0)j
with a z−test, when σ2 is known, using the test statistic,
Z =
βˆj − β(0)j
σ
√
Cjj
• Or we can replace σ with S, where, S =
√
1
n − p − 1
∑n
i=1 e2i , and perform a
t−test with,
T =
βˆj − β(0)j
S
√
Cjj
where T ∼ tn−p−1
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 5 / 33
Interpretation
• These subset examples show us that the relationship between predictors and
response depends on the value that the other predictors take in our multiple
regression model.
• So when we interpret a coefficient from a multiple linear model, we must reflect
this concept
• For a one inch increase in height, we see on average a decrease of 0.55% body fat
when abdomen size is fixed at a constant value.
• We always need to make such conditional interpretation
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 6 / 33
ANOVA
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 7 / 33
The RSS for Multiple Linear Regression
• The RSS for multiple regression can be given as,
RSS =
n∑
i=1
(yi − yˆi)2 = e′e
• Again recall that e = (I−H)y, where H = X(X′X)−1X′
• Thus, RSS = y′ [I− X(X′X)−1X′] y
• What is E (RSS)?
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 8 / 33
The RSS for Multiple Linear Regression
Theorem
If y is a n × 1 random vector with mean vector µ variance-covariance matrix V (non
singular) and A is a n × n matrix of constants then,
E (y′Ay) = tr(AV) + µ′Aµ
• In the case of RSS we have A = I− X(X′X)−1X′ and V = σ2I.
• Thus, tr(AV) = tr((I− X(X′X)−1X′)σ2I) = σ2tr(I− X(X′X)−1X′)
• Recall that tr(A− B) = tr(A)− tr(B)
• Thus tr(I− X(X′X)−1X′) = tr(I)− tr(X(X′X)−1X′)
• Obviously tr(I) = n
• Recall that tr(ABC) = tr(CAB). Thus,
tr(X(X′X)−1X′) = tr(X′X(X′X)−1) = tr(Ip+1) = p + 1
• Thus, σ2tr(I− X(X′X)−1X′) = σ2(n − p − 1)
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 9 / 33
The RSS for Multiple Linear Regression
• Now we know, µ = Xβ
• Thus, µ′Aµ = (Xβ)′(I− X(X′X)−1X′)Xβ = β′X′Xβ − β′X′X(X′X)−1X′Xβ = 0
• Which implies, that E (RSS) = E (e′e) = E (∑ni=1 e2i ) = (n − p − 1)σ2
• Thus, E
( ∑n
i=1 e2i
n − p − 1
)
= E (MSR) = σ2
• Where, p is the number of covariates. One can easily obtain the result for simple
linear regression when p = 1
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 10 / 33
The RSS and SSreg for Multiple Linear Regression
• The SSreg =∑ni=1(yˆi − (¯y))2 can be constructed similarly to the SLR and thus we
have,
SSreg/p
RSS/(n − p − 1) ∼ F0(p, n − p − 1)
• Since RSS and SSreg both can be represented in a vectorized form if we can show
that their dot products are zero then we know that they are independent
• We can perform a F-test. The null hypothesis is,
H0 : β1 = β2 = · · · = βp = 0
The alternative is H1 : at least one beta is ̸= 0
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 11 / 33
ANOVA table
• The ANOVA table for multiple regression looks like as following,
Sources of variation Sum Squares DF Mean Squares F value
Regression SSreg p MSreg =
SSreg
p F0 =
MSreg
MRSS
Residuals RSS n-p-1 MRSS = RSSn − p − 1
Total SST n-1
• We can create the ANOVA table using the anova command
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 12 / 33
Adjusted R2
• Recall that the coefficient of determination is defined as R2 = SSregSST
• However, as the number of variables increase in the model then R2 also increases
even if the model is not right (Why?)
• This is very closely related to the concept of over fitting (Later)
• We can correct for this increase by using an adjusted R2
• The adjusted R2 is given by,
R2adjusted = 1−
RSS/(n − p − 1)
SST/n − 1
• This accounts for the addition of multiple predictors
• If we are comparing models with different number of predictors, we should use
R2adjusted and R2 (WHY?)
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 13 / 33
Partial F-test
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 14 / 33
Testing a subset of predictors
• Sometimes our interest could be to investigate the significance of a group (subset)
of predictors
• When we say we are trying to test a subset of predictors for their relationship with
the response, we are really testing which of two possible models are better.
• Consider the full model, which includes all the p predictors we think represent the
true relationship with the response
Y = β0 + β1X1 + β2X2 + ...+ βpXp + ϵ
• We fit this model and notice that the first k predictors, k < p, don’t have
significant t-tests.
• Then we can just remove the first k predictors and refit the model
Y = β0 + βk+1Xk+1 + βk+2Xk+2 + ...+ βpXp + ϵ
• The second model is called the reduced model
• We can test whether the reduced model is a better fit
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 15 / 33
Example: nyc dataset
• To perform a partial F test we are going to use the nyc.csv dataset. The file is
uploaded in Quercus
• Data from surveys of customers of 168 Italian restaurants in the target area are
available
• The data has the following variables:
1. Y : Price = the price (in $US) of dinner (including 1 drink & a tip)
2. x1 : Food = customer rating of the food (out of 30)
3. x2 : Décor = customer rating of the décor (out of 30)
4. x3 : Service = customer rating of the service (out of 30)
5. x4 : East = dummy variable = 1 (0) if the restaurant is east (west) of Fifth Avenue
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 16 / 33
Diagnostic Checking
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 17 / 33
Diagnostic Checking
• Recall the assumptions of linear regression:
• Linearity
• Homoscedasticity in the error terms
• Normaility of the errors
• One of the important tasks before we move on with our analyses is to check these
assumptions
• These are often referred to as diagnostic checking
• In this lecture we are going to discuss how to check!
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 18 / 33
Diagnostic Checking
• Recall a simple linear regression model
Y = β0 + β1X + ϵ
E (Y |X ) = β0 + β1X
ϵ = Y − E (Y |X )
• Obviously we don’t know the true relationship between Y and X
• We can only estimate the relationship
• The fiited regression yˆ = βˆ0 + βˆ1X produces the estimate for E (Y |X )
• e is an unbiased estimate of ϵ
• Thus e can be used to check the validity of the model
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 19 / 33
Anscombe’s Four Data Sets
• Anscombe (1973) constructed 4 small toy datasets, to illustrate how blindly
tting a simple linear regression model can lead to very misleading conclusions
about the data. (anscombe.txt from the textbook website)
• Each dataset contains 11 observations, with a single predictor variable and
response.
• The responses in each dataset are different, but the predictors in 3 of the 4
datasets are identical.
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 20 / 33
Anscombe’s Four Data Set
• The dataset is as following
case x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
• It’s not obvious by looking at the raw data, but a linear model would only be
appropriate for one of these datasets.
• How should we check that?
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 21 / 33
Anscombe’s Four Data Set
4 6 8 10 12 14
4
5
6
7
8
9
10
11
x1
y 1
4 6 8 10 12 14
3
4
5
6
7
8
9
x2
y 2
4 6 8 10 12 14
6
8
10
12
x3
y 3
8 10 12 14 16 18
6
8
10
12
x4
y 4
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 22 / 33
Anscombe’s Four Data Set
• Now let’s look into the outputs of the coefficients (rounding up to two decimal
points)
βˆ0 βˆ1
x1 3.00 0.50
x2 3.00 0.50
x3 3.00 0.50
x4 3.00 0.50
• Does this mean that linear model is appropriate for all the data?
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 23 / 33
Anscombe’s Four Data Sets
4 6 8 10 12 14
4
5
6
7
8
9
10
11
x1
y 1
4 6 8 10 12 14
3
4
5
6
7
8
9
x2
y 2
4 6 8 10 12 14
6
8
10
12
x3
y 3
8 10 12 14 16 18
6
8
10
12
x4
y 4
• Dataset 1: The linear model is
appropriate
• Dataset 2: Relationship
quadratic. So not appropriate
• Dataset 3: The line is influenced
by one outlier. Not appropriate
• Dataset 4, the slope of the
regression is determined by a
single point. Not appropriate
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 24 / 33
Anscombe’s Four Data Sets
• Anscombe’s dataset was meant to illustrate how one should always supplement
one’s modelling with an investigation of the modelling assumptions
• We first need to check whether the relationship between X and Y are linear
• Scatterplot does not always help. For example when we have more then one
predictor (multiple regression)
• An individual observation can have massive impact on model fit
• What would be an easy way to check model assumptions
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 25 / 33
Residual Plots
• One way to check the assumptions is through residual plot
• Plotting residuals allows us to visually inspect the model assumptions (WHY?)
• Residuals measure the remaining variability in the data after fitting a model.
Thus, it is more sensitive to any irregularities
• There are two main types of residual scatter plots that we use:
• Residual vs predictor plot, i.e., plot e against the observed values of X . Not
appropriate for multiple regression
• Residual vs fitted plot i.e., plot e against yˆ
• Both plots are usefull to check whether the assumptions hold
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 26 / 33
Residual Plots
• Assumptions hold if there is no evident pattern seen in the residual plots
• specifically, we hope that the residuals are uniformly scattered around 0 for the
full range of predictor values
• Important patterns to be aware of for each of the assumptions are:
• Linearity of the relationship: any systematic pattern in the residuals, such as a
curve
• Independence of the errors: clusters of residuals that have obvious separation
from the rest
• Homoscedasticity: any pattern, but especially a fanning pattern, where residuals
gradually become more spread out
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 27 / 33
Anscombe’s residual vs predictors
4 6 8 10 12 14
−
2
−
1
0
1
Ansc$x1
e1
4 6 8 10 12 14
−
2.
0
−
1.
5
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
Ansc$x2
e2
4 6 8 10 12 14
−
1
0
1
2
3
Ansc$x3
e3
8 10 12 14 16 18
−
1.
5
−
0.
5
0.
5
1.
0
1.
5
Ansc$x4
e4
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 28 / 33
Anscombe’s residual vs fitted
5 6 7 8 9 10
−
2
−
1
0
1
y1hat
e1
5 6 7 8 9 10
−
2.
0
−
1.
5
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
y2hat
e2
5 6 7 8 9 10
−
1
0
1
2
3
y3hat
e3
7 8 9 10 11 12
−
1.
5
−
0.
5
0.
5
1.
0
1.
5
y4hat
e4
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 29 / 33
Residual Plots
• Why are residual plots important?
• It can be shown that if the linear assumption is correct then ei ≈ ϵi
• So the residuals should resemble the random errors, which should not vary with
different values of X
• For example assume that the true relationship is modeled by,
E (Y |X ) = β0 + β1X + β2X 2
• This is a quadratic model (will be discussed later)
• If a linear model is fit to this data then we have,
yˆi = βˆ0 + βˆ1xi
• Assume βˆ0 ≈ β0 and βˆ1 ≈ β1
• then ei = yi − yˆi ≈ ϵi + β2x2i
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 30 / 33
Residual Plot Quadratic Function
−2 −1 0 1 2
0
2
4
6
8
x
y
−2 −1 0 1 2
−
2
−
1
0
1
2
3
4
x
e
1.5 2.0 2.5 3.0 3.5 4.0
−
2
−
1
0
1
2
3
4
yhat.sim
e
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 31 / 33
Residual plot for the Production data
50 100 150 200 250 300 350
−
30
−
20
−
10
0
10
20
30
Residuals vs Predictors
Order Size
R
es
id
ua
ls
180 200 220 240
−
30
−
20
−
10
0
10
20
30
Residual vs Fitted
Fitted Values
R
es
id
ua
ls
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 32 / 33
Diagnostic Steps
There are a number of steps that it is good practice to follow to determine if you have
fit a valid model.
1. Visually assess model assumptions using residual plots
2. Determine which (if any) data points have x-values with unusually large effect on
regression model (leverage points)
3. Determine which (if any) data points are outliers in their responses
4. Assess the influence on the fitted model of any bad leverage points
5. Examine whether constant error variance assumption is reasonable
6. If data were collected over time, examine whether the data are correlated with
time
7. If sample size is small or prediction intervals are of interest, examine whether the
normality of errors assumption is reasonable.
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 5 33 / 33