STAM4000-无代写|学霸联盟

STAM4000-无代写

时间：2023-10-13

STAM4000
Quantitative Methods
Week 10
Multiple linear regression and
inference
http://claudiaflowers.net/rsch8140/Lec1.html
2COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan
Business School pursuant to Part VB of the Copyright Act 1968 (the Act).
The material in this communication may be subject to copyright under the Act. Any further
reproduction or communication of this material by you may be the subject of copyright
protection under the Act.
Do not remove this notice.
2
3Week 10
Multiple linear
regression and
inference
Learning
Outcomes
#1
#2
Assumptions in linear regression
Multiple linear regression
Inference in regression #3
4Why does this
matter?
We can create
linear models
with more
than one X
variable.
https://www.google.com/search?q=regression%20analysis%20cartoon&tbm=isch&hl=en&rlz=1C1CHBF_enAU841AU846&sa=X&ved=0CMoBEKzcAigAahcKEwiIuZO7tMPuAhUAAAAAHQAAAAAQAg&biw=1013&bih=433#imgrc=mWMAV2iiX--veM
We can also
estimate and
test
regression
statistics.
5#1 Assumptions in linear regression
https://line.17qq.com/articles/ncpkdmmlv_p3.html
6#1 Assumptions in linear regression
Use the acronym LINE
• Linearity: the underlying relationship between X and Y is linear
• Independence of errors: error values are statistically independent
• Normality of error: error values (ε) are normally distributed for any given
value of X
• Equal variance (homoscedasticity): the probability distribution of the errors
has constant variance
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group
Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
7#1 Residual analysis
The residual is the difference between an observed data value and the predicted
data value from the fitted line (the fitted value).
Error = observed Y − expected value of Y from model
= − = − ෡
The residual is the estimate of the error
• Zero residual: the fitted value equals the observed value
• Positive residual: the fitted value is less than the observed value
• Negative residual: the fitted value is greater than the observed value
8#1 More on residual analysis
A consequence of the least squares fitting algorithm is that the sum of the residuals,
σ − ෡ = and hence their mean, is 0.
The variance of the errors is estimated by s2 =
σ( − ෡)
−−
The estimated standard deviation of the errors is s =
σ( − ෡)
−−
The estimated standard deviation of the errors is referred to as the Standard Error, in Excel
output.
The Standard Error of the estimate is the typical error that occurs when the least squares
regression equation is used to estimate the value of Y for given values of the X variables.
9#1 Residual Analysis
Excel calculates residuals as part of its regression analysis, for example:
We can use these residuals to determine whether the error variable is non-
normal, whether the error variance is constant, and whether the errors are
independent…
RESIDUAL OUTPUT
Observation Predicted Weekly Sales
Residuals
= Observed Y − Predicted Y
1 251.92316 -6.923162
2 273.87671 38.12329
3 284.85348 -5.853484
4 304.06284 3.937162
… … …
10
Residual analysis for linearity
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Not Linear Linear
x
re
s
id
u
a
ls
x
Y
x
Y
x
re
s
id
u
a
ls
#1
✓
11
Residual analysis for independence
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty
Ltd) – 9781442549272/Berenson/Business Statistics /2e
Not independent Independent
X
X
re
s
id
u
a
ls
re
s
id
u
a
ls
X
re
s
id
u
a
ls
#1
✓
12
Residual analysis for normality
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Percent
Residual
A normal probability plot of the residuals can be used to check for normality
-3 -2 -1 0 1 2 3
0
100
#1
13
Residual analysis for equal variance (homoscedasticity)
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Non-constant variance ✓ Constant variance
x x
Y
x x
Y
re
s
id
u
a
ls
re
s
id
u
a
ls
#1
14
Example#1
The manager of a computer games store wishes to:
• Examine the residuals between weekly sales ($000) and the number of
customers making purchases.
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Weekly sales
($000)
No.
customers
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
0
100
200
300
400
500
0 500 1000 1500 2000 2500 3000
W
e
e
kl
y
sa
le
s
($
0
0
0
)
Number of customers
Scatterplot of weekly sales versus
number of customers
15
Example
Does not appear to violate any regression assumptions as:
i) the residual plot is well scattered, so linearity, independence, homoscedasticity satisfied.
ii) the normal probability plot follows an upward sloping diagonal line
#1
-100.00
-50.00
0.00
50.00
100.00
0 500 1000 1500 2000 2500 3000
R
e
si
d
u
al
s
=
O
b
se
rv
e
d

w
e
e
kl
y
sa
le
s
–
p
re
d
ic
te
d

w
e
e
kl
y
sa
le
s
Number of customers
Weekly sales residual plot
Number of
customers
Weekly
sales
($000)
1400 245
1600 312
1700 279
1875 308
1100 199
1550 219
2350 405
2450 324
1425 319
1700 255
Predicted
Weekly sales
($000) Residuals
251.92 -6.92
273.88 38.12
284.85 -5.85
304.06 3.94
218.99 -19.99
268.39 -49.39
356.20 48.80
367.18 -43.18
254.67 64.33
284.85 -29.85 0
200
400
600
0 20 40 60 80 100
W
e
e
kl
y
sa
le
s
($
0
0
0
)
Sample Percentile
Normal probability plot
16
#2 Multiple linear regression
https://www.pinterest.com.au/pin/515662226059772035/
17
= 0+ 1 1 + 22 + ⋯ + kk +
The Population Multiple Regression equation with k independent variables
Y-intercept Population slopes
(population coefficients)
Random Error
ො = 0 + 1 1 + 2 2 + ⋯ +
The Sample (fitted) Multiple Regression Model (equation) with k independent variables
Estimated
y-intercept
Estimated slopes, coefficients.
Note: with b1, b2 etc. we may, instead, use the names of the X
variable in the subscripts to the coefficients.
This Photo by Unknown Author is licensed
under CC BY-SA
#2 Model with multiple linear regression (MLR)
18
Interpretation of coefficients or slopes in multiple linear
regression (MLR)
•There is a separate coefficient or slope for
each X variable.
In MLR, we have
multiple X variables
• We try to describe the effect of one X
variable on the Y variable whilst:
o holding all other X variables
constant
or
o after allowing for the effects
of all other X variables.
In MLR, to interpret the
coefficient or slope of an X
variable ...
#2
19
Multiple linear regression: correlation coefficients
and two coefficients of determination
•This table summarises the direction and strength of
a linear relationship between the Y variable
separately with each individual X variable.
In MLR, we can use EXCEL to
create a correlation table of
correlation coefficients, r.
This value of r measures the correlation between observations and
fitted values.
In MLR EXCEL regression
output, r.
• 2, in MLR, as in SLR, we can multiply this by 100%,
to get the measure of the percentage of variation
in the dependent variable, Y, explained by the
variation in the independent variables, .
•Adjusted 2 is amended to take into consideration
the number of independent variables, .
In MLR, we will have two
coefficients of
determination:
#2
Sample correlation
coefficient notation:
r, R, Multiple R
20
Adjusted r2
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty
Ltd) – 9781442549272/Berenson/Business Statistics /2e
#2
Recall, the coefficient of determination, r2 measure this proportion of variation in the Y
variable explained by the regression on the X variable.
However, with Multiple Linear Regression, we now have more than one X variable – this
affects r2
•r2 never decreases when a new X variable is added to the model - this can be a
disadvantage when comparing models.
So, in MLR, we can also use the Adjusted r2
• Shows the proportion of variation in Y explained by all X variables adjusted for the number
of X variables used
• Penalises excessive use of unimportant explanatory variables
• Adjusted r2 will be smaller than r2
• Adjusted r2 is useful for comparing models
21
Example
A random sample of forty employees in a large, multinational company was collected and
values for the following variables were recorded: annual wage ($000s), work experience
(years), absenteeism (days per year absent) and years of education. A snip of the data is
below. EXCEL was used to create the scatterplots and regression output is in the next slide.
a) Write down the regression equation.
b) Interpret the adjusted coefficient of determination.
c) Interpret the coefficient of education.
d) Interpret the coefficient of absenteeism.
e) Forecast the wage of an employee with 15 years of
experience, 0 days absenteeism and 18 years of
education. Is this estimate reliable? Explain.
f) Calculate and comment on the residual for your
estimate in part e), given the actual data value, here, is 180, in $000, (in table above).
#2
This Photo by Unknown Author is licensed under
CC BY-NC-ND
Experience
(years)
Absenteeism
(days/year)
Education
(years)
Wage
($000)
0 12 10 25
2 9 15 75
10 2 17 120
15 0 18 180
... ... ... ...
The complete data set may be found in the EXCEL file named
“STAM4000 Data for Week 10.xls” on the subject portal, Weekly materials, Week 10.
22
Example#2
This Photo by Unknown Author is licensed under
CC BY-NC-ND
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.759
R Square 0.576
Adjusted R Square 0.540
Standard Error 27.165
Observations 40
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728
Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039
Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715
Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606
Names of the independent or explanatory or predictor, (X) variables
= Adjusted r2 is an adjusted coefficient of determination.
In multiple linear regression, it is best to use the adjusted r2 as it adds
precision and reliability by considering the impact of additional X
variables to the fit of the model.
= Correlation coefficient, r = correlation between observations and
fitted values.
b0 =
b2 =
b3 =
b1 =
23
Example#2
This Photo by Unknown Author is licensed under
CC BY-NC-ND
a) Write down the regression equation.
Theoretically, for our example: ෠ = b0 + b1X1+ b2X2 + b3X3
which translates to,
෣ = b0 + b1Experience+ b2Absenteeism + b3Education
which translates to,
෣ = b0 + bExperience Experience + bAbsenteeism Absenteeism + bEducation Education
The actual answer:
෣ = −64.832 −0.030 Experience − 0.463 Absenteeism + 10.534 Education
b) Interpret the adjusted coefficient of determination.
Adjusted r2 = 0.540 which tells us that 54% of the variance in wage is explained by this
regression on experience, absenteeism and education – a moderately, strong linear
relationship.
24
Example#2
This Photo by Unknown Author is licensed under
CC BY-NC-ND
c) Interpret the coefficient of education.
Holding experience and absenteeism constant, we estimate for extra year of
education, that wages increase by 10.534 in $000s, i.e., by $10,534 per annum, on
average
d) Interpret the coefficient of absenteeism.
After allowing for the effects of experience and education, we estimate for each
extra day of absenteeism, that wages decrease by 0.463 in $000, i.e., by $463 per
annum, on average.
25
Example#2
This Photo by Unknown Author is licensed under
CC BY-NC-ND
e) Forecast the wage of an employee with 15 years of experience, 0 days absenteeism and 18 years
of education. Is this estimate reliable? Explain.
From the snip of data, we understand that these data values are in the sample data range, so our
estimate will not be an extrapolation. (In fact, these data values exist exactly in the sample data).
෣ = −64.832 −0.030 Experience − 0.463 Absenteeism + 10.534 Education
෣ = −64.832 −0.030(15) − 0.463(0) + 10.534(18)
෣ = 124.33 in $000
෣ = $124,330 per annum, estimated, on average by the model
As adjusted R2 = 0.540, about 54% of the variation in wage is explained by this model, so moderately
reliable.
f) Calculate and comment on the residual for your estimate in part e), given the actual data value,
here, is 180 in $000.
Residual = data − model
= 180,000 − 124,330
= $55,670 > 0, this model has underestimated the annual wage for this employee.
26
Exercise#2
A random sample of thirty rental properties was collected and values recorded for: weekly rent, ($/wk), distance
from the city centre (km), number of bedrooms, number of bathrooms and age of the property (year).
This Photo by Unknown Author is licensed under CC BY
The complete data set
may be found in the
EXCEL file named
“STAM4000 Data for
Week 10.xls” on the
subject portal, Weekly
materials, Week 10.
EXCEL was used to create the regression output on the next slide. Use this to answer the following:
a) Write down the regression equation.
b) Interpret the coefficient of distance.
c) Forecast the weekly rent for a property that is 5 km from the city centre, with 3 bedrooms, 1 bathroom, and is
1 year old. Is this estimate reliable?
d) You are now told that the actual rent is $600 per week for a property that is 5 km from the city centre, with 3
bedrooms, 1 bathroom, and is 1 year old. Calculate and comment on the residual.
Distance (km) Bedrooms Bathrooms Age (years) Rent ($/wk)
23 3 2 35 350
30 2 1 40 280
15 2 1 25 450
14 1 1 10 375
2 2 2 5 750
... ... ... ... ...
27
Exercise#2
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.907
R Square 0.823
Adjusted R Square 0.795
Standard Error 105.323
Observations 30
Coefficients
Standard
Error t Stat P-value Lower 95% Upper 95%
Intercept 612.985 86.397 7.095 1.95E-07 435.046 790.923
Distance (km) -22.704 3.287 -6.907 3.07E-07 -29.473 -15.934
Bedrooms 7.417 21.853 0.339 0.737 -37.590 52.423
Bathrooms 101.921 35.376 2.881 0.008 29.063 174.779
Age (years) 2.799 2.277 1.229 0.230 -1.891 7.489
This Photo by Unknown Author is licensed under CC BY
31
#3 Inference in regression
https://www.cartoonstock.com/directory/s/statistics.asp
32
Confidence intervals
• for the intercept, β0
• for the slopes (coefficients) β1 , β2 etc.
Hypothesis Tests we will cover:
• for the intercept, β0
• for the slopes (coefficients), β1 , β2 etc.
• for the overall model
Inference about the population regression relationship#3
EXCEL automatically give 95% confidence
intervals for the population intercept
and the population slope(s)
EXCEL automatically gives the calculated test
statistics and corresponding p-values to test
the following:
i) Two-tailed tests about ZERO for the
population intercept and the population
slope(s)
ii) Test the significance of the overall model
33
Confidence intervals for the,
• intercept, β0, ±
• slope (or coefficient) β1 , 1 ± 1
• slope (or coefficient) β2 , 2 ± 2
• . for k coefficients
Note: t critical here depends on:
o CI percentage
o degrees of freedom = df = n − k − 1
where k = number of X variables in the regression
model. Note df = n − k − 1 = residual degrees of freedom
Confidence intervals in regression
EXCEL automatically provides
95% confidence intervals (CI) in the last
two columns of regression output.
To create CI for other % of confidence,
we use relevant values from the
regression output and
find t critical from the t table.
#3 Note:
• SE of is the standard
error of
• SE of 1 is the standard
error of 1
• SE of 2 is the standard
error of 2 etc.
34
Example
This Photo by Unknown Author is licensed under
CC BY-NC-ND
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.759
R Square 0.576
Adjusted R Square 0.540
Standard Error 27.165
Observations 40
ANOVA
df SS MS F Significance F
Regression 3 36053.136 12017.712 16.285 7.49E-07
Residual 36 26566.239 737.951
Total 39 62619.375
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728
Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039
Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715
Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606
#3
For the 95% CI for the population slope (coefficient) of education, we read
this directly from the regression output (6.462, 14.606).
E.g. Say, we wanted a 90% CI for the population slope (coefficient) of
education. For the t critical, in the t tables, we use residual df = 36, but we
can use the 35th row, and 0.05 column to find = 1.69. Then using the
CI formula,
±
= 10.534 ±1.69 2.008
= (7.140, 13.928)
35
Steps in hypothesis testing, with regression:
• Write the hypotheses
• Use either of these methods:
oCritical value method: use the EXCEL provided t-calculated test
statistic if testing about zero OR create a calculated test statistic if
testing for a non-zero value, and then use t-tables to find the t-critical.
op-value method: only for tests about zero, use the appropriate p-
value, compare to the significance level, α - QUICKER
• Decision
• Conclusion
Hypothesis tests in regression
Hypothesis Tests we will cover:
• t-test for the population intercept, β0
• t-tests for the population slopes (coefficients), β1 , β2 etc.
• F-test for the overall population model
#3
36
Formula for t-tests of individual coefficients or slopes
Formula t-tests for the population intercept, β0 or for t-tests for the population
slopes (coefficients), β1 , β2 etc.
= =
−

= population coefficient of
= sample coefficient of
SE of = standard error of sample coefficient of
As we are testing values of the population coefficients, we have in the
hypotheses.
To find t critical, we can use degrees of freedom = df = n − k − 1 = residual df
#3
37
Quick t-tests for individual coefficients - created by EXCEL
t-tests about zero for individual coefficients, are called tests of significance.
Excel automatically provides the p-value for these:
i)Two-tailed test about zero, p-value ≤ α, reject Ho, accept Ha etc.
ii)One-tailed test about zero, if
−
2
≤ α, reject Ho, accept Ha etc.
Note: for one-tailed test about zero, you must half the p-value, but keep the α as is.
E.g., Test the significance of the population intercept, β0
Ho: β0 = 0
Ha: β0 ≠ 0
Use p-value from regression, for the intercept, compare to α etc.
#3
E.g., Left-tailed test for β0
Ho: β0 = 0
Ha: β0 < 0 use
−
2
& α
E.g., Right-tailed test for β0
Ho: β0 = 0
Ha: β0 > 0 use
−
2
& α
38
Example
This Photo by Unknown Author is licensed under
CC BY-NC-ND
For the wages example, use the copy of the regression output in the next slide to answer the
following:
a) Test the significance of the coefficient of experience, at α of 5%.
b) Test the significance of the coefficient of absenteeism, at α of 5%.
c) Test the significance of the coefficient of education, at α of 5%.
d) Test that the coefficient of education is greater than 10, at α of 5%.
#3
39
Example
This Photo by Unknown Author is licensed under
CC BY-NC-ND
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.759
R Square 0.576
Adjusted R Square 0.540
Standard Error 27.165
Observations 40
ANOVA
df SS MS F Significance F
Regression 3 36053.136 12017.712 16.285 7.49E-07
Residual 36 26566.239 737.951
Total 39 62619.375
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728
Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039
Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715
Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606
#3
Use the P-values to do separate t-tests of significance for
individual population slopes (also called coefficients).
40
a) Test the significance of
the coefficient of
experience, at α of 5%.
Ho: βExperience = 0
Ha: βExperience ≠ 0
Decision:
p-value of 0.9674 > α of
0.05 retain Ho.
Conclusion:
There is no significant
linear relationship
between wage and
experience at α of 5%.
b) Test the significance of
the coefficient of
absenteeism, at α of 5%.
Ho: βAbsenteeism = 0
Ha: βAbsenteeism ≠ 0
Decision:
p-value of 0.7693 > α of 0.05
we retain Ho.
Conclusion: There is no
significant evidence of a
linear relationship between
wage and absenteeism at α
of 5%.
c) Test the significance of the
coefficient of education, at α
of 5%.
Ho: βEducation= 0
Ha: βEducation ≠ 0
Decision:
p-value of
7.07E−06
= 7.07 x 10−6
≈ 0 < α of 0.05, we reject Ho,
accept Ha.
Conclusion: The coefficient of
education is statistically
significant at α of 5%.
Example solution
This Photo by Unknown Author is licensed under
CC BY-NC-ND
#3
41
d) Test that the coefficient of education is greater than 10, at α of 5%.
Ho: βEducation = 10
Ha: βEducation > 10
Use
= =
−

=
10.534 −10
2.008
= 0.266
To find t critical, we can use degrees of freedom = df = n − k − 1 = residual df = 36, we use df = 35 row of
t tables. Have α of 5% = 0.05 in the right tail, use t 0.05 column in the t tables. t critical = 1.69
Decision: As t calc of 0.266 < t critical of 1.69, t calc does NOT lie in the rejection region. We retain Ho at
α of 5%.
Conclusion:
There is no significant evidence that the population coefficient of education is greater than 10.
Example solution continued
This Photo by Unknown Author is licensed under
CC BY-NC-ND
#3
42
If we wanted to test for a negative linear
relationship between wage and
absenteeism:
Ho: βAbsenteeism = 0
Ha: βAbsenteeism < 0
Decision:
As p-value of
0.7693
2
≈ 0.3847 > α of 0.05
retain Ho.
Conclusion:
There is no significant negative linear
relationship between wage and absenteeism.
If we wanted to test for a positive linear
relationship between wage and education:
Ho: βEducation = 0
Ha: βEducation > 0
Decision:
Use
p−value
2
=
7.07E−06
2
=
7.07 x 10−6
2
≈ 0 < α
of 0.05 we reject Ho and accept Ha.
Conclusion:
There is a significant positive linear relationship
between wage and education.
Example: one direction test of significance#3
This Photo by Unknown Author is licensed under
CC BY-NC-ND
43
Here, we can use the Significance F value, as the p-value to test the significance of the overall
model. Test at α of 5%.
Ho: βExperience = βAbsenteeism = βEducation = 0
Ha: βExperience ≠ 0 and/or βAbsenteeism ≠ 0 and/or βEducation ≠ 0
Decision: The p-value of 7.49E−07 = 7.49 × 10−7 ≈ 0 < α of 0.05, we can reject Ho, accept
Ha at α of 5%.
Conclusion: There is significant evidence of a linear relationship between wage and
experience, absenteeism and education i.e., at least one of the population coefficients
(slopes) is non-zero in the population.
(Note: see the notes for different versions of these hypotheses.)
Example: test the significance of the overall model
Also called test of goodness of fit or an F-test This Photo by Unknown Author is licensed under CC BY
#3
44
Example continued
This Photo by Unknown Author is licensed under
CC BY-NC-ND
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.759
R Square 0.576
Adjusted R Square 0.540
Standard Error 27.165
Observations 40
ANOVA
df SS MS F Significance F
Regression 3 36053.136 12017.712 16.285 7.49E-07
Residual 36 26566.239 737.951
Total 39 62619.375
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728
Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039
Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715
Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606
#3
Use
Significance
F as the p-
value to test
the
significance
of the overall
model.
45
A random sample of thirty rental properties was collected and values for the following variables
were recorded: weekly rent, ($/wk), distance from the city centre (km), number of bedrooms,
number of bathrooms and age of the property (year).
EXCEL was used to create the regression output on the next slide. Use this to answer the
following:
a) Write down and interpret the 95% confidence interval for the coefficient of distance.
b) Test the significance of a linear relationship between rent and distance, at α of 5%.
c) Test for a negative linear relationship between rent and distance at α of 5%.
d) Test the significance of the overall model at α of 5%.
This Photo by Unknown Author is licensed under CC BY
#3
46
Exercise
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.907
R Square 0.823
Adjusted R
Square 0.795
Standard Error 105.323
Observations 30
ANOVA
df SS MS F Significance F
Regression 4 1292394.73 323098.6825 29.12659158 4.39E-09
Residual 25 277322.7701 11092.9108
Total 29 1569717.5
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 612.985 86.397 7.095 1.95E-07 435.05 790.92
Distance (km) -22.704 3.287 -6.907 3.07E-07 -29.47 -15.93
Bedrooms 7.417 21.853 0.339 0.7372 -37.59 52.42
Bathrooms 101.921 35.376 2.881 0.0080 29.06 174.78
Age (years) 2.799 2.277 1.229 0.2304 -1.89 7.49
This Photo by Unknown Author is licensed under CC BY
#3
50
Supplementary Exercises
• Students are advised that Supplementary Exercises to this topic may be found on the
subject portal under “Weekly materials”.
• Solutions to the Supplementary Exercises may be available on the portal under “Weekly
materials "at the end of each week.
• Time permitting, the lecturer may ask students to work through some of these exercises
in class.
• Otherwise, it is expected that all students work through all Supplementary Exercises
outside of class time.
51
Extension
• The following slides are an extension to this week’s topic.
• The work covered in the extension:
o Is not covered in class by the lecturer.
o May be assessed.
52
Example
A random sample of forty employees in a large, multinational company was
collected and values for the following variables were recorded: annual wage
($000s), work experience (years), absenteeism (days per year) and years of
education.
Open the EXCEL file named “STAM4000 Week 10 Excel.xls” and use the sheet
named “Wage”.
a)Use EXCEL to create separate scatterplots of wage against each of the
explanatory variables.
b) Briefly describe each scatterplot.
This Photo by Unknown Author is licensed under
CC BY-NC-ND
53
53
Example solution
0
50
100
150
200
0 10 20 30
W
ag
e
($
0
0
0
)
Education (years)
Scatterplot of wage against
education
0
50
100
150
200
0 10 20 30
W
ag
e
($
0
0
0
)
Experience (years)
Scatterplot of wage against
experience
0
50
100
150
200
0 5 10 15 20
W
ag
e
($
0
0
0
)
Absenteeism (days)
Scatterplot of wage against
absenteeism
a)
i)Scatterplot of wage against education: strong (as most points are tightly clustered), positive (upward sloping),
linear relationship with one unusual point, an employee with education of 10 years with a very high wage of
$140,000.
ii)Scatterplot of wage against experience: possibly curved, due to an unusual points of employees with around 20
years experience, but relatively low wages.
iii)Scatterplot of wage against absenteeism: negative (downward sloping), moderately strong (as points are not
tightly clustered), possibly linear relationship with one employee had absenteeism of 15 days absent - a
relatively large number
i) ii) iii)
This Photo by Unknown Author is licensed under
CC BY-NC-ND
54
Example
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Week
Pie
Sales
Price
($)
Advertising
($100s)
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0
4 430 8.00 4.5
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
A distributor of frozen dessert pies wants to evaluate factors
thought to influence demand. Data are collected for 15 weeks.
Dependent, (response) variable: Y
Pie sales (units per week), number of pies sold per week.
Independent (explanatory or predictor) variables: X1, X2
Advertising ($100s), Price (in $)
55
Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341
Observations 15
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.01 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
Example continued Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –9781442549272/Berenson/Business Statistics /2e
෣ = 306.526 – 24.975Price + 74.131Advertising
෣ = b0 + b1Price + b2Advertising
56
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Estimated Sales = 306.526 − 24.975(Price) + 74.131(Advertising)
b1 = -24.975:
Holding advertising constant, number of pies sold will decrease, on average, by 24.975 pies per week for
each $1 increase in selling price, net of the effects of changes due to advertising
b2 = 74.131:
Holding price constant, number of pies sold, will increase, on average, by 74.131 pies per week for each $100
increase in advertising, net of the effects of changes due to price
Where,
• Sales is in number of pies sold per week
• Price is in $
• Advertising is in $00 i.e. hundreds of dollars.
Example continued
Write the regression equation.
Interpret the coefficients.
57
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty
Ltd) – 9781442549272/Berenson/Business Statistics /2e
Predict sales for a week in which the selling price is $5.50 and advertising is $350. Explain.
Predicted sales is 428.62 pies.
• The values of selling price is $5.50 and advertising is $350, are in the sample data range, so the estimate
is not an extrapolation.
• Adjusted R- square = 0.4417 which tells us that 44.17% of the variation in pie sales is explained by this
regression on price and advertising., the estimate is moderately reliable.
Estimated Sales = 306.526 − 24.975(Price) + 74.131(Advertising)
= 306.526 − 24.975 (5.50) + 74.131 (3.5)
= 428.62
Note that Advertising is in $00, so $350 means that X2 = 3.5
Example continued
If you are now told that a week in which the selling price is $5.50 and advertising is $350, the sales were
actually 400 pies, find the residual and comment.
Residual = data − model = 400 − 428.62 = −28.62 pies. As the residual is negative, this model has
overestimated the number of pies sold here.
58
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Test the following, at α of 5%:
a) Test the significance of the coefficient of price.
Ho: = 0
Ha: ≠ 0
p-value = 0.0398 < 0.05 of α, we reject Ho and accept Ha, concluding there is a significant linear relationship
between the number of pies sold and the price of the pie.
b) Test the significance of a positive linear relationship between sales and advertising.
Ho: = 0
Ha: > 0
Use
−
2
=
0.0145
2
= 0.0073 < 0.05 of α, we reject Ho and accept Ha, concluding there is a significant positive
linear relationship between the number of pies sold and amount spent on advertising
c) Test the significance of the overall model i.e. a goodness of fit test.
Ho: = = 0
Ha: ≠ 0 and/or ≠ 0
Use Significance F = p-value = 0.0120 < 0.05 of α, we reject Ho and accept Ha, concluding there is an overall
significant linear relationship i.e., at least one population coefficient is non-zero.
Example continued