程序代写案例-B200F
时间:2022-03-30
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Integrated Business FoundationBUS B200F
Lee Shau Kee School of Business and Administration
Module 9
Decision Making Skills
Simple Linear
Regression
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Lecture Outline
Regression and Correlation
Least Squares Regression Equation
Regression in Excel
Regression Assumptions
Inferences about the Slope and Correlation Coefficients
Estimation of Mean Values and Prediction of Individual Values
Remarks for Regression
2
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
What is Regression Analysis
Regression Analysis is used to
Predict the value of a Dependent Variable (Response, Y) based on the value of at
least one Independent Variable (Predictor, Xi)
Explain the impact of changes in Independent Variable (Predictor, Xi) an on the
Dependent Variable (Response, Y)
Simple Linear Regression
Only one independent variable, X
Relationship between X and Y is described by a linear function
Changes in the continuous variable Y are assumed to be related to changes in X
3
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Dependent and Independent Variables
Dependent variable: the variable we wish to predict or explain
Independent variable: the variable (or factor) used to explain the dependent
variable
Examples:
Dependent variable
Y
Independent variable
X
Exam score Test score
Sales of ice-cream Outdoor temperature
Project completion time Number of workers
4
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Check for Linear Relationship
To observe if there is any linear relationship between variables X and Y, we
need to take n pairs of observations, (x1,y1), …, (xn,yn), then either
Plot the data set into a scatter plot (of Y against X) to show the relationship between the
two variables
Perform a correlation analysis to measure the strength of the linear association between
the two variables
2222
YYnXXn
YXXYn
r
5
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Types of Relationships between Variables
6
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Types of Relationships between Variables
7
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Types of Relationships between Variables
8
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Correlation Coefficient (r )
Measures the strength of linear relationship of two variables
Sample correlation coefficient, r
where
and sX and sY are the sample standard deviation of X and Y respectively
YYXX
XY
SS
S
r
2
2
22 1 XXX sn
n
X
XXXS
2
2
22 1 YYY sn
n
Y
YYYS
n
YX
XYYYXXSXY
9
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Range of r
Values range from -1 to +1, i.e. -1 ≤ r ≤ 1
r > 0 Positively linear relationship between X and Y, i.e. X Y
r < 0 Negatively linear relationship between X and Y,
i.e. X Y or X Y
r = 0 No linear relationship between X and Y
r closes to +1 r closes to -1 r closes to 0
10
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Simple Linear Regression Model
iii εXββY 10
Linear component
Population
Y intercept
Population
Slope
Coefficient
Random
Error
term
Dependent
Variable
Independent
Variable
Random Error
component
11
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Random Error and Regression Line
12
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Simple Linear Regression Equation (Prediction Line)
The simple linear regression equation provides an estimate of the population
regression line
ii XbbY 10
ˆ
Estimate of
the regression
intercept
Estimate of the
regression slope
Estimated
(or predicted)
Y value for
observation i
Value of X for
observation i
13
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Least Squares Estimation (LSE)
A method to find the regression equation
By minimizing the sum of squared errors (or residuals) between the actual y
values and predicted y values:
n
i
ii
n
i
ii XbbYYY
1
2
10
1
2
minˆmin
14
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Least Squares Regression Coefficients
Least squares estimate for slope coefficient:
Interpretation: the estimated change in the mean value of Y when X is increased
by one unit
Least squares estimate for intercept coefficient:
Interpretations: the estimated mean value of Y when X is zero; or
the part of Y that is not affected by X
221
XXn
YXXYn
S
S
b
XX
XY
XbY
n
X
b
n
Y
b 110
15
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent
A real estate agent wishes to examine the relationship between the selling price of a
home (in USD1,000) and its size (in square feet) in Arizona, U.S.A. She sampled 10
houses in the state at random and the results are as follows.
a) Construct a scatter plot between House Price and Square Feet.
b) Find the least square regression line and add it on to the scatter plot.
c) Interpret the slope and intercept coefficients.
d) Find the correlation coefficient and
interpret its meaning.
e) Predict the price of a house of size 2000 square feet.
Square Feet (X) 1400 1600 1700 1875 1100 1550 2350 2450 1425 1700
House Price (Y) 245 312 279 308 199 219 405 324 319 255
16
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Answer)
Answer:
a)
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
H
o
u
s
e
P
ri
c
e
(
$
1
0
0
0
s
)
Square Feet
17
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Answer)
Answer:
b) From the data set, we have
n = 10, x = 17150, x2 = 30983750,
y = 2865, y2 = 853423, xy = 5085975,
The least square regression line is given as
1715X
5.286Y
2483.9817151098.05.28610 XbYb
XY 1098.02483.98ˆ
1571500
10
17150
30983750
22
2
n
X
XSXX
172500
10
286517150
5085975
n
YX
XYSXY
1098.0
1571500
172500
1
XX
XY
S
S
b
18
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Answer)
Answer:
b)
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
H
o
u
s
e
P
ri
c
e
(
$
1
0
0
0
s
)
feet) (square 0.1098 98.2483 price house
Slope
= 0.10977
Intercept
= 98.2483
19
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Answer)
Answer:
c) There is fixed cost of USD98,248.33 (= b0*USD1,000) which is independent to the
size of the square feet.
(Note: it is meaningless to interpret as “the average price of house of size 0 square
foot is USD98,248.33.)
For each extra square foot of the house, it is expected to cost USD109.8 (=
b1*USD1,000) more for the house price on average.
d)
There is a relatively strong positive linear relationship between House Price and
Square Feet.
e) The predicted price for a house of 2000 square feet is USD317,784.
7621.0
5.326001571500
172500
YYXX
XY
SS
S
r
7838.31720001098.02483.98ˆ Y
20
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Linear Regression in Excel
Activate the ‘Data Analysis’ add-in (click File/Option / Add-Ins / Go, select
‘Analysis ToolPak’, and then click OK)
21
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Linear Regression in Excel
Input the data of Exercise 1.1 into two adjacent columns in an Excel file and
then click Data/Data Analysis/ Regression, specify the ‘Input Y Range’ (B1:B8)
and the ‘Input X Range’ (A1:A8), and click ‘Labels’)
$B$1:$B$11
$A$1:$A$11
22
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Regression Output in Excel
Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
The regression equation is:
feet) (square 0.10978 98.2483 price house
23
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Regression Assumptions
Linearity
The relationship between X and Y is linear
Independence of Errors
Error values are statistically independent
Normality of Error
Error values are normally distributed for any given value of X
Equal Variance (also called homoscedasticity)
The probability distribution of the errors has constant variance
24
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Residual
Residual (ei )
Observed Error
Estimation of the random error
Sum of Squared Errors (SSE)
Least Squares Estimation
Choose the coefficients b0 and b1 such that SSE is minimized
i
iiii YYe ˆ
ˆ
XX
XY
YYii
S
S
SYYeSSE
2
2
2 ˆ
25
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Standard Error of Estimate
Mean Square Error
Point estimate of the constant variance of errors
Standard error of the estimate, s
iVar
2
2
ˆ 22
n
SSE
MSEs
2
n
SSE
MSEs
26
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Standard Error of the Estimate (cont’d)
s (standard error of the estimate) is a measure of the variation of the observed
Y values from the regression line
The magnitude of s should always be judged relative to the size of the Y values
in the sample data
27
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Measures of Variation
Total variations is made up of two parts
where: = Mean value of the dependent variable
Yi = Observed value of the dependent variable
= Predicted value of Y for the given Xi valueiYˆ
Y
21 YYY snSSST
Total Sum of
Squares
2)( YYSST i
Total Variation
SST
Error Sum of
Squares
2)ˆ( ii YYSSE
Unexplained Variation
SSE
Regression Sum
of Squares
2)ˆ( YYSSR i
Explained Variation
SSR
28
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Measures of Variation (cont’d)
29
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Coefficient of Determination, r2
The coefficient of determination is the portion of the total variation in the dependent
variable that is explained by variation in the independent variable
The coefficient of determination is also called r-squared and is denoted as r2
Values range from 0 to 1, i.e. 0 ≤ r2 ≤ 1
r2 = 1 Perfect match between the regression line and the data points
r2 = 0 No linear relationship between X and Y
SST
SSE
SST
SSR
r 12
30
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Coefficient of Determination, r2 (cont’d)
r2 = 1
Perfect linear relationship
Between X and Y;
100% of
the variation in Y is
explained by variation in X.
0 ≤ r2 ≤ 1
Weaker linear relationships
between X and Y;
Some but not all of the
variation in Y is explained
by variation in X.
r2 = 0
No linear relationship
between X and Y;
The value of Y does not
depend on X. (None of the
variation in Y is explained.
31
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Revisit)
A real estate agent wishes to examine the relationship between the selling price of a
home (in USD1,000) and its size (in square feet) in Arizona, U.S.A. She sampled 10
houses in the state at random and the results are as follows.
f) Find and interpret the standard error of the estimate.
g) Find and interpret the coefficient of determination.
h) Based on the results in parts (f) and (g), comment on whether the prediction made by this
regression model.
Square Feet (X) 1400 1600 1700 1875 1100 1550 2350 2450 1425 1700
House Price (Y) 245 312 279 308 199 219 405 324 319 255
32
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Answer)
Answer: (Revisit)
From the data set, we have
n = 10, x = 17150, x2 = 30983750,
y = 2865, y2 = 853423, xy = 5085975,
The least square regression line is given as
1715X
5.286Y
2483.9817151098.05.28610 XbYb
XY 1098.02483.98ˆ
1571500
10
17150
30983750
22
2
n
X
XSXX
172500
10
286517150
5085975
n
YX
XYSXY
1098.0
1571500
172500
1
XX
XY
S
S
b
33
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Revisit)
f)
The standard error of the estimate is moderately small relative to house prices in the
$200K – $400K range.
g)
58.08% of the variation in house prices is explained by the variation in square feet.
5652.13665
1571500
172500
5.32600
22
XX
XY
YY
S
S
SSSE
3303.41
210
5652.13665
2
n
SSE
MSEs
3303.41
210
5652.13665
2
n
SSE
MSEs
5808.0
5.32600
5652.13665
1112
YYS
SSE
SST
SSE
r
34
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Revisit)
h) With a small standard error of estimate and a moderately strong positive relationship
between the two variables, the regression model is useful in predicting Y.
35
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Regression Output in Computer
Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
36
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Inferences about the Slope
The standard error of the estimate for the slope coefficient of regression line
is:
where
Confidence Interval for population slope β1
d.f. = n – 2
XX
b
S
s
s
1
2
n
SSE
MSEs
XXS
s
tb
21
37
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Inferences about the Slope
t-test for the population slope β1
Hypotheses
vs
Test Statistic
where where is the hypothetical value
d.f. = n – 2
For , reject H0 if or if p-value ≤ α
XXb
obs
Ss
b
s
b
t 0
1
0 1111
0110
: H
0111
: H
0111
: H
0111
: H
01
ttobs
ttobs
22 or tttt obsobs
0111
: H
0111
: H
0111
: H
38
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Inferences about the Slope (cont’d)
To test the significance of a linear regression model, one can set up a test of:
(no linear relationship) vs
(linear relationship exists)
This test is included in regression output of Excel and SPSS (and many other statistical
software)
If the regression assumption of linearity holds, we should reject
at the significance level α
0: 10 H
0: 11 H
0: 10 H
39
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Revisit)
A real estate agent wishes to examine the relationship between the selling price of a
home (in USD1,000) and its size (in square feet) in Arizona, U.S.A. She sampled 10
houses in the state at random and the results are as follows.
i) Construct a 95% confidence interval for the slope coefficient.
j) Based on the interval in i), suggest whether the regression model is significant.
k) At α = 0.05, test whether every 20 extra square feet of a house can induce an average
increase of the house price by USD800 or more.
Square Feet (X) 1400 1600 1700 1875 1100 1550 2350 2450 1425 1700
House Price (Y) 245 312 279 308 199 219 405 324 319 255
40
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Answer)
Answer: (Revisit)
From the data set, we have
n = 10, x = 17150, x2 = 30983750,
y = 2865, y2 = 853423, xy = 5085975,
The least square regression line is given as
1715X
5.286Y
2483.9817151098.05.28610 XbYb
XY 1098.02483.98ˆ
1571500
10
17150
30983750
22
2
n
X
XSXX
172500
10
286517150
5085975
n
YX
XYSXY
1098.0
1571500
172500
1
XX
XY
S
S
b
41
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Revisit)
i) A 95% confidence interval for β1
We are 95% confident that the average impact on house price is between USD33.7 and
USD185.8 per square foot of house size.
1858.0 ,0337.0
1571500
3303.41
306.21098.0
21
XXS
s
tb
1b
s1b
1
01
bs
b
p-value
42
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Revisit)
j) The confidence interval for β1 remains above zero that gives enough evidence to reject H0:
β1 = 0 and to conclude the significance of the regression model.
k) It is required to test
H0: 20β1 = 0.8 (=USD800/USD1000) vs H1: 20β1 > 0.8
or equivalently
H0: β1 = 0.04 vs H1: β1 > 0.04
Test statistic
There is sufficient evidence that every 20 extra square feet can induce an average
increase of the house price by USD800 or more.
0
8;05.0
11
Reject
860.11161.2
15715003303.41
04.01098.0
0
H
t
Ss
b
t
XX
obs
43
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Inference about Correlation Coefficient
Hypotheses to test the population correlation coefficient ρ
H0: ρ = 0 (no correlation between X and Y)
H1: ρ ≠ 0 (correlation exists)
Test statistic
where it takes n-2 degrees of freedom
2
1 2
0
n
r
ρr
tobs
0if
0if
1
2
1
2
b rr
b rr
44
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Revisit)
A real estate agent wishes to examine the relationship between the selling price of a
home (in USD1,000) and its size (in square feet) in Arizona, U.S.A. She sampled 10
houses in the state at random and the results are as follows.
l) At α = 0.05, test ρ for a linear relationship between square feet and house price.
m) Construct and interpret a 95% confidence interval for the mean price of houses of size
2000 square feet.
n) Construct and interpret a 95% prediction interval for the price of a house of size 2000
square feet.
Square Feet (X) 1400 1600 1700 1875 1100 1550 2350 2450 1425 1700
House Price (Y) 245 312 279 308 199 219 405 324 319 255
45
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Revisit)
l) H0: ρ = 0 vs H1: ρ ≠ 0
Test statistic
There is sufficient evidence of a linear association at the 5% level of significance.
0
025.0
22
0
Reject
306.23294.3
210
7621.01
07621.0
2
1
H
t
n
r
ρr
tobs
46
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
What is predicted?
Before using the regression model, we need to assess how well the model fits
the data
If we are satisfied with how well the model fits the data, we can use it to
predict...
Population mean response E(Y | X=xp)
Individual response Y | X=xp
47
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
What is predicted?
The point on the regression line corresponding to a particular value of xp of the
independent variable x is
It is unlikely that this value will equal the mean value of y when X equals xp, i.e. E(Y |
X=xp)
Therefore, we need to place bounds on how far the predicted value might be from the
actual value
We can do this by calculating a confidence interval mean for the value of Y, E(Y |
X=xp), and a prediction interval for an individual value of Y | X=xp
010
ˆ xbbYi
48
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Confidence Interval and Prediction Interval
Both the confidence interval for the mean value of Y and the prediction interval for an
individual value of Y employ a quantity called the distance value
The distance value for a particular value xp of X is
The distance value is a measure of the distance between the value xp of X and
Notice that the farther xp from , the larger the distance valueX
XX
p
S
xX
n
2
1
49
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Confidence Interval and Prediction Interval
Goal: to form intervals around Y to express the uncertainty about the value of Y for a
given xp
50
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Confidence Interval for E(Y | X=xp)
A (1 – )*100% confidence interval for the population mean response E(Y | X=xp) is given as:
A (1 – )*100% prediction interval for an individual response
Y | X=xp is given as:
XX
p
S
xX
n
stYstY
2
22
1ˆ valueDistanceˆ
XX
p
S
xX
n
stYstY
2
22
1
1ˆ valueDistance1ˆ
This extra term adds to the interval width
to reflect the added uncertainty for an
individual case
51
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Revisit)
m) A 95% confidence interval for the mean price of houses of size 2000 square feet is
We have 95% confident that the mean price of houses of size 2000 square feet falls
between USD280,664 to USD354,903.
9032.354 ,6644.280
1571500
20001715
10
1
3303.41306.27838.317
1ˆ valueDistanceˆ
2
2
22
XX
p
S
xX
n
stYstY
7838.31720001098.02483.98ˆ Y
52
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Example: Real Estate Agent (Revisit)
n) A 95% prediction interval for the price of a house of size 2000 square feet is
We have 95% confident that the price of a house of size 2000 square feet falls between
USD215,503 to USD420,065.
20.06494 ,5027.215
1571500
20001715
10
1
13303.41306.27838.317
1
1ˆ valueDistance1ˆ
2
2
22
XX
p
S
xX
n
stYstY
7838.31720001098.02483.98ˆ Y
53
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Remark 1: Causation
A statistically significant regression of Y on X need not imply a causal relationship
between the two variables
A non-significant linear regression need not imply the lack of a causal relationship if
the causal relationship is non-linear
54
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Remark 2: Small Samples
Significant regressions can be obtained by chance, i.e. even when no (linear) causal
relationship exists
This is especially true if sample sizes are small
55
Integrated Business FoundationBUS B200F
DM Module | Simple Linear Regression
Remark 3: Extrapolation and Interpolation
Be careful when Predictions lie outside range of sample (Extrapolation)
Be careful when predictions are for values where data are sparse (Interpolation)
56