FIT1006-无代写
时间:2024-05-25
Information Technology
FIT1006
Business Information Analysis
Week 4 – Correlation and Regression
Topics covered (Correlation) :
Bivariate data
The linear model
Calculating q and r by hand
Calculating r using MicroSoft Excel and SYSTAT
Interpreting q and r
Visual estimation of q and r
2
Motivating Question
In 1998, Choice magazine
tested 1500 toothbrushes.
A summary of price and
functionality score is on the
right.
Is the functionality of the
toothbrush related to the
price? (Selvanathan, 4th Ed, p 679)
Answers later …
Price Functionality
3.96 83
3.99 81
3.69 80
2.96 78
3.69 76
2.99 76
3.98 74
2.79 73
3.49 73
2.95 72
1.95 69
2.99 68
2.92 66
3.95 65
3.95 65
2.97 64
3.99 61
3.20 61
4.95 59
0.69 57
1.96 57
3.35 56
1.00 51
2.99 51
1.99 51
1.08 49
1.67 46
1.00 42
0.66 40 3
Motivating Question
Price Functionality
3.96 83
3.99 81
3.69 80
2.96 78
3.69 76
2.99 76
3.98 74
2.79 73
3.49 73
2.95 72
1.95 69
2.99 68
2.92 66
3.95 65
3.95 65
2.97 64
3.99 61
3.20 61
4.95 59
0.69 57
1.96 57
3.35 56
1.00 51
2.99 51
1.99 51
1.08 49
1.67 46
1.00 42
0.66 40
0 1 2 3 4 5
PRICE
30
40
50
60
70
80
90
FU
N
C
TI
O
N
AL
IT
Y
4
Scatterplot
Motivating Question
Price Functionality
3.96 83
3.99 81
3.69 80
2.96 78
3.69 76
2.99 76
3.98 74
2.79 73
3.49 73
2.95 72
1.95 69
2.99 68
2.92 66
3.95 65
3.95 65
2.97 64
3.99 61
3.20 61
4.95 59
0.69 57
1.96 57
3.35 56
1.00 51
2.99 51
1.99 51
1.08 49
1.67 46
1.00 42
0.66 40
0 1 2 3 4 5
PRICE
30
40
50
60
70
80
90
FU
N
C
TI
O
N
AL
IT
Y
5
Scatterplot
The Q-Correlation
q
N N N N
N N N N
B C A D
A B C D
A
DC
B
6
The Q-Correlation
q
N N N N
N N N N
B C A D
A B C D
A
DC
B
7
Question 1
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.4
B. – 0.3
C. – 0.2
D. – 0.4
E. None of the above. x median
y median
8
Question 1
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.4
B. – 0.3
C. – 0.2
D. – 0.4
E. None of the above. x median
y median
9
3
3
2
2
Question 1
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.4
B. – 0.3
C. – 0.2
D. – 0.4
E. None of the above. x median
y median
10
3
3
2
2
q = (2+2) – (3+3)
3+2+3+2
Question 1
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.4
B. – 0.3
C. – 0.2
D. – 0.4
E. None of the above. x median
y median
11
3
3
2
2
q = (2+2) – (3+3)
3+2+3+2
q = -0.2
Question 1
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.4
B. – 0.3
C. – 0.2
D. – 0.4
E. None of the above. x median
y median
12
3
3
2
2
q = (2+2) – (3+3)
3+2+3+2
q = -0.2
q-Correlation
To calculate q, find the horizontal and vertical medians
and divide the data into four quadrants.
Count the number of observations in each quadrant. Do
not count any observations lying on the median lines.
Calculate the q-correlation as follows:
Note that q is robust to outliers.
q
N N N N
N N N N
B C A D
A B C D
A
DC
B
13
q-Correlation
To calculate q, find the horizontal and vertical medians
and divide the data into four quadrants.
Count the number of observations in each quadrant. Do
not count any observations lying on the median lines.
Calculate the q-correlation as follows:
Note that q is robust to outliers.
q
N N N N
N N N N
B C A D
A B C D
A
DC
B
14
Question 2
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.7
B. + 1.0
C. – 0.1
D. + 0.1
E. None of the above.
15
x median
y median
Question 2
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.7
B. + 1.0
C. – 0.1
D. + 0.1
E. None of the above.
16
x median
y median
0
0
3
4
Question 2
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.7
B. + 1.0
C. – 0.1
D. + 0.1
E. None of the above.
17
x median
y median
0
0
3
4
q = (4+3) – (0+0)
0+4+3+0
Question 2
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.7
B. + 1.0
C. – 0.1
D. + 0.1
E. None of the above.
18
x median
y median
0
0
3
4
q = (4+3) – (0+0)
0+4+3+0
q = 1
Question 2
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.7
B. + 1.0
C. – 0.1
D. + 0.1
E. None of the above.
19
x median
y median
0
0
3
4
q = (4+3) – (0+0)
0+4+3+0
q = 1
Do not count dots on the line
Question 2
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.7
B. + 1.0
C. – 0.1
D. + 0.1
E. None of the above.
20
x median
y median
0
0
3
4
q = (4+3) – (0+0)
0+4+3+0
q = 1
Do not count dots on the line
Question 2
From the scatterplot on the RHS below, the q-correlation
coefficient is:
A. + 0.7
B. + 1.0
C. – 0.1
D. + 0.1
E. None of the above.
21
x median
y median
0
0
3
4
q = (4+3) – (0+0)
0+4+3+0
q = 1
Do not count dots on the line
Question 3
Which plot has a q-correlation closest to 0?
x median
A.
C. D.
B.
22
Question 3
Which plot has (or plots have) a q-correlation closest to 0?
x median
A.
C. D.
B.
23
Question 3
Which plot has a q-correlation closest to 0?
x median
A.
C. D.
B.
24
Question 3
Which plot has (or plots have) a q-correlation closest to 0?
x median
A.
C. D.
B.
25
Question 4
Which plot has a q-correlation closest to – 1?
x median
A.
C. D.
B.
26
Question 4
Which plot has a q-correlation closest to – 1?
x median
A.
C. D.
B.
27
+1
Question 4
Which plot has a q-correlation closest to – 1?
x median
A.
C. D.
B.
28
+1
Linear relationship
When we determine the degree of correlation between variables
we are assuming that the variables have a linear relationship.
For two variables x, and y, we say that y = ax + b + e, where e
are random, Normally distributed errors.
e is sometimes written as e ~ N(0, s2)
29
Linear relationship
When we determine the degree of correlation between variables
we are assuming that the variables have a linear relationship.
For two variables x, and y, we say that y = ax + b + e, where e
are random, Normally distributed errors.
e is sometimes written as e ~ N(0, s2)
30
Linear relationship
When we determine the degree of correlation between variables
we are assuming that the variables have a linear relationship.
For two variables x, and y, we say that y = ax + b + e, where e
are random, Normally distributed errors.
e is sometimes written as e ~ N(0, s2)
31
Pearson’s r
Pearson’s r is the most commonly used measure of
correlation. Sxy is the covariance of x and y.
You should be able to calculate r if given the sum terms:
Sx, Sy, Sx2, Sy2, Sxy, and n.
r
s
s s
xy x y
n
x
x
n
y
y
n
xy
x y
S S S
S
S
S
S2
2
2
2
32
Pearson’s r
Pearson’s r is the most commonly used measure of
correlation. Sxy is the covariance of x and y.
You should be able to calculate r if given the sum terms:
Sx, Sy, Sx2, Sy2, Sxy, and n.
r
s
s s
xy x y
n
x
x
n
y
y
n
xy
x y
S S S
S
S
S
S2
2
2
2
33
Pearson’s r
Pearson’s r is the most commonly used measure of
correlation. Sxy is the covariance of x and y.
You should be able to calculate r if given the sum terms:
Sx, Sy, Sx2, Sy2, Sxy, and n.
r
s
s s
xy x y
n
x
x
n
y
y
n
xy
x y
S S S
S
S
S
S2
2
2
2
34
Pearson’s r
Pearson’s r is the most commonly used measure of
correlation. Sxy is the covariance of x and y.
You should be able to calculate r if given the sum terms:
Sx, Sy, Sx2, Sy2, Sxy, and n.
r
s
s s
xy x y
n
x
x
n
y
y
n
xy
x y
S S S
S
S
S
S2
2
2
2
35
Calculating r
Pearson’s r is built into MicroSoft Excel, SYSTAT and
quite possibly your calculator.
In Excel, use CORREL(RANGE1, RANGE2) or draw a
scatter plot and fit linear model.
In SYSTAT, you should be able to use the menu:
– Graph --> Plots --> Scatterplot
– Analyze --> Correlations --> Simple
For multivariate data, you should be able to use:
– Graph --> Scatterplot Matrix (SPLOM)
36
Question 6
Pearson’s r is an appropriate correlation measure for
A. A – F.
B. B, C, D, F.
C. C, D, F.
D. C, B, D.
E. C, D.
A. B.
C.
D.
E. F.
37
Question 6
Pearson’s r is an appropriate correlation measure for
A. A – F.
B. B, C, D, F.
C. C, D, F.
D. C, B, D.
E. C, D.
A. B.
C.
D.
E. F.
38
Question 7
For which plot is r closest to 0?
A.
C. D.
B.
39
Question 7
For which plot is r closest to 0?
A.
C. D.
B.
40
Question 7
For which plot is r closest to 0?
A.
C. D.
B.
41
Question 8
If a data point moves as shown. Which of the following is
true?
A. r increases, q unchanged
B. r decreases, q unchanged
C. r increases, q increases
D. r decreases, q decreases
E. None of the above.
42
Estimating correlation
From: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
43
Estimating correlation
From: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
44
Question 9
For the motivating problem, r is closest to:
A. 0.1
B. 0.3
C. 0.5
D. 0.7
E. 0.9
0 1 2 3 4 5
PRICE
30
40
50
60
70
80
90
FU
N
C
TI
O
N
AL
IT
Y
45
Question 9
For the motivating problem, r is closest to:
A. 0.1
B. 0.3
C. 0.5
D. 0.7
E. 0.9
0 1 2 3 4 5
PRICE
30
40
50
60
70
80
90
FU
N
C
TI
O
N
AL
IT
Y
46
Question 9
For the motivating problem, r is closest to:
A. 0.1
B. 0.3
C. 0.5
D. 0.7
E. 0.9
0 1 2 3 4 5
PRICE
30
40
50
60
70
80
90
FU
N
C
TI
O
N
AL
IT
Y
47
Motivating Question
0 1 2 3 4 5
PRICE
30
40
50
60
70
80
90
FU
N
C
TI
O
N
AL
IT
Y
r = 0.664
Price Functionality
3.96 83
3.99 81
3.69 80
2.96 78
3.69 76
2.99 76
3.98 74
2.79 73
3.49 73
2.95 72
1.95 69
2.99 68
2.92 66
3.95 65
3.95 65
2.97 64
3.99 61
3.20 61
4.95 59
0.69 57
1.96 57
3.35 56
1.00 51
2.99 51
1.99 51
1.08 49
1.67 46
1.00 42
0.66 40 48
Interpreting correlation
Some Cautions:
non-linear relationships will often have low
correlation.
Bivariate data can be subject to outliers which can
decrease the value of correlation coefficient.
Correlation does not imply causation. Two
variables might have a strong correlation but are
not necessarily directly related. (They might, e.g.,
be related by a third variable or party.)
We tend to use correlation comparatively - that is,
one set of observations has a greater correlation
than another set.
49
Discussion: Multiple Plots
Data XR 15-19 is admissions data looking at success
factors based on GPA (grade point average) over first 3
years at university.
You/We have:
• HSC (Higher School Certificate) grades
• TAE (Tertiary admission score)
• ACT – hour/week on extra curricular activities.
Which is the best predictor of GPA?
50
Discussion: SPLOM (Scatterplot Matrix)
GPA
G
PA
HSC TAE
G
PA
ACT
HS
C
H
SC
TA
E
TAE
GPA
A
CT
HSC TAE ACT
A
CT
51
Discussion: SPLOM
GPA
G
PA
HSC TAE
G
PA
ACT
HS
C
H
SC
TA
E
TAE
GPA
A
CT
HSC TAE ACT
A
CT
52
Scatterplots over multiple variables
For enrichment: go to http://www.gapminder.org/
53
Scatterplots over multiple variables
For enrichment: go to http://www.gapminder.org/
54
Scatterplots over multiple variables
Multivariate display shows: Income, Life expectancy,
Geographic region and Population over time.
55
Scatterplots over multiple variables
Multivariate display shows: Income, Life expectancy,
Geographic region and Population over time.
56
The Iris Data
A famous data set. See Wikipedia. Compares the sepal width
& length and petal width & length for 3 species of iris.
Studied by R A Fisher (1890-1962), famous British
statistician and geneticist, buried in Australia in Adelaide
57
The Iris Data
A famous data set. See Wikipedia. Compares the sepal width
& length and petal width & length for 3 species of iris.
SPECIES
SP
EC
IE
S
SEPALLEN SEPALWID PETALLEN
SPEC
IES
PETALWID
SE
PA
LL
EN
SEPALLEN
SE
PA
LW
ID
SEPALW
ID
PE
TA
LL
EN
PETALLEN
SPECIES
PE
TA
LW
ID
SEPALLEN SEPALWID PETALLEN PETALWID
PETALW
ID
58
The Iris Data
A famous data set. See Wikipedia. Compares the sepal width
& length and petal width & length for 3 species of iris.
SPECIES
SP
EC
IE
S
SEPALLEN SEPALWID PETALLEN
SPEC
IES
PETALWID
SE
PA
LL
EN
SEPALLEN
SE
PA
LW
ID
SEPALW
ID
PE
TA
LL
EN
PETALLEN
SPECIES
PE
TA
LW
ID
SEPALLEN SEPALWID PETALLEN PETALWID
PETALW
ID
59
Regression
The equation of the trend line is the other piece of important
information we get from bivariate data. This is covered next
lecture, quite soon.
60
Regression
The equation of the trend line is the other piece of important
information we get from bivariate data. This is covered next
lecture, quite soon.
61
Regression
The equation of the trend line is the other piece of important
information we get from bivariate data. This is covered next
lecture, quite soon.
62
Reading/Questions (Selvanathan)
Reading:
– 7th Ed Sections 4.3, 5.5.
Questions:
– 7th Ed Questions 4.37, 4.38, 4.43, 4.44, 5.77, 5.81, 5.84,
5.85.
63
A short break …
Perhaps back in approximately 5 minutes …
64
Information Technology
FIT1006
Business Information Analysis
Lecture
Linear Regression
Topics covered:
Estimating the regression equation by eye.
Fitting a regression using MicroSoft Excel and SYSTAT.
Measuring the goodness of fit.
Modelling with the regression equation.
66
Linear Regression
Regression is the practice of describing the (linear)
relationship between 2 or more quantitative variables.
Thus if we know the value of one variable, we can
estimate the value of the related variable of interest.
Origin: The 19th century scientist Francis Galton
collected data on the heights of fathers and their sons.
He found that tall fathers had slightly shorter sons and
that short fathers had slightly taller sons. Thus in each
case there was a regression (reversion) to the mean.
Over time the details of the investigation have been
largely forgotten but the name has stuck to this method
of modelling.
67
Motivating Question
In 1998, Choice magazine tested
1500 toothbrushes.
A summary of price and
functionality score is on the
right.
What is the relationship between
price and functionality?
How reliable is the model?
(Selvanathan, 4th Ed, p 679)
Answers later …
Price Functionality
3.96 83
3.99 81
3.69 80
2.96 78
3.69 76
2.99 76
3.98 74
2.79 73
3.49 73
2.95 72
1.95 69
2.99 68
2.92 66
3.95 65
3.95 65
2.97 64
3.99 61
3.20 61
4.95 59
0.69 57
1.96 57
3.35 56
1.00 51
2.99 51
1.99 51
1.08 49
1.67 46
1.00 42
0.66 40 68
Motivating Question
Price Functionality
3.96 83
3.99 81
3.69 80
2.96 78
3.69 76
2.99 76
3.98 74
2.79 73
3.49 73
2.95 72
1.95 69
2.99 68
2.92 66
3.95 65
3.95 65
2.97 64
3.99 61
3.20 61
4.95 59
0.69 57
1.96 57
3.35 56
1.00 51
2.99 51
1.99 51
1.08 49
1.67 46
1.00 42
0.66 40
0 1 2 3 4 5
PRICE
30
40
50
60
70
80
90
FU
N
C
TI
O
N
AL
IT
Y
69
Scatterplot
Motivating Question
Price Functionality
3.96 83
3.99 81
3.69 80
2.96 78
3.69 76
2.99 76
3.98 74
2.79 73
3.49 73
2.95 72
1.95 69
2.99 68
2.92 66
3.95 65
3.95 65
2.97 64
3.99 61
3.20 61
4.95 59
0.69 57
1.96 57
3.35 56
1.00 51
2.99 51
1.99 51
1.08 49
1.67 46
1.00 42
0.66 40
0 1 2 3 4 5
PRICE
30
40
50
60
70
80
90
FU
N
C
TI
O
N
AL
IT
Y
70
Scatterplot
The underlying assumption
When we calculate the regression of y on x, we are
assuming that the relationship between x and y is
linear and thus we can say that y = ax + b + e,
where e are random, Normally distributed errors.
We want to find the value of a and b.
(Note: the textbook uses slightly different notation)
71
The underlying assumption
When we calculate the regression of y on x, we are
assuming that the relationship between x and y is
linear and thus we can say that y = ax + b + e,
where e are random, Normally distributed errors.
We want to find the value of a and b.
(Note: the textbook uses slightly different notation)
72
The underlying assumption
When we calculate the regression of y on x, we are
assuming that the relationship between x and y is
linear and thus we can say that y = ax + b + e,
where e are random, Normally distributed errors.
Better is y = ax + b + N(0, s2)
We want to find the value of a, b and s.
(Note: the textbook uses slightly different notation)
73
The underlying assumption
When we calculate the regression of y on x, we are
assuming that the relationship between x and y is
linear and thus we can say that y = ax + b + e,
where e are random, Normally distributed errors.
Better is y = ax + b + N(0, s2)
We want to find the value of a, b and s.
(Note: the textbook uses slightly different notation)
74
The basic idea
We want to fit a line through the data that gives the most
appropriate fit of the fitted model (line) for the data.
0 x
y
*
*
*
* *
*
75
The basic idea
We want to fit a line through the data that gives the most
appropriate fit of the fitted model (line) for the data.
If we assume that errors are Normally distributed N(0, s2) and
if we also assume that best method of fitting is Maximum
Likelihood
estimation
(MLE),
then we
will get
least
squares
0 x
y
*
*
*
* *
*
76
The basic idea
We want to fit a line through the data that minimises the sum
of the squared errors, or differences, between the fitted model
(line) and the data.
0 x
y
*
*
*
* *
*
77
The basic idea
We want to fit a line through the data that minimises the sum
of the squared errors, or differences, between the fitted model
(line) and the data.
0 x
y
*
*
*
* *
*
78
The equation of a straight line
We can use the basic equation of a straight line as the model
for our regression equation. A line with gradient ‘a’ and y-
intercept ‘b’ has equation: y = ax + b.
y-intercept
gradient = rise/run = a
rise
run
0
b
x
y
79
The equation of a straight line
We can use the basic equation of a straight line as the model
for our regression equation. A line with gradient ‘a’ and y-
intercept ‘b’ has equation: y = ax + b.
y-intercept
gradient = rise/run = a
rise
run
0
b
x
y
80
The equation of a straight line
We can use the basic equation of a straight line as the model
for our regression equation. A line with gradient ‘a’ and y-
intercept ‘b’ has equation: y = ax + b.
y-intercept
gradient = rise/run = a
rise
run
0
b
x
y
81
Least Squares Regression
Ordinary Least Squares (OLS) Regression
minimises the sum of squared errors in the data.
The OLS regression of y on x as y = ax + b is:
a
s
s
xy x y
n
x
x
n
b y axxy
x
2
2
2
S S S
S
S
and
.: etc
n
yynote S
82
Least Squares Regression
Ordinary Least Squares (OLS) Regression
minimises the sum of squared errors in the data.
The OLS regression of y on x as y = ax + b is:
a
s
s
xy x y
n
x
x
n
b y axxy
x
2
2
2
S S S
S
S
and
.: etc
n
yynote S
83
Question 1: In which plot is ‘b’ greatest?
A.
C. D.
B.
0
0
0
0
84
Assume: y = ax + b
Question 1: In which plot is ‘b’ greatest?
A.
C. D.
B.
0
0
0
0
85
Assume: y = ax + b
Question 1: In which plot is ‘b’ greatest?
A.
C. D.
B.
0
0
0
0
86
Assume: y = ax + b
Question 2a: In which plot is ‘a’ greatest?
A.
C. D.
B.
0
0
0
0
87
Assume: y = ax + b
Question 2b: In which plot is ‘a’ greatest
in magnitude?
A.
C. D.
B.
0
0
0
0
88
Assume: y = ax + b
How good is the fit?
One measure of how well the regression model is
the proportion of variation in y that is ``explained’’
by the regression equation.
0
b
x
y Explained Variation
Unexplained
Variation
89
Coefficient of Determination
• The coefficient of determination is the proportion of variation
in y that is ``explained’’ by variation in x through the
regression equation.
• The coefficient of determination is r2 – the square of
Pearson’s correlation coefficient, r.
• Often, 100 r2 is calculated and the result expressed as a
percentage.
90
A.
C. D.
B.
91
Question 3: In which plot is r2 closest to 1?
A.
C. D.
B.
92
Question 3: In which plot is r2 closest to 1?
A.
C. D.
B.
93
Question 4: In which plot is r2 closest to 0?
Regression in MicroSoft Excel
Regression is a built in analysis function, or you
can also calculate the formulas manually with:
– a = SLOPE(y values, x values)
– b = INTERCEPT(y values, x values)
Also,
– r = CORREL(y values, x values)
– r2 = CORREL(y values, x values)^2
Regression is also a ‘Chart Tool’ if you first
draw a scatter plot and then choose this option.
94
Price Functionality
3.96 83
3.99 81
3.69 80
2.96 78
3.69 76
2.99 76
3.98 74
2.79 73
3.49 73
2.95 72
1.95 69
2.99 68
2.92 66
3.95 65
3.95 65
2.97 64
3.99 61
3.20 61
4.95 59
0.69 57
1.96 57
3.35 56
1.00 51
2.99 51
1.99 51
1.08 49
1.67 46
1.00 42
0.66 40
x y
Regression in MicroSoft Excel
Regression is a built in analysis function, or you
can also calculate the formulas manually with:
– a = SLOPE(y values, x values)
– b = INTERCEPT(y values, x values)
Also,
– r = CORREL(y values, x values)
– r2 = CORREL(y values, x values)^2
Regression is also a ‘Chart Tool’ if you first
draw a scatter plot and then choose this option.
95
Price Functionality
3.96 83
3.99 81
3.69 80
2.96 78
3.69 76
2.99 76
3.98 74
2.79 73
3.49 73
2.95 72
1.95 69
2.99 68
2.92 66
3.95 65
3.95 65
2.97 64
3.99 61
3.20 61
4.95 59
0.69 57
1.96 57
3.35 56
1.00 51
2.99 51
1.99 51
1.08 49
1.67 46
1.00 42
0.66 40
x y
Regression in MicroSoft Excel
Regression is a built in analysis function, or you
can also calculate the formulas manually with:
– a = SLOPE(y values, x values)
– b = INTERCEPT(y values, x values)
Also,
– r = CORREL(y values, x values)
– r2 = CORREL(y values, x values)^2
Regression is also a ‘Chart Tool’ if you first
draw a scatter plot and then choose this option.
96
Price Functionality
3.96 83
3.99 81
3.69 80
2.96 78
3.69 76
2.99 76
3.98 74
2.79 73
3.49 73
2.95 72
1.95 69
2.99 68
2.92 66
3.95 65
3.95 65
2.97 64
3.99 61
3.20 61
4.95 59
0.69 57
1.96 57
3.35 56
1.00 51
2.99 51
1.99 51
1.08 49
1.67 46
1.00 42
0.66 40
x y
Regression in SYSTAT
SYSTAT calculates regression and gives a
diagnostic output of the fitted model.
• Select: Analyze --> Regression --> Linear -->
Least Squares
• The dependent variable is the one we’re trying to
predict.
• The independent variable is the one that is free to
change.
• The model and residuals can be saved to a data
file.
97
Regression by hand
Use the same terms you calculated for the least
squares correlation: Sx, Sy, Sx2, Sy2, Sxy, and n.
Know how to calculate the OLS regression
equation using your calculator.
a
s
s
xy x y
n
x
x
n
b y axxy
x
2
2
2
S S S
S
S
and
99
n
yy S
Regression by hand
Use the same terms you calculated for the least
squares correlation: Sx, Sy, Sx2, Sy2, Sxy, and n.
Know how to calculate the regression equation
using your calculator.
a
s
s
xy x y
n
x
x
n
b y axxy
x
2
2
2
S S S
S
S
and
10
0
n
yy S
Let’s try and fit the Line of Best Fit by eye:
Motivating Question
0 1 2 3 4 5
PRICE
30
40
50
60
70
80
90
FU
N
C
TI
O
N
AL
IT
Y
101
Let’s try and fit the Line of Best Fit by eye:
Motivating Question
0 1 2 3 4 5
PRICE
30
40
50
60
70
80
90
FU
N
C
TI
O
N
AL
IT
Y
102
Question 5: For the toothbrush problem
which assumption is true:
Question is perhaps ambiguous. Price probably
depends upon Functionality, but Price is given
on the X-axis, where independent variables go.
A. Price and function are both
independent.
B. Function is independent
C. Price is dependent.
D. Price is independent.
0 1 2 3 4 5
PRICE
30
40
50
60
70
80
90
FU
N
C
TI
O
N
AL
IT
Y
104
SYSTAT Output (a) report
Dependent Variable ¦ FUNCTIONALITY
N ¦ 29
Multiple R ¦ 0.664
Squared Multiple R ¦ 0.441
Adjusted Squared Multiple R ¦ 0.421
Standard Error of Estimate ¦ 9.187
Regression Coefficients B = (X'X)-1X'Y
¦ Std.
Effect ¦ Coefficient Standard Error Coefficient Tolerance t p-Value
---------+-------------------------------------------------------------------------
CONSTANT ¦ 44.025 4.567 0.000 . 9.640 0.000
PRICE ¦ 6.939 1.503 0.664 1.000 4.618 0.000
105
SYSTAT Output (a) report
Dependent Variable ¦ FUNCTIONALITY
N ¦ 29
Multiple R ¦ 0.664
Squared Multiple R ¦ 0.441
Adjusted Squared Multiple R ¦ 0.421
Standard Error of Estimate ¦ 9.187
Regression Coefficients B = (X'X)-1X'Y
¦ Std.
Effect ¦ Coefficient Standard Error Coefficient Tolerance t p-Value
---------+-------------------------------------------------------------------------
CONSTANT ¦ 44.025 4.567 0.000 . 9.640 0.000
PRICE ¦ 6.939 1.503 0.664 1.000 4.618 0.000
106
SYSTAT Output (a) report
Dependent Variable ¦ FUNCTIONALITY
N ¦ 29
Multiple R ¦ 0.664
Squared Multiple R ¦ 0.441
Adjusted Squared Multiple R ¦ 0.421
Standard Error of Estimate ¦ 9.187
Regression Coefficients B = (X'X)-1X'Y
¦ Std.
Effect ¦ Coefficient Standard Error Coefficient Tolerance t p-Value
---------+-------------------------------------------------------------------------
CONSTANT ¦ 44.025 4.567 0.000 . 9.640 0.000
PRICE ¦ 6.939 1.503 0.664 1.000 4.618 0.000
107
Pearson’s r
Coeff of determination: r2
SYSTAT Output (a) report
Dependent Variable ¦ FUNCTIONALITY
N ¦ 29
Multiple R ¦ 0.664
Squared Multiple R ¦ 0.441
Adjusted Squared Multiple R ¦ 0.421
Standard Error of Estimate ¦ 9.187
Regression Coefficients B = (X'X)-1X'Y
¦ Std.
Effect ¦ Coefficient Standard Error Coefficient Tolerance t p-Value
---------+-------------------------------------------------------------------------
CONSTANT ¦ 44.025 4.567 0.000 . 9.640 0.000
PRICE ¦ 6.939 1.503 0.664 1.000 4.618 0.000
108
Pearson’s r
Coeff of determination: r2
Y-intercept: b = 44.025
Gradient: a = 6.939
Least square regression line is: Y = 6.639 x + 44.025
SYSTAT Output (a) report
Dependent Variable ¦ FUNCTIONALITY
N ¦ 29
Multiple R ¦ 0.664
Squared Multiple R ¦ 0.441
Adjusted Squared Multiple R ¦ 0.421
Standard Error of Estimate ¦ 9.187
Regression Coefficients B = (X'X)-1X'Y
¦ Std.
Effect ¦ Coefficient Standard Error Coefficient Tolerance t p-Value
---------+-------------------------------------------------------------------------
CONSTANT ¦ 44.025 4.567 0.000 . 9.640 0.000
PRICE ¦ 6.939 1.503 0.664 1.000 4.618 0.000
109
Pearson’s r
Coeff of determination: r2
Y-intercept: b = 44.025
Gradient: a = 6.939
Least square regression line is: Y = 6.639 x + 44.025
SYSTAT Output (a) report
Analysis of Variance
Source ¦ SS df Mean Squares F-Ratio p-Value
-----------+--------------------------------------------------
Regression ¦ 1,800.032 1 1,800.032 21.325 0.000
Residual ¦ 2,279.003 27 84.408
Durbin-Watson D-Statistic ¦ 0.946
First Order Autocorrelation ¦ 0.482
Information Criteria
AIC ¦ 214.860
AIC (Corrected) ¦ 215.820
Schwarz's BIC ¦ 218.962
110
SYSTAT Output (a) report
Little aside: Akaike’s Information Criterion (AIC), AIC
(Corrected) and Schwarz’s (Bayesian Information Criterion)
BIC are popular and relatively easy to calculate, but they are
by no means the entire story when measuring goodness of fit
Analysis of Variance
Source ¦ SS df Mean Squares F-Ratio p-Value
-----------+--------------------------------------------------
Regression ¦ 1,800.032 1 1,800.032 21.325 0.000
Residual ¦ 2,279.003 27 84.408
Durbin-Watson D-Statistic ¦ 0.946
First Order Autocorrelation ¦ 0.482
Information Criteria
AIC ¦ 214.860
AIC (Corrected) ¦ 215.820
Schwarz's BIC ¦ 218.962
111
SYSTAT Output (a) report
Tweaking the data by changing data point 19 from
(4.95, 59) to (4.95, 39) results in a warning:
Analysis of Variance
Source ¦ SS df Mean Squares F-Ratio p-Value
-----------+--------------------------------------------------
Regression ¦ 1,257.113 1 1,257.113 10.008 0.004
Residual ¦ 3,391.577 27 125.614
*** WARNING *** :
Case 19 is an Outlier (Studentized Residual : -4.698)
Other warnings are for ‘leverage’ and large residuals.
11
2
Question 6
If regression equation is: Function = 44 + Price × 7,
a toothbrush having a Price of $3 would have a
‘Function’ value of:
A. 44
B. 51
C. 54
D. 65
E. None of the above.
11
3
Question 6
If regression equation is: Function = 44 + Price × 7,
a toothbrush having a Price of $3 would have a
‘Function’ value of:
A. 44
B. 51
C. 54
D. 65
E. None of the above.
11
4
The Challenger Disaster
The Space Shuttle Challenger disaster occurred on January 28,
1986, when Space Shuttle Challenger broke apart 73 seconds into
its flight, leading to the deaths of its seven crew members. (Text
and images: Wikipedia)
Very sad story. Relevant data-set of O-ring incidents prior to the
flight – and other data-sets - is/are available from a number of places.
Warning that the story of the particular flight on 28/Jan/1986 is
obviously sad.
11
8
The Challenger Disaster
Data Flight No Date Temp F Temp C # Failures1 04-12-81 66 18.9 0
2 11-12-81 70 21.1 1
3 03-22-82 69 20.6 0
4 06-27-82 80 26.7 *
5 11-11-82 68 20.0 0
6 04-04-83 67 19.4 0
7 06-18-83 72 22.2 0
8 08-30-83 73 22.8 0
9 11-28-83 70 21.1 0
10 02-03-84 57 13.9 1
11 04-06-84 63 17.2 1
12 08-30-84 70 21.1 1
13 10-05-84 78 25.6 0
14 11-08-84 67 19.4 0
15 01-24-85 53 11.7 3
16 04-12-85 67 19.4 0
17 04-29-85 75 23.9 0
18 06-17-85 70 21.1 0
19 07-29-85 81 27.2 0
20 08-27-85 76 24.4 0
21 10-03-85 79 26.1 0
22 10-30-85 75 23.9 2
23 11-26-85 76 24.4 0
24 01-12-86 58 14.4 1
Temperature on launch 31 -0.6
From: http://wps.aw.com/wps/media/objects/15/15719/projects/ch5_challenger/index.html
11
9
The Challenger Disaster
Temperature at launch *
*
120
y = -0.1091x + 2.7193
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
-5 0 5 10 15 20 25 30
# Failures
Necessary Skills
Calculate the least squares regression by hand (using
your calculator) for a small data set.
Interpret the basic SYSTAT output and comment on
any data points that have a significant effect on the
regression model.
Draw a scatterplot and superimpose the line of best fit.
Calculate Pearson’s r and r2, and Comment on the
goodness of fit of the regression.
121
Assignment 1 – quick points
Issued on Moodle – deadline 28th of March 2024 (11:55pm)
Please carefully and thoroughly read instructions
122
Re-cap:
Topics covered (Correlation)
Bivariate data
The linear model
Calculating q and r by hand
Calculating r using Excel and SYSTAT
Interpreting q and r
Visual estimation of q and r
12
3
Re-cap:
Topics covered (Linear regression)
Estimating the regression equation by eye.
Fitting a regression using MicroSoft Excel and SYSTAT.
Measuring the goodness of fit.
Modelling with the regression equation.
12
4
Reading/Questions
Reading:
– 7th Ed Sections 15.1 - 15.4, 15.7, 16.1*, 16.2*.
– *Additional reading on multiple regression.
Questions:
– 7th Ed Questions 15.6, 15.7, 15.8, 15.10, 15.12, 15.14,
15.19, 15.21, 15.17, 15.63, 15.64.
– See week 5 Applied session 4
125