1CIS 315
WEEK 9
MARCH 8, 2021
Data Analytics
Chapter 3:
Data Visualization
Chapter 2:
Descriptive Analytics
Chapter 7:
Regression
Chapter 8:
Time Series Analysis &
Forecasting
Chapter 12, 13, & 14:
Optimization & Prescriptive
Analytics
Experimental Design
Chapter 11:
Simulation
3Today’s Agenda
• Experimental Design
• Sampling
4Sampling
• So far you have been given
observational data
• Little to no control over variables
• Merely observe their values
• For example, Age, Income,
etc…
5Sampling
• But if you designed an experiment…
• Then you could control one or
more variables
• And observe their effect
6Sampling
• Experiment
• Apply treatments to experimental
units (such as people, animals,
land, etc.) and then observe the
effect of the treatments on the
experimental units
7Sampling
• Observational Studies
• Observe subjects and measure
variables of interest without
assigning treatments to subjects.
8Sampling
Experiment
How would you design an experiment?
Observational Study
How would you design an observational study?
Suppose you want to study the effect of smoking on
lung capacity in women
9Sampling
Experiment
Find 100 women, age 20, who do not currently
smoke
Randomly assign half (50) to the smoking treatment
and the other half to the no smoking treatment
Those in the smoking treatment should smoke a
pack a day for 10 years, while those in the no
smoking treatment should remain smoke free for 10
years.
After 10 years, measure lung capacity for each of
the 100 women
Analyze, interpret, and draw conclusions from the
data.
Observational Study
Find 100 women, age 30, for which 50 have
been smoking a pack a day for 10 years while
the other 50 have remained smoke free for those
10 years
Measure lung capacity for each of the 100
women
Analyze, interpret, and draw conclusions from
the data.
Suppose you want to study the effect of smoking on
lung capacity in women
10
Sampling
An economist obtains the unemployment rate and gross
state product for a sample of states over the past 10
years, with the objective of examining the relationship
between the unemployment rate and the gross state
product by census region.
Experiment or Observational Study?
Observational Study
11
Sampling
A psychologist tests the effect of three different
feedback programs by randomly assigning five rats to
each program and recording their response times at
specified intervals during the program.
Experiment or Observational Study?
Experiment
12
Sampling
assigned to the experimental unit
Random Experiment
Is this a good choice?
13
Sampling
We want to test the null hypothesis that that
treatment means are all equal against the
alternative that at least two differ
The objective of a randomized design is to
usually compare the treatment means
! = " = # = ⋯ = \$% = ℎ
14
Sampling
For example, suppose you randomly selected five males and
five females and looked at their SAT scores.
450 475 500 525 550 575 600 625 650
Female MaleFemale Average: 550
Male Average: 590
Can we conclude that there is a difference in test
scores between Females and Males?
No, as the difference in the means is
dominated by the sampling variability
15
Sampling
For example, suppose you randomly selected five males and
five females and looked at their SAT scores.
Female Average: 550
Male Average: 590
Can we conclude that there is a difference in test
scores between Females and Males?
Probably, as the difference in the means is
large relative to the sampling variability
450 475 500 525 550 575 600 625 650
Female Male
16
Sampling
The key to sampling is to compare the difference between
the treatment means with the amount of sampling variability
SST = Sum of Squares for Treatments
SSE = Sum of Squares for Error =\$!"#\$ !(̅! − )% Where ! is the sample size of the ith treatment, ̅!is the mean of the treatment and ̅ is the mean of the overall sample = ∑&"#'! (#& − ̅#)% + ∑&"#'" (%& − ̅%)%+ … + ∑&"#'# (\$& − ̅\$)%
Looks complicated, but we can rewrite to… = (#−1)#% + (%−1)%% +⋯+ (\$−1)\$%
Where s2 is the sample variance = ∑\$%&' ()!*)̅)"'*&
17
Sampling
=\$!"#\$ !(̅! − )%
= ∑&"#'! (#& − ̅#)% + ∑&"#'" (%& − ̅%)%+ … + ∑&"#'# (\$& − ̅\$)% = 5 − 1 2250 + 5 − 1 2250 = 18000
But what we are really after is the MST and MSE…
For example, suppose you randomly selected five males and
five females and looked at their SAT scores.
= 5 550 − 570 2 + 5 590 − 570 2 = 4000450 475 500 525 550 575 600 625 650
Female Male
= (#−1)#% + (%−1)%% +⋯+ (\$−1)\$%
Don’t worry about calculating these right now…
18
Sampling
= (()\$*#, where k-1 is the degrees of freedom
= − = 1800010 − 2 = 2250
MST = Mean Square for Treatments
(measures the variability among the treatment means)
MSE = Mean Square for Error
(measures the variability within the treatments)
= 40002 − 1 = 4000
19
Sampling
− =
Use the SST, SSE, MST, MSE => F-Statistic
− = 40002250 = 1.78
The F-statistic determines if the means of the treatment
groups are equal (H0) or different (Ha)
Can we reject the null hypothesis that means of the
treatments are equal?
20
Sampling
This graph will change based on the degrees of freedom in
the numerator and denominator, but what you want is that
your F-statistic is larger than the value of F at the
designated level of significance
21
Sampling
table like this with the degrees
of freedom of the numerator in
the columns and the degrees of
freedom of the denominator in
the rows for a certain level of
significance… but now we
have technology that will give
you these numbers…
22
Sampling
Let’s choose level of significance of α=0.05. For our
example, the cut-off value of F0.05 is 5.32
Our F-statistic was 1.78
Can we reject the null hypothesis that means of the
treatments are equal?
NO!
Our F-stat = 1.78 < 5.31 = F0.05
We fail to reject the null hypothesis that the means are equal.
23
Sampling
Suppose you randomly selected five males and five females
and looked at their SAT scores.
450 475 500 525 550 575 600 625 650
Female Male
Let's do the same as before, but with this data.
24
Sampling = (5 − 1)(62.5) + (5 − 1)(62.5) = 500 = − = 50010 − 2 = 62.5 = 4000( ℎ ℎ ℎ ) − = 400062.5 = 64.0
Can we reject the null hypothesis that means of the
treatments are equal?
YES!
Our F-stat = 64.0 > 5.31 = F0.05
We can reject the null hypothesis that the means are equal.
25
Sampling
Since we rejected the null hypothesis that the means are equal,
we can conclude that the SAT mean score of males differs
from that of females.
450 475 500 525 550 575 600 625 650
Female Male
26
Sampling
This type of analysis is called ANOVA or Analysis of Variance
df SS MS F
Treatments − 1 SST = − 1 Error − SSE = −
Total − 1 SS(Total) = +
27
Sampling
Total Sum of Squares
SS(Total)
df=n-1
Sum of Squares for Treatments
SST
df=k-1
Sum of Squares for Error
SSE
df=n-k
28
Sampling
Example: Find the F-statistic and determine if we can reject
the null hypothesis of the means being equal at 0.10 level of
significance (F0.10=2.87)
df SS MS F
Treatments − 1 = 3 2794.39 = − 1
Error − = 36 762.30 = −
Total − 1 = 39 3556.69
Based on the table, tell me something about the experiment…
k=4, so we are comparing 4
different things
n=40, so we have 40 observations
29
Sampling
df SS MS F
Treatments − 1 = 3 2794.39 = − 1
Error − = 36 762.30 = −
Total − 1 = 39 3556.69
= − 1 = 2794.394 − 1 = 931.46 = − = 762.3040 − 4 = 21.18
= 931.4621.18 = 43.99
30
Sampling
= 931.4621.18 = 43.99 (F0.10=2.87)>
Can we reject the null hypothesis that means of the
treatments are equal at 0.10 level of significance?
Yes!
31
Sampling Example
Robotics researchers investigated whether robots could be
trained to behave like ants in an ant colony. Robots were
trained and randomly assigned to “colonies” (i.e. groups)
consisting of 3, 6, 9, or 12 robots. The robots were assigned
the tasks of foraging for “food” and recruiting another robot
when they identified a resource-rich area. One goal of the
experiment was to compare the mean energy expended (per
robot) of the four different sizes of colonies.
32
Sampling Example
1. Experiment or Observational Study? If experiment, what
kind?
2. Identify the treatments and the dependent variable.
3. Set up the null and alternative hypotheses of the test.
4. The following results were reported:
• F=7.70
• numerator df=3, denominator df=56
• F0.05=2.76
Interpret the results.
Randomized Experiment
Treatments: 3, 6, 9, 12 robots & Dependent Variable: Energy Expended
: = = = , : +
Reject H0 and conclude that the means
differ for the robot treatments
33
Sampling Example
We rejected H0 about the all means being equal for the robot
treatments, but that doesn’t tell you anything about the
difference between each treatment
Now we want to test = = = = = =
Essentially testing if the mean of the treatment with
3 robots is the same as the mean of the treatment
with 6 robots, etc…
When you have equal treatment sample sizes you
use the Tukey Method
You don’t want to do this by hand, instead have
the computer calculate the t-statistic and reject
that the means are equal if the t-statistic found is
larger than the t-statistic critical value (same
procedure as when we used the F-statistic)
34
Sampling Example
AB AC AD BC BD CD
A B A C A D B C B D C D
Mean 250.78 261.06 250.78 269.95 250.78 249.32 261.06 269.95 261.06 249.32 269.95 249.32
Variance 22.42 14.95 22.42 20.26 22.42 27.07 14.95 20.26 14.95 27.07 20.26 27.07
Observations 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00
df 18.00 18.00 18.00 18.00 18.00 18.00
t Stat -5.32 -9.28 0.66 -4.74 5.73 9.48
P(T<=t) one-tail 0.00 0.00 0.26 0.00 0.00 0.00
t Critical one-tail 1.33 1.33 1.33 1.33 1.33 1.33
Means of the two
samples
5.32 > 1.33, we can reject the
hypothesis that the mean of A
is the same as the mean of B
Do this for every
combination.
Which combination can you
not reject the hypothesis that
the means are the same?
A=3 robots, B=6 robots, C=9 robots, D=12 robots
35
Sampling
We can also introduce a better experimental design with
better controls to help account for variability
Take our SAT score example, what else could
we control for?
School GPA
Classes SES
Instead of selecting independent samples, we
choose experimental units (students in this
example) that are matched sets.
The matched sets are called blocks.
36
Sampling
df SS MS F
Treatments − 1 SST = − 1
Blocks − 1 SSB = − 1
Error − − + 1 SSE = − − + 1
Total − 1 SS(Total)
37
Sampling
Randomized Block Design
Attempting to reduce the sampling variability
of the experimental units in each block, which
in turn reduces the MSE
Now again we compare SAT scores of male
and female high school seniors, but now we
select matched pairs of females and males
according to their GPA and school
38
Sampling
Block Female SAT Score Male SAT Score Block Mean
School A, 2.75 GPA 540 530 535
School B, 3.00 GPA 570 550 560
School C, 3.25 GPA 590 580 585
School D, 3.50 GPA 640 620 630
School E, 3.75 GPA 690 690 690
Treatment Mean 606 594
39
Sampling
Follow the same procedure but now with the blocks
between female and male means
=CDE"\$ (̅F" − ̅)#
Squaring the distance between each treatment mean
and the overall mean, multiplying each squared
distance by the number of measurement for the
treatment, and then summing over treatments
̅"!the sample mean
for the ith treatment,
b is the number of
blocks, k is the
number of treatments
40
Sampling
=CDE"\$ (̅F" − ̅)# = 5 606 − 600 # + 5 594 − 600 # = 360
Block Female SAT Score Male SAT Score Block Mean
School A, 2.75 GPA 540 530 535
School B, 3.00 GPA 570 550 560
School C, 3.25 GPA 590 580 585
School D, 3.50 GPA 640 620 630
School E, 3.75 GPA 690 690 690
Treatment Mean 606 594
Number of Blocks Overall mean
41
Sampling
Now calculate the Sum of Squares for Blocks (SSB)
Measure of variation among the five block means
representing different schools and GPA
=CDE"G (̅H" − ̅)#
Squaring the squares of the differences between each
block mean and the overall mean, multiple each
squared difference by the number of measurements
for each block, and then sum over all blocks
̅#!the sample mean
for the ith block, k is
the number of
treatments
42
Sampling
SSB= 2 535 − 600 # + 2 560 − 600 # + 2() 585 −600 # + 2 630 − 600 # + 2 690 − 600 # = 30100
Block Female SAT Score Male SAT Score Block Mean
School A, 2.75 GPA 540 530 535
School B, 3.00 GPA 570 550 560
School C, 3.25 GPA 590 580 585
School D, 3.50 GPA 640 620 630
School E, 3.75 GPA 690 690 690
Treatment Mean 606 594
=CDE"G (̅H" − ̅)#Number of Treatments Overall mean
43
Sampling
In a randomized block design, the sampling
variability is measured by subtracting the portion
attributed to treatments and blocks from the total
sum of squares, SS(Total)() =CDE"I (D − ̅)#
44
Sampling
Block Female SAT Score Male SAT Score Block Mean
School A, 2.75 GPA 540 530 535
School B, 3.00 GPA 570 550 560
School C, 3.25 GPA 590 580 585
School D, 3.50 GPA 640 620 630
School E, 3.75 GPA 690 690 690
Treatment Mean 606 594
= (540 − 600)#+(530 − 600)#+⋯+ 690 − 600 #= 30600
() =:\$%&' (\$ − ̅), Overall mean
45
Sampling
In a randomized block design, the sampling
variability is measured by subtracting the portion
attributed to treatments and blocks from the total
sum of squares, SS(Total)
= − − = 30600 − 360 − 30100 = 140
= + +
Sum of Squares
for Treatment
Sum of Squares
for Blocks
Sum of Squares
for Error
46
Sampling
Total Sum of Squares
SS(Total)
df=n-1
Sum of Squares for
Treatments
SST
df=k-1
Sum of Squares for
Error
SSE
df=n-k
Sum of Squares for
Blocks
SSB
df=b-1
Sum of Squares for
Error
SSE
df=n-b-k+1
Randomized
Design
Randomized
Block
Design
47
Sampling
= − 1 = 3602 − 1 = 360
= − − + 1 = 14010 − 5 − 2 + 1 = 35
− = = 36035 = 10.29
48
Sampling
df SS MS F
Treatments − 1 = 2 − 1= 1 360 = − 1 = 3602 − 1= 360
=36035 = 10.29
Blocks − 1 = 4 30100 = − 1 = 301004= 7525
=752535 = 215
Error
− − + 1= 10 − 2 − 5 + 1= 4 140
= − − + 1= 14010 − 2 − 5 + 1 = 35
Total 14 30600
Use this to test
the difference in
the means of the
treatments
Use this to test
the difference in
the means of the
blocks
49
Sampling
!.!S = 7.71
= 10.29 > !.!S = 7.71
Can we reject the null hypothesis that the mean
SAT scores are the same for females and males?
YES! And we can conclude that the mean SAT
scores differ for females and males.
50
Sampling Example
df SS MS F
Treatments 4 501 125.25 9.11
Blocks 2 225 112.5 8.18
Error 8 110 13.75
Total 14 836
A randomized block design yielded the following results:
51
Sampling Example
df SS MS F
Treatments 4 501 125.25 9.11
Blocks 2 225 112.5 8.18
Error 8 110 13.75
Total 14 836
A randomized block design
yielded the following results:
1. How many blocks and treatments were used in this experiment?
3 blocks, 5 treatments
52
Sampling Example
df SS MS F
Treatments 4 501 125.25 9.11
Blocks 2 225 112.5 8.18
Error 8 110 13.75
Total 14 836
A randomized block design
yielded the following results:
2. How many observations were collected in the experiment?
15 observations
53
Sampling Example
df SS MS F
Treatments 4 501 125.25 9.11
Blocks 2 225 112.5 8.18
Error 8 110 13.75
Total 14 836
A randomized block design
yielded the following results:
3. Specify the null and alternative hypotheses you would use to
compare the treatment means.: = = = = ,: \
54
Sampling Example
df SS MS F
Treatments 4 501 125.25 9.11
Blocks 2 225 112.5 8.18
Error 8 110 13.75
Total 14 836
A randomized block design
yielded the following results:
4a. Which test statistic should you use to test the null hypothesis
regarding treatment means? − =
4b. Which test statistic should you use to test the null hypothesis
regarding block means? − =
55
Sampling Example
df SS MS F
Treatments 4 501 125.25 9.11
Blocks 2 225 112.5 8.18
Error 8 110 13.75
Total 14 836
A randomized block design
yielded the following results:
5. Conduct the test for treatment means against F0.05=3.84 and
interpret the results. = . > . = .
reject H0 that the treatments means are equal
56
Today’s Agenda
• Experimental Design
• Experiment vs Observational
• Random Sampling
• ANOVA
• Randomized Block Design
57
Next Class
LAB DAY Data Analytics
Chapter 3:
Data Visualization
Chapter 2:
Descriptive Analytics
Chapter 7:
Regression
Chapter 8:
Time Series Analysis &
Forecasting
Chapter 12, 13, & 14:
Optimization & Prescriptive
Analytics
Experimental Design
Chapter 11:
Simulation
To Do List
58
Homework Assignment #6
Due 3/12/2021 by 12:00pm (noon)