FALL2024-无代写|学霸联盟

FALL2024-无代写

时间：2024-10-06

Lecture 1
ECON 2100, FALL 2024
Overview
• Population vs Sample
• Methods of Sampling
• Types of Variables
• Data Visualization
• Descriptive Statistics
• Population Parameters
Population vs. Sample
POPULATION
A population contains all of the items or
individuals of interest that we seek to study.
SAMPLE
A sample contains only a portion of a
population of interest.
Population vs. Sample
Population Sample
All the items or individuals about
which we want to draw conclusion(s).
A portion of the population of
items or individuals.
Say, we wish to find out fraction of students
who speak Spanish at RPI:
Entire student body at RPI is the population
while only Math majors is a sample.
Lucio wants to know whether the food he serves in
his restaurant is within a safe range of temperatures.
He randomly selects 70 entrees and measures their
temperatures just before he serves them to his
customers. Identify the population and the sample:
a. The population is all of the hot entrees Lucio
serves; the sample is the entrees that are a safe
temperature.
b. The population is the 70 selected entrees; the
sample is the entrees that are a safe temperature.
c. The population is all of the entrees Lucio serves;
the sample is the 70 selected entrees.
O
Lucio wants to know whether the food he serves in
his restaurant is within a safe range of temperatures.
He randomly selects 70 entrees and measures their
temperatures just before he serves them to his
customers. Identify the population and the sample:
a. The population is all of the hot entrees Lucio
serves; the sample is the entrees that are a safe
temperature.
b. The population is the 70 selected entrees; the
sample is the entrees that are a safe temperature.
c. The population is all of the entrees Lucio serves;
the sample is the 70 selected entrees.
Probability Sample: Simple Random
Sample
• Every individual or item from the frame has an equal chance
of being selected.
• Selection may be with replacement (selected individual is
returned to frame for possible reselection) or without
replacement (selected individual isn’t returned to the frame).
• Samples obtained from table of random numbers or
computer random number generators.
- We will see how to do this in R

Selecting a Simple Random Sample
Using Random Number Table
Sampling Frame For Population
With 850 Items
Item Name Item #
Bev R. 001
Ulan X. 002
. .
. .
. .
. .
Joann P. 849
Paul F. 850
Portion Of Random Number Table
49280 88924 35779 00283 81163 07275
11100 02340 12860 74697 96644 89439
09893 23997 20048 49420 88872 08401
The first 5 items in a simple
random sample
Item # 492
Item # 808
Item # 892 -- does not exist so ignore
Item # 435
Item # 779
Item # 002
• Decide on sample size: n
• Divide frame of N individuals into groups of k individuals: k=N/n
• Randomly select one individual from the 1st group.
• Select every kth individual thereafter.
Probability Sample: Systematic Sample
N = 40
n = 4
k = 10
First Group
Probability Sample: Stratified Sample
• Divide population into two or more subgroups (called strata) according to
some common characteristic.
• A simple random sample is selected from each subgroup, with sample sizes
proportional to strata sizes.
• Samples from subgroups are combined into one.
• This is a common technique when sampling population of voters, stratifying
across racial or socio-economic lines.
Population
divided
into 4
strata
Probability Sample: Cluster Sample
• Population is divided into several “clusters,” each representative of the population.
• A simple random sample of clusters is selected.
• All items in the selected clusters can be used, or items can be chosen from a cluster
using another probability sampling technique.
• A common application of cluster sampling involves election exit polls, where certain
election districts are selected and sampled.
Population
divided into
16 clusters. Randomly selected clusters for sample
Probability Sample:
Comparing Sampling Methods
Simple random sample and Systematic sample:
◦ Simple to use.
◦ May not be a good representation of the population’s
underlying characteristics.
Stratified sample:
◦ Ensures representation of individuals across the entire
population.
Cluster sample:
◦ More cost effective.
◦ Less efficient (need larger sample to acquire the same level of
precision).
Stratified Sampling Cluster Sampling
Researcher decides the criterion
for division
Natural division
Homogeneity within subgroups
and heterogeneity between
subgroups
Heterogeneity within subgroups
and homogeneity between
subgroups
Ex. Students at RPI divided
based on year/major and then
individuals are sampled from
each subgroup
Ex. Determine proportion of
students in Capital Region who
are science majors.
Divide into clusters based on
schools. Then, randomly sample
schools
1. Interview every 10th student who enters the school in the morning.
a. Random Sampling
b. Cluster Sampling
c. Systematic Sampling
d. Stratified Sampling
2. Assign each car in a dealership a number and then use a random-number
table to select the cars to be inspected.
a. Random Sampling
b. Cluster Sampling
c. Systematic Sampling
d. Stratified Sampling
3. A teacher wants to know how well her students are doing on a topic. She
randomly picks one class to survey.
a. Random Sampling
b. Cluster Sampling
c. Systematic Sampling
d. Stratified Sampling
O
O
O
1. Interview every 10th student who enters the school in the morning.
a. Random Sampling
b. Cluster Sampling
c. Systematic Sampling
d. Stratified Sampling
2. Assign each car in a dealership a number and then use a random-number
table to select the cars to be inspected.
a. Random Sampling
b. Cluster Sampling
c. Systematic Sampling
d. Stratified Sampling
3. A teacher wants to know how well her students are doing on a topic. She
randomly picks one class to survey.
a. Random Sampling
b. Cluster Sampling
c. Systematic Sampling
d. Stratified Sampling
Classifying Variables By Type
• Categorical (qualitative) variables take categories as their values such
as “yes”, “no”, or “blue”, “brown”, “green”.
• Numerical (quantitative) variables have values that represent a
counted or measured quantity.
⸰ Discrete variables arise from a counting process.
⸰ Continuous variables arise from a measuring process.
Examples of Types of Variables
Question Responses Variable Type
Do you have an Instagram
profile? Yes or No
How many text messages
have you sent in the past
three days?
---------------
How long did the mobile
app update take to
download?
---------------
Examples of Types of Variables
Question Responses Variable Type
Do you have an Instagram
profile? Yes or No Categorical
How many text messages
have you sent in the past
three days?
---------------
Numerical
(discrete)
How long did the mobile
app update take to
download?
---------------
Numerical
(continuous)
Types of Variables
Variables
Categorical Numerical
Discrete Continuous
Examples:
n Marital Status
n Political Party
n Eye Color
(Defined Categories)
Examples:
n Number of Children
n Defects per hour
(Counted Items)
Examples:
n Weight
n Voltage
(Measured
Characteristics)
Nominal Ordinal
Examples: Ratings
n Good, Better, Best
n Low, Med, High
(Ordered Categories)
1. List all quantitative variables.
2. List all qualitative variables.
3. List all continuous variables.
4. List all discrete variables.
5. List all ordinal variables.
6. List all nominal variables.
Age ,Height , LDL ,
children
Gender , B6 , Happy , SmokegSC
Height Age , LDL
# children
Happy , SC
Gender , BG , Smoke
Visualizing Categorical Data:
The Bar Chart
• The bar chart visualizes a categorical variable as a series of
bars. The length of each bar represents either the frequency or
percentage of values for each category.
Reason For
Shopping Online?
Percent
Better prices 37%
Avoiding holiday
crowds or hassles
29%
Convenience 18%
Better selection 13%
Ships directly 3%
Visualizing Categorical Data:
The Pie Chart
• The pie chart is a circle broken up into slices that represent
categories. The size of each slice of the pie varies according to
the percentage in each category.
Reason For
Shopping Online?
Percent
Better prices 37%
Avoiding holiday
crowds or hassles
29%
Convenience 18%
Better selection 13%
Ships directly 3%
Visualizing Numerical Data: The Histogram
Class Frequency
10 but less than 20 3 .15 15
20 but less than 30 6 .30 30
30 but less than 40 5 .25 25
40 but less than 50 4 .20 20
50 but less than 60 2 .10 10
Total 20 1.00 100
Relative
Frequency Percentage
0
2
4
6
8
5 15 25 35 45 55 More
Fr
eq
ue
nc
y
Histogram: Age Of Students
(In a percentage
histogram the vertical
axis would be defined to
show the percentage of
observations per class).
i togram: T mperature
Visualizing Two Numerical Variables:
The Scatter Plot
• Scatter plots are used for numerical data consisting of paired
observations taken from two numerical variables.
• One variable is measured on the vertical axis and the other
variable is measured on the horizontal axis.
• Scatter plots are used to examine possible relationships between
two numerical variables.
Scatter Plot Example
Volume
per day
Cost per
day
23 125
26 140
29 146
33 160
38 167
42 170
50 188
55 195
60 200
Cost per Day vs. Production Volume
0
50
100
150
200
250
20 30 40 50 60 70
Volume per Day
C
o
st
p
er
D
ay

Summary Definitions
• The central tendency is the extent to which the values of a
numerical variable group around a typical or central value.
• The variation is the amount of dispersion or scattering away
from a central value that the values of a numerical variable
show.
• The shape is the pattern of the distribution of values from the
lowest value to the highest value.
Measures of Central Tendency:
The Mean
• The arithmetic mean (often just called the “mean”) is the most
common measure of central tendency.
◦ For a sample of size n:
Sample size
n
XXX
n
X
X n21
n
1i
i +++
==
å
= 
Observed values
The ith value
Pronounced x-bar
Measures of Central Tendency:
The Mean
• The most common measure of central tendency.
• Mean = sum of values divided by the number of values.
• Affected by extreme values (outliers).
11 12 13 14 15 16 17 18 19 20
Mean = 13
11 12 13 14 15 16 17 18 19 20
Mean = 14
31
5
65
5
5141312111
==
++++ 41
5
70
5
2041312111
==
++++
Measures of Central Tendency:
The Median
• In an ordered array, the median is the “middle” number (50%
above, 50% below).
Less sensitive than the mean to extreme values.
Median = 13 Median = 13
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Measures of Central Tendency:
Locating the Median
• The location of the median when the values are in numerical order
(smallest to largest):
• If the number of values is odd, the median is the middle number.
• If the number of values is even, the median is the average of the two
middle numbers.
dataorderedtheinposition
2
1npositionMedian +=
2
1n +Note that is not the value of the median, only the position of the median
in the ranked data.
• n = 7, then median is on which position?
• n = 8, then median is on which position?
• Ex 1. Find the median for 1, 4, 5, 9, 21, 22
• Ex 2. Find the median for 12, 32, 35, 78, 90
# = 4
&H = 4 . 5 ; arg , between 4th
& 5th pos
#9 = 7
35
• n = 7, then median is on which position?
Ans: (7+1)/2 = 8th position
• n = 8, then median is on which position?
Ans: (8+1)/2 = 4.5th position i.e., average of 4th and 5th positions
• Ex 1. Find the median for 1, 4, 5, 9, 21, 22
Ans: (5+9)/2 = 7
• Ex 2. Find the median for 12, 32, 35, 78, 90
Ans: 35
↳
Measures of Central Tendency:
The Mode
• Value that occurs most often.
• Not affected by extreme values.
• Used for either numerical or categorical data.
• There may be no mode.
• There may be several modes.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
Measures of Central Tendency:
Review Example
House Prices:
$2,000,000
$ 500,000
$ 300,000
$ 100,000
$ 100,000
Sum $ 3,000,000
§ Mean:
=
§ Median: middle value of ranked
data
=
§ Mode: most frequent value
=
3
,
000
,
000/5
600
,
000
300
,
000
100
, 000
Measures of Central Tendency:
Review Example
House Prices:
$2,000,000
$ 500,000
$ 300,000
$ 100,000
$ 100,000
Sum $ 3,000,000
§ Mean: ($3,000,000/5)
= $600,000
§ Median: middle value of ranked
data
= $300,000
§ Mode: most frequent value
= $100,000
Measures of Central Tendency:
Which Measure to Choose?
• The mean is generally used, unless extreme values (outliers)
exist.
• The median is often used, since the median is not sensitive to
extreme values. For example, median home prices may be
reported for a region; it is less sensitive to outliers.
• In many situations it makes sense to report both the mean and
the median.
Quiz
1. What is the mode of the following numbers?
4, 9, 6, 3, 4, 2
2. What is the median of the following numbers?
3, 5, 6, 7, 9, 6, 8
3. A data set can have more than one median. True
or False?
4
6
False
Same center,
different variation
Measures of Variation
• Measures of variation give
information on the spread or
variability or dispersion of
the data values.
Variation
Standard
Deviation
Range Variance
Measures of Variation:
The Range
• Simplest measure of variation.
• Difference between the largest and the smallest values:
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 13 - 1 = 12
Example:
Measures of Variation:
Why the Range Can Be Misleading
• Does not account for how the data are distributed.
• Sensitive to outliers.
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119
• Average (approximately) of squared deviations of values
from the mean.
◦ Sample variance:
Measures of Variation:
The Sample Variance
1-n
)X(X
S
n
1i
2
i
2
å
=
-
=
Where = arithmetic mean
n = sample size
Xi = ith value of the variable X
X
Measures of Variation:
The Sample Standard Deviation
• Most commonly used measure of variation.
• Shows variation about the mean.
• Is the square root of the variance.
• Has the same units as the original data.
◦ Sample standard deviation:
1-n
)X(X
S
n
1i
2
iå
=
-
=
Measures of Variation:
Comparing Standard Deviations
Smaller standard deviation
Larger standard deviation
Locating Extreme Outliers:
Z-Score
• To compute the Z-score of a data value, subtract the mean and
divide by the standard deviation.
• The Z-score is the number of standard deviations a data value is
from the mean.
• A data value is considered an extreme outlier if its Z-score is less
than -3.0 or greater than +3.0.
• The larger the absolute value of the Z-score, the farther the data
value is from the mean.
Locating Extreme Outliers:
Z-Score
Where X represents the data value
X is the sample mean
S is the sample standard deviation
S
XXZ -=
Sx2 = x -*
Locating Extreme Outliers:
Z-Score
• Suppose the mean math SAT score is 490, with a standard
deviation of 100.
• Compute the Z-score for a test score of 620.
z=0-490 =
Locating Extreme Outliers:
Z-Score
• Suppose the mean math SAT score is 490, with a standard
deviation of 100.
• Compute the Z-score for a test score of 620.
3.1
100
130
100
490620
==
-
=
-
=
S
XXZ
A score of 620 is 1.3 standard deviations above the
mean and would not be considered an outlier.
Numerical Descriptive
Measures for Population
• Descriptive statistics discussed previously described a
sample, not the population.
• Summary measures describing a population, called
parameters, are denoted with Greek letters.
• Important population parameters are the population mean,
variance, and standard deviation.
Numerical Descriptive
Measures for Population
• Descriptive statistics discussed previously described a
sample, not the population.
• Summary measures describing a population, called
parameters, are denoted with Greek letters.
• Important population parameters are the population mean,
variance, and standard deviation.
• The population parameter is a constant while the sample
statistic is variable
Numerical Descriptive Measures
for Population: The Mean µ
• The population mean is the sum of the values in the population
divided by the population size, N.
N
XXX
N
X
N21
N
1i
i +++
==µ
å
= 
μ = population mean
N = population size
Xi = ith value of the variable X
Where
Y
mch
• Average of squared deviations of values from the mean.
◦ Population variance:
Numerical Descriptive Measures
for Population: The Variance σ2
N
μ)(X
σ
N
1i
2
i
2
å
=
-
=
Where μ = population mean
N = population size
Xi = ith value of the variable X
K
sigma
Numerical Descriptive Measures for
Population: The Standard Deviation σ
• Most commonly used measure of variation.
• Shows variation about the mean.
• Is the square root of the population variance.
• Has the same units as the original data.
◦ Population standard deviation:
N
μ)(X
σ
N
1i
2
iå
=
-
=C
Sigma
Sample Statistics vs.
Population Parameters
Measure Population
Parameter
Sample
Statistic
Mean
Variance
Standard
Deviation
X
2S
S
µ
2s
s
Quartile Measures
• Quartiles split the ranked data into 4 segments with an
equal number of values per segment.
25%
⸰ The first quartile, Q1, is the value for which 25% of the
values are smaller and 75% are larger.
⸰ Q2 is the same as the median (50% of the values are
smaller and 50% are larger).
⸰ Only 25% of the values are greater than the third quartile.
Q1 Q2 Q3
25% 25% 25%
Quartile Measures:
Locating Quartiles
• Find a quartile by determining the value in the appropriate
position in the ranked data, where:
First quartile position: Q1 = (n+1)/4 ranked value
Second quartile position: Q2 = (n+1)/2 ranked value
Third quartile position: Q3 = 3(n+1)/4 ranked value
Where n is the number of observed values.
Quartile Measures:
Calculation Rules
When calculating the ranked position use the following
rules:
◦ If the result is a whole number then it is the ranked position to
use.
◦ If the result is a fractional half (e.g. 2.5, 7.5, 8.5, etc.) then
average the two corresponding data values.
◦ If the result is not a whole number or a fractional half then
round the result to the nearest integer to find the ranked
position.
(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so Q1 = (12+13)/2 = 12.5.
Q2 is in the (9+1)/2 = 5th position of the ranked data,
so Q2 = median = 16.
Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
so Q3 = (18+21)/2 = 19.5.
Quartile Measures
Calculating The Quartiles: Example
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
Q1 and Q3 are measures of non-central location.
Q2 = median, is a measure of central tendency.
(n = 8)
Q1 is in the (8+1)/4 = 2.25 position of the ranked data,
so Q1 =
Q2 is in the (8+1)/2 = 4.5 position of the ranked data,
so Q2 = median =
Q3 is in the 3(8+1)/4 = 6.75 position of the ranked data,
so Q3 =
Quartile Measures
Calculating The Quartiles: Example
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21
Q1 and Q3 are measures of non-central location.
Q2 = median, is a measure of central tendency.
(n = 8)
Q1 is in the (8+1)/4 = 2.25 position of the ranked data,
so Q1 = 12.
Q2 is in the (8+1)/2 = 4.5 position of the ranked data,
so Q2 = median = 16.
Q3 is in the 3(8+1)/4 = 6.75 position of the ranked data,
so Q3 = 18.
Quartile Measures
Calculating The Quartiles: Example
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21
Q1 and Q3 are measures of non-central location.
Q2 = median, is a measure of central tendency.
Quartile Measures:
The Interquartile Range (IQR)
• The IQR is Q3 – Q1 and measures the spread in the middle 50% of the data.
• The IQR is also called the midspread because it covers the
middle 50% of the data.
• The IQR is a measure of variability that is not influenced by
outliers or extreme values.
• Measures like Q1, Q3, and IQR that are not influenced by outliers are called resistant measures.
Calculating the Interquartile Range
Median
(Q2)
X
maximumXminimum Q1 Q3
Example:
25% 25% 25% 25%
12 30 45 57 70
Interquartile range
= 57 – 30 = 27
The Five Number Summary
The five numbers that help describe the center, spread and shape
of data are:
⸰ Xlargest
⸰ Third Quartile (Q3)
⸰ Median (Q2)
⸰ First Quartile (Q1)
⸰ Xsmallest
25% of data 25% 25% 25% of data
of data of data
Five Number Summary and
The Boxplot
• The boxplot is a graphical display of the data based on the five-
number summary:
Example:
Xsmallest -- Q1 -- Median -- Q3 -- Xlargest
Xsmallest Q1 Median Q3 Xlargest
Five Number Summary:
Shape of Boxplots
• If data are symmetric around the median then the box and
central line are centered between the endpoints.
• A boxplot can be shown in either a vertical or horizontal
orientation.
Xsmallest Q1 Median Q3 Xlargest
Distribution Shape and
The Boxplot
Right-SkewedLeft-Skewed Symmetric
Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Two Measures of the Relationship
Between Two Numerical Variables
• Scatter plots allow us to examine the relationship between
two numerical variables.
• Two quantitative measures of such relationships:
⸰ The Covariance
⸰ The Coefficient of Correlation
The Covariance
• The covariance measures the strength of the linear relationship
between two numerical variables (X & Y).
The sample covariance:
• Only concerned with the strength of the relationship.
• No causal effect is implied.
1n
)YY)(XX(
)Y,X(cov
n
1i
ii
-
--
=
å
=
• Covariance between two variables:
cov(X,Y) > 0 X and Y tend to move in the same direction.
cov(X,Y) < 0 X and Y tend to move in opposite directions.
• The covariance has a major flaw:
◦ It is not possible to determine the relative strength of the relationship from
the size of the covariance.
Interpreting Covariance
Coefficient of Correlation
• Measures the relative strength of the linear relationship between
two numerical variables.
Sample coefficient of correlation:
Where,
YXSS
Y),(Xcovr =
1n
)X(X
S
n
1i
2
i
X -
-
=
å
=
1n
)Y)(YX(X
Y),(Xcov
n
1i
ii
-
--
=
å
=
1n
)Y(Y
S
n
1i
2
i
Y -
-
=
å
=
Features of the
Coefficient of Correlation
• The population coefficient of correlation is referred as ρ.
• The sample coefficient of correlation is referred to as r.
• Either ρ or r have the following features:
◦ Unit free.
◦ Range between –1 and 1.
◦ The closer to –1, the stronger the negative linear relationship.
◦ The closer to 1, the stronger the positive linear relationship.
◦ The closer to 0, the weaker the linear relationship.
Scatter Plots of Sample Data with
Various Coefficients of Correlation
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6
r = +.3r = +1
Y
X
r = 0
Quiz
1. Suppose we measure heigh-weight correlation. The height in the data
set are changed from feet to inches, so all values are multiplied by 12.
The correlation coefficient for the new data will be:
a. 12 times the original
b. 144 times larger than the original
c. The same as original
2. The pairs in a data set are exchanged, so the x-coordinates are now
the y-coordinates and all values are multiplied by 12. The correlation
coefficients for the original data and for the new data are:
a. Opposites
b. Reciprocals
c. The same
3. Let Y be a random variable. Then V(Y) equals:
a.
b.
c.
d.
4. To infer the political tendencies of the students at your university, you
sample 150 of them. Only one is a simple random sample:
a. make sure that the proportion of minorities are the same in your sample
as in the entire student body
b. call every fiftieth person in the student directory at 9 a.m. If the person
does not answer the phone, you pick the next name listed, and so on.
c. go to the main dining hall on campus and interview students randomly
there.
d. have your statistical package generate 150 random numbers in the range
from 1 to the total number of students in your academic institution, and
then choose the corresponding names in the student telephone
directory.
2[( ) ]YE Y µ-
[| ( ) |]YE Y µ-
2[( ) ]YE Y µ-
[( )]YE Y µ-
Quiz
1. Suppose we measure heigh-weight correlation. The height in the data
set are changed from feet to inches, so all values are multiplied by 12.
The correlation coefficient for the new data will be:
a. 12 times the original
b. 144 times larger than the original
c. The same as original
2. The pairs in a data set are exchanged, so the x-coordinates are now
the y-coordinates and all values are multiplied by 12. The correlation
coefficients for the original data and for the new data are:
a. Opposites
b. Reciprocals
c. The same
3. Let Y be a random variable. Then V(Y) equals:
a.
b.
c.
d.
4. To infer the political tendencies of the students at your university, you
sample 150 of them. Only one is a simple random sample:
a. make sure that the proportion of minorities are the same in your sample
as in the entire student body
b. call every fiftieth person in the student directory at 9 a.m. If the person
does not answer the phone, you pick the next name listed, and so on.
c. go to the main dining hall on campus and interview students randomly
there.
d. have your statistical package generate 150 random numbers in the range
from 1 to the total number of students in your academic institution, and
then choose the corresponding names in the student telephone
directory.
2[( ) ]YE Y µ-
[| ( ) |]YE Y µ-
2[( ) ]YE Y µ-
[( )]YE Y µ-