STAM4000-无代写
时间:2023-09-12
STAM4000
Quantitative Methods
Week 1
Data, Sampling Methods
and Summarisation
2COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan
Business School pursuant to Part VB of the Copyright Act 1968 (the Act).
The material in this communication may be subject to copyright under the Act. Any further
reproduction or communication of this material by you may be the subject of copyright
protection under the Act.
Do not remove this notice.
2
3Week 1
Data, Sampling
Methods and
Summarisation
Learning
Outcomes
#1
#2
#3
Classify different types of data
Recognise sampling methods
Summarise data with a frequency table#4
Identify errors in sampling
4Why does
this matter?
Analyse the
past, to help the
present and
plan the future.
5TWO BRANCHES
OF STATISTICS
•Descriptive statistics:
Summarise the behaviour
of a data set.
•Inferential statistics:
Uses information from a
sample of data to draw
conclusions upon the entire
population.
https://blogs.sap.com/2013/08/08/bi-cartoons/
6#1 Classify different types of data
https://www.google.com/search?q=who+is+putting+all+the+maths+books+in+the+horror+section?&rlz=1C1CHBF_enAU841AU846&sxsrf=ALeKk02zxMXP42jwekhEZgNMvkjxKLelXQ:1610346492733&tbm=isch&source=iu&ictx=1&fir=0ceXWG1YlFi-
nM%252CUgKAlake7CyWpM%252C_&vet=1&usg=AI4_-kRLQBRRIgFusqrm9UIUP3NVmzgFIw&sa=X&ved=2ahUKEwj7-aeVoJPuAhU0yDgGHYKTAVsQ9QF6BAgEEAE#imgrc=0ceXWG1YlFi-nM&imgdii=BwR_-MF4lD4suM
7#1 Data or datum?
What are data?
Data are information about a variable.
What is a variable?
A quantity or quality that can take different values.
Sources of data:
i. Primary: data collector = user
E.g.: A café collects data on types of coffee sold last week.
ii. Secondary: data collector ≠ user
E.g.: A café uses data about the most popular coffees sold
in Australia to determine the types of coffee to sell in their
cafe.
Anyone need a
coffee?
27%
Latte
25%
Flat white
20%
Cappuccino
8%
Long black
6%
Espresso
14%
Other
Popularity of coffees
sold in Australia
8Exercise
Ecco Shoes is a Danish company that manufacturers shoes, bags
etc.
Describe each of the following as a primary source or a secondary
source of data. Briefly explain your choice.
a) The customer service department of Ecco Shoes emails online
customers asking about their recent purchase experience.
b) The product management team uses data collected by the
customer service department to learn what customers think
about products sold by Ecco Shoes.
#1
10
The 5 W’s and ‘How’ of data
Would you like a French pastry to go with your coffee?
These pastries look great, but before we bite into one, we may
want to ask a few questions:
• Who are we interested in – the pastries.
• What information on the pastries do we have?
• Why do we want information on the pastries?
• Where was the information on pastries collected?
• When was the information on pastries collected?
• How was the information on pastries collected?
These questions about pastries are the same for data.
The above are the 5 W’s and ‘How’ of data – these give context
and meaning to data.
https://unsplash.com/photos/a-niLBbZF4o
#1
11
#1
The number of COVID-19 positive cases from around the world
are being reported on a daily basis – but how reliable is this
data?
The World Health Organisation (WHO) receives its case
numbers from governments around the world - an external
secondary source of data.
The information supplied to WHO depends on the government
of the country supplying the data.
Describe the 5 W’s and ‘How’ of the number of COVID-19
positive cases from the WHO.
Exercise
This Photo by Unknown Author is licensed under CC BY-SA
13
Types of data
Numerical (quantitative):
numbers with units
Discrete data: countable (finite) or has
defined gaps between values
E.g.: number of tickets sold, test marks
Continuous data: infinite possible
values or no set gaps between values
E.g.: measurements such as weight,
height
Categorical (qualitative):
labels or numbers without units
Nominal data: labels or names
E.g.:STAM4000
Ordinal data: ranked values
E.g.: Measure of risk
1.High risk 2. Medium risk 3. Low risk
#1
14
Exercise
Match each orange example with two of the corresponding blue concepts:
Weight
Trip advisor
restaurant rating
Number of apps on
your phone
Numerical
Categorical
Ordinal
Discrete
Nominal
Continuous
Revenue
STAM 4000
Marks in the exam
#1
16
#1 More on types of quantitative data
Interval Ratio
• Data are always numerical and distances
between consecutive integers have
meaning.
• Differences between numbers are equal.
• The location of zero is a matter of
convenience or convention and not a
natural or fixed data point.
• Examples: Celsius temperature, Calendar
time, Standardised exam score.
• Highest level of measurement.
• Contains the same properties as interval
data, with the additional property that
zero has a meaning and represents the
absence of the phenomenon being
measured.
• Examples: Height, Weight and Volume,
Profit and Loss, Revenues and Expenses,
P/E Ratio, Inventory.
17
#2 Recognise sampling methods
https://www.google.com/search?sxsrf=ALeKk01XHOgr1vCbQSHcwLeBWIHoH5-R4w:1612998878950&q=free+sample+cartoon&tbm=isch&source=iu&ictx=1&tbs=simg:CAEShAIJyHqiwaQqz9sa-
AELELCMpwgaOQo3CAQSE3zxNLEIgiWBEMA8nxTOBcQ3iyoaGopmfYtZFSKmkCS8Ir9soCcz8DOKHi7Xv9ZpIAUwBAwLEI6u_1ggaCgoICAESBHpZgLUMCxCd7cEJGpkBCiEKDXBob3RvIGNhcHRpb27apYj2AwwKCi9tLzBiNzV3ZzQKIgoPc3RhbmRpbmcgYXJvdW5k2qWI9gMLCgkvai80MDQxeGQKGQoHZHJhd2luZ9qliPYDCgoIL20vMDJjc2YKHwoMaWxsdXN0cmF0aW9u2q
WI9gMLCgkvbS8wMWtyOGYKFAoDYXJ02qWI9gMJCgcvbS8wamp3DA&fir=8id8rVHC-OEmqM%252COs7hrcAzfihkeM%252C_&vet=1&usg=AI4_-kQtnhgsqKsljxvLpvtiiRsaMoQfeg&sa=X&ved=2ahUKEwiBt8-JueDuAhVEfH0KHbG7BLEQ9QF6BAgXEAE&biw=975&bih=470#imgrc=TXBDQO7wKmjLqM
18
https://unsplash.com/photos/Ss3U6bEtKww
Population
Entire group of individuals or items.
Sample
A selection or subset of the population.
Example
The population of households in the
city of Venice, Italy.
A sample of households in the city of
Venice, would be a subset selected (circled)
using a specific sampling method.
This Photo by Unknown Author is licensed under CC BY-NC-ND
Recognise sampling methods#2
19
Exercise
Which of the following is a population or a sample? Briefly explain your choice.
a) The closing price of Alphabet Inc. shares (parent company of Google) on the stock
exchange for the last 5 business days.
b) Annual household expenditure for all Australian households, from the 2016 Australian
Census.
c) Of all the people who use social media, those who use only Twitter.
d)Pass rate for each KBS subject offered in T3 2020.
#2
21
Which should we use - a population or a sample?
Population:
Advantage of a population of data is the ‘accuracy’ of the data.
Disadvantages of using a population of data:
• Costly both in time and money.
• Information may be irrelevant by the time all information is gathered.
• May destroy the population.
Sample:
Advantages of using a sample of data:
• Lower cost in time and money.
• Will not destroy the population.
Disadvantages of using a sample of data:
• Our sample may not be random and representative due to:
o sampling errors
o non-sampling errors (bias)
The disadvantages
of a population is
the reason
we use samples.
https://www.google.com/search?q=production%20quality%20control%20cartoon&tbm=isch&hl=en&rlz=1C1CH
BF_enAU841AU846&sa=X&ved=0CB8QtI8BKAFqFwoTCKCtpZij4O4CFQAAAAAdAAAAABAO&biw=964&bih=459
#2
22
Random and Representative?
Samples should be random and representative.
Random
A sample is random, if being selected into the
sample is unpredictable.
Representative
A sample is representative of the population, if the
sample is made up of the same characteristics and
in the same proportions as the population.
Good sample data is based on a random,
representative sample, OTHERWISE we may have
BIAS - see later.
https://davidkane9.github.io/PPBDS/one-parameter.html
#2
23
Exercise In the lead up to the 2019 Australian Federal Election, polling
companies tried to predict who would become prime
minister: Scott Morrison from the Liberal Party or Bill
Shorten from the Labor party.
A number of polling companies gathered their samples using
only “robocalls” to call voters on their landlines, from the
readily available “White Pages”, which lists most Australian
residential landline phone numbers. The robocalls were
computerised phone calls, with pre-recoded messages, that
asked respondents to choose between Scott Morrison and
Bill Shorten.
Most of the polls predicted Bill Shorten to win.
The winner – Scott Morrison.
Do you think the data gathered by the polling companies
were random and representative samples of the population?
Briefly explain your reasoning.(https://www.smh.com.au/federal-election-2019/celebrations-and-sorrow-creates-headaches-for-both-sides-of-politics-20190519-p51p0s.html)
#2
25
A sampling method is the process, scenario, strategy or technique used to
gather a sample of individuals or items from a population.
4 Types of sampling methods we will discuss:
Sampling methods#2
Simple
random
sampling
Stratified
sampling
Cluster
sampling
Systematic
sampling
26
A sample is drawn so that every possible sample of the size we
plan to draw has an equal chance of being selected as does each
combination of individuals/items.
Example: Suppose a soccer federation decides to randomly test players for drugs. There
are 10 soccer teams in the league with 20 players per team. The federation would like to
drug test a sample of 40 soccer players. Describe a simple random sampling method.
• The players are sorted in alphabetical order in the the registration list.
• A random number generator selects 40 numbers.
• Using the registration list, those players positioned in that number place are selected.
• These players are asked to be drug tested.
Simple random sampling (SRS)#2
This Photo by Unknown Author is licensed under CC BY
27
Stratified sampling
A sample is drawn by first dividing the population into groups
of similar individuals (called strata), and by then taking a simple
random sample from each strata.
We will use the same example of testing soccer players for drugs.
Now, describe a stratified sampling method.
• The soccer federation divides the soccer players into four different age group, treating each
age group as a strata:
o Up to 20 years
o Above 20 years to less than 25 year
o Above 25 years to less than 30 years
o Above 30 years
• The soccer federation randomly selects 10 players from each age group, and drug tests
these players.
#2
This Photo by Unknown Author is licensed under CC BY-SA
28
Cluster sampling
A sample is drawn by first dividing the population into
(clusters) of different individuals or items from the population.
One or more of the clusters is/are randomly selected, and a
census is then performed on each selected cluster(s).
We will use the same example of testing soccer players for drugs.
Now, describe a cluster sampling method.
• There are 10 sports teams in the federation.
• Each sports team is a combination of different age groups etc.
• A cluster sampling method is used when the soccer federation randomly selects two sports
teams and drug tests every player in each of those two selected teams.
#2
This Photo by Unknown Author is licensed under CC BY-SA
29
Systematic sampling
A sample is gathered by using a list or a location, and a fixed
interval to select members from a sampling frame.
We will use the same example of testing soccer players for drugs.
Now, describe a systematic sampling method.
• A systematic sampling method is used when the soccer federation uses the soccer
player registration list in a systematic manner:
o With the list in alphabetical order, say, every 5th soccer player on that list can be
selected and ask to be drug tested.
#2
30
Giant cruise ships have been causing damage to Venice, Italy. The mayor’s
office of Venice would like to understand how the 50,000 residents feel
about cruise ships travelling in the Grand Canal, through the historic city.
Listed below, are several proposed sampling methods to survey residents.
For each of the following, describe what type of sampling method is used.
a) The city of Venice is divided into sestiere, or 6 districts. The mayor’s office randomly selects one district and
surveys every household in that district.
b) Under Italian law, records of births, marriages, and deaths are maintained by the Registrar of Vital Statistics in
Venice. Every 50th person born in Venice from 1920 to 2002 is selected and surveyed.
c) Using the electoral roll, 1000 Venetian residents are randomly selected and asked to participate in the survey.
d) Using the record of births, residents are divided into four age groups: 18 to 30 years, greater than 30 years to
45 years, greater than 45 years to 60 years, and greater than 60 years old. Say, 250 residents from each
group are randomly selected and surveyed.
Exercise#2
This Photo by Unknown Author is licensed under CC BY
32
#3 Identify errors in sampling
https://www.davisenterprise.com/comics/the-wizard-of-id-350/?amp=1
33
Errors in sampling
Is our sample any good?
Lots of things can go wrong with sampling:
bad sample = unreliable analysis = bad decisions
There are 2 main sources of error:
• Sampling Error (not really an error)
• Non Sampling Error = BIAS
#3
https://www.team-consulting.com/insights/10-human-factors-study-myths-6-to-10/
34
Sampling Error
• The difference between a measurement
in a sample, and a measurement in the population, due to
random processes or by chance alone.
• A legitimate difference expected; not really an “error”.
• Depends on sample size. The sample size is
represented by n.
• We can decrease sampling error if we increase the
sample size.
https://www.google.com/search?q=cartoon+about+sampling+eror&tbm=isch&ved=2ahUKEwiC3Nilz5ntAhUN8TgGHUJJARkQ2-
cCegQIABAA&oq=cartoon+about+sampling+eror&gs_lcp=CgNpbWcQAzoCCAA6BAgAEBhQ_p43WNO_N2DnxDdoAXAAeACAAaUCiAH3GJIBBTAuNi45mAEAoAEBqgELZ3dz
LXdpei1pbWfAAQE&sclient=img&ei=1yq8X8KUHI3i4-EPwpKFyAE&bih=605&biw=1396&rlz=1C1CHBF_enAU841AU846
#3
35
Non-sampling Error = BIAS
• Bias is due to bad sampling design.
• Bias is more serious.
o If bias is present, we CANNOT draw valid
conclusions.
• Bias can exist regardless of sample size.
• An increase in the sample size will NOT
decrease bias.
Can bias exist in a census?
https://twitter.com/rodemmerson/status/1027263226923208704
#3
36
Comparison of Errors
Sampling errors are
errors caused by the
mere act of using a
sample to represent
the population.
Non sampling errors
(BIAS) are errors not
caused by this, and as
such, they can exist in
a census.
0
18
35
53
70
1 2 3 4 5 6 7 8
Er
ro
r
Sample size
Comparison of sampling error and bias
Non-samping error (bias) Sampling error (chance)
#3
37
Some types of bias …
Voluntary response bias:
•General broadcast to participate in a
survey.
•Only individuals who care enough
choose to participate.
•E.g. polls on the radio, television or
social media.
Non-response bias:
•Individuals are specifically invited to
participate in a survey.
•A large number of individuals with the
same characteristic(s) do not participate.
•This characteristic(s) is now missing from
the sample.
•E.g. polling of invited individuals in the
2016 and 2020 US presidential elections.
#3
38
More bias ...#3
39
What else can lead to bias?
Response bias:
•An individual’s response is influenced
by leading questions or gifts
associated to the survey.
•E.g. It is so cold today – do you really
believe in global warming?
Undercoverage and convenience sampling:
•Bad sampling frame as it limits the
opportunity for members of a population
to be included in a sample.
•E.g. Surveys in shopping malls during
business hours, when others are working
elsewhere
#3
40
It is compulsory (mandatory) in Australia for all eligible citizens
to vote in federal elections, by-elections and referendums. Over
the years, there has been an attempt to stop mandatory voting.
A researcher wants to understand people’s opinion on
mandatory voting, and has proposed the following sampling
scenarios. For each scenario, indicate some possible errors:
a) Run a poll on the local TV news, asking people to dial one
of two numbers to indicate whether they would agree or
disagree with the stopping of mandatory voting.
b) Randomly select one street in each city and contact each of
the households in that street asking their opinion on
mandatory voting.
c) Hold a meeting in each capital city, and tally the opinions on mandatory voting, expressed by those who attend
the meetings.
d) Go through the electoral roll, selecting every 100,000th voter and ask their opinion on
mandatory voting.
Exercise
Th
is
P
h
o
toby
U
n
k
n
o
w
n

A
u
t
h
o
r
is
lice
n
se
d

u
n
d
er
C
C

B
Y-
N
C
#3
42
#4 Summarise data with a frequency table
https://towardsdatascience.com/data-scientists-guide-to-summarization-fc0db952e363
43
#4 Summarise data with a frequency table
Class:
• Names or labels (if categorical data)
or
• Class interval for quantitative data
where ( x read as “values greater than x”
but x ] read as “values less than or equal to x”
Frequency: number of “counts” for each class.
Relative Frequency: The fraction (or %) in each class relative to the total
frequency
Cumulative Frequency: running total.
Cumulative Relative Frequency: running percentage.
Note: classes must be mutually exclusive and collectively exhaustive.
Example:
Class (50, 65] is read
as “values greater
than 50 AND less
than or equal to 65”
44
The call centre of an electricity provider has received a
number of complaints from customers that the call wait
time (minutes) is too long. The manager of the call centre
claims that most wait times are 15 minutes or less.
To investigate the complaints, a consumer group
telephoned the electricity provider 25 times and recorded
below.
Example
25.5 23.5 24.3 26.5 28.2
19.7 28.5 25.5 28.5 26.5
7.9 3.2 27.9 26.0 23.8
23.9 24.6 23.3 28.2 17.6
15.5 26.6 22.5 6.5 28.3
#4
This Photo by Unknown Author is licensed under CC BY
Table I: raw data of call wait times (minutes)
45
Example#4
Creating the frequency table for quantitative data for the call waiting example:
• How do we choose the class size?
• Decide the number of classes we want.
E.g.: say 6 classes
• Find the range = max – min of the raw data.
E.g.: using Table I, range = 28.5– 3.2 = 25.3 minutes
• Estimate the class width:
E.g.: = 4.2 minutes
• Round up the class width to a suitable integer, (whole number).
E.g.: width of 5 minutes
46
Example
Using Table II, we can answer the following:
a)How many customers waited more than 5 minutes but
less than or equal to 10 minutes? Answer: 2
b)What proportion of customers waited more than 25
minutes but less than or equal to 30 minutes?
Answer: 48%
c) How many customers waited 25 minutes or less?
Answer: 13 customers
d) What proportion of customers waited 15 minutes or
less? Answer: 12%
e) The manager of the call centre claims that most wait
times are 15 minutes or less. Comment.
Answer: The manager’s claim is not supported for this
sample of data, as only 12% waited 15 minutes or less.
Class
(min)
Frequency
Relative
frequency
Cumulative
frequency
Relative
cumulative
frequency
(0, 5] 1 4% 1 4%
(5, 10] 2 8% 3 12%
(10, 15] 0 0% 3 12%
(15, 20] 3 12% 6 24%
(20, 25] 7 28% 13 52%
(25, 30] 12 48% 25 100%
Total 25 100% N/A N/A
Table II: Frequencies of call wait times
#4
Note: the first class, (0, 5] means:
• 0 minutes is NOT included in this class
• 5 minutes IS included in this class
47
Exercise
Mario’s Gelateria has 24 stores across Australia. The chain recorded each store’s
monthly revenue ($000) for January 2020 as follows:
a) Create a table of frequencies, relative frequencies, cumulative frequencies and
relative cumulative frequencies. Use (60, 75] as your first class.
b) How many stores that had monthly revenue greater than $150,000 but less than or
equal to $165,000?
c) What proportion of stores had a monthly revenue of $90,000 or less?
d) How many stores had a monthly revenue greater than $105,000 up to or including
$150,000?
e) Which revenue class was the most common in this data set? Explain.
This Photo by Unknown Author is licensed under CC BY-NC-ND
130.7 112.2 130.3 136.5
73.5 120.3 106.5 106.2
114.3 98.5 133.5 143.3
115.3 85.8 110.6 100.5
135.2 112.5 118.5 150.5
92.3 86.3 115.8 105.5
#4
49
Supplementary Exercises
• Students are advised that Supplementary Exercises to this topic may be found on the
subject portal under “Weekly materials”.
• Solutions to the Supplementary Exercises may be available on the portal under “Weekly
materials” at the end of each week.
• Time permitting, the lecturer may ask students to work through some of these exercises
in class.
• Otherwise, it is expected that all students work through all Supplementary Exercises
outside of class time.
50
Extension
• The following slides are an extension to this week’s topic.
• The work covered in the extension:
o Is not covered in class by the lecturer.
o May be assessed.
51
Example
Here is a list of the top
10 companies in
Australia as of February
2020.
a) Create a table
of frequencies and
relative frequencies, by
industry.
b) What is the
percentage share and
category of the most
common industry?
Th
is
P
h
o
to
b
y
U
n
kn
o
w
n
A
u
th
o
r
is
li
ce
n
se
d
u
n
d
er
C
C
B
Y-
SA
https://disfold.com/top-companies-australia-asx/
52
Example Solution
Th
is
P
h
o
to
b
y
U
n
kn
o
w
n
A
u
th
o
r
is
li
ce
n
se
d
u
n
d
er
C
C
B
Y-
SA
Industry (alphabetical order) Frequency Relative frequency (%)
Biotechnology 1 10%
Capital Markets 1 10%
Diversified Banks 4 40%
Grocery Stores 1 10%
Home Improvement Retail 1 10%
Metals & Mining 2 20%
TOTAL 10 100%
a) Frequency table by industry.
b) The most common industry in the top 10 companies is Diversified
Banks with a relative frequency of 40%.
53
Exercise
Sixteen Kaplan Business School students earned the following grades in a
subject offered last trimester.
a) Is this data categorical or numerical? Briefly explain.
b) Create a table summarising the frequencies and relative frequencies, by
hand.
c) How many students received a credit?
d) What proportion of students received a high distinction?
e)
P D HD C
F C P P
C HD F C
F P P P
54
Exercise solution
a) This is categorical data as we have names of grades.
b) Frequency table:
c)Four students received a credit.
d)12.5% of students received a high distinction.
e) 81.25% of students did not fail.
Grade Frequency Relative frequency
F 3 18.75%
P 6 37.5%
C 4 25%
D 1 6.25%
HD 2 12.5%
Total 16 100%
essay、essay代写