Chapter 1 – Defining and Collecting Data
STAT1008
Quantitative Research Methods
Key learning points
• Basic definitions:
– Population vs. sample; parameter vs. statistic
• Survey sampling methods
• Types of variables
• Sources of survey error
2
Example data set
• Electronic Vehicle (EV) Consumer Survey Data Set
• Source: California Clean Vehicle Rebate Project
http://cleanvehiclerebate.org/eng/survey-dashboard/ev
• Study aim: Data was collected to examine the demographics of
electric vehicle buyers and identify patterns in the decision making
process among EV buyers
3
Example data set
4
Population and Sample
• Population: All members of the group about which you
want to draw a conclusion on
– Q: What is the population of interest for the EV example?
• Sample: Portion of the population selected for analysis
• Why do we analyse the sample rather than the entire
population? Isn’t it better to have information on all
members of a population, and not just a subset?
5
Parameters and statistics
• Parameter: characteristic of the population
• Statistic: A numerical measure that describes a
characteristic of the sample. Calculated using the
sample data. Used as an estimate of the population
parameter
6
Data collection method
• Before looking at the data, let’s consider how the data
were collected? That is what was the sampling method?
• Why is it important to acknowledge the sampling
method? Do we have an unbiased sample? What is an
unbiased sample?
7
Data collection method
• EV consumer survey – all rebate participants (as listed on the rebate
register) in California received a survey invitation by email with their
application approval notice.
• Response is voluntary
• Of the 91,085 registered program participants, 19460 responded
(response rate = 21%)
– Pros?
– Cons?
• Other methods of data collection?
8
Data collection method
• It is important to understand how the data were collected
as it affects the makeup of your samples and whether
the sample is representative of your target population.
• What possible biases could be present in the EV survey
data collection method?
9
Types of survey sampling methods
• Sampling frame – the list of items that make up the population. What is
the sampling frame for the EV survey data set?
• Once you select a frame, you draw a sample from that frame.
10
Types of survey sampling methods
• Non-probability sample – select items without knowing
probability of selection. Convenient and low cost but selection
bias is problematic.
– eg convenient sampling , judgement sampling , quota sampling
• Probability sample – selection probabilities are known,
produce unbiased samples
11
Simple random sample
• Simple random sample – every item in the frame has an
equal chance of being selected.
– Example: randomly select 1000 rebate participants for inclusion in the
survey. Probability of selection is 1000/91085. (each unit in the
population has an equal chance of selection)
12
Stratified sampling
• Stratified sample - Divide frame into subpopulations (strata),
perform simple random sampling in each strata
– Pros: ensure specific groups of the population are equally represented
– Example: divide rebate participants by income group, perform simple
random sampling of units within each income group. Stratified
random sampling then ensures all income groups are adequately
represented.
13
Systematic sampling
• Systematic sampling – start with the kth item (eg k=10,20..) in
the sampling frame, then pick every kth item thereafter
– Example: product testing in a manufacturing factory.
– Example: in EV data set, select every 100th participant as listed
alphabetically in the sampling frame
– Prone to selection bias given that the probability of selection will be
affected by the order in which the items in the frame appear.
14
Cluster sampling
• Cluster sampling - divide items in frame into clusters. Take a random
sample of clusters then collect data on every item in that sampled cluster.
– A cluster sample typically gives less precise estimates than a systematic or
simple random sample of the same size (especially if values tend to be similar
within the same cluster), but it can be much cheaper.
– Cluster examples: households, postcode, electorate (more cost-effective than
SRS if population is spread over a large region)
– EV survey data example: suppose we have zipcode recorded for each
record. Then cluster sample by zipcode within each county.
15
Some definitions of a dataset
• What does a row in the dataset represent?
– Each row is a unit of observation. The entity/item on which we collect
data
• What information is contained in the columns?
– Each column is a variable (a characteristic/feature of each unit of
observation)
• Rows and columns together form a data set. Note the tabular
format of presenting data which we will focus on in this class
16
Types of variables
• Why is it important to classify variables by type?
• Categorical (values are class labels or levels)
– nominal (no natural order, e.g. make of car={Ford, Subaru, Hyundai, …} )
– ordinal (natural order: e.g. very unsatisfied, unsatisfied, neutral, satisfied,
very satisfied)
• Numerical (a measurement number: height, weight, age, number of children,
…)
– discrete (e.g. a count, number of car trips in past week = 0, 1, 2, … )
– continuous (e.g. age, height, …)
17
Types of variables – EV survey example
1. Age – numerical, continuous
• Numerical – has a numerical value that quantifies the amount/size of something.
Age is numerical because it quantifies how many years old the person is.
• Continuous – can take on any value between specified limits. Although age is
usually reported in whole numbers, your exact age in terms of years, months, days
etc can be reported as a real number. Eg, if your 18th birthday was exactly 3
months ago, you age is 18.25
18
Types of variables – EV survey example
2. Sex: 1=Female, 2=Male – categorical nominal
• Categorical – values fall into two or more classes. The classes can be coded as
numbers as above, but the numbers are mere labels and have no quantitative
meaning. The variable Sex in the data set is a 2-level categorical variable.
• Nominal – no ranking is implied by the levels of the categorical variable.
• In this case, the coding of 1=Female and 2=Male does not imply that Females are
better than males nor vice versa
19
• When classifying variables in your data set by type, you are making an
assumption on the structure of the data.
• Different variable types imply different data structures and constraints, and
convey different information.
• The variable type assumption(s) affect the choice of statistical tools to
analyse and model the data.
• It is important to make valid variable type assumptions and to correctly
incorporate the assumed data structure into your analysis.
20
Types of variables
Survey error – can you trust the data source?
• Data is prone to errors. Four main types of survey errors:
1. Coverage error – certain groups of items are excluded from
the sampling scheme.
Is coverage error an issue for the car preferences survey?
2. Non-response error – failure to collect data on all items in the
survey, often denoted as a blank data entry.
Eg – income level is typically a sensitive question and often missing
21
3. Sampling error – chance differences from sample to sample
EV example: repeat survey at beginning of year and end of year. A different
subset of participants will respond.
22
Survey error – can you trust the data source?
4. Measurement error – values recorded in the survey are
different from the true response. Eg leading question (do you
have a problem with your boss vs tell me about your
relationship with your boss); incorrect interpretation of question
– EV example
– The respondent may difficulty recalling the number of PEVs observed or
not observed at all.
23
Survey error – can you trust the data source?