S1 ECOS -无代写|学霸联盟

S1 ECOS -无代写

时间：2025-08-13

2025 S1 ECOS 3997 (Stream 2: Data literacy)
Week 2: What is data?
Ellen Stuart
School of Economics
The University of Sydney
6 March 2025
Review
Our goal: Empirically estimate the revenue-maximizing linear income tax rate
What have we done so far?
• Wk1: Used a simple model to tell us what we need out of the data (e)
What are we going to do today?
• Take a step back to think about “what is data?”
• Discuss some of the data sources frequently used in empirical economics
1/65
Lifecycle of empirical work
Ask a Question Obtain Data
Understand the DataPrediction and Inference
Conclusions
Adapted from UC Berkeley Data Science 100. 2/65
What is data?
What is a dataset?
• Datasets are organized collections of information
• In other words: collections of data
What is data?
• Data are individual pieces of information (or a single observation)
• A single piece of information is a datum
3/65
What is data?
Some other useful vocabulary:
• Data unit (a.k.a. record, observation): one entity of the population being
studied
– E.g., one person, one business
– These usually make up the rows of a dataset
• Data item (a.k.a. variable): one type of information (characteristic, attribute)
collected for each data unit
– E.g., height, date of birth
– These usually make up the columns of a dataset
4/65
What is data?
Dataset example
Name Age Fav. Truck Fav. Color
Joe 6 Cement mixer Red
Zach 3 Garbage truck Black
Joe and Zach are data units / records / observations / rows...
Age, favorite truck, and favorite color are data items / variables / fields...
The whole table is a dataset
Joe’s age is a single piece of data (a datum)
5/65
What is data?
• This dataset organizes information about Joe and Zach
• The same information is contained in the following paragraph, but is not
organized for easy analysis
6/65
What is data?
Joe and Zach are two delightful children who happen to love trucks. As in, they
really love trucks. They become sad when they are watching the garbage truck
work, and then the garbage truck drives away. Joe is 6, and Zach is 3. Did I
mention they love trucks? They love all trucks, but Joe’s favorite truck is the
cement mixer. His favorite color is red. In contrast, Zach’s favorite color is
black and his favorite truck is a garbage truck.
Name Age Fav. Truck Fav. Color
Joe 5 Cement mixer Red
Zach 3 Garbage truck Black
7/65
What is data?
Data can be quantitative or qualitative
• Quantitative data: measures of counts or values
• Answers questions like: How many? How often? How much?
• Represented by a number (can be continuous or discrete)
• In our previous example, “age” was a discrete quantitative data item
• Qualitative data: measures of types
• Answers questions like: What kind? Where?
• Represented by a word, phrase, or code
• May be ordinal (e.g., shirt sizes) or nominal (e.g., eye color)
• In our previous example, “favorite truck” and “favorite color” were nominal
qualitative data items
8/65
What is data?
Quantitative and qualitative are both critical for economic analysis
• Example: suppose new childcare subsidy policy increases labor supply
(quantitative)
• We might want to know where the increase is coming from. In other words,
what mechanism increased labor supply? Through what channel?
– Intensive margin? (Same people worked more?)
– Extensive margin? (New people joined labor force? If yes, who were those
people? Women? Single parents? Need qualitative data!)
9/65
What is data?
Many variables do not sit neatly in one of these categories
Question: Are Yelp stars a number (quantitative discreet) or not (qualitative
ordinal)?
Understanding difference is important: affects what statistics can be produced
10/65
What do we do with data?
Once we have some data, what do we do with it?
• Clean & process the data
• Analyze & visualize the data
This is a cyclical process
11/65
What do we do with data?
What does it mean to clean/process data?
Ideally, a dataset looks something like this:
First Name Last Name Date of Birth Income Marital Status
Sarah Kay 1980-01-07 300,000 0
Matt Reed 1979-12-16 85,000 0
Ben Smith 1964-09-28 120,000 0
Hannah Zhu 1966-04-11 1,200,000 1
12/65
What do we do with data?
However, “raw” data usually looks more like this:
First Name Last Name Date of Birth Income Marital Status
Sarah 564 1980-01-07 300,000 0
Matt Reed Dec 16 1979 85,000 NA
Ben Smith 1964-09-28 120,000 NA
Hannah Zhu 11/04/1966 1,200,000,000,000,000 1
• invalid values
• incorrect/inconsistent formatting
• outlier values
• missing information
• among others!
13/65
What do we do with data?
Data cleaning is the process of taking raw data and dealing with those (and other)
errors so that the data can be analyzed
Data cleaning inevitably involves decisions that may impact the results of the
analysis
14/65
What do we do with data?
Revisiting our example:
First Name Last Name Date of Birth Income Marital Status
Sarah 564 1980-01-07 300,000 0
Matt Reed Dec 16 1979 85,000 NA
Ben Smith 1964-09-28 120,000 NA
Hannah Zhu 11/04/1966 1,200,000,000,000,000 1
• If ignore missing data, in our sample 50% look married (instead of 25%)
• And the outlier income—how do we know it’s not the real value? If the value
is real, do we leave it? If not, what do we do with it?
All empirical results are impacted by decisions like these
15/65
What do we do with data?
Once we have made decisions about how to handle the raw data, it is time to
analyze the data
What are we doing when we analyze data?
• Answering a question
• Telling a story
Types of analysis:
• Visual
• Quantitative
16/65
What do we do with data?
Example: baby sleep patterns (visual analysis)
17/65
What do we do with data?
Example: baby sleep patterns (quantitative analysis)
What was the average total amount of sleep at night (7pm - 7am)?
• Kid 1: 8.5 hours
• Kid 2: 7.5 hours
What was the average nap length (sleep between 7am - 7pm)?
• Kid 1: 86 minutes
• Kid 2: 80 minutes
How old was each kid the first time they slept 6+ hours 3+ nights in a row?
• Kid 1: 1 month, 8 days
• Kid 2: 8 months, 2 days
18/65
Where does data come from?
Claim from the US Centers for Disease Control and Prevention (CDC):
In 2017-2018, 42.7% of Americans were obese
Question: How do they know?
(Note: this example comes from Emily Oster and is used with her permission.)
19/65
Where does data come from?
What would the ideal data to measure this?
The whole population would weigh themselves at the same time of day, every day,
for some extended period of time and accurately report the data
This would give us accurate, precise, and regularly updated information about
obesity rates...but is also a bit alarming to contemplate!
20/65
Where does data come from?
More realistic options:
• De-identified medical records
• Fitness/diet-tracker apps
• Driver’s license applications & renewals
21/65
Where does data come from?
Pros:
• Doesn’t require new data collection
• Sample sizes are likely large
Cons:
• Each of these samples is selected (i.e., they don’t represent the full U.S.
population)
• In some cases, rely on self-reported data
22/65
Where does data come from?
This gets at a bigger question: where does data come from?
• Everywhere!
(But actually, where does it come from?)
23/65
Where does data come from?
Primary data - researcher generates dataset. Examples:
• Observations
• Surveys
• Experiments
Secondary data - researcher uses dataset generated by someone else. Examples:
• Primary data collected by other researchers (e.g., HILDA)
• Organizations
• Businesses
• Text
• Browsing history
24/65
Where does data come from?
These sources are not mutually exclusive (i.e., data from these different sources
can be combined)
Example: an experiment where the treatment changes a question on a survey
25/65
Data in Economics
Economists draw on data from the following sources:
• Survey data
• Administrative data
• Data from experiments
• ...and more!
26/65
Data in Economics: Survey data
Back to our example: In 2017-2018, 42.7% of Americans were obese (CDC)
Question: How do they know?
27/65
Data in Economics: Survey data
Survey data: National Health and Nutrition Examination Survey (NHANES)
• Started in 1960s, current iteration since 1999
• ≈ 5,000 individuals included each year
Two components:
1. Survey component: Demographics (race, income, education), health
conditions, and diet
2. Examination: Medical and dental tests (including weight, blood pressure)
Exams are done in specialized NHANES mobile examination units
28/65
Data in Economics: Survey data
The NHANES is designed as a representative sample
• Ideally: randomly survey 5,000 people from full U.S. population of ∼ 300m
• Infeasible (esp. given mobile examination units)
• In reality:
– Choose 15 random counties each year
– Choose random households within these counties
– Choose random individuals within the households
• From this, we can back out inferences about the full population (IF the people
randomly picked are actually surveyed)
29/65
Data in Economics: Survey data
Main issue: non-response
• Half of people contacted are willing to be surveyed
• Not all of those are willing to undergo the examination
• Problem: refusal to participate is non-random
So... what do you do?
30/65
Data in Economics: Survey data
Short answer: “reweight” the data
• Imagine your data has 100 people: 90 Sydneysiders, 10 Melbournians
• Overall population is 50% Sydneysiders, 50% Melbournians
• To reweight the data to be representative of the population:
– Count each of the 10 Melbournians 5 times
– Count each of the 90 Sydneysiders 5/9ths of a time
Reweighting very complicated when imbalanced on multiple dimensions
31/65
Data in Economics: Survey data
Catch: can only reweight based on things observed in the data
“The fundamental problem is that if your sample is selected based on features you
cannot see—unobservables—then you’re kind of out of luck for making precise
conclusions.” (Emily Oster)
32/65
Data in Economics: Survey data
Sampling is one of many challenges associated with getting credible survey data
Example: the University of Michigan offers an entire class called “Methods and
Theory of Sample Design”:
The theory underlying the methods of survey sampling widely used in
practice. It covers the basic techniques of simple random sampling,
stratification, systematic sampling, cluster and multi-stage sampling,
and probability proportional to size sampling. It also examines methods
of variance estimation for complex sample designs, including the Taylor
series expansion method, balanced repeated replications, and jackknife
methods.
This class part of the Master’s Program in Survey Methodology offered by UM—an
entire degree devoted to understanding how to create credible survey data 33/65
Data in Economics: Survey data
What are some other issues to be aware of when using survey data?
• Non-responsiveness
– Even if we do our best to sample in a representative way, we still have issues if
not everyone responds (and they do not!)
• Survey fatigue and question ordering
• Leading questions (ex: what do you think of our new and improved product?)
• Rating scales mean different things to different people (ex: restaurant ratings)
• Self-reporting (ex: remembering what you ate, incentives to report a certain
thing)
34/65
Data in Economics: Survey data
Survey data also has advantages:
• Can include exactly (or as close to) the data items of interest
• Lots of ways to be administered (online (Mechanical Turk), by phone, email,
snail mail, street corner...)
• Some methods allow for remote administering
• Helpful for qualitative responses
• Easy to implement (especially if you aren’t concerned with proper survey
design)
• Fast data collection times (potentially)
• Can collect large volumes of data at relatively low cost
35/65
Data in Economics: Administrative data
What is administrative data?
• Data that organizations (administrations) collect as part of their regular
management and operations
• Examples:
– Tax returns filed by individuals and organizations
– School enrollments and scores stored by universities
– Medicare enrollments and claims
– Credit card transactions used by financial institutions
36/65
Data in Economics: Administrative data
Pros of using administrative data:
• Can have huge sample sizes
– Allows for analyses of small subgroups
– Allows for detection of small but meaningful changes in behavior
• Often easier to the same individual over time
• Not necessarily self-reported
• If can link data from multiple organizations, can explore broader set of issues
– E.g., education data and child tax credit data
• Cheap: data collection has already happened
37/65
Data in Economics: Administrative data
Cons of using administrative data:
• Only people covered by the administration/organization are included
• Data not collected for research purposes
– Limited to the variables collected by the organization
– If a form changes, a variable may just stop being collected
– Potentially limited documentation
• Data often only available with a lag
• Can be expensive and/or hard for researchers to access
– Issues around confidentiality and disclosure
– Issues around “proprietary” data
• Subject to various sources of measurement error
38/65
Data in Economics: Administrative data
Common sources of measurement error in administrative data:
• Invalid values
• Incorrect formatting
• Internal inconsistencies
• Outlier values
• Missing information
• Imputed information
39/65
Data in Economics: Administrative data
Example: date of birth in the U.S. tax records (birth years 1937-1943)
0
10
20
30
40
50
60
70
80
N
um
be
r
of
I
nd
iv
id
ua
ls
(
T
ho
us
an
ds
)
-90 -60 -30 0 30 60 90
Date of Birth
“The Jan. 1 birth date is the
common birth date we assign”
- United States Citizenship and
Immigration Services (USCIS)
explaining how they process new
immigrants who lack birth
certificates.
Source: Bryant et al. (2024) Working Paper
40/65
Quick aside: Cool Australian data
There is amazing survey and administrative Australian data available to Economics
researchers, including (but not limited to):
• Household, Income and Labour Dynamics in Australia (HILDA)
• ATO Longitudinal Information Files (ALife)
• Person Level Integrated Data Asset (PLIDA)
• Business Longitudinal Analysis Data Environment (BLADE)
• Linked Employer-Employee Database (LEED)
41/65
Data in Economics: Experiments
Experiments are another source of data for economic research
• Lab experiments
• Field experiments
42/65
Data in Economics: Experiments
Lab experiments seek to understand human behavior and choices made in a
controlled setting
Closely related to lab experiments used in social psychology
– Experimental economics tends to be rooted in “economic thinking” (utility
and disutility, supply and demand, opportunity costs, sunk costs, market
frictions, etc.)
Pro: Highly controlled environment
Con: Highly artificial environment
43/65
Data in Economics: Experiments
Classic example: The Ultimatum Game (Gu¨th et al. 1982)
Set-up:
• Player 1 has a sum of money, $X
• Player 1 decides how much of that sum to offer to Player 2, $Y
• Player 2 knows the value of $X and $Y and decides to accept or reject the offer
• If Player 2 accepts the offer, Player 2 received $Y and Player 1 receives $X-$Y
• If Player 2 refuses the offer, both payers receive nothing
44/65
Data in Economics: Experiments
Let’s play a round to see how this works
• Assume you are Player 2
• Player 1 has $100
• They will offer you some amount $0 ≤ $X ≤ $100
• If you accept, you get $X (and Player 1 gets $100− $X )
• If you reject, you get nothing (and Player 1 gets nothing)
Question: What is the lowest amount you would accept?
45/65
Data in Economics: Experiments
The Ultimatum Game
• Offers below 20-30% often rejected
• Can be played as a once-off game or repeated game
• Many variations (e.g., “competitive ultimatum game”)
• Framing effects matter–giving versus splitting versus taking (Leliveld et al.
2008), characterizing the game as a windfall (Lightner et al. 2017)
• Provides insights into how well our “rational agent” assumptions hold (with
caveats)
• Results differ across countries, cultures (e.g., shared communities more likely
to offer fair splits)
46/65
Data in Economics: Experiments
Field experiments: Actively applying a treatment “in the real world” (i.e., the field)
rather than in a lab
Less controlled than a lab experiment, but closer to the real world
Within field experiments, there is variation on whether subjects are aware that they
are taking part in an experiment
Example: tax salience in the U.S.
47/65
Data in Economics: Experiments
Example: Chetty, Looney, and Kroft (2009)
Context:
In the U.S., the price listed on the shelf does not include sales tax (GST
equivalent)
⇒ price faced at register ̸= price on shelf
Research question: how salient is U.S. sales tax to consumers?
48/65
Data in Economics: Experiments
Experiment: manipulate the salience of sales tax
Setting: U.S. supermarkets (30% of products subject to sales tax)
Treatment: Posted tax-inclusive prices on the shelf for a subset of products subject
to sales tax (≈ 7%)
49/65
Data in Economics: Experiments
50/65
Data in Economics: Experiments
Data: scanner data on weekly price and weekly quantity sold
Method: experimental difference-in-differences
Treatment group:
• Products: Cosmetics, Deodorants, and Hair Care Accessories
• Store: One large store in Northern California
• Time period: 3 weeks (February 22, 2006 – March 15, 2006)
Control groups:
• Products: Other products in same aisle (toothpaste, skin care, shave)
• Stores: Two nearby stores similar in demographic characteristics
• Time period: Calendar year 2005 and first 6 weeks of 2006
51/65
Data in Economics: Experiments
52/65
Data in Economics: Experiments
Results: Tax inclusive pricing resulted in a decline in quantity demanded
This is consistent with consumers, on average, underestimating the total price prior
to the intervention (i.e., tax was not fully salient)
Like with all of the sources of data we’re discussing today, this field experiment
faces criticisms
53/65
Data in Economics: Experiments
There are many other fascinating examples of field experiments in economics
Some examples:
• Slemrod, Blumenthal, Christian (2001): Methods to reduce tax noncompliance
• Bertrand and Mullainathan (2004): Racial discrimination in the labor market
• Heller et al. (2017): High school mentoring programs as crime reduction
54/65
Data in Economics: Experiments
A special type of field experiment that is used particularly in development
economics is the Randomized Controlled Trial (RCT)
RCT: Identify the eligible population, select a random sample, and then randomly
assign individuals to a treatment (experimental) group and a control group
In 2019, Abhijit Banerjee, Esther Duflo, and Michael Kremer were award the Nobel
Prize in Economics for their groundbreaking work using RCTs to investigate
innovative methods to reduce poverty around the world
55/65
Data in Economics: Experiments
For example, Professor Kremer has conducted several RCTs in Kenyan schools to
understand how different interventions improve student performance
• Buying new textbooks? No impact.
• Providing de-worming pills? Reduced absenteeism by 25%
The results of this second study were ultimately scaled into a nationwide program
by the Kenyan government, and then by the Indian government
56/65
Data in Economics: Experiments
RCTs often referred to as the “gold standard” for causal inference (i.e., for
estimating the extent to which X causes Y)
They can provide incredible evidence and insight into practical, real-world solutions
for big problems
Like all sources of data in empirical economics, they are not fault-free
• Expensive (both money & time)
• Will any estimated effects generalize?
• Will any estimated effects scale?
57/65
Data in Economics: Experiments
Pros of experiments:
• (Mostly) truly random variation
• (Relatively more) control of behavior environment
• Easier to replicate
• Easier to try and target a specific behavior
58/65
Data in Economics: Experiments
Cons of experiments:
• Designed by humans ⇒ subject to human error
• Some (or many) levels removed from the real world
• Scale, generalizability, cost (time, money)
• The Hawthorne Effect (idea that people behave differently when observed)
– Stronger version: people anticipate the hypothesis of the experiment and act to
try and please the experimenter
• Many results from lab experiments are WEIRD (Western, educated,
industrialized, rich and democratic)
– Estimated to be ∼80% of study participants, only 12% of world population
– Results don’t always hold up in other cultures
• How much do we worry about informed consent?
– Should individuals always be told they are participating in an experiment?
– How would that impact how we use experiments in research?
59/65
Data in Economics: Other sources of data
Other sources of data: websites and text
It’s becoming increasingly popular to scrape websites and/or to try to create
analyzable data out of text
These methods involve additional tools from Data Science and Computer Science
(especially Natural Language Processing, the branch of AI that deals with language
and text)
60/65
Data in Economics: Other sources of data
Apps and the internet:
• Yelp (e.g., Luca and Zervas 2016)
• Uber (e.g., Angrist et al. 2021)
• Airbnb (e.g., Kakar et al. 2018)
Text analysis:
• Tweets (e.g., Baylis 2020)
• Newspapers (e.g., Baker et al 2016)
• Wills (e.g., Hines, Kummerfeld, and Stuart–in progress!)
61/65
Data in Economics: Other sources of data
A note of caution: before scraping a website, must check terms of use
https://www.yelp-support.com/article/Can-I-copy-or-scrape-data-from-the-Yelp-site
62/65
Wrapping up
Today:
• Overview of data: what it is and where it comes from
• Pros and cons of different data sources
Key take-aways:
• Even the best (realistic) data is not perfect
• It’s critical to think about where data comes from and what it’s really
measuring
• When you read the headline “study shows...” you should ask yourself “I
wonder what data they used.”
63/65
Wrapping up
How does this relate to our research question?
• To answer our research question, we need an estimate of the the elasticity of
taxable income
• When we start discussing different ways that economists have tried to estimate
this elasticity, one key input will be the data and the associated pros and cons
64/65
Wrapping up
Next week (Week 3):
• Lecture: Summary statistics
• Tutorial: Prompt 1 (next slide)
References:
• Statistical Terms and Concepts, ABS,
https://www.abs.gov.au/statistics/understanding-statistics/
statistical-terms-and-concepts
• Where Does Data Come From? Emily Oster,
https://www.parentdata.org/p/where-does-data-come-from
65/65
Prompt 1
Two common sources of information on income are administrative data (e.g.,
tax returns) and survey data. What are some of the trade-offs between these
two data sources? Why might the tax return data be more credible than the
survey data? Why might the survey data be more reliable? Would you prefer
to work with one or the other? Why?
Note: to receive full marks, you must compare the two types of data in
addition to separately describing them.

学霸联盟