2025 S1 ECOS 3997 (Stream 2: Data literacy) Week 2: What is data? Ellen Stuart School of Economics The University of Sydney 6 March 2025 Review Our goal: Empirically estimate the revenue-maximizing linear income tax rate What have we done so far? • Wk1: Used a simple model to tell us what we need out of the data (e) What are we going to do today? • Take a step back to think about “what is data?” • Discuss some of the data sources frequently used in empirical economics 1/65 Lifecycle of empirical work Ask a Question Obtain Data Understand the DataPrediction and Inference Conclusions Adapted from UC Berkeley Data Science 100. 2/65 What is data? What is a dataset? • Datasets are organized collections of information • In other words: collections of data What is data? • Data are individual pieces of information (or a single observation) • A single piece of information is a datum 3/65 What is data? Some other useful vocabulary: • Data unit (a.k.a. record, observation): one entity of the population being studied – E.g., one person, one business – These usually make up the rows of a dataset • Data item (a.k.a. variable): one type of information (characteristic, attribute) collected for each data unit – E.g., height, date of birth – These usually make up the columns of a dataset 4/65 What is data? Dataset example Name Age Fav. Truck Fav. Color Joe 6 Cement mixer Red Zach 3 Garbage truck Black Joe and Zach are data units / records / observations / rows... Age, favorite truck, and favorite color are data items / variables / fields... The whole table is a dataset Joe’s age is a single piece of data (a datum) 5/65 What is data? • This dataset organizes information about Joe and Zach • The same information is contained in the following paragraph, but is not organized for easy analysis 6/65 What is data? Joe and Zach are two delightful children who happen to love trucks. As in, they really love trucks. They become sad when they are watching the garbage truck work, and then the garbage truck drives away. Joe is 6, and Zach is 3. Did I mention they love trucks? They love all trucks, but Joe’s favorite truck is the cement mixer. His favorite color is red. In contrast, Zach’s favorite color is black and his favorite truck is a garbage truck. Name Age Fav. Truck Fav. Color Joe 5 Cement mixer Red Zach 3 Garbage truck Black 7/65 What is data? Data can be quantitative or qualitative • Quantitative data: measures of counts or values • Answers questions like: How many? How often? How much? • Represented by a number (can be continuous or discrete) • In our previous example, “age” was a discrete quantitative data item • Qualitative data: measures of types • Answers questions like: What kind? Where? • Represented by a word, phrase, or code • May be ordinal (e.g., shirt sizes) or nominal (e.g., eye color) • In our previous example, “favorite truck” and “favorite color” were nominal qualitative data items 8/65 What is data? Quantitative and qualitative are both critical for economic analysis • Example: suppose new childcare subsidy policy increases labor supply (quantitative) • We might want to know where the increase is coming from. In other words, what mechanism increased labor supply? Through what channel? – Intensive margin? (Same people worked more?) – Extensive margin? (New people joined labor force? If yes, who were those people? Women? Single parents? Need qualitative data!) 9/65 What is data? Many variables do not sit neatly in one of these categories Question: Are Yelp stars a number (quantitative discreet) or not (qualitative ordinal)? Understanding difference is important: affects what statistics can be produced 10/65 What do we do with data? Once we have some data, what do we do with it? • Clean & process the data • Analyze & visualize the data This is a cyclical process 11/65 What do we do with data? What does it mean to clean/process data? Ideally, a dataset looks something like this: First Name Last Name Date of Birth Income Marital Status Sarah Kay 1980-01-07 300,000 0 Matt Reed 1979-12-16 85,000 0 Ben Smith 1964-09-28 120,000 0 Hannah Zhu 1966-04-11 1,200,000 1 12/65 What do we do with data? However, “raw” data usually looks more like this: First Name Last Name Date of Birth Income Marital Status Sarah 564 1980-01-07 300,000 0 Matt Reed Dec 16 1979 85,000 NA Ben Smith 1964-09-28 120,000 NA Hannah Zhu 11/04/1966 1,200,000,000,000,000 1 • invalid values • incorrect/inconsistent formatting • outlier values • missing information • among others! 13/65 What do we do with data? Data cleaning is the process of taking raw data and dealing with those (and other) errors so that the data can be analyzed Data cleaning inevitably involves decisions that may impact the results of the analysis 14/65 What do we do with data? Revisiting our example: First Name Last Name Date of Birth Income Marital Status Sarah 564 1980-01-07 300,000 0 Matt Reed Dec 16 1979 85,000 NA Ben Smith 1964-09-28 120,000 NA Hannah Zhu 11/04/1966 1,200,000,000,000,000 1 • If ignore missing data, in our sample 50% look married (instead of 25%) • And the outlier income—how do we know it’s not the real value? If the value is real, do we leave it? If not, what do we do with it? All empirical results are impacted by decisions like these 15/65 What do we do with data? Once we have made decisions about how to handle the raw data, it is time to analyze the data What are we doing when we analyze data? • Answering a question • Telling a story Types of analysis: • Visual • Quantitative 16/65 What do we do with data? Example: baby sleep patterns (visual analysis) 17/65 What do we do with data? Example: baby sleep patterns (quantitative analysis) What was the average total amount of sleep at night (7pm - 7am)? • Kid 1: 8.5 hours • Kid 2: 7.5 hours What was the average nap length (sleep between 7am - 7pm)? • Kid 1: 86 minutes • Kid 2: 80 minutes How old was each kid the first time they slept 6+ hours 3+ nights in a row? • Kid 1: 1 month, 8 days • Kid 2: 8 months, 2 days 18/65 Where does data come from? Claim from the US Centers for Disease Control and Prevention (CDC): In 2017-2018, 42.7% of Americans were obese Question: How do they know? (Note: this example comes from Emily Oster and is used with her permission.) 19/65 Where does data come from? What would the ideal data to measure this? The whole population would weigh themselves at the same time of day, every day, for some extended period of time and accurately report the data This would give us accurate, precise, and regularly updated information about obesity rates...but is also a bit alarming to contemplate! 20/65 Where does data come from? More realistic options: • De-identified medical records • Fitness/diet-tracker apps • Driver’s license applications & renewals 21/65 Where does data come from? Pros: • Doesn’t require new data collection • Sample sizes are likely large Cons: • Each of these samples is selected (i.e., they don’t represent the full U.S. population) • In some cases, rely on self-reported data 22/65 Where does data come from? This gets at a bigger question: where does data come from? • Everywhere! (But actually, where does it come from?) 23/65 Where does data come from? Primary data - researcher generates dataset. Examples: • Observations • Surveys • Experiments Secondary data - researcher uses dataset generated by someone else. Examples: • Primary data collected by other researchers (e.g., HILDA) • Organizations • Businesses • Text • Browsing history 24/65 Where does data come from? These sources are not mutually exclusive (i.e., data from these different sources can be combined) Example: an experiment where the treatment changes a question on a survey 25/65 Data in Economics Economists draw on data from the following sources: • Survey data • Administrative data • Data from experiments • ...and more! 26/65 Data in Economics: Survey data Back to our example: In 2017-2018, 42.7% of Americans were obese (CDC) Question: How do they know? 27/65 Data in Economics: Survey data Survey data: National Health and Nutrition Examination Survey (NHANES) • Started in 1960s, current iteration since 1999 • ≈ 5,000 individuals included each year Two components: 1. Survey component: Demographics (race, income, education), health conditions, and diet 2. Examination: Medical and dental tests (including weight, blood pressure) Exams are done in specialized NHANES mobile examination units 28/65 Data in Economics: Survey data The NHANES is designed as a representative sample • Ideally: randomly survey 5,000 people from full U.S. population of ∼ 300m • Infeasible (esp. given mobile examination units) • In reality: – Choose 15 random counties each year – Choose random households within these counties – Choose random individuals within the households • From this, we can back out inferences about the full population (IF the people randomly picked are actually surveyed) 29/65 Data in Economics: Survey data Main issue: non-response • Half of people contacted are willing to be surveyed • Not all of those are willing to undergo the examination • Problem: refusal to participate is non-random So... what do you do? 30/65 Data in Economics: Survey data Short answer: “reweight” the data • Imagine your data has 100 people: 90 Sydneysiders, 10 Melbournians • Overall population is 50% Sydneysiders, 50% Melbournians • To reweight the data to be representative of the population: – Count each of the 10 Melbournians 5 times – Count each of the 90 Sydneysiders 5/9ths of a time Reweighting very complicated when imbalanced on multiple dimensions 31/65 Data in Economics: Survey data Catch: can only reweight based on things observed in the data “The fundamental problem is that if your sample is selected based on features you cannot see—unobservables—then you’re kind of out of luck for making precise conclusions.” (Emily Oster) 32/65 Data in Economics: Survey data Sampling is one of many challenges associated with getting credible survey data Example: the University of Michigan offers an entire class called “Methods and Theory of Sample Design”: The theory underlying the methods of survey sampling widely used in practice. It covers the basic techniques of simple random sampling, stratification, systematic sampling, cluster and multi-stage sampling, and probability proportional to size sampling. It also examines methods of variance estimation for complex sample designs, including the Taylor series expansion method, balanced repeated replications, and jackknife methods. This class part of the Master’s Program in Survey Methodology offered by UM—an entire degree devoted to understanding how to create credible survey data 33/65 Data in Economics: Survey data What are some other issues to be aware of when using survey data? • Non-responsiveness – Even if we do our best to sample in a representative way, we still have issues if not everyone responds (and they do not!) • Survey fatigue and question ordering • Leading questions (ex: what do you think of our new and improved product?) • Rating scales mean different things to different people (ex: restaurant ratings) • Self-reporting (ex: remembering what you ate, incentives to report a certain thing) 34/65 Data in Economics: Survey data Survey data also has advantages: • Can include exactly (or as close to) the data items of interest • Lots of ways to be administered (online (Mechanical Turk), by phone, email, snail mail, street corner...) • Some methods allow for remote administering • Helpful for qualitative responses • Easy to implement (especially if you aren’t concerned with proper survey design) • Fast data collection times (potentially) • Can collect large volumes of data at relatively low cost 35/65 Data in Economics: Administrative data What is administrative data? • Data that organizations (administrations) collect as part of their regular management and operations • Examples: – Tax returns filed by individuals and organizations – School enrollments and scores stored by universities – Medicare enrollments and claims – Credit card transactions used by financial institutions 36/65 Data in Economics: Administrative data Pros of using administrative data: • Can have huge sample sizes – Allows for analyses of small subgroups – Allows for detection of small but meaningful changes in behavior • Often easier to the same individual over time • Not necessarily self-reported • If can link data from multiple organizations, can explore broader set of issues – E.g., education data and child tax credit data • Cheap: data collection has already happened 37/65 Data in Economics: Administrative data Cons of using administrative data: • Only people covered by the administration/organization are included • Data not collected for research purposes – Limited to the variables collected by the organization – If a form changes, a variable may just stop being collected – Potentially limited documentation • Data often only available with a lag • Can be expensive and/or hard for researchers to access – Issues around confidentiality and disclosure – Issues around “proprietary” data • Subject to various sources of measurement error 38/65 Data in Economics: Administrative data Common sources of measurement error in administrative data: • Invalid values • Incorrect formatting • Internal inconsistencies • Outlier values • Missing information • Imputed information 39/65 Data in Economics: Administrative data Example: date of birth in the U.S. tax records (birth years 1937-1943) 0 10 20 30 40 50 60 70 80 N um be r of I nd iv id ua ls ( T ho us an ds ) -90 -60 -30 0 30 60 90 Date of Birth “The Jan. 1 birth date is the common birth date we assign” - United States Citizenship and Immigration Services (USCIS) explaining how they process new immigrants who lack birth certificates. Source: Bryant et al. (2024) Working Paper 40/65 Quick aside: Cool Australian data There is amazing survey and administrative Australian data available to Economics researchers, including (but not limited to): • Household, Income and Labour Dynamics in Australia (HILDA) • ATO Longitudinal Information Files (ALife) • Person Level Integrated Data Asset (PLIDA) • Business Longitudinal Analysis Data Environment (BLADE) • Linked Employer-Employee Database (LEED) 41/65 Data in Economics: Experiments Experiments are another source of data for economic research • Lab experiments • Field experiments 42/65 Data in Economics: Experiments Lab experiments seek to understand human behavior and choices made in a controlled setting Closely related to lab experiments used in social psychology – Experimental economics tends to be rooted in “economic thinking” (utility and disutility, supply and demand, opportunity costs, sunk costs, market frictions, etc.) Pro: Highly controlled environment Con: Highly artificial environment 43/65 Data in Economics: Experiments Classic example: The Ultimatum Game (Gu¨th et al. 1982) Set-up: • Player 1 has a sum of money, $X • Player 1 decides how much of that sum to offer to Player 2, $Y • Player 2 knows the value of $X and $Y and decides to accept or reject the offer • If Player 2 accepts the offer, Player 2 received $Y and Player 1 receives $X-$Y • If Player 2 refuses the offer, both payers receive nothing 44/65 Data in Economics: Experiments Let’s play a round to see how this works • Assume you are Player 2 • Player 1 has $100 • They will offer you some amount $0 ≤ $X ≤ $100 • If you accept, you get $X (and Player 1 gets $100− $X ) • If you reject, you get nothing (and Player 1 gets nothing) Question: What is the lowest amount you would accept? 45/65 Data in Economics: Experiments The Ultimatum Game • Offers below 20-30% often rejected • Can be played as a once-off game or repeated game • Many variations (e.g., “competitive ultimatum game”) • Framing effects matter–giving versus splitting versus taking (Leliveld et al. 2008), characterizing the game as a windfall (Lightner et al. 2017) • Provides insights into how well our “rational agent” assumptions hold (with caveats) • Results differ across countries, cultures (e.g., shared communities more likely to offer fair splits) 46/65 Data in Economics: Experiments Field experiments: Actively applying a treatment “in the real world” (i.e., the field) rather than in a lab Less controlled than a lab experiment, but closer to the real world Within field experiments, there is variation on whether subjects are aware that they are taking part in an experiment Example: tax salience in the U.S. 47/65 Data in Economics: Experiments Example: Chetty, Looney, and Kroft (2009) Context: In the U.S., the price listed on the shelf does not include sales tax (GST equivalent) ⇒ price faced at register ̸= price on shelf Research question: how salient is U.S. sales tax to consumers? 48/65 Data in Economics: Experiments Experiment: manipulate the salience of sales tax Setting: U.S. supermarkets (30% of products subject to sales tax) Treatment: Posted tax-inclusive prices on the shelf for a subset of products subject to sales tax (≈ 7%) 49/65 Data in Economics: Experiments 50/65 Data in Economics: Experiments Data: scanner data on weekly price and weekly quantity sold Method: experimental difference-in-differences Treatment group: • Products: Cosmetics, Deodorants, and Hair Care Accessories • Store: One large store in Northern California • Time period: 3 weeks (February 22, 2006 – March 15, 2006) Control groups: • Products: Other products in same aisle (toothpaste, skin care, shave) • Stores: Two nearby stores similar in demographic characteristics • Time period: Calendar year 2005 and first 6 weeks of 2006 51/65 Data in Economics: Experiments 52/65 Data in Economics: Experiments Results: Tax inclusive pricing resulted in a decline in quantity demanded This is consistent with consumers, on average, underestimating the total price prior to the intervention (i.e., tax was not fully salient) Like with all of the sources of data we’re discussing today, this field experiment faces criticisms 53/65 Data in Economics: Experiments There are many other fascinating examples of field experiments in economics Some examples: • Slemrod, Blumenthal, Christian (2001): Methods to reduce tax noncompliance • Bertrand and Mullainathan (2004): Racial discrimination in the labor market • Heller et al. (2017): High school mentoring programs as crime reduction 54/65 Data in Economics: Experiments A special type of field experiment that is used particularly in development economics is the Randomized Controlled Trial (RCT) RCT: Identify the eligible population, select a random sample, and then randomly assign individuals to a treatment (experimental) group and a control group In 2019, Abhijit Banerjee, Esther Duflo, and Michael Kremer were award the Nobel Prize in Economics for their groundbreaking work using RCTs to investigate innovative methods to reduce poverty around the world 55/65 Data in Economics: Experiments For example, Professor Kremer has conducted several RCTs in Kenyan schools to understand how different interventions improve student performance • Buying new textbooks? No impact. • Providing de-worming pills? Reduced absenteeism by 25% The results of this second study were ultimately scaled into a nationwide program by the Kenyan government, and then by the Indian government 56/65 Data in Economics: Experiments RCTs often referred to as the “gold standard” for causal inference (i.e., for estimating the extent to which X causes Y) They can provide incredible evidence and insight into practical, real-world solutions for big problems Like all sources of data in empirical economics, they are not fault-free • Expensive (both money & time) • Will any estimated effects generalize? • Will any estimated effects scale? 57/65 Data in Economics: Experiments Pros of experiments: • (Mostly) truly random variation • (Relatively more) control of behavior environment • Easier to replicate • Easier to try and target a specific behavior 58/65 Data in Economics: Experiments Cons of experiments: • Designed by humans ⇒ subject to human error • Some (or many) levels removed from the real world • Scale, generalizability, cost (time, money) • The Hawthorne Effect (idea that people behave differently when observed) – Stronger version: people anticipate the hypothesis of the experiment and act to try and please the experimenter • Many results from lab experiments are WEIRD (Western, educated, industrialized, rich and democratic) – Estimated to be ∼80% of study participants, only 12% of world population – Results don’t always hold up in other cultures • How much do we worry about informed consent? – Should individuals always be told they are participating in an experiment? – How would that impact how we use experiments in research? 59/65 Data in Economics: Other sources of data Other sources of data: websites and text It’s becoming increasingly popular to scrape websites and/or to try to create analyzable data out of text These methods involve additional tools from Data Science and Computer Science (especially Natural Language Processing, the branch of AI that deals with language and text) 60/65 Data in Economics: Other sources of data Apps and the internet: • Yelp (e.g., Luca and Zervas 2016) • Uber (e.g., Angrist et al. 2021) • Airbnb (e.g., Kakar et al. 2018) Text analysis: • Tweets (e.g., Baylis 2020) • Newspapers (e.g., Baker et al 2016) • Wills (e.g., Hines, Kummerfeld, and Stuart–in progress!) 61/65 Data in Economics: Other sources of data A note of caution: before scraping a website, must check terms of use https://www.yelp-support.com/article/Can-I-copy-or-scrape-data-from-the-Yelp-site 62/65 Wrapping up Today: • Overview of data: what it is and where it comes from • Pros and cons of different data sources Key take-aways: • Even the best (realistic) data is not perfect • It’s critical to think about where data comes from and what it’s really measuring • When you read the headline “study shows...” you should ask yourself “I wonder what data they used.” 63/65 Wrapping up How does this relate to our research question? • To answer our research question, we need an estimate of the the elasticity of taxable income • When we start discussing different ways that economists have tried to estimate this elasticity, one key input will be the data and the associated pros and cons 64/65 Wrapping up Next week (Week 3): • Lecture: Summary statistics • Tutorial: Prompt 1 (next slide) References: • Statistical Terms and Concepts, ABS, https://www.abs.gov.au/statistics/understanding-statistics/ statistical-terms-and-concepts • Where Does Data Come From? Emily Oster, https://www.parentdata.org/p/where-does-data-come-from 65/65 Prompt 1 Two common sources of information on income are administrative data (e.g., tax returns) and survey data. What are some of the trade-offs between these two data sources? Why might the tax return data be more credible than the survey data? Why might the survey data be more reliable? Would you prefer to work with one or the other? Why? Note: to receive full marks, you must compare the two types of data in addition to separately describing them.
学霸联盟