ECON20222-r代写|学霸联盟

ECON20222-r代写

时间：2023-04-28

Introduction to Handling Data
ECON20222 - Lecture 2
Ralf Becker and Martyn Andrews
February 2023
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 1 / 40
Aim for today
Explore data
Review hypothesis testing
Review simple regresison anlysis
Become more familiar with R
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 2 / 40
Preparing your workfile
We add the basic libraries needed for this week’s work:
library(tidyverse) # for almost all data handling tasks
library(readxl) # to import Excel data
library(ggplot2) # to produce nice graphiscs
library(stargazer) # to produce nice results tables
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 3 / 40
New Dataset - Wellbeing
Doing Economics Project 8 deals with international wellbeing data.
Data are from the European Value Survey.
A large catalogue of questions on Perceptions of Life, Politics and
Society, Work, Religion etc
48 mainly European countries
Four waves/years of data (1981, 1990, 1999 and 2008)
129,515 observations (people/respondents)
load("WBdata.Rdata") # import data
This will load two objects into your environment
wb_data - the actual data file
wb_data_Des - a table which contains some description to each
variable
To get to this dataset a significant amount of data handling and cleaning
had to happen (see Project 8 in Doing Economics.)
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 4 / 40
Wellbeing Data
str(wb_data) # prints some basic info on variables
## tibble [129,515 x 19] (S3: tbl_df/tbl/data.frame)
## $ S002EVS : chr [1:129515] "1981-1984" "1981-1984" "1981-1984" "1981-1984" ...
## $ S003 : chr [1:129515] "Belgium" "Belgium" "Belgium" "Belgium" ...
## $ S006 : num [1:129515] 1001 1002 1003 1004 1005 ...
## $ A009 : num [1:129515] 3 5 2 5 5 5 5 5 4 4 ...
## $ A170 : num [1:129515] 9 9 3 9 9 9 9 10 8 10 ...
## $ C036 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ C037 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ C038 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ C039 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ C041 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ X001 : chr [1:129515] "Male" "Male" "Male" "Female" ...
## $ X003 : num [1:129515] 53 30 61 60 60 19 38 39 44 76 ...
## $ X007 : chr [1:129515] "Single/Never married" "Married" "Separated" "Married" ...
## $ X011_01 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ X025A : chr [1:129515] NA NA NA NA ...
## $ Education_1: num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ Education_2: chr [1:129515] NA NA NA NA ...
## $ X028 : chr [1:129515] "Full time" "Full time" "Unemployed" "Housewife" ...
## $ X047D : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 5 / 40
Data Description
wb_data_Des[1:10,] # prints some basic info on variables
## Names Labels
## 1 S002EVS EVS-wave
## 2 S003 Country/region
## 3 S006 Respondent number
## 4 A009 Health
## 5 A170 Life satisfaction
## 6 C036 Work Q1
## 7 C037 Work Q2
## 8 C038 Work Q3
## 9 C039 Work Q4
## 10 C041 Work Q5
## Description
## 1 EVS-wave
## 2 Country/region
## 3 Original respondent number
## 4 State of health (subjective), 1 = Very Poor, 5 = Very good
## 5 Satisfaction with your life
## 6 To develop talents you need to have a job, 1 = Strongly Agree, 5 = Strongly Disagree
## 7 Humiliating to receive money without having to work for it, 1 = Strongly Agree, 5 = Strongly Disagree
## 8 People who don't work become lazy, 1 = Strongly Agree, 5 = Strongly Disagree
## 9 Work is a duty towards society, 1 = Strongly Agree, 5 = Strongly Disagree
## 10 Work should come first even if it means less spare time, 1 = Strongly Agree, 5 = Strongly Disagree
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 6 / 40
Data - Countries
Let’s find out which countries are in the sample:
unique(wb_data$S003) # unque finds all the different values in a variable
## [1] "Belgium" "Canada" "Denmark"
## [4] "France" "Germany" "Iceland"
## [7] "Ireland" "Italy" "Malta"
## [10] "Netherlands" "Norway" "Spain"
## [13] "Sweden" "Great Britain" "United States"
## [16] "Northern Ireland" "Austria" "Bulgaria"
## [19] "Czech Republic" "Estonia" "Finland"
## [22] "Hungary" "Latvia" "Lithuania"
## [25] "Poland" "Portugal" "Romania"
## [28] "Slovakia" "Slovenia" "Croatia"
## [31] "Greece" "Russian Federation" "Turkey"
## [34] "Albania" "Armenia" "Bosnia Herzegovina"
## [37] "Belarus" "Cyprus" "Northern Cyprus"
## [40] "Georgia" "Luxembourg" "Moldova"
## [43] "Montenegro" "Serbia" "Switzerland"
## [46] "Ukraine" "Macedonia" "Kosovo"
Point out what unique does.
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 7 / 40
Data - Waves
Let’s find out how many observations/respondents we have for each
country (S003) in each wave (S002EVS).
Use piping technique of the tidyverse
table1 <- wb_data %>% group_by(S002EVS,S003) %>% # groups by Wave and Country
summarise(n = n()) %>% # calculating no of obs
spread(S002EVS,n) %>% # put Waves across columns
print(n=4)
## # A tibble: 48 x 5
## S003 `1981-1984` `1990-1993` `1999-2001` `2008-2010`
##
## 1 Albania NA NA NA 1200
## 2 Armenia NA NA NA 1224
## 3 Austria NA 1432 NA 1216
## 4 Belarus NA NA NA 1237
## # ... with 44 more rows
For each country (j = 1, ..., 48) we have observations from potentially
four years (t = 1, ..., 4). For each country-year (jt) we have
(i = 1, ..., njt) observations, e.g. nAustria,1990 = 1432. Each observation
can be identified/indexed by ijt. Repeated Cross-Section
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 8 / 40
Data - Some graphical representation
Summarise data by country and wave.
A170: All things considered, how satisfied are you with your life as
a whole these days? (1 Dissatisfied to 10 Satisfied)
A009: All in all, how would you describe your state of health these
days? Would you say it is . . . 1 Very poor to 5 Very good
table2 <- wb_data %>% group_by(S002EVS,S003) %>% # groups by Wave and Country
summarise(Avg_LifeSatis = mean(A170),Avg_Health = mean(A009))
head(table2,4)
## # A tibble: 4 x 4
## # Groups: S002EVS [1]
## S002EVS S003 Avg_LifeSatis Avg_Health
##
## 1 1981-1984 Belgium 7.37 4.01
## 2 1981-1984 Canada 7.82 4.20
## 3 1981-1984 Denmark 8.21 4.18
## 4 1981-1984 France 6.71 3.72
table2 now contains an observation for each country-year with
Avg_LifeSatis and Avg_Health
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 9 / 40
Data - Some graphical representation
ggplot(table2,aes(Avg_Health,Avg_LifeSatis, colour=S002EVS)) +
geom_point() +
ggtitle("Health v Life Satisfaction")
5
6
7
8
3.00 3.25 3.50 3.75 4.00 4.25
Avg_Health
Av
g_
Li
fe
Sa
tis
S002EVS
1981−1984
1990−1993
1999−2001
2008−2010
Health v Life Satisfaction
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 10 / 40
Data - Some graphical representation
Summarise data by country and wave.
A170: All things considered, how satisfied are you with your life as
a whole these days? (1 Dissatisfied to 10 Satisfied)
C041: Work should come first even if it means less spare time, 1 =
Strongly Agree, 5 = Strongly Disagree
table2 <- wb_data %>% group_by(S002EVS,S003) %>%
summarise(Avg_LifeSatis = mean(A170),Avg_WorkFirst = mean(C041))
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 11 / 40
Data - Some graphical representation
ggplot(table2,aes( x=Avg_WorkFirst, y=Avg_LifeSatis,colour=S002EVS)) +
geom_point() +
ggtitle("Work First v Life Satisfaction")
5
6
7
8
2.5 3.0 3.5 4.0
Avg_WorkFirst
Av
g_
Li
fe
Sa
tis
S002EVS
1981−1984
1990−1993
1999−2001
2008−2010
Work First v Life Satisfaction
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 12 / 40
Data - Some graphical representation
There seems to be a negative relationship between Attitude to Work
(WC) and Life Satisfaction (LS) (in countries with a “work-centric”
ethic people were on average happier.)
Is there such a relationship inside countries as well?
We will calculate correlations for each country-wave, e.g. Austria in 2008:
CorrAut,2008(LSi,Aut,2008,WCi,Aut,2008) =
CovAut,2008(LSi,Aut,2008,WCi,Aut,2008)
sLS,Aut,2008 sWC,Aut,2008
table3 <- wb_data %>% filter(S002EVS == "2008-2010") %>%
group_by(S003) %>% # groups by Country
summarise(cor_LS_WF = cor(A170,C041,use = "pairwise.complete.obs"),
med_income = median(X047D)) %>%
arrange(cor_LS_WF)
Point out that correlations are in [−1, 1]. They are standardised
covariances. Ensure you revise how to calculate sample s.d. and
covariances!
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 13 / 40
Data - Some graphical representation
ggplot(table3,aes( cor_LS_WF, med_income)) +
geom_point() +
ggtitle("Corr(Life Satisfaction, Work First) v Median Income")
1
2
−0.1 0.0 0.1
cor_LS_WF
m
e
d_
in
co
m
e
Corr(Life Satisfaction, Work First) v Median Income
Note that the correlation between LS and WC is close to 0 in most
countries.
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 14 / 40
Data on Maps
Geographical relationships are sometimes best illustrtaed with maps.
Sometimes these will reveal a pattern.
R can create great maps (but it requires a bit of setup - see the
additional file on BB). You need the following
A shape file for each country
The statistics for each country
a procedure to merge these bits of information in one) %>% # pick British data
filter(S002EVS == "2008-2010") # pick latest wave
Now we run a regresison of the Life Satisfaction variable (A170) against
a constant only.
LifeSatisi = α+ ui
mod1 <- lm(A170~1,data=test_data)
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 28 / 40
Regression Analysis - Example 1
We use the stargazer function to display regression results
##
## ===============================================
## Dependent variable:
## ---------------------------
## A170
## -----------------------------------------------
## Constant 7.530***
## (0.063)
##
## -----------------------------------------------
## Observations 997
## R2 0.000
## Adjusted R2 0.000
## Residual Std. Error 2.001 (df = 996)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The estimate for the constant, α̂, is the sample mean.
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 29 / 40
Regression Analysis - Example 1
Testing H0 : µA170 = 0 can be achieved by
##
## One Sample t-test
##
## data: test_data$A170
## t = 118.84, df = 996, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 7.405255 7.653922
## sample estimates:
## mean of x
## 7.529589
We can use the above regression to achieve the same:
t− test = α̂/seα̂ = 7.530/0.063 = 119.524
That H0 makes no sense as smallest answer to A170 is 1.
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 30 / 40
Regression Analysis - Example 2
We now estimate a regression model which includes a constant and the
household’s monthly income (in 1,000 Euros) as an explanatory variable
(Inci or variable X047D in our dataset).
LifeSatisi = α+ β Inci + ui
mod1 <- lm(A170~X047D,data=test_data)
How do we interprete the estimate of β̂?
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 31 / 40
Regression Analysis - Example 2
stargazer(mod1, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## A170
## -----------------------------------------------
## X047D 0.184***
## (0.039)
##
## Constant 7.190***
## (0.095)
##
## -----------------------------------------------
## Observations 997
## R2 0.022
## Adjusted R2 0.021
## Residual Std. Error 1.980 (df = 995)
## F Statistic 22.302*** (df = 1; 995)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As the income increases by one unit (increase of Euro 1,000) we should
expect that Life Satisfaction increases by 0.184 units.
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 32 / 40
Regression Analysis - Example 2
Let’s present a graphical representation.
ggplot(test_data, aes(x=X047D, y=A170)) +
labs(x = "Income", y = "Life Satisfaction") +
geom_jitter(width=0.2, size = 0.5) + # Use jitter - try geom_point() instead
geom_abline(intercept = mod1$coefficients[1],
slope = mod1$coefficients[2], col = "blue")+
ggtitle("Income v Life Satisfaction, Britain")
2.5
5.0
7.5
10.0
0 2 4 6 8
Income
Li
fe
S
at
isf
a
ct
io
n
Income v Life Satisfaction, Britain
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 33 / 40
Regression Analysis - What does it actually do?
Two interpretations
1) Finds the regression line (via α̂ and β̂) that minimises the
residual sum of squares Σ(LifeSatisi − α̂− β̂ Inci)2. → Ordinary
Least Squares (OLS)
2) Finds the regression line (via α̂ and β̂) that ensures that the
residuals (ûi = LifeSatisi − α̂− β̂ Inci) are uncorrelated with
the explanatory variable(s) (here Inci).
In many ways 2) is the more insightful one.
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 34 / 40
Regression Analysis - What does it actually do?
LifeSatis = α+ β Inc+ u
Assumptions
One of the regression assumptions is that the (unobserved) error terms u
are uncorrelated with the explanatory variable(s), here Inc. Then we
call Inc exogenous.
This implies that Cov(Inc, u) = Corr(Inc, u) = 0
In sample
LifeSatisi = α̂+ β̂ Inci +û
Where α̂+ β̂ Inci is the regression-line.
In sample Corr(Inci, ûi) = 0 (is ALWAYS TRUE BY
CONSTRUCTION).
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 35 / 40
Regression Analysis - Underneath the hood?
LifeSatis = α+ β Inc+ u
What happens if you call
mod1 <- lm(A170~X047D,data=test_data)?
You will recall the following from Year 1 stats:
βˆ = Ĉov(LifeSatis, Inc)
V̂ ar(Inc)
αˆ = LifeSatis− βˆ Inc
The software will then replace Ĉov(LifeSatis, Inc) and V̂ ar(Inc) with
their sample estimates to obtain βˆ and then use that and the two
sample means to get αˆ.
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 36 / 40
Regression Analysis - Underneath the hood?
Need to recognise that in a sample βˆ and αˆ are really random variables.
βˆ = Ĉov(LifeSatis, Inc)
V̂ ar(Inc)
= Ĉov(α+ β Inc+ u, Inc)
V̂ ar(Inc)
= Ĉov(α, Inc) + βĈov(Inc, Inc) + Ĉov(u, Inc)
V̂ ar(Inc)
= β V̂ ar(Inc)
V̂ ar(Inc)
+ Ĉov(u, Inc)
V̂ ar(Inc)
= β + Ĉov(u, Inc)
V̂ ar(Inc)
So βˆ is a function of the random term u and hence is itself a random
variable. Once Ĉov(LifeSatis, Inc) and V̂ ar(Inc) are replaced by
sample estimates we get ONE value which is draw from a random
distribution.
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 37 / 40
Regression Analysis - The Exogeneity Assumption
Why is assuming Cov(Inc, u) = 0 important when, in sample, we are
guaranteed Cov(Inci, ûi) = 0?
If Cov(Inci, ui) = 0 is not true, then
1) Estimating the model by OLS imposes an incorrect relationship
2) The estimated coefficients α̂ and β̂ are biased (on average incorrect
if we had many samples)
3) The regression model has no causal interpretation
As we cannot observe ui, the assumption of exogeneity cannot be tested
and we need to make an argument using economic understanding.
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 38 / 40
Regression Analysis - Outlook
y = α+ β x+ u
Much of empirical econometric analysis is about making the exogeneity
assumption (Corr(x, u) = 0) more plausible/as plausible as possible.
But this begins with thinking why an explanatory variable x is
endogenous.
1) Most models have more than one explanatory variable.
2) Including more relevant explanatory variables can make the
exogeneity assumption more plausible.
3) But fundamentally, if Cov(u, x) = 0 is implausible we need to find
another variable z for which Cov(u, z) = 0 is plausible. A lot of the
remainder of this unit is about elaborating on this issue.
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 39 / 40
Outlook
Over the next weeks you will learn
Simple OLS regression with dummy
Endogeneity
Multiple regression
Difference-in-Difference (DiD) estimator
Ralf Becker and Martyn Andrews Introduction to Handling Data February 2023 40 / 40