PPGA 503-无代写
时间:2022-11-25
PPGA 503 Measurement and
Data Analysis for Policy
Week 12
Lab Session
Outline
• Interaction effects
• Model diagnostics
• Linearity, heteroskedasticity, multicollinearity
• Transformations
2
Interaction effects
3
Interaction effects – What are they?
4
Independent
Variable
Dependent
Variable
Moderator
Variable
Mediator
Variable
How do we interpret the
interaction effect of a
categorical vs continuous
variable?
Interaction effects – Categorical variables
To help you see interaction effects you can do a margins plot after you run a regression. In STATA the code will
look like this (must be entered in sequence)
regress dependent_var independent_var1#independent_var2
margins independent_var1#independent_var2
marginsplot
Terms:
• Margins – statistic computed from predictions from a model while manipulating the values of the covariates
(i.e. the predicted values of your outcome from your model after you run a regression and setting specific
conditions)
• The command margins variable basically tells STATA to predict values of the variable(s) you specify
• Margins Plot – a graph plotting the predicted values of interest
• Bonus:
• if you don’t want to see the confidence interval in your marginsplot, add “, noci” at the end.
• E.g. marginsplot, noci
• If you want to change the “main character” in a marginsplot, add “, x(variable_name)”
• E.g. marginsplot, x(displacement) 5
Interaction effects – Continuous variables
What about continuous (non-integer) variables?
•You just need to add a “c.” in front of the variable name to tell
STATA it is a continuous variable
•Visualising is harder (because both have a huge range of values, there
are an infinite number of “lines” you can draw to map the
relationship
•So you need to tell STATA what lines to draw
6
Interaction effects – Continuous variables
Commands for interacting continuous variables
regress dependent_var c.independent_var1#c.independent_var2
margins at(independent_var1=(Min(distance)Max) independent_var2 =(Min(distance)Max) )
marginsplot
E.g.
regress price c.weight#c.displacement
margins, at (weight=(1800(1000)4800) displacement=(80(80))400) )
Note: To determine the appropriate range of values for each variable
• Keep it within the range of the data (don’t extrapolate) check with “tab”
• You shouldn’t have too many lines keep the distance appropriately large
7
Practice exam questions
• What are interaction effects, and when should include interaction terms
when constructing a regression model?
• Think of a simple bivariate regression (DV=Malaria infection rate,
IV=No. of mosquito nets installed). What are the possible variables
which might interact with the IV to influence its effect on the DV?
8
Practice exam questions
• What are the Gauss-Markov assumptions for OLS regression?
• How do we test for them?
• How do we account for violations of these assumptions in practice?
• Why do we care about outliers in a regression analysis?
• How do we check for outliers?
• What can we do to account for them?
9
Model diagnostics
10
Remember the Gauss-Markov assumptions?
1. Linear relationship between y and x
2. Random sample of data (to make inferences)
3. No perfect collinearity in explanatory variables x
4. Exogeneity: zero conditional mean (expected value of u is 0 for all
values of x)
a) independent variables X are not dependent on the dependent variable
b) Omitted variable bias and autocorrelation
5. Homoscedasticity: uncertainty similar for all values of x
11
Running model diagnostics
12
Linear relationship*
• Pre-estimation: represent the variables in a scatterplot
to see if linear
• scatter DV IV
• rvfplot, yline(0)
• lowess DV IV
Heteroskedasticity
• Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity – hettest
• The null hypothesis is homoskedasticity – low p-value is
strong evidence towards heteroskedasticity
• What you can do about it
• Either evidence of poor data gathering process, or
indicator of omitted variable bias – use other IV’s!
Fixing heteroscedasticity
• Review Model/ Redefine variables: change the model from using the raw measure to using
rates and per capita values (basically if you expect the impact of variable to be non-linear on the
outcome – choose another variable)
• Use robust/ clustered standard errors (NOT the same as robust regression)
• A method to get unbiased standard errors
• Just add “,robust” or “,cluster” at the end of your regression.
• E.g. regress Y X1 X2, robust and regress Y X1 X2, cluster
• Note: clustering assumes that within the clusters variance is relatively constant (homoscedastic)
• Weighted regression: assigns each data point a weight based on the variance of its fitted value to
give small weights to observations associated with higher variances to shrink their squared
residuals
• Simply replace ‘regress’ command with ‘vwls’
• Transform the dependent variable: transform your original data into different values that
produce good looking residuals (try not to use this at all)
13
Running model diagnostics
14
Multicollinearity
• Run a pairwise assessment of covariance
among the IV’s: pwcorr IV1 IV2
• From 0 to 1
• Look at Variance Inflation Factor: vif
• 1 = no multicollinearity
• >4 = start to be worried!
• What you can do about it
• If IV’s fall on different parts of the causal
chain, retain the one you think is more
important and drop the others
• If IV’s measure the same underlying
concept, drop either
• Increase sample size
Running model diagnostics
15
Outliers
• Check for leverage of individual observations
• Use scatterplots
• Use: lvr2plot, mlabel(*id)
• *id here refers to a variable you can use to
identify the observations
Transformations
16
Rescale
• Basically generate new variables as a result of dividing and multiplying existing variables
using STATA commands you already know
• gen new_variable = old_variable/ X
• E.g. gen price_1000 = price/1000
Standardise
• One way is to check for statistics like mean and standard deviation and then manipulate
using basic STATA commands
• E.g. if I want to measure the number of standard deviations away from the mean
• summarize old_variable do this to get the standard deviation* of the variable
• gen new_variable = (old_variable – mean)/standard deviation*
• E.g. gen price_se = (price – 6165.257)/2949.496
Transformations
17
Logarithmic Transformation
• Use log() when generating a new variable to have a variable in its logarithmic form
• gen new_variable = log(old_variable)
• E.g. gen price_log = log(price)
•
Note that in Stata, ln(x) = log(x), since log(x) defaults to natural
log instead of log base 10 – Specify log10(x) if you want log base 10!
• Commonly used when you observe logarithmic or exponential
Quadratic Transformation
• Applying an index or surd to your variable (squaring and square-rooting etc.)
• Indices
• gen new_variable = old_variable^X
• E.g. gen price_squared = price^2
• Surds
• gen new_variable = old_variable^(1/X)
• E.g. gen price_squareroot = price^(1/2)
• Commonly used for variables such as age where you expect a u-shaped effect!
•
If you have a positive effect of age and a negative effect of age
squared that means that as people get older the effect of age is
lessoned
• Break up the values of age into categories to get a better sense of the different r/s across different values of age