ECOM30001/ECOM90001-ecom30001代写|学霸联盟

ECOM30001/ECOM90001-ecom30001代写

时间：2023-05-10

ECOM30001/ECOM90001
Basic Econometrics
Semester 1, 2023
Week 6, Lecture 2
Practical Advice I
1/33
1 Research Question
2 Description of Data
3 Empirical Model
4 Econometric Methodology
5 Results
6 Graphs, Tables, and Equations
7 Some Practical Issues
8 Functional Forms
2/33
Research Question
What is required?: A clear statement of the research question
- Why this research question is important? Why is this topic of interest?
- Does ‘economic’ theory provide any guidance on the expected direction or
magnitude of your main effect of interest? Explain why or why not.
- Who might be interested in this topic? Why?
- What might we learn from studying your research question? Any
contribution to public policy debate?
3/33
Description of Data
- What is the sampling frame of your data? What type of individuals/
observations are included in the raw sample?
- What is the sample size of the raw data?
- Have you restricted the sample to a subset of the observations? Why?
- Provide a clear motivation for restricting your sample to a smaller
population.
- The most likely reason here would be to reduce heterogeneity in your
sample to allow a narrower focus in your research question.
- In contrast, you may want to use the full sample to allow more general
conclusions from your analysis.
- Either way, explain carefully motivate your choice of sample restrictions, if
any.
4/33
Empirical Model
- A clear and precise equation representing your intended econometric
model.
- Clearly identify your parameter of interest and a clear statement about the
specific hypothesis being examined.
- A clear statement describing the dependent variable (outcome variable),
including a statement about the units of measurement, if applicable.
- A clear statement describing the explanatory variables (independent
variables), including a statement about their units of measurement, if
applicable.
- Your discussion of your econometric model should indicate whether your
dependent and independent variables are continuous or indicator variables.
- Your description of your econometric model should include a discussion
about why you think the specific explanatory variables should be included
in the model, either using ‘economic’ intuition or just common sense.
5/33
Econometric Methodology
- A clear description of the econometric methodology to be used.
- Does your proposed methodology produce an unbiased estimator of the
population parameters? What conditions or assumptions are required for
your estimator to be an unbiased estimator? What are the consequences if
these assumptions are not satisfied.
- Probably, the biggest issue will be the possibility of omitted variable bias,
which is directly related to your choice of explanatory variables. What are
the consequences for your estimator if there are omitted variables? Do you
think that your choice of explanatory variables might mitigate the issue of
omitted variables? Explain why or why not?
- Be sure to update your proposed methodology as we cover additional
topics:
6/33
Econometric Methodology
Be sure to update your proposed methodology as we cover additional topics.
Binary Choice Methods (dependent (outcome) variable is an indicator
(dummy) variable):
- Linear Probability Model or Probit?
- Interpretation of estimates: average marginal effects
Heteroskedasticity:
- Do you suspect heteroskedasticity? Why?
- What are the consequences for the OLS estimator if you ignore
heteroskedasticity?
- Solutions? (Robust) standard errors? Feasible GLS?
Endogenity/IV Estimator: E [ε|X] 6= 0
- Why is COV(X, ε) 6= 0 a problem. What are the consequences for the
OLS estimator?
- Sources of COV(X, ε) 6= 0
- Measurement error
- Omitted variables
- Is there a solution? Is there a valid instrumental variable? In most cases
(for the data available the answer is mostly likely NO)
- Fine just to mention the possibility of endogeneity and the implications
7/33
Creating Indicator Variables for Categorical Variables
- Both UK census data and Melbourne house prices data have categorical
variables included in the data
- Cannot include categorical variables in the analysis: need to first create a set of
indicator variables for each level of the categorical variable
- Use fastDummies package
- Example: create indicator variables for regionname variable in house prices
data
houseprices <‐ import("houseprices.csv")
# first convert categorical variable to a factor variable
houseprices$regionname <‐factor(houseprices$regionname)
# quickly create indicator variables for categorical (factor) variable
# FastDummies will only create indicator variables for factors
# Note: regionname is the only factor
# The following will create 8 dummy variables for regionname
houseprices_dum <‐fastDummies::dummy_cols(houseprices)
# assign levels to the (origina) regionname variable
houseprices_dum$regionname <‐ factor(houseprices_dum$regionname,
                                 levels = c(1:8),
                                 labels = c(
                                   "Eastern Metropolitan",
                                   "Eastern Victoria",
                                   "Northern Metropolitan",
                                   "Northern Victoria",
                                   "South‐Eastern Metropolitan",
                                   "Southern Metropolitan",
                                   "Western Metropolitan",
                                   "Western Victoria"
                                 ))
8/33
Descriptive Statistics
Be sure to include a table of descriptive statistics for your variables
in your empirical model:
- Purpose: What are the characteristics of your sample?
- Is your main effect of interest evident in the descriptive
statistics?
- Sample means, standard deviation, minimum, maximum
- Include summary statistics for each level for categorical
variables of interest
- Always include a paragraph discussing these descriptive
statistics
9/33
Estimation Results
Be sure to include a discussion of your estimation results:
- Purpose: What is the magnitude of your main effect of
interest? Is it realistic? Expected sign?
- Are your results consistent with your (theoretical) hypothesis?
Why or why not?
- Always include results for your preferred model
- Be careful with your interpretation of your results:
- Units of measurement
- Elasticities? Semi-elasticities?
- Robustness of results:
- Diagnostic tests: RESET test; Jarque-Bera test; White test for
heteroskeasticity
- Different functional forms: transformations of variables
- Interactions? Non-linear effects?
10/33
Conclusion
The aim of the conclusion is to provide a clear summary of your
main points.
A conclusion should bring together different sections of the report.
Topics you might include are:
- Summary: To recap the main points ...
- Make your message clear and leave the reader in no doubt
about your main results.
- Confirm: Do your results confirm your (main) hypothesis?
- Implications of your research: Policy conclusions?
- Important: What are the limitations of your research? Data?
Methodology?
11/33
Graphs: Some General Principles
- The value of using graphs in data analysis comes when they show
important patterns in the data.
- A graph need to be legible and well designed.
- It is possible (and usually required) to change the shape, size, font,
colour, darkness, orientation and location along an axis of the graph
to maximise its visual impact.
- Avoid cluttering within the graph by using unnecessary visual
elements.
- Make sure that the end product is informative and not misleading.
12/33
Graphs: Some General Principles
- Define the variables on the axes and use graph titles to describe the
graph .
- Scales of the axes should be chosen to include the range of data.
- The size for symbols for plotted data points should be chosen so as to
not obscure the location of the points that are underneath.
- If connecting lines are to be used they should not be so thin that they
fade into the background but not so thick they obscure data points.
- Inner grid-lines should only be used if you want the reader to extract
specific values. If they are to be used, they should be visible but light,
for example grey and with a delicate line.
- To highlight an important value reference lines may be used. They are
usually shaded a lighter colour so as to not interfere with the data.
- If a key legend is to be used it should be placed so as to not interfere
with the interpretation of the graph.
13/33
Graphs: Some More General Principles
- Graphics on a white background are the most legible.
- Using too many colours in a graph may be confusing.
- An alternative is to use different shades of the same colour.
- Bright or dark colours can be used such as red and black to emphasize
the important line.
- Colours are also useful for encoding different categories say in a scatter
plot.
- Avoid red-green contrasts.
- Colour coding with different colours to show numerical information
does not work well as this requires an ordering. It is difficult to
perceive an ordering to red, green, blue etc.
- Instead use a colour scale from lightest to darkest or vice versa with a
smooth progression.
14/33
Presentation of Regression Results
- It is almost never appropriate to present directly the results from an
(econometrics) software package.
- Typically these results include information that is not needed in the description
of the analysis.
- Typically these results report (often) confusing and uninformative variable
names, not suitable to be included in tables.
- The standard R output reports:
- estimated coefficients
- standard errors
- sample t-test statistics for H0 : βk = 0
- p-values for the sample t-test statistic for the two-sided test H0 : βk = 0
- some other model summary statistics (R2, sample F ).
The standard error, the t-statistic, the p-value (and the confidence intervals) all tell
the same story – is the coefficient estimate significant or not? You do not need to
use them all.
You will need to modify the form of the results generated by the computer into
something easier for the reader to easily understand.
- Use stargazer package
- Really useful for comparing estimates from different specifications
15/33
Equations
- Mathematical notation should be as simple as possible. Where appropriate,
define any symbols that you use.
- Equations should be identified by consecutive numbers in parentheses at the
end of the equation.
- When numbering an equation be sure the number is set far enough away from
the equation that it does not seem to be a member of the equation.
- If possible, do not let an equation spill from one page to another.
16/33
Equations: An Example
When only a sample of data is drawn from a population, the
sample standard deviation may be estimated according to the
following:
s =
√∑N
i=1
(
Xi − X¯
)2
(N − 1) (2.5)
where {X1,X2, . . .XN} are the observed values of the data in the
sample, X¯ is the mean value of the observations, and N is the
sample size.
17/33
Estimated Variance of Non-Linear Estimators (Delta
Method)
Example:
yi = β0 + β1 Xi + β2 X
2
i + εi
Estimated marginal effect:
∆E [y |X ]
∆X
= b1 + 2 b2 Xi
The marginal effect varies with the level of X. There are three (3) ways to measure
the marginal effect:
1 average marginal effect: average of the marginal effect for each X = Xi
2 Marginal Effect at Mean: marginal effect calculated at X = X¯
3 Marginal Effect at Value: marginal effect calculated at a specific value X = X˜
The estimated turning point occurs at:
Xˆ ∗ = − b1
2 b2
18/33
Estimated Variance of Non-Linear Estimators (Delta
Method)
The estimated turning point occurs at:
Xˆ ∗ = − b1
2 b2
Is Xˆ ∗ within the range of values that the regressor X can take? How do we
undertake hypothesis tests on X∗?
First, determine if there is a quadratic relationship between X and y by conducting
the two-sided test H0 : β2 = 0. Rejection of H0 suggests the hypothesis of a linear
relationship should be rejected.
19/33
Estimated Variance of Non-Linear Estimators (Delta
Method)
Econometric model:
ln wagei = β0 + β1 educi + β2 experi + β3 exper
2
i + β4 disadvi + β5 cityi + εi
Untitled
Call:
lm(formula = lnwage ~ educ + exper + I(exper^2) + factor(disadv) +
    factor(city), data = wage)
Residuals:
     Min 1Q   Median 3Q      Max
‐1.74187 ‐0.23369  0.02547  0.25113  1.38291
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      4.6392013  0.0680696  68.154 < 0.0000000000000002 ***
educ 0.0822234  0.0035176  23.375 < 0.0000000000000002 ***
exper 0.0841655  0.0068085  12.362 < 0.0000000000000002 ***
I(exper^2)      ‐0.0022754  0.0003254  ‐6.993    0.000000000003292 ***
factor(disadv)1 ‐0.1756205  0.0148026 ‐11.864 < 0.0000000000000002 ***
factor(city)1    0.1155612  0.0150656   7.671    0.000000000000023 ***
‐‐‐
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3832 on 3004 degrees of freedom
Multiple R‐squared:  0.2555, Adjusted R‐squared:  0.2543
F‐statistic: 206.2 on 5 and 3004 DF,  p‐value: < 0.00000000000000022
> margins_summary(reg1)
  factor     AME     SE z      p   lower   upper
   city1  0.1156 0.0151   7.6705 0.0000  0.0860  0.1451
disadv1 ‐0.1756 0.0148 ‐11.8641 0.0000 ‐0.2046 ‐0.1466
    educ  0.0822 0.0035  23.3751 0.0000  0.0753  0.0891
   exper  0.0439 0.0023  18.8398 0.0000  0.0393  0.0484
Page 1
20/33
Estimated Variance of Non-Linear Estimators (Delta
Method)
Untitled
# calculate ame at specific values for experience
  experlist <‐ seq(0,23, by=2)
  summary(margins(reg1,variables="exper", at=list(exper=experlist)))
factor   exper     AME     SE z      p   lower  upper
  exper  0.0000  0.0842 0.0068 12.3616 0.0000  0.0708 0.0975
  exper  2.0000  0.0751 0.0056 13.4162 0.0000  0.0641 0.0860
  exper  4.0000  0.0660 0.0044 14.8837 0.0000  0.0573 0.0746
  exper  6.0000  0.0569 0.0034 16.8662 0.0000  0.0503 0.0635
  exper  8.0000  0.0478 0.0025 18.7652 0.0000  0.0428 0.0527
  exper 10.0000  0.0387 0.0022 17.3277 0.0000  0.0343 0.0430
  exper 12.0000  0.0296 0.0026 11.2805 0.0000  0.0244 0.0347
  exper 14.0000  0.0205 0.0035  5.8703 0.0000  0.0136 0.0273
  exper 16.0000  0.0114 0.0046  2.4888 0.0128  0.0024 0.0203
  exper 18.0000  0.0022 0.0057  0.3925 0.6947 ‐0.0090 0.0135
  exper 20.0000 ‐0.0069 0.0069 ‐0.9861 0.3241 ‐0.0205 0.0068
  exper 22.0000 ‐0.0160 0.0082 ‐1.9474 0.0515 ‐0.0320 0.0001
Page
1
21/33
Estimated Variance of Non-Linear Estimators (Delta
Method)
−0.04
−0.02
0.00
0.02
0.04
0.06
0.08
0.10
0 2 4 6 8 10 12 14 16 18 20 22 24
Experience
Av
e
ra
ge
M
ar
gi
na
l E
ffe
ct
Average Marginal Effect for Experience, with 95% Confidence Intervals
medata1 <- cplot(reg1, "exper", what = "effect",draw = FALSE)
22/33
Estimated Variance of Non-Linear Estimators (Delta
Method)
Econometric model:
ln wagei = β0 + β1 educi + β2 experi + β3 exper
2
i + β4 disadvi + β5 cityi + εi
Xˆ ∗ = − b2
2 b3
Untitled
# use delta method to calculate se for the turning point
# requires car package
deltaMethod(reg1, "‐b2/(2*b3)", parameterNames=c("b0", "b1","b2","b3","b4","b5"))
Estimate      SE   2.5 % 97.5 %
‐b2/(2 * b3)  18.4943  1.3249 15.8976 21.091
Page 1
23/33
Why Use Logs?
There are several reasons why (natural) logarithms are used so much in
applied work:
1 When y > 0 models using ln(y) as the dependent variable often
satisfy the Classical Linear Model assumptions more closely than
models using the level of y.
Strictly positive variables often have conditional distributions that are
heteroskedastic or skewed. Transforming using natural logs can
mitigate, if not eliminate, both problems.
2 Taking logs restricts the range of the variable which makes estimates
less sensitive to outlying observations.
This is particularly true of variables that can be large monetary values
such as firms’ annual sales. Population variables also tend to vary
widely.
24/33
Why Use Logs?: Limitations
Log Transformation cannot be used if a variable takes on 0 or negative
numbers.
Solution:
In cases where a variable y is nonnegative but can take on the value 0,
ln(1 + y) is sometimes used.
Generally, using ln(1 + y) and then interpreting the estimates as if the
variable were ln(y) is acceptable when the data on y contain relatively few
zeros.
Another alternative is to use ln(1 + y) and include a dummy (indicator)
variable for the observations with y = 0.
25/33
Why Use Logs?: Limitations
It is more difficult to predict the original variable. The estimated model
provides predictions for ln(y), not y .
Consider the model:
zi = ln yi = β0 + β1 X1i + β2 X2i + . . . βK XKi + εi εi |Xi ∼ N (0, σ2)
so yi = exp(zi ):
f (y) =
1
y
1√
2pi σ
exp
(
−(ln y − µ)
2
2σ2
)
with:
E [y |X] = exp (β0 + β1 X1i + β2 X2i + . . . βK XKi ) exp
(
σ2
2
)
A consistent but not unbiased prediction for the level variable y is given by:
yˆ = exp (b0 + b1 X1i + b2 X2i + . . . bK XKi ) exp
(
σˆ2
2
)
where σˆ2 is the estimator of the error variance.
26/33
Why Use Logs?: Limitations
It is not legitimate to compare the R2 from models where y is the
dependent variable in one case and ln(y) is the dependent variable in the
other. These measures explain variations in different variables.
Recall, in the standard linear regression model estimated by OLS:
yˆ = b0 + b1 X1i + b2 X2i + . . . bK XKi
the usual R2 is simply the square of the correlation between yi and yˆi .
This suggests that to obtain an R2 when the dependent variable is ln(yi )
that can be compared to the usual R2 when the dependent variable is yi :
1 obtain predictions:
yˆ = exp (b0 + b1 X1i + b2 X2i + . . . bK XKi ) exp
(
σˆ2
2
)
2 compute the square correlation between yi and yˆi
27/33
Interpretation in Log Models
Consider the model:
zi = ln yi = β0 + β1 X1i + β2 X2i + . . . βK XKi + εi εi |Xi ∼ N (0, σ2)
with:
E [y |X] = exp (β0 + β1 X1i + β2 X2i + . . . βK XKi ) exp
(
σ2
2
)
so for a ‘small’ change in the continuous variable Xk :
∆E [y |X]
∆Xk
= βk exp (β0 + β1 X1i + β2 X2i + . . . βK XKi ) exp
(
σ2
2
)
with:
100 ∗
(
∆E [y |X]
∆Xk
1
E [y |X]
)
=
%∆E [y |X]
∆Xk
= 100 ∗ βk
so (100 ∗ βk) represents the percentage change in E [y |X] associated with
a change in the level of Xk , for a ‘small’ change in Xk(semi-elasticity)
28/33
Interpretation in Log Models
Consider the model:
zi = ln yi = β0 + β1 Xi + β2Di + εi εi |Xi ∼ N (0, σ2)
with:
E [y |X] = exp (β0 + β1 Xi + β2Di ) exp
(
σ2
2
)
For the indicator variable Di :
E [y |X ,D = 1] = exp(β2) exp (β0 + β1 X ) exp
(
σ2
2
)
E [y |X ,D = 0] = exp (β0 + β1 X ) exp
(
σ2
2
)
so:
%∆E [y |X]
∆D
= 100 ∗
(
E [y |X ,D = 1]− E [y |X ,D = 0]
E [y |X ,D = 0]
)
= 100 ∗
exp (β0 + β1 X ) exp
(
σ2
2
)
[exp(β2)− 1]
exp (β0 + β1 X ) exp
(
σ2
2
)

= 100 ∗ {exp(β2)− 1}
29/33
Linear-Log Model
Now consider a linear-log model of the form:
y = β0 + β1 lnX + ε
so:
β1 =
∆E [y |X ]
∆ lnX
=
∆E [y |X ]
∆X/X
so:
β1
100
=
1
100
∗
(
∆E [y |X ]
∆X/X
)
=
(
∆E [y |X ]
% ∆X
)
so β1/100 represents the level change in E [y |X ] associated with a
percentage change in the level of X , for a small change in X .
Alternatively, β1 then represents the change in E [y |X ] associated
with a doubling or 100% change in X .
30/33
Log-Log Model
Now consider a log-log model of the form:
zi = ln yi = β0 + β1 ln Xi + ε
with:
E [y |X] = exp (β0 + β1 ln X ) exp
(
σ2
2
)
so for a ‘small’ change in the continuous variable Xk :
∆E [y |X]
∆ ln X
= β1 exp (β0 + β1 ln X ) exp
(
σ2
2
)
Noting that ∆ lnX ≈ (∆X/X ):
100
100
∗
(
∆E [y |X]/E [y |X]
∆X/X
)
=
%∆E [y |X]
%∆X
= β1
so β1 represents the (approximate) percentage change in E [y |X ]
associated with a percentage change in the level of X . Note that the
parameter β1 can then be interpreted as an elasticity.
31/33
Example: Log-Linear Model
ln wagei = β0 + β1 educi + β2 experi + β3 exper
2
i + β4 disadvi + β5 cityi + εi
Untitled
Call:
lm(formula = lnwage ~ educ + exper + I(exper^2) + factor(disadv) +
    factor(city), data = wage)
Residuals:
     Min 1Q   Median 3Q      Max
‐1.74187 ‐0.23369  0.02547  0.25113  1.38291
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      4.6392013  0.0680696  68.154 < 0.0000000000000002 ***
educ 0.0822234  0.0035176  23.375 < 0.0000000000000002 ***
exper 0.0841655  0.0068085  12.362 < 0.0000000000000002 ***
I(exper^2)      ‐0.0022754  0.0003254  ‐6.993    0.000000000003292 ***
factor(disadv)1 ‐0.1756205  0.0148026 ‐11.864 < 0.0000000000000002 ***
factor(city)1    0.1155612  0.0150656   7.671    0.000000000000023 ***
‐‐‐
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3832 on 3004 degrees of freedom
Multiple R‐squared:  0.2555, Adjusted R‐squared:  0.2543
F‐statistic: 206.2 on 5 and 3004 DF,  p‐value: < 0.00000000000000022
> margins_summary(reg1)
  factor     AME     SE z      p   lower   upper
   city1  0.1156 0.0151   7.6705 0.0000  0.0860  0.1451
disadv1 ‐0.1756 0.0148 ‐11.8641 0.0000 ‐0.2046 ‐0.1466
    educ  0.0822 0.0035  23.3751 0.0000  0.0753  0.0891
   exper  0.0439 0.0023  18.8398 0.0000  0.0393  0.0484
Page 1
32/33
Example: Log-Linear Model
ln wagei = β0 + β1 educi + β2 experi + β3 exper
2
i + β4 disadvi + β5 cityi + εi
Untitled
# estimate model by OLS
reg1 <‐ lm(lnwage ~ educ + exper + I(exper^2)+ factor(disadv) + factor(city),
data=wage)
deltaMethod(reg1, "100*(exp(b5)‐1)", parameterNames=c("b0", "b1","b2","b3","b4","b5"))
Estimate      SE   2.5 % 97.5 %
100 * (exp(b5) ‐ 1)  12.2503  1.6911  8.9358 15.565
deltaMethod(reg1, "100*(exp(b4)‐1)", parameterNames=c("b0", "b1","b2","b3","b4","b5"))
Estimate       SE    2.5 %  97.5 %
100 * (exp(b4) ‐ 1) -16.1064   1.2418 ‐18.5403 ‐13.672
Page 1
- workers residing in cities earn 12.25% higher average wages, relative
to workers in rural areas, controlling for education, experience, and
area of residence.
- workers residing in the disadvantaged region earn 16.11% lower
average wages, relative to workers in advantaged regions, controlling
for education, experience, and city status.