Econ 120C: Stata Problem Set 1
Due: January 29, 2022
Instructions
The do file to be used for this problem set is ps1 PID.do. The data file for this problem
set is bwght2.dta, which will be downloaded from the web within the provided do file. First
create a working directory (a folder) in your system. Save the do file therein. Rename the
do file to ”ps1 yourPID”. That is, if your PID is A34567890, then your file should be named
”ps1 A34567890.do”.
You will solve this problem set by modifying the lines in this do file itself. This do file will guide
you through solving this problem set. It has commented blanks for where your answers need to
be written. You must update these blanks within the same do file and upload the do file. Only
the final do file needs to be uploaded.
For the questions which require an answer in words, write the answer as a comment in the same
do file. In Stata, all lines starting with an asterisk (*) and all lines enclosed by ”/*” and ”*/”
are considered as comments. Make sure your code runs;1 if it does not run, 25% will
be subtracted from your final score.
Please solve all the questions using Stata 17 which is available for download on Canvas.
Question 1
Consider the following regression model:
yi = β0 + x1iβ1 + x2iβ2 + ui,
where x1i ∼ N (0, 1) and x2i ∼ N (0, 1) are independent of ui ∼ N (0, 22). xi1 and xi2 are
correlated with correlation coefficient ρ = 0.6. We are interested in running regressions of yi on
our available x variables, and want to understand how the distribution of βˆ1 + βˆ2 varies with
the sample size.
In this problem, we generate data according to the distributions above with β0 = 1, β1 = 1, β2 =
2 and run the regression R = 5000 times for each specification and sample size we are interested
in. This will allow us to look at the distribution of the estimates.
1By “run”, we mean that the TA can click “do” in the do-file editor and the whole do-file runs through and
produces the desired results.
1
Suppose that we simulate the variables above from a model with sample size N = 50. Run
the regression by filling in the code in the for loop. The for loop takes the regression you
include and stores βˆ1 and βˆ2 (the coefficient on x1 and x2 respectively), se(βˆ1) and se(βˆ2) (the
standard errors of the corresponding coefficients) and cov(βˆ1, βˆ2) (the estimated covariance
between coefficients), which you will need to answer the following questions. In your do-file,
please answer each of the questions using this simulated data (the questions are written in the
do-file as well).
1. First, use the output from the 5000 regressions and generate two additional variables:
• The estimate of β1 + β2
βˆ12 = βˆ1 + βˆ2 (1)
• The standard error of βˆ12
se(βˆ12) =
√
se(βˆ1)2 + se(βˆ2)2 + 2cov(βˆ1, βˆ2). (2)
2. Generate a histogram of βˆ12. Is it centered around the true value? Comment on the shape
of this distribution.
3. Summarize the simulated data. What is the standard deviation of βˆ12? What is the mean
of se(βˆ12)? Are these numbers close to each other? Explain. What can you say about the
sign of the correlation between βˆ1 and βˆ2?
4. Now repeat the analysis with large sample size (N = 500). Consider the histograms, and
comment on what changed.
Question 2
We will now be using the birth weight dataset from Wooldridge (2020).2 This data can be
accessed within Stata by running the code in the do file (the lines bcuse bwght2.dta, clear).
1. First, use the describe command to see the variables in the data set.
2. Use the keep command to keep only the following variables in the dataset: bwght, cigs,
drink, mage, fage, male.
3. For this exercise, we are only interested in individuals with drink ≤ 2. Use drop command
to get rid of the observations that do not satisfy this condition.
4. Find the mean, standard deviation, the median, and the range of birth weights for children
in the sample.
5. Run a regression to estimate coefficients in equation (3). Interpret the estimate of β1. Is
it statistically significant?
bwght = β0 + β1drink + β2cigs+ β3mage+ β4fage+ β5male+ u (3)
2Wooldridge, J. M. 2020. Introductory Econometrics: A Modern Approach, 7e.
2
6. Create a dummy variable called D1 which equals 1 if an individual has on average one
drink per week and equals 0 otherwise. Create a dummy variable called D2 which equals
1 if an individual has on average two drinks per week and equals 0 otherwise.
7. Run a regression to estimate coefficients in equation (4). Interpret the estimate of δ1 and
δ2. Which model, (3) or (4), is more flexible? Explain.
bwght = β0 + δ1D1 + δ2D2 + β2cigs+ β3mage+ β4fage+ β5male+ v (4)
3