ECN620-无代写-Assignment 2
时间:2023-11-09
Assignment 2
ECN620
Deadline: November 12, 11:59pm
Instructions
1. Students should complete this assignment using R.
2. All R work must be saved in a separate R-Script file or R-Markdown file (preferably).
3. Some questions require essay-type answers. Including them in R-Markdown is straightforward. If you
choose to submit an R-Script instead, you may add all required text explanations in a separate MS
Word or PDF file, or include them as annotations in your R file.
4. One of the first lines of your R must be: setwd “yourdirectoryhere” Once I change the setwd “ ” command
line to the directory where I have the data files for the assignment, R should execute the whole script
smoothly.
5. Load all necessary packages at the beginning of your document.
6. Your R script should have all commands that you used to process the original and any additional data
files, including the commands that you use to convert Excel data (or any other format) into R.
7. Assignments should be submitted through Assignment portal on D2L. Attachments to emails will not
be graded and receive zero mark.
Question 1. Experiemntal research design
This question is based on the data file Experiment.xlsx. This is experimental data from one of the academic
articles. This is the abstract of the article:
“We explore the power of behavioral economics to influence the level of effort exerted by students in a low
stakes testing environment. We find a substantial impact on test scores from incentives when the rewards
are delivered immediately. There is suggestive evidence that rewards framed as losses outperform those
framed as gains. Nonfinancial incentives can be considerably more cost-effective than financial incentives
for younger students, but are less effective with older students. All motivating power of incentives vanishes
when rewards are handed out with a delay. Our results suggest that the current set of incentives may lead to
underinvestment.”
In this assignment you will replicate some of the results of that study. The basic idea of the paper is
to investigate whether providing students with the “right” incentives to do well in school causes better
performance, and it uses randomized experiment methodology. The objective of the research is formulated by
the authors:
“One of the biggest puzzles in education is why investment among many students is so low given the high
returns. One explanation is that the current set of long-run returns does not sufficiently motivate some
students to invest effort in school. If underinvestment is a problem, then there is a role for public policy in
stimulating investment.”
One of the explanation is hyperbolic discounting - when weighing the costs and benefits of taking an action
that has immediate costs but future benefits (i.e., studying!), people sometimes choose not to act because
they have put too little weight on the (ever-so-distant) benefit.
1
To test whether they could get students to overcome hyperbolic discounting, the authors showed up on the
day of the test (right before the test) and randomly offered some students a financial reward if they could
improve their test score relative to last year’s score. Therefore, if the students tried hard, they would realize
the benefit immediately after the test rather than in the distant future. Specifically, the authors visited a
high school in Bloom Township (Bloom), a small school district south of Chicago, and conducted a random
experiment in which they randomly sorted students into three groups: high incentives (the student receives
$20 if they improve), low incentives (the student receives $10 if they improve), and control group (the student
receives nothing if they improve).
For this assignment, we answer a few questions about this context and conduct some simple analyses with
the authors’ data.
Non-experiemntal Research design
Prior to getting to the actual data, suppose you want to answer a general question of “Does providing students
with financial incentives to do well on tests improve test performance?” in a different way. What you do is
collect survey data by going to a local high school and collecting performance data on student tests. You
then survey the parents of all students in your sample, asking parents, “On the day of the last test your child
wrote, did you promise to pay him or her at least $10 if they improved their performance relative to the last
time they took a similar test?”
Based on the answers to the survey, you create the following grouping variable for your sample of i = 1, ..., N
students:
Di =
{
1 if promissed to pay;
0 if parents did not promis to pay.
You then specify the following econometric model to describe the test score performance of student i:
Yi = β0 + β1 ·Di + Ui
1.1.A. Provide an example of a factor that could be in Ui.
1.1.B. Explain (in words) why it is problematic to simply compare conditional means as way of answering the
research question. That is, why is it not a good idea to compare the mean test score of students with Di = 1
to the mean test score of students with Di = 0
Experiemntal Research design
Now we move on to the experimental data. Open up the Experiment.xlsx data file. As with any random
experiment, we first want to make sure randomization was successful. The variables treatment and
treatment_full_name specify the treatment status of each observation in the sample. The experiment was
run in two different years and the variables are coded a little differently in each year.
Treatment in the 2009 sessions was as follows: Di = INF : if the student was offered a low financial reward
$10 Di = INH : if the student was offered a high financial reward $20 Di = NS : if the student was told
nothing and offered nothing (control group)
Treatment in the 2010 session was as follows: Di = INLH : if the student was offered a high financial reward
$20 but it was framed as a loss Di = INH : if the student was offered a high financial reward $20 Di = INC :
if the student was told nothing and offered nothing (control group)
The experiment wave is recorded in variable session. In the first year of the experiment in 2009 the variable
session takes the values “Bloom 2009 spring” and “Bloom 2009 winter”. In the second year it is “Bloom
2010”. For the rest of the problem, you can keep only the year 2009.
2
1.2.A. Find the proportion of students who are eligible for a free lunch in the control group, F¯LC , and
proportion who are eligible in the low incentive group, F¯LL. Find the difference in the proportions, F¯LC−F¯LL
1.2.B. Find the variance of the free lunch variable among a pooled sample of students in the control group
and students in the low incentive group (do not include students in the other groups).
1.2.C. Find the variance of the estimator F¯LC − F¯LL
1.2.D. What is the value for the t-statistic for the hypothesis test that F¯LC − F¯LL = 0?
1.2.E. Using your calculations, conduct a hypothesis test that the proportion of students who are eligible for
a free lunch in the control group is the same as the proportion who are eligible in the low incentive group.
Compare it to R’s built-in hypothesis testing command that we showed in class. What do you conclude?
Based on your analysis with the free lunch variable, was randomization successful?
1.2.F. Test the hypothesis that the average test score of students in the control group and the low incentives
($10) treatment group are the same. What do you conclude? (Note: test score is given by “current_score_t”
in this data set)
1.2.G. Test the hypothesis that the average test score of students in the control group and the high incentives
($20) treatment group are the same. What do you conclude?
Question 2.
For this question we will use data file “Pollution.xlsx”. The description of the variables is below. A researcher
is set to explore the effect of air pollution on human health. For that, the data on 58,648 individuals living
in the same country was collected. the hypothesis that the researcher is interested in testing is the effect
of distance of a person’s residency to the nearest air-polluting facility on individual’s health. In the data,
that distance is stored in the variable distance. Specifically, it measures distance (in kilometers) from
a geographic location of a person’s main residence to the nearest industrial facility that emits in the air
of the potentially hazardous substances in quantities that exceed a certain threshold. the data file also
has information on gender (variable male, which is equal to 1 for men), and age of each individual. The
problem the researcher faces is a consistent and reliable measure of individual’s health status (the outcome
variable). Since health can be measured in different ways, the researcher collected information on three
different measures that reflect individual’s health. The first measure is recorded in variable health. It is
a self-reported health status, ranging from 1 (the worst) to 10 (the best) by the survey participants. The
second is the person’s weight (in kilograms), stored in the variable weight. The last measure is a binary
variable hospital, which takes the value of 1 if individual in the sample was admitted to a hospital at least
once during the period of study, and equal to zero otherwise.
2.A. Estimate the relationship between health and distance to the nearest source of pollution using three
different regression models with health, weight, and hospital as dependent variables, and distance to the
nearest pollution source as the explanatory variable. Discuss the results.
2.B. Discuss the validity of the research design, used by the researcher. In particular, is it safe to argue that
the effects, estimated in (2.A), reflect the causal effect of pollution on health?
2.C. Use one of the three models from (2.A) to estimate the elasticity of health with respect to the distance
to nearest source of pollution.
2.D. Using the estimates from (2.C), construct a scatter plot of the predicted and actual health status. Add
OLS regression equation to that plot.
2.E. Lastly, verify two properties of the OLF estimator. First, verify property (1) from p.75 of class notes
“Class6_regressions”, and property (2) from p.76.
3
Question 3.
For this question, we will use the “WAGE2” data from Assignment 1 again.
3.A. Regress ln(wage) on educ and IQ. Report and interpret both the coefficient on educ and the coefficient
on IQ.
Now do the following:
3.B. Regress educ on IQ and save the residuals from this regression. Let’s label the residuals êducr.
[Remember: if you estimate the model educi = α0+α1IQi+ ϵi and obtain fitted values (or model predictions)
as êduci = αˆ0 + αˆ1IQi, then the estimated residual is [êducR = educi − êduci = educi − αˆ0 − αˆ1IQi].
3.C. Regress ln(wage) on êducr. Report and interpret the coefficient on êducr. How does it compare to the
coefficient on educ from part (A)? Given the relationship between the coefficient here and that in part (A),
what can you say about how to interpret the estimated coefficients in a multiple regression? (Hint: we talked
about the partialling out interpretation in class.)
3.D. Now regress ln(wage) on IQ only and save the residuals from this regression. Let’s call these residuals
ln(ŵage)r. Now regress ln(ŵage)r on êducr (from above) and report the coefficient. How does it compare
to the estimated coefficient from part (A) and estimated coefficient from part (C)? ln(ŵage)r is the part
of ln(wage) that is independent of IQ and êducr is the part of educ that is independent of IQ. Given this
knowledge, add to your explanation in part (C) about how to interpret estimated coefficients in a multiple
regression.
3.E Continuing with the regression equation from part (A), add tenure, age , and age squared as controls.
How does it affect the coefficient on education? Is it consistent with your expectations?