ECON 113 Fall 2021
Problem Set #4: Statistical Inference in Multiple Regression
Note: Questions #1 and #2 both use the dataset rental.dta. Question #3 uses houseprices.dta.
(#1) (Data exercise). You should use Stata for parts (j) and (k) of this problem.
Are rental rates influenced by the student population in a college town? Let rent be the average monthly
rent paid on rental units in a college town in the United States. Let pop denote the total city population,
avginc the average city income, and pctstu the student population as a percentage of the total
population.
One model to test for a relationship is
ln(rent) = β0 + β1ln(pop) + β2ln(avginc) + β3pctstu + u
a) State the null hypothesis that size of the student body relative to the population has no ceteris
paribus effect on monthly rents. State the alternative that there is an effect.
b) What signs do you expect for β1 and β2?
The equation estimated using 1990 data from rental.dta for 64 college towns is
� = .043 + .066 lnpop + .507 lnavginc + .0056 pctstu
(.844) (.039) (.081) (.0017)
n=64, R2=.458
Here are the summary statistics and correlation matrix for the four variables:
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lnrent | 64 6.026034 .200436 5.717028 6.829794
lnpop | 64 11.16897 .6245325 10.16119 13.35808
lnavginc | 64 10.04073 .2556954 9.133676 10.93857
pctstu | 64 27.84786 13.61892 11.45658 71.20982
| lnrent lnpop lnavginc pctstu
-------------+------------------------------------
lnrent | 1.0000
lnpop | 0.2195 1.0000
lnavginc | 0.6029 0.3692 1.0000
pctstu | 0.0598 -0.5869 -0.3127 1.0000
c) Interpret the slope coefficient of 0.0056 on pctstu. Be careful when stating the units.
(#1) cont’d.
d) If lnavginc were omitted from the regression, what do you think would happen to the coefficient on
pctstu? Explain briefly.
e) Using the estimates from above and one of the following critical values, test the hypothesis stated
in part (a) at the 5% level. Explain why you chose the critical value that you did. [Hint: you will
need to calculate the t-stat for ̂3.]
Stata note: the expression invt(df,p) gives the value t, of a t-distributed random variable with df
degrees of freedom, for which Pr(
one tail of a t-distribution with df degrees of freedom.
. display invt(60,.005)
-2.660283
. display invt(60,.01)
-2.3901195
. display invt(60,.025)
-2.0002978
. display invt(60,.05)
-1.6706489
f) Now focus on the slope coefficient on lnpop (β1). Construct a (two-sided) 90% confidence interval
for β1. (Continue to use the estimates from part (b) and choose the appropriate critical values from
part (e).) Write a statement that explains this 90% confidence interval.
g) Construct a (two-sided) 95% confidence interval for β1. (Again, choose the appropriate critical
value from part (e)).
h) Based on your answers to (f) and (g), what can you conclude about the (2-sided) p-value for ̂1:
A. p > .10
B. p = .10
C. p = .05
D. .05 < p < .10
E. p < .05
(#1) cont’d.
i) Use the output in (b) to calculate the t-statistic for ̂1. It is: __________.
Now use the output from the Stata command:
. display ttail(60,1.70)
.04715491
Stata note: the expression ttail(df,t) gives the probability p=Pr(>t) for a t-distributed
random variable with df degrees of freedom. So, it gives us the probability in one tail of a t-
distribution with df degrees of freedom when the value is t.
Use this information (including knowledge of the t-statistic you calculated) to obtain the (two-sided)
p-value associated with ̂1. Your answer here should be consistent with your answer to part (h).
j) Now estimate the above regression yourself, and where possible, use the output to verify your
answers thus far. Some Stata notes:
• The data set rental.dta contains data from two years, 1980 and 1990. The variable “year”
takes on of two values (80 or 90). To estimate the regression for the 1990 data only, type
“if” after the last variable in the regression statement and then the expression “year==90”.
Note that there is no comma before “if”:
regress y x1 x2 x3 if year==90
• You should run the command once with the default significance level of 5% (and 95%
confidence intervals) and then again with the option (typed after a comma):
, level(90)
to get output corresponding to the 10% significance level and 90% confidence intervals.
k) Based on your regression output from part (j), which of the three variables (lnpop, lnavginc,
pctstu) is statistically significant at:
a. a 10% level?
b. a 5% level?
c. a 1% level?
d. a 0.1 % level?
(#2) (Data exercise). This problem also uses the data set rental.dta. In question #1, you estimated the
following model for rental rates in college towns, using all 64 observations:
log ()� = .043 + .066 log(pop) + .507 log(avginc) + .0056 pctstu
(.844) (.039) (.081) (.0017)
n=64, R2=.458
Now suppose that you accidently deleted a few observations from the data for year 1990 before running
the regression. Let’s assume that the observations were deleted at random, so the sample you have left
is still a random sample.
a) Do you think your estimates for ̂1 would change? Why or why not? If so, can you predict whether
̂1 will get bigger or smaller?
b) Explain why you might expect the t-stat for ̂1 to get smaller.
c) Is it possible that the t-stat will actually get bigger? Explain.
Now, try estimating the regression after dropping 4 observations. Do this as follows.
First, “keep” only the observations corresponding to year 1990 by typing:
keep if year==90
(type describe to verify that you now have a data set with only 64 observations, and tab year to
verify that year is equal to 90 for all of them.)
First, re-estimate the regression using all 64 observations (in #1, you did this by adding “if year==90” to
the end of the command. Now, you no longer need this “if” condition).
Now, to estimate the regression using only the first 60 observations (i.e, ignoring the last 4), type
regress y x … in 1\60
Try it again using only the last 60 observations:
regress y x … in 5\64
d) What happened to ̂1 and to the t-statistic in each case?
(#3) (Data exercise). Use the data on housing prices (houseprices.dta) to estimate a simple linear
regression of price on number of bedrooms (bdrms).
a) Is the coefficient on bdrms statistically significant at a:
• 10% significance level? YES / NO
• 5% level? YES / NO
• 1% level? YES / NO
b) Now control for both the size of the house and the size of the lot in your regression. Conditional on the
size of the house and the size of the lot, is the predicted effect of an additional bedroom on the sale price
of a house significant at the:
• 10% significance level? YES / NO
• 5% level? YES / NO
• 1% level? YES / NO
c) A realtor tells you that you should expect to pay an extra $150 on average ($.15K) for each additional
square foot in this housing market, holding constant the size of the lot and the number of bedrooms.
Suppose that you decide you will trust this realtor’s advice unless you are 95% confident that the
statement is wrong based on your own regression analysis. Do you trust the realtor? [Hint: test the null
hypothesis that the statement is true.]
d) The p-value of .128 on bdrms in your estimated multiple regression model implies that...
• ... an additional bedroom does not have a statistically significant effect on the home price (using conventional thresholds for significance) once we control for the size of the house and the lot. TRUE/ FALSE ?
• ... an additional bedroom does not have an economically meaningful effect on the home price (once we control for the size of the house and the lot). TRUE/ FALSE ?
• ... the true coefficient on bdrms in the price regression is zero (once we control for the size of the house and the lot). TRUE/ FALSE ?
e) Suppose you were able to collect additional data on housing prices and quadruple the size of your
random sample (i.e., you now have 88 × 4 = 352 observations). If you re-estimate the multiple regression model using the new sample, would you expect the new coefficient on bdrms to be statistically significant at the 5% level? Explain.
one tail of a t-distribution with df degrees of freedom.
. display invt(60,.005)
-2.660283
. display invt(60,.01)
-2.3901195
. display invt(60,.025)
-2.0002978
. display invt(60,.05)
-1.6706489
f) Now focus on the slope coefficient on lnpop (β1). Construct a (two-sided) 90% confidence interval
for β1. (Continue to use the estimates from part (b) and choose the appropriate critical values from
part (e).) Write a statement that explains this 90% confidence interval.
g) Construct a (two-sided) 95% confidence interval for β1. (Again, choose the appropriate critical
value from part (e)).
h) Based on your answers to (f) and (g), what can you conclude about the (2-sided) p-value for ̂1:
A. p > .10
B. p = .10
C. p = .05
D. .05 < p < .10
E. p < .05
(#1) cont’d.
i) Use the output in (b) to calculate the t-statistic for ̂1. It is: __________.
Now use the output from the Stata command:
. display ttail(60,1.70)
.04715491
Stata note: the expression ttail(df,t) gives the probability p=Pr(>t) for a t-distributed
random variable with df degrees of freedom. So, it gives us the probability in one tail of a t-
distribution with df degrees of freedom when the value is t.
Use this information (including knowledge of the t-statistic you calculated) to obtain the (two-sided)
p-value associated with ̂1. Your answer here should be consistent with your answer to part (h).
j) Now estimate the above regression yourself, and where possible, use the output to verify your
answers thus far. Some Stata notes:
• The data set rental.dta contains data from two years, 1980 and 1990. The variable “year”
takes on of two values (80 or 90). To estimate the regression for the 1990 data only, type
“if” after the last variable in the regression statement and then the expression “year==90”.
Note that there is no comma before “if”:
regress y x1 x2 x3 if year==90
• You should run the command once with the default significance level of 5% (and 95%
confidence intervals) and then again with the option (typed after a comma):
, level(90)
to get output corresponding to the 10% significance level and 90% confidence intervals.
k) Based on your regression output from part (j), which of the three variables (lnpop, lnavginc,
pctstu) is statistically significant at:
a. a 10% level?
b. a 5% level?
c. a 1% level?
d. a 0.1 % level?
(#2) (Data exercise). This problem also uses the data set rental.dta. In question #1, you estimated the
following model for rental rates in college towns, using all 64 observations:
log ()� = .043 + .066 log(pop) + .507 log(avginc) + .0056 pctstu
(.844) (.039) (.081) (.0017)
n=64, R2=.458
Now suppose that you accidently deleted a few observations from the data for year 1990 before running
the regression. Let’s assume that the observations were deleted at random, so the sample you have left
is still a random sample.
a) Do you think your estimates for ̂1 would change? Why or why not? If so, can you predict whether
̂1 will get bigger or smaller?
b) Explain why you might expect the t-stat for ̂1 to get smaller.
c) Is it possible that the t-stat will actually get bigger? Explain.
Now, try estimating the regression after dropping 4 observations. Do this as follows.
First, “keep” only the observations corresponding to year 1990 by typing:
keep if year==90
(type describe to verify that you now have a data set with only 64 observations, and tab year to
verify that year is equal to 90 for all of them.)
First, re-estimate the regression using all 64 observations (in #1, you did this by adding “if year==90” to
the end of the command. Now, you no longer need this “if” condition).
Now, to estimate the regression using only the first 60 observations (i.e, ignoring the last 4), type
regress y x … in 1\60
Try it again using only the last 60 observations:
regress y x … in 5\64
d) What happened to ̂1 and to the t-statistic in each case?
(#3) (Data exercise). Use the data on housing prices (houseprices.dta) to estimate a simple linear
regression of price on number of bedrooms (bdrms).
a) Is the coefficient on bdrms statistically significant at a:
• 10% significance level? YES / NO
• 5% level? YES / NO
• 1% level? YES / NO
b) Now control for both the size of the house and the size of the lot in your regression. Conditional on the
size of the house and the size of the lot, is the predicted effect of an additional bedroom on the sale price
of a house significant at the:
• 10% significance level? YES / NO
• 5% level? YES / NO
• 1% level? YES / NO
c) A realtor tells you that you should expect to pay an extra $150 on average ($.15K) for each additional
square foot in this housing market, holding constant the size of the lot and the number of bedrooms.
Suppose that you decide you will trust this realtor’s advice unless you are 95% confident that the
statement is wrong based on your own regression analysis. Do you trust the realtor? [Hint: test the null
hypothesis that the statement is true.]
d) The p-value of .128 on bdrms in your estimated multiple regression model implies that...
• ... an additional bedroom does not have a statistically significant effect on the home price (using conventional thresholds for significance) once we control for the size of the house and the lot. TRUE/ FALSE ?
• ... an additional bedroom does not have an economically meaningful effect on the home price (once we control for the size of the house and the lot). TRUE/ FALSE ?
• ... the true coefficient on bdrms in the price regression is zero (once we control for the size of the house and the lot). TRUE/ FALSE ?
e) Suppose you were able to collect additional data on housing prices and quadruple the size of your
random sample (i.e., you now have 88 × 4 = 352 observations). If you re-estimate the multiple regression model using the new sample, would you expect the new coefficient on bdrms to be statistically significant at the 5% level? Explain.
学霸联盟