Mini-Project 2 (due W 3/24 by 11:59pm)
You may work with other students to complete the assignment, but each student needs to turn in
his or her own work (e.g., it is ok to discuss how to write the code to solve a question, but it is not
ok to copy and paste each other’s code). Show all work to receive full credit. For questions that ask
you to calculate something, I need to see the full calculation, not just the answer. Note that the
“calculation” will often (almost always) be done using code, so I would need to see the code that
leads up to and produces the final answer. Similarly, for questions that require software to perform
an analysis or generate a plot, I need to see the R code that produced the results. You may attach
relevant R code as an Appendix at the end of the assignment, or include the code as part of your
answer to the question that the code supports.
You should upload your assignment solution as a single pdf to the “Assignments” section of our
course in Canvas. Click on the name of the assignment, then click the “Submit Assignment” button,
then upload the file containing your solution, then click “Submit Assignment” a final time. The
filename should be in the format LastName_FirstName_ProjectNumber.pdf. For example, if I was
submitting the assignment, I would name it poythress_jc_proj2.pdf.
1. We will analyze a dataset containing information about the number of FEMA buyouts of
flood-prone properties from 1989 to 2016, which can be downloaded from Canvas in the file
fema_buyouts.csv. The dataset contains information at both the county and state level, but
we will focus on the county-level data. The response variable of interest is NumBuyouts. Of
interest is how certain socioeconomic and demographic factors are associated with the number
of buyouts. The covariates of interest are:
ALAND: land area in m2
AWATER: water area in m2
FDD: number of federal disaster declarations
FDD_IA: number of federal disaster declarations with individual assistance
CountyIncome: average household income
CountyEducation: proportion with high school education
CountyRace: proportion white
CountyPopulation: total population
CountyPopDens: population density
CountyLanguage: proportion proficient in English
(a) Make a scatterplot matrix of the covariates FDD–CountyLanguage. Do any of the covariates
appear correlated with one another? If so, which ones, and what effect might correlation
among the covariates have on the analysis? Do any observations have relatively large values
of a particular covariate? Should we consider transforming those covariates? If yes, why?
And which transformations should we consider?
Also make scatterplots of ln(NumBuyouts+1) vs. ln(ALAND) and ln(NumBuyouts+1) vs.
ln(ALAND). We might argue that either ALAND or AWATER should be treated as an exposure
variable, since either could serve as a proxy for the number of properties at risk of flooding.
Does the relationship between NumBuyouts and ALAND or AWATER suggest that either should
be treated as an exposure variable?
(b) For now, don’t treat ALAND or AWATER as an exposure variable (i.e., don’t use an offset).
Assume the response NumBuyouts is a Poisson random variable and fit a Poisson regression
model. Use all of the covariates ALAND, AWATER, FDD–CountyLanguage and any transforma-
tions of the covariates you wish to perform model selection to find the best model possible
for the response. You may also construct new variables from combinations of two or more
variables [e.g., AWATER/(AWATER+ALAND) would represent the proportion of the county that is
water, which may be a relevant covariate]. You can use whichever model selection algorithm
and criterion for the “best model,” but you should justify your choices.
Does the final model you selected appear to fit the data well? If your final model happened
to include either ALAND or AWATER or some function of one them, does it suggest one or the
other should be treated as an exposure variable?
(c) Make a histogram of NumBuyouts. Are there any unusual features apparent in the histogram
(where“unusual”is in the context of assuming the counts follow a Poisson distribution)? How
might the distribution of the counts be related to the lack-of-fit you may have encountered
for the models you fit in the previous part? [Hint: You may want to make a custom set
of breaks in the histogram, because the features may be difficult to see using the default
(d) Use the zeroinfl function from the pscl package to fit a zero-inflated Poisson (ZIP) model.
As in part (b) fit a model with many/all of the covariates and transformations of the co-
variates, then perform model selection to find a reduced model that includes some subset of
those covariates. The model you select for the count part of the model need not include the
same covariates as the model you select for the zero-inflated part of the model.
Refer to Faraway’s example of fitting a ZIP model in R.
If you fit a model, look at the summary, and see NAs for the SEs, Z values, and P -values,
try standardizing the covariates first.
The step function appears to work for the object returned by zeroinfl, but only for
the count part of the model. You might consider removing covariates “by hand” and
using LRTs to justify their removal for the zero-inflated part of the model (like Faraway
(e) Even though the ZIP model accounts for lack-of-fit due to excess zeros, it’s still possible
that the counts are overdispersed (or that the mean structure of the model is misspecified,
even after model selection). We should somehow check for lack-of-fit before interpreting
the model or drawing any conclusions. The zeroinfl function can also fit a zero-inflated
negative binomial (ZINB) model by changing the dist argument.
Fit a ZINB model with the same set of covariates in the count and zero-inflated parts of
the model that were in the final model you selected in the previous part. Presumably most
or all of the covariates you selected in your final ZIP model were significant. Are they still
significant in the ZINB model?
We discussed comparing the Poisson vs. NB models through a LRT of the overdispersion
parameter. We could do that for the zero-inflated versions of the models as well if we could
be confident that the asymptotic distribution of the LRT statistic under the null is the
same in the regular and zero-inflated versions of the models. However, I am unsure whether
or not that is the case. Alternatively, we could use AIC to compare the ZIP and ZINB
models. Unfortunately, we would need to know 1) that no constants have been left off the
log-likelihood 2) the AIC function counts the number of parameters of zeroinfl objects
properly and 3) the zeroinfl function uses the MLE for the estimate of the dispersion
parameter. Again, I am unsure of any of those things. However, the summary of the ZINB
model includes a Wald test for log(theta) (where theta is the overdispersion parameter),
which may not be ideal, but is at least something we can use to determine whether or not
there is evidence of overdispersion. Based on the summary of the fitted ZINB model, is there
evidence for overdispersion in the counts? That is, should we prefer the ZIP model or the
ZINB model for the purpose of drawing conclusions from the data?
(f) Suppose we decide that we want to base our conclusions on the ZINB model. If there
are covariates included in the ZINB model fit in the previous part that are not significant,
we should perform model selection once again before interpreting the model. Use whichever
procedure and criterion you prefer to perform model selection on the ZINB fit in the previous
part. Just make sure to clearly state why the criterion justifies removing a covariate from
the model, should you choose to remove any.
(g) Interpret the effect of the covariates in the final model you selected in the previous part. In
particular, how are the covariates included in the zero-inflated part of the model associated
with the odds that a county had no FEMA property buyouts from 1989 to 2016? How are
the covariates included in the count part of the model associated with the mean number of
FEMA property buyouts, among counties that had at least one FEMA buyout? Does the
association between each covariate and the number of FEMA buyouts match your intuition?
[Hint: If your final model includes standardized versions of the covariates, it’s OK to interpret
the covariate effects in loose terms. For example, if Income was included in the model, is an
increase in a county’s average income associated with an increase or decrease in the number
of FEMA buyouts? You don’t need to phrase the interpretation as “for a 1-unit increase in
standardized income, the mean number of FEMA buyouts increases/decreases by a factor
of ....” A precise quantitative interpretation of the effect of a covariate in anything but its
original units makes things messy and complicated and difficult to understand.]
2. Do Exercise 2 on page 126 of Faraway.
3. We will analyze a dataset about chronic respiratory disease, which can be downloaded from Can-
vas in the file respire.dat. The dataset has column names, so use the header=TRUE argument
when you read it into R.
The dataset has information on three covariates:
air: air pollution (low or high)
exposure: job exposure (yes or no)
smoking: non-smoker, ex-smoker, or current smoker
The goal is to analyze the covariates’ relationships with the response – the counts of individuals
falling into four chronic respiratory disease categories:
Level 1: no symptoms
Level 2: cough or phlegm < 3 months/year
Level 3: cough or phlegm ≥ 3 months/year
Level 4: cough or phlegm ≥ 3 months/year + shortness of breath
Thus, we have an ordinal multinomial response. Furthermore, we could argue that the categories
arise from a sequential mechanism, so that a continuation ratio logit model would be reasonable.
(a) After reading the data into R, take a look at the dataset to see how it is structured. Now use
the vglm function from the VGAM package to fit parallel and non-parallel versions of the cumu-
lative logit, adjacent category logit, and continuation ratio logit models. Use main effects for
air, exposure, and smoking as the covariates in the models. For each type of model, use a
LRT to determine whether the non-parallel version fits better than the parallel version. Is the
preferred version of the model (parallel vs. non-parallel) consistent across the three different
types of models? How many more parameters do the non-parallel versions of the models have
vs. the parallel versions? [Hint: The model type can be changed via the family argument,
where the names are cumulative, acat, and cratio. The parallel argument controls
which version of the model is fitted. So for example, the non-parallel version of the contin-
uation ratio logit model would specified with the argument family=cratio(parallel=F).
Note that we could also change the link argument, but logit is the default for all three
types, so we don’t need to adjust it.]
(b) For each type of model (cumulative, adjacent category, and continuation ratio), choose
either the parallel version or non-parallel version to proceed with, based on the LRTs from
the previous part. Use the drop1 function to perform LRTs to determine whether any
covariates can be removed from the models. If yes, refit the models with the covariate(s)
In your final models (there should be one for each of the three types), is there evidence for
lack-of-fit? Report the values of the statistics on which you base your conclusions regarding
lack-of-fit. Why is the statistic you chose an appropriate measure of goodness-of-fit?
(c) For each of the final models chosen in the previous part, interpret the effects of the covariates
on chronic respiratory disease. Note that each type of model involves a subtly different
function of the probabilities of the respiratory disease levels, and your interpretations should
reflect that. [Hint: If you look at the summary of the fitted model, besides the reference
levels for the covariates, it also tells you which linear predictors are being modelled. Not
only should this help you interpret the effects, but you can also reverse the direction of the
probabilities by refitting the model with the reverse=T argument if that leads to a more
convenient way to interpret the effects.]
FYI, if the models you chose were the non-parallel slopes versions, the interpretations can get
quite messy and tedious. In other words, something like “current smokers have a higher odds
of more severe respiratory disease” would be too simplistic; it neither takes into account the
non-parallel slopes nor the choice of model for the probabilities. So I am expecting the detail
and nuance of the interpretation to be commensurate with the complexity of the model.
(d) Suppose we wanted to pick just one type of model among the cumulative logit, adjacent
category logit, and continuation ratio logit. How would you compare the three final models
from the previous two parts? LRTs? AIC or BIC? Log-likelihood? Some other method?
Choose a model comparison method and determine which model is preferred among the
three final models you selected in the previous two parts. Justify your choice of model
comparison method by explaining why it is appropriate (and why some of the other choices
are not appropriate).