xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

R代写-W 3/24

时间：2021-03-24

Mini-Project 2 (due W 3/24 by 11:59pm)

Instructions:

You may work with other students to complete the assignment, but each student needs to turn in

his or her own work (e.g., it is ok to discuss how to write the code to solve a question, but it is not

ok to copy and paste each other’s code). Show all work to receive full credit. For questions that ask

you to calculate something, I need to see the full calculation, not just the answer. Note that the

“calculation” will often (almost always) be done using code, so I would need to see the code that

leads up to and produces the final answer. Similarly, for questions that require software to perform

an analysis or generate a plot, I need to see the R code that produced the results. You may attach

relevant R code as an Appendix at the end of the assignment, or include the code as part of your

answer to the question that the code supports.

You should upload your assignment solution as a single pdf to the “Assignments” section of our

course in Canvas. Click on the name of the assignment, then click the “Submit Assignment” button,

then upload the file containing your solution, then click “Submit Assignment” a final time. The

filename should be in the format LastName_FirstName_ProjectNumber.pdf. For example, if I was

submitting the assignment, I would name it poythress_jc_proj2.pdf.

Questions:

1. We will analyze a dataset containing information about the number of FEMA buyouts of

flood-prone properties from 1989 to 2016, which can be downloaded from Canvas in the file

fema_buyouts.csv. The dataset contains information at both the county and state level, but

we will focus on the county-level data. The response variable of interest is NumBuyouts. Of

interest is how certain socioeconomic and demographic factors are associated with the number

of buyouts. The covariates of interest are:

ALAND: land area in m2

AWATER: water area in m2

FDD: number of federal disaster declarations

FDD_IA: number of federal disaster declarations with individual assistance

CountyIncome: average household income

CountyEducation: proportion with high school education

CountyRace: proportion white

CountyPopulation: total population

CountyPopDens: population density

CountyLanguage: proportion proficient in English

(a) Make a scatterplot matrix of the covariates FDD–CountyLanguage. Do any of the covariates

appear correlated with one another? If so, which ones, and what effect might correlation

1

among the covariates have on the analysis? Do any observations have relatively large values

of a particular covariate? Should we consider transforming those covariates? If yes, why?

And which transformations should we consider?

Also make scatterplots of ln(NumBuyouts+1) vs. ln(ALAND) and ln(NumBuyouts+1) vs.

ln(ALAND). We might argue that either ALAND or AWATER should be treated as an exposure

variable, since either could serve as a proxy for the number of properties at risk of flooding.

Does the relationship between NumBuyouts and ALAND or AWATER suggest that either should

be treated as an exposure variable?

(b) For now, don’t treat ALAND or AWATER as an exposure variable (i.e., don’t use an offset).

Assume the response NumBuyouts is a Poisson random variable and fit a Poisson regression

model. Use all of the covariates ALAND, AWATER, FDD–CountyLanguage and any transforma-

tions of the covariates you wish to perform model selection to find the best model possible

for the response. You may also construct new variables from combinations of two or more

variables [e.g., AWATER/(AWATER+ALAND) would represent the proportion of the county that is

water, which may be a relevant covariate]. You can use whichever model selection algorithm

and criterion for the “best model,” but you should justify your choices.

Does the final model you selected appear to fit the data well? If your final model happened

to include either ALAND or AWATER or some function of one them, does it suggest one or the

other should be treated as an exposure variable?

(c) Make a histogram of NumBuyouts. Are there any unusual features apparent in the histogram

(where“unusual”is in the context of assuming the counts follow a Poisson distribution)? How

might the distribution of the counts be related to the lack-of-fit you may have encountered

for the models you fit in the previous part? [Hint: You may want to make a custom set

of breaks in the histogram, because the features may be difficult to see using the default

breakpoints.]

(d) Use the zeroinfl function from the pscl package to fit a zero-inflated Poisson (ZIP) model.

As in part (b) fit a model with many/all of the covariates and transformations of the co-

variates, then perform model selection to find a reduced model that includes some subset of

those covariates. The model you select for the count part of the model need not include the

same covariates as the model you select for the zero-inflated part of the model.

Some hints:

Refer to Faraway’s example of fitting a ZIP model in R.

If you fit a model, look at the summary, and see NAs for the SEs, Z values, and P -values,

try standardizing the covariates first.

The step function appears to work for the object returned by zeroinfl, but only for

the count part of the model. You might consider removing covariates “by hand” and

2

using LRTs to justify their removal for the zero-inflated part of the model (like Faraway

did).

(e) Even though the ZIP model accounts for lack-of-fit due to excess zeros, it’s still possible

that the counts are overdispersed (or that the mean structure of the model is misspecified,

even after model selection). We should somehow check for lack-of-fit before interpreting

the model or drawing any conclusions. The zeroinfl function can also fit a zero-inflated

negative binomial (ZINB) model by changing the dist argument.

Fit a ZINB model with the same set of covariates in the count and zero-inflated parts of

the model that were in the final model you selected in the previous part. Presumably most

or all of the covariates you selected in your final ZIP model were significant. Are they still

significant in the ZINB model?

We discussed comparing the Poisson vs. NB models through a LRT of the overdispersion

parameter. We could do that for the zero-inflated versions of the models as well if we could

be confident that the asymptotic distribution of the LRT statistic under the null is the

same in the regular and zero-inflated versions of the models. However, I am unsure whether

or not that is the case. Alternatively, we could use AIC to compare the ZIP and ZINB

models. Unfortunately, we would need to know 1) that no constants have been left off the

log-likelihood 2) the AIC function counts the number of parameters of zeroinfl objects

properly and 3) the zeroinfl function uses the MLE for the estimate of the dispersion

parameter. Again, I am unsure of any of those things. However, the summary of the ZINB

model includes a Wald test for log(theta) (where theta is the overdispersion parameter),

which may not be ideal, but is at least something we can use to determine whether or not

there is evidence of overdispersion. Based on the summary of the fitted ZINB model, is there

evidence for overdispersion in the counts? That is, should we prefer the ZIP model or the

ZINB model for the purpose of drawing conclusions from the data?

(f) Suppose we decide that we want to base our conclusions on the ZINB model. If there

are covariates included in the ZINB model fit in the previous part that are not significant,

we should perform model selection once again before interpreting the model. Use whichever

procedure and criterion you prefer to perform model selection on the ZINB fit in the previous

part. Just make sure to clearly state why the criterion justifies removing a covariate from

the model, should you choose to remove any.

(g) Interpret the effect of the covariates in the final model you selected in the previous part. In

particular, how are the covariates included in the zero-inflated part of the model associated

with the odds that a county had no FEMA property buyouts from 1989 to 2016? How are

the covariates included in the count part of the model associated with the mean number of

FEMA property buyouts, among counties that had at least one FEMA buyout? Does the

association between each covariate and the number of FEMA buyouts match your intuition?

[Hint: If your final model includes standardized versions of the covariates, it’s OK to interpret

the covariate effects in loose terms. For example, if Income was included in the model, is an

increase in a county’s average income associated with an increase or decrease in the number

of FEMA buyouts? You don’t need to phrase the interpretation as “for a 1-unit increase in

standardized income, the mean number of FEMA buyouts increases/decreases by a factor

3

of ....” A precise quantitative interpretation of the effect of a covariate in anything but its

original units makes things messy and complicated and difficult to understand.]

2. Do Exercise 2 on page 126 of Faraway.

3. We will analyze a dataset about chronic respiratory disease, which can be downloaded from Can-

vas in the file respire.dat. The dataset has column names, so use the header=TRUE argument

when you read it into R.

The dataset has information on three covariates:

air: air pollution (low or high)

exposure: job exposure (yes or no)

smoking: non-smoker, ex-smoker, or current smoker

The goal is to analyze the covariates’ relationships with the response – the counts of individuals

falling into four chronic respiratory disease categories:

Level 1: no symptoms

Level 2: cough or phlegm < 3 months/year

Level 3: cough or phlegm ≥ 3 months/year

Level 4: cough or phlegm ≥ 3 months/year + shortness of breath

Thus, we have an ordinal multinomial response. Furthermore, we could argue that the categories

arise from a sequential mechanism, so that a continuation ratio logit model would be reasonable.

(a) After reading the data into R, take a look at the dataset to see how it is structured. Now use

the vglm function from the VGAM package to fit parallel and non-parallel versions of the cumu-

lative logit, adjacent category logit, and continuation ratio logit models. Use main effects for

air, exposure, and smoking as the covariates in the models. For each type of model, use a

LRT to determine whether the non-parallel version fits better than the parallel version. Is the

preferred version of the model (parallel vs. non-parallel) consistent across the three different

types of models? How many more parameters do the non-parallel versions of the models have

vs. the parallel versions? [Hint: The model type can be changed via the family argument,

where the names are cumulative, acat, and cratio. The parallel argument controls

which version of the model is fitted. So for example, the non-parallel version of the contin-

uation ratio logit model would specified with the argument family=cratio(parallel=F).

Note that we could also change the link argument, but logit is the default for all three

types, so we don’t need to adjust it.]

4

(b) For each type of model (cumulative, adjacent category, and continuation ratio), choose

either the parallel version or non-parallel version to proceed with, based on the LRTs from

the previous part. Use the drop1 function to perform LRTs to determine whether any

covariates can be removed from the models. If yes, refit the models with the covariate(s)

removed.

In your final models (there should be one for each of the three types), is there evidence for

lack-of-fit? Report the values of the statistics on which you base your conclusions regarding

lack-of-fit. Why is the statistic you chose an appropriate measure of goodness-of-fit?

(c) For each of the final models chosen in the previous part, interpret the effects of the covariates

on chronic respiratory disease. Note that each type of model involves a subtly different

function of the probabilities of the respiratory disease levels, and your interpretations should

reflect that. [Hint: If you look at the summary of the fitted model, besides the reference

levels for the covariates, it also tells you which linear predictors are being modelled. Not

only should this help you interpret the effects, but you can also reverse the direction of the

probabilities by refitting the model with the reverse=T argument if that leads to a more

convenient way to interpret the effects.]

FYI, if the models you chose were the non-parallel slopes versions, the interpretations can get

quite messy and tedious. In other words, something like “current smokers have a higher odds

of more severe respiratory disease” would be too simplistic; it neither takes into account the

non-parallel slopes nor the choice of model for the probabilities. So I am expecting the detail

and nuance of the interpretation to be commensurate with the complexity of the model.

(d) Suppose we wanted to pick just one type of model among the cumulative logit, adjacent

category logit, and continuation ratio logit. How would you compare the three final models

from the previous two parts? LRTs? AIC or BIC? Log-likelihood? Some other method?

Choose a model comparison method and determine which model is preferred among the

three final models you selected in the previous two parts. Justify your choice of model

comparison method by explaining why it is appropriate (and why some of the other choices

are not appropriate).

5

学霸联盟

Instructions:

You may work with other students to complete the assignment, but each student needs to turn in

his or her own work (e.g., it is ok to discuss how to write the code to solve a question, but it is not

ok to copy and paste each other’s code). Show all work to receive full credit. For questions that ask

you to calculate something, I need to see the full calculation, not just the answer. Note that the

“calculation” will often (almost always) be done using code, so I would need to see the code that

leads up to and produces the final answer. Similarly, for questions that require software to perform

an analysis or generate a plot, I need to see the R code that produced the results. You may attach

relevant R code as an Appendix at the end of the assignment, or include the code as part of your

answer to the question that the code supports.

You should upload your assignment solution as a single pdf to the “Assignments” section of our

course in Canvas. Click on the name of the assignment, then click the “Submit Assignment” button,

then upload the file containing your solution, then click “Submit Assignment” a final time. The

filename should be in the format LastName_FirstName_ProjectNumber.pdf. For example, if I was

submitting the assignment, I would name it poythress_jc_proj2.pdf.

Questions:

1. We will analyze a dataset containing information about the number of FEMA buyouts of

flood-prone properties from 1989 to 2016, which can be downloaded from Canvas in the file

fema_buyouts.csv. The dataset contains information at both the county and state level, but

we will focus on the county-level data. The response variable of interest is NumBuyouts. Of

interest is how certain socioeconomic and demographic factors are associated with the number

of buyouts. The covariates of interest are:

ALAND: land area in m2

AWATER: water area in m2

FDD: number of federal disaster declarations

FDD_IA: number of federal disaster declarations with individual assistance

CountyIncome: average household income

CountyEducation: proportion with high school education

CountyRace: proportion white

CountyPopulation: total population

CountyPopDens: population density

CountyLanguage: proportion proficient in English

(a) Make a scatterplot matrix of the covariates FDD–CountyLanguage. Do any of the covariates

appear correlated with one another? If so, which ones, and what effect might correlation

1

among the covariates have on the analysis? Do any observations have relatively large values

of a particular covariate? Should we consider transforming those covariates? If yes, why?

And which transformations should we consider?

Also make scatterplots of ln(NumBuyouts+1) vs. ln(ALAND) and ln(NumBuyouts+1) vs.

ln(ALAND). We might argue that either ALAND or AWATER should be treated as an exposure

variable, since either could serve as a proxy for the number of properties at risk of flooding.

Does the relationship between NumBuyouts and ALAND or AWATER suggest that either should

be treated as an exposure variable?

(b) For now, don’t treat ALAND or AWATER as an exposure variable (i.e., don’t use an offset).

Assume the response NumBuyouts is a Poisson random variable and fit a Poisson regression

model. Use all of the covariates ALAND, AWATER, FDD–CountyLanguage and any transforma-

tions of the covariates you wish to perform model selection to find the best model possible

for the response. You may also construct new variables from combinations of two or more

variables [e.g., AWATER/(AWATER+ALAND) would represent the proportion of the county that is

water, which may be a relevant covariate]. You can use whichever model selection algorithm

and criterion for the “best model,” but you should justify your choices.

Does the final model you selected appear to fit the data well? If your final model happened

to include either ALAND or AWATER or some function of one them, does it suggest one or the

other should be treated as an exposure variable?

(c) Make a histogram of NumBuyouts. Are there any unusual features apparent in the histogram

(where“unusual”is in the context of assuming the counts follow a Poisson distribution)? How

might the distribution of the counts be related to the lack-of-fit you may have encountered

for the models you fit in the previous part? [Hint: You may want to make a custom set

of breaks in the histogram, because the features may be difficult to see using the default

breakpoints.]

(d) Use the zeroinfl function from the pscl package to fit a zero-inflated Poisson (ZIP) model.

As in part (b) fit a model with many/all of the covariates and transformations of the co-

variates, then perform model selection to find a reduced model that includes some subset of

those covariates. The model you select for the count part of the model need not include the

same covariates as the model you select for the zero-inflated part of the model.

Some hints:

Refer to Faraway’s example of fitting a ZIP model in R.

If you fit a model, look at the summary, and see NAs for the SEs, Z values, and P -values,

try standardizing the covariates first.

The step function appears to work for the object returned by zeroinfl, but only for

the count part of the model. You might consider removing covariates “by hand” and

2

using LRTs to justify their removal for the zero-inflated part of the model (like Faraway

did).

(e) Even though the ZIP model accounts for lack-of-fit due to excess zeros, it’s still possible

that the counts are overdispersed (or that the mean structure of the model is misspecified,

even after model selection). We should somehow check for lack-of-fit before interpreting

the model or drawing any conclusions. The zeroinfl function can also fit a zero-inflated

negative binomial (ZINB) model by changing the dist argument.

Fit a ZINB model with the same set of covariates in the count and zero-inflated parts of

the model that were in the final model you selected in the previous part. Presumably most

or all of the covariates you selected in your final ZIP model were significant. Are they still

significant in the ZINB model?

We discussed comparing the Poisson vs. NB models through a LRT of the overdispersion

parameter. We could do that for the zero-inflated versions of the models as well if we could

be confident that the asymptotic distribution of the LRT statistic under the null is the

same in the regular and zero-inflated versions of the models. However, I am unsure whether

or not that is the case. Alternatively, we could use AIC to compare the ZIP and ZINB

models. Unfortunately, we would need to know 1) that no constants have been left off the

log-likelihood 2) the AIC function counts the number of parameters of zeroinfl objects

properly and 3) the zeroinfl function uses the MLE for the estimate of the dispersion

parameter. Again, I am unsure of any of those things. However, the summary of the ZINB

model includes a Wald test for log(theta) (where theta is the overdispersion parameter),

which may not be ideal, but is at least something we can use to determine whether or not

there is evidence of overdispersion. Based on the summary of the fitted ZINB model, is there

evidence for overdispersion in the counts? That is, should we prefer the ZIP model or the

ZINB model for the purpose of drawing conclusions from the data?

(f) Suppose we decide that we want to base our conclusions on the ZINB model. If there

are covariates included in the ZINB model fit in the previous part that are not significant,

we should perform model selection once again before interpreting the model. Use whichever

procedure and criterion you prefer to perform model selection on the ZINB fit in the previous

part. Just make sure to clearly state why the criterion justifies removing a covariate from

the model, should you choose to remove any.

(g) Interpret the effect of the covariates in the final model you selected in the previous part. In

particular, how are the covariates included in the zero-inflated part of the model associated

with the odds that a county had no FEMA property buyouts from 1989 to 2016? How are

the covariates included in the count part of the model associated with the mean number of

FEMA property buyouts, among counties that had at least one FEMA buyout? Does the

association between each covariate and the number of FEMA buyouts match your intuition?

[Hint: If your final model includes standardized versions of the covariates, it’s OK to interpret

the covariate effects in loose terms. For example, if Income was included in the model, is an

increase in a county’s average income associated with an increase or decrease in the number

of FEMA buyouts? You don’t need to phrase the interpretation as “for a 1-unit increase in

standardized income, the mean number of FEMA buyouts increases/decreases by a factor

3

of ....” A precise quantitative interpretation of the effect of a covariate in anything but its

original units makes things messy and complicated and difficult to understand.]

2. Do Exercise 2 on page 126 of Faraway.

3. We will analyze a dataset about chronic respiratory disease, which can be downloaded from Can-

vas in the file respire.dat. The dataset has column names, so use the header=TRUE argument

when you read it into R.

The dataset has information on three covariates:

air: air pollution (low or high)

exposure: job exposure (yes or no)

smoking: non-smoker, ex-smoker, or current smoker

The goal is to analyze the covariates’ relationships with the response – the counts of individuals

falling into four chronic respiratory disease categories:

Level 1: no symptoms

Level 2: cough or phlegm < 3 months/year

Level 3: cough or phlegm ≥ 3 months/year

Level 4: cough or phlegm ≥ 3 months/year + shortness of breath

Thus, we have an ordinal multinomial response. Furthermore, we could argue that the categories

arise from a sequential mechanism, so that a continuation ratio logit model would be reasonable.

(a) After reading the data into R, take a look at the dataset to see how it is structured. Now use

the vglm function from the VGAM package to fit parallel and non-parallel versions of the cumu-

lative logit, adjacent category logit, and continuation ratio logit models. Use main effects for

air, exposure, and smoking as the covariates in the models. For each type of model, use a

LRT to determine whether the non-parallel version fits better than the parallel version. Is the

preferred version of the model (parallel vs. non-parallel) consistent across the three different

types of models? How many more parameters do the non-parallel versions of the models have

vs. the parallel versions? [Hint: The model type can be changed via the family argument,

where the names are cumulative, acat, and cratio. The parallel argument controls

which version of the model is fitted. So for example, the non-parallel version of the contin-

uation ratio logit model would specified with the argument family=cratio(parallel=F).

Note that we could also change the link argument, but logit is the default for all three

types, so we don’t need to adjust it.]

4

(b) For each type of model (cumulative, adjacent category, and continuation ratio), choose

either the parallel version or non-parallel version to proceed with, based on the LRTs from

the previous part. Use the drop1 function to perform LRTs to determine whether any

covariates can be removed from the models. If yes, refit the models with the covariate(s)

removed.

In your final models (there should be one for each of the three types), is there evidence for

lack-of-fit? Report the values of the statistics on which you base your conclusions regarding

lack-of-fit. Why is the statistic you chose an appropriate measure of goodness-of-fit?

(c) For each of the final models chosen in the previous part, interpret the effects of the covariates

on chronic respiratory disease. Note that each type of model involves a subtly different

function of the probabilities of the respiratory disease levels, and your interpretations should

reflect that. [Hint: If you look at the summary of the fitted model, besides the reference

levels for the covariates, it also tells you which linear predictors are being modelled. Not

only should this help you interpret the effects, but you can also reverse the direction of the

probabilities by refitting the model with the reverse=T argument if that leads to a more

convenient way to interpret the effects.]

FYI, if the models you chose were the non-parallel slopes versions, the interpretations can get

quite messy and tedious. In other words, something like “current smokers have a higher odds

of more severe respiratory disease” would be too simplistic; it neither takes into account the

non-parallel slopes nor the choice of model for the probabilities. So I am expecting the detail

and nuance of the interpretation to be commensurate with the complexity of the model.

(d) Suppose we wanted to pick just one type of model among the cumulative logit, adjacent

category logit, and continuation ratio logit. How would you compare the three final models

from the previous two parts? LRTs? AIC or BIC? Log-likelihood? Some other method?

Choose a model comparison method and determine which model is preferred among the

three final models you selected in the previous two parts. Justify your choice of model

comparison method by explaining why it is appropriate (and why some of the other choices

are not appropriate).

5

学霸联盟