CHIC600-无代写|学霸联盟

CHIC600-无代写

时间：2022-12-06

Coursework - CHIC600
Instructions
This assessment is designed to test your knowledge against what you saw during the course. Keeping this in
mind you need to provide descriptive answers and describe the different steps of your analysis. The solutions
to the exercises below should be provided in a PDF that includes the R code, its output (figures, tables
etc. . . ) and the text with the description. My suggestion is to work with a Rmarkdown document. If you are
going to use Rmarkdown please submit both the .Rmd and the PDF file.
Case study 1
The data in the file UN11.csv contains several variables, including ppgdp (the gross national product per
person in U.S. dollars) and fertility (the birth rate per 1000 females), both from the year 2009. The data
are for 199 localities, mostly United Nations (UN) member countries. The data were collected from United
Nations. We are interested in studying the association between fertility and ppgdp.
Exercise 1
• Use exploratory data analysis to summarise the data and generate some assumptions on the relationship
between the variables of interest. Report the graphs produced and describe them.
• Fit a normal linear model with the following systematic component:
µi = β0 + β1ppgdpi
with µi = E[fertilityi]. State the assumptions behind the normal linear model and use diagnostic
analysis to check that they are valid. Could the model be improved? If so, re-fit the model with
appropriate changes.
Exercise 2
• Test the hypothesis that H0 : β1 = 0 against H1 : β1 ̸= 0. Report the p-value for this test and interpret
the result.
• Calculate and report the value of the coefficient of determination R2, and explain its meaning.
• Calculate and report the expected fertility for for a locality with ppgdp = 1000, along with a 95%
confidence interval.
Exercise 3
• Add pctUrban, an indicator of urbanisation, to the regression model. After introducing this variable
in the model, how does this affect the estimates of the regression coefficient for log(ppgdp)? Explain
why.
• Provide an interpretation of the estimated regression coefficient for pctUrban and a measure of uncer-
tainty for this estimate.
1
• After controlling for pctUrban, to what change in expected fertility is a 15% increase in ppgdp associated
with?
Exercise 4
The variable group is a factor with levels oecd for countries that are members of the OECD, the Organization
for Economic Co-operation and Development, africa for countries on the African continent, and other for
all other countries. No OECD countries are located in Africa. Replace pctUrban with group and fit a normal
linear model that has log(ppgdp) and group as explanatory variables.
• Plot the fitted regressions for each group as a function of log(ppgdp) along with confidence intervals.
Now add the observed points to the plot; do you think that we should allow the slopes of this regressions
to vary by group? How can we change the current model to obtain different slopes by group? Apply
this change if you deem that to be appropriate.
Case study 2
The data for this exercise relate to a study of the prevalence of Loa Loa (eyeworm, https://www.cdc.gov/
parasites/loiasis/index.html) in a series of surveys undertaken in 197 villages in Cameroon and southern
Nigeria. The variables are the following:
• longitude: Longitude in degrees.
• latitude: Latitude in degrees.
• examined: Number of people tested.
• infected: Number of positive test results.
• elevation: Height above sea-level in meters.
• mean_ndvi: Mean of all Normalised Difference Vegetation Index (NDVI) values recorded at village
location, 1999-2001
• max_ndvi: Maximum of all NDVI values recorded at village location, 1999-2001
• min_ndvi: Minimum of all NDVI values recorded at village location, 1999-2001
• stdev_ndvi: Standard deviation of all NDVI values recorded at village location, 1999-2001.
Our interest lies in understanding how Loa Loa prevalence (infected / examined) varies according to the
environmental variables provided. This dataset is provided in the loaloa.csv file.
Exercise 1
• Use the empirical logit transformation to achieve almost normality y∗ = log( yi+0.5mi−yi+0.5 ) (with yi the
number of people infected and mi the number of people tested). Produce a histogram of the prevalence
pi = yi/mi and of y∗i .
• Conduct an exploratory analysis to understand what environmental variables could be useful in a
normal linear model that has y∗i as the response variable. Report the produced plots and comment.
• Check also how much these variables are correlated between each other. Why is this something impor-
tant to assess?
Exercise 2
• Assuming that prediction is your main goal, use an appropriate method of variable selection to find
the model with the highest predictive power.
• Explain what would be the most suitable GLM model in this case.
2
• Write down the random and systematic component of your chosen model (use the same explanatory
variables of the final model found in point 4)
Exercise 3
• Fit the GLM model and compute both Wald type and profile likelihood type confidence intervals for
the model parameters. Do they differ?
• Could we use raw residuals (yi − µˆi) for diagnostic plots? Use the appropriate type of residuals to
check model assumptions.
• Interpret the estimated regression coefficients of your chosen GLM.
• Are there alternative GLM models that could be used as an approximation of your chosen model?