统计代写-MATH 523-Assignment 2|学霸联盟

统计代写-MATH 523-Assignment 2

时间：2022-03-10

Johanna G. Nešlehová
Generalized Linear Models MATH 523
McGill University, Winter Term 2022
Assignment 2 due on March 25 at noon.
Q1 Lecture 9a
Consider a binomial GLM with an arbitrary link function g and n responses that have
been entered in a grouped format. Using the same notation as in the lecture notes,
show that:
(1) The maximum likelihood estimates of β do not depend on whether the data have
been entered in a grouped or ungrouped format.
(2) The Fisher information matrix does not depend on whether the data have been
entered in a grouped or ungrouped format. Conclude that the asymptotic covari-
ance matrix of βˆ (and consequently the standard errors of βˆj, j = 1, . . . , p) does
not depend on the data entry format. Hint: It is easiest if you verify the entry
at position (j, k) of the Fisher information for arbitrary j, k, rather than doing
the matrix multiplication.
Q2 Suppose that miYi is binomial (mi, pii), where g(pii) = Xiβ and i = 1, . . . , n. Consider
the null model, for which pi1 = . . . = pin. Show that
pˆi =
∑n
i=1miyi∑n
i=1mi
.
When mi = 1 for all i ∈ {1, . . . , n}, show that in this case, the Pearson X2 statistic,
which is defined as the sum of the squared Pearson residuals, equals n. Decide whether
or not X2 is useful for testing whether a Binomial GLM model fits the data well when
the response is binary.
Q3 R exercise
Consider the following data on home-well contamination in 3020 households in Ara-
hazar upazila, Bangladesh. The response variable is switch (binary variable whether
or not the household switched to another well from an unsafe well). Other variables
collected for each household were arsenic (the level of arsenic contamination in the
household’s original well, in hundreds of micrograms per liter), dist100 (distance in
100-meter units to the closest known safe well), educ (years of education of the head of
the household) and assoc (whether or not any members of the household participated
in any community organizations: no or yes). The data is available in MyCourses under
Datasets. Load the data and compute dist100 as follows.
wells <- read.table("../Datasets/wells.dat")
attach(wells)
dist100 <- dist/100
Johanna G. Nešlehová
Generalized Linear Models MATH 523
McGill University, Winter Term 2022
Assignment 2 due on March 25 at noon.
(1) Report whether the data have been entered in a grouped or ungrouped form, and
which explanatory variables are continuous and which are factors.
(2) Fit a logistic regression model with the intercept and arsenic. Assess the fit
of this model graphically as follows: divide arsenic into 30 approximately filled
categories, group the data accordingly, and display the empirical logits of switch-
ing to a safe well for each category and display the fitted regression line. Do you
think the model is adequate? Perform an approximate goodness-of-fit test of the
model using the above binning and Pearson’s X2 statistic; conclude at the 5%
level.
(3) Find the most appropriate logistic regression model for the data. Use the de-
viance, but also consider practical significance by looking at the AIC and the size
of the effect of the predictors.
(4) Try to simplify the model you found in part (3) by replacing educ by a binary
factor predictor feduc, constructed as follows:
feduc <- numeric(3020)
for(i in 1:3020){
if(educ[i] < 9){feduc[i] <- 0}
if(educ[i] > 8){feduc[i] <- 1}
}
This predictor feduc records whether the person has a primary education (i.e.
1–8 years) or secondary education and above (i.e. more than 9 years).
(5) Compare the final model in parts (3) and (4) using AIC and ROC curves. Which
one do you prefer and why? Interpret the final model you selected.
Q4 R exercise
Consider a study on the duration of unemployment (1: short-term unemployment,
less than 6 months; 0: long-term unemployment) with explanatory variables gender
(1: male, 0: female) and level of education (0: lower, 1: higher). The data are
summarized in the table below.
Gender Education Level Short Term Unemployment Long Term Unemployment
1 0 313 126
1 90 41
0 0 196 132
1 42 43
(1) Analyze these contingency table data with logistic regression using the duration
of unemployment as a response.
Johanna G. Nešlehová
Generalized Linear Models MATH 523
McGill University, Winter Term 2022
Assignment 2 due on March 25 at noon.
(2) Describe the dependence relationship between the explanatory variables and the
response (conditional independence, homogeneous association etc.) in the model
selected in part (1).
(3) For the model selected in part (1), calculate the relevant odds ratios that de-
scribe the effect of the explanatory variable(s) on the response along with a 95%
confidence interval.
(4) Calculate the expected counts from the model selected in part (1) and compare
them to the observed counts using Pearson’s X2 statistic. Test goodness of fit
using an appropriate χ2 null distribution and conclude at the 5% level.
(5) Interpret the final model in one or two sentences in layman terms that a non-
statistician can understand (no formulas).
Due on March 25 at noon.