3860B/9155B课程,这是一个针对留学生的专业课程。该课程旨在帮助留学生提高他们的语言能力,并为他们的未来学习和工作做好充分的准备。在本课程中,学生将学习语法、词汇、听力、阅读和写作等语言技能,并通过实际模拟和练习来加深对这些技能的理解。此外,课程还包括了许多实用的语言技巧和策略,帮助学生在课堂外应用所学知识。我们的教师均具有丰富的教学经验和高度的语言能力,他们将给学生提供全面的指导和帮助,使他们能够以最快的速度实现语言进步。总的来说,这是一门非常有益的课程,为留学生提供了一个学习语言的绝佳机会。
Assignment 2
STATS 3860B/9155B - Instructor: Dr. Camila de Souza
Winter 2023
• This assignment is due Friday, March 10th, at 11:55 pm (EST).
• You must write your R code and answers using Rmarkdown generating a
single pdf file.
• Submissions must be made via Gradescope. You must carefully assign each
question part to its corresponding page (or pages) on your pdf file. Question parts
with no pages assigned to them will receive zero marks.
• Each student must submit their own work. Scholastic offences are taken seriously,
and students are directed to read the appropriate policy, specifically, the definition of
what constitutes a Scholastic Offence, at the following Web site: http://www.uwo.ca/u
nivsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf
Question 1
The data set NMES1988 in the AER package contains a sample of individuals over 65 who
are covered by Medicare in order to assess the demand for health care through physician
office visits, outpatient visits, ER visits, hospital stays, etc. The data can be accessed by
installing and loading the AER package and then running data(NMES1988). More background
information and references about the NMES1988 data can be found in help pages for the AER
package.
a) Show through graphical means that there are more respondents with 0 visits than
might be expected under a Poisson model.
b) Fit a ZIP model for the number of physician office visits using chronic, health, and
insurance as predictors for the Poisson count, and chronic and insurance as the
predictors for the binary part of the model. Then, provide interpretations in context for
the following estimated model parameters:
• coefficient of chronic in the Poisson part of the model
• coefficient of poor health in the Poisson part of the model
• the intercept in the logistic part of the model
1
• coefficient of insurance in the logistic part of the model
c) Use the estimated coefficients from part b) to calculate the predicted probability of
“always zero” (never visiting the doctor) for an individual who has insurance and two
chronic conditions. Then check your answer using the predict() function to compute
the same probability.
d) Use the estimated coefficients from part b) to calculate the predicted probability of
visiting the doctor five times for an individual with poor health, four chronic conditions
and without insurance. Then check your answer using the predict() function to
compute the same probability.
Question 2
The dataset melanoma gives data on a sample of patients suffering from melanoma (skin
cancer) cross-classified by the type of cancer and the location on the body.
suppressMessages(library(faraway))
str(melanoma)
## 'data.frame': 12 obs. of 3 variables:
## $ count: num 22 16 19 11 2 54 33 17 10 115 ...
## $ tumor: Factor w/ 4 levels "freckle","indeterminate",..: 1 4 3 2 1 4 3 2 1 4 ...
## $ site : Factor w/ 3 levels "extremity","head",..: 2 2 2 2 3 3 3 3 1 1 ...
a) Display the data in a two-way table. Make a mosaic plot and comment on the evidence
of independence.
b) Check for independence between site and tumour type using a Chi-squared test.
c) Fit a Poisson GLM model and use it to check for independence.
d) Make a two-way plot of the deviance residuals from the last model. Comment on the
larger residuals.
Question 3
The UCBAdmissions dataset presents data on applicants to graduate school at Berkeley for
the six largest departments in 1973 classified by admission and sex.
suppressMessages(library(faraway))
str(UCBAdmissions)
## 'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
## - attr(*, "dimnames")=List of 3
## ..$ Admit : chr [1:2] "Admitted" "Rejected"
## ..$ Gender: chr [1:2] "Male" "Female"
## ..$ Dept : chr [1:6] "A" "B" "C" "D" ...
2
a) Show that this provides an example of the Simpson’s paradox.
b) Determine the most appropriate dependence model between the variables. Explain in
detail each step you take.
c) Fit a binomial regression with admissions status as the response and show the relationship
to your model in the previous question.
Question 4
The hsb data was collected as a subset of the “High School and Beyond” study conducted by
the National Education Longitudinal Studies program of the National Center for Education
Statistics. The variables are gender, race, socioeconomic status (SES), school type, chosen
high school program type, scores on reading, writing, math, science and social studies. The
response variable is the chosen high school program type (prog), which is multinomial with 3
levels.
library(faraway)
library(nnet)
data("hsb")
hsb <- hsb[,-1] ## removing first column corresponding to student ID
str(hsb)
## 'data.frame': 200 obs. of 10 variables:
## $ gender : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
## $ race : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
## $ ses : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
## $ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
## $ prog : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
## $ read : int 57 68 44 63 47 44 50 34 63 57 ...
## $ write : int 52 59 33 44 52 52 59 46 57 55 ...
## $ math : int 41 53 54 47 57 51 42 45 54 52 ...
## $ science: int 47 63 58 53 53 63 53 39 58 50 ...
## $ socst : int 57 61 31 56 61 61 61 36 51 51 ...
a) Fit a multinomial regression model for prog (with baseline level academic) and all nine
predictors.
b) Interpret the coefficients corresponding to the five subjects (scores on reading, writing,
math, science and social studies) in terms of odds.
c) Regarding to part b), identify which one of the five subjects gives unexpected results and
suggest an explanation for this behavior. Any reasonable explanation will be accepted.
Question 5
An earlier study examined the effect of workplace rules in Minnesota which require smokers
to smoke cigarettes outside. The number of cigarettes smoked by smokers in a 2-hour period
3
Table 1: A small subset of hypothetical data on Minnesota workplace rules on smoking.
subject x (location) y (cigarettes)
1 0 3
2 1 0
3 1 0
4 1 1
5 0 2
6 0 1
was recorded, along with whether the smoker was at home or at work. A (very) small subset
of the data appears in Table 1 and it is also available in the smoking.csv file.
• Model 1: Assume that Y ∼ Poisson(λ); there is no difference between home and work.
• Model 2: Assume that Y ∼ Poisson(λW ) when the smoker is at work, and Y ∼
Poisson(λH) when the smoker is at home.
• Model 3: Assume that Y ∼ Poisson(λ) and log(λ) = β0 + β1x.
a) Write out the likelihood L(λ) and log-likelihood logL(λ) in Model 1. Use the data
values in Table 1, and simplify where possible.
b) Intuitively, what would be a reasonable estimate for λ based on this data? Why?
c) Use R to produce a plot of the likelihood function L(λ) . Find the maximum likelihood
estimator for λ in Model 1 using an optimization routine in R (for example, optim(),
but not the glm() function).
d) Write out the log-likelihood function logL(λW , λH) in Model 2. Use the data values in
Table 1, and simplify where possible.
e) Intuitively, what would be reasonable estimates for λW and λH based on this data?
Why?
f) Find the maximum likelihood estimators for λW and λH in Model 2 using an optimization
routine in R (for example, optim(), but not the glm() function).
g) Write out the log-likelihood function logL(β0, β1) in Model 3. Again, use the data values
in Table 1, and simplify where possible.
h) Find the maximum likelihood estimators for β0 and β1 in Model 3 using an optimization
routine in R (for example, optim(), but not the glm() function). Use R to produce a
3D plot of the log-likelihood function.
i) Confirm your estimates for Model 1 and Model 3 using glm(). Then show that the
MLEs for Model 3 agree with the MLEs for Model 2.
4
Question 6
This question refers to exercise 4 of Chapter 8 of the textbook; however, instead of working
with the Galapagos data, you will work with the dataset in Table 1 from the previous question.
The purpose of this question is to reproduce the details of the GLM fitting of this data via
the IRWLS algorithm.
a) Consider the Poisson GLM fit for Model 3 in part i) of the previous question. Report
the estimated values of the coefficients and deviance.
For parts b), c), d), e), f) and g) refer to Exercise 4 Chapter 8 of the textbook (page 172).
5