1. (24 points)
In recent times, several professionals in health-care analytics are
trying to understand the key
factors that might dictate the chances of someone getting an infection
while admitted at a
hospital. 113 patients were randomly sampled, and initially, to predict
InfectionRisk (Y) [higher
values ⇒ higher risk and we are not extrapolating in any of the
questions below], two potential
cause variables were examined:
❼ Stay: the number of days the patient remained in hospital.
❼ XRay: the number of XRays the patient received.
Fig 1. describes this regression output. Call this Model F1. The PRESS
value for this model
was 137.53.
a. (7 pts) ❼ (2 points) Check at α = 0.05 whether at least one of these
two variables would be useful in
explaining variation in the response. (Need the null and alternate
hypotheses, the p-value,
and the decision).
❼ (3 points) Check at α = 0.05 whether (holding Stay constant) an
increase in XRay frequency leads to an increase in InfectionRisk. (Need
the null and alternate hypotheses, the
p-value, and the decision).
❼ (2 points) John will have to stay for 10 days and will have 20 XRays.
Estimate through a
95% interval (a range of numbers) his InfectionRisk.
3
b. (8 pts) Model F1 was next generalized to include a second order
interaction term:
Stay2XRay. Fig 2 summarizes that fit. Call this model F2. The PRESS
value for this model
was 133.78.
❼ (2 points) Observe that we are only explaining about 38% of the total
response variation. So I concluded that the variables involved must be
bad at α = 0.1. Am I right?
Justify/criticize.
❼ (4 points) Sam (a friend of John from part (a)) will also have to stay
for 10 days, but will
have 21 XRays. Who has a higher InfectionRisk? And by how much?
❼ (2 points) Had you calculated the interval estimate from Model F2,
would it be wider or
narrower than what you found in (a)? Why?
4
c. (4 pts) Next, a computer glitch deleted all the information on the
cause variables
Stay an XRay and results from Models F1 and F2 (i.e., Figs 1 and 2) are
lost too. A new person
who’ll stay for 10 days and will have 15 XRays wants to estimate his
InfectionRisk. What will
it be? Also, find R2 R2a
for this new model. (Hint: You may use Fig 3.)
d. (5 pts) Hospitals often cite lack of funding as the reason behind
alarming InfectionRisks. A regression line was made connecting
InfectionRisk (Y) to funding (X) (in millions of
dollars):
Yˆi = 2.1 1.5Xi (1)
(Data on funding are confidential. A person who had access to those made
this line) The standard
error corresponding to the slope estimate was 2. Test at α = 0.10
whether an increase in funding
leads to a significant decrease in InfectionRisks. You may assume
pt(-1.5,111)=0.068227, pt(-
1.5,112)=0.068214, pt(-0.75,111)=0.2274207, pt(-0.75,112)=0.2274137.
5
2. (36 points) (The story from question 1 gets continued here) Later, it
was discovered that
several other potential cause variables might also affect the chances of
getting infected. Here’s
a larger list:
❼ Stay: Length of stay at the hospital, in days.
❼ Age: The age of the patient, in years.
❼ XRay: The frequency of XRay administration.
❼ Beds: Number of available beds at the hospital.
❼ Census: Total number of patients admitted.
❼ Nurses: Number of trained nurses available.
❼ Facilities: Number of facilities available at the hospital.
a. (3 pts) Within one minute, you’ll need to choose a few of these
variables as future
higher-order and/or interaction term candidates. You can use a computer
that does one regression in one second. Would you implement a step-wise
search or an all-possible combination (i.e.,
a brute-force) search? Justify.
6
b. (9 pts) A stepwise selection search among these variables generated
Fig 4. while a
brute-force search generated Figs 5-9.
❼ (2 points) How many first order non-interaction regressions did the
stepwise algorithm
actually do for this dataset?
❼ (3 points) If financial constraints force us to choose only two cause
variables, which combination would you choose? Use both stepwise and
brute force to answer this and check
whether we have agreement between the two methods.
❼ (4 points) Next, imagine we have no restrictions on the size of the
model. Use one of the
two methods to propose good cause-combinations under each of the
following scenarios:
– If I were to explain the maximum proportion of variation in the
response after penalizing for model complexity.
– If I were to estimate the chances of a friend (that is going to be
admitted) getting
infected.
7
c. (9 pts) “Stay”, “XRay”, and “Facilities” were ultimately chosen from
the previous
variable screening steps. Next, the hospitals were classified according
to their locations and two
indicator variables X4 and X5 were brought in according to the following
table:
X4 X5
East Coast 0 0
Central America 0 1
West Coast 1 0
A first-order non-interaction model was constructed with these five
variables. The fit is shown
in Fig. 10. Call this Model Q1. The PRESS value for this model is
125.1996.
❼ (3 points) Compare F2 and Q1.
❼ (3 points) Using Q1, estimate the increase in InfectionRisk if we
choose a hospital in a)
the “East Coast”, b) the “West Coast”, and c) “Central America”.
❼ (3 points) Based on Fig. 10, at α = 0.05, is there a need to
differentiate between hospitals
in the regions i) “East Coast” and “West Coast” ii) “East Coast” and
“Central America”?
8
d. (15 pts)
Model Q1 was reduced next according to the classification:
X6
East Coast/Central America 0
West Coast 1
and an interaction between “Stay” and X6 was conjectured to generate
Model Q2, with a PRESS
= 123.1746. Fig. 11 describes that fit.
❼ (2 points) Test at α = 0.2 whether the interaction term is
significant.
❼ (3 points) A person is considering moving from an “East Coast/Central
America” hospital
to one in the “West Coast”. Regardless of the hospital he chooses, he
estimates his
length of “Stay” to be 10 days (we are not extrapolating) . By how much
is this person’s
“RiskofInfection” likely to change?
❼ (4 points) Use Model Q2 to discover a blind spot, if any, and
interpret it.
❼ (3 points) A new person who has to stay admitted for 10 days, will
need 10 XRays, wants
to choose a hospital in the “East Coast” with 30 “Facilities” (we are
not extrapolating)
comes in. This person was not used to derive any of Models F1, F2, Q1,
Q2. Using one
of these four models, predict (through a point estimate) the
InfectionRisk value of such a
person.
❼ (3 points) Find R2
Jack for the model you chose in the previous part, out of the four
available.