Homework Four
STAT 4051
language easy for the investigators to understand. Do not include any ‘computer-ese’ (e.g.,
code or computer output) as part of your answers to parts 1-6.
In 1986, no longitudinal surveys existed in China, and all surveys were either very narrow
health, economic or demographic surveys. Furthermore, no raw data from any survey had been al-
lowed out of the country. Since China’s reform and open policy, the country was being transformed
from one facing famine and extreme food shortages to one in which the food supply addressed ba-
sic needs and the initial states of a major transformation of the food distribution and marketing
system were occurring. The China Health and Nutrition Survey (CHNS) was established with a
goal of developing a multipurpose longitudinal survey that would allow the group to examine a
series of economic, sociological, demographic and health questions of interest. The CHNS inves-
tigators recruited households ranging from 1 to 14 members (mode 4 members) in nine provinces
to participate in this comprehensive study.
We are interested in the following factors potentially related to each adult participant’s body
mass index.
• BMIi: body mass index (in kg/m2) for participant i in 1989
• AGEi: age (in years) for participant i in 1989
• CALi: average daily caloric intake (in kilocalories divided by 1000) for subject i in 1989
• PROi: average percentage of calories from protein for subject i in 1989
• MALEi: takes value 1 if subject i is male and 0 otherwise
• URBANi: takes value 1 if subject i lived in urban area in 1989 and takes value 0 otherwise
The data are available in ASCII format on canvas in the file ga1.dat.
After meeting with the investigators, you determine that you should fit the following model to
the data to answer their questions.
BMIi = β0 + β1(AGEi − 30) + β2(CALi − 3) + β3(PROi − 10) + β4MALEi
+β5URBANi + β6URBANi(CALi − 3) + εi (1)
1
assuming εi
iid∼ N(0, σ2).
1. Provide a clearly-labeled table describing the variables provided. For continuous variables,
report means, standard deviations, minima, and maxima. For binary variables, provide fre-
quencies and percentages. For each variable, be sure to indicate what percentage of obser-
vations are missing.
2. Provide point estimates (βˆ, σˆ2) and accompanying interval estimates for βˆ. Provide clear in-
terpretations of the associations between BMI and the predictor variables age, caloric intake,
protein, gender, and urbanicity.
3. Often we are interested in hypotheses of the form Cβ =θ0. Provide the C and θ0 matrices
needed to test the hypothesis that caloric intake is unrelated to body mass index using a
single test. Carry out the test using the knowledge of linear regression, report the results
(presenting the test statistic, degrees of freedom, test used, and p-value), and interpret the
test results in language someone without a statistics background can understand.
4. What are the typical assumptions used in estimation and inference under the model in (1)?
Determine whether these assumptions appear to hold in these data to the extent possible,
providing evidence to support your determination.
5. Describe in words the hypothesis tested by each choice of C and θ0 below.
(a) C =
(
0 1 0 0 0 0 0
)
, θ0 = 0
(b) C =
(
1 0 0 0 0 0 0
)
, θ0 = 22
2
(c) C =
(
0 0 0 0 0 1 0
0 0 0 0 0 0 1
)
, θ0 =
(
0
0
)
(d) C =
(
0 0 0 0 0 1 1
)
, θ0 = 0. How does the hypothesis tested by this contrast
differ from the one tested by the previous contrast?
6. Read the introduction notes on canvas and then answer the question
Suppose the CHNS investigators wish to use these baseline data to help design a future study
of a similar population. This new study will follow 2000 men and 2000 women from age 40
to age 50, with BMI measurements taken every two years. Investigators anticipate that the
correlation between any two BMI measures on an individual will be approximately 0.65 and
assume that the rate of BMI change over time will be a linear increase. Using a two-sided test
with α = 0.05 and power= 0.80, what is the minimum detectable difference in the rate of
BMI change between men and women under these settings? Provide two plots showing how
the minimum detectable difference in slopes changes as you vary (a) ρ and (b) N . Clearly
describe all assumptions you make in order to carry out the power analysis.
7. To facilitate reproducible research, a single file containing all code to read in data (processing
code) and reproduce analysis (analytic code) should be uploaded to canvas. This code should
be clearly documented so that a colleague can run it without making any edits.
We are very thankful for Prof. Amy Herring from Duke University who provided the interesting
data set.
3 