程序代写案例-B350F
时间:2021-11-27
BIA B350F – Using R for multiple regression
1

Introduction
Multiple regression is an extension of simple linear regression that can be used to predict the score on
the dependent variable (DV) from scores on several independent variables (IVs).
This laboratory session will use the ‘regress’ and ‘regressmi’ R data sets to illustrate how to use R to
build, evaluate and validate multiple regression model. The ‘regressmi’ data set is created from the
‘regress’ data set with a number of cases were randomly deleted and a fair amount of missing data for
illustrating how to use multiple imputation in regression analysis. The data sets contain the following
variables.
Regress/Regressmi
Variable Description
SUBJNO Subject number
TIMEDRS Visits to health professionals
MENHEAL Mental health symptoms
PHYHEAL Physical health symptoms
STRESS Life change stress

Download the data sets from the OLE and store it in the folder ‘B350F’. Then, load the data into the R
Console. Note that “regress” and “regressmi” are both CSV files.
Task 4-1 Standard multiple regression
1. Correlation analysis
Before build the regression model, you can use the cor() to investigate the relationship between a
DV and the IVs. If the multiple correlation is different from zero, you may then further ask which IVs
are important for predicting the score of DV.

Suppose you want to investigate whether the number of visits to health professional (timedrs) can
be predicted from the mental health symptoms (menheal), physical health symptoms (phyheal) and
life change stress (stress). Run the following R program and examine the output
## extract timedrs. menheal, phyheal and stress
library(car)
x <- subset(regress, select = timedrs:stress)
cor(x)
scatterplotMatrix(x = x, diagonal = TRUE)

## use rcor.test to perform Pearson's correlation test
library(ltm)
rcor.test(x)

In Task 2-7, you have learnt to use cor() create a correlation matrix. In the R program above, you
created a scatter plot matrix with histograms plotted at the diagonals, using scatterplotMatrix()
from the R package car. You also created a table for the Pearson Correlation test, using rcor.test
from the R package ltm. This function conducts pairwise Pearson’s correlation tests for every single
variable in your input dataframe.

Do the multiple correlations between DV (timedrs) and IVs (menheal, phyheal, and stress) differ
from zero? Do these variables seem to be normal?
BIA B350F – Using R for multiple regression
2


2. Standard multiple regression
Since the correlations between DV and IVs are all significant, you may perform the standard multiple
regression to include all IVs for this analysis and construct the confident interval. The following R
program begin the analysis:
reg <- lm(timedrs ~ menheal + phyheal + stress, data = regress)
summary(reg)
confint(reg, level=0.95)

In lm(), multiple regression model is specified as y ~ x1 + x2 + … + xn. By default, it fits the complete
model specified in the lm()statement. The summary()displays of the model results. The
confint()compute the confident interval for model parameters.

a. Run the R program and write down the sample regression model and the confident interval
of the model parameters.

b. Next, type anova(reg). What output do you obtain?

c. Suppose the order of the regressors in the regression model is reversed as follows:
reg1 <- lm(timedrs ~ stress + phyheal + menheal, data = regress)

Use the summary() and anova()to produce the summary and ANOVA table of the regression model,
and compare the results against that obtained in (a) and (b). Are they the same? If not, which of
them (summary, ANOVA table, or both) are different and why reversing the order of the regressors
makes a difference in the result(s)?
3. Evaluate model fitness
lm() provides standard analysis tables including the analysis of variance (ANOVA) and parameter
estimates tables provide details about the fitted model.

F-test and the adjusted R-square for the model:
Residual standard error: 9.708 on 461 degrees of freedom
Multiple R-squared: 0.2188, Adjusted R-squared: 0.2137
F-statistic: 43.03 on 3 and 461 DF, p-value: < 2.2e-16


BIA B350F – Using R for multiple regression
3

Test for regression coefficients:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.704848 1.124195 -3.296 0.001058 **
menheal -0.009666 0.129029 -0.075 0.940318
phyheal 1.786948 0.221074 8.083 5.6e-15 ***
stress 0.013615 0.003612 3.769 0.000185 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Testing the validity of the model and regression coefficients and comment of the goodness of the
fitted model.

4. Use ANOVA table for model comparisons
You just created your first regression model reg in part 3. In fact, you can utilize anova() to see if
there is another model that fits better than yours.

The following R program produces two new regression models. reg2 has a constant term (baseline);
reg3 has only “phyheal” and “stress” as predictors. Then, anova() compares both reg2 and reg3
against your first model, reg1.

reg2 <- lm(timedrs ~ 1, data = regress)
reg3 <- lm(timedrs ~ phyheal + stress, data = regress)
anova(reg2, reg)
anova(reg3, reg)

Run the R program above. Based on the output from ANOVA, answer the following questions.
a. Comparing your results between reg and reg2. Which one fits better?
b. Comparing your results between reg and reg3. Which one fits better?
c. Why the difference between reg and reg3 is NOT statistically significant?
d. Which model would you choose to present your data? Why would you choose this one?

5. Construct a complete ANOVA table
By default, R will break down the residual sum of squares (SSR) for each predictor. However,
sometimes you may want to know the overall SSR. The following R function produces the overall SSR
for us. After running the function by clicking the ‘source’ on the shortcut menu, run
simpleAnova(reg). Compare your result against anova(reg).
simpleAnova <- function(object, ...) {

# Compute anova table
tab <- anova(object, ...)

# Obtain number of predictors
p <- nrow(tab) - 1

# Add predictors row
predictorsRow <- colSums(tab[1:p, 1:2])
predictorsRow <- c(predictorsRow, predictorsRow[2] / predictorsRow[1])

# F-quantities
Fval <- predictorsRow[3] / tab[p + 1, 3]
BIA B350F – Using R for multiple regression
4

pval <- pf(Fval, df1 = p, df2 = tab$Df[p + 1], lower.tail = FALSE)
predictorsRow <- c(predictorsRow, Fval, pval)

# Simplified table
tab <- rbind(predictorsRow, tab[p + 1, ])
row.names(tab)[1] <- "Predictors"
return(tab)

}

6. Regression diagnostic
Before you accept a regression model, it is important to examine influence and fit diagnostics to see
whether the model might be unduly influenced by a few observations and whether the data support
the assumptions that underlie the linear regression. The fit diagnostic includes the following plots
and test:
 residuals versus the predicted values (aka residual vs. fit plot)
 square root of standardized residuals versus the predicted values
 standardized residuals versus the leverage
 normal quantile-quantile plot (Q-Q plot) of the residuals
 Cook’s versus observation number
 histogram of the residuals
 Durbin-Watson test to assess the serially correlated errors

Before studying the plots, you should understand the following terms in linear regression:
a. Residual: The difference between the predicted value (based on the regression equation) and
the actual, observed value
b. Outlier: In linear regression, an outlier is an observation with large residual. In other words, it
is an observation whose dependent-variable value is unusual given its value on the predictor
variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or
other problem.
c. Leverage: An observation with an extreme value on a predictor variable is a point with high
leverage. Leverage is a measure of how far an IV deviates from its mean. High leverage points
can have a great amount of effect on the estimate of regression coefficients.
d. Cook’s distance (or Cook’s D): A measure that combines the information of leverage and
residual of the observation.

BIA B350F – Using R for multiple regression
5

The residual vs. fit plot provides hints for verifying the normality, homoscedasticity, and linearity
assumptions as illustrated in the following diagram.

(Source: textbook)
Let’s examine the following diagnosis plots for our model in #3 one by one. After running the
regression model in #2, run the following R program to obtain the diagnosis plots.

## regression diagnosis
par(mfrow=c(2, 2))
plot(reg)

First of all, the residual vs. fit plot indicates that the normality and homoscedasticity cannot be
fulfilled. Second, the Q-Q plot of residual also indicate the failure of normality, with a skewed
Assumptions met Failure of normality Nonlinearity Heteroscedasticity
BIA B350F – Using R for multiple regression
6

distribution of residuals. Finally, the residuals vs. leverage plot reveals that there is one observation
with very high leverage (#405) that might be overly influencing the fit produced.

Run the following R program and then find the # of observations that are highly influential. You
have seen and run this program at Task4-2, so you may feel familiar with it.

# plot cook's distance
cutoff<-4/(nrow(regress)-length(reg$coefficients)-1)
plot(reg,which=4,cook.levels=cutoff)
abline(h=cutoff, lty=2, col="red")


# Influence Plot
influencePlot(reg, id.method="identify", main="Influence Plot",
sub="Circle size is proportional to Cook's Distance" )


Run the following R program to check the linearity between dependent variable and predictors.
Which predictor variable (or regressor) has nonlinear relationship with the dependent variable?

plot(regress$menheal, resid(reg),
main = "residual by regressors for timedrs")

plot(regress$phyheal, resid(reg),
main = "residual by regressors for timedrs")

plot(regress$stress, resid(reg), hist() boxplot()
main = "residual by regressors for timedrs")

The “car” package provides a function for the Durbin–Watson test. Run the following R program to
use the Durbin-Watson test to check the independence of errors. Are the residuals serially
correlated?

durbinWatsonTest(reg)


7. Outlier detection and distribution analysis
Write an R program to analyze the normality of the DV and IVs, and detect the univariate outliers of
DV and IVs. Hint: utilizing hist()and Boxplot()from the R package “car”, as well as the Q-Q plot. If
you can’t remember the format of Boxplot(), refer to Task 4-2 for details.

Identify the outliers and comment on the distribution of each variable. Suggest appropriate
transformation to improve the normality of highly skewed variables.


BIA B350F – Using R for multiple regression
7

8. Transformation
Variable Distribution Transformation Outlier
Timedrs Substantial positive skewness Log10 Subno=405 and 290
Phyheal Substantial positive skewness Log10 Subno=277 and 373
Menheal Slightly to moderate positive
skewness
- -
Stress Moderate positive skewness SQRT Subno=405 and 403

The distribution analysis results above show that all variables except ‘menheal’ have significant
positive skewness. After testing, it found that the following transformation can enhance the
normality of the variables:
logarithmic transformation – ‘timedrs’ and ‘phyheal’
square root transformation – ‘stress’.
Since ‘ltimedrs’ has zero value, a constant k=1 is added to the original value before taking logarithm.
A new R dataset “regress_res” is created to store the transformed data. Run the following R
program to obtain the transformed data.

log_timedrs <- log10(1 + regress$timedrs)
log_phyheal <- log10(regress$phyheal)
menheal <- regress$menheal
sqrt_stress <- sqrt(regress$stress)
regress_res <- data.frame(log_timedrs, log_phyheal, menheal, sqrt_stress)


Now, repeat the procedure from #2 to #5 of Task 4-1. Build a multiple regression with the
transformed data. You can utilize the following R program as a starting point, and add your own
code.

## rebuild the model with transformed data
regx <- lm(log_timedrs ~ menheal + log_phyheal + sqrt_stress,
data = regress_res)
summary(regx)

## regression diagnosis
par(mfrow=c(2, 2))
plot(regx)

Evaluate your new model and compare it with the original model obtained in #3.

(Note: Normality assumption is only for the residual. However, when the regression function is not
linear and the error terms are not normal and have unequal variances, In general (but not always):
1. Transform the dependent variable corrects problems with error terms (and may help the non-
linearity).
2. Transform the independent variables primarily correct the non-linearity.

But, transformation makes it difficult to interpret of the model coefficients.)
BIA B350F – Using R for multiple regression
8

9. Multicollinearity diagnostics
When an IV is nearly a linear combination of other regressors in the model, the affected estimates
are unstable and have high standard errors. This problem is called collinearity or multicollinearity.
We can use the Farrar-Glauber test (F-G test), which is a Chi-square test for detecting the strength
of the multicollinearity over the whole set of predictor variables.

A collinearity problem occurs when one or more predictor variables have a VIF (variance inflation
factor) larger than 10. Run the following R program and examine the results

library(mctest)
omcdiag(mod=regx)
imcdiag(mod=regx)

We first create a separate dataset that contains only the predictor variables (it is stored as “dat” at
the above R program). Next, you need to load the R package mctest in order to run the F-G test. The
test has two parts.

 omcdiag() tests if overall multicollinearity is present in our model or not. After running this
function, you will obtain following output
Call:
omcdiag(mod = regx)

Overall Multicollinearity Diagnostics
MC Results detection
Determinant |X'X|: 0.6155 0
Farrar Chi-Square: 224.2690 1
Red Indicator: 0.4117 0
Sum of Lambda Inverse: 4.0475 0
Theil's Method: 0.0074 0
Condition Number: 9.7468 0

1 --> COLLINEARITY is detected by the test
0 --> COLLINEARITY is not detected by the test

From the output above, the Chi-square statistic is 224.2690, which is very large. It implies that
multicollinearity may be present in our model.
 imcdiag() tells us which variable(s) may be highly correlated. After running this function, you
will obtain the following output.
Call:
imcdiag(mod = regx)

All Individual Multicollinearity Diagnostics Result

VIF TOL Wi Fi Leamer CVIF Klein IND1 IND2
menheal 1.4614 0.6843 106.5886 213.6385 0.8272 2.2664 0 0.0030 1.2448
log_phyheal 1.3863 0.7213 89.2461 178.8785 0.8493 2.1500 0 0.0031 1.0987
sqrt_stress 1.1998 0.8335 46.1456 92.4910 0.9130 1.8606 0 0.0036 0.6565

1 --> COLLINEARITY is detected by the test
0 --> COLLINEARITY is not detected by the test
BIA B350F – Using R for multiple regression
9


menheal , coefficient(s) are non-significant may be due to multicollinearity

R-square of y on all x: 0.3768

* use method argument to check which regressors may be the reason of collinear
ity
===================================

The VIF, TOL and Wi columns provide the diagnostic output for variance inflation factor,
tolerance and Farrar-Glauber F-test respectively. As mentioned earlier, when one or more
predictor variables have a VIF > 10, you may have a collinearity problem. Here, the VIF’s of the
predictor variables are less than 2. Therefore, multicollinearity may be present, but it is NOT
statistically significant.
To confirm this result, you run a correlation test for our predictor variables. Here is the test
result from rcor.test().
log_phyheal menheal sqrt_stress
log_phyheal ***** 0.511 0.317
menheal <0.001 ***** 0.383
sqrt_stress <0.001 <0.001 *****

upper diagonal part contains correlation coefficient estimates
lower diagonal part contains corresponding p-values

Note that “log_phyheal” and “menheal” are moderately correlated (r = 0.511). Therefore, there
might be a multicollinearity issue between these two variables.
(Note: it is not impossible that the F-G test gives contradictory results – omcdiag() indicates the
overall multicollinearity is present, but imcdiag()tells you multicollinearity between variables is
not significant. You are highly advised to run the whole F-G test and the correlation test, before
making a final decision)


BIA B350F – Using R for multiple regression
10

Task 4-2 Model selection methods
By default, lm() will use every single predictor variable available in our dataset to build a statistical
model. However, sometimes you end up with a model where some of the variables are not
statistically significant. In view of this, you want to build an “optimum” statistical model where all
predictor variables are significant.

We will utilize the dataset “swiss” and the R function stepAIC() to build our “optimum” model. The
dataset “swiss” consists of fertility measure and socio-economic indicators for each of 47 French-
speaking provinces of Switzerland in 1888. All data are standardized. Type “?swiss” to examine the
details. Here is a brief summary of the dataset “swiss”.

Variable name description
Fertility Ig, ‘common standardized fertility measure’
Agriculture % of males involved in agriculture as occupation
Examination % draftees receiving highest mark on army examination
Education % education beyond primary school for draftees.
Catholic % ‘catholic’ (as opposed to ‘protestant’).
Infant.Mortality live births who live less than 1 year.

1. Forward selection
stepAIC() can be called after loading the R package MASS. This function allows you to perform a
forward selection, backward selection, or stepwise selection, for choosing predictor variables.

stepAIC()with forward selection is operated in the following procedure.
a. Starting from a null model (Y ~ 1).
b. The null model’s AIC is compare against all predictor variables’ AIC.
c. If a predictor variable’s AIC is smaller than the null model’s AIC, this variable will be added
into our null model. Variable with the lowest AIC will be added first.
d. Then, the new model’s AIC is compared against the remaining predictor variable’s AIC.
e. Step 2 to 4 repeats until all predictor variables are added, or, remaining predictor variable’s
AIC is higher than the new model’s AIC.
Let’s look at the following R output for illustration.
Start: AIC=238.35
Fertility ~ 1

Df Sum of Sq RSS AIC
+ Education 1 3162.7 4015.2 213.04
+ Examination 1 2994.4 4183.6 214.97
+ Catholic 1 1543.3 5634.7 228.97
+ Infant.Mortality 1 1245.5 5932.4 231.39
+ Agriculture 1 894.8 6283.1 234.09
7178.0 238.34

The model (Fertility ~ 1) has AIC = 238.35. The predictor variable “Education” has the smallest AIC
(213.04), so it will be added into the model, and the new model becomes Fertility ~ education.
Now, run the following R program and examine the result.
BIA B350F – Using R for multiple regression
11

fit1 <- lm(Fertility ~ 1, data = swiss)
fit2 <- lm(Fertility~ ., data = swiss)
step1 <- stepAIC(fit1, direction = "forward", scope= formula(fit2))
step1$anova

In the R program above, you created two models: a null model (fit1) and a full model (fit2). The
stepAIC has three main arguments:
 fit1 tells R what your initial model is. Here, you start from fit1, the null model
 direction = "forward" tells R to use forward selection. You can change the direction to
“backward” for backward selection, or “both” for stepwise selection.
 scope= formula(fit2)gives the upper bound of the “optimum” model. Your final model
should be somewhere between fit1 and fit2.
Finally, the step1$anova shows you which variables are added during the model selection process,
and which variables the final model contains. Note: this is what R believes as the “optimal” model.
You still need to use your own judgement to make your decision
Which variables are added in each step and which variables the final model have?

2. Backward selection
stepAIC()with backward selection is operated in the following procedure.
a. Starting from a full model. All predictor variables are included
b. The full model’s AIC is compare against all predictor variables’ AIC.
c. If a predictor variable’s AIC is smaller than the full model’s AIC, this variable will be removed
into our null model. Variable with the lowest AIC will be removed first.
d. Then, the new model’s AIC is compared against the remaining predictor variable’s AIC.
e. Step 2 to 4 repeats until all predictor variables are removed, or, remaining predictor
variable’s AIC is higher than the new model’s AIC.

step2 <- stepAIC(fit2, direction = "backward")
step2$anova

Run the above R program. Study how variables are deleted when direction = "backward". Which
variable is deleted first, and what variables are retained in the final model?

3. Stepwise selection
The stepwise is a combination of the forward and backward selection.

step3 <- stepAIC(fit1, direction = "both", scope= formula(fit2))
step3$anova

Run the above R program. Study how variables are added/deleted in the stepwise method. Which
variable is added/deleted in each step? What variables are retained in the final model?



BIA B350F – Using R for multiple regression
12

4. In your lecture note, you learned that model selection is done by comparing the p-value of a
predictor against the alpha-to-enter / alpha-to-remove level. However, stepAIC compares
predictors by its AIC value. To help you understand better how model selection is done by p-value,
here is an example for you. This example performs a stepwise selection using p-value as the criteria.
First, download and open the R script “stepwise”. You will see something like this:

Next, the “source” button as labelled in the figure above. You will see something loaded in the
environment, like this:

You have just loaded a function into R. R functions can perform complex tasks, especially for
repetitive and lengthy tasks.
BIA B350F – Using R for multiple regression
13

Now, the function “stepwise” contains 5 arguments: full model, initial model, alpha-to-enter, alpha-
to-leave, source of data. Run the following R program.
result <- stepwise(Fertility ~ 1 + Agriculture + Examination + Education +
Catholic + Infant.Mortality,
Fertility ~ 1, 0.05, 0.1, data = swiss)

You can see how p-value of each predictor changes as each of them is included / excluded during
model selection. If a predictor has p-value between 0.05 and 0.1, it will be included into the final
model.
Finally, type summary(result) . This gives you the regression table of the final model.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
Education -0.98026 0.14814 -6.617 5.14e-08 ***
Catholic 0.12467 0.02889 4.315 9.50e-05 ***
Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
Agriculture -0.15462 0.06819 -2.267 0.02857 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.168 on 42 degrees of freedom
Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10

Does it look familiar to you? You can compare this output against the output from stepwise selection,
using stepAIC(). Type summary(step3) and compare the output.













BIA B350F – Using R for multiple regression
14

Task 4-3 Cross-validation
Cross-validation is a technique to evaluate predictive models by randomly partitioning the original
sample into a training set to train the model, and a test set to evaluate it. A recommended split is 80%
for the statistical regression analysis and the remaining 20% as the cross-validation sample. After the
statistical regression on the larger subsample is run, predicted scores are created for the smaller cross-
validation sample using the regression coefficients produced by the analysis. Finally, predicted scores
and actual scores are correlated to find R2 for the testing sample. A large discrepancy between R2 for the
testing and training samples indicates overfitting and lack of generalizability of the results of the
analysis.
1. Build model with training sample
The following R program will randomly split the data set into 80% for the statistical regression
analysis and the remaining 20% as the cross-validation sample. The program does the followings:
a. The smp_size returns a number equivalent to 80% of all data from “regress_new”.
b. set.seed()is used so that the same set of training / test data can be obtained each time
c. The sample() tells R to draw numbers randomly. The total numbers drawn is equal to 80%
of the number of observations in “regress”. The result is stored as train_ind
d. Two new dataset is created, One is train, containing the training data, and one is test,
containing the test data
regress_new <- read.csv("regress_new.csv")
smp_size <- floor(0.8 * nrow(regress_new))
set.seed(123)
train_ind <- sample(seq_len(nrow(regress_new)), size = smp_size)

train <- regress_new[train_ind, ]
test <- regress_new[-train_ind, ]

Finally, a regression model is created using the training data set.
reg <- lm(log_timedrs ~ log_phyheal + sqrt_stress, data = train)
summary(reg)

Run the cross-validation program. What is the R2 for the training sample?

2. Output sample prediction
The following R program uses the model to predict the “log_timedrs”, named as ‘pred_ltimedrs’, for
the training sample. The predict() will generate predicted values of “log_timedrs”, using our
regression from training sample and our test data. The rcor.test is then used to calculate the
correlation between the actual score ‘timedrs’ and the predicted score ‘pred_ltimedrs’. After that,
you can calculate the coefficient of determination R2 , which is equal to 2.
pred_ltimedrs <- predict(reg, test)
dat <- data.frame(pred_ltimedrs, test$log_timedrs)
library(ltm)
rcor.test(dat)

What is the R2 for the testing sample? Compare to the R2 of the training sample, which of them is
larger? Is the R2 for the testing sample is significantly different from that for the training sample?
BIA B350F – Using R for multiple regression
15

Task 4-4 Making prediction from existing multiple regression
The following R program illustrates how to use the regression model to predict values. A data set
pred.data containing the values of predictor variables for two new observations is created. Next, you
build a statistical model, reg3, using the transformed data “regress_new”. The predict() uses the reg3
and pred.data to predict the values of log_timedrs. The level=0.95 specifies the 95% confidence
level; the interval = “predict” (or “confidence”) requests to output the lower bound and the upper
bound of a 100(1-α)% confidence interval of an individual prediction (or mean prediction).
X <- c(999, 1000)
menheal <- c(7, 10)
log_phyheal <- c(0.3, 0.5)
sqrt_stress <- c(15, 3)

pred.data <- data.frame(X, menheal, log_phyheal, sqrt_stress)

reg3 <- lm(log_timedrs ~ menheal + log_phyheal + sqrt_stress,
data = regress_new)

# individual prediction
predict(reg3, pred.data, level = 0.95, interval="predict")

# mean prediction
predict(reg3, pred.data, level = 0.95, interval="confidence")

Run the R program above and study the result.


BIA B350F – Using R for multiple regression
16

Self-study Task 4-5 Multiple imputation
Multiple imputation is a statistical technique for analyzing incomplete data sets, that is, data sets for
which some entries are missing. Application of the technique requires three steps: imputation, analysis
and pooling, as explained below.
 Imputation: Impute (or fill in) the missing entries of the incomplete data sets, not once, but m
times. Imputed values are drawn for a distribution that can be different for each missing entry.
 Analysis: Analyze each of the m completed data sets and produce m analysis results.
 Pooling: Integrate the m analysis results into a final result.
The following figure illustrates these steps for applying multiple imputation in regression analysis.









1. Imputed data sets
The missing data in the “regressmi” data set is imputed by mice()in the R library mice. Run the
following R program.
library(mice)
# understand the missing pattern
md.pattern(regress_mi)

## impute the data
dat <- mice(regress_mi,m=5,maxit=10,meth='pmm',seed=500)
regoutmi <- complete(dat, 2)

The md.pattern tells you the number of observations that have/have no missing value. After that,
you call out mice() to impute the missing data for us. The function mice() has 4 components.
 regress_mi , the name the dataset being imputed
 m=5 specifies to impute the missing data by 5 times
 maxit=10 is the number of iterations. The default number is 5
 seed=500 is equivalent as set.seed(500)

Next, a dataset called “dat” is generated. It consists of 5 set of imputed data. The R program extract
the 2nd set of imputed data, and stored it as “regoutmi”. You can choose any number between 1 and
Data set with
missing data

Imputed data
sets
Analyze results
Pool results
BIA B350F – Using R for multiple regression
17

5 from “dat” and extract that set of data. If the value of m increases, you have more imputed
dataset to choose from.
Now, type head(regress_mi) and head(regoutmi)to compare the difference of the two dataset.
There should not be any missing values in “regoutmi”.

2. Pool results
pool() is used to combine all the parameter estimates and associated standard errors or covariance
matrix generated by lm() for each imputed data in the following R program. Run it and examine the
results.
modelFit1 <- with(dat ,lm(timedrs ~ menheal + phyheal + stress))
summary(pool(modelFit1))

estimate std.error statistic df p.value
(Intercept) -2.1871747 0.71178076 -3.0728207 106.92598 2.687379e-03
menheal 8.5663513 1.13419216 7.5528218 107.44475 1.485767e-11
phyheal 0.1806579 0.58865665 0.3068987 84.80810 7.595145e-01
stress 0.1936725 0.04344885 4.4574830 68.15435 2.044192e-05

Now, write an R program to build a statistical model for the imputed dataset “regoutmi”. You should
obtain an output similar to part 2 of Task 4-1. Next, compare the R output above against the R
output from the dataset “regoutmi”.

3. Handling data by listwise deletion
Instead of handling the missing by imputation, lm() by default will use listwise deletion to handle
missing data (i.e., delete an entire observation if any value is missing) as follows.
reg2 <- lm(timedrs ~ menheal + phyheal + stress, data = regress_mi)
summary(reg2)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.85903 0.79399 -3.601 0.000375 ***
menheal 8.91211 1.27041 7.015 1.73e-11 ***
phyheal 0.46044 0.64241 0.717 0.474125
stress 0.20671 0.04584 4.509 9.58e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Run the program and compare this model with the imputed regression model.




BIA B350F – Using R for multiple regression
18

Self-study Task 4-6 Bootstrapping
Bootstrapping is a resampling technique that generates an empirical distribution of a test statistic or set
of test statistics, from the original sample. It has two main advantages:
 There is no need to assume if the original sample follows a normal distribution
 The sample size can be as small as 10.
It’s easiest to demonstrate the logic of bootstrapping with an example. Say that you want to calculate
the 95 percent confidence interval for a sample mean. Your sample has 10 observations, a sample mean
of 40, and a sample standard deviation of 5. What if there is NO information about the sampling
distribution? You could use a bootstrapping approach instead. Here are the steps:
a. Randomly select 10 observations from the sample, with replacement after each selection. Some
observations may be selected more than once, and some may not be selected at all.
b. Calculate and record the sample mean.
c. Repeat steps 1 and 2 a thousand times.
d. Order the 1,000 sample means from smallest to largest.
e. Find the sample means representing the 2.5th and 97.5th percentiles. In this case, it’s the 25th
number from the bottom and top. These are your 95 percent confidence limits.
(Quoted from “R in Action – 2nd edition)

We can apply non-parametric bootstrapping to obtain a 95% confidence interval for the R2 value in a
regression model. Here is an example, which utilizes the “regress” dataset from Task 4-2.
1. First, Write a function to obtain the R2 value from our regression model
rsq <- function(formula, data, indices) {
d <- data[indices,]
fit <- lm(formula, data=d)
return(summary(fit)$r.square)
}

2. The function “rsq” will return the R-square value from a regression model (you will build the model
at the next step). Then, use the following code to draw 1000 bootstrap replications
library(boot)
set.seed(1234)
results <- boot(data=regress_new, statistic=rsq, R=1000,
formula = log_timedrs ~ log_phyheal + sqrt_stress)

Here, you called the R library boot, and called out the boot function. It takes the following
arguments:
 data, which dataset you are using
 statistic, which statistic is bootstrapped. Here, you want to bootstrap the R2
 R, number of bootstrap replicates. You can adjust this number as you like
 formula, you want to bootstrap the statistic, based on the input formula


BIA B350F – Using R for multiple regression
19

3. Next, you can print the bootstrapped results, and plot the associated graph. You can see an
estimated R2 from bootstrapping, as well as the standard error of the estimate.
print(results)
plot(results)

Compare your estimated R2 against part 2 of Task 4-3 (the R2 from the testing sample). Which one is
higher? Why is there a difference between two figures?

4. You can use the following R program to extract a 95% confident interval for the estimated R2.
boot.ci(results, type=c("perc", "bca"))

Does the confident interval cross over zero? What can you conclude from this interval?

essay、essay代写