Homework 02
your ID here
2022-03-11T15:00:00 GMT
Contents
0. Data 1
1. Fit a Random Forest model 2
2. Compute and Compare Predictive Performance 2
3. Examine the model 3
Bonus: Use IML to create a PDP+ICE chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4. Summarize 4
0. Data
We’re going to use a mail response data set from a real direct marketing campaign located in mailing.csv.
Each record represents an individual who was targeted with a direct marketing offer. The offer was a
solicitation to make a charitable donation. This data was provided by the authors of our textbook, and I’m
not sure of the original source.
The columns (features) are:
income household income
Firstdate data assoc. with the first gift by this individual
Lastdate data associated with the most recent gift
Amount average amount by this individual over all periods
rfaf2 frequency code
rfaa2 donation amount code
pepstrfl flag indicating a star donator
glast amount of last gift
gavr amount of average gift
class outcome variable, 1 if they gave donation
The target variables is class and is equal to one if they gave in this campaign and zero otherwise.
load('./mailing_train_test.RData')
glimpse(mailing_train)
1
## Rows: 4,000
## Columns: 14
## $ Income
7, 2, 4, 6, 6, 3, 1, 5, 2, 0, 0, 0, 6, 3, 6, 0, 5, 6, 7, 3,~
## $ Firstdate 9101, 9501, 9410, 9110, 9510, 8610, 8612, 8809, 8703, 9410,~
## $ Lastdate 9512, 9508, 9602, 9512, 9511, 9601, 9701, 9601, 9511, 9512,~
## $ Amount 0.30, 0.23, 0.24, 0.17, 0.11, 0.45, 0.43, 0.23, 0.25, 0.09,~
## $ rfaf2 3, 2, 4, 2, 2, 3, 1, 4, 2, 1, 1, 2, 1, 1, 4, 1, 2, 1, 3, 1,~
## $ glast 10, 20, 8, 20, 15, 5, 5, 5, 5, 15, 15, 10, 18, 25, 10, 15, ~
## $ gavr 6.41, 15.00, 5.42, 11.50, 14.00, 3.29, 6.87, 6.56, 11.78, 1~
## $ class 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ rfaa2_D 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ rfaa2_E 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,~
## $ rfaa2_F 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1,~
## $ rfaa2_G 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,~
## $ pepstrfl_0 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1,~
## $ pepstrfl_X 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,~
1. Fit a Random Forest model
Our outcome variable is class. Provide a quick look at the distribution of class for the training data and
test data.
# your code here
Comparing the distribution of the outcome variable in training and test, do they look balanced?
Write a sentence HERE about the balance of outcomes in the data.
Fit a random forest model to predict class.
# your code here
Examine the results.
# your code here
What is the out-of-box error rate?
Write HERE.
2. Compute and Compare Predictive Performance
Use the confusionMatrix function to compute several metrics of predictive performance.
# your code here
How would you describe the performance of this model?
Write HERE
2
The confusion matrix assumed a 0.50 cutoff for prediction. Now create a ROC plot and compute the AUC
for the training set.
# your code here
Now create a ROC chart for the test set and compute the test AUC.
# your code here
Is the model underfit, overfit, or correctly fit to the data?
Write a couple sentences HERE about this model fit.
3. Examine the model
Examine the variable importance of the model.
# your code here
Make some individual predictions of the model. Choose one case from the data and see what the model
predicts for that one person.
# your code here
Using the most important variable from the plot above, change the value of that variable to something new
and make a new predicting for that one case (i.e. set the value to something very small, or very large). How
does the prediction change?
# your code here
Let’s now try and predict the outcome for this case if that important variable was changed from it’s minimum
value to it’s maximum value.
1. Create a grid of at least 100 points from the minimum value to the maximum value.
2. Duplicate the case you used above the same number of times
3. Add the grid of points to the data
4. Predict the outcome using this new fake data and save the predicted probability in the dataset.
# your code here
Now plot the results of that sequential grid against the predicted probability. How do see the probability
of responding the mailer change in response to the variable? Tip: You can use coord_cartesian to change
the xlimits to focus on specific areas of detail if you want.
# your code here
3
Bonus: Use IML to create a PDP+ICE chart
Use the iml package to create an individual conditional expectations combined with partial dependence
chart.
It’s highly recommended that you only provide the prediction object a subset of the data. I’m not sure the
RStudio Cloud instance can take the full data and it will take a very long time. Use coord_cartesian to
focus the y-limit range to focus the chart to a region where you can observe the effect (it’s tiny!).
# your code here, if you want
4. Summarize
What are your thoughts on the impact of these different features on the likelihood for a person to respond
to our donation requests? (100-200 words)
Thoughts HERE
Things you could write about:
• What changes would you make to the modeling approach to improve the predictions?
• What recommendations would you make to the organization?
• What other kinds of data would be useful in improving these predictions?
4