ISE 529 Predictive Analytics Homework 6 Submit on April 26 by 6 pm.
1. Consider the Hitters data set with 20 variables of major league baseball players. It is of interest
to predict the Salary of the players. Use random_state=1 to divide the data set into a training
and a test set (50%). For some players the salary is not available (remove these rows from the
data set).
a) (20 pts.) Fit a Random Forest model with B=100 bagged trees, max_features = 10 and
random_state=1. Report the test MSE. What are the three most important predictors?
b) (20 pts.) Fit a Gradient boosting model with 100 trees, learning rate 0.10, and max_depth = 4.
Use random_state=1. Report the test MSE. What are the three most important predictors?
c) (10 pts.) Fit a multiple linear regression model with the two most important predictors found
by the Random Forest. Find the test MSE.
2. Consider the Caravan.csv data set. It is of interest to predict Purchase (this is a classification
problem). The variables PVRAAUT and AVRAAUT are highly unbalanced (having most rows belong-
ing to a few categories). Therefore remove these variables from the dataset. Use random_state=1
to divide the data set into a training set (40%) and a test set (60%).
a) (20 pts.) Fit a random forest model with 500 trees and max_features = 29 to the training set
with Purchase as the response and the other variables as predictors. Use random_state=1.
What are the three most important predictors? Report the test accuracy rate.
b) (20 pts.) Fit a boosting model to the training set with max_depth = 4 and Purchase as the
response and the other variables as predictors. Use random_state=1. Use 1000 trees, and
learning rate 0.01. What are the three most important predictors? Report the test accuracy
rate.
c) (10 pts.) Fit a logistic regression model with max_iter=900 and random_state=1 to predict
Purchase. Report the test accuracy rate.
Submit your report as a pdf file onto Blackboard (no screen captures).
1
学霸联盟