Python代写|机器学习代写 - machine learning
时间:2020-11-12
1. (5+10+10=25 pts) PCA
For this problem, we will try to quantify the impact of dimensionality reduction on logistic
regression.
(a) Normalize the features of the wine quality dataset (where applicable). Train an
unregularized logistic regression model on the normalized dataset and predict the
probabilities on the normalized test data.
(b) Run PCA on the normalized training dataset. How many components were needed to
capture at least 95% of the variance in the original data. Discuss what characterizes the
first 3 principal components (i.e., which original features are important).
(c) Train an unregularized logistic regression models using the PCA dataset and predict the
probabilities on the appropriately transformed test data (i.e., for PCA, the test data
should be transformed to reflect the loadings on the k principal components). Plot the
ROC curves for both models (normalized dataset, PCA dataset) on the same graph.
Discuss your findings from the ROC plot.
2. (30+10+5=45 pts) Almost Random Forest
For this problem, you will be implementing a variant of the random forest using the decision
trees from scikit-learn. However, instead of subsetting the features for each node of each
tree in your forest, you will choose a random subspace that the tree will be created on. In
other words, each tree will be built using a bootstrap sample and random subset of the features.
The template code for the random forest is available in rf.py
(a) Build the adaptation of the random forest. Note that the forest will support the following
parameters.
• nest: the number of trees in the forest
• maxFeat: the maximum number of features to consider in each tree
• criterion: the split criterion – either gini or entropy
• maxDepth: the maximum depth of each tree
• minSamplesLeaf: the minimum number of samples per leaf node
Note that you’ll need to implement the train and predict function in the template
code.
• train: Given a feature matrix and the labels, learn the random forest using the
data. The return value should be the OOB error associated with the trees up to that
point. For example, at 5 trees, calculate the random forest predictor by averaging
only those trees where the bootstrap sample does not contain the observation.
1
• predict: Given a feature matrix, predict the responses for each sample.
(b) Find the best parameters on the wine quality training dataset based on classification
error. Justify your selection with a few plots or tables.
(c) Using your optimal parameters, how well does your version of the random forest perform
on the test data? How does this compare to the estimated OOB error?