AMS3640 Data Mining
Tutorial 3: Classification
A. Loading packages below:
library(tidyverse)
library(ggplot2)
install.packages("mlbench")
install.packages("e1071")
library(e1071)
install.packages("caret")
library(caret)
install.packages("rpart")
library(rpart)
First, we import the data "Zoo" from a package called "mlbench".
data(Zoo, package="mlbench")
Zoo_df<-Zoo
head(Zoo_df)
Zoo_tibble <- as_tibble(Zoo)
Zoo_tibble
Note: data.frames in R can have row names. The Zoo data set uses the animal name
as the row names.
B. Data Preparations
I translate all the TRUE/FALSE values into factors (nominal). This is often needed for
building models. Always check summary() to make sure the data is ready for model
learning.
Zoo_df <- Zoo_df %>%
modify_if(is.logical, factor, levels = c(TRUE, FALSE)) %>%
modify_if(is.character, factor)
Zoo_df %>% summary()
C. Decision Trees
Recursive Partitioning (similar to CART) uses the Gini index to make splitting
decisions and early stopping (pre-pruning).
tree_default <- Zoo_df %>% rpart(type ~ ., data = .)
tree_default
Notes: %>% supplies the data for rpart. Since data is not the first argument of
rpart, the syntax data = . is used to specify where the data in Zoo goes. The
call is equivalent to tree_default <- rpart(type ~ ., data = Zoo_df).
The formula models the type variable by all other features represented by . .
data = . means that the data provided by the pipe (%>%) will be passed to rpart
as the argument data. The class variable needs a factor (nominal) or rpart will
create a regression tree instead of a decision tree. Use as.factor() if necessary.
install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(tree_default, extra = 2)
Note: extra=2 prints for each leaf node the number of correctly classified objects
from data and the total number of objects from the training data falling into that node
(correct/total).
D. Create a Full Tree
To create a full tree, we set the complexity parameter cp to 0 (split even if it does not
improve the tree) and we set the minimum number of observations in a node needed
to split to the smallest value of 2 (see: ?rpart.control). Note: full trees overfit the
training data!
tree_full <- Zoo_df %>% rpart(type ~., data = ., control =
rpart.control(minsplit = 2, cp = 0))
rpart.plot(tree_full, extra = 2)
tree_full
Training error on tree with pre-pruning
predict(tree_default, Zoo_df)
pred <- predict(tree_default, Zoo_df, type="class")
head(pred)
confusion_table <- with(Zoo, table(type, pred))
confusion_table
correct <- confusion_table %>% diag() %>% sum()
correct
error <- confusion_table %>% sum() - correct
error
accuracy <- correct / (correct + error)
accuracy
Get a confusion table with more statistics (using caret)
confusionMatrix(data = pred, reference = Zoo %>% pull(type))
E. Make Predictions for New Data
Make up my own animal: A lion with feathered wings
my_animal <- tibble(hair = TRUE, feathers = TRUE, eggs = FALSE,
milk = TRUE, airborne = TRUE, aquatic = FALSE, predator = TRUE,
toothed = TRUE, backbone = TRUE, breathes = TRUE, venomous =
FALSE, fins = FALSE, legs = 4, tail = TRUE, domestic = FALSE,
catsize = FALSE, type = NA)
Fix columns to be factors like in the training set.
my_animal <- my_animal %>% modify_if(is.logical, factor, levels =
c(TRUE, FALSE))
my_animal
Make a prediction using the default tree
predict(tree_default , my_animal, type = "class")
F. K-Nearest Neighbors
Note: kNN uses Euclidean distance, so data should be standardized (scaled) first.
Here legs are measured between 0 and 6 while all other variables are between 0 and
1.
Zoo_scaled <- Zoo_tibble %>% mutate_at(vars(-17), function(x)
as.vector(scale(x)))
knnFit <- Zoo_scaled %>% train(type ~ .,
method = "knn",
data = ., tuneLength = 5)
knnFit
knnFit$finalModel
The final model can be directly used for predict()
predict(knnFit, head(Zoo _scaled)) #Note that it need to be scaled
G. Linear Support Vector Machines
svmFit <- Zoo_tibble %>% train(type ~.,
method = "svmLinear",
data = ., tuneLength = 5)
svmFit
svmFit$finalModel
The final model can be directly used for predict()
predict(svmFit, head(Zoo))