ISE535-ISE 535 Data Mining代写
时间:2023-04-27
ISE 535 Data Mining Exam 2 Due on May 4 by 1 pm
1. Here we develop a classification model to predict if a text message is spam or not, using the words in the
message. The file sms.csv has sms messages. Column type identifies the message as spam or non-spam
(called ham). Column text has the text message. Use
library(tm) # VCorpus( ), tm_map( ), findFreqTerms( )
# read all as character columns
df0 <- read.csv("sms.csv", stringsAsFactors = FALSE)
str(df0)
to store this file into a dataframe with 2 character columns, then convert the first column to a factor. The
classification model will predict if the message is spam by using the words in the message ignoring the order
of the words. Thus, first we need to clean the data, split the message into words, then build the model.
a) (10 pts.) Text messages may contain words, spaces, numbers, and punctuation. To split the message into
individual words, noise characters need to be removed. The text data mining library tm is useful.
# build a corpus (a collection of messages suitable for text mining)
sms_corpus <- VCorpus(VectorSource(df0$text))
# examine it
as.character(sms_corpus[[1]])
lapply(sms_corpus[1:2], as.character)
# change all words to lowercase
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))
as.character(sms_corpus_clean[[1]])
# remove numbers
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)
# remove stop words
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords())
# remove punctuation
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
# example of word stemming
library(SnowballC)
wordStem(c("learn", "learned", "learning", "learns"))
#
# replace words by stem words
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
# eliminate unneeded whitespace
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)
# compare original with the final clean corpus
lapply(sms_corpus[1:3], as.character)
lapply(sms_corpus_clean[1:3], as.character)
b) (10 pts.) Convert the tm object sms_corpus_clean to a Document term matrix DTM as follows. How
many binary columns does the matrix has?
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
ISE 535 Data Mining Exam 2 Due on May 4 by 1 pm
c) (10 pts.) Split the matrix into train set (first 4169 rows) and test set. Further simplify these sets by keeping
words that appear at least 5 times in the data.
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
dim(sms_dtm)
# split into train and test sets
m = 4169
sms_dtm_train <- sms_dtm[1:m, ]
sms_dtm_test <- sms_dtm[(m+1):5559, ]
dim(sms_dtm_train)
sms_train_labels <- df0[1:m, ]$type
sms_test_labels <- df0[(m+1):5559, ]$type
# vector with words appearing at least 5 times
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
# show some of them
set.seed(2)
sample(sms_freq_words,12)
# DTMs with only the frequent terms
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]
# a function that converts 1/0 to Yes/No
convert_counts = function(x) x = ifelse(x > 0, "Yes", "No")
# Use convert_counts() to the columns of the train/test sets
sms_train <- apply(sms_dtm_freq_train, 2, convert_counts)
sms_test <- apply(sms_dtm_freq_test, 2, convert_counts)
dim(sms_test)
d) (20 pts.) Use the train set to build a Naive Bayes model. Use it to predict the test set with threshold equal
to 0.50. Report the TPR and FPR.
e) (20 pts.) Change the threshold to improve the test positive accuracy rate. Report the improved TPR and
FPR.
2. (30 pts.) The following US map shows the number of coronavirus cases per county as of April 21, 2020. The
file usmap.csv has the relevant data (the size of the circles shows the number of cases in each county). Use
the following to help you reproduce the map (continental US only, you may ignore Alaska) as close as possible
us <- c(left = -125, bottom = 25.75, right = -67, top = 49)
US.map = get_stamenmap(us, zoom = 5, maptype = "toner-lite")
ISE 535 Data Mining Exam 2 Due on May 4 by 1 pm
Submit your report (code and output) as a pdf file onto Blackboard (no screen captures). Read your pdf file before
submitting. One submission per student.