Things to Keep in Mind: I. If you run into a syntax error while
preparing this assignment, do not panic. Show the code you ran, and show
the error you encountered. Explain the purpose of the step, and state
what you were trying to do. If the error is preventing you from running
any follow-on steps, again, focus on the explanation -- show that you
understand the purpose of the step, rather than just giving up. II. Use
your resources. Whether it’s the consultation sessions in Zoom, the Zoom
recitations, e-mail, the video library, your classmates, the web, etc.
there are many places to look for help or to clarify any questions that
you may have. As the AD699 slogan says, “Get After It!” To submit this
assignment, you will upload two files into Blackboard. One file will be
the R script that you used, and the other will be your write-up,
submitted in the form of a PDF. Your PDF should clearly demonstrate your
code and your results for all steps. For any part of the prompt that
asks you a question, or asks you to describe something, you should
include a written answer in your write-up. You may use any reporting
format that clearly demonstrates your code, results, and interpretation
statements. If you do not already use R Markdown or R Notebooks, you may
wish to explore these options. Main Topic: Classification Tasks: ●
K-Nearest Neighbors: The model that we’ll build will aim to predict
whether a college will have a high graduation rate. To answer this
question, we will use the College dataset from the ISLR package in R. A
description of this dataset can be found on our class Blackboard page,
in the same folder where you found this assignment prompt.
1. Bring this dataset into your R environment. Once you have brought the
ISLR package into your environment, you can do this with: >
data(College) 2. We are going to build a classification model with
Grad.Rate as our response variable. Call the str() function on your
dataset and show the results. a. What type of variable is Grad.Rate? b.
If Grad.Rate is not currently a factor, convert it into a factor by
binning it. Use the median to create two levels for this factor -- any
records at or above the median should be labeled “High Rate” and any
records below the median should be labeled “Low Rate.” 3. Are there any
NAs in this dataset? Show the code that you used to find this out. If
there are any NA values in any particular column, replace them with the
median value for that column. 4. Creating two new features: a. Create a
new variable called ‘selective.’ Selective should be found by taking
Accept divided by Apps. (Accept/Apps) b. Create another new variable
called ‘yield.’ Yield should be found by taking Enroll divided by
Accept. (Enroll/Accept) 5. Using your assigned seed value (from
Assignment 2), partition your entire dataset into training (60%) and
validation (40%) sets. 6. Make up a fake college (yes, really!) a. Give
your college a name (there’s no R code needed here, and you won’t use
the name when you run k-nn...but give the school a name anyway, and just
write it here). b. Use the runif() function to give your college values
for each of these numeric predictor attributes: Expend, S.F. Ratio,
perc.alumni, selective, and yield. Use the min and max values from your
training set as the lower and upper boundaries for runif(). 7. Normalize
your data using the preProcess() function from the caret package. Use
Table 7.2 from the book as a guide for this.
8. Using the knn() function from the FNN package, and using a k-value of
7, generate a predicted classification for your college. For your input
variables, use Expend, S.F. ratio, perc.alumni, selective, and yield.
What outcome category was it predicted to belong to? Also, who were your
college’s 7 nearest neighbors? How many of them were High Rate, and how
many were Low Rate? 9. Use your validation set to help you determine an
optimal k-value. Use Table 7.3 from the textbook as a guide here. 10.
Using either the base graphics package or ggplot, make a scatterplot
with the various k values that you used in 7a on your x-axis, and the
accuracy metrics on the y-axis. 11. Re-run your knn() function with the
optimal k-value that you found previously. What result did you obtain?
Was it different from the result you saw when you first ran the k-nn
function? Also, what were the outcome classes for each of your college’s
k-nearest neighbors? ● Naive Bayes: Again in this section, we will be
performing classification. Using the the appointments.csv dataset, our
outcome variable will be using the outcome variable No.Show. Our dataset
comes from Brazil, and it contains information about medical Patients. A
dataset description can be found on Blackboard. 1. After downloading
the file from Blackboard, bring appointments into your R environment. 2.
Data preparation. a. Run the str() function to check the data type for
the variables in this dataframe. b. We will not use the variables
PatientID or AppointmentID in our analysis. Remove them. c. Age is not a
factor, but we can turn it into a factor. Bin the ages in any way that
creates groups that contain mostly similar numbers of records. d.
ScheduledDay and AppointmentDay need to be formatted in a way that will
make them useful for our model. To make them more useful, first be sure
that these variables are seen as dates in R. Then, once they’ve been
converted to dates, make three new categorical variables:
i. DateGap: This will be difference between AppointmentDate and
ScheduledDate 1. Once you have created this variable, you will need to
bin it into a factor. Do this in a way that creates groups of relatively
similar sizes. ii. WeekDayAppoint: This will be the day of the week for
the appointment. iii. ScheduleAppoint: This will be the day of the week
on which the appointment is scheduled. e. Once you have completed the
previous step, remove the ScheduledDay and AppointmentDay variables from
the dataset. f. Filter the dataset so that only the records from the 10
most common neighborhoods remain. 3. Preparatory data analysis a. Let’s
take a look at a few variables from the dataset, and explore the way
that these might impact NoShow. Choose any three predictor variables
from the dataset. For the three that you chose, make a barplot for each
one. Each barplot should show one of your chosen categories on the
x-axis, with NoShow as the fill variable. You should build proportional
barplots (you can achieve this by adding position=”fill” inside your
geom layer). You should generate three separate barplots for this step.
b. Based on the barplots that you see here, are there any
generalizations that you can make about these variables’ relationship
with NoShow? Do some variables look like they’ll have more predictive
power than others? 4. Using your seed value (the same one from
Assignment #2) , partition your data into training (60%) and validation
(40%) sets. 5. Build a naive bayes model, with the response variable
NoShow. Use all of the other variables in your training set as inputs.
6. Show a confusion matrix that compares the performance of your model
against the training data, and another that shows its performance
against the validation data (just use the accuracy metric for this
analysis). How did your training set’s performance compare with your
validation set’s performance? 7. If you had used the naive rule as an
approach to classification, how would you have classified all the
records in your training set? (Note: Although their names are very
similar, the naive rule for classification is very different from a
naive Bayes approach to classification).
8. Create a lift chart for your model. Show the lift chart, and explain
its meaning in 2-3 sentences. 9. It’s time to make up a fake medical
patient! a. What is your person’s name? (There is no R code required for
this step -- you can just make up a name, or use your own). b. For each
of the predictor variables in the dataset, determine a category that
your person will belong to. You don’t need to use any randomization
functions here -- you can just assign any category value to your person
for each category. Create a new dataframe for your person that includes
his/her category values. c. Use the predict() function in R to predict
whether your person will be a No Show. What outcome did your model
predict? d. Use the predict() function in R in a slightly different way
to determine the probability that your person will be a No Show. What
probability did it assign to your person? e. Now, determine why you saw
those numbers. For this step, you should use R, but do not use any
functions from any packages. Instead, use the information that you see
in your model’s results in order to generate an attend score (noshow
will be 0) and a NoShow score (noshow will be a 1). Use your knowledge
of the naive Bayes calculation process to demonstrate how the naive
Bayes algorithm generated the probability prediction that you saw in a
previous step.