R代写 - AD699: Data Mining for Business Analytics
Things to Keep in Mind: I. If you run into a syntax error while preparing this assignment, do not panic. Show the code you ran, and show the error you encountered. Explain the purpose of the step, and state what you were trying to do. If the error is preventing you from running any follow-on steps, again, focus on the explanation -- show that you understand the purpose of the step, rather than just giving up. II. Use your resources. Whether it’s the consultation sessions in Zoom, the Zoom recitations, e-mail, the video library, your classmates, the web, etc. there are many places to look for help or to clarify any questions that you may have. As the AD699 slogan says, “Get After It!” To submit this assignment, you will upload two files into Blackboard. One file will be the R script that you used, and the other will be your write-up, submitted in the form of a PDF. Your PDF should clearly demonstrate your code and your results for all steps. For any part of the prompt that asks you a question, or asks you to describe something, you should include a written answer in your write-up. You may use any reporting format that clearly demonstrates your code, results, and interpretation statements. If you do not already use R Markdown or R Notebooks, you may wish to explore these options. Main Topic: Classification Tasks: ● K-Nearest Neighbors: The model that we’ll build will aim to predict whether a college will have a high graduation rate. To answer this question, we will use the College dataset from the ISLR package in R. A description of this dataset can be found on our class Blackboard page, in the same folder where you found this assignment prompt. 1. Bring this dataset into your R environment. Once you have brought the ISLR package into your environment, you can do this with: > data(College) 2. We are going to build a classification model with Grad.Rate as our response variable. Call the str() function on your dataset and show the results. a. What type of variable is Grad.Rate? b. If Grad.Rate is not currently a factor, convert it into a factor by binning it. Use the median to create two levels for this factor -- any records at or above the median should be labeled “High Rate” and any records below the median should be labeled “Low Rate.” 3. Are there any NAs in this dataset? Show the code that you used to find this out. If there are any NA values in any particular column, replace them with the median value for that column. 4. Creating two new features: a. Create a new variable called ‘selective.’ Selective should be found by taking Accept divided by Apps. (Accept/Apps) b. Create another new variable called ‘yield.’ Yield should be found by taking Enroll divided by Accept. (Enroll/Accept) 5. Using your assigned seed value (from Assignment 2), partition your entire dataset into training (60%) and validation (40%) sets. 6. Make up a fake college (yes, really!) a. Give your college a name (there’s no R code needed here, and you won’t use the name when you run k-nn...but give the school a name anyway, and just write it here). b. Use the runif() function to give your college values for each of these numeric predictor attributes: Expend, S.F. Ratio, perc.alumni, selective, and yield. Use the min and max values from your training set as the lower and upper boundaries for runif(). 7. Normalize your data using the preProcess() function from the caret package. Use Table 7.2 from the book as a guide for this. 8. Using the knn() function from the FNN package, and using a k-value of 7, generate a predicted classification for your college. For your input variables, use Expend, S.F. ratio, perc.alumni, selective, and yield. What outcome category was it predicted to belong to? Also, who were your college’s 7 nearest neighbors? How many of them were High Rate, and how many were Low Rate? 9. Use your validation set to help you determine an optimal k-value. Use Table 7.3 from the textbook as a guide here. 10. Using either the base graphics package or ggplot, make a scatterplot with the various k values that you used in 7a on your x-axis, and the accuracy metrics on the y-axis. 11. Re-run your knn() function with the optimal k-value that you found previously. What result did you obtain? Was it different from the result you saw when you first ran the k-nn function? Also, what were the outcome classes for each of your college’s k-nearest neighbors? ● Naive Bayes: Again in this section, we will be performing classification. Using the the appointments.csv dataset, our outcome variable will be using the outcome variable No.Show. Our dataset comes from Brazil, and it contains information about medical Patients. A dataset description can be found on Blackboard. 1. After downloading the file from Blackboard, bring appointments into your R environment. 2. Data preparation. a. Run the str() function to check the data type for the variables in this dataframe. b. We will not use the variables PatientID or AppointmentID in our analysis. Remove them. c. Age is not a factor, but we can turn it into a factor. Bin the ages in any way that creates groups that contain mostly similar numbers of records. d. ScheduledDay and AppointmentDay need to be formatted in a way that will make them useful for our model. To make them more useful, first be sure that these variables are seen as dates in R. Then, once they’ve been converted to dates, make three new categorical variables: i. DateGap: This will be difference between AppointmentDate and ScheduledDate 1. Once you have created this variable, you will need to bin it into a factor. Do this in a way that creates groups of relatively similar sizes. ii. WeekDayAppoint: This will be the day of the week for the appointment. iii. ScheduleAppoint: This will be the day of the week on which the appointment is scheduled. e. Once you have completed the previous step, remove the ScheduledDay and AppointmentDay variables from the dataset. f. Filter the dataset so that only the records from the 10 most common neighborhoods remain. 3. Preparatory data analysis a. Let’s take a look at a few variables from the dataset, and explore the way that these might impact NoShow. Choose any three predictor variables from the dataset. For the three that you chose, make a barplot for each one. Each barplot should show one of your chosen categories on the x-axis, with NoShow as the fill variable. You should build proportional barplots (you can achieve this by adding position=”fill” inside your geom layer). You should generate three separate barplots for this step. b. Based on the barplots that you see here, are there any generalizations that you can make about these variables’ relationship with NoShow? Do some variables look like they’ll have more predictive power than others? 4. Using your seed value (the same one from Assignment #2) , partition your data into training (60%) and validation (40%) sets. 5. Build a naive bayes model, with the response variable NoShow. Use all of the other variables in your training set as inputs. 6. Show a confusion matrix that compares the performance of your model against the training data, and another that shows its performance against the validation data (just use the accuracy metric for this analysis). How did your training set’s performance compare with your validation set’s performance? 7. If you had used the naive rule as an approach to classification, how would you have classified all the records in your training set? (Note: Although their names are very similar, the naive rule for classification is very different from a naive Bayes approach to classification). 8. Create a lift chart for your model. Show the lift chart, and explain its meaning in 2-3 sentences. 9. It’s time to make up a fake medical patient! a. What is your person’s name? (There is no R code required for this step -- you can just make up a name, or use your own). b. For each of the predictor variables in the dataset, determine a category that your person will belong to. You don’t need to use any randomization functions here -- you can just assign any category value to your person for each category. Create a new dataframe for your person that includes his/her category values. c. Use the predict() function in R to predict whether your person will be a No Show. What outcome did your model predict? d. Use the predict() function in R in a slightly different way to determine the probability that your person will be a No Show. What probability did it assign to your person? e. Now, determine why you saw those numbers. For this step, you should use R, but do not use any functions from any packages. Instead, use the information that you see in your model’s results in order to generate an attend score (noshow will be 0) and a NoShow score (noshow will be a 1). Use your knowledge of the naive Bayes calculation process to demonstrate how the naive Bayes algorithm generated the probability prediction that you saw in a previous step.