DA5020 – Practicum III
In this practicum you will use the k-nearest neighbor algorithm to predict a continuous variable. Each
question in the practicum follows the CRISP-DM framework. The practicum was designed in this manner
to help you to practice and conceptualize each phase, based on the requirements of an actual project.
This is a group practicum which means that you may choose to work in groups of up to three students.
You may fully collaborate and submit the same work. However, you must include all students' names
on all submitted work. If a group member is not adequately contributing, the remaining team members
may "vote to eject" the student from the team by emailing me the reason. In such an event, the team
member who was "fired" must still complete the project individually by the due date.
If you are working in groups, you can self-signup in Canvas or notify me via email by April 22, 2021 and
I will create the group for you. Ensure that you include your name and the name(s) of your group
member(s) in the email and cc them.
——————————————————————————————————————————
Practicum Tasks
CRISP-DM: Business Understanding
The NYC Taxi and Limousine Commission (TLC) publishes a dataset on yellow and green taxi trip
records which include: pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances,
itemized fares, rate types, payment types, and driver-reported passenger counts. For more information on
the dataset, visit the following website and view the accompanying data dictionary for additional
information.
Description of the Problem
You are hired as a Machine Learning Engineer, on the Data Insights and Analytics Team, for the NYC
Taxi and Limousine Commission (TLC). Your first assignment is to analyze the trip data from the Green
Taxis; more specifically, you need to evaluate where passengers use these cabs and how frequently.
However, your main objective is to evaluate the factors that contribute toward cab drivers being
incentivized (i.e. what determines whether or not they receive a tip). This will enable you to build a model
that can be used to predict the tip amount for future trips.
In this use-case, you will conduct your analysis using the NYC Green Taxi Trip Records for February
2020 and build a k-nn regression model to predict the tip amount. You are free to use any libraries to
support your analysis.
Question 1 — (20 points) +10 optional points
CRISP-DM: Data Understanding
• Load the NYC Green Taxi Trip Records data directly from the URL into a data frame or tibble.
• Data exploration: explore the data to identify any patterns and analyze the relationships between the
features and the target variable i.e. tip amount. At a minimum, you should analyze: 1) the distribution,
2) the correlations 3) missing values and 4) outliers — provide supporting visualizations and explain
all your steps.
Tip: remember that you have worked with this dataset in your previous assignments. You are free to
reuse any code that support your analysis.
• Feature selection: identify the features/variables that are good indicators and should be used to predict
the tip amount. Note: this step involves selecting a subset of the features that will be used to build the
predictive model. If you decide to omit any features/variables ensure that you briefly state the reason.
• Feature engineering: (+10 bonus points): create a new feature and analyze its effect on the target
variable (e.g. the tip amount). Ensure that you calculate the correlation coefficient and also use
visualizations to support your analysis. Summarize your findings and determine if the new feature is a
good indicator to predict the tip amount. If it is, ensure that you include it in your model. If it is not a
good indicator, explain the reason.
NOTE: If you attempt this bonus question, ensure that you create a meaningful feature (and nothing
arbitrary). If you are unable to think about something meaningful, do not become fixated on this. There
is another bonus question that you can attempt later in the practicum.
Question 2 — (20 points)
CRISP-DM: Data Preparation
• Prepare the data for the modeling phase and handle any issues that were identified during the
exploratory data analysis. At a minimum, ensure that you:
• Preprocess the data: handle missing data and outliers, perform any suitable data transformation
steps, etc. Also, ensure that you filter the data. The goal is to predict the tip amount, therefore you
need to ensure that you extract the data that contains this information. Hint: read the data dictionary.
• Normalize the data: perform either max-min normalization or z-score standardization on the
continuous variables/features.
• Encode the data: determine if there are any categorical variables that need to be encoded and
perform the encoding.
• Prepare the data for modeling: shuffle the data and split it into training and test sets. The percent
split between the training and test set is your decision. However, clearly indicate the reason.
Question 3 — (30 points)
CRISP-DM: Modeling
• In this step you will develop the k-nn regression model. Create a function with the following name and
arguments: knn.predict(data_train, data_test, k);
• data_train represents the observations in the training set,
• data_test represents the observations from the test set, and
• k is the selected value of k (i.e. the number of neighbors).
Perform the following logic inside the function:
• Implement the k-nn algorithm and use it to predict the tip amount for each observation in the test
set i.e. data_test.
• Note: You are not required to implement the k-nn algorithm from scratch. Therefore, this step
may only involve providing the training set, the test set, and the value of k to your chosen k-nn
library.
• Calculate the mean squared error (MSE) between the predictions from the k-nn model and the
actual tip amount in the test set.
• The knn-predict() function should return the MSE.
Question 4 — (30 points)
CRISP-DM: Evaluation
• Determine the best value of k and visualize the MSE. This step requires selecting different values of k
and evaluating which produced the lowest MSE. At a minimum, ensure that you perform the following:
• Provide at least 20 different values of k to the knn.predict() function (along with the training set
and the test set).
Tip: use a loop! Use a loop to call knn.predict() 20 times and in each iteration of the loop, provide
a different value of k to knn.predict(). Ensure that you save the MSE that’s returned.
• Create a line chart and plot each value of k on the x-axis and the corresponding MSE on the y-axis.
Explain the chart and determine which value of k is more suitable and why.
• What are your thoughts on the model that you developed and the accuracy of its predictions? Would
you advocate for its use to predict the tip amount of future trips? Explain your answer.
Question 5 — (10 optional/bonus points)
In this optional (bonus) question, you can: 1) use your intuition to create a compelling visualization that
tells an informative story about one aspect of the dataset OR 2) optimize the k-nn model and evaluate the
effect of the percentage split, between the training and test set, on the MSE. Choose ONE of the
following:
• Create a compelling visualization that tells an informative story about how these cabs are used.
OR
• Evaluate the effect of the percentage split for the training and test sets and determine if a different split
ratio improves your model’s ability to make better predictions.
Ensure that you perform the steps of the bonus question in a new R chunk!
Note: all charts that are displayed should have the following:
• An informative title (and subtitle if applicable)
• Labels on the x-axis and y-axis that indicate the units of measurement.
• A caption that indicates the purpose of the chart.
Useful Resources
• NYC Green Taxi Trip Records - February 2020 and Green Trips Data Dictionary
• Visit the following webpage to obtain more information on the dataset: NYC TLC Trip Record Data
• Normalizing Data with R
• kNN Algorithm using R
• KNN Algorithm: A Practical Implementation Of KNN Algorithm In R
• k-Nearest Neighbor: An Introductory Example
• Quick Guide to Creating Scatterplots in R with ggplot
Submission Details
• This practicum contains bonus points that can contribute to your practicum average.
• Your submission must contain two files: the .Rmd file and a knitted PDF or HTML (from
the .rmd). Name your .Rmd file, DA5020.P3.FirstName.LastName.Rmd and your PDF/HTML
DA5020.P3.FirstName.LastName.{pdf,html}, where FirstName.LastName is your first and last name.
• The .Rmd file must be fully commented and properly "chunked" R code and detailed explanations.
Make sure that it is easy to recognize which question you answer and that your code runs from
beginning to end (because that is how we will test it). Code that doesn't execute, stops, throws errors
will receive no points. If the TAs have to "debug" your code or spend any effort getting it to run,
substantial points will be deducted.
• Not submitting a knitted PDF or HTML will result in reduction of 30 points.
• Not submitting the .Rmd file (or both) will result in a score of 0.