MET CS 777 - Big Data Analytics
Spark Data Wrangling
GitHub Classroom Invitation Link
The goal of this assignment is to implement a set of Spark programs in python (using Apache Spark).
Specifically, your Spark jobs will analyzing a data set consisting of New York City Taxi trip reports in the
Year 2013. The dataset was released under the FOIL (The Freedom of Information Law) and made public
by Chris Whong (https://chriswhong.com/open-data/foil_nyc_taxi/).
2 Taxi Data Set
The data set itself is a simple text file. Each taxi trip report is a different line in the file. Among other things,
each trip report includes the starting point, the drop-off point, corresponding timestamps, and information
related to the payment. The data are reported by the time that the trip ended, i.e., upon arrive in the order of
the drop-off timestamps. The attributes present on each line of the file are, in order:
0 medallion an md5sum of the identifier of the taxi - vehicle bound (Taxi ID)
1 hack license an md5sum of the identifier for the taxi license (Driver ID)
2 pickup datetime time when the passenger(s) were picked up
3 dropoff datetime time when the passenger(s) were dropped off
4 trip time in secs duration of the trip
5 trip distance trip distance in miles
6 pickup longitude longitude coordinate of the pickup location
7 pickup latitude latitude coordinate of the pickup location
8 dropoff longitude longitude coordinate of the drop-off location
9 dropoff latitude latitude coordinate of the drop-off location
10 payment type the payment method -credit card or cash
11 fare amount fare amount in dollars
12 surcharge surcharge in dollars
13 mta tax tax in dollars
14 tip amount tip in dollars
15 tolls amount bridge and tunnel tolls in dollars
16 total amount total paid amount in dollars
Table 1: Taxi Data Set fields
The data files are in comma separated values (CSV) format. Example lines from the file are:
You can use the following PySpark Code to cleanup the data and get the required field.
1 l i n e s = sc . t e x t F i l e ( s y s . a rgv [ 1 ] )
2 t a x i l i n e s = l i n e s . map ( lambda x : x . s p l i t ( ’ , ’ ) )
4 # E x c e p t i o n Hand l ing and removing wrong d a t a l i n e s
5 d e f i s f l o a t ( v a l u e ) :
6 t r y :
7 f l o a t ( v a l u e )
8 r e t u r n True
9 e x c e p t :
10 r e t u r n F a l s e
12 # For example , remove l i n e s i f t h e y don ’ t have 16 v a l u e s and . . .
13 d e f c o r r e c t R o w s ( p ) :
14 i f ( l e n ( p ) == 17) :
15 i f ( i s f l o a t ( p [ 5 ] ) and i s f l o a t ( p [ 1 1 ] ) ) :
16 i f ( f l o a t ( p [ 5 ] ) !=0 and f l o a t ( p [ 1 1 ] ) ! = 0 ) :
17 r e t u r n p
19 # c l e a n i n g up d a t a
20 t e x i l i n e s C o r r e c t e d = t a x i l i n e s . f i l t e r ( c o r r e c t R o w s )
You can also pre-process the data and store it in your own cluster storage.
3 Obtaining the Dataset
Small data set. (93 MB compressed, uncompressed 384 MB) for implementation and testing purposes
(roughly 2 million taxi trips). This is available at Amazon S3:
You can download or access the data sets using the following internal URLs:
Small Data Set gs://metcs777/taxi-data-sorted-small.csv.bz2
Large Data Set gs://metcs777/taxi-data-sorted-large.csv.bz2
Table 2: Data set on Google Cloud Storage - URLs
Small Data Set s3://metcs777/taxi-data-sorted-small.csv.bz2
Large Data Set s3://metcs777/taxi-data-sorted-large.csv.bz2
Table 3: Data set on Amazon AWS - URLs
4 Assignment Tasks
4.1 Task 1 : Top-10 Active Taxis (5 points)
Many different taxis have had multiple drivers. Write and execute a Spark Python program that computes
the top ten taxis that have had the largest number of drivers. Your output should be a set of (medallion,
number of drivers) pairs.
Note: You should consider that this is a real world data set that might include wrongly formatted data
lines. You should clean up the data before the main processing, a line might not include all of the fields. If
a data line is not correctly formatted, you should drop that line and do not consider it.
4.2 Task 2 - Top-10 Best Drivers (7 Points)
We would like to figure out who the top 10 best drivers are in terms of their average earned money per
minute spent carrying a customer. The total amount field is the total money earned on a trip. In the end, we
are interested in computing a set of (driver, money per minute) pairs.
4.3 Task 3 - Best time of the daty to Work on Taxi (8 Points)
We would like to know which hour of the day is the best time for drivers that has the highest profit per miles.
Consider the surcharge amount in dollar for each taxi ride (without tip amount) and the distance in miles,
and sum up the rides for each hour of the day (24 hours) – consider the pickup time for your calculation.
The profit ratio is the ration surcharge in dollar divided by the travel distance in miles for each specific time
of the day.
Profit Ratio = (Surcharge Amount in US Dollar) / (Travel Distance in miles)
We are interested to know the time of the day that has the highest profit ratio.
4.4 Task 4 - (For Advanced Students – no points)
Here are two further tasks for advanced groups.
• How many percent of taxi customers pay with cash and how many percent using electronic cards?
Analyze these payment methods for different time of the day and provide a list of percents for each
day time? As a result provide two numbers for total percentages and a list like (hour of day, percent
• We would like to measure the efficiency of taxis drivers by finding out their average earned money per
mile. (Consider the total amount which includes tips, as their earned money) Implement a Spark job
that can find out the top-10 efficient taxi divers.
• What are mean, median, first and third quantiles of tip amount? How do find the median?
• Using the IQR outlier detection method find out the top-10 outliers.
5 Important Considerations
5.1 Machines to Use
One thing to be aware of is that you can choose virtually any configuration for your EMR cluster - you can
choose different numbers of machines, and different configurations of those machines. And each is going to
cost you differently!
Pricing information is available at: http://aws.amazon.com/elasticmapreduce/pricing/
Since this is real money, it makes sense to develop your code and run your jobs locally, on your laptop,
using the small data set. Once things are working, you’ll then move to Amazon EMR.
We are going to ask you to run your Spark jobs over the “real” data using 3 machines with 4 cores
and 8GB RAM each as workers. This provides 4 cores per machine (16 cores total) so it is quite a bit of
horsepower. On the Google cloud take 4 machines with 4 cores and 8 GB of memory.
As you can see on EC2 Price list , this costs around 50 cents per hour. That is not much, but IT WILL
ADD UP QUICKLY IF YOU FORGET TO SHUT OFF YOUR MACHINES. Be very careful, and stop
your machine as soon as you are done working. You can always come back and start your machine or create
a new one easily when you begin your work again. Another thing to be aware of is that Amazon charges
you when you move data around. To avoid such charges, do everything in the ”N. Virginia” region. That’s
where data is, and that’s where you should put your data and machines.
• You should document your code very well and as much as possible.
• You code should be compilable on a unix-based operating system like Linux or MacOS.
5.2 Academic Misconduct Regarding Programming
In a programming class like our class, there is sometimes a very fine line between ”cheating” and acceptable
and beneficial interaction between peers. Thus, it is very important that you fully understand what is and
what is not allowed in terms of collaboration with your classmates. We want to be 100% precise, so that
there can be no confusion.
The rule on collaboration and communication with your classmates is very simple: you cannot transmit
or receive code from or to anyone in the class in any way—visually (by showing someone your code),
electronically (by emailing, posting, or otherwise sending someone your code), verbally (by reading code to
someone) or in any other way we have not yet imagined. Any other collaboration is acceptable.
The rule on collaboration and communication with people who are not your classmates (or your TAs or
instructor) is also very simple: it is not allowed in any way, period. This disallows (for example) posting
any questions of any nature to programming forums such as StackOverflow. As far as going to the web and
using Google, we will apply the ”two line rule”. Go to any web page you like and do any search that you
like. But you cannot take more than two lines of code from an external resource and actually include it in
your assignment in any form. Note that changing variable names or otherwise transforming or obfuscating
code you found on the web does not render the ”two line rule” inapplicable. It is still a violation to obtain
more than two lines of code from an external resource and turn it in, whatever you do to those two lines after
you first obtain them.
Furthermore, you should cite your sources. Add a comment to your code that includes the URL(s) that
you consulted when constructing your solution. This turns out to be very helpful when you’re looking at
something you wrote a while ago and you need to remind yourself what you were thinking.
Create a single document that has results for all three tasks. For each task, copy and paste the result that
your last Spark job wrote to Amazon S3. Also for each task, for each Spark job you ran, include a screen
shot of the Spark History.
Figure 1: Screenshot of Spark History
Please zip up all of your code and your document (use .zip only, please!), or else attach each piece of
code as well as your document to your submission individually.
Please have the latest version of your code on the GitHub. Zip the files from GitHub and submit as your
latest version of assignment work to the Blackboard. We will consider the latest version on the Blackboard
but it should exactly match your code on the GitHub