DA5020 – Practicum II
This assignment provides you with an opportunity to use the MongoDB document based data store.
MongoDB is a commonly used non-relational database. In this practicum you will practice using
MongoDB in the cloud . Register for a free MongoDB Atlas account. Ensure that you select the free 1
option (no credit card is needed). After registration is completed, create a free cluster (this will be done in
question 1).
Alternatively, you can setup mongoDB on your computer. This will require software installations,
command line servers, and keys to be installed. Have patience. However, it is recommended that you use
MongoDB Atlas — which is free and does not require installing software on your computer.
This is a group practicum which means that you may choose to work in groups of up to three
students. You may fully collaborate and submit the same work. However, you must put all students'
names on all submitted work. If a group member is not adequately contributing, the remaining
team members may "vote to eject" the student from the team by emailing me the reason. In such an
event, the team member who was "fired" must still complete the project individually by the due
date.
If you are working in groups, you can self-signup in Canvas using one of the Practicum 2 groups or notify
me via email by Mar 25, 2021 and I will create the group for you. Ensure that you include your name and
the name(s) of your group member(s) in the email and cc them.
——————————————————————————————————————————
In this practicum, you will work with data from the Bureau of Transportation Statistics which reports the
Airline/Carrier On-Time Performance from 1987 to present. The dataset contains airlines, which are also
called operators, that voluntarily share its data and notify the organization about its flight hours, distance,
arrival and departure delays at various airports across the US and its territories. For more information on
the dataset and to learn more about the description of the fields, visit the following website.
You are given the task of performing a comprehensive analysis of this dataset. However, due to the
volume of data, a subset has been extracted for you for all flights that originated or departed from the
following states in 2019: Arizona: AZ, Nevada: NV and California: CA.
You will need to Install the mongolite library in your R environment.
Question 1 — Configure the Database: MongoDB Atlas (10 points)
1. Create a free account with MongoDB Atlas
2. Watch the following tutorial entitled: Getting Started with MongoDB Atlas - Free Tier to create a free
cluster and mongoDB will be installed automatically for you. When you watch the video, pay
attention to the following terms: cluster name, database user, network access and connection
information. You will need to perform the following when you create your cluster:
• Cluster name: Ensure that your cluster is called da5020-cluster.
You can perform this on your local machine. However, the cloud gives you the flexibility to scale your analysis 1
(e.g use more or less computing resources for large-scale data sources). Additionally, you will only need to create
one database for your group which you can all access at a central location.
• Database User: Create a database user and password so that you can access the cluster. This is 2
different from the credentials for your mongoDB Atlas account.
• If you work in a group, create a database user for each group member. You only need to create
one cluster for your group; however, each group member is expected to use their respective
username and password to access the database.
• Create a database user for the TA with the following information username: da5020_ta and
password: da5020NEU; grant them read and write access to any database.
• Take a screenshot of the database users that were created and submit it with the
practicum.
• Network Access: In the Mongo DB Atlas Dashboard, click Network access => IP Address and
select “Allow Access From Anywhere”. This setting is not very secure but is only required for
grading purposes to allow the TAs access to your database (and you won’t need to whitelist their
IP address) .
• Connection String: Get your connection information so that you can connect to your database in
R studio. In the Mongo DB Atlas Dashboard, click clusters => connect => connect to your
application. You will be presented with information on how to connect to your cluster. Copy the
connection string (view the image below). This will be your url to connect to the database in R.
Open R studio and create a variable called mongo_url and paste the connection string. Ensure that
you replace
with your actual password and with the name of the
database that you will create in question 2.
Note: if you decided to install mongoDB on your computer, skip steps 1 and 2 above and follow the
instructions at this link to install the necessary software.
3. Create the database and load the data.
Open R studio, connect to your mongoDB instance and create a database called airline_performance and
a collection called flights_2019. You will need to use your connection string from Question 1.2 above.
After which, insert the attached CSV data “2019_ONTIME_REPORTING_FSW.csv” using the insert
Ensure that you will remember the username and password for the database user, because you will use it in your R 2
environment. Note: the credentials for the database user is different from your MongoDB Atlas login credentials.
function from the mongolite package. It will take a few minutes to insert all the data because it contains
approximately 2 million observations.
——————————————————————————————————————————
Using R Studio, create queries using mongoDB (via the mongolite library) to answer the remaining
questions. Ensure that you write queries to answer each question below and do not load the entire
dataset in your R environment.
Question 2 — (10 points)
Calculate the number of flights in the dataset. Also, derive the mean and standard deviation of the elapsed
time for all flights. Explain the results.
Question 3 — (20 points)
Calculate the total flights for each airline/operator. Visualize the top 10 results and show the carrier name
and the frequency. Explain the results.
• Ensure that you indicate the full name of each carrier, in lieu of the carrier code. This will require web
scraping. Here is a helpful resource with the list of airline codes and the respective names. You can
also use an alternative webpage of your choice.
Question 4 — (10 points)
Select the top 5 airlines, from question 3 (above), and calculate the total flight time for each month
(grouped by airline). Explain and visualize the results. Hint: the total flight time is not equivalent to the
frequency of flights and ensure that you display the total hours and not the total minutes.
Question 5 — (15 points)
Analyze the data to determine if there are certain days of the week that the above 5 airlines are the
busiest. Explain and visualize the results. You can define busy as either of the following: 1) the frequency
of flights from Sun. to Sat., 2) the total flight hours from Sun. to Sat., 3) the average flight hours from
Sun. to Sat., or 4) another metric. Justify your logic regarding how you defined busy and explain the
results. Note: you do not need to conduct this analysis on a monthly basis; you can do this quarterly or
annually. You can decide the best way to convey this data.
Question 6 — (20 points)
Select any (1) aircraft, and explore the data to determine where it often travels. Calculate its average
arrival and departure delays at the airports. After which analyze all the results to identify any patterns that
are evident and also indicate which airline operates that aircraft. Explain your findings. Note: the
TAIL_NUM can help you to identify each unique aircraft.
Question 7 — (15 points)
For each of the original 3 states (i.e. AZ, NV, CA), analyze the most popular outbound/destination
airports. For example, if a flight originated in CA (at any of its airports), where do they often go?
Comment on your findings and visualize the top results.
Question 8 — (+10 optional/bonus points)
Build one additional query to test a hypothesis or answer a question that you have about the dataset. Your
query should retrieve data from MongoDB and evaluate the pattern/trend that you intend to explore.
Prepare supporting visualizations for your analysis. If necessary, you can integrate any additional data that
provide more details or support your analysis/findings.
Note: all charts that are displayed should have the following:
• An informative title (and subtitle if applicable)
• Labels on the x-axis and y-axis that indicate the units of measurement.
• A caption that indicates the purpose of the chart.
Submission Details
• Your submission must contain three files: the .Rmd file, a knitted PDF or HTML and a screenshot of
the database users. Name your .Rmd file, DA5020.P2.FirstName.LastName.Rmd and your PDF/HTML
DA5020.P2.FirstName.LastName.{pdf,html}, where FirstName.LastName is your first and last name.
• The .Rmd file must be fully commented and properly "chunked" R code and detailed explanations.
Make sure that it is easy to recognize which question you answer and that your code runs from
beginning to end (because that is how we will test it). Code that doesn't execute, stops, throws errors
will receive no points. If the TAs have to "debug" your code or spend any effort getting it to run,
substantial points will be deducted.
• Not submitting a knitted PDF or HTML will result in reduction of 30 points.
• Not submitting the .Rmd file (or both) will result in a score of 0.
• Include the names of all group members in your RMD file.
Useful Resources
• MongoDB Atlas
• Getting Started with MongoDB Atlas - Free Tier
• Install mongo DB on your computer
• Mongo DB Atlas and R
• Query the days of the week in MongoDB
• List of airline codes
• Mapping the US
• Advanced Mapping
• Bureau of Transportation Statistics
• Data dictionary