DATA3404-无代写-Assignment 1
时间:2024-04-22
School of Computer Science
Uwe Roehm
DATA3404: Scalable Data Management 1.Sem./2024
Assignment 1: SQL in PostgreSQL vs Databricks
Group Assignment (10%) Due Date: Sunday 28th April (Wk 9) 11/04/2024
Introduction
This is the first part of the major practical assignment of DATA3404 in which you have to write a
series of SQL statements and Databricks / Apache Spark queries to analyse an air traffic data set.
We provide you with the schema and dataset. Your task is to implement some given data analysis
queries on both platforms, and to write a short report on your results and the query plans.
You find links to online documentation, data, and hints on tools and schema needed for this
assignment in the ’Modules / Assignment’ section in Canvas.
Dataset Description
This assignment is based on an Aviation On-time data set which includes data about airports, air-
lines, aircrafts, and flights. This data set has the following structure (primary keys are underlined):
airport_code
airport_name
city
state
country
lat
long
Airports
tailnum
manufacturer
model
status
aircraft_type
engine_type
year
Aircrafts
carrier_code
airline_name
country
Airlinescarrier_code
flight_number
flight_date
origin
destination
tail_number
scheduled_departure_time
scheduled_arrival_time
actual_departure_time
actual_arrival_time
distance
Flights
You will be provided with scaffold Jupyter notebook files for setting up your Databricks workspace
similar to the introductory tutorial. Note that you can take a na¨ıve approach to processing flights –
if a flight has a row in the flights csv file, it is a flight – regardless of whether it was cancelled, or if
it is somehow duplicated or any other data quality issues exist. In general, if there is missing data
you need for a query, you can ignore that row.
Downloading Datasets
For the raw CSVs to use in SQL, links will be available in Canvas Modules.
For Databricks, use the Assignment Bootstrap notebook to handle this downloading into DBFS.
1
Questions
There are two question sections:
• In the first section, each student chooses an individual question they would like to answer;
• In the second section, each team answers the question in this section together.
Individual Questions - Choose one for each team member to complete
Each team member chooses a single and distinct question to answer from the following:
1. Determine the name of the five airlines which flew to the most destinations in 2000.
Expected SQL output: Five row table with columns: airline name, count of destinations
Execution time comparison: Compare the execution times of your query on PostgreSQL and
on Spark/Databricks for the small and the medium datasets.
2. Determine which airport has the largest total departure delay time. (A flight is late to
depart if and only if its actual departure time is after its scheduled departure time.)
Expected SQL output: One row table with columns: airport name, total departure delay
Execution time comparison: Compare the execution times of your query on PostgreSQL and
on Spark/Databricks for the small and the medium datasets.
3. Determine the top three aircrafts with the longest total distance flown – ignoring all
aircraft with unknown model.
Expected SQL output: Three row table with columns: manufacturer, model, total distance
Execution time comparison: Compare the execution times of your query on PostgreSQL and
on Spark/Databricks for the small and the medium datasets.
4. Determine for each state which airline flies to that state the most.
Expected SQL output: Table with columns: state, airline, count of flights
Execution time comparison: Compare the execution times of your query on PostgreSQL and
on Spark/Databricks for the small and the medium datasets.
Team Question
This section contains the group question that must be answered by the whole team together.
1. Team Question: List the average, min and max departure delays (in minutes) of aircraft
models from BOEING and AIRBUS for the busiest US airport (by number of departing flights).
A flight is late to depart if its actual departure time is after its scheduled departure time.
Try to answer with a single SQL query (hint: check the WITH clause).
• Expected SQL Output: Result table with columns: airport name, city, num departures,
manufacturer, model, average departure delay, min departure delay, max departure delay
• Formatting: only consider the first 4 characters of each aircraft model; list the result
alphabetically by manufacturer, and then by descending average delay their models
• Compare the query execution plans between the two systems, i.e. of both PostgreSQL
and Spark/Databricks for the medium data size.
• Suggest an index that would change the query plan and improve the runtime of your
query on PostgreSQL, and demonstrate its usefulness. Does it work on Databricks too?
2
Deliverables and Submission Details
There are three deliverables per group:
1. a brief report/documentation outlining your outputs; and a
2. source code - SQL as a single SQL file that answers the chosen individual and all the team
questions; and a
3. source code - Jupyter notebook as a single .DBC archive or SQL source file that answers the
chosen individual and all the team questions.
Here are the specifics associated with each deliverable item.
Report
Filename recommendation: data3404 y24s1 assignment1 assignmentgroupname.pdf
• Your group name, including tutorial code and group number (eg. TUT01-AssignmentGroup-1)
• The answers (output) you receive for each question when executing against the small and
medium datasets for SQL/Postgres and for Spark/Databricks.
• For the individual queries, compare the execution times between PostgreSQL and Databricks.
• For the team query, compare the query execution plans using EXPLAIN on the medium dataset
for both PostgreSQL and Databricks. Include a short paragraph describing the query plan
differences between the two platforms and why that might be the case.
• For the team query, suggest an index that improves its runtime and demonstrate its effective-
ness with either new execution plans, or the measured runtimes.
• A short explanation of ’what is going on’ in the general sense for each SQL statement. Note
that this does not need to go into technical detail (that will come in Assignment 2) - instead
you should essentially explain what each SQL clause or function is there for; why you used it
and what effect you expect it to have. A short paragraph for each question is plenty.
• A group contribution statement with the following headings:
– Your group members’ names and SIDs
– Which Question 1 subquestion they completed
– Whether they contributed meaningfully to the team question and the report (yes or no)
• This does not have a strict page limit, but you should keep it relevant to ensure feedback can
be useful. In particular:
– Do not include large code dumps. You are already submitting all of your code. Use 1
or 2 line snippets if you absolutely must. Remember that including your code is not
explaining your code.
– Do not keep writing more in the hope that some of what you include will be correct. You
are more likely to run into issues including incorrect information than you are to gain
by including correct information.
3
SQL File
Filename recommendation: data3404 y24s1 assignment1 assignmentgroupname.sql
A single .SQL file that contains all of your completed questions. Use /*comments*/ to indicate
where each question starts and ends, and importantly for question 1 the name and SID of the
group member who completed that sub-question. This .SQL file must be able to be run against a
Postgres installation that has the datasets imported with the appropriate schema (i.e., the tables
have been created and populated already) and return the correct results as indicated in your report.
Try to answer each question with just a single (complex) SQL query, though the creation of utility
functions, e.g. for the delay computation, are allowed.
Jupyter DBC Archive or SQL Source File
Filename recommendation: data3404 y24s1 assignment1 assignmentgroupname.dbc
A single Jupyter SQL source file (or .DBC archive) that contains all of your completed questions.
Use markdown cells to indicate the same information as for the SQL file. This file must be able to be
run attached to a Databricks Community cluster that has had the Assignment Bootstrap notebook
run on it, and no other configurations made.
Submission Deadline and Submission Boxes in Canvas
Due Date: All deliverables are due in Week 9, no later than Sunday 28th April. Late submission
penalty: -5% of the marks per day late. The marking rubric is in Canvas.
There are two submission boxes in Canvas:
1. Submit your report into the ”Assignment 1 Report” TurnItIn box in Canvas, one per group.
Filename recommendation: data3404 y24s1 assignment1 assignmentgroupname.pdf
2. Please submit the source codes (SQL file(s) and Jupyter notebook) and a soft copy of design
documentation as a zip or tar file into the ”Assignment 1 Code” submission box in Canvas,
one per each group. Name your zip archive after your group number X with the following
name pattern: data3404 y24s1 assignment1 assignmentgroupname.zip
Students must retain electronic copies of their submitted assignment files and databases, as the
unit coordinator may request to inspect these files before marking of an assignment is completed. If
these assignment files are not made available to the unit coordinator when requested, the marking
of this assignment may not proceed.
All the best!
Group member participation
This is a group assignment. The mark awarded for your assignment is conditional on you being
able to explain any of your answers to your tutor or the lecturer if asked.
If your group is experiencing difficulties with the content, you should ask on Ed (use a private
post if you need to discuss code or report writing directly).
Level of contribution Proportion of final grade received
No participation. 0%
Did individual task but not group task (or vice-versa) 50%
Major contributor to the group’s submission. 100%
essay、essay代写