spark代写-TASK 2

时间：2022-05-15

MAJOR ASSIGNMENT: TASK 2
INTRODUCTION
This is the final part of the major practical assignment of DATA3404 analysing the air traffic data set. You will
find links to the supplied data and starting SQL and Spark code in the ‘Major Assignment’ section in Canvas
modules.
STARTING SCENARIO
The questions you will be analysing have been changed to limit the amount of work required for this task while
still focusing on the optimisation analysis you need to perform. These questions have been completed in a SQL
form, and a Spark form. The work that has been done for both of these approaches has been fairly
questionable – the output is correct, but there appears to be inefficiency in the way the queries are
constructed. Your task here is to identify where those inefficiencies might be, and to correct them.
DATASETS
The datasets have been modified – the supplied SQL file will allow you to import the tweaked datasets into
PostgreSQL, and the supplied Task 2 Bootstrap file will obtain the same datasets in their raw CSV format. The
primary difference is in data cleaning – to avoid the need to redo the work you have already shown you can
perform in Task 1, these source files have been cleaned ahead of time.
QUESTIONS
QUESTION 1
Determine the airport that is flown to (destination) most often. Of the airlines who have flights recorded to
that airport, which does so using the largest number of different aircraft.
QUESTION 2
Determine the origin airport which is the most common origin airport for each airline. For these pairs of most
common origin airport <-> airline, identify the airline whose most common origin airport represents the
highest % of total flights for that airline.
ALTERNATE VERSIONS
Your task is to modify these queries such that they are more efficient. You should do this using a systematic
process to help you write your report:
• Analyse the current operation of these queries using the techniques covered for SQL and Spark
• Identify something that can be improved, and change the query to support that and thus to be more
efficient
• Ensure the output is unchanged - and that you anticipate it would be unchanged for any valid inputs,
i.e. do not rely on the specific data you have available for testing.
• If you are unable to identify an improvement, you should write something that is equivalent in output
but involves a different execution.
• Observe an improvement both from the SQL execution plan, the Spark DAG, and the SparkMeasure
analysis.
• You may wish to use the larger flights datasets to check this, particularly the very large dataset – this
will make it clear whether there is a speed improvement. Note that these versions have not been
cleaned as the flights_small_cleaned dataset has. To achieve similar results, dropna should be applied
to the dataframe when it is loaded, the column names should be corrected, and inner joins should be
used when connecting with airline/airport/aircraft data.
• In the event that any bugs in the provided queries are discovered, they should be treated as working –
that is, your modification should retain any bugged behaviour rather than try to correct it to the
stated question.
• The cache clearing in the Spark code should be retained – do not use caching that requires the queries
to be run multiple times to achieve efficiency. The efficiency improvement should come from the
query being run for the first time. Caching occurring within the query itself is fine.
Note that each question should be approached by the group as a whole. Each re-written version should have
the authors who contributed to that version indicated in a comment.
Note also that there will be quite a lot that you might identify and change – you do not need to optimise these
queries to be perfectly efficient! All you need to do is to identify one aspect of inefficiency in each question
and correct it. However, there are a couple of things that are the same for both questions, particularly with
Spark – you will need to use a different approach for the two rather than cutting and pasting an improvement
that applies to both questions for your improvement to count.
WHAT TO SUBMIT
In the ‘report’ submission box you should submit a pdf report with the following information:
• Your group name, including tutorial code and group number
• An analysis for each question using the following headings:
o SQL
▪ Initial State
Describe the execution of the query as provided. Write a relational algebra
expression tree which describes this query. Use the execution plan from PostgreSQL
to support your analysis of what the query is doing.
▪ Alternative Version
Describe what your alternative version has changed. Indicate this by comparing the
different relational algebra expression trees and execution plans between the
versions.
o Spark
▪ Initial State
Describe the execution of the query as provided. Use the DAG output from
Databricks/Spark as well as the SparkMeasure output to support your analysis of
what this query is doing. You may wish to break down the notebook further into
different cells to achieve this – be careful that in doing so you don’t change the way
the query operates.
▪ Alternative Version
Describe what your alternative version has changed. Indicate this by comparing the
DAG and SparkMeasure output between the versions.
▪ Optional for higher marks: A discussion on how the original and alternative versions
of the queries would perform if being executed on a multiple-node cluster. Identify
any issues to do with parallelism/distributed joins that might impact either version
differently, and discuss whether you believe your alternative version would still be
an improvement with more data on a highly parallel cluster (e.g. the very large
dataset executed on 100+ nodes). This will be speculative, but you should have
enough knowledge about the way distributed joins work to identify some key issues.
o There is no need to include the query output itself as this should not change from what is
provided.
• A group contribution statement with the following headings:
o Your group members’ names and SIDs
o Which alternative versions they contributed to
o Whether they contributed meaningfully to the analysis and report (yes or no)
• This does not have a strict page limit, but you should keep it relevant to ensure feedback can be
useful. In particular:
o Don’t include large code dumps. You are already submitting all of your code. Use 1 or 2 line
snippets if you absolutely must. Remember that including your code is not explaining your
code.
o Don’t keep writing more in the hope that some of what you include will be correct. You are
more likely to run into issues including incorrect information than you are to gain by
including correct information.
In the ‘code’ submission box you should submit a zip folder with the following contents:
• A single .SQL file that contains your alternative versions. Use /*comments*/ to indicate where each
question starts and ends, and the name and SID of the group members who contributed to that
question.
This .SQL file must be able to be run against a Postgres installation that has had the datasets imported
with the provided schema
• A single Jupyter notebook that contains all of your completed questions. Use markdown cells to
indicate the same information as for the SQL file.
This notebook must be able to be run attached to a Databricks Community cluster that has had the
Assignment Bootstrap Task 2 notebook run on it, with SparkMeasure installed, and no other
configurations made.

MARKING
The marking guide is broken up into rubric items, that follow the EMRN pattern described in the first lecture.
• SQL Q1+Q2 (2% + 2%)
o E: A valid alternative version has been written that has a different execution. The SQL
statements are clean, idiomatic, and elegant.
o M: A valid alternative version has been written with a different execution.
o R: The alternative version gives invalid output or does not have a different execution plan,
but the statement is close enough that some direct feedback can be given for a corrected
revision to be made
o N: The alternative version is incorrect or has the same execution, and the statement is not
sufficiently close to correct for more directed feedback than ‘see the course material’, or is
missing.
• SQL Q1+Q2 Improvement (0.5% + 0.5%)
o This is a simple yes or no tied to a result of E or M for SQL Q1 and Q2. If the valid alternative
version represents a tangible improvement in efficiency, this mark is awarded.
• Spark Q1+Q2 (2% + 2%)
o E: A valid alternative version has been written that has a different execution. The Spark code
is clean, idiomatic, and elegant.
o M: A valid alternative version has been written with a different execution.
o R: The alternative version gives invalid output or does not have a different execution plan,
but the code is close enough that some direct feedback can be given for a corrected revision
to be made
o N: The alternative version is incorrect or has the same execution, and the code is not
sufficiently close to correct for more directed feedback than ‘see the course material’, or is
missing.
• Spark Q1+Q2 Improvement (0.5% + 0.5%)
o This is a simple yes or no tied to a result of E or M for Spark Q1 and Q2. If the valid
alternative version represents a tangible improvement in efficiency, this mark is awarded.
• Report SQL (2% + 2%) – Q1 and Q2
o E: All required analysis is present and valid. Analysis shows clear insight into the underlying
SQL execution engine, and the way declarative SQL statements are interpreted and
optimised into an execution plan.
o M: All required analysis is present and valid. Analysis shows a basic understanding of the SQL
execution engine, and the way declarative SQL statements are interpreted and optimised
into an execution plan.
o R: All required analysis is present. Analysis is understandable, but has errors that need to be
corrected based on specific feedback.
o N: Any required analysis is missing, or analysis shows clear lack of understanding of the way
SQL statements are executed.
• Report Spark (2% + 2%) – Q1 and Q2
o E: All required analysis is present and valid. Analysis shows clear insight into the underlying
Spark dataframe execution engine, and the way Spark code is interpreted and optimised into
an execution plan.
o M: All required analysis is present and valid. Analysis shows a basic understanding of the
Spark dataframe execution engine, and the way Spark code is interpreted and optimised into
an execution plan.
o R: All required analysis is present. Analysis is understandable, but has errors that need to be
corrected based on specific feedback.
o N: Any required analysis is missing, or analysis shows clear lack of understanding of the way
Spark code is executed.
• Report Parallelism Discussion (2%)
o E: Discussion identifies valid differences between the alternative versions when executing in
parallel, or identifies that there are no differences. Discussion shows clear insight into
parallelism concerns, particularly shuffles and distributed joins.
o M: Discussion identifies valid differences between the alternative versions when executing in
parallel, or identifies that there are no differences. Discussion does not include incorrect
information about parallelism concerns.
o R: Discussion is present and shows some insight, but has invalid observations or information
that must be corrected based on specific feedback.
o Discussion is missing, unclear, or does not show any understanding of parallelism concerns.
A result of (E)xemplary for a line item yields 100% of the available marks. (M)eets Expectations yields 75%.
(R)evision Required yields 0%, but with the chance to resubmit that aspect (and that aspect only) for re-
grading by the resubmission due date, to then receive the full 75% if successful. (N)ot Sufficient For Feedback
yields 0% with no chance to resubmit.
GROUP MANAGEMENT
The same group management recommendations and outcomes apply for Task 2.
If a group member is not contributing correctly the following outcomes will occur:
• Didn’t contribute to any rewritten query: excluded from group marks, 0 for the assessment
• Contributed to a rewritten query but did not otherwise participate: 50% of the earned marks. i.e. a
mark of 7/10 will become 3.5