COMP5338: -无代写|学霸联盟

COMP5338: -无代写

时间：2025-10-31

School of Computer Science
Dr. Ying Zhou
COMP5338: Advanced Data Models Sem. 2/2025
Neo4j Project (Individual, 20%)
27.10.2025
1 Introduction
In this assignment, you are asked to design and build a Neo4j graph model represent-
ing airport and the airline route using data sets downloaded from OpenFlights Data and
OurAirports. You are also asked to implement Cypher queries and to anlayse their perfor-
mance.
2 Data Set
The data set is available on the unit’s Canvas site and consists of files from two sources:
• OpenFlights (.dat): airports.dat, airlines.dat, planes.dat, routes.dat.
• OurAirports (.csv): airports.csv, runways.csv, countries.csv, regions.csv.
Read the official column descriptions for each source. You do not need to load every
field; it is acceptable (and encouraged) to omit columns that are unused by the workloads.
The following are recommended join and cleaning rules:
• Airport identity: use ICAO as the primary join key across sources.
– OpenFlights airports.dat→ property ICAO.
– OurAirports airports.csv→ column ident (same as ICAO).
If an airport lacks ICAO, you may retain it as OpenFlights-only (do not attempt fuzzy
matching). Drop those unmatched airports that are only in OurAirport data source.
• Field precedence (when both sources exist for the same ICAO):
– Name, latitude, longitude: prefer OpenFlights if available.
– City, country (name): keep from OpenFlights for continuity with legacy work-
loads.
1
– Type, iso_country, iso_region: take from OurAirports.
• Runway–airport link: in runways.csv, link via airport_ident→ Airport.ICAO.
• Routes quality: remove routes which reference missing airlines or airports (after the
airport/airline nodes are loaded).
3 Workloads to implement
You are asked to design a graph schema and load the data according to your schema to
build the graph and implement the following workloads, each using a single Cypher query,
which could includes multiple subqueries. You will receive zero marks for any workload
that utilizes multiple queries.
W1 Compute the average runway length (in meters) by airport type. You need to first
collect all individual runways attached to airports of each type and take the mean of
their lengths. Convert feet to meters using the formula
meters = feet× 0.3048.
Only include runways with a known length (non-null) and that are not marked as
closed. Your query should return four columns, sorted by the average length in de-
scending order:
• airport_type — the value of Airport.type;
• avg_runway_m — the average runway length (meters) rounded to one decimal
place;
• runway_count — the number of runways of that type;
• airports_with_runways — the number of distinct airports of that type which
have at least one included runway.
W1 partial output
airport_type avg_runway_m runway_count airports_with_runways
large_airport 3085 905 467
medium_airport 2020.5 5726 4165
... ... ... ...
small_airport 1342 2832 2212
heliport 311 32 27
2
W2 Identify the top 5 airlines that serve the widest range of time zones, measured
by the number of distinct time zone names across all airports they serve (as either
sources or destinations). Only count airports whose tz_name is present (non-null).
Your query should return the following, sorted by the number of distinct time zones
in descending order:
• airline — the airline name,
• tz_name_count — the number of distinct tz_name values across its network,
• airports_count — the number of distinct airports with a known tz_name that
the airline serves.
W2 partial output before skip
airline tz_name_count airports_count
Air France 134 385
Turkish Airlines 118 254
... ... ...
W3 Identify the airlines that serve only intra-region routes. Treat a route as intra-region
if the source and destination airports have the same region code (e.g., US-AK).
Return the top 10 airlines by the number of such intra-region routes. Your query
should return:
• airline — the airline name,
• route_count — the number of routes operated by that airline,
• regions_served — the list of distinct region codes the airline serves.
W3 partial output
airline route_count regions_served
Era Alaska 177 [US-AK]
Bering Air 65 [US-AK]
Pacific Coastal Airline 30 [CA-BC]
Astral Aviation 28 [US-AK]
SATA Air Acores 28 [PT-20]
Aero Flight 26 [US-AK]
WebJet Linhas A 26 [CA-QC, CA-NL]
... ... ...
3
W4 Identify all one-stop connecting itineraries from Sydney Kingsford Smith Inter-
national Airport (SYD) to London Heathrow Airport (LHR) in which both legs are
operated by the same airline and each leg is non-stop. For each qualifying itinerary
(SYD → via → LHR), compute the geodesic distance (in kilometers) for each leg using
the airports’ WGS84 coordinates.
Ignore any leg where coordinates are missing. Return the results sorted by airline
name (and then by total distance ascending if desired). Your query should return:
• airline — the airline name,
• cities — a three-element sequence [Sydney, via city, London],
• leg1_km — distance from SYD to the via city (km, rounded as you see fit),
• leg2_km — distance from the via city to LHR (km),
• total_km — leg1_km + leg2_km.
W4 partial output
airline cities leg1_km leg2_km total_km
Air Canada [Sydney, Vancouver, London] 12515.0 7587.1 20102.0
... ... ...
4
W5 Identify the top five countries ranked by the number of airports which have at least
one widebody-capable runway, and for each of these countries report the single airport
(within that country) which has the most such runways together with that count.
In this workload, a runway is considered widebody-capable should have a width of at
least 60 meters and a length of at least 3000 meters. Also, the runway should not be
marked as closed.
You should return the following sorted by widebody_airport_count descendingly, then
country ascendingly:
• country_code — country ISO code (e.g., US, AU);
• country — country name;
• widebody_airport_count — number of airports in the country with at least one
qualifying runway;
• top_airport — identifier for the airport in that country with the most qualifying
runways (e.g., IATA/ICAO/name);
• qualifying_runways — number of qualifying runways at top_airport.
W5 Partial Output
country_code country airport_count top_airport qualifying_runways
US United States 89 JFK 3
RU Russia 25 SVO 3
CN China 20 PVG 5
... ... ... ...
5
W6 Identify the plane types that most frequently serve airports whose runways are
only unpaved, e.g no paved surface. Count, for each plane type, the number of
distinct airports which appear as either a source or a destination on a route that uses
that plane type.
Below is how you should classify paved and unpaved.
• Paved if the recorded surface string contains one of: asphalt, asph, concrete,
conc, bitumen, cement, macadam, paved.
• Unpaved if it contains one of: grass, turf, gravel, grvl, dirt, clay, sand,
soil, cinder, laterite, snow, ice, water, unpaved, unprepared.
Ignore runways marked as closed (closed=true). An airport qualifies as only unpaved
if and only if it has at least one unpaved runway and no paved runway.
Return the top 10 plane types sorted by the count in descending order of the airport
count, then by plane name in ascending order . Your query should return:
• plane_iata — the IATA equipment code for the plane type,
• plane_name — the human-readable name (if available),
• airports_only_unpaved_count — the number of distinct airports (source or
destination) that have only unpaved runways and are served by routes using this
plane type.
W6 Partial Output
plane_iata plane_name airports_only_unpaved_count
"DHT" "De Havilland Canada DHC-6 Twin
Otter"
25
... ... ...
"YN2" "Harbin Yunshuji Y12" 10
... ... ...
6
4 Query Validation
For each workload, include one or more short Cypher queries that validate part or all
of your result. These validation queries should be easy to run and interpret; they help
demonstrate that you understand the workload definition and allow us to diagnose small
discrepancies (e.g., different filtering assumptions).
What to include
• State the claim you are validating. For example: "For large_airport, our by-runway
average length is 3085 m over 905 runways and 467 airports."
• Provide a focused validation query which recomputes the relevant metric only for
the chosen case (e.g., a single airline, airport type, region, or itinerary).
• Show the expected evidence/output (counts, sums, lists) which confirms the claim
within a reasonable tolerance (e.g., rounding to 1 decimal place).
• State any assumptions you used (e.g.,"ignored runways with unknown surface").
Validation patterns (pick one or more per workload)
• Spot-check a single entity (e.g., one airline) and recompute the exact aggregates
your main query reported.
• Constraint checks with EXISTS/NOT EXISTS to prove no counterexamples exist
(e.g., no inter-region routes for an "intra-region only" airline).
• Membership checks to confirm that every row in your top-K satisfies the selection
predicate (and that a known non-qualifier is excluded).
• Cross-tab sanity show that partitions add up (e.g., sum of per-type counts equals
the overall count) or that derived numbers are coherent.
7
5 Report
Prepare a concise, professional report which describes and justifies your graph model,
explains the workloads and their validation, and analyzes performance for at least
three workloads.
Structure
Your report must contain the following sections:
• Introduction (very brief)
• Graph Model
• Workload Implementation and Validation
• Performance Observation
• Conclusion (very brief)
Introduction / Conclusion (brief). There are no strict rules on these two sections; keep
each to a short paragraph that frames your approach and closes with key takeaways. They
should integrate smoothly with the rest of the report and contribute to overall profession-
alism.
Graph Model
Describe what you modeled and why. This section must include:
• A brief description of the node labels and relationship types, including important
properties
• A screenshot of the schema produced by CALL db.schema.visualization.
• Your design justification, including key trade-offs and at least one plausible alterna-
tive you considered.
• Data provenance: specify which portions of the graph are sourced from which files
(OpenFlights vs OurAirports). Also list any fields that were present in the files but
not loaded into the graph and explain why (e.g., out of scope).
8
Workload Implementation and Validation
For workloads W1–W6, present each workload using the following template:
• Workload Description. Restate the workload in your own words and clarify any
assumptions.
• Cypher query. Include the final query you used. Annotate key parts briefly (e.g.,
how you filter, group, or compute distances).
• Sample output. Include a small table or a few rows to illustrate result shape and
formatting (rounded where appropriate).
• Validation. Provide at least one short validation query per workload that checks part
or all of the result (e.g., spot-check recomputation for a single entity, a membership
check that returns no violations, or a small consistency/cross-tab check). Include all
requirement elements as described in the query validation section.
Performance Observation
Analyze the execution of at least three workloads. For each selected workload:
• Run PROFILE and summarize the plan: key operators, estimated vs actual rows,
total DB hits, memory usage if shown.
• Index usage: note which indexes were used (or not) and why (e.g., predicate shape
prevents index seek).
• Bottlenecks & improvements: identify the expensive step(s) (e.g., label scan +
DISTINCT), and describe potential or implemented improvements (e.g., different
starting point, selective predicates earlier, additional index, rewriting collect+size
as COUNT DISTINCT, using CALL {...} to isolate a subquery).
9
6 Deliverables and Submission Guidelines
Files to submit
• One .cypher file per workload named Wn.cypher (e.g., W1.cypher, W2.cypher,
. . . ). Each file must contain:
– the workload implementation query, and
– one or more validation queries for that workload.
Separate sections using clear comment dividers, e.g.:
// ===== Workload Implementation =====
// (your main query)
// ===== Validation =====
// (one or more short queries)
• dataloading.cypher: all schema constraints/indexes and data-loading statements
needed to build your graph (assume CSVs are in Neo4j’s import/ folder; use file:///...
URIs).
Packaging
• Place all required .cypher files (e.g., W1.cypher–W6.cypher and dataloading.cypher)
into a single .zip and submit it via the designated Canvas link.
• Do not include any data files, database folders, or generated exports in the zip.
Execution expectations
• Your files should run on Neo4j 5.x in a clean database, in this order: dataloading.cypher
→ W1.cypher→ . . . Wn.cypher.
• Use portable file references (e.g., LOAD CSV FROM 'file:///airports.csv'). Avoid
absolute paths.
• Your query should not rely on APOC or other procedures.
Report submission
• Submit your written report as a single PDF to the separate report submission link on
Canvas.
10
7 Generative AI Usage Guidelines
You are permitted (but not required) to use Generative Artificial Intelligence (AI) tools. If
you choose to use one or more such tools, you must appropriately acknowledge your use
of such tools. You can do this by including an acknowledgment section at the end of your
report where you need to describe the AI tool(s) that you used, what you used it to do,
what prompt(s) you provided, and how AI output was used or adapted by you.
8 Point distribution (total = 20)
• Data loading script: 2.0
• Workload implementation script (W1–W6): 1.5 each; total: 9.0
• Workload validation script (W1–W6): 0.5 each; total 3.0
• Report: 6.0
– 1 point for graph model
– 3 points for workload implementation and validation
– 1.5 point for performance observation
– 0.5 point for overall presentation
Revision History
Date Changes
2025-10-27 Updated W3 partial output;
2025-10-27 Updated the partial results; added indicator for result continuing to all
partial output
2025-10-20 Updated the GenAI usage guidelines
2025-10-17 Preview release.
11

学霸联盟