CASA0006: Data Science for Spatial Systems
Assessment Guidelines

Deadline 5pm, GMT, 26th April 2021, Monday
Word Count Minimum 2000 words (not including Python scripts)

The coursework for this module will consist of an individual assignment that tests your ability to conduct in-
depth data analysis. Each student is required to submit a single Python Notebook which contains both the
code required to conduct the data analysis and accompanying text which provides context interpretation.

This coursework represents 100% of the overall module assessment.


Select any open dataset relating to an urban or spatial system of your choice and conduct an advanced
analysis of the dataset. A complete data analysis process should be undertaken – this will include data
validation and cleaning, a data pre-processing phase (e.g. text, image, clustering analysis), and
comprehensive analysis (including relevant visualisations) of the data, identifying important trends and
insights contained within the dataset. Each stage of the data treatment and analysis process should be well
documented and keeping with the exploratory, narrative theme described during the course. Marks will be
awarded for both the technical analysis process and the interpretation and choice of analysis methods. The
dataset (or datasets) you choose to analyse is left completely open and should relate to an urban or spatial

The data analysis process should be captured within a single Jupyter Python notebook. This notebook
should contain all of the code used to complete each of the three stages of the work, in addition to the full
documentation of the analysis process and interpretation of results. The documentation must be a minimum
of 2000 words; note that the provided Python scripting is not included in this word limit.

Note that the submission should contain a Python notebook ending with ‘.ipynb’ and probably a data file. Other
submitted files will be neglected. For instance, if you submit only a PDF file, you will get a mark of 0.

In terms of ‘how many methods to use’, you are not supposed to use all methods taught in the module. Rather,
you can use two to four methods that are relevant to the research question. If you use a method incorrectly
(e.g. using k-means for regression), you will be penalised.

A breakdown of how the notebook will be marked is as follows:

• Analysis and interpretation of data – 70%
- Analysis context and aims (incl. reference to relevant literature and projects)
- Data collection, handling, cleaning and management
- Depth and scope of data analysis
- Appropriateness of data visualisation
- Interpretation and reporting of analysis and major findings
- Clarity of presentation of results
• Demonstration of technical skills – 20%
- Choice and rationale of data analysis methods used
• Creativity of analytical work – 10%

At submission, the notebook should be able to be fully executed quickly. Please share the dataset in a
Github repo and then remotely read this dataset in the notebook (e.g. using ‘read_csv’ function as shown in
workshops). If the data size exceeds the file size limit of Github (100 M), you could submit a .zip file containing
the notebook and data file. Regarding libraries, please stick to the libraries within the recommended and
original computing environment (via docker/Vagrant/Anaconda). If you really need to use other libraries
(including fastai), you would need to clearly state the names and version numbers of these libraries.
If the data cleaning and pre-processing stages require considerable time for execution, it is satisfactory that the
processed data is provided, alongside a detailed description of the processing phase. If you use SQL to pre-
process the data, please provide the processed data without including the details of SQL. The assessors will
return work that has not been provided in an easily executed format, which will suffer late penalty deductions.

Before your submission, please use the Jupyter function of ‘Restart & Rerun all’ (or equivalent functions) to
ensure that the codes are viable and results are well presented.

Structure of the notebook

These sections should be included in this notebook:
• Introduction
• Literature review
• Research question
• Presentation of data
• Methodology
• Results
• Discussion
• Conclusion

You can combine ‘Introduction’ and ‘Literature review’ into one section of ‘Introduction’, or ‘Results’ and
‘Discussion’ into a section of ‘Results and Discussion’. Note that in the literature review, you need to include at
least three relevant studies. In ‘Research question’, you need to explicitly state the question ending with a
question mark. For example, ‘what is the relationship between Covid-19 mortality rate and local deprivation in
the UK?’ or ‘Is it possible to predict Covid-19 mortality rate using socio-demographic variables in the UK?’

A title of the notebook is needed. You can use the proposed research question as the title, but other options are

Example Workbooks

Listed below are a number of example data analysis projects using Python and various libraries, combining
code and narrative (to varying extents) within a notebook format. In general, we expect a more systematic and
complete analysis than that offered here – following the steps outlines above.

• Using Python to see how the Times writes about men and women -
• How Clean are San Francisco’s Restaurants? -
• Predicting use on NYC Metro -
• San Francisco Drug Geography -
• New York Taxi Analysis - - Excellent visualisations
• Buzzfeed analysis of Segregation in St Louis -
08-st-louis-county-segregation/blob/master/notebooks/segregation-analysis.ipynb - needs better
• Graph Properties of the Twitter Stream -
• Logistic models of well switching in Bangladesh -
.ipynb - lacks descriptions of the data
• Clustering Samsung smartphone accelerometer data -
• Exploratory Analysis of the 2014 World Cup Final -
• Data mining Twitter using tweepy -
tent=14023248&utm_medium=social&utm_source=twitter - very informative!
• Flight Arrivals -
2013/blob/master/python/lecture_27_arrival.ipynb - lacks full documentation!
• Very nice analysis of how the Circle Line rogue train was caught with data -
79405c86ab6a#.oabdxcg86 - GitHub notebook, rather than Jupyter
Once marked, we would encourage you to submit your completed workbooks to or for wider sharing.

Examples Datasets

We’d encourage you to find an interesting dataset that you all want to work on. Here are a few examples in
case you are struggling to find one.

• NYC GPS taxi data -
• Yelp dataset -
• UK Land Registry house sales data -
• Stop and Search Data by US State -
• Traffic Accident and Traffic Flow data for 16 years -
• Real-time crime data in Seattle -
• Various FOI data releases can be found on WhatDoTheyKnow -
• Crime Data in Buenos Aires -
• Lots of open data for Bahrain -
• City Cellular Traffic Map -
• Flight data (requires Google account) -
• Beijing GPS taxi data -
• International Migration data -
• Plant Diversity in American National Parks Biodiversity -
• Wildlife Trade Database -
• H1-B Visa Petitions -
• Baltimore Crime Data -
• Chicago Crime Data -
• AWS Honeypot Cyber Attack Data (with originating lat/lngs) -
• Vancouver Crime Data -