COMP6214 Open Data Innovation
Stuart E. Middleton, sem03@soton.ac.uk
Updated: 26th January 2022
Deliverables and deadlines
Deliverable(s) Deadline Marking Scheme
Coursework 1
PDF report
Module
week 8
Task 1 Data cleaning 8 marks
Task 2 Data modelling 12 marks
Task 3 Data visualization 20 marks
Total 40 Marks
Task
This coursework has three parts assessing your ability to clean and model an open dataset and then
visualize that same dataset. You will write a single PDF report which includes results for all three
tasks, which will then be marked according to the assignment marking scheme. You will not submit
your code, just a report with screenshots (i.e. evidence of visualizations produced by your code).
Task 1 - Data cleaning [8 marks]
You must identify errors within the provided assignment dataset and correct them. The assignment
dataset is an Excel spreadsheet obtained from a UK government open data website. It can be
download from the module wiki page (see link to CW1-BusinessImpactsOfCovid19Data.xlsx).
A total of 10 errors have been introduced to this dataset for you to find and correct. You can use
techniques learnt from the lecture content to help you, or other tools you find on the web.
Your report must document how you found the errors (including tools used and justification for why
they were used), what they were, how you corrected them and what validation approach you used
to check the clean dataset was error free.
Task 2 - Data modelling [12 marks]
You must model the dataset in open data format RDF and populate the model using the data from
the datasets. You should export your RDF triples as turtle TTL formatted output.
You can use techniques learnt from the lecture content to help you. You can use the example code
package which can be download from the module wiki page (see link to java-rdf-example-code.zip),
or other tools you find on the web.
Your report must document the knowledge representation (i.e. ontology classes and predicates) you
chose to represent the knowledge extracted from dataset as RDF. You should justify why you chose
this knowledge representation in the context of other choices, and why you think it has a good
balance between expressiveness and conciseness and delivers conceptual clarity. You should provide
a clear diagram showing the ontology used alongside instance frequency statistics (i.e. number of
instances of each class), and a small snippet from the TTL file you serialized (i.e. max half a page of
TTL). You should also explain your data ingest approach and RDF model construction and
serialization approach.
Task 3 - Data visualization [20 marks]
You must create a multi-dimensional interactive visualisation of your RDF model for the assignment
dataset using a Linked Data Visualization tool. Your visualisation should have suitable interactivity
that allows for manipulation, filtering, and detailed analysis of the data.
You should aim to develop a multidimensional (greater than 2 dimensions) visualisation that enables
rich exploration of the data. Note that 'multidimensional' refers to the dimensions of the data, not
the visualisation (i.e. you are expected to use values from at least 3 columns from the provided
dataset to create your visualisation from one or more worksheets).
Examples of tools you might use are in the resources section of this document and lecture content.
Your report must describe the visualisation tools and techniques used in enough detail to show a
deep understanding. You should justify your choice of visualisation tools and techniques in the
context of the problem and alternatives that are available. Your report should describe a
hypothetical scenario for which your multi-dimensional interactive visualisation could be used with
the assignment dataset. You should walk the reader through this scenario, using your interactive
system running on the assignment dataset to provide sufficient screenshots to show its rich features,
multi-dimensional capabilities and support for interactive data manipulation, filtering, and analysis.
Report structure
Your PDF report should not be longer than 20 pages (including all sections and screenshots) and use
font size 12. Include a title, your name and student number but no abstract or table of contents.
Failure to use the required structure or going over the page limit will be penalized.
Your assignment PDF report should have the following section headings:
Title, Student Name, Student Number
1 Data cleaning
1.1 Approach to data cleaning with justification
1.2 Errors identified and validation approach used to check cleaned dataset
2 Data modelling
2.1 Knowledge representation for RDF model with justification
2.2 Approach to data ingest
2.3 Ontology with instance frequency statistics and TTL snippet
3 Data visualization
3.1 Approach to multi-dimensional interactive visualisation with justification
3.2 Hypothetical scenario for multi-dimensional interactive visualisation with walk-
through
4 References and links
Support resources
Open data tools
OpenRefine
A free, open source, powerful tool for working with messy data
https://openrefine.org/
CSV Lint
CSVLint helps you to check that your CSV file is readable. And you can use it to check
whether it contains the columns and types of values that it should.
http://csvlint.io/
JSON Lint
JSONLint - The JSON Validator
https://jsonlint.com
Linked data visualization tools
LD-VOWL: Visualizing Linked Data Endpoints
http://vowl.visualdataweb.org/ldvowl.html
Tableau
https://www.tableau.com/learn/get-started
Marbles
https://mes.github.io/marbles/
D3
https://d3js.org/
Coursework package
Example java code to turn excel spreadsheet into RDF (turtle serialization)
see module wiki link to java-rdf-example-code.zip
Notes and Restrictions
You can use any tools you find on the web to help you. Your report must justify your choices in the
context of other tools/approaches and explain how you used them in enough detail to show a deep
understanding of the tools used.
Learning Outcomes
D2. Apply appropriate validation, cleaning and transformation to use, reuse and combine a
multitude of complex datasets
B2. Critically evaluate a large range of Infographics and interaction techniques suitable for different
tasks
Late submissions
Late submissions will be penalised according to the standard rules. All submissions must use the
handin system (emailed submissions cannot be accepted).
If you need an extension apply before the deadline via special considerations (link below). We advise
you to submit a draft report before the deadline, then submit a revised report only if your extension
is approved. Submitting a late report with an extension request pending, then having the extension
request refused will trigger an automatic late penalty. The module team do not process extension
requests, we are simply notified by the special considerations team if you are successful.
https://www.southampton.ac.uk/quality/assessment/special_considerations.page
Plagiarism
The final report need to be the student’s own work unless mentioned otherwise. You must not copy
text from third party sources without clear attribution (i.e. paraphrasing) or work in teams (i.e.
collusion). If you quote text/images/tables be clear you have done so, "quote" the
text/images/tables and provide a citation from the original source. Reports will be checked with
TurnitIn.
This is important as any violations, deliberate or otherwise, will be automatically reported to the
Academic Integrity Officer.