CB9166/ORBS7120 – Big Data Analytics and Visualisation
Individual project instruction

1. Assessment structure
The individual project accounts for 80% of the module grade. Please choose one data set in
the list below for your project. The length of the report should be of 2000 words, excluding
references. All the relevant literature and resources for your project should be properly
cited in the Harvard referencing style (you will find this website helpful). For your
convenience, a template is provided here.

Marks allocated to
20% 1. Introduction to data and research question (700 words)
Please introduce the data set used and its background. The relevant
literature (e.g., academic journal articles and textbooks) should be
surveyed and properly cited with Harvard referencing style. More
importantly, please identify a problem to be addressed with this data set
(i.e., the research question). Please note that the problem should be
specific (i.e., relevant in the application domain and linked to the variables
available from the data set).

15% 2. Data processing and exploration (300 words)
Please explain: Which variables are available from the data set? Which
variables have been selected for the analysis and why? What data
transformations have been done and why?

25% 3. Data visualisation and interpretation (600 words)
Please provide at least three data visualisations as descriptive analytical
results (e.g., properties of the variables selected) and advanced analytical
results (e.g., relationships between the variables selected, machine
learning results). Please follow best practices taught in the module
regarding data visualization. Importantly, please interpret the results and
findings with details. Note that the data visualisations should be nontrivial
representations of information, yet easy to interpret.

20% 4. Data insights and conclusions (400 words)
Please provide the insights drawn from the analytics and summarise the
findings. In particular, is the problem (i.e., research question) identified at
the beginning addressed by the analytics? How?

20% 5. Writing, styling and references
The clarity, logic and presentation of the report, including spelling,
grammar and punctuation. The general styling and references should be
clear and consistent.

2. Recommended data sets
Please find a list of recommended data sets below. All of them have significant textual
content (a major type of unstructured data). Therefore, text analytics tools should be
employed. Your analysis could build on existing code shared by the online community (e.g.,
from Kaggle.com). If so, please cite the original sources (links or relevant publications)
properly in the Harvard referencing style.
• [Business] Amazon review data: This dataset includes reviews (ratings, text,
helpfulness votes), product metadata (descriptions, category information, price,
brand, and image features), and links (also viewed/also bought graphs), covering 29
product categories from Amazon (you may focus on a single category for your
individual project, please don’t use software or magazine as they were already
used as examples in class). Please note that you will be asked to complete a short
form regarding proper use of the data when first time downloading.
• [Society] COVID19 tweets: The tweets have #covid19 hashtag. Collection started on
25/7/2020, with an initial 17k batch.
• [Finance] Daily news for stock market prediction: A combination of news data of
historical news headlines from Reddit WorldNews Channel and stock data of Dow
Jones Industrial Average (DJIA).
• [Society] US Election 2020 Tweets: Tweets containing the hashtags of the candidates’
names collected during the election period.
• [Business] Women's e-commerce clothing reviews: This is a Women’s Clothing E-
Commerce dataset revolving around the reviews written by customers.
You may choose another data set not listed here. If so, please contact the module
convenor for approval before conducting the project.