CITS 2401 Computer Analysis and Visualisation Page 1 of 6
Assignment 3: Advanced Data Processing and Visualisation Worth: 5% of
the unit Submission: Answer the questions on the quiz server. Deadline:
14 Oct. 2021 6pm Late submissions: late submissions attract 5% penalty
per day up to 7 days (i.e., 20 Oct. 2021 6pm). Afterwards, the mark will
be 0 (zero). Also, any plagiarised work will be forwarded to Academic
Conduct Advisor. 1. Outline In an online review site, customers can
share their experiences which can help others to make informed decisions
about cafes, supermarkets, restaurants, auto-repair shops, and so on. A
review made by a customer may have several emotions (e.g., sad, happy,
surprise, disgust, joy, etc.) associated with the review texts. Also,
the reviews in the online sites have other valuable data such as the
information related to the user (e.g., userID) who made the review,
posting time of the review, location coordinates, etc. The reviews of
customers have a huge impact on business as a customer’s emotions about
the business site are concealed in their perspective reviews. Therefore,
such information from an online review site can provide several
opportunities for a data analyst to analyse and visualize the data in a
sophisticated manner. In this assignment, you need to perform some tasks
that are related to the scope of location-enabled online review data
analytics. The instructions and the expected outcomes of the tasks are
provided in Section 2 of this document. Materials necessary to complete
this assignment include lectures up to week 11 Interpolation and Curve
Fitting, and their related labs. You are provided two sample csv files
location_review_data.csv and user_friend_data.csv, where, the first file
contains 13 columns with header created_at, user_ID, review_ID,
latitude, longitude, location_ID, friend_count, follower_count, sad,
happy, surprise, disgust, joy. The first eight columns of the file
contain information on (1) time of posting a review, (2) user id (who
made the review), (3) review id (id associated with review), (4)
latitude, (5) longitude, (6) location id of the location where the
review posted, (7) number of friends, (8) number of followers of the
user who posted the review. The last five columns of the file contain
the scores (between 0 and 1) for five user- emotions. Note, the column
values associated with friend_count, follower_count are masked. The
second csv file user_friend_data.csv contains user_ID, friend_count,
follower_count, user_locID, where the column user_locID contains the
location of the users (user_ID). Note: 1. This is an individual
assignment, please don't share your solution/code/files with others
(only high- level discussion is allowed, e.g., the syntax of the
formula, use of modules with other examples etc.). If it is found to be
not your original work, then you will be reported to Academic Conduct
Advisor. 2. You should not use any print() statement in your submitted
code on the Quiz Server. However, you can use print() statement to test
your code’s output in Python’s IDE (e.g., Thonny, Wing101, etc.). 3. In
case you find any technical difficulties, please post them on Help Forum
in LMS. Our teaching team will assist you. But, do not share your code
on the Help forum. You can ask lab facilitators on your registered
face-to-face/online labs in case you face any problem on your
programming. CITS 2401 Computer Analysis and Visualisation Page 2 of
6 2. Tasks You will find example test cases for the below mentioned
tasks once you start the quiz on the Quiz Server. Two sample csv files
location_review_data.csv and user_friend_data.csv are available in the
Assignment 3 Instructions (CITS2401-Sem2) Folder on Quiz Server. Task 1.
Load data in NumPy Array (4) Write a function load_mydata(filename)
that considers filename (type string) as input. The filename should be a
csv (comma separated values) file with the columns created_at, user_ID,
review_ID, latitude, longitude, location_ID, friend_count,
follower_count, sad, happy, surprise, disgust, joy as given in the
sample data file location_review_data.csv. The function will extract the
data (including the column headers) corresponding to the nine columns
in the order as follows: 1. created_at 2. user_ID 3. review_ID 4.
location_ID 5. sad 6. happy 7. surprise 8. disgust 9. joy The extracted
data should be stored in the NumPy array format (i.e., it should produce
'numpy.ndarray'>). No other post-processing is needed at this point. The resulting output will now be
known as data.
Note: when importing, set the delimiter to be ',' (i.e., a comma) and the quotechar to be '”' (i.e., a
double quotation mark).
Task 2. Formatting Datatype (4)
The datatype of the array created by NumPy in Task 1 is unstructured. This is because, in the default
setting, NumPy decides the datatype for each value. Also, the output in Task 1 contains the header row
that may not be required in our assignment. So, remove the header row and convert all the columns to
type float (i.e., "float") apart from the columns specified by the input parameter indexes (mentioned
below). Also, the remaining columns which are not mentioned in indexes should be in Unicode of length
30 characters (i.e., "
Write a function data_type_format(data, indexes) that can complete the above-mentioned task,
where, the input data is a NumPy array and indexes contains the column indices (in list) which are to
be converted into converted to type float.
Task 3. Formatting Date Time (4)
The "created_at" column of the NumPy array (say in the output array 'data' in Task 1) is obtained by
loading the csv input file (e.g., location_review_data.csv) contains the time of each review post. But,
the current time format is not ideal for comparison between an old review and a new one. To change
this, we are going to reformat the column (we got the type of the "created_at" column as Unicode of
length 30 after conversion in Task 2).
Write a function convert_to_datetime(array) that converts the current (e.g., original) timestamp
format into a new format as follows:
Current format: [day] [month] [day value] [hour]:[minute]:[second] [time zone difference] [year]
New format : [year]-[month value]-[day value] [hour]:[minute]:[second]
For example, a current format value:
Wed Feb 05 17:20:01 +0000 2020
will be converted to:
2020‐02‐05 17:20:01
CITS 2401 Computer Analysis and Visualisation
Page 3 of 6
Task 4. Box Plot (6)
The dataset location_review_data.csv provided for this assignment contains a series of
measurements (i.e., scores) for five emotions (such as sad, happy, surprise, disgust, joy) corresponding
to each review made. The box (or boxes and whiskers) plot is a good choice to visualise and compare
the emotions in the dataset.
Write a function box_plot(data, output_name) that takes the input data (type NumPy Array)
and output_name (type String) where data contains the data available in
the location_review_data.csv file (Hint: Q1) and output_name is the file name (output.png) to be
plotted using the below instructions.
Make a boxplot for each of the 'sad', 'happy', 'surprise', 'disgust', and 'joy' emotions from the
data, in the order provided.
Figure size should be set to (10, 8).
The linestyle is set to '-', linewidth = 2, and color to red (hint: lookup medianprops).
Patch_artist of the boxplot is set to True.
Set the facecolor order as brown, yellow, blue, white, cyan.
Set the title to be 'Distribution of Review Emotions'.
Add x-axis data labels 'Sad', 'Happy', 'Surprise', 'Disgust', 'Joy' in this order.
Set the y-axis grid to True.
Set the x-axis label to 'Emotions'.
Set the y-axis label to 'Scores'.
Finally, save the graph as output_name with a default value "output.png".
The expected output is shown below using the provided using the sample dataset
(location_review_data.csv).
CITS 2401 Computer Analysis and Visualisation
Page 4 of 6
Task 5. Count Outliers (5)
As seen in the boxplot figure, there are several points that are outside of the boxes. For this task, we
will calculate the number of outliers for a selected emotion (e.g., sad, happy, surprise, disgust, joy),
specified by upper and lower boundary values.
Write a function count_outliers(emotion, lower, upper) that considers three inputs emotion,
lower, upper, where emotion contains the records corresponding to any one among the five emotions
(e.g., sad, happy, surprise, disgust, joy). The function count_outliers(emotion, lower,
upper) should return the number of outliers in the provided data range emotion (among the five
emotions mentioned above), with given lower and upper percentile. For example, if lower = 25
and upper = 75, then you should regard all the points below 25% and above 75% as outliers, inclusive.
Note: The input parameter emotion is of class numpy.ndarray.
Task 6. Euclidean Distance between Two Emotions Array (2)
The Euclidean distance (ED) between two numpy array arr1 = [x1 y1] and arr2 = [x2 y2] can be
calculated as follows.
The ED calculation formula can be extended to higher dimension vectors also.
Write a function euclidean_dist(emotion1, emotion2) that takes two vectors emotion1, emotion2 of
class numpy.ndarray as input parameters, and returns the Euclidean distance (ED) score
between emotion1, emotion2. If the sizes of the input parameters are different, the function should
return None.
Note: You cannot use for or while loop for this question. You can use the math module and it has been
imported already in the Quiz Server (e.g., import math.)
Task 7. Import Specific Columns to Dataframe (4)
Write a program import_to_dataframe(filename, columns) which can import the contents of
specific columns (type list of integers) of an input csv file filename to a pandas dataframe.
Here, columns is a list of integers that contains the index of the columns to be imported into the
dataframe (e.g., ).
For example, if columns = [0, 1, 2], the dataframe will import the first three columns of the
input filename.
Task 8. Convert NumPy Array into Dataframe (4)
Pandas provide various methods that can be used to handle data more efficiently. In this task, you should
convert NumPy array data into Pandas dataframe type.
Write a function convert_to_dataframe(data) that consider input data of class numpy.ndarray and
convert it into the Pandas dataframe type.
Note: Consider the data's dtype names as column headers and their associated data values.
Task 9. Load Data to Dataframe (4)
Write a function load_mydataframe(filename) where the input parameter filename (type string) is a
csv file (e.g., location_review_data.csv) having 13 columns (e.g., created_at, user_ID, review_ID,
latitude, longitude, location_ID, friend_count, follower_count, sad, happy, surprise, disgust, joy). The
function load_mydataframe(filename) should extract 9 (Nine) columns in the order as follows:
1. created_at
2. user_ID
3. review_ID
4. location_ID
5. sad
6. happy
CITS 2401 Computer Analysis and Visualisation
Page 5 of 6
7. surprise
8. disgust
9. joy
The extracted data should be stored in the pandas dataframe format (i.e., it should produce 'pandas.core.frame.DataFrame'>). No other post-processing is needed at this point. The resulting output
will now be known as my_dataframe.
Note: when importing, set the delimiter to be ',' (i.e., a comma) and the quotechar to be '”' (i.e., a
double quotation mark).
Task 10. Load User info Records to Dataframe (3)
Write a function load_user_info(filename) that considers the input filename (type string) as a .csv
file and returns the Pandas dataframe of the records from the filename. You can assume that there
must be some data in the file filename (e.g., the file filename should not be empty).
Note: The data separator of the input file should be a comma (e.g., ',').
Task 11. Merging Dataframes (5)
You may notice that the friend_count and follower_count columns of the sample csv
file location_review_data.csv are masked (e.g., no numeric values are available). You are provided
another csv file user_friend_data.csv that contains the number of friends (e.g., friend_count) and
the number of followers (e.g., follower_count) corresponding to each user_ID. In Task 9 (e.g., Q 9)
and Task 10 (e.g., Q 10), you have loaded the file location_review_data.csv (with specific 9 columns)
and user_friend_data.csv (4 columns) into the memory space as a Pandas dataframe type,
respectively. As both the dataframe has the common column header user_ID, we will use that
information to join two dataframes (refer Task 9 and Task 10) to form a single dataframe.
Write a function merge_dfs(df_data, df_users) that takes two dataframes df_data and df_users as
inputs. Note, the inputs df_data and df_users contains the dataframes returned by the
functions load_mydataframe() and load_user_info() of Task 9 (e.g., Q 9) and Task 10 (e.g., Q 10),
respectively.
In this task, the function merge_dfs(df_data, df_users) should perform the following tasks:
(1) Join the two dataframes df_data and df_users using the user_ID (Note: check user_ID should
be common to both the dataframes). The join method is 'inner' (will come in handy). As not all
rows will match the user_ID (as some user_ID (i.e., 151421, 45097) may not available
in df_users dataframe), make sure to drop those records (row containing these user_IDs that
are not available in both dataframes). Finally, the function will create a single
dataframe df_merged by joining the dataframes df_data and df_users.
(2) Finally, return the dataframe df_merged
Hint: You must use the pandas dataframe join method.
Note: depending on your approach, this task may take time.
Task 12. Combining of the Above Tasks (Task 1 to Task 11) (5)
Submit your Python code in the box (under Question 12 on Quiz Server) containing the functions to solve
all the above-mentioned tasks. You also need to upload a python file (.py) on the Quiz Server as detailed
in Section 3. Failure to upload the Python file will result in a penalty of 50% of entire assignment.
CITS 2401 Computer Analysis and Visualisation
Page 6 of 6
3. Submission
You are expected to have read and understood the University's guidelines on academic conduct.
Following this policy, you may discuss with other students the general principles required to understand
this project, but the work you submit must be the result of your own effort. Plagiarism detection, and
other systems for detecting potential malpractice, will therefore be used. Besides, if what you submit is
not your own work then you will have learned little and will, therefore, likely, fail the final exam.
You must submit your project before the submission deadline listed above. Following UWA policy, a late
penalty of 5% will be deducted for each day (24 hours), after the deadline, that the assignment is
submitted. No submissions will be allowed after 7 days following the deadline except approved special
consideration cases.
Submit your Python code in the quiz answer box containing all the functions related to the tasks on the
quiz server (already mentioned in Task 13), as well as a python file which you should name as
A3_StudentId.py. For example, if your student ID is 12345678, then your file name is
A3_12345678.py.
Fail to follow the instructions in this assignment may result in zero grade.