FIT5196-S1-2021 Assessment 2 This is an individual assessment and worth 35% of your total mark for FIT5196. Due date: Please check Assessment 2: Exploratory Data Analysis and Data Cleansing (Weight: 35%) Data Cleansing (60%) For this assessment, you are required to write Python code to analyze your dataset, find and fix the problems in the data. The input and output of this task are shown below: Table 1. The input and output of the task Input Output Other Deliverables



Note1: All files except PDF must be zipped into a file named
Note2: is to be replaced with your student ID.
Note3: Each student can find their three input files here

Exploring and understanding the data is one of the most important parts of the data wrangling
process. You are required to perform graphical and/or non-graphical Exploratory Data Analysis
(EDA) methods to understand the data first and then find the data problems. You are required to:
• Detect and fix errors in _dirty_data.csv
• Detect and remove outlier rows in _outlier_data.csv
o (outliers are to be found w.r.t. delivery_cost attribute)
• Impute the missing values in _missing_data.csv
As a starting point, here is what we know about the dataset in hand:
The dataset contains Furniture Delivery data from an online furniture store in Melbourne,
Australia. The furniture store has five branches around Melbourne area. All five branches share
the same catalogs but they have different management so they operate differently.
Each instance of the data represents a single order from the online furniture store. The description
of each data column is shown in Table 2.
Table 2. Description of the columns
Column Description
Sales_id the unique id of the sale
Date The date of the order was made, given in DD-MM-YYYY format
Time The time of the order was made, given in hh:mm:ss format
Shopping_cart A list of tuples representing the order items: first element of the
tuple is the item ordered, and the second element is the quantity
ordered for such item.
For example, [('wardrobe', 3)], here, ‘Wardrobe’ is item ordered and
“3” is quantity.
Parcel_size A categorical value representing the different size of the parcel
namely, small, medium, and large
Price A float value representing the order total price
Customer_Lat Latitude of the customer
Customer_Long Longitude of the customer
Is_loyal_customer A logical (Boolean) variable denoting whether the customer has a
loyalty card with the store (1 if the customer has loyalty, else 0)
Nearest_storehouse_id the unique id of the storehouse
Nearest_storehouse A string denoting the name of the nearest storehouse to the customer
dist_to_nearest_storehouse A float representing the distance, in kilometers, between the
storehouse and the customer.
delivery_cost A float representing the delivery fee of the order

1. The output csv files must have the exact same columns as the input.
2. There is at least one anomaly in the dataset from each category of the data anomalies (i.e.,
syntactic, semantic, and coverage).
3. In the file _dirty_data.csv, any row can carry no more than one anomaly. (i.e.,
there can only be one anomaly in a single row and all anomalies are fixable)
4. There are no data anomalies in the file _outlier_data.csv, only outliers and
duplicates. Similarly, there are no data anomalies other than missing value problems in the
file _missing_data.csv
5. A useful python package to solve a linear system of equations is numpy.linalg.
6. Price of the shopping cart is based on the item and quantity of item purchased by the
7. The radius of the earth is 6371 km.
8. Delivery Cost is calculated using a different method for each branch
The delivery cost linearly depends (but in different ways for each branch) on:
a. Distance of customers location from storehouse - as a continuous variable
b. Season of Year - as a discrete variable
According to Melbourne, Seasons of Year are classified into:
a. Summer – from 1st December 00:00:00 to 28th February 23:59:59
b. Autumn – from 1st March 00:00:00 to 31th May 23:59:59
c. Winter – from 1st June 00:00:00 to 31th August 23:59:59
d. Spring – from 1st September 00:00:00 to 30th November 23:59:59

For leap year it will be 29th February 23:59:59

If a customer has loyalty, they get a 10% discount on delivery fee
9. As EDA is part of this assessment, no further information will be given publicly regarding
the data. However, you can brainstorm with the teaching team during tutorials and
consultation sessions.

Methodology (20%)
The report should demonstrate the methodology (including all steps) to achieve the correct results.

Documentation (20%)
The cleaning task must be explained in a well-formatted report (with appropriate sections and
subsections). Please remember that the report must explain the complete EDA to examine the data,
your methodology to find the data anomalies and the suggested approach to fix those anomalies.