1
BIA B452F Assignment 1
Weighting: 25%
Deadlines:
• Part A (10%) – 26 March 2021 (Friday)
• Part B (15%) – 9 April 2021 (Friday)
Learning outcome:
• Explain and select analytic techniques for business intelligence and big data analysis.
• Apply data visualization tools and predictive analytics to summarize and analyze business data.
Important note:
• You should note that there might not be a single correct answer to the questions. Your answers to
these questions may be different from each other and could all be equally valid.
• This is an individual assignment. Copying some or all of another student’s assignment is plagiarism.
• Discussing your assignments with other students and seeking their comments and advice is
acceptable but it is not acceptable for two students to hand in assignments that are substantially the
same. When you collaborate on an individual assignment, it is important that the final product is
your own work.
Investigating Online Shoppers’ Purchasing Intention
In this assignment, you will perform exploratory and clustering analyses to investigate the online shoppers’
shoppers’ purchasing intention based on the clickstream data obtained from the navigation path of online
shoppers. The numerical and categorical features used in studying the online shoppers’ purchasing intention
are given in table below. The sample Dataset “online_shoppers.csv” consists of feature vectors belonging
to 12,330 sessions.
Table 1 – Numerical features
Feature name Feature description
Administrative Number of pages visited by the visitor about account management
Administrative duration Total amount of time (in seconds) spent by the visitor on account
management related pages
Informational Number of pages visited by the visitor about Web site, communication
and address information of the shopping site
Informational duration Total amount of time (in seconds) spent by the visitor on informational
pages
Product related Number of pages visited by visitor about product related pages
Product related duration Total amount of time (in seconds) spent by the visitor on product
related pages
Bounce rate Average bounce rate value of the pages visited by the visitor
“Bounce Rate” feature for a Web page refers to the percentage of
visitors who enter the site from that page and then leave (‘‘bounce’’)
2
without triggering any other requests to the analytics server during that
session
Exit rate Average exit rate value of the pages visited by the visitor
The value of ‘‘Exit Rate’’ feature for a specific Web page is calculated
as for all pageviews to the page, the percentage that were the last in the
session.
Page value Average page value of the pages visited by the visitor
The ‘‘Page Value’’ feature represents the average value for a Web page
that a user visited before completing an e-commerce transaction.
Special day Closeness of the site visiting time to a special day
Table 2 – Categorical features
Feature name Feature description
Month Month of the site visiting time
OperatingSystems Operating system of the visitor
Browser Browser of the visitor
Region Geographic region from which the session has been started by the
visitor
TrafficType Traffic source by which the visitor has arrived at the Web site (e.g.,
banner, SMS, direct)
VisitorType Visitor type as: 0 – Returning Visitor, 1 – New Visitor, and -1 – Other
Weekend Boolean value indicating whether the date of the visit is weekend
Month Month value of the visit date
Revenue Boolean value indicating whether the visit has been finalized with a
transaction
Part A – Exploratory Analysis (40 marks)
In this task, you have to apply exploratory analysis to reveal online shoppers’ purchasing intention that
could be used for formulating customized promotions to the online shoppers. You have to define your own
research questions and use summary statistics and data visualization to perform initial investigations on
data. For example, you can use correlation analysis to identify the factors (i.e., variables) that allow to
predict visitor’s purchasing intention and likelihood to abandon the site and then define and use appropriate
criteria to generate graphic representations showing the impact of these factors. You are also required to
clearly explain your observations.
You have to pre-process the data for constructing data visualizations. For example, you have to handle the
missing data, coding the variables, and perform data aggregation. You may use different approach to handle
the missing data and make any reasonable assumption in the analysis, if necessary. You may also use any
appropriate visualization methods in your analysis. But, you have to justify your methods and assumptions
made.
3
Part B – Clustering Analysis (60 marks)
(a) In this task, you will use unsupervised learning to segment the sample dataset. You have to apply K-
means and Expectation Maximization Algorithm to cluster the online shoppers’ purchasing intention
dataset and interpret the cluster results. You must decide which features to be included in the clustering.
The main purpose of this analysis is to help the business better understand how to utilize for predicting
the behavior of online shoppers in real time and take actions accordingly to improve the shopping cart
abandonment and purchase conversion rates.
Specifically, you have to perform the following tasks:
• Load and prepare the data (e.g. data cleansing and data normalization).
• Train a K-means on the data, select k based on scree plot and Silhouette plot.
• Rerun the model with optimal no. of clusters.
• Apply Expectation Maximization (EM) Algorithm to cluster the data.
• Compare the result of the two methods based on Silhouette plot and Dunn index and select
the best clusters.
• Perform exploratory analysis on the clusters (e.g. descriptive statistics, 2D and 3D scatterplots,
histograms, correlation analysis, etc.) and interpret the clustering results.
(Note: you can only use numerical variables for K-means and EM.) (50 marks)
(b) Explain why K-means can only use numerical variables for clustering and discuss how clustering mixed
data types (i.e., both numerical and categorical variables) in R. (Note: you don’t need to write the R
program.) (10 marks)
Grading Criteria
Each submission will be graded based on both the analysis process and included visualizations. Here are
our grading criteria:
• Appropriate data cleansing and transformation.
• Sufficient breadth of analysis, exploring multiple questions.
• Sufficient depth of analysis, with appropriate follow-up questions.
• Expressive & effective visualizations crafted to investigate analysis questions.
• Clearly written, understandable captions that communicate primary insights.
Submission Details
Your completed works should be uploaded to OLE before deadline as follows:
1. Part A – Exploratory Analysis (Mar 19, Friday)
• Analysis report – “Assignment 1 (Part A)”
• R program (or R markdown) – “Assignment 1 (Part A) – R program”
2. Part B – Clustering Analysis (Apr 9, Friday)
• Analysis report – “Assignment 1 (Part B)
• R program (or R markdown) – “Assignment 1 (Part B) – R program”
Marks will be deducted if any non-compliance with the submission requirements.
学霸联盟