CSCI316-python代写-Assignment 1
时间:2023-07-25
CSCI316 (SIM) 2023 Session 3 Individual Assignment 1
CSCI316 – Big Data Mining Techniques and Implementation
Individual Assignment 1
2023 Session 3 (SIM)
15 Marks
Deadline: Refer to the submission link of this assignment on Moodle
Two (2) tasks are included in this assignment. The specification of each task starts in a separate page.
You must implement and run all your Python code in Jupyter Notebook. The deliverables include one
Jupyter Notebook source file (with .ipybn extension) and one PDF document for each task.
Note: To generate a PDF file for a notebook source file, you can either (i) use the Web browser’s PDF
printing function, or (ii) click “File” on top of the notebook, choose “Download as” and then “PDF via
LaTex”.
All results of your implementation must be reproducible from your submitted Jupyter notebook source
files. In addition, the submission must include all execution outputs as well as clear explanation of your
implementation algorithms (e.g., in the Markdown format or as comments in your Python codes).
Submission must be done online by using the submission link associated with assignment 1 for this
subject on MOODLE. The size limit for all submitted materials is 20MB. DO NOT submit a zip file.
This is an individual assignment. Plagiarism of any part of the assignment will result in having 0 mark for
the assignment and for all students involved.
CSCI316 (SIM) 2023 Session 3 Individual Assignment 1
Task 1
(5 marks)
Data set: Customer Churn Dataset
https://www.kaggle.com/datasets/muhammadshahidazeem/customer-churn-dataset
Customer churn refers to the phenomenon where customers discontinue their relationship or subscription with
a company or service provider. It represents the rate at which customers stop using a company's products or
services within a specific period. Churn is an important metric for businesses as it directly impacts revenue,
growth, and customer retention.
In the context of the Churn dataset, the churn label indicates whether a customer has churned or not. A churned
customer is one who has decided to discontinue their subscription or usage of the company's services. On the
other hand, a non-churned customer is one who continues to remain engaged and retains their relationship with
the company.
Understanding customer churn is crucial for businesses to identify patterns, factors, and indicators that
contribute to customer attrition. By analysing churn behaviour and its associated features, companies can
develop strategies to retain existing customers, improve customer satisfaction, and reduce customer turnover.
Predictive modelling techniques can also be applied to forecast and proactively address potential churn,
enabling companies to take proactive measures to retain at-risk customers.
Objective
Use Pandas in Python to clean and pre-process this dataset. You cannot use any ML library (including Sci-kit
Learn) for this task.
Requirements
(1) Create one Pandas dataframe for both the training and test data.
(2) Identify the missing values. Propose a method to clean the missing values.
The following steps are performed based on the cleaned data.
(3) Perform z-score normalization of the values in the attribute “Last Interaction”. Show the mean and
variance of the normalized values.
(4) Create five bins for the attribute “Total Spend” such that the bins contain (approximately) equivalent
numbers of records.
(5) Apply one-hot-encoding to the attribute “Contract Length”.
(6) Define at least one new attribute based on existing attribute, and explain your reason behind your
definition.
For the requirements (2) – (5), append the new columns to the existing Pandas dataframe.
Deliverables
• A Jupiter Notebook source file named _task1.ipybn which contains your
implementation source code in Python
• A PDF document named _task1.pdf which is generated from your Jupiter
Notebook source file, and presents clear and accurate explanation of your implementation and results.
CSCI316 (SIM) 2023 Session 3 Individual Assignment 1
Task 2
(10 marks)
Data set: Customer Churn Dataset
https://www.kaggle.com/datasets/muhammadshahidazeem/customer-churn-dataset
Objective
The objective of this task is to implement from scratch a Decision Tree classifier to predict the churn label.
You cannot use any ML library for this task.
Requirements
(1) Implement three DT models by the split criteria of Information Gain, Gain Ratio and Gini Index. You
can use either binary-split or multiple-split.
(2) It is recommended that your implementation includes a “tree induction function”, a “classification
function” and other functions (which are up to you).
(3) After implementing the three DT models, use a simple ensemble method to build a new classifier. This
new classifier just utilises a voting function on the prediction outcomes of the three DT models.
(4) Present clear and accurate explanation of your implementation and results in text. In particular, report
the accuracy of the models, and report whether an improvement the ensemble method can achieve.
(5) Note. You can (but not must) use any suitable pre-processing method. You also can (but not must) use
any reasonable early stopping criteria (pre-pruned parameters such as number of splits, minimum data
set size, and split threshold) to improve the training speed. If you do so, you must explain the criteria
clearly.
Deliverables
• A Jupiter Notebook source file named _task2.ipybn which contains your
implementation source code in Python
• A PDF document named _task2.pdf which is generated from your Jupiter
Notebook source file.