Data Mining & Machine Learning


Semester 1, 2021

Due: Friday 16 April at midnight.
Weighting: 50%
Note: This assignment may be completed individually or in groups of size 2.
Submission: A soft copy needs to be submitted through Turnitin (a link for this purpose
will be set up in Blackboard). When submitting the assessment make the name(s) and
student ID(s) must be indicated on the front page of the report.

The Aim of this assignment is two-fold. Firstly, in Part A, you are required to
conduct a literature review of data mining applications in Industry and will thus
provide you with a further insight into the ways that data mining is used in Part B.

Part A
Your survey should cover two different application areas (ensure that these are
from different domains – e.g. banking, health, etc). The survey is intended to assist
you in establishing a suitable framework (application area, tools, algorithms) on
which your mining project will be based.

• Background information on the organisation that initiated the Data Mining
• A brief description of the target application (e.g. detecting credit card fraud,
diagnosing heart disease, etc.) and the objectives of the data mining
exercise undertaken.
• A description of the data used in the mining exercise (the level of detail
published here will differ due to commercial sensitivity, hence flexibility will
be used in the marking of this section).
• A description of the mining tools (data mining software) used, together with
an identification (no details required) of the mining algorithms and how the
mining algorithms were applied on the data.
• Discussion of the outcomes and benefits (be as specific as possible, talk
about accuracy of results, potential or actual savings in dollar terms
or time savings; do not talk in vague, general terms) to the organisation
that resulted from the mining exercise. This discussion should contain, in
addition to the published material, your own reflection on the level of
success achieved by the organisation in meeting their stated aims and

The total length of your report for Part A is expected to be no longer than 3 pages
(1.5 pages for each case study). The criteria that will be used for assessment in
Part A is as follows:

Criterion Mark
Overall Quality of Presentation 6
Background 6*2=12
Tools and Mining algorithms 8*2=16
Outcomes and Benefits 8*2=16

Part B
This part allows you to solve two real-world data mining problems using Python.
In the two questions given below justification of your answers carries a high
proportion (50%) of the marks awarded.

Q1: Application Area 1 (dataset for this is Mortgage.csv; dataset
description is in Mortgage.txt)

This application is concerned with predicting the outcome of mortgage
applications. The dataset contains 700 applications for mortgages for which
outcomes (paid back=0, default on loan=1) are known. A further 150 mortgages
have currently being granted but the outcomes for these are not known as the
loans are still in progress.

You are required to build a model using the Decision Tree learner and answer
the following questions based on the model built. Use the data segment on the
700 mortgages whose outcomes are known. In building the model, use the 10
fold cross-validation option for testing.

Your answers below need to be supported by suitable evidence, wherever
appropriate. Some examples of suitable evidence are Decision Trees,
Confusion Matrices, Model Visualizations and Summary Statistics.

a) Using an appropriate method identify the top 4 most influential features
in classifying this dataset. [5 marks]

b) Now build a model using the Decision Tree Classifier. By adjusting two
suitable parameters (one at a time) reduce the size of the tree to not
more than 10 to 15 nodes in order to improve the interpretability of the
model generated. Which of the two parameters yielded better accuracy
while producing smaller trees? [5 marks]

c) Describe the role of the two parameters in the model building that you
used in b) above. Do you expect that manipulating the parameter in the
same way, will improve accuracy for other types of datasets? Justify your
answer. [8 marks]

d) Examine the Confusion Matrix carefully. You will notice that the success
rate of predictions for the “default on loan” (1) outcome is significantly
smaller than the corresponding success rate for the 0 outcome. Why do
you think this happens? Will a suitable visualization help to explain this
phenomenon? [5 marks]

e) Do you expect to replicate the same level of success as with the 700
mortgages that you built the model from, or do you expect the prediction
to be significantly worse? Justify your answer. Hint: Examine the data
distributions of the two sets of data and look for similarities or differences
between the two. [10 marks]
COMP606 Assignment Part 1 page 4 of 4

2. Application Area 2 (dataset for this is Heart.arff)

This application is from the Medical domain and is concerned with predictions of
heart disease for a collection of individuals from whom relevant medical data has
been obtained. The objective is to predict whether a given individual will suffer from
heart disease (outcome 2) or not (outcome 1) in a year’s time from gathering the

For this dataset, you will use both the Decision Tree Classifier and Naïve Bayes
algorithms to mine the data. Use the 10 fold cross-validation option for testing the
performance for both models on testing data.

a. Use an appropriate method of feature selection to identify the most significant
features. State method used and list the features produced. Discuss the
independence assumption between the features in Naïve Bayes algorithm and
support your answer with reference to the selected features.
[5 marks]

b. Run the Naïve Bayes algorithm with the the most significant features identified
above. Produce a probability model table similar to the example discussed in
class. Use this probability table to identify the top 3 (feature, value) pairs that
predict the presence of heart disease. Discuss your findings and show all
working. [7marks]

c. Now run the Decision Tree algorithm and compare the list of the most
significant features with the top 3 features produced by the Decision Tree
model. Identify similarities and differences. Discuss any differences.

Good luck!