Data Mining
COSC 2111/2110
Assignment 2 Neural Networks
Assessment Type You can do this assignment by yourself or in a group of
2. If
you are working in a group, please establish a group in Assignment 2
Group on Canvas. Submit online via Canvas →
Assignments → Assignment 2. Marks are awarded for meeting requirements
as closely as possible. Clarifications/updates
may be made via announcements/relevant discussion forums.
Due Date End of week 11, Monday 12th October 2020, 11:59pm
Marks 40
1 Overview
In this assignment you are asked to explore the use of neural networks
for classification
and numeric prediction (you may choose to use ‘Javanns’ or
‘MultilayerPerceptron’ in
Weka). You are also asked to carry out a data mining investigation on a
real-world
data file. You are required to write a report on your findings. Your
assignment will be
assessed on desmontrated understanding of concepts, algorithms,
methodology, analysis
of results and conclusions. Please make sure your answers are labelled
correctly with the
corresponding part and sub-question numbers, to make it easier for the
marker to follow.
2 Learning Outcomes
This assessment relates to the following learning outcomes of the
course.
• CLO 1: Demonstrate advanced knowledge of data mining concepts and
techniques.
• CLO 2: Apply the techniques of clustering, classification, association
finding, feature selection and visualisation on real world data.
• CLO 3: Determine whether a real world problem has a data mining
solution.
• CLO 4: Apply data mining software and toolkits in a range of
applications.
• CLO 5: Set up a data mining process for an application, including data
preparation,
modelling and evaluation
3 Assignment Details
3.1 Part 1: Classification with Neural Networks (12 marks)
This part involves predicting the Class attribute in the following file:
hypothyroid.arff
in the directory:
/KDrive/SEH/SCSIT/Students/Courses/COSC2111/DataMining/data/arff/UCI/
The main goal is to achieve the lowest classification error with the
lowest amount of
overfitting.
For the neural network training runs build a table with the following
headings:
Run Archi- Param Train Train Epochs Test Test
No tecture- eters MSE Error MSE Error
1 ii-hh-oo lr=.2 0.5 30% 500 0.6 40%
1. Describe the data preprcocessing tasks (including data encoding) that
are required.
How many outputs and how many inputs will there be? How do you handle
numeric and nominal attributes? What are the normalizations requred?
How do you
deal with missing values (if present)? Include your data preprocessing
scripts (if
necessary) as an appendix (not part of the page count).
2. Develop a script (or elaborate a pre-processing procedure in Weka) to
generate the
necessary training, validation and test data files. How do you determine
when to
stop training a neural network? Include your data preparation script (if
necessary)
as an appendix (not part of the page count).
3. Describe how a trained neural network determines unseen test data
instance’s class
label (e.g., the “analyze” strategy in Javanns).
4. Assuming that no hidden layer is used, carry out 5 train and test
runs for a network.
Comment on the limitations of this single-layer “perceptron” network, as
opposed
to a network where one or more hidden layers are employed.
5. Assuming that one hidden layer is used, use Javanns (or Weka) to
carry out 5 train
and test runs for a network with 5 hidden nodes. Comment on the
variation in the
training runs and the degree of overfitting. Comment on the differences
(if any)
you observe in results on the networks with or without the hidden layer.
6. Experiment with different numbers of hidden nodes. What seems to be
the right
number of hidden nodes for this problem?
7. For the network with 5 hidden nodes, explore different combinations
of learning
rate and momentum. What do you conclude?
8. Compare the classification accuracy of Javanns (or Weka
MultilayerPerceptron)
with the classification accuracy of Weka J48. Comment on the pros and
cons of
employing these two classifiers for classification tasks.
9. [Optional for COSC2110] Experimenting with both Javanns and Weka
MultilayerPerceptron, what are the pros and cons of these two different
software programs for
neural network training? What makes you decide to choose to use either
Javanns
or Weka? Provide your reasoning.
Report Length Up to two pages.
2
3.2 Part 2: Numeric Prediction with Neural Networks (10 marks)
This part involves the following file: heart-v1.arff
in the directory:
/KDrive/SEH/SCSIT/Students/Courses/COSC2111/DataMining/data/arff/UCI/
The main goal is to achieve the lowest mean absolute error with the
lowest amount of
overfitting.
The task is to predict the value of the chol attribute. Build a similar
table of runs to
the one in Part 1.
1. Describe the data preprcocessing tasks (including data encoding) that
are required.
How many outputs and how many inputs will there be? How do you handle
numeric
and nominal attributes? What scaling or normalization is required?
Include your
data preprocessing scripts (if necessary) as an appendix (not part of
the page count).
2. Modify your script (or the Weka pre-processing procedure) from Part 1
to generate
the necessary training, validation and test data files. Describe how you
calculate the
mean-absolute error. Does it require scaling of neural network inputs,
and reverse
scaling of the neural network outputs? Include your data preparation
scripts (if
needed) as an appendix (not part of the page count).
3. Assuming that no hidden layer is used, use Javanns (or Weka) to carry
out 5
train and test runs for a network. Comment on the limitations of this
single-layer
“perceptron” network, as opposed to a network where one or more hidden
layers
are employed.
4. Assuming that one hidden layer is used, use Javanns (or Weka) to
carry out 5 train
and test runs for a network with 5 hidden nodes. Comment on the
variation in the
training runs and the degree of overfitting. Comment on the differences
(if any)
you observe in results on the networks with or without the hidden layer.
5. Experiment with different numbers of hidden nodes. What seems to be
the right
number of hidden nodes for this problem?
6. For the network with 5 hidden nodes, explore different combinations
of learning
rate and momentum. What do you conclude?
7. Perform a run with 5 hidden nodes and no validation data. Stop
training when the
MSE is no longer changing. Get the error on the training and test data.
Comment
on the degree of overfitting.
8. What are the differences between the relative-absolute error and
mean-absolute
error? Which one you’d prefer to use, and why?
9. Compare the mean absolute error of Javanns (or Weka
MultiLayerPerceptron) with
the mean absolute error of Weka M5P. Comment on the pros and cons of
employing
these two classifiers for numeric prediction tasks.
Report Length Up to two pages.
3
3.3 Part 3: Data Mining (15 marks)
This part of the assignment is concerned with the movie data file
IMDB-movie-data.csv,
which is in the directory:
/KDrive/SEH/SCSIT/Students/Courses/COSC2111/DataMining/data/other/
The movie data was collected from the IMDb web site which claims to be
“the world’s
most popular and authoritative source for movie, TV and celebrity
content”. It was collected to answer the question “How can we tell the
greatness of a movie before it is released
in cinema?” There is a full description at:
https://www.kaggle.com/carolzhangdc/
imdb-5000-movie-dataset.
IMDB-movie-data.csv has some changes from the kaggle file, mostly to
make the genre
information more usable.
Your task is to analyse this data with appropriate classification,
clustering, association
finding, attribute selection and visualisation techniques selected from
the Weka menus and
identify any “golden nuggets” in the data. If you don’t use any of the
above techniques,
you need to say why. You need to provide a report for this analysis,
focusing on the
following two aspects:
1. Describing the strategy you adopted, your methodology, the runs you
performed,
any “golden nuggets” you found and your conclusions.
2. Discussing the advantages and disadvantages of each of your chosen
data mining
methods. Make sure you provide a rationale of your choices, and why it
worked
well (or not well) for discovering the “golden nuggets”.
Report Length Up to two pages.
3.4 Part 4: Self-reflection (3 marks)
In this task, you will need to provide a recorded video presentation (3
or 4 minutes,
with no more than 5 presentation slides) of your reflection on what you
have learnt from
this course on Data Mining. In particular, you should focus on answering
the following
questions:
• Have you gained much improved understanding of key data mining
concepts and
major techniques? What is your reflection on the journey (considering
now that
you have completed your assignment 2)?
• What is your knowledge and understanding now in determining whether
there is a
data mining solution for a real-world problem?
• What have you learned from doing both assignment 1 and 2, in terms of
helping you
extract meaningful patterns (i.e., “golden nuggets”) for a real-world
data mining
problem?
You will need to record the presentation in either WEBM or MP4 format
(using Studio
in Canvas or any software of your own choices). Both the recorded video
presentation
and the presentation slides (PDF format) should be submitted through
Canvas.
4
4 Alternative for this assignment
It is possible for your group to choose to work on some other real-world
data sets from
the Kaggle Competition website: https://kaggle.com. You still need to
complete all
four parts (part 1, 2, 3 and 4 as described in Section 3), with the only
difference being
the data sets you choose to use. You need to consult the lecturer about
this request
individually to get an approval, before going ahead with it.
5 Submission Instructions
You need to submit the following 3 files via Canvas:
• one PDF file for the report covering Part 1 - Part 3 (note that each
part has a 2-page
limit, not couning the appendix where you could include your scripts
developed).
• one WEBM (or MP4) video file for Part 4.
• one PDF file for your presentation slides for Part 4.
5.1 Pair work submission
If you work as a pair, then please include a brief paragraph (at the end
of your report
pdf file) to describe how you two worked together (i.e., who has done
what?), and specify
the percentage of your contribution to the whole assignment. You may be
called upon to
give a quick presentation to demonstrate how each of you contributes to
the solution of
this assignment.
5.2 Late submission penalty
After the due date, you will have 5 business days to submit your
assignment as a late
submission. Late submissions will incur a penalty of 10% per day. After
these five days,
Canvas will be closed and you will lose ALL the assignment marks.
Assessment declaration:
When you submit work electronically, you agree to the assessment
declaration - https://
www.rmit.edu.au/students/student-essentials/assessment-and-exams/assessment/
assessment-declaration
6 Academic integrity and plagiarism (standard warning)
Academic integrity is about honest presentation of your academic work.
It means acknowledging the work of others while developing your own
insights, knowledge and ideas.
You should take extreme care that you have:
• Acknowledged words, data, diagrams, models, frameworks and/or ideas of
others
you have quoted (i.e. directly copied), summarised, paraphrased,
discussed or mentioned in your assessment through the appropriate
referencing methods
• Provided a reference list of the publication details so your reader
can locate the
source if necessary. This includes material taken from Internet sites.
If you do not
5
acknowledge the sources of your material, you may be accused of
plagiarism because
you have passed off the work and ideas of another person without
appropriate
referencing, as if they were your own.
RMIT University treats plagiarism as a very serious offence constituting
misconduct.
Plagiarism covers a variety of inappropriate behaviours, including:
• Failure to properly document a source
• Copyright material from the internet or databases
• Collusion between students
For further information on our policies and procedures, please refer to
the following:
https://www.rmit.edu.au/students/student-essentials/rights-and-responsibilities/
academic-integrity.
7 Marking guidelines
Factors contributing to the final mark will include the number of tasks
attempted, the
amount of exploration and demonstrated understanding of the algorithms,
methodology, logical analysis, presentation of results and conclusions
(see the marking rubrics in
Canvas).
6