Page 1 of 6
FINAL ASSESSMENT
AUGUST 2021 SEMESTER
MODULE NAME : DATA MINING
MODULE CODE : ITS61504
EXAM DURATION : 24 HOURS
DATE : 04/12/2021 8:00 AM – 05/12/2021 8:00 AM (MYT GMT+8)
SUBMISSION DEADLINE: 05/12/2021 8:00 AM (MYT GMT+8)
This paper consists of SIX (6) printed pages, inclusive of this page.
Instruction to Candidates:
1. 1. Answer ALL questions
2. This is an open book examination, student is not allowed to transcribe directly (cut and
paste) any material from another source into their submission.
3. The Turnitin similarity for this module is 20% overall and lesser than 5% from a single
source excluding program source codes.
4. Severe disciplinary action will be taken against those caught violating assessment rules
such as colluding, plagiarizing or transcribing.
5. The final assessment answers handed in should be within 5 -12 pages in total for non-
programming modules, with a spacing of 1.5 and a font of 12pt Times New Roman.
6. Submission link is here. (Do not submit the question paper)
7. The breakdown of exam questions by Module Learning Outcome(s) and its associate
weightage is as follows:
MLO Section(s)/ Question(s) Marks
MLO1 Question 1 / 20
MLO2 Question 2 / 20
MLO3 Question 3 / 20
MLO3 Question 4 / 20
MLO4 Question 5 / 20
TOTAL / 100
8. Start each answer on a separate page.
9. Complete the front cover of the examination answer booklet and question paper. Write
the question numbers attempted on the front cover of the answer booklet.
Page 2 of 6
Data Mining
ITS61504
Lorita Angeline
202108FE
Part I: Association Rule Mining
1. Table 1 shows the record for computer purchase. Manually calculate all items sets (from
one item to the maximum number of items you can find) using Apriori method. Prune the
item sets with minimum support of 35% and minimum confidence of 75%. (20 marks)
a) Manually calculate all itemset (10 marks)
b) Prune itemset with minimum support of 35% (5 marks)
c) Prune itemset with minimum confidence of 75% (5 marks)
Table 1: Computer purchase transactional record
TID Age Income Student Credit Rating Class (buy comp)
1 lessEqual30 High No Fair No
2 lessEqual30 High No Excellent No
3 31… 40 High No Fair Yes
4 greatThan40 Medium No Fair Yes
5 greatThan40 Low Yes Fair Yes
6 greatThan40 Low Yes Excellent No
7 31… 40 Low Yes Excellent Yes
8 lessEqual30 Medium No Fair No
9 lessEqual30 Low Yes Fair Yes
10 greatThan40 Medium Yes Fair Yes
11 lessEqual30 Medium Yes Excellent Yes
12 31… 40 Medium No Excellent Yes
13 31… 40 High Yes Fair Yes
14 greatThan40 Medium No Excellent No
15 31… 40 Medium Yes Fair Yes
Page 3 of 6
Data Mining
ITS61504
Lorita Angeline
202108FE
Part II: Case Study
In metropolitan cities like Kuala Lumpur, the prospective home buyer considers several
factors such as location, size of the land, proximity to parks, schools, hospitals, power
generation facilities and most importantly the house price. House price prediction is a
significant financial decision for individuals working in the housing market as well as for
potential buyers. From investment to buying a house for residence, a person investing in the
housing market is interested in the potential gain. Table 2 shows the property listing and the
factors for house price prediction. The full dataset is available on TIMeS (data_kl.csv)
Features:
• Rooms: Number of rooms
• Price: Price in Ringgit Malaysia (MYR)
• Distance: Distance from KL downtown
• Bedroom2: Number of Bedrooms
• Bathroom: Number of Bathrooms
• Car: Number parking space
• Landsize: Land size
Table 2: Property listing and the factors for house price prediction
Rooms Price Distance Bedroom2 Bathroom Car Landsize
2 1480000.0 2.5 2.0 1.0 1.0 202.0
2 1035000.0 2.5 2.0 1.0 0.0 156.0
3 1465000.0 2.5 3.0 2.0 0.0 134.0
3 850000.0 2.5 3.0 2.0 1.0 94.0
4 1600000.0 2.5 3.0 1.0 2.0 120.0
2 941000.0 2.5 2.0 1.0 0.0 181.0
3 1876000.0 2.5 4.0 2.0 0.0 245.0
2 1636000.0 2.5 2.0 1.0 2.0 256.0
3 1000000.0 2.5
1.0 1.0 238.0
2 745000.0 2.5 2.0 1.0 1.0 113.0
1 300000.0 2.5 1.0 1.0 1.0 0.0
2 1097000.0 2.5 3.0 1.0 2.0 220.0
2 542000.0 2.5 2.0 1.0
195.0
2 760000.0 2.5 2.0
1 481000.0 2.5 1.0 1.0
Page 4 of 6
Data Mining
ITS61504
Lorita Angeline
202108FE
(Question 2 – 5 are based on the case study, to predict house pricing)
2. Clean the dataset data_kl.csv using pre-processing techniques in R. Describe each
detected noises and anomalies that are existed in the full dataset. (20 marks)
3. Write a piece of program in R to develop a prediction model that predicts house pricing
for a new property listing. You are allowed to amend the dataset (justify your amendments).
Apply your model on our new house on the listing, newList: (20 marks)
newList <- data.frame(Room = 2,
Distance = 2.5,
Bedroom2 = 2,
Bathroom = 1,
Car = 0,
Landsize = 181)
4. Based on the original dataset (or your updated dataset), can we apply any regression
modelling? Which type(s) of regression modelling do you suggest? Justify your opinion with
a clear description. Use and modify a piece of program in R to conduct a regression
modelling. Apply your model on our new house, newList. (20 marks)
5. Which model is preferable for house price prediction? Evaluate and describe the
performance of both models (developed in question 3 and 4) using performance metric.
(20 marks)
- END OF QUESTION PAPER -
Page 5 of 6
Data Mining
ITS61504
Lorita Angeline
202108FE
Submission Requirements
1. Font type : Times New Roman
2. Font size : 12
3. Line spacing : 1.5
4. Alignment : Justify Text
5. Document type : .pdf, .R
6. Number of pages : 5 – 12 pages
7. A report of your answer should consist of the following (in order):
a) Cover page (Name, ID, Date, Signature, Score)
b) Report of your answer script
c) Appendixes (line spacing = 1.0)
• R programming
• List of references (APA format)
• Report of similarity score (percentage of similarity score from each source needs
to be shown)
8. Start each question on a separate page.
9. All figures and tables are labelled properly.
10. File naming conventions: StudentName_FinalAssessment
Notes:
• Include in-text citation to support your answers and add the list of references at the end of your
report (APA format). The list of references is to be alphabetized by the first author's last name, or
(if no author is listed) the organization or title.
• You are required to add screenshots of the code and results for each question.
• The program code must be appended to the main report (put in Appendix).
• The original program files (*.R) are required to be attached to the report upon submission.
No Student Name Student ID Date Signature Score
1
Page 6 of 6
Data Mining
ITS61504
Lorita Angeline
202108FE
ITS61504 Data Mining
Final Exam - Alternative Assessment
Marking Rubric (August 2021)
Criteria Excellent Good Average Poor
(90 – 100) (75 – 89) (40 – 74) (0 – 39)
Q 1: Describe
rule
association
mining
(MLO 1)
All itemset is
calculated, Apriori
method is applied
correctly and the
solution is clearly
elaborated in a step-by-
step manner. The
similarity is less than
2%.
All itemset is
calculated, Apriori
method is applied
correctly and the
solution is NOT clearly
elaborated in a step-by-
step manner. The
similarity is less than
2%.
Two itemset is
calculated, Apriori
method is NOT applied
correctly and the
solution is NOT clearly
elaborated in a step-by-
step manner. The
similarity is between
2% to 4%.
One or no itemset is
calculated, Apriori
method is NOT applied
correctly and the
solution is NOT clearly
elaborated in a step-by-
step manner. The
similarity is greater than
or equal to 5%.
Q2: Pre-
processing
techniques
(MLO 2)
All types of noise and
anomalies are detected
with high degree of
accuracy. The code is
applied correctly and the
solution is clearly
elaborated in a step-by-
step manner. The
similarity is less than
2%.
All types of noise and
anomalies are detected
with moderate degree of
accuracy. The code is
applied correctly and the
solution is NOT clearly
elaborated in a step-by-
step manner. The
similarity is less than
2%.
Some noise and
anomalies are detected
with moderate degree of
accuracy. The code is
applied correctly and the
solution is NOT
elaborated in a step-by-
step manner. The
similarity is between
2% to 4%.
Misses most of the noises
and anomalies, and
focuses on irrelevant
aspects. The code is
applied incorrectly and
solution is NOT
elaborated in a step-by-
step manner. The
similarity is greater than
or equal to 5%.
Q 3: Machine
Learning
techniques
(MLO 3)
Demonstrates
comprehensive analysis
of the Machine Learning
model and able to build
and train the model. The
code is applied correctly
and the solution is clearly
elaborated in a step-by-
step manner. The
similarity is less than
2%.
Demonstrates enough
evaluation of the
Machine Learning model
and able to build and
train the model. The code
is applied correctly and
the solution is NOT
clearly elaborated in a
step-by-step manner. The
similarity is less than
2%.
Demonstrates enough
evaluation of the
Machine Learning model
and UNABLE to build
and train the model. The
code is applied
incorrectly and the
solution is NOT
elaborated in a step-by-
step manner. The
similarity is between
2% to 4%.
Description of the
Machine Learning model
is not valid and unable to
build and train the model.
The code is applied
incorrectly and the
solution is NOT
elaborated in a step-by-
step manner. The
similarity is greater than
or equal to 5%.
Q 4:
Regression
Modelling
(MLO 3)
Demonstrates
comprehensive analysis
of the Regression model
and able to build and
train the model. The code
is applied correctly and
the solution is clearly
elaborated in a step-by-
step manner. The
similarity is less than
2%.
Demonstrates enough
evaluation of the
Regression model and
able to build and train the
model. The code is
applied correctly and the
solution is NOT clearly
elaborated in a step-by-
step manner. The
similarity is less than
2%.
Demonstrates enough
evaluation of the
Regression model and
UNABLE to build and
train the model. The code
is applied incorrectly and
solution is NOT
elaborated in a step-by-
step manner. The
similarity is between
2% to 4%.
Description of the
Regression model is not
valid and unable to build
and train the model. The
code is applied
incorrectly and the
solution is NOT
elaborated in a step-by-
step manner. The
similarity is greater than
or equal to 5%.
Q 5:
Performance
Evaluation
(MLO 4)
Critically evaluates
performance of the
models by examining
strengths and weakness
with performance
metrics. Solution is
clearly elaborated in a
step-by-step manner. The
similarity is less than
2%.
Able to evaluate
performance of the
models with performance
metrics and misses some
important strengths or
weaknesses. Solution is
NOT clearly elaborated
in a step-by-step manner.
The similarity is less
than 2%.
Able to evaluate
performance of the
models with some
performance metrics and
misses the strengths or
weaknesses. Solution is
NOT elaborated in a step-
by-step manner. The
similarity is between
2% to 4%.
Performance evaluation
of the model with
performance metrics is
not correct. Solution is
NOT elaborated in a step-
by-step manner. The
similarity is greater than
or equal to 5%.
- END -