ECLT5810 E-Commerce Data Mining Techniques
2020 – 2021 Second Term
Due: March 30, 2021 (Tuesday) 17:00
This is an individual assignment. This assignment requires the use of Weka. You must use Weka to
complete the assignment. In Assignment 1, you have conducted the data preprocessing for the term
deposit subscription prediction task. In this assignment, dataset in Assignment 1 is assumed to serve as
training set only. You need to use the training data to build a decision tree model and a logistic regression
model to predict the term deposit subscription of clients. Moreover, testing set will be given. The models
will be analyzed their performance using the training set as well as the testing set. The followings are
requirements of the task. In this Assignment, please use ARFF format instead of CSV format as ARFF
format is more compatible with Weka.
1. Conduct variable transformation and variable selection specification with reference to assignment 1,
with dataset (bank-additional.csv). Please save your file as ARFF format. Or you can simply use
2. Build a decision tree model to predict the term deposit subscription of clients. Save the model as
3. Build a logistic regression model to predict the term deposit subscription of clients. Save the model
4. Assess the two models using training data only. Test options can be specified by yourselves (but no
supplied testing data). Compare the training accuracy and other metrics. Report the results* and write
down your comments on the comparison result.
5. Both learned models are analyzed using the testing set stored in another dataset (bank-additional-
6. You should perform the same variable transformation and variable selection techniques on the testing
dataset before prediction using the training dataset statistics. For example, in normalization, use the
min and max values in training set when handling the testing dataset. Please save your feature
engineered dataset as ARFF format.
7. Perform prediction on the testing dataset using the trained model in previous steps. Report the results*.
Write down your comments on the results (The model performs well or not? Why?)
8. Modify or extend any steps above in order to achieve a better accuracy on the testing set. Note that for
the learning models, only decision tree and logistic regression can be used. Report the results*.
State and explain the modifications you have made.
9. State briefly and concisely any explanatory notes on the methodologies you used to improve the
models’ performance. Discuss briefly the weakness and assumptions used in the models and
methodologies. Specifically, you should state which model you have chosen as the finalized model and
explain your reasons. Do not write more than one A4 page for this requirement.
* Using Screenshot on Weka to report the results.
1. You need to report the results of your project in a word/pdf file. Please rename the file according to the
“id-name” format. For example, if your name is Chan Tai Ming (Please use full name rather than Peter
/ Mary) and your student ID is 1100123456, your report must be stored in a file named 1100123456-
2. Please submit your file to Blackboard. Under “Course Content” – “Assignment 2”. You are allowed
to submit multiple times before the due time, and the last attempt will be graded. Late submission will
not be accepted.
3. Any enquiry please email to firstname.lastname@example.org.
In step 6, you are required to perform the same variable transformation and variable selection techniques
on the new dataset. Since discretization depends on the data, the resulted bins on training and testing data
will be different, although you apply the same configuration. In practice, we should use the ranges
obtained from training data and apply the same ranges to testing data. As a result, we can perform manual
discretization on testing data.
Suppose the ranges for the first attribute obtained from training data are (-inf-20.0], (20.0-40.0] and (40.0-
After the testing data is loaded in Weka, in preprocess tab, choose MathExpression, it is under
Enter this expression: ifelse(A>20, ifelse(A>40, 3, 2), 1)
Enter 2-last in ignoreRange if we only want to apply it to the first attribute.
Then, we need to use NumericToNominal to turn it into a nominal attribute.
After applying NumericToNominal, we will have (-inf-20.0] as label 1, (20.0-40.0] as label 2 and (40.0-inf)
as 3. Now, we need to rename the label to the range.
Choose RenameNominalValues, it is under unsupervised->attribute.
Enter 1:(-inf-20.0],2:(20.0-40.0],3:(40.0-inf) in valueReplacements.
Enter 1 in selectedAttributes as we only want to apply on the first attribute.
We will obtain the same discretization of our training data.
For Normalization and Standardization, you should use the values obtained from training data to handle
the testing data, e.g. the min and max values for Normalization, mean and std for Standardization. You can
use MathExpression to manually apply those methods on the testing data. 学霸联盟