DSCI553 Foundations and Applications of Data Mining
Assignment 5 Clustering
Deadline: Apr. 23rd 11:59 PM PST
1. Overview of the Assignment
In Assignment 5, you will implement the K-Means and Bradley-Fayyad-Reina (BFR) algorithm. Doing the
assignment helps you get familiar with clustering algorithms using various distance measurements. The
datasets you are going to use are synthetic.
2. Assignment Requirements
2.1 Programming Language and Library Requirements
a. You must use Python to implement all the tasks. You can only use standard Python libraries (i.e.,
external libraries like numpy or pandas are not allowed). Spark RDD is optional for Python. If you
want to use Spark, please specify the following environment in your code:
a. There will be 10% bonus for Scala implementation in each task. Spark RDD is required for Scala. You
can get the bonus only when both Python and Scala implementations are correct.
b. Spark DataFrame and DataSet are not allowed.
2.2 Programming Environment
Python 3.6, Scala 2.11, and Spark 2.3.0
We will use Vocareum to automatically run and grade your submission. You must test your scripts on the
local machine and the Vocareum terminal before submission.
2.3 Write your own code
Do not share code with other students!!
For this assignment to be an effective learning experience, you must write your own code! We emphasize
this point because you may find Python implementations of some of the required functions on the Web.
Please do not look for or at any such code!
Plagiarism detection will combine all the code we can find from the Web (e.g., Github) as well as other
students’ code from this and other (previous) sections. We will report all detected plagiarism to the
Since the BFR algorithm has a strong assumption that the clusters are normally distributed with
independent dimensions, we generated synthetic datasets by the following steps: 1) Initializing
some random centroids 2) For each centroid, sampling data points following the normal
distribution in each dimension. Together, the mean of the normal distribution in each dimension
is the centroid of the clusters, and the standard deviation is set manually 3) For each centroid,
we added some data points as the outliers. Figure 1 shows an example of the data points (in CSV
format). The first column is the data point indexes. The other columns represent the
features/dimensions of the data points.
Figure 1: An example of the data points with 3 dimensions
You can access and download the following datasets either under the directory
(resource/asnlib/publicdata/) on Vocareum or Google Drive (USC email only):
a. Five folders are named from “test1” to “test5”. Each folder contains multiple files storing data points.
We will treat these files as separate data chunks. In each iteration, you will load one file (one chunk
of points) to the memory and process these data points with the BFR algorithm.
b. Five files are named from “cluster1.json” to “cluster5.json”. Each file provides the ground truth cluster
for the data points in the corresponding folder from “test1” to “test5”. The key is the data point index
(as a string). The value is its corresponding cluster index. The cluster index of the outliers is -1. The
number of clusters for each dataset is test1: 10, test2: 10, test3: 5, test4: 8, test5: 15.
c. We have generated ten testing sets using the same method. Five of them are provided for your
implementation. The rest of them are used for grading. Notice that the number of dimensions, the
number of files, and the number of data points in each dataset could vary.
4. Task (8pts)
You need to submit the following files on Vocareum: (all lowercase)
a. [REQUIRED] Python scripts: bfr.py
b. [REQUIRED FOR SCALA] Scala scripts: bfr.scala; one Jar package: hw5.jar
c. [OPTIONAL] You can include other scripts to support your programs (e.g., callable functions).
4.1 Task description
You will write the K-Means and Bradley-Fayyad-Reina (BFR) algorithms from scratch. You should
implement K-Means as the in-memory clustering algorithm that you will use in BFR. You will iteratively
load the data points from a file and process these data points with the BFR algorithm. See below
pseudocode for your reference.
In BFR, there are three sets of points that you need to keep track of: Discard set (DS), Compression set
(CS), Retained set (RS). For each cluster in the DS and CS, the cluster is summarized by:
N: The number of points
SUM: the sum of the coordinates of the points
SUMSQ: the sum of squares of coordinates
The conceptual steps of the BFR algorithm: Please refer to the slides.
The implementation details of the BFR algorithm: Please refer to Section 7.
4.2 Execution commands
Python command: $ python3 bfr.py
Scala command: $ spark-submit --class bfr hw5.jar
Params : the folder containing the files of data points
: the number of clusters
: the output file of cluster results
: the output file of intermediate results
4.3 Output format
a. You must write your clustering results in the JSON format (see Figure 2). The key is the data point
index (as string). The value is its corresponding cluster index. The indexes for both data points and
clusters start from 0.
Figure 2: An example of the output clustering results
b. You must output the intermediate results in the CSV format (use the same headers in Figure 3). Each
line in the intermediate results represents the following information about each iteration/data trunk:
- round id (starting from 1): # of rounds must be # of data chunks in the folder.
- the number of clusters in the discard set
- the total accumulated number of the discarded points: The number should only go up with
- the number of clusters in the compression set
- the total number of the compressed points
- the number of points in the retained set
Figure 3: An example of the intermediate results
We will use normalized mutual information (NMI) score to evaluate your clustering results. To obtain the
full point(0.8pt) for each dataset, the NMI should be above 0.8, and the intermediate result is correct. If
either criterion (i.e., NMI and intermediate result) does not meet, you will lose all points for a dataset.
The submission report will tell you which part of your intermediate results is incorrect. We will grade your
code on all ten datasets.
5. About Vocareum
a. Your code can directly access the datasets under the directory: ../resource/asnlib/publicdata/
b. You should upload the required files under your workspace: work/
c. You must test your scripts on both the local machine and the Vocareum terminal before submission.
d. During submission period, the Vocareum will run and evaluate the results for test1 and test2.
e. You will receive a submission report after Vocareum finishes executing your scripts. The submission
report shows NMI for each dataset. If the intermediate results are incorrect, the submission shows
f. The total execution time of submission period should be less than 600 seconds. The execution time
of grading period need to be less than 3,000 seconds.
g. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade on
your last submission.
6. Grading Criteria
(% penalty = % penalty of possible points you get)
a. You can use your free 5-day extension separately or together. You must submit a late-day request via
https://forms.gle/6aDASyXAuBeV3LkWA. This form is recording the number of late days you use for
each assignment. By default, we will not count the late days if no request submitted.
b. There will be 10% bonus for each task if your Scala implementations are correct. Only when your
Python results are correct, the bonus of Scala will be calculated. There is no partial point for Scala.
c. There will be no point if your submission cannot be executed on Vocareum.
d. There is no regrading. Once the grade is posted on the Blackboard, we will only regrade your
assignments if there is a grading error. No exceptions.
e. There will be 20% penalty for the late submission within one week and no point after that.
The implementation details of the BFR algorithm (you can/should have your own implementation; this
is only for reference). Suppose the number of clusters is K and the number of dimensions is d.
1. Load the data points from one file.
2. Run K-Means on the data points or a random subset of the data points. For the implementation, you
will apply K-means on a subset since you cannot handle all data points in the first file. Initialize the
algorithm with a large number of centroids (e.g., 3 or 5 times of K) and use the Euclidean distance as
the similarity measurement.
3. Among the result clusters from step 2, move all the clusters that contain only one or very few points
(you define “very few”) to RS as the outliers. You will now have two groups of data points: the outlier
data points and inlier data points.
4. Run the clustering algorithm again to cluster the inlier data points into K clusters. Use these K clusters
as your DS. Discard these points and generate the DS statistics.
5. Run the clustering algorithm again to cluster the outlier data points using a large number of clusters
(e.g., 3 or 5 times of K). Generate CS and their statistics from the clusters with more than one data
point and use the remaining as your new RS.
6. The above steps finish the initialization of DS. So far, you have K DS (from step 4), some number
of CS (from step 5), and some number of RS (from step 5).
7. Load the data points from another file (step 2 loads a portion of the first file, load the remaining data
points from the first file).
8. For the new data points, compare them to each DS using the Mahalanobis Distance and assign them
to the nearest DS clusters if the distance is < √ (e.g., 2√).
9. For the new data points which are not assigned to any DS cluster, compare them to each of the CS
using the Mahalanobis Distance and assign them to the nearest CS clusters if the distance is < √
10. For the new data points which are not assigned to any DS or CS cluster, add them to your RS.
11. Run the clustering algorithm on the RS with a large number of centroids (e.g., 3 or 5 time of K).
Generate CS and their statistics from the clusters with more than one data point and add them to
your existing CS list. Use the remaining points as your new RS.
12. Merge CS clusters that have a Mahalanobis Distance < √ (e.g., 2√).
13. Output your intermediate result after all the data points in the data file have been evaluated.
Repeat step 6 to 12.
If this is the last run (after the last chunk of data), merge your CS and RS clusters into the closest DS clusters.