xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

python代写-DSCI553-Assignment 5

时间：2021-04-13

DSCI553 Foundations and Applications of Data Mining

Spring 2021

Assignment 5 Clustering

Deadline: Apr. 23rd 11:59 PM PST

1. Overview of the Assignment

In Assignment 5, you will implement the K-Means and Bradley-Fayyad-Reina (BFR) algorithm. Doing the

assignment helps you get familiar with clustering algorithms using various distance measurements. The

datasets you are going to use are synthetic.

2. Assignment Requirements

2.1 Programming Language and Library Requirements

a. You must use Python to implement all the tasks. You can only use standard Python libraries (i.e.,

external libraries like numpy or pandas are not allowed). Spark RDD is optional for Python. If you

want to use Spark, please specify the following environment in your code:

a. There will be 10% bonus for Scala implementation in each task. Spark RDD is required for Scala. You

can get the bonus only when both Python and Scala implementations are correct.

b. Spark DataFrame and DataSet are not allowed.

2.2 Programming Environment

Python 3.6, Scala 2.11, and Spark 2.3.0

We will use Vocareum to automatically run and grade your submission. You must test your scripts on the

local machine and the Vocareum terminal before submission.

2.3 Write your own code

Do not share code with other students!!

For this assignment to be an effective learning experience, you must write your own code! We emphasize

this point because you may find Python implementations of some of the required functions on the Web.

Please do not look for or at any such code!

Plagiarism detection will combine all the code we can find from the Web (e.g., Github) as well as other

students’ code from this and other (previous) sections. We will report all detected plagiarism to the

university.

3. Dataset

Since the BFR algorithm has a strong assumption that the clusters are normally distributed with

independent dimensions, we generated synthetic datasets by the following steps: 1) Initializing

some random centroids 2) For each centroid, sampling data points following the normal

distribution in each dimension. Together, the mean of the normal distribution in each dimension

is the centroid of the clusters, and the standard deviation is set manually 3) For each centroid,

we added some data points as the outliers. Figure 1 shows an example of the data points (in CSV

format). The first column is the data point indexes. The other columns represent the

features/dimensions of the data points.

Figure 1: An example of the data points with 3 dimensions

You can access and download the following datasets either under the directory

(resource/asnlib/publicdata/) on Vocareum or Google Drive (USC email only):

https://drive.google.com/drive/folders/1mZgx9vHpsjwtjwbZGdnVjrVypKoZJaP9?usp=sharing

a. Five folders are named from “test1” to “test5”. Each folder contains multiple files storing data points.

We will treat these files as separate data chunks. In each iteration, you will load one file (one chunk

of points) to the memory and process these data points with the BFR algorithm.

b. Five files are named from “cluster1.json” to “cluster5.json”. Each file provides the ground truth cluster

for the data points in the corresponding folder from “test1” to “test5”. The key is the data point index

(as a string). The value is its corresponding cluster index. The cluster index of the outliers is -1. The

number of clusters for each dataset is test1: 10, test2: 10, test3: 5, test4: 8, test5: 15.

c. We have generated ten testing sets using the same method. Five of them are provided for your

implementation. The rest of them are used for grading. Notice that the number of dimensions, the

number of files, and the number of data points in each dataset could vary.

4. Task (8pts)

You need to submit the following files on Vocareum: (all lowercase)

a. [REQUIRED] Python scripts: bfr.py

b. [REQUIRED FOR SCALA] Scala scripts: bfr.scala; one Jar package: hw5.jar

c. [OPTIONAL] You can include other scripts to support your programs (e.g., callable functions).

4.1 Task description

You will write the K-Means and Bradley-Fayyad-Reina (BFR) algorithms from scratch. You should

implement K-Means as the in-memory clustering algorithm that you will use in BFR. You will iteratively

load the data points from a file and process these data points with the BFR algorithm. See below

pseudocode for your reference.

In BFR, there are three sets of points that you need to keep track of: Discard set (DS), Compression set

(CS), Retained set (RS). For each cluster in the DS and CS, the cluster is summarized by:

N: The number of points

SUM: the sum of the coordinates of the points

SUMSQ: the sum of squares of coordinates

The conceptual steps of the BFR algorithm: Please refer to the slides.

The implementation details of the BFR algorithm: Please refer to Section 7.

4.2 Execution commands

Python command: $ python3 bfr.py

Scala command: $ spark-submit --class bfr hw5.jar

Params: the folder containing the files of data points

: the number of clusters

: the output file of cluster results

: the output file of intermediate results

4.3 Output format

a. You must write your clustering results in the JSON format (see Figure 2). The key is the data point

index (as string). The value is its corresponding cluster index. The indexes for both data points and

clusters start from 0.

Figure 2: An example of the output clustering results

b. You must output the intermediate results in the CSV format (use the same headers in Figure 3). Each

line in the intermediate results represents the following information about each iteration/data trunk:

- round id (starting from 1): # of rounds must be # of data chunks in the folder.

- the number of clusters in the discard set

- the total accumulated number of the discarded points: The number should only go up with

iterations.

- the number of clusters in the compression set

- the total number of the compressed points

- the number of points in the retained set

Figure 3: An example of the intermediate results

4.3 Grading

We will use normalized mutual information (NMI) score to evaluate your clustering results. To obtain the

full point(0.8pt) for each dataset, the NMI should be above 0.8, and the intermediate result is correct. If

either criterion (i.e., NMI and intermediate result) does not meet, you will lose all points for a dataset.

The submission report will tell you which part of your intermediate results is incorrect. We will grade your

code on all ten datasets.

5. About Vocareum

a. Your code can directly access the datasets under the directory: ../resource/asnlib/publicdata/

b. You should upload the required files under your workspace: work/

c. You must test your scripts on both the local machine and the Vocareum terminal before submission.

d. During submission period, the Vocareum will run and evaluate the results for test1 and test2.

e. You will receive a submission report after Vocareum finishes executing your scripts. The submission

report shows NMI for each dataset. If the intermediate results are incorrect, the submission shows

the reason(s).

f. The total execution time of submission period should be less than 600 seconds. The execution time

of grading period need to be less than 3,000 seconds.

g. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade on

your last submission.

6. Grading Criteria

(% penalty = % penalty of possible points you get)

a. You can use your free 5-day extension separately or together. You must submit a late-day request via

https://forms.gle/6aDASyXAuBeV3LkWA. This form is recording the number of late days you use for

each assignment. By default, we will not count the late days if no request submitted.

b. There will be 10% bonus for each task if your Scala implementations are correct. Only when your

Python results are correct, the bonus of Scala will be calculated. There is no partial point for Scala.

c. There will be no point if your submission cannot be executed on Vocareum.

d. There is no regrading. Once the grade is posted on the Blackboard, we will only regrade your

assignments if there is a grading error. No exceptions.

e. There will be 20% penalty for the late submission within one week and no point after that.

7. Appendix

The implementation details of the BFR algorithm (you can/should have your own implementation; this

is only for reference). Suppose the number of clusters is K and the number of dimensions is d.

1. Load the data points from one file.

2. Run K-Means on the data points or a random subset of the data points. For the implementation, you

will apply K-means on a subset since you cannot handle all data points in the first file. Initialize the

algorithm with a large number of centroids (e.g., 3 or 5 times of K) and use the Euclidean distance as

the similarity measurement.

3. Among the result clusters from step 2, move all the clusters that contain only one or very few points

(you define “very few”) to RS as the outliers. You will now have two groups of data points: the outlier

data points and inlier data points.

4. Run the clustering algorithm again to cluster the inlier data points into K clusters. Use these K clusters

as your DS. Discard these points and generate the DS statistics.

5. Run the clustering algorithm again to cluster the outlier data points using a large number of clusters

(e.g., 3 or 5 times of K). Generate CS and their statistics from the clusters with more than one data

point and use the remaining as your new RS.

6. The above steps finish the initialization of DS. So far, you have K DS (from step 4), some number

of CS (from step 5), and some number of RS (from step 5).

7. Load the data points from another file (step 2 loads a portion of the first file, load the remaining data

points from the first file).

8. For the new data points, compare them to each DS using the Mahalanobis Distance and assign them

to the nearest DS clusters if the distance is < √ (e.g., 2√).

9. For the new data points which are not assigned to any DS cluster, compare them to each of the CS

using the Mahalanobis Distance and assign them to the nearest CS clusters if the distance is < √

(e.g., 2√).

10. For the new data points which are not assigned to any DS or CS cluster, add them to your RS.

11. Run the clustering algorithm on the RS with a large number of centroids (e.g., 3 or 5 time of K).

Generate CS and their statistics from the clusters with more than one data point and add them to

your existing CS list. Use the remaining points as your new RS.

12. Merge CS clusters that have a Mahalanobis Distance < √ (e.g., 2√).

13. Output your intermediate result after all the data points in the data file have been evaluated.

Repeat step 6 to 12.

If this is the last run (after the last chunk of data), merge your CS and RS clusters into the closest DS clusters.

Spring 2021

Assignment 5 Clustering

Deadline: Apr. 23rd 11:59 PM PST

1. Overview of the Assignment

In Assignment 5, you will implement the K-Means and Bradley-Fayyad-Reina (BFR) algorithm. Doing the

assignment helps you get familiar with clustering algorithms using various distance measurements. The

datasets you are going to use are synthetic.

2. Assignment Requirements

2.1 Programming Language and Library Requirements

a. You must use Python to implement all the tasks. You can only use standard Python libraries (i.e.,

external libraries like numpy or pandas are not allowed). Spark RDD is optional for Python. If you

want to use Spark, please specify the following environment in your code:

a. There will be 10% bonus for Scala implementation in each task. Spark RDD is required for Scala. You

can get the bonus only when both Python and Scala implementations are correct.

b. Spark DataFrame and DataSet are not allowed.

2.2 Programming Environment

Python 3.6, Scala 2.11, and Spark 2.3.0

We will use Vocareum to automatically run and grade your submission. You must test your scripts on the

local machine and the Vocareum terminal before submission.

2.3 Write your own code

Do not share code with other students!!

For this assignment to be an effective learning experience, you must write your own code! We emphasize

this point because you may find Python implementations of some of the required functions on the Web.

Please do not look for or at any such code!

Plagiarism detection will combine all the code we can find from the Web (e.g., Github) as well as other

students’ code from this and other (previous) sections. We will report all detected plagiarism to the

university.

3. Dataset

Since the BFR algorithm has a strong assumption that the clusters are normally distributed with

independent dimensions, we generated synthetic datasets by the following steps: 1) Initializing

some random centroids 2) For each centroid, sampling data points following the normal

distribution in each dimension. Together, the mean of the normal distribution in each dimension

is the centroid of the clusters, and the standard deviation is set manually 3) For each centroid,

we added some data points as the outliers. Figure 1 shows an example of the data points (in CSV

format). The first column is the data point indexes. The other columns represent the

features/dimensions of the data points.

Figure 1: An example of the data points with 3 dimensions

You can access and download the following datasets either under the directory

(resource/asnlib/publicdata/) on Vocareum or Google Drive (USC email only):

https://drive.google.com/drive/folders/1mZgx9vHpsjwtjwbZGdnVjrVypKoZJaP9?usp=sharing

a. Five folders are named from “test1” to “test5”. Each folder contains multiple files storing data points.

We will treat these files as separate data chunks. In each iteration, you will load one file (one chunk

of points) to the memory and process these data points with the BFR algorithm.

b. Five files are named from “cluster1.json” to “cluster5.json”. Each file provides the ground truth cluster

for the data points in the corresponding folder from “test1” to “test5”. The key is the data point index

(as a string). The value is its corresponding cluster index. The cluster index of the outliers is -1. The

number of clusters for each dataset is test1: 10, test2: 10, test3: 5, test4: 8, test5: 15.

c. We have generated ten testing sets using the same method. Five of them are provided for your

implementation. The rest of them are used for grading. Notice that the number of dimensions, the

number of files, and the number of data points in each dataset could vary.

4. Task (8pts)

You need to submit the following files on Vocareum: (all lowercase)

a. [REQUIRED] Python scripts: bfr.py

b. [REQUIRED FOR SCALA] Scala scripts: bfr.scala; one Jar package: hw5.jar

c. [OPTIONAL] You can include other scripts to support your programs (e.g., callable functions).

4.1 Task description

You will write the K-Means and Bradley-Fayyad-Reina (BFR) algorithms from scratch. You should

implement K-Means as the in-memory clustering algorithm that you will use in BFR. You will iteratively

load the data points from a file and process these data points with the BFR algorithm. See below

pseudocode for your reference.

In BFR, there are three sets of points that you need to keep track of: Discard set (DS), Compression set

(CS), Retained set (RS). For each cluster in the DS and CS, the cluster is summarized by:

N: The number of points

SUM: the sum of the coordinates of the points

SUMSQ: the sum of squares of coordinates

The conceptual steps of the BFR algorithm: Please refer to the slides.

The implementation details of the BFR algorithm: Please refer to Section 7.

4.2 Execution commands

Python command: $ python3 bfr.py

Scala command: $ spark-submit --class bfr hw5.jar

Params

4.3 Output format

a. You must write your clustering results in the JSON format (see Figure 2). The key is the data point

index (as string). The value is its corresponding cluster index. The indexes for both data points and

clusters start from 0.

Figure 2: An example of the output clustering results

b. You must output the intermediate results in the CSV format (use the same headers in Figure 3). Each

line in the intermediate results represents the following information about each iteration/data trunk:

- round id (starting from 1): # of rounds must be # of data chunks in the folder.

- the number of clusters in the discard set

- the total accumulated number of the discarded points: The number should only go up with

iterations.

- the number of clusters in the compression set

- the total number of the compressed points

- the number of points in the retained set

Figure 3: An example of the intermediate results

4.3 Grading

We will use normalized mutual information (NMI) score to evaluate your clustering results. To obtain the

full point(0.8pt) for each dataset, the NMI should be above 0.8, and the intermediate result is correct. If

either criterion (i.e., NMI and intermediate result) does not meet, you will lose all points for a dataset.

The submission report will tell you which part of your intermediate results is incorrect. We will grade your

code on all ten datasets.

5. About Vocareum

a. Your code can directly access the datasets under the directory: ../resource/asnlib/publicdata/

b. You should upload the required files under your workspace: work/

c. You must test your scripts on both the local machine and the Vocareum terminal before submission.

d. During submission period, the Vocareum will run and evaluate the results for test1 and test2.

e. You will receive a submission report after Vocareum finishes executing your scripts. The submission

report shows NMI for each dataset. If the intermediate results are incorrect, the submission shows

the reason(s).

f. The total execution time of submission period should be less than 600 seconds. The execution time

of grading period need to be less than 3,000 seconds.

g. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade on

your last submission.

6. Grading Criteria

(% penalty = % penalty of possible points you get)

a. You can use your free 5-day extension separately or together. You must submit a late-day request via

https://forms.gle/6aDASyXAuBeV3LkWA. This form is recording the number of late days you use for

each assignment. By default, we will not count the late days if no request submitted.

b. There will be 10% bonus for each task if your Scala implementations are correct. Only when your

Python results are correct, the bonus of Scala will be calculated. There is no partial point for Scala.

c. There will be no point if your submission cannot be executed on Vocareum.

d. There is no regrading. Once the grade is posted on the Blackboard, we will only regrade your

assignments if there is a grading error. No exceptions.

e. There will be 20% penalty for the late submission within one week and no point after that.

7. Appendix

The implementation details of the BFR algorithm (you can/should have your own implementation; this

is only for reference). Suppose the number of clusters is K and the number of dimensions is d.

1. Load the data points from one file.

2. Run K-Means on the data points or a random subset of the data points. For the implementation, you

will apply K-means on a subset since you cannot handle all data points in the first file. Initialize the

algorithm with a large number of centroids (e.g., 3 or 5 times of K) and use the Euclidean distance as

the similarity measurement.

3. Among the result clusters from step 2, move all the clusters that contain only one or very few points

(you define “very few”) to RS as the outliers. You will now have two groups of data points: the outlier

data points and inlier data points.

4. Run the clustering algorithm again to cluster the inlier data points into K clusters. Use these K clusters

as your DS. Discard these points and generate the DS statistics.

5. Run the clustering algorithm again to cluster the outlier data points using a large number of clusters

(e.g., 3 or 5 times of K). Generate CS and their statistics from the clusters with more than one data

point and use the remaining as your new RS.

6. The above steps finish the initialization of DS. So far, you have K DS (from step 4), some number

of CS (from step 5), and some number of RS (from step 5).

7. Load the data points from another file (step 2 loads a portion of the first file, load the remaining data

points from the first file).

8. For the new data points, compare them to each DS using the Mahalanobis Distance and assign them

to the nearest DS clusters if the distance is < √ (e.g., 2√).

9. For the new data points which are not assigned to any DS cluster, compare them to each of the CS

using the Mahalanobis Distance and assign them to the nearest CS clusters if the distance is < √

(e.g., 2√).

10. For the new data points which are not assigned to any DS or CS cluster, add them to your RS.

11. Run the clustering algorithm on the RS with a large number of centroids (e.g., 3 or 5 time of K).

Generate CS and their statistics from the clusters with more than one data point and add them to

your existing CS list. Use the remaining points as your new RS.

12. Merge CS clusters that have a Mahalanobis Distance < √ (e.g., 2√).

13. Output your intermediate result after all the data points in the data file have been evaluated.

Repeat step 6 to 12.

If this is the last run (after the last chunk of data), merge your CS and RS clusters into the closest DS clusters.