COMP9313-无代写|学霸联盟

COMP9313-无代写

时间：2023-10-30

2023/10/30 18:02 (9) COMP9313-T3 – Ed Lessons
https://edstem.org/au/courses/13867/lessons/43582/slides/298593 1/1
COMP9313-T3 – Ed Lessons 9 Y
课程卡片挑战提交列表上一页下一页Frequent item set mining in E-commerce transaction logs (1…
Project 2
Frequent item set mining in
E-commerce transaction
logs (16 marks)
描述
Frequent item set mining in E-commerce
transaction logs (16 marks)
Background: Frequent item set mining is a fundamental task in data mining. It involves
identifying sets of items (or elements) that frequently occur together in a given dataset. This
technique is widely used in various applications, including market basket analysis,
recommendation systems, bioinformatics, and network analysis.
The item set is a collection of one or more items. For example, in a market basket analysis,
an item set could represent a list of products that a customer purchases in a single
transaction. In order to nd out the frequent item sets, we need to rst compute the
"support" of each item set, which is the proportion of transactions in which the item set
appears.
To be specic, the support of an item set is the number of transactions in which the item set
appears, divided by the total number of transactions. For example, suppose we have a
dataset of 1000 transactions, and the item set {milk, bread} appears in 100 of those
transactions. The support of the item set {milk, bread} would be calculated as follows:
Support({milk, bread}) = Number of transactions containing
{milk, bread} / Total number of transactions
= 100 / 1000
= 10%
So the support of the item set {milk, bread} is 10%. This means that in 10% of the
transactions, the items milk and bread were both purchased.
Problem Denition: You are given an E-Commerce dataset of customer purchase
transaction logs collected over time. Each record in the dataset has the following ve elds
(see the example dataset):
InvoiceNo: the unique ID to record one purchase transaction
Description: the name of the item in a transaction
Quantity: the amount of the items purchased
InvoiceDate: the time of the transaction
UnitPrice: the price of a single item
Your task is to utilize Spark to detect the top-k frequent item sets from the log for each
month. To make the problem simple, you are only required to nd frequent item sets
containing three items. The support of an item set X in a month M is computed as:
Support(X) = (Number of transactions containing X in M) / (Total number of transactions i
Output Format: The output format is "MONTH/YEAR,(Item1|Item2|Item3), support
value", where the three items are ordered alphabetically. You need to sort the results rst
by time in ascending order, then by the support value in descending order, and nally by the
item set in alphabetical order. If one month has less than k item sets, just output all item sets
in order for that month.
For example, given the sample dataset and k=2, your result should be like this:
1/2010,(A|B|C),0.6666666666666666
1/2010,(A|C|D),0.6666666666666666
2/2010,(A|B|C),1.0
Code Format: The code template has been provided. You need to submit two solutions, one
using only RDD APIs and the other one using only DataFrame APIs. Your code should take
three parameters: the input le, the output folder, and the value of k. Assuming k=2, you
need to use the command below to run your code:
$ spark-submit project2_rdd.py "file:///home/sample.csv" "file:///home/output" 2
Submission
Deadline: Tuesday 31st October 11:59:59 PM
空间: 4(自动) 已
project2_d
project2_r
result_k=5
sample.csv
test.csv
终
端
点击这里
激活终端