Frequent Itemsets 1 IERG4300/ IEMS5709 Web-Scale Information Analytics Fall 2021 Frequent Itemsets and Association Rule Mining Prof. Wing C. Lau Department of Information Engineering wclau@ie.cuhk.edu.hk Frequent Itemsets 2 Acknowledgements ¢ The slides used in this chapter are adapted from: l CS246 Mining Massive Data-sets, by Jure Leskovec, Stanford University. with the author’s permission. All copyrights belong to the original author of the material. Frequent Itemsets 3 Association Rule Discovery Supermarket shelf management – Market-basket model: ¢ Goal: Identify items that are bought together by sufficiently many customers ¢ Approach: Process the sales data collected with barcode scanners to find dependencies among items ¢ A classic rule: l If one buys diaper and milk, then he is likely to buy beer l Don’t be surprised if you find six-packs next to diapers! TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Frequent Itemsets 4 The Market-Basket Model ¢ A large set of items l e.g., things sold in a supermarket ¢ A large set of baskets, each is a small subset of items l e.g., the things one customer buys on one day ¢ A general many-many mapping (association) between two kinds of things l But we ask about connections among “items”, not “baskets” 10/1/21 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Frequent Itemsets 5 Association Rules: Approach ¢ Given a set of baskets ¢ Want to discover association rules l People who bought {x,y,z} tend to buy {v,w} • Amazon! ¢ 2-step approach: l 1) Find frequent itemsets l 2) Generate association rules 10/1/21 Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Input: Output: Frequent Itemsets 6 Applications – (1) ¢ Items = products; Baskets = sets of products someone bought in one trip to the store ¢ Real market baskets: Chain stores keep TBs of data about what customers buy together l Tells how typical customers navigate stores, lets them position tempting items l Suggests tie-in “tricks”, e.g., run sale on diapers and raise the price of beer l High support needed, or no $$’s ¢ Amazon’s people who bought X also bought Y 10/1/21 Frequent Itemsets 7 Applications – (2) ¢ Baskets = sentences; Items = documents containing those sentences l Items that appear together too often could represent plagiarism l Notice items do not have to be “in” baskets ¢ Baskets = patients; Items = drugs & side-effects l Has been used to detect combinations of drugs that result in particular side-effects l But requires extension: Absence of an item needs to be observed as well as presence 10/1/21 7 First: Define Frequent itemsets Association rules: Confidence, Support, Interestingness Then: Algorithms for finding frequent itemsets Finding frequent pairs Apriori algorithm PCY algorithm + refinements 8 10/1/21 Frequent Itemsets 9 Frequent Itemsets ¢ Simplest question: Find sets of items that appear together “frequently” in baskets ¢ Support for itemset I: Number of baskets containing all items in I l Often expressed as a fraction of the total number of baskets ¢ Given a support threshold s, then sets of items that appear in at least s baskets are called frequent itemsets 10/1/21 9 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Support of {Beer, Bread} = 2 Frequent Itemsets 10 Example: Frequent Itemsets ¢ Items = {milk, coke, pepsi, beer, juice} ¢ Minimum support = 3 baskets B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4= {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} ¢ Frequent itemsets: {m}, {c}, {b}, {j}, 10/1/21 , {b,c} , {c,j}.{m,b} Frequent Itemsets 11 Association Rules ¢ Association Rules: If-then rules about the contents of baskets ¢ {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it is likely to contain j” ¢ In practice there are many rules, want to find significant/interesting ones! ¢ Confidence of this association rule is the probability of j given I = {i1,…,ik} 10/1/21 conf(I→ j) = support(I ∪ j)support(I ) = Pr[ j | I ] *Note: support(I ∪ j) = # (or %) of baskets contain BOTH I AND j Frequent Itemsets 12 Interesting Association Rules ¢ Not all high-confidence rules are interesting l The rule X → milk may have high confidence for many itemsets X, because milk is just purchased very often (independent of X) and the confidence will be high ¢ Interest of an association rule I → j: difference between its confidence and the fraction of baskets that contain j l Interesting rules are those with high positive or negative interest values 10/1/21 Interest(I→ j) = conf(I→ j)− Pr[ j] = Pr[ j | I ]− Pr[ j] Frequent Itemsets 13 Example: Confidence and Interest B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4= {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} ¢ Association rule: {m, b} →c l Confidence = 2/4 = 0.5 l Interest = |0.5 – 5/8| = 1/8 • Item c appears in 5/8 of the baskets • Rule is not very interesting! 10/1/21 Frequent Itemsets 14 Finding Association Rules ¢ Problem: Find all association rules with support ≥s and confidence ≥c l Note: Support of an association rule is the support of the set of items on the left side ¢ Hard part: Finding the frequent itemsets! l If {i1, i2,…, ik} → j has high support and confidence, then both {i1, i2,…, ik} and {i1, i2,…,ik, j} will be “frequent” 10/1/21 )support( )support()conf( I jIjI È=® Frequent Itemsets 15 Mining Association Rules ¢ Step 1: Find all frequent itemsets I l (we will explain this next) ¢ Step 2: Rule generation l For every subset A of I, generate a rule A → I \ A • Since I is frequent, A is also frequent • Variant 1: Single pass to compute the rule confidence • conf(A,B→C,D) = supp(A,B,C,D)/supp(A,B) • Variant 2: • Observation**: If A,B,C→D is below confidence, so is A,B→C,D • Can generate “bigger” rules from smaller ones! l Output the rules above the confidence threshold 10/1/21 Frequent Itemsets 16 Mining Association Rules (cont’d) ¢ Claim: If A,B,C→D is below confidence, so is A,B→C,D Why ? Since Supp(ABC) =< Supp(AB) Therefore: Conf.(ABC->D) = Supp(ABCD)/ Supp(ABC) >= Supp(ABCD)/Supp(AB) = Conf(AB->CD) Thus, IF Conf(AB->CD) >= Threshold THEN Conf(ABC->D) also >= threshold ; Equivalently, IF Conf(ABC->D) < Threshold then Conf(AB->CD) is also below threshold This means we can first check Conf(AB->CD) if it is above threshold, we can simply generate additional rules, e.g. ABC->D, ABD->C. => Can generate “bigger” rules from smaller ones! Frequent Itemsets 17 Example B1 = {m, a, b} B2 = {m, p, j} B3 = {m, a, b, n} B4= {a, j} B5 = {m, p, b} B6 = {m,a, b, j} B7 = {a, b, j} B8 = {b, a} ¢ Min support s=3, confidence c=0.75 ¢ 1) Frequent itemsets: l {b,m} {a,b} {a,m} {a,j} {m,a,b} ¢ 2) Generate rules: l b→m: c=4/6 b→a: c=5/6 b,a→m: c=3/5 l m→b: c=4/5 … b,m→a: c=3/4 l b→a,m: c=3/6 10/1/21 Frequent Itemsets 18 A Compact Way to store/track Frequent Itemsets You only need to store the so-called: Maximal Frequent itemsets: Definition: a Frequent set for which NO immediate superset is frequent Nice Properties: All subsets of a Maximal Frequent itemset are frequent AND Every Frequent itemset must be a subset of some Maximal Frequent itemset => By enumerating ALL subsets of all Maximal Frequent Itemsets, you will NOT miss any Frequent Itemset ! Also, every subset you got is a Frequent Itemset ! 10/1/21 18 Frequent Itemsets 19 Example: Maximal Frequent Itemset Count Maximal (s=3) A 4 No B 5 No C 3 No AB 4 Yes AC 2 No BC 3 Yes ABC 2 No 10/1/21 Frequent, but superset BC also frequent. Frequent, and its only superset, ABC, not freq. Finding Frequent Itemsets Frequent Itemsets 21 Itemsets: Computation Model ¢ Back to finding frequent itemsets ¢ Typically, data is kept in flat files rather than in a database system: l Stored on disk l Stored basket-by-basket l Baskets are small but we have many baskets and many items • Expand baskets into pairs, triples, etc. as you read baskets • Use k nested loops to generate all sets of size k 10/1/21 Item Item Item Item Item Item Item Item Item Item Item Item Etc. Items are positive integers, and boundaries between baskets are –1. Note: We want to find frequent itemsets. To find them, we have to count them. To count them, we have to generate them. Frequent Itemsets 22 Computation Model ¢ The true cost of mining disk-resident data is usually the number of disk I/O’s ¢ In practice, association-rule algorithms read the data in passes – all baskets read in turn ¢ We measure the cost by the number of passes an algorithm makes over the data 10/1/21 Frequent Itemsets 23 Main-Memory Bottleneck ¢ For many frequent-itemset algorithms, main-memory is the critical resource l As we read baskets, we need to count something, e.g., occurrences of pairs of items l The number of different things we can count is limited by main memory l Swapping counts in/out is a disaster (why?) 10/1/21 Frequent Itemsets 24 Finding Frequent Pairs ¢ The hardest problem often turns out to be finding the frequent pairs of items {i1, i2} l Why? Often frequent pairs are common, frequent triples are rare • Why? Probability of being frequent drops exponentially with size; number of sets grows more slowly with size. ¢ Let’s first concentrate on pairs, then extend to larger sets ¢ The approach: l We always need to generate all the itemsets l But we would only like to count/keep track of those itemsets that in the end turn out to be frequent 10/1/21 Frequent Itemsets 25 Naïve Algorithm ¢ Naïve approach to finding frequent pairs ¢ Read file once, counting in main memory the occurrences of each pair: l From each basket of n items, generate its n(n-1)/2 pairs by two nested loops ¢ Fails if (#items)2 exceeds main memory l Remember: #items can be 100K (Wal-Mart) or 10B (Web pages) • Suppose 105 items, counts are 4-byte integers • Number of pairs of items: 105(105-1)/2 = 5*109 • Therefore, 2*1010 (20 gigabytes) of memory needed 10/1/21 Frequent Itemsets 26 Counting Pairs in Memory Two Approaches: ¢ Approach 1: Count all pairs using a matrix ¢ Approach 2: Keep a table of triples [i, j, c] = “the count of the pair of items {i, j} is c.” l If integers and item ids are 4 bytes, we need approximately 12 bytes for pairs with count > 0 l Plus some additional overhead for the hashtable Note: ¢ Approach 1 only requires 4 bytes per pair ¢ Approach 2 uses 12 bytes per pair (but only for pairs with count > 0) 10/1/21 Frequent Itemsets 27 Comparing the 2 Approaches 10/1/21 4 bytes per pair Triangular Matrix Triples 12 bytes per occurring pair Frequent Itemsets 28 Triangular Matrix Approach Triangular Matrix Approach l n = total number items l Count pair of items {i, j} only if i
¢ Keep pair counts in lexicographic order:
l {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
¢ Pair {i, j} is at position (i –1)(n– i/2) + j –1
¢ Total number of pairs n(n –1)/2; total bytes= 2n2
¢ Triangular Matrix requires 4 bytes per pair
¢ Approach 2 uses 12 bytes per pair
(but only for pairs with count > 0)
l Beats triangular matrix if less than 1/3 of
possible pairs actually occur
10/1/21
The A-Priori Algorithm
Frequent Itemsets 30
A-Priori Algorithm – (1)
¢ A two-pass approach called
a-priori limits the need for
main memory
¢ Key idea: monotonicity
l If a set of items I appears at
least s times, so does every subset J of I.
¢ Contrapositive for pairs:
If item i does not appear in s baskets, then no pair
including i can appear in s baskets
10/1/21
Frequent Itemsets 31
A-Priori Algorithm – (2)
¢ Pass 1: Read baskets and count in main memory the
occurrences of each individual item
• Requires only memory proportional to #items, usually enough
memory even for 1 billon ( = n) different types of items !
¢ Items that appear >= s times are the frequent items
¢ Pass 2: Read baskets again and count in main memory
only those pairs where both elements are frequent (from
Pass 1)
l Requires memory proportional to square of frequent items only
(for counts)
l Plus a list of the frequent items (so you know what must be
counted)
10/1/21
Frequent Itemsets 32
Main-Memory: Picture of A-Priori
10/1/21
Item counts
Pass 1 Pass 2
Frequent items
M
ai
n
m
em
or
y
Counts of
pairs of frequent
items (candidate
pairs)
Frequent Itemsets 33
Detail for A-Priori
¢ You can use the triangular
matrix method with:
n = number of frequent items
l May save space compared with
storing triples
¢ Trick: re-number frequent
items 1,2,… and keep a table
relating new numbers to
original item numbers
10/1/21
Item counts
Pass 1 Pass 2
Counts of pairs
of frequent
items
Frequent
items
Old
item
#s
M
ai
n
m
em
or
y
Counts of
pairs of
frequent items
Frequent Itemsets 34
Frequent Triples, Etc.
¢ For each k, we construct two sets of
k-tuples (sets of size k):
l Ck = candidate k-tuples = those that might be frequent sets
(support > s) based on information from the pass for k–1
l Lk = the set of truly frequent k-tuples
10/1/21
C1 L1 C2 L2 C3Filter Filter ConstructConstruct
All
items
Count
the items
All pairs
of items
from L1
Count
the pairs
To be
explained
Frequent Itemsets 35
Example
¢ Hypothetical steps of the A-Priori algorithm
l C1 = { {b} {c} {j} {m} {n} {p} }
l Count the support of itemsets in C1
l Prune non-frequent: L1 = { b, c, j, m }
l Generate C2 = { {b,c} {b,j} {b,m} {c,j} {c,m} {j,m} }
l Count the support of itemsets in C2
l Prune non-frequent: L2 = { {b,m} {b,c} {c,m} {c,j} }
l Generate C3 = { {b,c,m} {b,c,j} {b,m,j} {c,m,j} }
l Count the support of itemsets in C3
l Prune non-frequent: L3 = { {b,c,m} }
10/1/21
**
Frequent Itemsets 36
A-Priori for All Frequent Itemsets
¢ One pass for each k (itemset size)
¢ Needs room in main memory to count
each candidate k–tuple
¢ For typical market-basket data and reasonable support
(e.g., 1%), k = 2 requires the most memory
¢ Many possible extensions:
l Lower the support s as itemset gets bigger
l Association rules with intervals:
• For example: Men over 65 have 2 cars
l Association rules when items are in a taxonomy
• Bread, Butter → FruitJam
• BakedGoods, MilkProduct → PreservedGoods
10/1/21
PCY (Park-Chen-Yu) Algorithm
Frequent Itemsets 38
PCY (Park-Chen-Yu) Algorithm
¢ Observation:
In pass 1 of a-priori, most memory is idle
l We store only individual item counts
l Can we use the idle memory to reduce
memory required in pass 2?
¢ Pass 1 of PCY: In addition to item counts, maintain a
hash table with as many buckets as fit in memory
l Keep a count for each bucket into which
pairs of items are hashed
• Just the count, not the pairs that hash to the bucket!
10/1/21
Frequent Itemsets 39
PCY Algorithm – First Pass
FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
FOR (each pair of items) :
hash the pair to a bucket;
add 1 to the count for that bucket;
¢ Pairs of items need to be generated from the input file; they
are not present in the file
¢ We are not just interested in the presence
of a pair, but we need to see whether it is present at least s
(support) times
10/1/21
New
in
PCY
Frequent Itemsets 40
Observations about Buckets
¢ If a bucket contains a frequent pair, then
the bucket is surely frequent
l But we cannot based on the hash alone to eliminate any non-
frequent member within this bucket
¢ Even without any frequent pair, a bucket
can still be frequent (why?)
¢ But, for a bucket with total count less than s,
none of its pairs can be frequent
l Pairs that hash to a non-frequent bucket can be eliminated from the
candidate list (even if the pair consists of 2 frequent items)
¢ Pass 2:
Only count pairs that hash to frequent buckets
10/1/21
Frequent Itemsets 41
PCY Algorithm – Between Passes
¢ Replace the buckets by a bit-vector:
l 1 means the bucket count exceeded the support s
(a frequent bucket ); 0 means it did not
¢ 4-byte integer counts are replaced by bits, so the bit-
vector requires 1/32 of memory in Pass 1
¢ Also, decide which items are frequent
and list them for the second pass
10/1/21
Frequent Itemsets 42
PCY Algorithm – Pass 2
¢ Count all pairs {i, j} that meet the
conditions for being a candidate pair:
1. Both i and j are frequent items
2. The pair {i, j} hashes to a bucket whose bit in the bit vector is 1
(i.e., frequent bucket)
¢ Both conditions are necessary for the
pair to have a chance of being frequent
10/1/21
Frequent Itemsets 43
Main-Memory: Picture of PCY
10/1/21
Hash
table
Item counts
Bitmap
Pass 1 Pass 2
Frequent items
Hash table
for pairs
M
ai
n
m
em
or
y
Counts of
candidate
pairs
Frequent Itemsets 44
Main-Memory Details
¢ Buckets require a few bytes each:
l Note: we don’t have to count past s
l #buckets is O(main-memory size)
¢ On second pass, a table of (item, item, count) triples is
essential (we cannot use triangular matrix approach,
why?)
l Thus, hash table must eliminate approx. 2/3 of the candidate pairs
for PCY to beat A-Priori.
10/1/21
Frequent Itemsets 45
Refinement: Multistage Algorithm
¢ Limit the number of candidates to be counted
l Remember: Memory is the bottleneck
l Still need to generate all the itemsets but we only want to
count/keep track of the ones that are frequent
¢ Key idea: After Pass 1 of PCY, rehash only those pairs
that qualify for Pass 2 of PCY
l i and j are frequent, and
l {i, j} hashes to a frequent bucket from Pass 1
¢ On middle pass (i.e. the new Pass 2), fewer pairs
contribute to buckets, so fewer false positives
¢ Requires 3 passes over the data
10/1/21
Frequent Itemsets 46
Main-Memory: Multistage
10/1/21
First
hash table
Item counts
Bitmap 1 Bitmap 1
Bitmap 2
Freq. items Freq. items
Counts of
candidate
pairs
Pass 1 Pass 2 Pass 3
Count items
Hash pairs {i,j}
Hash pairs {i,j}
into Hash2 iff:
i,j are frequent,
&& {i,j} hashes to
freq. bucket in B1
Count pairs {i,j} iff:
i,j are frequent,
&& {i,j} hashes to
freq. bucket in B1
&& {i,j} hashes to
freq. bucket in B2
First
hash table Second
hash table Counts ofcandidate
pairs
M
ai
n
m
em
or
y
Frequent Itemsets 47
Multistage – Pass 3
¢ Count only those pairs {i, j} that satisfy these
candidate pair conditions:
1. Both i and j are frequent items
2. Using the first hash function, the pair
hashes to a bucket whose bit in the
first bit-vector is 1.
3. Using the second hash function, the pair
hashes to a bucket whose bit in the
second bit-vector is 1.
10/1/21
Frequent Itemsets 48
Important Points
1. The two hash functions have to be independent
2. We need to check both hashes on the third pass
l If not, we would end up counting pairs of frequent items that
hashed first to an infrequent bucket (during Pass 1) but
happened to hash to a frequent bucket during the new Pass 2.
10/1/21
Frequent Itemsets 49
Refinement: Multihash
¢ Key idea: Use several independent hash tables on the first
pass
¢ Risk: Halving the number of buckets doubles the average
count
l We have to be sure most buckets will still not reach count s
¢ If so, we can get a benefit like multistage,
but in only 2 passes
10/1/21
Frequent Itemsets 50
Main-Memory: Multihash
10/1/21
First hash
table
Second
hash table
Item counts
Bitmap 1
Bitmap 2
Freq. items
Counts of
candidate
pairs
Pass 1 Pass 2
First
hash table
Second
hash table
Counts of
candidate
pairs
M
ai
n
m
em
or
y
Frequent Itemsets 51
PCY: Extensions
¢ Either multistage or multihash can use more than two
hash functions
¢ In multistage, there is a point of diminishing returns, since
the bit-vectors eventually consume all of main memory
¢ For multihash, the bit-vectors occupy exactly what one
PCY bitmap does, but given the constant amount of main
memory to hold all hash tables, too many hash functions
makes all counts > s, and thus, fails to eliminate any non-
frequent pairs !
10/1/21
Frequent Itemsets
in < 2 Passes
Frequent Itemsets 53
Frequent Itemsets in < 2 Passes
¢ A-Priori, PCY, etc., take k passes to find frequent itemsets
of size k
¢ Can we use fewer passes?
¢ Use 2 or fewer passes for all sizes,
but may miss some frequent itemsets
l Random sampling
l SON (Savasere, Omiecinski, and Navathe)
l Toivonen (see textbook [MMDS Ch6])
Frequent Itemsets 54
Random Sampling (1)
¢ Take a random sample of the market baskets
¢ Run A-priori or one of its improvements
in main memory
l So we don’t pay for disk I/O each
time we increase the size of itemsets
l MUST reduce support threshold
proportionally to match the sample size
10/1/21
Copy of
sample
baskets
Space
for
counts
M
ai
n
m
em
or
y
Frequent Itemsets 55
Random Sampling (2)
¢ Optionally, verify that the candidate pairs are truly frequent
in the entire data set by a second pass (avoid false
positives)
¢ But you don’t catch sets frequent in the whole but not in
the sample
l Smaller threshold, e.g., s/125, helps catch more truly frequent
itemsets
• But requires more space
10/1/21
Frequent Itemsets 56
SON Algorithm – (1)
¢ Repeatedly read small subsets of the baskets into main
memory and run an in-memory algorithm to find all
frequent itemsets
l Note: we are not sampling, but processing the entire file in
memory-sized chunks
¢ An itemset becomes a candidate if it is found to be
frequent in any one or more subsets of the baskets.
10/1/21
Frequent Itemsets 57
SON Algorithm – (2)
¢ On a second pass, count all the candidate itemsets and
determine which are frequent in the entire set
¢ Key idea: an itemset cannot be frequent in the entire set
of baskets unless it is frequent in at least one subset.
10/1/21
Frequent Itemsets 58
SON – Distributed Version
¢ SON lends itself to distributed/ parallel implementation, e.g.
using MapReduce
¢ Baskets distributed among many nodes
l Compute frequent itemsets at each node
l Distribute candidate itemsets to all nodes
l Accumulate the counts of all candidates
¢ Can be done with two MapReduce jobs:
l First MapReduce job to produce the candidate itemsets
l Second MapReduce job to calculate the true frequent itemsets.
10/1/21
Frequent Itemsets 59
SON: Map/Reduce
¢ Job 1: Find candidate itemsets
l Map?
l Reduce?
¢ Job 2: Find true frequent itemsets
l Map?
l Reduce?
10/1/21
Frequent Itemsets 60
SON: MapReduce Implementation
Mapper for Job 1
¢ Run A-Priori algorithm on
the chunk using support
threshold
¢ Output the frequent
itemsets for that chunk
(F, c), where F is the key
(itemset) and c is count
(or proportion)
Reducer for Job 1
¢ Output the candidate
itemsets to be verified in
the Job 2
¢ Given (F,c), discard c and
output all candidate
itemsets F’s
60
Frequent Itemsets 61
SON: MapReduce Implementation (cont’d)
Mapper for Job 2
¢ For all the candidate
itemsets produced by Job 1,
count the frequency in local
chunk
Reducer for Job 2
61
¢ Aggregate the o/p of the
Mapper of Job 2 and sum
the count to get the
frequency of each candidate
itemsets across the entire
input file
¢ Filter out the itemsets with
support smaller than s
学霸联盟