Mining Multiple-level Association Rules in Large Databases IEEE

Mining Multiple-level Association Rules in Large Databases IEEE Transactions on Knowledge and Data Engineering, 1999 Authors : Jiawei Han Simon Fraser University, British Columbia. (now: University of Illinois Urbana-Champaign, CS Dept) Yongjian Fu University of Missouri-Rolla, Missouri. (now: Simon Fraser University, British Columbia) Story : Ramesh Kumar Adapted from screenplay by Zhenyu Lu(2007)

OUTLINE 1. 2. 3. 4. 5. 6. 7. 8. Introduction Multiple-Level Association Rules A Method For Mining M-L Association Rules Performance Study Mining Cross-Level Associations Interesting Association Rules Conclusions and Future Work Exam Questions

INTRODUCTION l l l Why Multiple-Level (ML) Association Rules Pre-requisites for M-L Data Mining (MLDM *) Possible Approaches and Rationale Assumptions How this differs from previous research *MLDM=Multiple-Level Data Mining

WHY MLDM? l l l Rule A=> 70% of customers who bought diapers also bought beer Rule B =>45% of customers who bought cloth diapers also bought dark beer Rule C =>35% of customers who bought Pampers also bought Samuel Adams beer.

WHY MLDM? l What are the conceptual differences between the three rules? l l l Rule A applies at a generic higher level of abstraction (product) Rule B applies at a more specific level of abstraction (category) Rule C applies at the lowest level of abstraction (brand). This process is called drilling down.

WHY MLDM? l l More specific information at lower levels Hence essential to mine at different levels for any tree Different levels of association rules enable different strategies Helps to factor out uninteresting or coincidental rules

PRE-REQUISITES FOR MLDM 1. Data Representation At Different Levels Of Abstraction l l 2. Level 1: {DIAPERS, BEER} Level 2: {CLOTH, DISPOSABLE} {REGULAR, LITE} Level 3: {‘BUMKINS’, ’KUSHIES’, ’PAMPERS’, ’HUGGIES’} {‘BUDWEISER’, ‘MILLER LITE’, ‘SAMUEL ADAMS’, ‘HEINIKEN’} Efficient Methods for ML Rule Mining(focus of this paper)

HIERARCHY TYPES l l l Generalization Specialization (isa relationships) Generalization/Specialization With Multiple Inheritance Whole-Part hierarchies (is-part-of; has-part)

GENERALIZATIONSPECIALIZATION Vehicle 4 -Wheels Sedan SUV 2 -Wheels Bicycle Motorcycle

GENERALIZATION-SPECIALIZATION WITH MULTIPLE INHERITANCE Vehicle Commuting Car Bicycle Recreational Snowmobile

WHOLE-PART HIERARCHIES Computer Motherboard RAM CPU Hard Drive RW Head Platter

FOCUS OF THE PAPER Determining Efficient Mining Of Multiple-Level Association Rules

DIFFERENT METHODS l l Apply single-level Apriori Algorithm to each of the multiple levels under the same miniconf and minisup. Potential Problems? l l Higher Levels of abstraction have higher support and lower levels have lower support What is the single most optimum minisup for all levels? Too high a minisup => not too many itemsets for lower levels Too low a minisup => far too many uninteresting rules

POSSIBLE SOLUTIONS l l Have different minisup for different levels Maybe also different miniconf at different levels Progressively decrease minisup as we go down the tree to lower levels This paper studies a progressive deepening method developed by extension of the Apriori Algorithm

MAIN ASSUMPTION l l Explore only descendants of frequent items at any level. In other words, if an item is non-frequent at one level, its descendants no longer figure in further analysis ARE THERE ANY PROBLEMS THAT CAN ARISE BECAUSE OF THIS ASSUMPTION?

Explore only descendants of frequent items at any level What is a potential problem with this approach? l Will this potentially eliminate possible interesting rules for itemsets at one level whose ancestors are not frequent at higher levels? ® If so, can be addressed by 2 workarounds 2 minisup values at higher levels – one for filtering infrequent items, the other for passing down frequent items to lower levels; latter called level passage threshold (lph) The lph may be adjusted by user to allow descendants of sub frequent items

DIFFERENCES FROM PREVIOUS RESEARCH l l Other studies use same minisup across different levels of the hierarchy This study…. l l l Uses different minisup values at different levels of the hierarchy Analyzes different optimization techniques Studies the use of interestingness measures

MULTIPLE LEVEL ASSOCIATION RULES l l l Definitions Example Taxonomy Working of the algorithm

DEFINITIONS-1 Assume Database contains: 1. Item dataset containing item description: {<Ai>, <description>} 2. A transaction dataset T containing set of transactions. . {Tid, {Ap…Aq}} where Tid is a transaction identifier(key)

DEFINITION 2. 1 A pattern or an itemset A is one item Ai or a set of conjunctive items Ai Λ …. Λ Aj. The support of a pattern is the number of transactions that contain A vs. the total number of transactions σ(A|S) The confidence of a rule A => B in S is given by: φ(A=>B) = σ(AUB)/ σ(A) ie. it is the conditional probability of B occurring given that A has occurred Specify 2 thresholds minisup(σ’) and miniconf (φ’); different values at different levels

DEFINITION 2. 2 pattern A is frequent in set S at level l if : l the support of A is no less than the corresponding minimum support threshold σ’ rule A => B is strong for a set S, if: l a. b. c. each ancestor of every item in A and B is frequent at its corresponding level A Λ B is frequent at the current level and the confidence of A => B is no less than the miniconf at that level This ensures that the patterns examined at the lower levels arise from itemsets that have a high support at higher levels

HOW DOES THE METHOD WORK? Example: Find multiple-level strong association rules for purchase patterns related to category, content and brand 1. 2. 3. 4. Retrieve relevant data from TABLE 2 and merge into a generalized sales_item table with their relevant bar codes replaced by a bar_code set as in TABLE 3 Find frequent patterns and strong rules at highest level. 1 -item, 2 -item, …k-itemsets may be discovered of the form {bread, vegetable, milk, …} At the next level the process is repeated but the itemsets will be more specific ex: {2% milk, lettuce, white bread} Repeat steps 2 to 3 at all levels until no more FPs

OUTLINE 1. 2. Introduction Multiple-Level Association Rules 3. An Algorithm For Mining M-L Association Rules 4. Performance Study Mining Cross-Level Associations Presentation Of Interesting Association Rules Conclusions and Future Work My Own Conclusions Exam Questions 5. 6. 7. 8. 9.

ALGORITHM: Taxonomy For This Exercise food milk 2% Dairyland < ------ L 1 ------ > < ------L 2 ------ > chocolate Foremost < ------L 3 ------- > bread wheat white Old Mills Wonder L 1, L 2 and L 3 correspond to the 3 levels of the hierarchy

ALGORITHM: Dataset For This Exercise Trans-id 351428 {17325, 92108, …. } 653234 TABLE 1 Sales-transaction Table Bar_code_set {23423, 56432, …} TABLE 2 sales_item (Description) Relation Bar_cod e Categor y Brand Conten Size t 17325 milk Foremos 2% t 1 Gal $3. 3 1 …. . …… …. . Price TABLE 3 Generalized sales_item Description Table GID Barcode_set Category Content brand 112 {17325, 31414, 91265, …} Milk 2% Foremost …. . ……

ALGORITHM: Explanation Of GID =112 Level 1 Item 1 ‘Milk’ Foremost 2% Milk Level 2 Item 1 ‘ 2%’ Level 3 Item 2 ‘Foremost’

Transaction Table T[1] TID Items T 1 {111 , 121 , 211 , 221} T 2 {111 , 222 , 323} T 3 {112 , 122 , 221 , 411} T 4 {111 , 121} T 5 {111 , 122 , 211 , 221 , 413} T 6 {211 , 323 , 524} T 7 {323 , 411 , 524 , 713}

ALGORITHM : Step 1 T [1] TID Items T 1 {111 , 121 , 211 , 221} T 2 {111 , 222 , 323} T 3 {112 , 122 , 221 , 411} T 4 {111 , 121} T 5 {111 , 122 , 211 , 221 , 413} T 6 {211 , 323 , 524} T 7 {323 , 411 , 524 , 713} Level 1 Mini. Sup = 4 Level 1 frequent 1 -itemsets L[1, 1] Itemset Suppor t {1**} 5 {2**} 5

ALGORITHM: Step 2 Filtered T [2] L[1, 1] TID Itemset Suppor t {1**} 5 only keep items in L[1, 1] from T[1] Items T 1 {111, 121, 211, 221} T 2 {111, 222} T 3 {112, 122, 221} Itemset 2]Support L [1, T 4 {111, 121} T 5 {111, 122, 211, 221} {1**, 2**} T 6 {211} {2**} 5 4

L[2, 1] Itemset ALGORITHM : Step 3 Level-2 minsup = 3 Filtered T [2] TID Items T 1 {111, 121, 211, 221} T 2 {11*} {12*} {21*} {22*} Support 5 4 4 4 L[2, 2] Itemset Support {111, 222} {11*, 12*} 4 T 3 {112, 122, 221} {11*, 21*} 3 T 4 {111, 121} {11*, 22*} 4 T 5 {111, 122, 211, 221} {12*, 22*} 3 {21*, 22*} 3 T 6 {211} L[2, 3] Itemset Support {11*, 12*, 22*} 3 {11*, 22*} 3

Algorithm: Level 3 Ops Level 3 Minisup=3 Filtered T [2] TID Items T 1 {111, 121, 211, 221} T 2 {111, 222} T 3 {111, 121} T 5 {111, 122, 211, 221} T 6 {211} Itemset Support {111} 4 {221} 3 {112, 122, 221} T 4 L(3, 1) L(3, 2) Itemset Support {111, 211} 3

This Leads Us To The Following Algorithm…. .

Algorithm ML_T 2 L 1 l Purpose: To find multiple-level frequent item sets for mining strong association rules in a transaction database l Input l l l T[1]: a hierarchy-information encoded transaction table of form <TID, Item-set> minisup threshold for each level L in the form: (minsup[L]) Output: Multiple-level frequent item set

Algorithm Variations l Algorithm ML_T 1 LA l Use only the first encoded transaction table T[1]. l No filtered T[2] is generated during the processing l Support for the candidate sets at all levels computed at the same time by scanning T[1] once. l pros: Only one table and maximum k scans (k-itemsets) l cons: May consist of infrequent items and requires large space to keep track of all candidates

Algorithm Variations l Algorithm ML_TML 1 l Generate multiple encoded transaction tables T[1], …, T[max_l+1] l New table T[l+1] generated at each level l by filtering out infrequent items l T[l+1] is used for next generation of itemsets as well as T[l+2] l Pros: May save substantial amount of processing l Cons: Can be inefficient if only a few items are filtered out at each level processed.

Algorithm Variations l Algorithm ML_T 2 LA l Uses 2 encoded transaction tables as in ML_T 2 L 1 algorithm ie T[1] and T[2] l Support for the candidate sets at all levels computed at the same time by scanning T[2] once l Avoids generation of successive filtered tables T[l+1] at each level l Scans T[1] twice to generate T[2] l Then scans T[2] once for generation of each frequent k-itemset l Pros: Potentially efficient if T[2] consists of much fewer items than T[1]. l Cons: ? ?

Performance Study l Assumptions: l l The maximal level in concept hierarchy is 3 Randomized Transaction Generation Algorithm used to generate a set of synthetic databases I=1000; L=number of potentially frequent itemsets=2000; D=total number of transactions=100, 000 Use two data settings: DB 1 (Average frequent item length = 4 and Average transaction size =10) and DB 2 (Average frequent item length = 6 and Average transaction size =20)

Performance Study Average frequent item length = 4 Average transaction size =10 Average frequent item length = 6 Average transaction size =20 minisup[2] = 2% minisup[3] = 0. 75% minisup[2] = 3% minisup[3] = 1%

Performance Study Average frequent item length = 4 Average transaction size =10 minisup[1] = 60% minisup[3] = 0. 75% Average frequent item length = 6 Average transaction size =20 minisup[1] = 55% minisup[3] = 1%

Performance Study minisup[1] = 60% minisup[2] = 2% minisup[1] = 55% minisup[2] = 3%

Performance Study Conclusions: 1. Relative performance of the four algorithms is highly relevant to the threshold setting (i. e. , the power of a filter at each level). 2. Parallel derivation of L(l, k) is useful and deriving a transaction table T[2] is usually beneficial. 3. ML_T 1 LA is found to be the BEST or the second best algorithm.

Cross-level association food expand milk m 1 milk bread m 2 b 1 b 2 mine rules like milk => bread and m 2 => b 1 m 1 bread m 2 b 1 b 2 mine rules like milk => b 1

Cross-level association Two adjustments: 1. A single minisup is used at all levels 2. When the frequent k-itemsets are generated, items at all levels are considered, itemsets which contain an item and its ancestor are excluded

Cross-level association Results 1. Two algorithms implemented ML_T 2 LA-C and ML_T 1 LA-C 2. Both algorithms are an order of magnitude slower than their counterparts for minisup ranging from 1% to 5% 3. When the frequent k-itemsets are generated, items at all levels are considered, itemsets which contain an item and its ancestor are excluded

Filtering of uninteresting association rules l Redundant Rules • • § Redundant Rule = a rule that can be computed from a higher level rule. Example: • milk bread(12% sup, 85% con) • Chocolate milk bread(1% sup, 84% con) • Not interesting if 8% of milk is chocolate milk Removal of redundant rules: • • To remove redundant rules, when a rule R passes the minimum confidence test, it is checked against every strong rule R' , of which R is a descendant. If the confidence of R, (R), falls in the range of the expected confidence with the variation of , it is removed. Can reduce rules by 30% to 60%

Filtering of uninteresting association rules l Unnecessary Rules • • § Unnecessary Rule = a rule that provides little extra information from a previous rule Example: • 80% of customers who buy milk buy bread • 80% of customers who buy milk + butter bread Little Information gained here Definition 6. 2 A rule R, AΛC => B is unnecessary if there is a rule R’ , A => B, and φ (R) ∈ [φ (R’) – β, φ (R’) + β ] where β is a user-defined constant and A, B and C are itemsets

Filtering of uninteresting association rules (continued) l Removal of unnecessary rules: • To filter out unnecessary association rules, for each strong rule R: A => B, we test every such rule R’ : A ‑ C => B, where C ⊂ A. If the confidence of R, (R), is not significantly different from that of R' = (R' ), R is not presented to the user. • Experiments reveal that this reduces rules by 50% to 80%

CONCLUSIONS l l l Extended the association rules from single-level to multiple-level. A top-down progressive deepening technique is developed for mining multiple-level association rules. Filtering of uninteresting association rules

FUTURE WORK l l Can study developing efficient algorithms for mining multiple-level sequential patterns Another interesting issue is the study of mining multiple-level correlations in databases

Exam Question 1 What is a major drawback to multiple-level data mining using the same minisup at all levels of a concept hierarchy? A. Large support exists at higher levels of the hierarchy; smaller support at lower levels. In order to insure that sufficiently strong association rules are generated at the lower levels, we must reduce the support at higher levels which, in turn, could result in generation of many uninteresting rules at higher levels. Thus we are faced with the problem of determining which is the optimal minisup at all levels Q.

Exam Question 2 Q. What are the 2 pre-requisites to performing multiple-level association rule mining? A. To explore multiple-level association rule mining, one needs to provide: 1) Data at multiple levels of abstraction, and 2) Efficient methods for multiple-level rule mining

Exam Question 3 Q. Give an example of a multiple-level association rule. A. Multiple-Level Association Rules operate on a taxonomy or concept hierarchy. At a higher level in the hierarchy one may have a very general rule such as 80% of people who buy bread also buy milk. At a lower level in the hierarchy the rule becomes more specific. For example, 24% of the people who buy Foremost 2% milk also buy Wonderbread