
5b54ffeaa8ff2267eabc34c230f24055.ppt
- Количество слайдов: 20
Mining Confident Rules Without Support Requirements Ke Wang Yu He D. W. Cheung F. Y. L. Chin CIKM 01 1
Association Rules • Given a table over A 1, …, Ak, C • Find all rules {Ai=ai} C=c of minimum confidence and minimum support • Support: sup(Ai=ai)= #records containing Ai=ai • Confidence: sup(Ai=ai C=c)/sup(Ai=ai) CIKM 01 2
Low Support Rules • Interesting rules unknown low support • High support rules low confidence • Often, patterns are fragmented into many low support rules Find all rules above the minimum confidence CIKM 01 3
Confidence-based Pruning • Without minimum support, the classic support -based pruning inapplicable • Confident rules are neither downward closed nor upward closed • Need new strategies for pushing the confidence requirement. CIKM 01 4
Confidence-based Pruning r 1: Age=young Buy=yes r 2: Age=young, Gender=M r 3: Age=young, Gender=F Buy=yes Observation 1: if r 1 is confident, so is one of r 2 and r 3 (specialized by Gender) Observation 2: if no specialized rule of r 1 is confident, r 1 can be pruned CIKM 01 5
Confidence-based Pruning • Level-wise rule generation: Generate a candidate rule x c only if for every attribute A not in x c, some Aspecialization of x c is confident. CIKM 01 6
The algorithm Input: table T over A 1, …, Am, C, and miniconf Output: all confident rules 1. k=m; 2. Rulek= all confident m-rules; 3. while k>1 and Rulek is not empty do 4. generate Candk-1 from Rulek; 5. compute the confidence of Candk-1 in one pass of T; 6. Rulek-1 = all confident candidates in Candk-1; 7. k--; 8. return all Rulek; CIKM 01 7
Disk-based Implementation • Assumption: T, Rulek, Candk-1 are stored on disk. • We focus on – generating Candk-1 from Rulek and – computing the confidence for Candk-1. • Key: clustering T, Rulek, Candk-1 according to attributes Ai CIKM 01 8
Clustering by Hash Partitioning • hi --- the hash function for attribute Ai, i=1, …, m • Table T is partitioned into T-buckets • Rulek is partitioned into R-buckets • Candk-1 is partitioned into C-buckets • A bucket-id is a sequence of hash values involved [b 1, …bk] CIKM 01 9
Pruning by Checking Bucket Ids • A tuple in a T-bucket supports a candidate in a C-bucket only if the T-bucket id matches the C-bucket id. – E. g. , T-bucekt [A 1. 1, A 2. 1, A 3. 2] matches C-buckets [A 1. 1, A 3. 2] and [A 1. 1, A 2. 1] • A C-bucket [b 1, …, bk] is nonempty only if for every other attribute A, some R-bucket [b 1, …, bk, b. A ] is nonempty CIKM 01 10
Hypergraph Hk-1 • A vertex corresponds to a T-bucket • An edge corresponds to a C-bucket, which contains a vertex if and only if the C-bucket matches the T-bucket • Hk-1 is in memory. CIKM 01 11
The Optimal Blocking • Assume that we can read several T-buckets each time, called a T-block. • For each T-block, we need to access the matching C-buckets from disk. • We want the optimal blocking of T-blocks so that the access of C-buckets is minimized. • This problem is NP-hard. CIKM 01 12
Heuristics • Heuristic I: The more T-buckets match a Cbucket, the higher priority such T-buckets should be in the next T-block. • Heuristic II: The more C-buckets matches a T-bucket, the higher priority this T-bucket should be in the next T-block. CIKM 01 13
C 1 C 2 C 3 C 4 T 1 T 2 T 3 T 4 T 5 • (T 1 T 2 T 3)(T 4 T 5): C 1, C 2, C 4 read twice, C 3 read once • Heursitic I: (T 1 T 2 T 5)(T 3 T 4): C 1, C 2, C 4 read once, C 3 read twice • Heuristic II: (T 1 T 3 T 5)(T 2 T 4): C 1, C 4 read twice, C 2, C 3 read once. CIKM 01 14
Experiments • Synthetic datasets from “An interval classifier for database mining application”, VLDB 92. • 9 attributes, 1 class. • Default data size = 100 K CIKM 01 15
CIKM 01 16
CIKM 01 17
CIKM 01 18
CIKM 01 19
Conclusion • The experiments show that the proposed confidence-based pruning is effective. CIKM 01 20