Скачать презентацию Mining Confident Rules Without Support Requirements Ke Wang Скачать презентацию Mining Confident Rules Without Support Requirements Ke Wang

5b54ffeaa8ff2267eabc34c230f24055.ppt

  • Количество слайдов: 20

Mining Confident Rules Without Support Requirements Ke Wang Yu He D. W. Cheung F. Mining Confident Rules Without Support Requirements Ke Wang Yu He D. W. Cheung F. Y. L. Chin CIKM 01 1

Association Rules • Given a table over A 1, …, Ak, C • Find Association Rules • Given a table over A 1, …, Ak, C • Find all rules {Ai=ai} C=c of minimum confidence and minimum support • Support: sup(Ai=ai)= #records containing Ai=ai • Confidence: sup(Ai=ai C=c)/sup(Ai=ai) CIKM 01 2

Low Support Rules • Interesting rules unknown low support • High support rules low Low Support Rules • Interesting rules unknown low support • High support rules low confidence • Often, patterns are fragmented into many low support rules Find all rules above the minimum confidence CIKM 01 3

Confidence-based Pruning • Without minimum support, the classic support -based pruning inapplicable • Confident Confidence-based Pruning • Without minimum support, the classic support -based pruning inapplicable • Confident rules are neither downward closed nor upward closed • Need new strategies for pushing the confidence requirement. CIKM 01 4

Confidence-based Pruning r 1: Age=young Buy=yes r 2: Age=young, Gender=M r 3: Age=young, Gender=F Confidence-based Pruning r 1: Age=young Buy=yes r 2: Age=young, Gender=M r 3: Age=young, Gender=F Buy=yes Observation 1: if r 1 is confident, so is one of r 2 and r 3 (specialized by Gender) Observation 2: if no specialized rule of r 1 is confident, r 1 can be pruned CIKM 01 5

Confidence-based Pruning • Level-wise rule generation: Generate a candidate rule x c only if Confidence-based Pruning • Level-wise rule generation: Generate a candidate rule x c only if for every attribute A not in x c, some Aspecialization of x c is confident. CIKM 01 6

The algorithm Input: table T over A 1, …, Am, C, and miniconf Output: The algorithm Input: table T over A 1, …, Am, C, and miniconf Output: all confident rules 1. k=m; 2. Rulek= all confident m-rules; 3. while k>1 and Rulek is not empty do 4. generate Candk-1 from Rulek; 5. compute the confidence of Candk-1 in one pass of T; 6. Rulek-1 = all confident candidates in Candk-1; 7. k--; 8. return all Rulek; CIKM 01 7

Disk-based Implementation • Assumption: T, Rulek, Candk-1 are stored on disk. • We focus Disk-based Implementation • Assumption: T, Rulek, Candk-1 are stored on disk. • We focus on – generating Candk-1 from Rulek and – computing the confidence for Candk-1. • Key: clustering T, Rulek, Candk-1 according to attributes Ai CIKM 01 8

Clustering by Hash Partitioning • hi --- the hash function for attribute Ai, i=1, Clustering by Hash Partitioning • hi --- the hash function for attribute Ai, i=1, …, m • Table T is partitioned into T-buckets • Rulek is partitioned into R-buckets • Candk-1 is partitioned into C-buckets • A bucket-id is a sequence of hash values involved [b 1, …bk] CIKM 01 9

Pruning by Checking Bucket Ids • A tuple in a T-bucket supports a candidate Pruning by Checking Bucket Ids • A tuple in a T-bucket supports a candidate in a C-bucket only if the T-bucket id matches the C-bucket id. – E. g. , T-bucekt [A 1. 1, A 2. 1, A 3. 2] matches C-buckets [A 1. 1, A 3. 2] and [A 1. 1, A 2. 1] • A C-bucket [b 1, …, bk] is nonempty only if for every other attribute A, some R-bucket [b 1, …, bk, b. A ] is nonempty CIKM 01 10

Hypergraph Hk-1 • A vertex corresponds to a T-bucket • An edge corresponds to Hypergraph Hk-1 • A vertex corresponds to a T-bucket • An edge corresponds to a C-bucket, which contains a vertex if and only if the C-bucket matches the T-bucket • Hk-1 is in memory. CIKM 01 11

The Optimal Blocking • Assume that we can read several T-buckets each time, called The Optimal Blocking • Assume that we can read several T-buckets each time, called a T-block. • For each T-block, we need to access the matching C-buckets from disk. • We want the optimal blocking of T-blocks so that the access of C-buckets is minimized. • This problem is NP-hard. CIKM 01 12

Heuristics • Heuristic I: The more T-buckets match a Cbucket, the higher priority such Heuristics • Heuristic I: The more T-buckets match a Cbucket, the higher priority such T-buckets should be in the next T-block. • Heuristic II: The more C-buckets matches a T-bucket, the higher priority this T-bucket should be in the next T-block. CIKM 01 13

C 1 C 2 C 3 C 4 T 1 T 2 T 3 C 1 C 2 C 3 C 4 T 1 T 2 T 3 T 4 T 5 • (T 1 T 2 T 3)(T 4 T 5): C 1, C 2, C 4 read twice, C 3 read once • Heursitic I: (T 1 T 2 T 5)(T 3 T 4): C 1, C 2, C 4 read once, C 3 read twice • Heuristic II: (T 1 T 3 T 5)(T 2 T 4): C 1, C 4 read twice, C 2, C 3 read once. CIKM 01 14

Experiments • Synthetic datasets from “An interval classifier for database mining application”, VLDB 92. Experiments • Synthetic datasets from “An interval classifier for database mining application”, VLDB 92. • 9 attributes, 1 class. • Default data size = 100 K CIKM 01 15

CIKM 01 16 CIKM 01 16

CIKM 01 17 CIKM 01 17

CIKM 01 18 CIKM 01 18

CIKM 01 19 CIKM 01 19

Conclusion • The experiments show that the proposed confidence-based pruning is effective. CIKM 01 Conclusion • The experiments show that the proposed confidence-based pruning is effective. CIKM 01 20