Data Mining Association Rules Advanced Concepts and Algorithms

Скачать презентацию Data Mining Association Rules Advanced Concepts and Algorithms

0b7a8ef844b3ef1e1937fa7864d93893.ppt

Количество слайдов: 67

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Continuous and Categorical Attributes How to apply association analysis formulation to nonasymmetric binary variables? Example of Association Rule: {Number of Pages [5, 10) (Browser=Mozilla)} {Buy = No} © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Handling Categorical Attributes l Transform categorical attribute into asymmetric binary variables l Introduce a new “item” for each distinct attributevalue pair – Example: replace Browser Type attribute with u Browser Type = Internet Explorer u Browser Type = Mozilla © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Handling Categorical Attributes l Potential Issues – What if attribute has many possible values Example: attribute country has more than 200 possible values u u Many of the attribute values may have very low support – Potential solution: Aggregate the low-support attribute values – What if distribution of attribute values is highly skewed u Example: 95% of the visitors have Buy = No u Most of the items will be associated with (Buy=No) item – Potential solution: drop the highly frequent items © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

Handling Continuous Attributes l Different kinds of rules: – Age [21, 35) Salary [70 k, 120 k) Buy – Salary [70 k, 120 k) Buy Age: =28, =4 l Different methods: – Discretization-based – Statistics-based – Non-discretization based u min. Apriori © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Handling Continuous Attributes Use discretization l Unsupervised: l – Equal-width binning – Equal-depth binning – Clustering l Supervised: Class Attribute values, v v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 Anomalous 0 0 20 10 20 0 0 Normal 100 0 100 150 bin 1 © Tan, Steinbach, Kumar bin 2 Introduction to Data Mining bin 3 4/18/2004 6

Discretization Issues l Size of the discretized intervals affect support & confidence {Refund = No, (Income = $51, 250)} {Cheat = No} {Refund = No, (60 K Income 80 K)} {Cheat = No} {Refund = No, (0 K Income 1 B)} {Cheat = No} – If intervals too small u may not have enough support – If intervals too large u l may not have enough confidence Potential solution: use all possible intervals © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Discretization Issues l Execution time – If intervals contain n values, there are on average O(n 2) possible ranges l Too many rules {Refund = No, (Income = $51, 250)} {Cheat = No} {Refund = No, (51 K Income 52 K)} {Cheat = No} {Refund = No, (50 K Income 60 K)} {Cheat = No} © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Approach by Srikant & Agrawal l Preprocess the data – Discretize attribute using equi-depth partitioning Use partial completeness measure to determine number of partitions u Merge adjacent intervals as long as support is less than max-support u l Apply existing association rule mining algorithms l Determine © Tan, Steinbach, Kumar interesting rules in the output Introduction to Data Mining 4/18/2004 9

Approach by Srikant & Agrawal l Discretization will lose information Approximated X X – Use partial completeness measure to determine how much information is lost C: frequent itemsets obtained by considering all ranges of attribute values P: frequent itemsets obtained by considering all ranges over the partitions P is K-complete w. r. t C if P C, and X C, X’ P such that: 1. X’ is a generalization of X and support (X’) K support(X) 2. Y X, Y’ X’ such that support (Y’) K support(Y) (K 1) Given K (partial completeness level), can determine number of intervals (N) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

Interestingness Measure {Refund = No, (Income = $51, 250)} {Cheat = No} {Refund = No, (51 K Income 52 K)} {Cheat = No} {Refund = No, (50 K Income 60 K)} {Cheat = No} l Given an itemset: Z = {z 1, z 2, …, zk} and its generalization Z’ = {z 1’, z 2’, …, zk’} P(Z): support of Z EZ’(Z): expected support of Z based on Z’ – Z is R-interesting w. r. t. Z’ if P(Z) R EZ’(Z) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Interestingness Measure l For S: X Y, and its generalization S’: X’ Y’ P(Y|X): confidence of X Y P(Y’|X’): confidence of X’ Y’ ES’(Y|X): expected support of Z based on Z’ l Rule S is R-interesting w. r. t its ancestor rule S’ if – Support, P(S) R ES’(S) or – Confidence, P(Y|X) R ES’(Y|X) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

Statistics-based Methods l Example: Browser=Mozilla Buy=Yes Age: =23 l Rule consequent consists of a continuous variable, characterized by their statistics – mean, median, standard deviation, etc. l Approach: – Withhold the target variable from the rest of the data – Apply existing frequent itemset generation on the rest of the data – For each frequent itemset, compute the descriptive statistics for the corresponding target variable Frequent itemset becomes a rule by introducing the target variable as rule consequent u – Apply statistical test to determine interestingness of the rule © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

Statistics-based Methods l How to determine whether an association rule interesting? – Compare the statistics for segment of population covered by the rule vs segment of population not covered by the rule: A B: versus A B: ’ – Statistical hypothesis testing: u Null hypothesis: H 0: ’ = + u Alternative hypothesis: H 1: ’ > + u Z has zero mean and variance 1 under null hypothesis © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Statistics-based Methods l Example: r: Browser=Mozilla Buy=Yes Age: =23 – Rule is interesting if difference between and ’ is greater than 5 years (i. e. , = 5) – For r, suppose n 1 = 50, s 1 = 3. 5 – For r’ (complement): n 2 = 250, s 2 = 6. 5 – For 1 -sided test at 95% confidence level, critical Z-value for rejecting null hypothesis is 1. 64. – Since Z is greater than 1. 64, r is an interesting rule © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

Min-Apriori (Han et al) Document-term matrix: Example: W 1 and W 2 tends to appear together in the same document © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

Min-Apriori l Data contains only continuous attributes of the same “type” – e. g. , frequency of words in a document l Potential solution: – Convert into 0/1 matrix and then apply existing algorithms u lose word frequency information – Discretization does not apply as users want association among words not ranges of words © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

Min-Apriori l How to determine the support of a word? – If we simply sum up its frequency, support count will be greater than total number of documents! u Normalize the word vectors – e. g. , using L 1 norm u Each word has a support equals to 1. 0 Normalize © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Min-Apriori l New definition of support: Example: Sup(W 1, W 2, W 3) = 0 + 0 + 0. 17 = 0. 17 © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Anti-monotone property of Support Example: Sup(W 1) = 0. 4 + 0 + 0. 2 = 1 Sup(W 1, W 2) = 0. 33 + 0. 4 + 0. 17 = 0. 9 Sup(W 1, W 2, W 3) = 0 + 0 + 0. 17 = 0. 17 © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

Multi-level Association Rules l Why should we incorporate concept hierarchy? – Rules at lower levels may not have enough support to appear in any frequent itemsets – Rules at lower levels of the hierarchy are overly specific e. g. , skim milk white bread, 2% milk wheat bread, skim milk wheat bread, etc. are indicative of association between milk and bread u © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Multi-level Association Rules l How do support and confidence vary as we traverse the concept hierarchy? – If X is the parent item for both X 1 and X 2, then (X) ≤ (X 1) + (X 2) – If and then (X 1 Y 1) ≥ minsup, X is parent of X 1, Y is parent of Y 1 (X Y 1) ≥ minsup, (X 1 Y) ≥ minsup (X Y) ≥ minsup – If then conf(X 1 Y 1) ≥ minconf, conf(X 1 Y) ≥ minconf © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

Multi-level Association Rules l Approach 1: – Extend current association rule formulation by augmenting each transaction with higher level items Original Transaction: {skim milk, wheat bread} Augmented Transaction: {skim milk, wheat bread, milk, bread, food} l Issues: – Items that reside at higher levels have much higher support counts if support threshold is low, too many frequent patterns involving items from the higher levels u – Increased dimensionality of the data © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 24

Multi-level Association Rules l Approach 2: – Generate frequent patterns at highest level first – Then, generate frequent patterns at the next highest level, and so on l Issues: – I/O requirements will increase dramatically because we need to perform more passes over the data – May miss some potentially interesting cross-level association patterns © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 25

Sequence Database: Object A A A B B C Timestamp 10 20 23 11 17 21 28 14 © Tan, Steinbach, Kumar Events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7 Introduction to Data Mining 4/18/2004 26

Examples of Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Event data History of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A, T, G, C Element (Transaction) Sequence © Tan, Steinbach, Kumar E 1 E 2 E 1 E 3 E 2 Introduction to Data Mining E 2 E 3 E 4 Event (Item) 4/18/2004 27

Formal Definition of a Sequence l A sequence is an ordered list of elements (transactions) s = < e 1 e 2 e 3 … > – Each element contains a collection of events (items) ei = {i 1, i 2, …, ik} – Each element is attributed to a specific time or location l Length of a sequence, |s|, is given by the number of elements of the sequence l A k-sequence is a sequence that contains k events (items) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 28

Examples of Sequence l Web sequence: < {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} > l Sequence of initiating events causing the nuclear accident at 3 -mile Island: (http: //stellar-one. com/nuclear/staff_reports/summary_SOE_the_initiating_event. htm) < {clogged resin} {outlet valve closure} {loss of feedwater} {condenser polisher outlet valve shut} {booster pumps trip} {main waterpump trips} {main turbine trips} {reactor pressure increases}> l Sequence of books checked out at a library: <{Fellowship of the Ring} {The Two Towers} {Return of the King}> © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 29

Formal Definition of a Subsequence l A sequence is contained in another sequence (m ≥ n) if there exist integers i 1 < i 2 < … < in such that a 1 bi 1 , a 2 bi 1, …, an bin Data sequence < {2} {3, 5} > Yes < {1, 2} {3, 4} > < {1} {2} > No < {2, 4} {2, 5} > l Contain? < {2, 4} {3, 5, 6} {8} > l Subsequence < {2} {4} > Yes The support of a subsequence w is defined as the fraction of data sequences that contain w A sequential pattern is a frequent subsequence (i. e. , a subsequence whose support is ≥ minsup) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 30