Скачать презентацию Association Rules Mining with SQL Kirsten Nelson Deepen Скачать презентацию Association Rules Mining with SQL Kirsten Nelson Deepen

a96423e1fbf6d3cddce900ae38a053e3.ppt

  • Количество слайдов: 71

Association Rules Mining with SQL Kirsten Nelson Deepen Manek November 24, 2003 1 Association Rules Mining with SQL Kirsten Nelson Deepen Manek November 24, 2003 1

Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL n n Candidate generation, pruning, support counting K-way join, Sub. Query, Gather. Join, Vertical, Hybrid Integrating taxonomies Mining sequential patterns 2

Early data mining applications n n n Most early mining systems were developed largely Early data mining applications n n n Most early mining systems were developed largely on file systems, with specialized data structures and buffer management strategies devised for each All data was read into memory before beginning computation This limits the amount of data that can be mined 3

Advantage of SQL and RDBMS n n Make use of database indexing and query Advantage of SQL and RDBMS n n Make use of database indexing and query processing capabilities More than a decade spent on making these systems robust, portable, scalable, and concurrent Exploit underlying SQL parallelization For long-running algorithms, use checkpointing and space management 4

Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL n n Candidate generation, pruning, support counting K-way join, Sub. Query, Gather. Join, Vertical, Hybrid Integrating taxonomies Mining sequential patterns 5

Use of Database in Data Mining n “Loose coupling” of application and data n Use of Database in Data Mining n “Loose coupling” of application and data n n How would you write an Apriori program? Use SQL statements in an application Use a cursor interface to read through records sequentially for each pass Still two major performance problems: n n Copying of record from database to memory Process context switching for each record retrieved 6

Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL n n Candidate generation, pruning, support counting K-way join, Sub. Query, Gather. Join, Vertical, Hybrid Integrating taxonomies Mining sequential patterns 7

Tightly-coupled applications n n Push computations into the database system to avoid performance degradation Tightly-coupled applications n n Push computations into the database system to avoid performance degradation Take advantage of user-defined functions (UDFs) Does not require changes to database software Two types of UDFs we will use: n n Ones that are executed only a few times, regardless of the number of rows Ones that are executed once for each selected row 8

Tight-coupling using UDFs Procedure Tightly. Coupled. Apriori(): begin exec sql connect to database; exec Tight-coupling using UDFs Procedure Tightly. Coupled. Apriori(): begin exec sql connect to database; exec sql select alloc. Space() into : blob from onerecord; exec sql select * from sales where Gen. L 1(: blob, TID, ITEMID) = 1; not. Done : = true; 9

Tight-coupling using UDFs while not. Done do { exec sql select apriori. Gen(: blob) Tight-coupling using UDFs while not. Done do { exec sql select apriori. Gen(: blob) into : blob from onerecord; exec sql select * from sales where item. Count(: blob, TID, ITEMID)=1; exec sql select Gen. Lk(: blob) into : not. Done from onerecord } 10

Tight-coupling using UDFs exec sql select get. Result(: blob) into : result. Blob from Tight-coupling using UDFs exec sql select get. Result(: blob) into : result. Blob from onerecord; exec sql select dealloc. Space(: blob) from onerecord; compute Answer using result. Blob; end 11

Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL n n Candidate generation, pruning, support counting K-way join, Sub. Query, Gather. Join, Vertical, Hybrid Integrating taxonomies Mining sequential patterns 12

Methodology n n n Comparison done with Association Rules against IBM DB 2 Only Methodology n n n Comparison done with Association Rules against IBM DB 2 Only consider generation of frequent itemsets using Apriori algorithm Five alternatives considered: n n n Loose-coupling through SQL cursor interface – as described earlier UDF tight-coupling – as described earlier Stored-procedure to encapsulate mining algorithm Cache-mine – caching data and mining on the fly SQL implementations to force processing in the database n Consider two classes of implementations n n SQL-92 – four different implementations SQL-OR (with object relational extensions) – six implementations 13

Architectural Options n Stored procedure n n Apriori algorithm encapsulated as a stored procedure Architectural Options n Stored procedure n n Apriori algorithm encapsulated as a stored procedure Implication: runs in the same address space as the DBMS Mined results stored back into the DBMS. Cache-mine n n n Variation of stored-procedure Read entire data once from DBMS, temporarily cache data in a lookaside buffer on a local disk Cached data is discarded when execution completes Disadvantage – requires additional disk space for caching Use Intelligent Miner’s “space” option 14

Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL n n Candidate generation, pruning, support counting K-way join, Sub. Query, Gather. Join, Vertical, Hybrid Integrating taxonomies Mining sequential patterns 15

Terminology n Use the following terminology n T: table of items n n n Terminology n Use the following terminology n T: table of items n n n Ck: candidate k-itemsets n n {tid, item} pairs Data is normally sorted by transaction id Obtained from joining and pruning frequent itemsets from previous iteration Fk: frequent items sets of length k n Obtained from Ck and T 16

Candidate Generation in SQL – join step n Generate Ck from Fk-1 by joining Candidate Generation in SQL – join step n Generate Ck from Fk-1 by joining Fk-1 with itself insert into Ck select I 1. item 1, …, I 1. itemk-1, I 2. itemk-1 from Fk-1 I 1, Fk-1 I 2 where I 1. item 1 = I 2. item 1 and … I 1. itemk-2 = I 2. itemk-2 and I 1. itemk-1 < I 2. itemk-1 17

Candidate Generation Example § F 3 is {{1, 2, 3}, {1, 2, 4}, {1, Candidate Generation Example § F 3 is {{1, 2, 3}, {1, 2, 4}, {1, 3, 5}, {2, 3, 4}} § C 4 is {{1, 2, 3, 4}, {1, 3, 4, 5}} Table F 3 (I 1) Table F 3 (I 2) item 1 item 2 item 3 1 2 4 1 3 5 2 3 4 18

Pruning n Modify candidate generation algorithm to ensure all k subsets of Ck of Pruning n Modify candidate generation algorithm to ensure all k subsets of Ck of length (k-1) are in Fk-1 n n n Do a k-way join, skipping itemn-2 when joining with the nth table (2

Pruning Example n n Evaluate join with I 3 using previous example C 4 Pruning Example n n Evaluate join with I 3 using previous example C 4 is {1, 2, 3, 4} Table F 3 (I 1) Table F 3 (I 2) Table F 3 (I 3) item 1 item 2 item 3 item 1 1 2 3 1 2 4 1 3 4 1 3 5 2 3 4 item 2 item 3 20

Support counting using SQL n Two different approaches n Use the SQL-92 standard n Support counting using SQL n Two different approaches n Use the SQL-92 standard n n Use ‘standard’ SQL syntax such as joins and subqueries to find support of itemsets Use object-relational extensions of SQL (SQL-OR) n n User Defined Functions (UDFs) & table functions Binary Large Objects (BLOBs) 21

Support Counting using SQL-92 n 4 different methods, two of which detailed in the Support Counting using SQL-92 n 4 different methods, two of which detailed in the papers n n n K-way Joins Sub. Query Other methods not discussed because of unacceptable performance n n 3 -way join 2 Group-Bys 22

SQL-92: K-way join n n Obtain Fk by joining Ck with table T of SQL-92: K-way join n n Obtain Fk by joining Ck with table T of (tid, item) Perform group by on the itemset insert into Fk select item 1, …, itemk, count(*) from Ck, T t 1, …, T tk, where t 1. item = Ck. item 1, … , and tk. item = Ck. itemk and t 1. tid = t 2. tid … and tk-1. tid = tk. tid group by item 1, …, itemk having count(*) > : minsup 23

K-way join example n n C 3={B, C, E} and minimum support required is K-way join example n n C 3={B, C, E} and minimum support required is 2 Insert into F 3 {B, C, E, 2} 24

K-way join: Pass-2 optimization n n When calculating C 2, no pruning is required K-way join: Pass-2 optimization n n When calculating C 2, no pruning is required after we join F 1 with itself Don’t calculate and materialize C 2 - replace C 2 in 2 -way join algorithm with join of F 1 with itself insert into F 2 select I 1. item 1, I 2. item 1, count(*) from F 1 I 1, F 1 I 2, T t 1, T t 2 where I 1. item 1 < I 2. item 1 and t 1. item = I 1. item 1 and t 2. item = I 2. item 1 and t 1. tid = t 2. tid group by I 1. item 1, I 2. item 1 having count(*) > : minsup 25

SQL-92: Sub. Query based n n Split support counting into cascade of k subqueries SQL-92: Sub. Query based n n Split support counting into cascade of k subqueries nth subquery Qn finds all tids that match the distinct itemsets formed by the first n items of Ck insert into Fk select item 1, …, itemk, count(*) from (Subquery Qk) t Group by item 1, item 2 … , itemk having count(*) > : minsup Subquery Qn (for any n between 1 and k): select item 1, …, itemn, tid from T tn, (Subquery Qn-1) as rn-1 (select distinct item 1, …, itemn from CK) as dn where rn-1. item 1 = dn. item 1 and … and rn-1. itemn-1 = dn. itemn and rn-1. tid = tn. tid and tn. item = dn. itemn 26

Example of Sub. Query based n Using previous example from class n C 3 Example of Sub. Query based n Using previous example from class n C 3 = {B, C, E}, minimum support = 2 n Q 0: No subquery Q 0 n Q 1 in this case becomes select item 1, tid From T t 1, (select distinct item 1 from C 3) as d 1 where t 1. item = d 1. item 1 27

Example of Sub. Query based cnt’d n Q 2 becomes select item 1, item Example of Sub. Query based cnt’d n Q 2 becomes select item 1, item 2, tid from T t 2, (Subquery Q 1) as r 1, (select distinct item 1, item 2 from C 3) as d 2 where r 1. item 1 = d 2. item 1 and r 1. tid = t 2. tid and t 2. item = d 2. item 2 28

Example of Sub. Query based cnt’d n Q 3 becomes select item 1, item Example of Sub. Query based cnt’d n Q 3 becomes select item 1, item 2, item 3, tid from T t 3, (Subquery Q 2) as r 2, (select distinct item 1, item 2, item 3 from C 3) as d 3 where r 2. item 1 = d 3. item 1 and r 2. item 2 = d 3. item 2 and r 2. tid = t 3. tid and t 3. item = d 3. item 3 29

Example of Sub. Query based cnt’d n Output of Q 3 is n Insert Example of Sub. Query based cnt’d n Output of Q 3 is n Insert statement becomes insert into F 3 select item 1, item 2, item 3, count(*) from (Subquery Q 3) t group by item 1, item 2 , item 3 having count(*) > : minsup n Insert the row {B, C, E, 2} n For Q 2, pass-2 optimization can be used 30

Performance Comparisons of SQL -92 approaches n Used Version 5 of DB 2 UDB Performance Comparisons of SQL -92 approaches n Used Version 5 of DB 2 UDB and RS/6000 Model 140 n n n 200 Mhz CPU, 256 MB main memory, 9 GB of disk space, Transfer rate of 8 MB/sec Used 4 different item sets based on real-world data Built the following indexes, which are not included in any cost calculations n n n Composite index (item 1, …, itemk) on Ck k different indices on each of the k items in Ck (item, tid) and (tid, item) indexes on the data table T 31

Performance Comparisons of SQL -92 approaches n n Best performance obtained by Sub. Query Performance Comparisons of SQL -92 approaches n n Best performance obtained by Sub. Query approach Sub. Query was only comparable to loose-coupling in some cases, failing to complete in other cases n n Data. Set C, for support of 2%, Sub. Query outperforms loosecoupling but decreasing support to 1%, Sub. Query takes 10 times as long to complete Lower support will increase the size of Ck and Fk at each step, causing the join to process more rows 32

Support Counting using SQL with object-relational extensions n 6 different methods, four of which Support Counting using SQL with object-relational extensions n 6 different methods, four of which detailed in the papers n n n Gather. Join Gather. Count Gather. Prune Vertical Other methods not discussed because of unacceptable performance n n Horizontal SBF 33

SQL Object-Relational Extension: Gather. Join n Generates all possible k-item combinations of items contained SQL Object-Relational Extension: Gather. Join n Generates all possible k-item combinations of items contained in a transaction and joins them with Ck n n An index is created on all items of Ck Uses the following table functions n n Gather: Outputs records {tid, item-list}, with item-list being a BLOB or VARCHAR containing all items associated with the tid Comb-K: returns all k-item combinations from the transaction n Output has k attributes T_itm 1, …, T_itmk 34

Gather. Join insert into Fk select item 1, …, itemk, count(*) from Ck, (select Gather. Join insert into Fk select item 1, …, itemk, count(*) from Ck, (select t 2. T_itm 1, …, t 2. itmk from T, table(Gather(T. tid, T. item)) as t 1, table(Comb-K(t 1. tid, t 1. item-list)) as t 2) where t 2. T_itm 1 = Ck. item 1 and … and t 2. T_itmk = Ck. itemk group by Ck. item 1, …, Ck. itemk having count(*) > : minsup 35

Example of Gather. Join n n t 1 (output from Gather) looks like: t Example of Gather. Join n n t 1 (output from Gather) looks like: t 2 (generated by Comb-K from t 1) will be joined with C 3 to obtain F 3 n n 1 row from Tid 10 1 row from Tid 20 4 rows from Tid 30 Insert {B, C, E, 2} 36

Gather. Join: Pass 2 optimization n n When calculating C 2, no pruning is Gather. Join: Pass 2 optimization n n When calculating C 2, no pruning is required after we join F 1 with itself Don’t calculate and materialize C 2 - replace C 2 with a join to F 1 before the table function n Gather is only passed frequent 1 -itemset rows insert into F 2 select I 1. item 1, I 2. item 1, count(*) from F 1 I 1, (select t 2. T_itm 1, t 2. T_itm 2 from T, table(Gather(T. tid, T. item)) as t 1, table(Comb-K(t 1. tid, t 1. item-list)) as t 2 where T. item = I 1. item 1) group by t 2. T_itm 1, t 2. T_itm 2 having count(*) > : minsup 37

Variations of Gather. Join Gather. Count n n Perform the GROUP BY inside the Variations of Gather. Join Gather. Count n n Perform the GROUP BY inside the table function Comb-K for pass 2 optimization Output of the table function Comb-K n n n Not the candidate frequent itemsets (Ck) But the actual frequent itemsets (Fk) along with the corresponding support Use a 2 -dimensional array to store possible frequent itemsets in Comb-K n May lead to excessive memory use 38

Variations of Gather. Join Gather. Prune n n Push the join with Ck into Variations of Gather. Join Gather. Prune n n Push the join with Ck into the table function Comb-K Ck is converted into a BLOB and passed as an argument to the table function. n Will have to pass the BLOB for each invocation of Comb-K - # of rows in table T 39

SQL Object-Relational Extension: Vertical n n n For each item, create a BLOB containing SQL Object-Relational Extension: Vertical n n n For each item, create a BLOB containing the tids the item belongs to n Use function Gather to generate {item, tidlist} pairs, storing results in table Tid. Table n Tid-list are all in the same sorted order Use function Intersect to compare two different tid-lists and extract common values Pass-2 optimization can be used for Vertical n n Similar to K-way join method Upcoming example does not show optimization 40

Vertical insert into Fk select item 1, …, itemk, count(tid-list) as cnt from (Subquery Vertical insert into Fk select item 1, …, itemk, count(tid-list) as cnt from (Subquery Qk) t where cnt > : minsup Subquery Qn (for any n between 2 and k) Select item 1, …, itemn, Intersect(rn-1. tid-list, tn. tid-list) as tid-list from Tid. Table tn, (Subquery Qn-1) as rn-1 (select distinct item 1, …, itemn from CK) as dn where rn-1. item 1 = dn. item 1 and … and rn-1. itemn-1 = dn. itemn-1 and tn. item = dn. itemn Subquery Q 1: (select * from Tid. Table) 41

Example of Vertical n Using previous example from class n n C 3 = Example of Vertical n Using previous example from class n n C 3 = {B, C, E}, minimum support = 2 Q 1 is Tid. Table 42

Example of Vertical cnt’d n Q 2 becomes Select item 1, item 2, Intersect(r Example of Vertical cnt’d n Q 2 becomes Select item 1, item 2, Intersect(r 1. tid-list, t 2. tid-list) as tid-list from Tid. Table t 2, (Subquery Q 1) as r 1 (select distinct item 1, item 2 from C 3) as d 2 where r 1. item 1 = d 2. item 1 and t 2. item = d 2. item 2 43

Example of Vertical cnt’d n Q 3 becomes select item 1, item 2, item Example of Vertical cnt’d n Q 3 becomes select item 1, item 2, item 3, intersect(r 2. tid-list, t 3. tid-list) as tid-list from Tid. Table t 3, (Subquery Q 2) as r 2 (select distinct item 1, item 2, item 3 from C 3) as d 3 where r 2. item 1 = d 3. item 1 and r 2. item 2 = d 3. item 2 and t 3. item = d 3. item 3 44

Performance Comparisons using SQL-OR 45 Performance Comparisons using SQL-OR 45

Performance Comparisons using SQL-OR 46 Performance Comparisons using SQL-OR 46

Performance comparison of SQL object-relational approaches n Vertical has best overall performance, sometimes an Performance comparison of SQL object-relational approaches n Vertical has best overall performance, sometimes an order of magnitude better than other 3 approaches n n n Pass-2 optimization has huge impact on performance of Gather. Join n n Majority of time is transforming the data in {item, tid-list} pairs Vertical spends too much time on the second pass For Dataset-B with support of 0. 1 %, running time for Pass 2 went from 5. 2 hours to 10 minutes Comb-K in Gather. Join generates large number of potential frequent itemsets we must work with 47

Hybrid approach n Previous charts and algorithm analysis show n n n Vertical spends Hybrid approach n Previous charts and algorithm analysis show n n n Vertical spends too much time on pass 2 compared to other algorithms, especially when the support is decreased Gather. Join degrades when the # of frequent items per transaction increases To improve performance, use a hybrid algorithm n n n Use Vertical for most cases When size of candidate itemset is too large, Gather. Join is a good option if number of frequent items per transaction (N f) is not too large When Nf is large, Gather. Count may be the only good option 48

Architecture Comparisons n Compare five alternatives n Loose-Coupling, Stored-procedure n n n Basically the Architecture Comparisons n Compare five alternatives n Loose-Coupling, Stored-procedure n n n Basically the same except for address space program is being run in Because of limited difference in performance, focus solely on stored procedure in following charts Cache-Mine UDF tight-coupling Best SQL approach (Hybrid) 49

Performance Comparisons of Architectures 50 Performance Comparisons of Architectures 50

Performance Comparisons of Architectures cnt’d 51 Performance Comparisons of Architectures cnt’d 51

Performance Comparisons of Architectures cnt’d n Cache-Mine is the best or close to the Performance Comparisons of Architectures cnt’d n Cache-Mine is the best or close to the best performance in all cases n n Factor of 0. 8 to 2 times faster than SQL approach Stored procedure is the worst n Difference between Cache-Mine directly related to the number of passes through the data n n n Passes increase when the support goes down May need to make multiple passes if all candidates cannot fit in memory UDF time per pass decreases 30 -50% compared to stored procedure because of tighter coupling with DB 52

Performance Comparisons of Architectures cnt’d n SQL approach comes in second in performance to Performance Comparisons of Architectures cnt’d n SQL approach comes in second in performance to Cache-Mine n n Somewhat better than Cache-Mine for high support values 1. 8 – 3 times better than Stored-procedure/loose-coupling approach, getting better when support value decreases Cost of converting to Vertical format is less than cost of converting to binary format in Cache-Mine For second pass through data, SQL approach takes much more time than Cache-Mine, particularly when we decrease the support 53

Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL n n Candidate generation, pruning, support counting K-way join, Sub. Query, Gather. Join, Vertical, Hybrid Integrating taxonomies Mining sequential patterns 54

Taxonomies - example Beverages Soft Drinks Snacks Alcoholic Drinks Pretzels Chocolate Bar Parent Coke Taxonomies - example Beverages Soft Drinks Snacks Alcoholic Drinks Pretzels Chocolate Bar Parent Coke Example rule: Soft Drinks Pretzels with 30% confidence, 2% support Beer Beverages Soft Drinks Beverages Pepsi Child Alcoholic Drinks Soft Drinks Pepsi Soft Drinks Coke Alcoholic Drinks Beer Snacks Pretzels Snacks Chocolate Bar 55

Taxonomy augmentation n n Algorithms similar to previous slides Requires two additions to algorithm Taxonomy augmentation n n Algorithms similar to previous slides Requires two additions to algorithm n n n Pruning itemsets containing an item and its ancestor Pre-computing the ancestors for each item Will also consider support counting 56

Pruning items and ancestors In the second pass we will join F 1 with Pruning items and ancestors In the second pass we will join F 1 with F 1 to give C 2 n This will give, for example: beverages, pepsi snacks, coke pretzels, chocolate bar n But beverages, pepsi is redundant! n 57

Pruning items and ancestors The following modification to the SQL statement eliminates such redundant Pruning items and ancestors The following modification to the SQL statement eliminates such redundant combinations from being selected: insert into C 2 (select I 1. item 1, I 2. item 1 from F 1 I 1, F 1 I 2 where I 1. item 1 < I 2. item 1) except (select ancestor, descendant from Ancestor union select descendant, ancestor from Ancestor) n 58

Pre-computing ancestors n An ancestor table is created n n Format (ancestor, descendant) Use Pre-computing ancestors n An ancestor table is created n n Format (ancestor, descendant) Use the transitive closure operation insert into Ancestor with R-Tax (ancestor, descendant) as (select parent, child from Tax union all select p. ancestor, c. child from R-Tax p, Tax c where p. descendant = c. parent) select ancestor, descendant from R-Tax 59

Support Counting n Extensions to handle taxonomies n n n Straightforward, but Non-trivial Need Support Counting n Extensions to handle taxonomies n n n Straightforward, but Non-trivial Need an extended transaction table n n For example, if we have {coke, pretzels} We add also {soft drinks, pretzels}, {beverages, pretzels}, {coke, snacks}, {soft drinks, snacks}, {beverages, snacks} 60

Extended transaction table Can be obtained by the following SQL Query to generate T* Extended transaction table Can be obtained by the following SQL Query to generate T* select item, tid from T union select distinct A. ancestor as item, T. tid from T, Ancestor A where A. descendant = T. item n The “select distinct” clause gets rid of items with common ancestor – e. g. don’t want {beverages, beverages} being added twice from {pepsi, coke} n 61

Pipelining of Query n n No need to actually build T* Make following modification Pipelining of Query n n No need to actually build T* Make following modification to SQL: insert into Fk with T*(tid, item) as (Query for T*) select item 1, …, itemk, count(*) from Ck, T* t 1, …, T* tk, where t 1. item = Ck. item 1, … , and tk. item = Ck. itemk and t 1. tid = t 2. tid … and tk-1. tid = tk. tid group by item 1, …, itemk having count(*) > : minsup 62

Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and Organization of Presentation n n Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL n n Candidate generation, pruning, support counting K-way join, Sub. Query, Gather. Join, Vertical, Hybrid Integrating taxonomies Mining sequential patterns 63

Sequential patterns n n Similar to papers covered on Nov 17 Input is sequences Sequential patterns n n Similar to papers covered on Nov 17 Input is sequences of transactions n n n E. g. ((computer, modem), (printer)) Similar to association rules, but dealing with sequences as opposed to sets Can also specify maximum and minimum time gaps, as well as sliding time windows n Max-gap, min-gap, window-size 64

Input and output formats n Input has three columns: n n Sequence identifier (sid) Input and output formats n Input has three columns: n n Sequence identifier (sid) Transaction time (time) Idem identifier (item) Output format is a collection of frequent sequences, in a fixed-width table n n (item 1, eno 1, …, itemk, enok, len) For smaller lengths, extra column values are set to NULL 65

GSP algorithm n n n Similar to algorithms shown earlier Each Ck has transactions GSP algorithm n n n Similar to algorithms shown earlier Each Ck has transactions and times, but no length – has fixed length of k Candidates are generated in two steps n Join – join Fk-1 with itself n n n Sequence s 1 joins with s 2 if the subsequence obtained by dropping the first item of s 1 is the same as the one obtained by dropping the last item of s 2 When generating C 2, we need to generate sequences where both of the items appear as a single element as well as two separate elements Prune n All candidate sequences that have a non-frequent contiguous (k -1) subsequence are deleted 66

GSP – Join SQL insert into Ck select I 1. item 1, I 1. GSP – Join SQL insert into Ck select I 1. item 1, I 1. eno 1, . . . , I 1. itemk-1, I 1. enok-1, I 2. itemkk-1, I 1. enok-1 + I 2. enok-1 – I 2. enok -2 from Fk-1 I 1, Fk-1 I 2 where I 1. item 2 = I 2. item 1 and. . . and I 1. itemk-1 = I 2. itemk-2 and I 1. eno 3 -I 1. eno 2 = I 2. eno 2 – I 2. eno 1 and. . . and I 1. enok-1 – I 1. enok-2 = I 2. enok-2 – I 2. enok-3 67

GSP – Prune SQL n n n Write as a k-way join, similar to GSP – Prune SQL n n n Write as a k-way join, similar to before There at most k contiguous subsequences of length (k-1) for which Fk-1 needs to be checked for membership Note that all (k-1) subsequences may not be contiguous because of the max-gap constraint between consecutive elements. 68

GSP – Support Counting n n In each pass, we use the candidate table GSP – Support Counting n n In each pass, we use the candidate table Ck and the input data-sequences table D to count the support K-way join n n We use select distinct before the group by to ensure that only distinct data-sequences are counted We have additional predicates between sequence numbers to handle the special time elements 69

GSP – Support Counting SQL (Ck. enoj = Ck. enoi and abs(dj. time – GSP – Support Counting SQL (Ck. enoj = Ck. enoi and abs(dj. time – di. time)≤ window -size) or (Ck. enoj = Ck. enoi + 1 and dj. time – di. time maxgap and dj. time – di. time > min-gap) or (Ck. enoj > Ck. enoi + 1) 70

References 1. Developing Tightly-Coupled Data Mining Applications on a Relational Database System n Rakesh References 1. Developing Tightly-Coupled Data Mining Applications on a Relational Database System n Rakesh Agrawal, Kyuseok Shim, 1996 2. Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications n n Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal, 1998 Refers to 1) above 3. Mining Generalized Association Rules and Sequential Patterns Using SQL Queries n n Shiby Thomas, Sunita Sarawagi, 1998 Refers to 1) and 2) above 71