Скачать презентацию A Short Introduction to Sequential Data Mining Koji Скачать презентацию A Short Introduction to Sequential Data Mining Koji

4bf9cce197ad3df689026df88df36979.ppt

  • Количество слайдов: 12

A Short Introduction to Sequential Data Mining Koji IWANUMA Hidetomo NABESHIMA University of Yamanashi A Short Introduction to Sequential Data Mining Koji IWANUMA Hidetomo NABESHIMA University of Yamanashi The First Franco-Japanese Symposium on Knowledge Discovery in System Biology, September 17, Aix-en-Provence

Two Main Frameworks of Sequential Mining Sequential pattern mining for multiple data sequences Sequence Two Main Frameworks of Sequential Mining Sequential pattern mining for multiple data sequences Sequence ID Purchase data record 1 2 <(wheat, milk), bread, (berry, sausage)> 3 <(bread, pumpkin, sausage)> 4 5 Sequential pattern mining for a single data sequence Data sequence 2

J. Han and M. Kamber. Data Mining: Concepts and Techniques, www. cs. uiuc. edu/~hanji J. Han and M. Kamber. Data Mining: Concepts and Techniques, www. cs. uiuc. edu/~hanji What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database SID sequence 10 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 An element may contain a set of items. Items within an element are unordered and we list them alphabetically. is a subsequence of Given support threshold min_sup =2, <(ab)c> is a sequential pattern 3

Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans be able to incorporate various kinds of user-specific constraints J. Han and M. Kamber. Data Mining: Concepts and Techniques, www. cs. uiuc. edu/~hanji 4

Sequential Pattern Mining Algorithms for Multiple Data Sequences Apriori-based method: GSP (Generalized Sequential Patterns: Sequential Pattern Mining Algorithms for Multiple Data Sequences Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’ 96) Pattern-growth methods: Free. Span & Prefix. Span (Han et al. @KDD’ 00; Pei, et al. @ICDE’ 01) Vertical format-based mining: SPADE ([email protected] Leanining’ 00) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, [email protected]’ 99; Pei, Han, Wang @ CIKM’ 02) Mining closed sequential patterns: Clo. Span (Yan, Han & Afshar @SDM’ 03) J. Han and M. Kamber. Data Mining: Concepts and Techniques, www. cs. uiuc. edu/~hanji 5

Mining Sequential Patterns from a Very-Long Single Sequence A series of daily news paper Mining Sequential Patterns from a Very-Long Single Sequence A series of daily news paper articles < > typhoon flood, landslide 6

Sequential Pattern Mining Algorithms for a Single data Sequence Discovery of frequent episodes in Sequential Pattern Mining Algorithms for a Single data Sequence Discovery of frequent episodes in event sequences, based on a sliding window system [Mannila 1998]:  The frequency measure becomes anti-monotonic, but has a problem, i. e. , a duplicate counting of an occurrence. Asynchronous periodic pattern mining [Yang et. al 2000, Huang 2004]: Any anti-monotonic frequency measures are not investigated. On-line approximation algorithm for mining frequent items, not for frequent subsequences Lossy counting algorithm [Manku and Motwani, VLDB’ 02] 7

Research in Our Laboratory Sequential Data Mining from a very-large single data sequence. Main Research in Our Laboratory Sequential Data Mining from a very-large single data sequence. Main target: sequential textual data, especially, newspaper-articles corpora Objectives: to generate a robust and useful large-scale event-sequences corpus. Application 1: topic tracking/detection in information retrieval. Application 2: automated content-tracking in WEB. Application 3: scenario/story semi-automatic creation  Ordinary temporal data analysis: various log data in computer systems, genetic information, etc. 8

Technical Topics (1/2) A new framework for extracting frequent subsequences from a single long Technical Topics (1/2) A new framework for extracting frequent subsequences from a single long data sequence: in IEEE Inter. Conf. on Data Mining 2005 (ICDM 2005): A new rational frequency measures, which satisfies the Apriori (anti-monotonic) property and has no duplicate counting. A fast on-line algorithm for a some limited case 9

Technical Topics (1/2) On-going current works and future work On-line rational filters based on Technical Topics (1/2) On-going current works and future work On-line rational filters based on confidence criteria and/or information-gain for eliminating redundant valueless sequences from system output Methods for finding meta-structures embedded in huge amount of frequent sequences generated by a system A method using compression based on context-free grammarinference/learning More fast extraction algorithm based on a method for simultaneously searching multiple strings over compressed data. 10

References: Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques (Chapter 8). www. References: Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques (Chapter 8). www. cs. uiuc. edu/~hanj 11

Thanks for your attention!! 12 Thanks for your attention!! 12