4bf9cce197ad3df689026df88df36979.ppt

- Количество слайдов: 12

A Short Introduction to Sequential Data Mining Koji IWANUMA Hidetomo NABESHIMA University of Yamanashi The First Franco-Japanese Symposium on Knowledge Discovery in System Biology, September 17, Aix-en-Provence

Two Main Frameworks of Sequential Mining Sequential pattern mining for multiple data sequences Sequence ID Purchase data record 1 ~~ 2 ~~

J. Han and M. Kamber. Data Mining: Concepts and Techniques, www. cs. uiuc. edu/~hanji What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database SID sequence 10 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40

Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans be able to incorporate various kinds of user-specific constraints J. Han and M. Kamber. Data Mining: Concepts and Techniques, www. cs. uiuc. edu/~hanji 4

Sequential Pattern Mining Algorithms for Multiple Data Sequences Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’ 96) Pattern-growth methods: Free. Span & Prefix. Span (Han et al. @KDD’ 00; Pei, et al. @ICDE’ 01) Vertical format-based mining: SPADE ([email protected] Leanining’ 00) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, [email protected]’ 99; Pei, Han, Wang @ CIKM’ 02) Mining closed sequential patterns: Clo. Span (Yan, Han & Afshar @SDM’ 03) J. Han and M. Kamber. Data Mining: Concepts and Techniques, www. cs. uiuc. edu/~hanji 5

Mining Sequential Patterns from a Very-Long Single Sequence A series of daily news paper articles < > typhoon flood, landslide

Sequential Pattern Mining Algorithms for a Single data Sequence Discovery of frequent episodes in event sequences, based on a sliding window system [Mannila 1998]： The frequency measure becomes anti-monotonic, but has a problem, i. e. , a duplicate counting of an occurrence. Asynchronous periodic pattern mining [Yang et. al 2000, Huang 2004]： Any anti-monotonic frequency measures are not investigated. On-line approximation algorithm for mining frequent items, not for frequent subsequences Lossy counting algorithm [Manku and Motwani, VLDB’ 02] 7

Research in Our Laboratory Sequential Data Mining from a very-large single data sequence. Main target: sequential textual data, especially, newspaper-articles corpora Objectives: to generate a robust and useful large-scale event-sequences corpus. Application 1： topic tracking/detection in information retrieval. Application 2： automated content-tracking in WEB. Application 3: scenario/story semi-automatic creation Ordinary temporal data analysis: various log data in computer systems, genetic information, etc. 8

Technical Topics (1/2) A new framework for extracting frequent subsequences from a single long data sequence: in IEEE Inter. Conf. on Data Mining 2005 (ICDM 2005): A new rational frequency measures, which satisfies the Apriori (anti-monotonic) property and has no duplicate counting. A fast on-line algorithm for a some limited case 9

Technical Topics (1/2) On-going current works and future work On-line rational filters based on confidence criteria and/or information-gain for eliminating redundant valueless sequences from system output Methods for finding meta-structures embedded in huge amount of frequent sequences generated by a system A method using compression based on context-free grammarinference/learning More fast extraction algorithm based on a method for simultaneously searching multiple strings over compressed data. 10

References: Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques (Chapter 8). www. cs. uiuc. edu/~hanj 11

Thanks for your attention!! 12