浙江大学本科生数据挖掘导论课件第 6课序列挖掘技术徐从富副教授浙江大学人智能研究所

浙江大学本科生《数据挖掘导论》课件第 6课序列挖掘技术徐从富，副教授浙江大学人智能研究所

内容提纲序列数据挖掘简介 n 序列模式挖掘算法 n 数据流挖掘 n

Sequence Databases & Sequential Patterns n Transaction databases, time-series databases vs. sequence databases n Frequent patterns vs. (frequent) sequential patterns n Most databases contains sequential or ordered data: ¨ Sequence of transactions, Weblogs, time-series, video, scientific & engineering processes n Applications of sequential pattern mining ¨ Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months. ¨ Medical treatment, natural disasters (e. g. , earthquakes), science & engineering processes, stocks and markets, etc. n ¨ Telephone calling patterns, Weblog click streams ¨ DNA sequences and gene structures

What Is Sequential Pattern Mining? n Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern

Challenges on Sequential Pattern Mining n A huge number of possible sequential patterns are hidden in databases n A mining algorithm should ¨ find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold ¨ be highly efficient, scalable, involving only a small number of database scans ¨ be able to incorporate various kinds of user-specific constraints 5

Sequential Pattern Mining Algorithms n Concept introduction and an initial Apriori-like algorithm ¨ n n n Agrawal & Srikant. “Mining sequential patterns, ” ICDE’ 95 Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’ 96) Pattern-growth methods: Free. Span & Prefix. Span (Han et al. @KDD’ 00; Pei, et al. @ICDE’ 01) Vertical format-based mining: SPADE (Zaki@Machine Leanining’ 00) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB’ 99; Pei, Han, Wang @ CIKM’ 02) Mining closed sequential patterns: Clo. Span (Yan, Han & Afshar @SDM’ 03) 6

A Basic Property of Sequential Patterns: Apriori n A basic property: Apriori (Agrawal & Sirkant’ 94) ¨ If a sequence S is not frequent ¨ Then none of the super-sequences of S is frequent ¨ E. g, <hb> is infrequent so do <hab> and <(ah)b> Seq. ID Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> Given support threshold min_sup =2

GSP—A Generalized Sequential Pattern Mining Algorithm n GSP (Generalized Sequential Pattern) mining algorithm ¨ proposed n by Agrawal and Srikant, EDBT’ 96 Outline of the method ¨ Initially, every item in DB is a candidate of length-1 ¨ for each level (i. e. , sequences of length-k) do n scan database to collect support count for each candidate sequence n generate candidate length-(k+1) sequences from lengthk frequent sequences using Apriori ¨ repeat until no frequent sequence or no candidate can be found n Major strength: Candidate pruning by Apriori

Finding Length-1 Sequential Patterns Examine GSP using an example Initial candidates: all singleton sequences n n ¨ <a>, <h> , <c>, <d>, <e>, <f>, <g>, Scan database once, count support for candidates min_sup =2 n Cand Sup <a> 3 5 <c> 4 <d> 3 <e> 3 Seq. ID 10 <(bd)cb(ac)> <f> 2 20 <(bf)(ce)b(fg)> <g> 1 30 <(ah)(bf)abf> <h> 1 40 <(be)(ce)d> 50 9 Sequence <a(bd)bcb(ade)>

Generating Length-2 Candidates <a> <c> <d> <e> <f> 10 <d> <e> <f> <aa> <ab> <ac> <ad> <ae> <af> <ba> <bb> <bc> <bd> <be> <bf> <ca> <cb> <cc> <cd> <ce> <cf> <da> <db> <dc> <dd> <de> <df> <ea> <eb> <ec> <ed> <ee> <ef> <a> <c> <e> 51 length-2 Candidates <fa> <fb> <fc> <fd> <fe> <ff> <c> <d> <e> <f> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <(bc)> <(bd)> <(be)> <(bf)> <(cd)> <(ce)> <(cf)> <(de)> <(df)> <(ef)> Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44. 57% candidates

The GSP Mining Process Cand. cannot pass sup. threshold 5 th scan: 1 cand. 1 length-5 seq. pat. <(bd)cba> 4 th scan: 8 cand. 6 length-4 seq. pat. <abba> <(bd)bc> … Cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. <abb> <aab> <aba> <bab> … 20 cand. not in DB at all 2 nd scan: 51 cand. 19 length-2 seq. pat. <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> 10 cand. not in DB at all 1 st scan: 8 cand. 6 length-1 seq. pat. <a> <c> <d> <e> <f> <g> <h> Seq. ID min_sup =2 Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>

Candidate Generate-and-test: Drawbacks n A huge set of candidate sequences generated. ¨ Especially n 2 -item candidate sequence. Multiple Scans of database needed. ¨ The length of each candidate grows by one at each database scan. n Inefficient for mining long sequential patterns. ¨A long pattern grow up from short patterns ¨ The number of short patterns is exponential to the length of mined patterns. 12

The SPADE Algorithm n SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 n A vertical format sequential pattern mining method n A sequence database is mapped to a large set of ¨ Item: n <SID, EID> Sequential pattern mining is performed by ¨ growing the subsequences (patterns) one item at a time by Apriori candidate generation 13

The SPADE Algorithm 14 17 March 2018 Data Mining: Principles and Algorithms

Bottlenecks of GSP and SPADE n A huge set of candidates could be generated ¨ 1, 000 frequent length-1 sequences generate length-2 candidates! n Multiple scans of database in mining n Breadth-first search n Mining long sequential patterns ¨ Needs ¨A an exponential number of short candidates length-100 sequential pattern needs 1030 candidate sequences!

Prefix and Suffix (Projection) <a>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)> n Given sequence <a(abc)(ac)d(cf)> n Prefix <a> <ab> 16 Suffix (Prefix-Based Projection) <(abc)(ac)d(cf)> <(_c)(ac)d(cf)>

Projections Mining Sequential Patterns by Prefix n n Step 1: find length-1 sequential patterns ¨ <a>, , <c>, <d>, <e>, <f> Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: ¨ The ones having prefix <a>; ¨ The ones having prefix ; SID sequence ¨… 10 <a(abc)(ac)d(cf)> ¨ The ones having prefix <f> 20 30 <(ef)(ab)(df)cb> 40 17 <(ad)c(bc)(ae)> <eg(af)cbc>

Finding Seq. Patterns with Prefix <a> n Only need to consider projections w. r. t. <a> ¨ <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> n Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> ¨ Further partition into 6 subsets SID sequence Having prefix <aa>; 10 <a(abc)(ac)d(cf)> n … 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> n Having prefix <af> 40 <eg(af)cbc> n 18

Completeness of Prefix. Span SDB SID 10 <aa>-proj. db 19 <(ef)(ab)(df)cb> 40 Having prefix <aa> <(ad)c(bc)(ae)> 30 <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> <a(abc)(ac)d(cf)> 20 Having prefix <a> sequence <eg(af)cbc> Length-1 sequential patterns <a>, , <c>, <d>, <e>, <f> Having prefix <c>, …, <f> Having prefix Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> -projected database … …… Having prefix <af> … <af>-proj. db 17 March 2018 Data Mining: Principles and Algorithms

Efficiency of Prefix. Span n No candidate sequence needs to be generated n Projected databases keep shrinking n Major cost of Prefix. Span: constructing projected databases ¨ Can 20 be improved by pseudo-projections

Speed-up by Pseudo-projection n Major cost of Prefix. Span: projection ¨ Postfixes of sequences often appear repeatedly in recursive projected databases n When (projected) database can be held in main memory, use pointers to form projections ¨ Pointer ¨ Offset to the sequence s=<a(abc)(ac)d(cf)> <a> of the postfix s|<a>: ( , 2) <(abc)(ac)d(cf)> <ab> s|<ab>: ( , 4) 21 <(_c)(ac)d(cf)>

Pseudo-Projection vs. Physical Projection n Pseudo-projection avoids physically copying postfixes ¨ Efficient in running time and space when database can be held in main memory n However, it is not efficient when database cannot fit in main memory ¨ Disk-based n Suggested Approach: ¨ Integration ¨ Swapping memory 22 random accessing is very costly of physical and pseudo-projection to pseudo-projection when the data set fits in

Performance on Data Set C 10 T 8 S 8 I 8 23 17 March 2018 Data Mining: Principles and Algorithms

Performance on Data Set Gazelle 24 17 March 2018 Data Mining: Principles and Algorithms

Effect of Pseudo-Projection 25 17 March 2018 Data Mining: Principles and Algorithms

3. 流数据挖掘流数据简介 n 流数据频繁模式挖掘算法 n

I. n 数据流简介概念 ¨ 一系列连续且有序的点组成的序列 x 1, …, xi, …, xn, 称为数据流；按照固定的次序，这些点只能被读取一次或者几次 n 特点 ¨ 大数据量，甚至无限 ¨ 频繁的变化和快速的响应 ¨ 线性扫描算法，查询次数有限 n random access is expensive

DBMS 与 DSMS n n n n n 持久的关系 One-time queries 随机的访问 “无限”的磁盘空间当前状态有效相对较低的更新率很少“实时服务” 假定数据精确无误访问策略由查询处理器在数据库设计时确定 n n n n n 瞬间的流连续的查询序列化的访问有限的主存数据的到达顺序是关键数据传输率未知实时响应过时/模糊的数据变化的数据及数据量

DSMS User/Application Continuous Query Results Multiple streams Stream Query Processor Scratch Space (Main memory and/or Disk)

DSMS Register Query Streamed Result Stored Result DSMS Input streams Archive Scratch Stored Relations

目前的DSMS项目 n STREAM (Stanford): A general-purpose DSMS n Cougar (Cornell): sensors n Aurora (Brown/MIT): sensor monitoring, dataflow n Hancock (AT&T): telecom streams n Niagara (OGI/Wisconsin): Internet XML databases n Open. CQ (Georgia Tech): triggers, incr. view maintenance n Tapestry (Xerox): pub/sub content-based filtering n Telegraph (Berkeley): adaptive engine for sensors n Tradebot (www. tradebot. com): stock tickers & streams n Tribeca (Bellcore): network monitoring n Streaminer (UIUC): new project for stream data mining

应用领域 n 新的应用领域 – 以连续的、有序的“流”的形式输入数据 ¨ 网络监听和流量控制(Network monitoring and traffic engineering) ¨ 电话通信(Telecom call records) ¨ 网络安全 (Network security ) ¨ 金融领域(Financial Application) ¨ 业生产 (Manufacturing Processes) ¨ 网页日志与点击流(Web logs and clickstreams)

应用实例 n 网络安全 ¨ 数据包流，用户的会话信息 ¨ 查询: URL 过滤，异常监测，网络攻击和病毒来源 n 金融领域 ¨ 交易数据流，股票行情，消息反馈 ¨ 查询: 套汇可能性分析，模式

现有的研究方向 n 流数据建模(Stream data model) ¨ STanford st. REam dat. A Manager (STREAM) ¨ Data Stream Management System (DSMS) n 流检索/查询建模(Stream query model) ¨ Continuous Queries ¨ Sliding windows n 流数据挖掘(Stream data mining) ¨ Clustering & summarization (Guha, Motwani et al. ) ¨ Correlation of data streams (Gehrke et al. ) ¨ Classification of stream data (Domingos et al. )

II. 流数据频繁模式挖掘简介静态数据流数据关系特点静态稳固短暂易失查询方式一次完成连续查询存取方式随机访问序列访问存储容量无限的辅存有限的主存响应速度无要求或尽量快必须快存储特点被动存储主动存储更新速度低不可预测响应特点较少“实时服务 ” 实时响应

流数据频繁模式挖掘要求 ① ② ③ 只能对数据流进行一次扫描；处理的数据项是无穷的；实时响应数据处理要求。

数据流管理系统的抽象体系结构

III. 流数据频繁模式挖掘算法

确定区间（deterministic bounds）近似算法：计算一个近似结果，但这个近似结果能够落入由真实结果构成的区间； n 概率区间（probabilistic bounds）近似算法：计算一个近似结果，但这个近似结果能够以较高的概率落入由真实结果构成的区间。 n

算法比较

滑动窗口技术 n 自然滑动窗口 7 days 24 hrs 4 qtrs 15 minutes Time 12 months Time 25 sec. Now 31 days 24 hours 4 qtrs Now

对数滑动窗口

参考文献 n n n n n R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT’ 96. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI: 97. M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, 2001. J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M. -C. Hsu. Prefix. Span: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01 (TKDE’ 04). J. Pei, J. Han and W. Wang, Constraint-Based Sequential Pattern Mining in Large Databases, CIKM'02. X. Yan, J. Han, and R. Afshar. Clo. Span: Mining Closed Sequential Patterns in Large Datasets. SDM'03. J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04. H. Cheng, X. Yan, and J. Han, Inc. Span: Incremental Mining of Sequential Patterns in Large Database, KDD'04. J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99. J. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series data, KDD'00.

浙江大学本科生 数据挖掘导论 课件 第 6课 序列挖掘技术 徐从富 副教授 浙江大学人 智能研究所

浙江大学本科生数据挖掘导论课件第 6课序列挖掘技术徐从富副教授浙江大学人智能研究所