960b0b3e61e0f2cab400fd60a9b67f17.ppt
- Количество слайдов: 43
浙江大学本科生《数据挖掘导论》课件 第 6课 序列挖掘技术 徐从富,副教授 浙江大学人 智能研究所
内容提纲 序列数据挖掘简介 n 序列模式挖掘算法 n 数据流挖掘 n
Sequence Databases & Sequential Patterns n Transaction databases, time-series databases vs. sequence databases n Frequent patterns vs. (frequent) sequential patterns n Most databases contains sequential or ordered data: ¨ Sequence of transactions, Weblogs, time-series, video, scientific & engineering processes n Applications of sequential pattern mining ¨ Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months. ¨ Medical treatment, natural disasters (e. g. , earthquakes), science & engineering processes, stocks and markets, etc. n ¨ Telephone calling patterns, Weblog click streams ¨ DNA sequences and gene structures
What Is Sequential Pattern Mining? n Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern
Challenges on Sequential Pattern Mining n A huge number of possible sequential patterns are hidden in databases n A mining algorithm should ¨ find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold ¨ be highly efficient, scalable, involving only a small number of database scans ¨ be able to incorporate various kinds of user-specific constraints 5
Sequential Pattern Mining Algorithms n Concept introduction and an initial Apriori-like algorithm ¨ n n n Agrawal & Srikant. “Mining sequential patterns, ” ICDE’ 95 Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’ 96) Pattern-growth methods: Free. Span & Prefix. Span (Han et al. @KDD’ 00; Pei, et al. @ICDE’ 01) Vertical format-based mining: SPADE (Zaki@Machine Leanining’ 00) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB’ 99; Pei, Han, Wang @ CIKM’ 02) Mining closed sequential patterns: Clo. Span (Yan, Han & Afshar @SDM’ 03) 6
A Basic Property of Sequential Patterns: Apriori n A basic property: Apriori (Agrawal & Sirkant’ 94) ¨ If a sequence S is not frequent ¨ Then none of the super-sequences of S is frequent ¨ E. g, <hb> is infrequent so do <hab> and <(ah)b> Seq. ID Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> Given support threshold min_sup =2
GSP—A Generalized Sequential Pattern Mining Algorithm n GSP (Generalized Sequential Pattern) mining algorithm ¨ proposed n by Agrawal and Srikant, EDBT’ 96 Outline of the method ¨ Initially, every item in DB is a candidate of length-1 ¨ for each level (i. e. , sequences of length-k) do n scan database to collect support count for each candidate sequence n generate candidate length-(k+1) sequences from lengthk frequent sequences using Apriori ¨ repeat until no frequent sequence or no candidate can be found n Major strength: Candidate pruning by Apriori
Finding Length-1 Sequential Patterns Examine GSP using an example Initial candidates: all singleton sequences n n ¨ <a>, <h> <b>, <c>, <d>, <e>, <f>, <g>, Scan database once, count support for candidates min_sup =2 n Cand Sup <a> 3 <b> 5 <c> 4 <d> 3 <e> 3 Seq. ID 10 <(bd)cb(ac)> <f> 2 20 <(bf)(ce)b(fg)> <g> 1 30 <(ah)(bf)abf> <h> 1 40 <(be)(ce)d> 50 9 Sequence <a(bd)bcb(ade)>
Generating Length-2 Candidates <a> <b> <c> <d> <e> <f> 10 <d> <e> <f> <aa> <ab> <ac> <ad> <ae> <af> <ba> <bb> <bc> <bd> <be> <bf> <ca> <cb> <cc> <cd> <ce> <cf> <da> <db> <dc> <dd> <de> <df> <ea> <eb> <ec> <ed> <ee> <ef> <a> <c> <e> 51 length-2 Candidates <b> <fa> <fb> <fc> <fd> <fe> <ff> <b> <c> <d> <e> <f> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <(bc)> <(bd)> <(be)> <(bf)> <(cd)> <(ce)> <(cf)> <(de)> <(df)> <(ef)> Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44. 57% candidates
The GSP Mining Process Cand. cannot pass sup. threshold 5 th scan: 1 cand. 1 length-5 seq. pat. <(bd)cba> 4 th scan: 8 cand. 6 length-4 seq. pat. <abba> <(bd)bc> … Cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. <abb> <aab> <aba> <bab> … 20 cand. not in DB at all 2 nd scan: 51 cand. 19 length-2 seq. pat. <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> 10 cand. not in DB at all 1 st scan: 8 cand. 6 length-1 seq. pat. <a> <b> <c> <d> <e> <f> <g> <h> Seq. ID min_sup =2 Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>
Candidate Generate-and-test: Drawbacks n A huge set of candidate sequences generated. ¨ Especially n 2 -item candidate sequence. Multiple Scans of database needed. ¨ The length of each candidate grows by one at each database scan. n Inefficient for mining long sequential patterns. ¨A long pattern grow up from short patterns ¨ The number of short patterns is exponential to the length of mined patterns. 12
The SPADE Algorithm n SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 n A vertical format sequential pattern mining method n A sequence database is mapped to a large set of ¨ Item: n <SID, EID> Sequential pattern mining is performed by ¨ growing the subsequences (patterns) one item at a time by Apriori candidate generation 13
The SPADE Algorithm 14 17 March 2018 Data Mining: Principles and Algorithms
Bottlenecks of GSP and SPADE n A huge set of candidates could be generated ¨ 1, 000 frequent length-1 sequences generate length-2 candidates! n Multiple scans of database in mining n Breadth-first search n Mining long sequential patterns ¨ Needs ¨A an exponential number of short candidates length-100 sequential pattern needs 1030 candidate sequences!
Prefix and Suffix (Projection) <a>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)> n Given sequence <a(abc)(ac)d(cf)> n Prefix <a> <ab> 16 Suffix (Prefix-Based Projection) <(abc)(ac)d(cf)> <(_c)(ac)d(cf)>
Projections Mining Sequential Patterns by Prefix n n Step 1: find length-1 sequential patterns ¨ <a>, <b>, <c>, <d>, <e>, <f> Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: ¨ The ones having prefix <a>; ¨ The ones having prefix <b>; SID sequence ¨… 10 <a(abc)(ac)d(cf)> ¨ The ones having prefix <f> 20 30 <(ef)(ab)(df)cb> 40 17 <(ad)c(bc)(ae)> <eg(af)cbc>
Finding Seq. Patterns with Prefix <a> n Only need to consider projections w. r. t. <a> ¨ <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> n Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> ¨ Further partition into 6 subsets SID sequence Having prefix <aa>; 10 <a(abc)(ac)d(cf)> n … 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> n Having prefix <af> 40 <eg(af)cbc> n 18
Completeness of Prefix. Span SDB SID 10 <aa>-proj. db 19 <(ef)(ab)(df)cb> 40 Having prefix <aa> <(ad)c(bc)(ae)> 30 <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> <a(abc)(ac)d(cf)> 20 Having prefix <a> sequence <eg(af)cbc> Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Having prefix <c>, …, <f> Having prefix <b> Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> <b>-projected database … …… Having prefix <af> … <af>-proj. db 17 March 2018 Data Mining: Principles and Algorithms
Efficiency of Prefix. Span n No candidate sequence needs to be generated n Projected databases keep shrinking n Major cost of Prefix. Span: constructing projected databases ¨ Can 20 be improved by pseudo-projections
Speed-up by Pseudo-projection n Major cost of Prefix. Span: projection ¨ Postfixes of sequences often appear repeatedly in recursive projected databases n When (projected) database can be held in main memory, use pointers to form projections ¨ Pointer ¨ Offset to the sequence s=<a(abc)(ac)d(cf)> <a> of the postfix s|<a>: ( , 2) <(abc)(ac)d(cf)> <ab> s|<ab>: ( , 4) 21 <(_c)(ac)d(cf)>
Pseudo-Projection vs. Physical Projection n Pseudo-projection avoids physically copying postfixes ¨ Efficient in running time and space when database can be held in main memory n However, it is not efficient when database cannot fit in main memory ¨ Disk-based n Suggested Approach: ¨ Integration ¨ Swapping memory 22 random accessing is very costly of physical and pseudo-projection to pseudo-projection when the data set fits in
Performance on Data Set C 10 T 8 S 8 I 8 23 17 March 2018 Data Mining: Principles and Algorithms
Performance on Data Set Gazelle 24 17 March 2018 Data Mining: Principles and Algorithms
Effect of Pseudo-Projection 25 17 March 2018 Data Mining: Principles and Algorithms
3. 流数据挖掘 流数据简介 n 流数据频繁模式挖掘算法 n
I. n 数据流简 介 概念 ¨ 一系列连续 且有序的点组 成的序列 x 1, …, xi, …, xn, 称为 数据流;按照固定的次序,这 些点 只能被读 取一次或者几次 n 特点 ¨ 大数据量,甚至无限 ¨ 频 繁的变 化和快速的响应 ¨ 线 性扫 描算法,查询 次数有限 n random access is expensive
DBMS 与 DSMS n n n n n 持久的关系 One-time queries 随机的访问 “无限”的磁盘空间 当前状态有效 相对较低的更新率 很少“实时服务” 假定数据精确无误 访问策略由查询处理器 在数据库设计时确定 n n n n n 瞬间的流 连续的查询 序列化的访问 有限的主存 数据的到达顺序是关键 数据传输率未知 实时响应 过时/模糊的数据 变化的数据及数据量
DSMS User/Application Continuous Query Results Multiple streams Stream Query Processor Scratch Space (Main memory and/or Disk)
DSMS Register Query Streamed Result Stored Result DSMS Input streams Archive Scratch Stored Relations
目前的DSMS项目 n STREAM (Stanford): A general-purpose DSMS n Cougar (Cornell): sensors n Aurora (Brown/MIT): sensor monitoring, dataflow n Hancock (AT&T): telecom streams n Niagara (OGI/Wisconsin): Internet XML databases n Open. CQ (Georgia Tech): triggers, incr. view maintenance n Tapestry (Xerox): pub/sub content-based filtering n Telegraph (Berkeley): adaptive engine for sensors n Tradebot (www. tradebot. com): stock tickers & streams n Tribeca (Bellcore): network monitoring n Streaminer (UIUC): new project for stream data mining
应用领域 n 新的应 用领 域 – 以连续 的、有序的“流”的 形式输 入数据 ¨ 网络监 听和流量控制(Network monitoring and traffic engineering) ¨ 电话 通信(Telecom call records) ¨ 网络 安全 (Network security ) ¨ 金融领 域(Financial Application) ¨ 业 生产 (Manufacturing Processes) ¨ 网页 日志与点击 流(Web logs and clickstreams)
应用实例 n 网络安全 ¨ 数据包流,用户的会话信息 ¨ 查询: URL 过滤,异常监测,网络攻击和病毒 来源 n 金融领域 ¨ 交易数据流, 股票行情, 消息反馈 ¨ 查询: 套汇可能性分析,模式
现有的研究方向 n 流数据建模(Stream data model) ¨ STanford st. REam dat. A Manager (STREAM) ¨ Data Stream Management System (DSMS) n 流检索/查询建模(Stream query model) ¨ Continuous Queries ¨ Sliding windows n 流数据挖掘(Stream data mining) ¨ Clustering & summarization (Guha, Motwani et al. ) ¨ Correlation of data streams (Gehrke et al. ) ¨ Classification of stream data (Domingos et al. )
II. 流数据频繁模式挖掘简介 静态 数据 流数据 关系特点 静态稳 固 短暂 易失 查询 方式 一次完成 连续查询 存取方式 随机访问 序列访问 存储 容量 无限的辅 存 有限的主存 响应 速度 无要求或尽量快 必须 快 存储 特点 被动 存储 主动 存储 更新速度 低 不可预测 响应 特点 较 少“实时 服务 ” 实时 响应
流数据频繁模式挖掘要求 ① ② ③ 只能对数据流进行一次扫描; 处理的数据项是无穷的; 实时响应数据处理要求。
数据流管理系统的抽象体系结构
III. 流数据频繁模式挖掘算法
确定区间 (deterministic bounds)近似算法: 计 算一个近似结 果,但这 个近似结 果能够 落入由真实结 果构成的 区间 ; n 概率区间 (probabilistic bounds)近似算法: 计 算一个近似结 果,但这 个近似结 果能够 以较 高的概率落入由真实结 果构成的 区 间。 n
算法比较
滑动窗口技术 n 自然滑动窗口 7 days 24 hrs 4 qtrs 15 minutes Time 12 months Time 25 sec. Now 31 days 24 hours 4 qtrs Now
对数滑动窗口
参考文献 n n n n n R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT’ 96. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI: 97. M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, 2001. J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M. -C. Hsu. Prefix. Span: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01 (TKDE’ 04). J. Pei, J. Han and W. Wang, Constraint-Based Sequential Pattern Mining in Large Databases, CIKM'02. X. Yan, J. Han, and R. Afshar. Clo. Span: Mining Closed Sequential Patterns in Large Datasets. SDM'03. J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04. H. Cheng, X. Yan, and J. Han, Inc. Span: Incremental Mining of Sequential Patterns in Large Database, KDD'04. J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99. J. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series data, KDD'00.
960b0b3e61e0f2cab400fd60a9b67f17.ppt