知识发现数据挖掘第六章关联规则 Association Rules 史忠植中国科学院计算技术研究所

知识发现（数据挖掘 ) 第六章关联规则 Association Rules 史忠植中国科学院计算技术研究所 2018/3/15 AA 12 关联规则史忠植 1

内容提要 l 引言 l Apriori 算法 l Frequent-pattern tree 和FP-growth 算法 l 多维关联规则挖掘 l 相关规则 l 关联规则改进 l 总结 2018/3/15 AA 12 关联规则史忠植 2

关联规则 l l 关联规则反映一个事物与其他事物之间的相互依存性和关联性。如果两个或者多个事物之间存在一定的关联关系，那么，其中一个事物就能够通过其他事物预测到。关联规则表示了项之间的关系。示例: cereal, milk fruit “买谷类食品和牛奶的人也会买水果. ” 商店可以把牛奶和谷类食品作特价品以使人们买更多的水果. 2018/3/15 AA 12 关联规则史忠植 3

市场购物篮分析 l 分析事务数据库表 Person A Chips, Salsa, Cookies, Crackers, Coke, Beer B Lettuce, Spinach, Oranges, Celery, Apples, Grapes C Chips, Salsa, Frozen Pizza, Frozen Cake D l Basket Lettuce, Spinach, Milk, Butter 我们是否可假定? l 2018/3/15 Chips => Salsa Lettuce => Spinach AA 12 关联规则史忠植 4

基本概念 l 通常, 数据包含: TID Basket 项的子集事务 ID 2018/3/15 AA 12 关联规则史忠植 5

关联规则挖掘 l 在事务数据库, 关系数据库和其它信息库中的项或对象的集合之间, 发现频繁模式, 关联, 相关, 或因果关系的结构. l 频繁模式: 数据库中出现频繁的模式( 项集, 序列, 等等) 2018/3/15 AA 12 关联规则史忠植 6

基本概念项集事务 l 关联规则 l Items bought A, B, C 20 A, C 30 A, D 40 l Transaction-id 10 l B, E, F - 事务数据集 (例如右图) 事务标识 TID 每一个事务关联着一个标识, 称作 2018/3/15 AA 12 关联规则史忠植 TID. l 7

关联规则的度量 l 支持度s l D中包含A和 B 的事务数与总的事务数的比值规则 A B 在数据集D中的支持度为s, 其中s 表示D 中包含A B (即同时包含A和B)的事务的百分率. 2018/3/15 AA 12 关联规则史忠植 8

关联规则的度量 l 可信度 c l D中同时包含A和B的事务数与只包含A的事务数的比值规则 A B 在数据集D中的可信度为c, 其中c表示D中包含A的事务中也包含B的百分率. 即可用条件概率P(B|A)表示. confidence(A B )=P(B|A) 条件概率 P(B|A) 表示A发生的条件下B也发生的概率. 2018/3/15 AA 12 关联规则史忠植 9

关联规则的度量 l 关联规则根据以下两个标准(包含或排除 ): l 最小支持度 – 表示规则中的所有项在事务中出现的频度 l 最小可信度 - 表示规则中左边的项(集)的出现暗示着右边的项(集)出现的频度 2018/3/15 AA 12 关联规则史忠植 10

市场购物篮分析事务 ID 购物篮 A Chips, Salsa, Cookies, Crackers, Coke, Beer B Lettuce, Spinach, Oranges, Celery, Apples, Grapes C Chips, Salsa, Frozen Pizza, Frozen Cake D Lettuce, Spinach, Milk, Butter, Chips I是什么? 事务ID B的T是什么? s(Chips=>Salsa) 是什么? c(Chips=>Salsa)是什么? 2018/3/15 AA 12 关联规则史忠植 11

频繁项集项集 – 任意项的集合 l k-项集 – 包含k个项的项集 l 频繁 (或大)项集 – 满足最小支持度的项集 l l 若I包含m个项, 那么可以产生多少个项集? 2018/3/15 AA 12 关联规则史忠植 12

强关联规则 l 给定一个项集, 容易生成关联规则. l 项集: {Chips, Salsa, Beer} l l Beer, Chips => Salsa Beer, Salsa => Chips, Salsa => Beer 强规则是有趣的 l 强规则通常定义为那些满足最小支持度和最小可信度的规则. 2018/3/15 AA 12 关联规则史忠植 13

关联规则挖掘 l 两个基本步骤 l 找出所有的频繁项集 l 满足最小支持度 l 找出所有的强关联规则 l 由频繁项集生成关联规则 l 保留满足最小可信度的规则 2018/3/15 AA 12 关联规则史忠植 14

内容提要 l 引言 l Apriori 算法 l Frequent-pattern tree 和FP-growth 算法 l 多维关联规则挖掘 l 相关规则 l 关联规则改进 l 总结 2018/3/15 AA 12 关联规则史忠植 15

Apriori算法 l IBM公司Almaden研究中心的R. Agrawal 等人在 1993年提出的AIS和SETM。 l 在 1994年提出Apriori和Apriori. Tid。Apriori和 Apriori. Tid算法利用前次过程中的数据项目集来生成新的候选数据项目集，减少了中间不必要的数据项目集的生成，提高了效率 2018/3/15 AA 12 关联规则史忠植 16

生成频繁项集 l. Naïve algorithm n <- |D| for each subset s of I do l <- 0 for each transaction T in D do if s is a subset of T then l <- l + 1 if minimum support <= l/n then add s to frequent subsets 2018/3/15 AA 12 关联规则史忠植 17

生成频繁项集 l naïve algorithm的分析 l l l I 的子集: O(2 m) 为每一个子集扫描n个事务测试s为T的子集: O(2 mn) 随着项的个数呈指数级增长! 我们能否做的更好? 2018/3/15 AA 12 关联规则史忠植 18

Apriori 性质 l 定理(Apriori 性质): 若A是一个频繁项集, 则A的每一个子集都是一个频繁项集. l 证明: 设n为事务数. 假设A是l个事务的子集, 若 A’ A , 则A’ 为l’ (l’ l )个事务的子集. 因此, l/n ≥s( 最小支持度), l’/n ≥s也成立. 2018/3/15 AA 12 关联规则史忠植 19

Apriori 算法 l l Apriori算法是一种经典的生成布尔型关联规则的频繁项集挖掘算法. 算法名字是缘于算法使用了频繁项集的性质这一先验知识. 思想: Apriori 使用了一种称作level-wise搜索的迭代方法, 其中k-项集被用作寻找(k+1)-项集. 首先, 找出频繁1 -项集, 以L 1表示. L 1用来寻找L 2, 即频繁2 -项集的集合. L 2用来寻找L 3, 以此类推, 直至没有新的频繁k-项集被发现. 每个Lk都要求对数据库作一次完全扫描. . 2018/3/15 AA 12 关联规则史忠植 20

生成频繁项集 l l 中心思想: 由频繁(k-1)-项集构建候选k-项集方法 l l l 2018/3/15 找到所有的频繁1 -项集扩展频繁(k-1)-项集得到候选k-项集剪除不满足最小支持度的候选项集 AA 12 关联规则史忠植 21

Apriori: 一种候选项集生成-测试方法 l l Apriori 剪枝原理: 若任一项集是不频繁的, 则其超集不应该被生成/测试! 方法: l l l 由频繁k-项集生成候选(k+1)-项集, 并且在DB中测试候选项集性能研究显示了Apriori算法是有效的和可伸缩 (scalablility)的. 2018/3/15 AA 12 关联规则史忠植 22

The Apriori 算法—一个示例 Itemset Tid A, C, D 20 B, C, E 30 {B} 3 {C} 3 {D} 1 3 A, B, C, E 40 2 C 1 Items 10 {A} {E} Database TDB sup B, E L 2 Itemset {A, C} {B, E} {C, E} 1 st scan C 2 sup 2 2 3 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} sup 1 2 3 2 sup {A} 2 {B} 3 {C} 3 {E} L 1 Itemset 3 C 2 2 nd scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C 3 2018/3/15 Itemset {B, C, E} 3 rd scan L 3 Itemset AA 12 关联规则史忠植 {B, C, E} sup 2 23

Apriori 算法 Algorithm: Apriori 输入: Database, D, of transactions; minimum support threshold, min_sup. 输出: L, freuqent itemsets in D. 过程: Ck: Candidate itemset of size k Lk : frequent itemset of size k L 1 = find_frequent_1_itemsets(D); for (k = 2; Lk+1 != ; k++) do begin{ Ck = apriori_gen(Lk-1 , min_sup); for each transaction t in database D do{//scan D for counts Ct =subset(Ck , t); // get the subsets of t that are candidates For each candidate c Ct c. count++; } Lk = candidates in Ck with min_support 2018/3/15 }end AA 12 关联规则史忠植 24 return L= k Lk;

Apriori 算法 Procedure apriori_gen(Lk-1: frequent (k-1)-itemsets; min_sup: minimum support threshold ) for each itemset l 1 Lk-1 for each itemset l 2 Lk-1 if(l 1[1]=l 2[1]) (l 1[2]=l 2[2]) … (l 1[k-1]=l 2[k-1]) Then{ c=join(l 1, l 2)//join step: generate candidates if has_infrequent_subset(c, Lk-1 ) then delete c; //prune step: remove unfruitful candidate else add c to Ck } 2018/3/15 return Ck AA 12 关联规则史忠植 25

Apriori 算法 Join is generate candidates set of itemsets Ck from 2 itemsets in Lk-1 Procedure join(p, q) insert into Ck select p. item 1, p. item 2, . . . , p. itemk-1, q. itemk-1 from Lk-1 p, Lk-1 q where p. item 1=q. item 1, . . . , p. itemk-2=q. itemk-2, p. itemk-1 = q. itemk-1 2018/3/15 AA 12 关联规则史忠植 26

Apriori 算法 Procedure has_infrequent_subset(c: candidate k-itemset; Lk-1: frequent (k-1)-itemsets; )//use prior knowledge for each (k-1)-subset s of c if s Lk-1 then return TRUE; return FALSE. 2018/3/15 AA 12 关联规则史忠植 27

Apriori 算法 l 如何生成候选项集? l 步骤 1: 自连接 Lk l 步骤 2: 剪枝 l 如何计算候选项集的支持度? l 候选项庥生成的示例 l l l L 3={ abc, abd, ace, bcd } 自连接: L 3*L 3 l 由abc 和abd 连接得到abcd l 由acd 和ace 连接得到acde 剪枝: l l 2018/3/15 因为ade 不在L 3中acde 被剪除 C 4={abcd} AA 12 关联规则史忠植 28

如何生成候选项集? l 假定Lk-1中的项以一定顺序排列 l 步骤 1: 自连接 Lk-1 insert into Ck select p. item 1, p. item 2, …, p. itemk-1, q. itemk-1 from Lk-1 p, Lk-1 q where p. item 1=q. item 1, …, p. itemk-2=q. itemk-2, p. itemk-1 < q. itemk-1 l 步骤 2: 剪枝 forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck 2018/3/15 AA 12 关联规则史忠植 29

如何计算候选项集的支持度? l 为何候选项集的支持度的计算是一个问题? l l l 候选项集的总数可能是巨大的一个事务可能包含多个候选项集方法: l l l 哈希树的叶子结点包含一个项集和计数的列表内部结点包含一个哈希表 l 2018/3/15 候选项集被存储在一个哈希树子集函数: 找出包含在一个事务中的所有候选项集 AA 12 关联规则史忠植 30

频繁模式挖掘的挑战 l 多次扫描事务数据库 l 巨大数量的候选项集繁重的计算候选项集的支持度作改进 Apriori: 大体的思路 l l l 减少事务数据库的扫描次数 l 缩减候选项集的数量 l 使候选项集的支持度计算更加方便 2018/3/15 AA 12 关联规则史忠植 31

Apriori. Tid算法 l l Apriori. Tid算法由Apriori算法改进优点：只和数据库做一次交互，无须频繁访问数据库将Apirori中的Ck 扩展，内容由{c}变为{TID，c}， TID用于唯一标识事务引入Bk ，使得Bk 对于事务的项目组织集合，而不是被动的等待Ck 来匹配 2018/3/15 AA 12 关联规则史忠植 32

Apriori. Tid算法 l l 举例：minsupp = 2 数据库： TID 100 200 300 400 2018/3/15 项目 134 235 1235 25 AA 12 关联规则史忠植 33

Apriori. Tid算法示例 TID 项目集 100 {1} {3} {4} 200 {2} {3} {5} 300 {1} {2} {3} {5} 400 {2} {5} 2018/3/15 项集支持度 {1} {2} {3} {5} 2 3 3 3 AA 12 关联规则史忠植 34

Apiori. Tid算法示例 TID 100 200 300 400 2018/3/15 项目集 {{1 3}} {{2 3} {2 5} {3 5} } {{1 2} {1 3} {1 5} {2 3} {2 5} {3 5}} {{2 5}} AA 12 关联规则史忠植项集支持度 {1 3} {2 5} {3 5} 2 2 35

Apiori. Tid算法示例 TID 项目集 100 空 200 {{2 3 5}} 300 {{2 3 5 }} 400 2018/3/15 空 AA 12 关联规则史忠植 36

Apiori. Tid算法 l l l 上面图中分别为Bk 和Lk ，而Ck 和Apriori算法产生的一样，因此没有写出来可以看到Bk 由Bk-1 得到，无须由数据库取数据缺点：内存要求很大，事务过多的时候资源难以满足 2018/3/15 AA 12 关联规则史忠植 37

内容提要 l 引言 l Apriori 算法 l Frequent-pattern tree 和FP-growth 算法 l 多维关联规则挖掘 l 相关规则 l 关联规则改进 l 总结 2018/3/15 AA 12 关联规则史忠植 38

频繁模式挖掘的瓶颈 l l 多次扫描数据库是高代价的长模式的挖掘需要多次扫描数据库以及生成许多的候选项集 l 找出频繁项集 i 1 i 2…i 100 l l 扫描次数: 100 候选项集的数量: (1001) + (1002) + … + (110000) = 2100 -1 = 1. 27*1030 ! 瓶颈: 候选项集-生成-测试我们能否避免生成候选项集? 2018/3/15 AA 12 关联规则史忠植 39

不生成候选项集的频繁模式挖掘 l 利用局部频繁的项由短模式增长为长模式 l “abc” 是一个频繁模式 l 得到所有包含 “abc”的事务: DB|abc l “d” 是 DB|abc 的一个局部频繁的项 abcd 是一个频繁模式 2018/3/15 AA 12 关联规则史忠植 40

FP Growth算法 (Han, Pei, Yin 2000) l Apriori算法的一个有问题的方面是其候选项集的生成 l 指数级增长的来源另一种方法是使用分而治之的策略(divide and conquer) l 思想: 将数据库的信息压缩成一个描述频繁项相关信息的频繁模式树 l 2018/3/15 AA 12 关联规则史忠植 41

利用FP-树进行频繁模式挖掘 l 思想: 频繁模式增长 l l 递归地增长频繁模式借助模式和数据库划分方法 l l l 2018/3/15 对每个频繁项, 构建它的条件模式基, 然后构建它的条件FP-树. 对每个新创建的条件FP-树重复上述过程直至结果FP-树为空, 或者它仅包含一个单一路径. 该路径将生成其所有的子路径的组合, 每个组合都是一个频繁模式. AA 12 关联规则史忠植 42

频繁 1 -项集事务数据库支持度计数频繁1 -项集 TID Items 1 3 4 5 6 7 8 I 1, I 2, I 3, I 5 9 I 1, I 2, I 3 7 {I 2} 7 6 {I 3} 6 2 {I 4} 2 2 {I 5} 2 {I 6} I 1, I 3 6 {I 5} I 2, I 3 {I 1} {I 4} I 1, I 3 6 {I 3} I 1, I 2, I 4 Support count {I 2} I 2, I 3, I 6 Itemset {I 1} I 2, I 4 Support count 1 I 1, I 2, I 5 2 Itemset 2018/3/15 l 最小支持度为 20% (计数为 2) AA 12 关联规则史忠植 43

FP-树构建按支持度降序排列 Itemset Support count {I 1} 6 {I 2} 7 {I 1} 6 {I 3} 6 {I 4} 2 {I 5} 2018/3/15 Support count 2 {I 5} 2 AA 12 关联规则史忠植 44

FP-树构建创建根结点 null I 2 1 I 1 1 (I 2, 1) I 3 0 I 4 0 (I 1, 1) I 5 1 (I 5, 1) 2018/3/15 扫描数据库事务 1: I 1, I 2, I 5 排序: I 2, I 1, I 5 处理事务以项的顺序增加结点标注项及其计数维护索引表 AA 12 关联规则史忠植 45

FP-树构建 TID Items 1 I 4 1 (I 1, 1) I 5 0 2018/3/15 6 I 2, I 3 I 1, I 2, I 3, I 5 9 AA 12 关联规则史忠植 I 1, I 3 8 (I 5, 1) I 1, I 2, I 4 7 (I 4, 1) I 2, I 3, I 6 5 (I 2, 2) I 3 0 I 2, I 4 4 I 1 0 2 3 null I 2 2 I 1, I 2, I 5 I 1, I 2, I 3 46

FP-树构建 TID Items 1 2 I 3 6 (I 1, 4) I 5 2 (I 3, 2) (I 4, 1) (I 3, 2) (I 5, 1) (I 3, 2) (I 4, 1) I 1, I 2, I 4 I 1, I 3 I 2, I 3 7 I 1, I 3 8 I 4 2 4 6 (I 1, 2) (I 2, 7) I 2, I 3, I 6 5 I 1 6 I 2, I 4 3 null I 2 7 I 1, I 2, I 5 I 1, I 2, I 3, I 5 9 I 1, I 2, I 3 (I 5, 1) 2018/3/15 AA 12 关联规则史忠植 47

FP-树构建 l l l 扫描事务数据库D一次, 得到频繁项的集合F及它们的支持度. 将F按支持度降序排列成L, L是频繁项的列表. 创建FP-树的根, 标注其为NULL. 对D中的每个事务进行以下操作: 根据 L中的次序对事务中的频繁项进行选择和排序. 设事务中的已排序的频繁项列表为[p|P], 其中p表示第一个元素, P表示剩余的列表. 调用 insert_Tree([p|P], T). 2018/3/15 AA 12 关联规则史忠植 48

FP-树构建 l Insert_Tree([p|P], T) If T has a child N such that N. item-name= p. item-name, then increment N’s count by 1; else create a new node N, and let its count be 1, its parent link be linked to T, and its nodelink to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert_tree(P, N) recursively. 2018/3/15 AA 12 关联规则史忠植 49

挖掘 FP-tree l l 从索引表中的最后一个项开始找到所有包含该项的路径 l l 确定条件模式 l l l 沿着结点-链接(node-links) 路径中符合频度要求的模式构建 FP-tree C 添加项至C中所有路径, 生成频繁模式递归地挖掘C (添加项) 从索引表和树中移除项 2018/3/15 AA 12 关联规则史忠植 50

挖掘 FP-Tree null I 2 7 I 1 6 (I 1, 2) (I 2, 7) I 3 6 null I 4 2 (I 4, 1) (I 1, 4) I 5 2 (I 3, 2) (I 5, 1) (I 3, 2) (I 2, 2) (I 4, 1) (I 1, 2) (I 5, 1) 2018/3/15 前缀路径 (I 2 I 1, 1) (I 2 I 1 I 3, 1) 条件路径 (I 2 I 1, 2) 条件 FP-tree AA 12 关联规则史忠植 (I 2 I 1 I 5, 2), (I 2 I 5, 2), (I 1 I 5, 2) 51

挖掘 FP-Tree 项 I 5 I 4 I 3 I 1 2018/3/15 条件FP生成的频繁模式 tree {(I 2 I 1: 1), (I 2 I 1 I 3: 1)} I 2 I 5: 2, I 1 I 5: 2, I 2 I 1 I 5: 2 {(I 2 I 1: 1), (I 2: 1)} I 2 I 4: 2 条件模式基 {(I 2 I 1: 2, (I 2: 2), (I 1: 2)} , I 2 I 3: 4, I 1 I 3: 2, I 2 I 1 I 3: 2 {(I 2: 4)} I 2 I 1: 4 AA 12 关联规则史忠植 52

挖掘 FP-Tree Procedure FP_growth(Tree, ) (1) (2) (3) (4) (5) (6) (7) (8) If Tree contains a single path P then for each combination (denote as ) of the nodes in the path P generate pattern with support = minisup of nodes in ; Else for each ai in the header of Tree { generate pattern =ai with support =ai. support; construct , s conditional pattern base and then ’conditional FP_tree Tree ; IF Tree ø then call FP_growth(Tree , ); } 2018/3/15 AA 12 关联规则史忠植 53

并行关联规则挖掘 Three parallel algorithms: CD, DD, Ca. D based on Apriori Discovering frequent itemsets (1) is much more expensive than generating rules (2) Phase 1: Each node generates candidate k-itemsets locally from the frequent (k-1)-itemsets how to partition? Phase 2: The match candidates itemsets and transactions collect the local counts how to distribute? Phase 3: - determine the global counts for itemsets how to find? - find frequent k-itemsets and replicate in all nodes

并行关联规则挖掘 k-itemset An itemset having k items Lk Set of frequent k-itemsets (those with minimum support) Each member of this set has 2 fields: itemset and support count Ck Set of candidate k-itemsets (potentially frequent itemsets) Each member of this set has 2 fields: itemset and support count Pi Processor with id-i Di The dataset local to the processor Pi D Ri The dataset local to the processor Pi after repartitioning Ci The candidate set maintained with the processor Pi during the kth pass (there are k items in each candidate) k

计数分布CD Objective: minimizing communication Techniques: - Straight-forward parallelization of Apriori - Carry out redundant duplicate computations in parallel to avoid communication - Only requires communicating count values (no data tuples are exchanged) Processors can scan the local data asynchronously in parallel

计数分布CD Algorithm: Pass 1: (1) Each processor Pi generates its local candidate itemset Ci 1 depending on the items present in its local data partition Di (2) Develop and Exchange local counts Ci 1 (3) Develop global support counts C 1

计数分布CD Algorithm: Pass k>1: (1) Pi generates the complete Ck using the complete Lk-1 created at the end of pass (k-1). Each processor has the identical Lk-1 thus generates identical Ck and puts its count values in a common order into a count array (2) Pi makes a pass over data partition Di and develop local support counts for candidates in Ck (3) Pi exchanges local Ck counts with all other processors to develop global Ck counts. All processors must synchronize. (4) Pi computes Lk from Ck (5) Pi independently decide to terminate or continue to the next pass

计数分布CD

计数分布CD Disadvantages: § CD does not exploit the aggregate memory of the system § Must synchronize and develop global count at the end of each pass Number of processor Total amount of memory Usage of memory per processor 1 |M| M over [Ck] N N * |M| M over [Ck]

数据分布DD Objective: utilize aggregate main memory of the system effectively Technique: • Partitions the candidates into disjoint sets, which are assigned to different processors. Each processor works with the entire dataset but only portion of the candidate set. • Each processor counts mutually exclusive candidates. On a N-processor configuration, DD can count in a single pass candidate set that would require N pass in CD

数据分布DD Basic Idea Example: 2 processors Ck C k 1 data L k 1 Processor 1 C k 2 Lk Processor 2 § Data Distribution only processes a subset of Ck to utilize the aggregate memory § Exchange data to develop global counts for Cki

数据分布DD Algorithm: Pass 1: Same as the CD algorithm Pass k>1: (1) Pi generates Ck from Lk-1. It retains only 1/N of the itemsets forming Cik (2) Pi develops support counts for itemsets in Cik for ALL transactions (using local data pages and data pages received from other processors) (3) At the end of the data pass, Pi calculates Lik using local Cik (4) Processors exchange Lik so that every processor has the complete Lk for generating Ck+1 for the next pass (requires processors to synchronize) (5) Pi can independently decide whether to terminate or continue on to the next pass

数据分布DD Li k Lk Li k

数据分布DD Disadvantages: heavy communication Each processor must broadcast their local data and frequent itemsets to all other processors and synchronize in every pass.

候选分布Ca. D Problem: CD and DD require processors to synchronize at the end of each pass Basic Idea: Remove dependence among processors • Data dependence Complete transactions are required to compute support count (in CD) • Frequent itemsets dependency A global itemset Lk is needed during the pruning step of Apiori candidate generation algorithm(in DD)

候选分布Ca. D Remove Data Dependency • • • Each processor Pi works on Cki, a disjoint subset of Ck Pi derives global support counts for Cki from local data. Replicate data amongst processors in order to achieve the above Reduce Frequent itemset dependency • • • Does not wait for the complete pruning information to arrive from other processors. Prune the candidate set as much as possible Late arriving pruning information is used in subsequent passes.

候选分布Ca. D Algorithm: Pass k<l: Use either the CD or DD algorithm Pass k=l : (1) Partition Lk-1 among N processors (2) Pi generates Cik logically using only the Lik-1 partition (use standard pruning) (3) Pi develops global counts for candidates in Cik and the database is repartitioned into D Ri at the same time (requires communicating local data) (4) Pi receive Ljk from all other processors needed for pruning Cik+1 (5) Pi computes Lik from Cik and asynchronously send it to the other N-1 processors Pass k>l: (1) Pi collects all frequent itemsets sent by other processors (2) Pi generates Cik using local Lik-, , take care of pruning(Ljl-1) (3) Pi passes over D Ri and counts Cik (4) Pi computes Lik from Cik and asynchronously send it to the other N-1 processors

并行关联算法比较 µCount Distribution attempts to minimize communication by replicating the candidate sets in each processor’s memory µData Distribution maximizes the use of aggregate memory by allowing each processor works with the entire dataset but only portion of the candidate set µCandidate Distribution eliminates the synchronization costs at the end of every pass, maximizes the use of aggregate memory while limiting heavy communication to a single redistribution pass

并行关联算法比较 Non-Communication Count Distribution • Generate complete Ck • Generate complete Lk • Exchange support counts for Cki • Synchronize at each pass Data Distribution • Generate complete Ck • Partition Ck • Generate partial Lki • Exchange data to compute support counts for Cki • Exchange Lki • Synchronize at each pass Candidate Distribution • Partition Lk • Generate partial Cki • Generate partial Lki • Exchange data to repartition database (once only). • Exchange Lki (require no synchronization).

并行关联算法比较 Advantages Disadvantages Count Distribution (CD) §Low communication cost §Only exchange counts, not data §Doesn’t exploit aggregate memory §Must synchronize at the end of every pass §When the candidate itemsets doesn’t fit the memory of each node? Data Distribution (DD) §Better utilization of aggregate system memory §Need to broadcast its local data to every other processor at every pass §Need to synchronize Candidate Distribution (Ca. D) §Processors can proceed independently without synchronizing at the end of every pass §Communication of entire dataset needed for a single redistribution pass. The communication cost is higher than the synchronization cost savings

由事务数据库构建FP-树 TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} 1. 扫描DB一次, 找到频繁1 项 (单一项模式) 2. 按支持度降序对频繁项排序为 F-list 3. 再次扫描DB, 构建FPtree {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p 2018/3/15 min_support = 3 AA 12 关联规则史忠植 f: 4 c: 3 c: 1 b: 1 a: 3 b: 1 p: 1 m: 2 b: 1 p: 2 m: 1 72

划分模式和数据库 l 频繁模式根据F-list可以被划分为若干子集 l l l l F-list=f-c-a-b-m-p 包含 p的模式包含 m 但包含 p的模式 … 包含 c 但不包含 a , b, m, p的模式模式 f 完整性和非冗余性 2018/3/15 AA 12 关联规则史忠植 73

从P的条件数据库找出包含P的模式 l l l 从 FP-tree的索引表的频繁项开始沿着每个频繁项p的链接遍历 FP-tree 累积项p的所有转换前缀路径来形成的p的条件模式基 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 2018/3/15 项 c: 3 条件模式基 c f: 3 a fc: 3 b f: 4 fca: 1, f: 1, c: 1 m fca: 2, fcab: 1 p fcam: 2, cb: 1 c: 1 b: 1 a: 3 b: 1 p: 1 m: 2 b: 1 p: 2 AA 12 m: 1 关联规则史忠植 74

递归: 挖掘每个条件FP-tree {} {} “am”的条件模式基: (fc: 3) f: 3 c: 3 a: 3 “cm”的条件模式基: (f: 3) f: 3 c: 3 am-条件 FP-tree {} f: 3 m-条件 FP-tree cm-条件 FP-tree {} “cam”的条件模式基: (f: 3) f: 3 cam-条件 FP-tree 2018/3/15 AA 12 关联规则史忠植 75

一个特例: FP-tree中的单一前缀路径 l 假定 (条件的) FP-tree T 有一个共享的单一前缀路径 P l 挖掘可以分为两部分 {} a 1: n 1 l 将单一前缀路径约简为一个结点 l 将两部分的挖掘结果串联 a 2: n 2 a 3: n 3 b 1: m 1 C : k 2 2018/3/15 2 r 1 {} C 1: k 1 C 3: k 3 r 1 = a 1: n 1 a 2: n 2 a 3: n 3 AA 12 关联规则史忠植 + b 1: m 1 C 2: k 2 C 1: k 1 C 3: k 3 76

通过 DB 投影(Projection)使FP-growth 可伸缩 l l FP-tree 不能全放入内存? —DB 投影首先将一个数据库划分成一个由若干投影 (Projected)数据库组成的集合然后对每个投影数据库构建和挖掘 FP-tree Parallel projection vs. Partition projection 技术 l Parallel projection is space costly 2018/3/15 AA 12 关联规则史忠植 77

Partition-based Projection l l Parallel projection 需要很多磁盘空间 Partition projection 节省磁盘空间 p-proj DB fcam cb fcamp fcabm fb cbp fcamp m-proj DB b-proj DB fcab fca am-proj DB 2018/3/15 Tran. DB fc fc fc f cb … a-proj DB fc … cm-proj DB f f f AA 12 关联规则史忠植 c-proj DB f … f-proj DB … … 78

改进途径使用哈希表存储候选k-项集的支持度计数 l 移除不包含频繁项集的事务 l 对数据采样 l 划分数据 l l 2018/3/15 若一个项集是频繁的, 则它必定在某个数据分区中是频繁的. AA 12 关联规则史忠植 79

FP-tree 结构的优点 l 完整性 l l l 保持了频繁项集挖掘的完整信息没有打断任何事务的长模式紧密性 l l 减少不相关的信息—不频繁的项没有了项按支持度降序排列: 越频繁出现, 越可能被共享决不会比原数据库更大 (不计结点链接和计数域) 对 Connect-4 数据库, 压缩比率可以超过100 2018/3/15 AA 12 关联规则史忠植 80

关联规则可视化: 方格图(Pane Graph) 2018/3/15 AA 12 关联规则史忠植 81

关联规则可视化: 规则图(Rule Graph) 2018/3/15 AA 12 关联规则史忠植 82

内容提要 l 引言 l Apriori 算法 l Frequent-pattern tree 和FP-growth 算法 l 多维关联规则挖掘 l 相关规则 l 关联规则改进 l 总结 2018/3/15 AA 12 关联规则史忠植 83

挖掘多种规则或规律 l 多层(Multi-level), 量化(quantitative)关联规则, 相关(correlation)和因果(causality), 比率(ratio)规则, 序列 (sequential) 模式, 浮现(emerging)模式, temporal associations, 局部周期(partial periodicity) l 分类(classification), 聚类(clustering), 冰山立方体 ( iceberg cubes), 等等. 2018/3/15 AA 12 关联规则史忠植 84

多层关联规则 l l 项常常构成层次可伸缩的(flexible)支持度设置: 在较低层的项预期有较低的支持度. 事务数据库可以基于维度和层次编码探寻共享多层挖掘减少的支持度统一支持度 Level 1 min_sup = 5% Level 2 min_sup = 5% 2018/3/15 Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] AA 12 关联规则史忠植 Level 1 min_sup = 5% Level 2 min_sup = 3% 85

可伸缩的支持度约束的多层/多维(ML/MD)关联规则 l 为什么设置可伸缩的支持度? l l l 实际生活中项的出现频率变化巨大 l 在一个商店购物篮中的钻石, 手表, 钢笔统一的支持度未必是一个有趣的模型一个可伸缩模型 l 较低层的, 较多维的组合以及长模式通常具有较小的支持度 l 总体规则应该要容易说明和理解特殊的项和特殊的项的组合可以特别设定(最小支持度)以及拥有更高的优先级 l 2018/3/15 AA 12 关联规则史忠植 86

多维关联规则 l 单维规则: buys(X, “milk”) buys(X, “bread”) l 多维规则: 2 个维度或谓词( predicates) l 跨维度(Inter-dimension)关联规则 (无重复谓词) age(X, ” 19 -25”) occupation(X, “student”) buys(X, “coke”) l 混合维度(hybrid-dimension)关联规则 (重复谓词) age(X, ” 19 -25”) buys(X, “popcorn”) buys(X, “coke”) l 分类(Categorical)属性 l l 有限的几个可能值, 值之间不可排序数量(Quantitative)属性 2018/3/15 l AA 12 关联规则数值的, 值之间有固有的排序史忠植 87

多层关联规则: 冗余滤除 l 根据项之间的”先辈” (ancestor)关系, 一些规则可能是冗余的. l 示例 l milk wheat bread l 2% milk wheat bread [support = 2%, confidence = 72%] [support = 8%, confidence = 70%] l 我们说第 1个规则是第 2个规则的先辈. l 一个规则是冗余的, 当其支持度接近基于先辈规则的”预期”(expected)值. 2018/3/15 AA 12 关联规则史忠植 88

多层关联规则: 逐步深化(Progressive Deepening) l 一个自上而下的, 逐步深化的方法: l 首先挖掘高层的频繁项: l l milk (15%), bread (10%) 然后挖掘它们的较低层”较弱” (weaker)频繁项: 2% milk (5%), wheat bread (4%) 多层之间不同的最小支持度阈值导致了不同的算法: l l 2018/3/15 如果在多个层次间采用了相同的最小支持度, 若t的任何一个先辈都是非频繁的则扔弃(toss)t. 如果在较低层采用了减少的最小支持度，则只检验那些先辈的支持度是频繁的／不可忽略的派生（descendents）即可． AA 12 关联规则史忠植 89

多维关联规则挖掘的技术 l 搜索频繁 k-谓词集(predicate set): 示例: {age, occupation, buys}是一个 3 -谓词集以age处理的方式, 技术可以如下分类 l 1. 利用数量属性的统计离散(static discretization)方法利用预先确定的概念层次对数量属性进行统计离散化 2. 量化关联规则 l 基于数据的分布, 数量属性被动态地离散化到不同的容器空间 (bins) 3. 基于距离(Distance-based)的关联规则 l 2018/3/15 这是一个动态离散化的过程, 该过程考虑数据点之间的距离 AA 12 关联规则史忠植 90

数量属性的统计离散化 l 挖掘之前利用概念层次离散化 l 数值被范围(ranges)替代. l 关系数据库中, 找出所有的频繁k-谓词(predicate)集要求 k 或 k+1次表扫描. l 数据立方体(data cube)非常适合数据挖掘. l N-维立方体的 cells 与谓词集( (age) () (income) (buys) predicate sets)相对应. l (age, income) 通过数据立方体挖掘会非常快速. 2018/3/15 AA 12 关联规则史忠植 (age, buys) (income, buys) (age, income, buys) 91

量化关联规则 l 数值属性动态离散化 l 这样挖掘的规则的可信度或紧密度最大化 l 2 -维量化关联规则: Aquan 1 Aquan 2 Acat l 示例 age(X, ” 30 -34”) income(X, ” 24 K 48 K”) buys(X, ”high resolution TV”) 2018/3/15 AA 12 关联规则史忠植 92

Mining Distance-based Association Rules l Binning methods do not capture the semantics of interval data l Distance-based partitioning, more meaningful discretization considering: l l density/number of points in an interval “closeness” of points in an interval 2018/3/15 AA 12 关联规则史忠植 93

Interestingness Measure: Correlations (Lift) l play basketball eat cereal [40%, 66. 7%] is misleading l The overall percentage of students eating cereal is 75% which is higher than 66. 7%. l play basketball not eat cereal [20%, 33. 3%] is more accurate, although with lower support and confidence l Measure of dependent/correlated events: lift Basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col. ) 2018/3/15 Not basketball 3000 2000 5000 AA 12 关联规则史忠植 94

内容提要 l 引言 l Apriori 算法 l Frequent-pattern tree 和FP-growth 算法 l 多维关联规则挖掘 l 相关规则 l 关联规则改进 l 总结 2018/3/15 AA 12 关联规则史忠植 95

相关规则(Correlation Rules) l l “Beyond Market Baskets, ” Brin et al. 假设执行关联规则挖掘 c 20 5 25 t 70 5 75 col 2018/3/15 row t tea => coffee 20% support 80% confidence c 90 10 100 but 90% of the people buy coffee anyway! AA 12 关联规则史忠植 96

相关规则 l 一种度量是计算相关性若两个随机变量 A 和 B 是统计独立的 l 对tea 和 coffee: l 2018/3/15 AA 12 关联规则史忠植 97

相关规则 l 利用 2 统计检验来测试独立性 l 设n为购物篮的总数 l 设k为考虑的项的总数 l 设 r 为一个包含项 (ij, ij)的规则 l 设O(r) 表示包含规则r的购物篮的数量 (即频率 ) l 对单个项ij, 设 E[ij] = O(ij) (反过来即为 n - E[ij]) E[r] = n * E[r 1]/n * … * E[rk] / n 2018/3/15 l AA 12 关联规则史忠植 98

相关规则 l 2 统计量定义为 l Look up for significance value in a statistical textbook l l 2018/3/15 There are k-1 degrees of freedom If test fails cannot reject independence, otherwise contigency table represents dependence. AA 12 关联规则史忠植 99

示例 l l l l row t 20 5 25 70 5 75 col Back to tea and coffee c t l c 90 10 100 E[t] = 25, E[t]=75, E[c]=90, E[c]=10 E[tc]=100 * 25/100 * 90 /100=22. 5 O(tc) = 20 Contrib. to 2 = (20 - 22. 5)2 / 22. 5 = 0. 278 Calculate for the rest to get 2=2. 204 Not significant at 95% level (3. 84 for k=2) Cannot reject independence assumption 2018/3/15 AA 12 关联规则史忠植 100

兴趣度（Interest） l l If 2 test shows significance, then want to find most interesting cell(s) in table I(r) = O(r)/E[r] l Look for values far away from 1 l I(tc) = 20/22. 5 = 0. 89 l I(tc) = 5/2. 5 = 2 l I(tc) = 70/67. 5 = 1. 04 l I(tc) = 5/7. 5 = 0. 66 2018/3/15 AA 12 关联规则史忠植 101

2 l 上封闭性(Upward closed) l l 若一个k-项集是相关的, 则其所有的超集也是相关的. 寻找最小的相关的项集 l l 统计量的性质没有子集是相关的能否将a-priori and 2 统计量有效地结合 l 2018/3/15 No generate and prune as in support-confidence AA 12 关联规则史忠植 102

其它度量(Measures) TID Items 1 I 1, I 2, I 5 2 I 2, I 4 3 I 1, I 2, I 4 5 I 1, I 3 6 I 2, I 3 7 I 1, I 3 8 I 1, I 2, I 3, I 5 9 作用度(Lift) I 2, I 3, I 6 4 l I 1, I 2, I 3 2018/3/15 l 度量项之间的相关性, 但没有检验 AA 12 关联规则史忠植 103

其它度量(Measures) l 可信度(Conviction) TID Items 2 I 2, I 3, I 6 I 1, I 2, I 4 I 1, I 3 6 I 2, I 3 7 I 1, I 3 8 I 1, I 2, I 3, I 5 9 AA 12 关联规则史忠植 I 2, I 4 5 度量一个规则的强度 I 1, I 2, I 5 4 2018/3/15 1 3 l I 1, I 2, I 3 104

内容提要 l 引言 l Apriori 算法 l Frequent-pattern tree 和FP-growth 算法 l 多维关联规则挖掘 l 相关规则 l 关联规则改进 l 总结 2018/3/15 AA 12 关联规则史忠植 105

关联规则改进 l l l Lin等人提出解决规则挖掘算法中的数据倾斜问题，从而使算法具有较好的均衡性。Park等人提出把哈希表结构用于关联规则挖掘。 Agrawal首先提出事务缩减技术，Han和Park等人也分别在减小数据规模上做了一些作。抽样的方法是由Toivonen提出的。 Brin等人采用动态项集计数方法求解频繁项集。 Aggarwal提出用图论和格的理论求解频繁项集的方法。 Prutax算法就是用格遍历的办法求解频繁项集。 2018/3/15 AA 12 关联规则史忠植 106

关联规则改进 l l 关联规则模型有很多扩展，如顺序模型挖掘，在顺序时间段上进行挖掘等。还有挖掘空间关联规则，挖掘周期性关联规则，挖掘负关联规则，挖掘交易内部关联规则等。 Guralnik提出顺序时间段问题的形式描述语言，以便描述用户感兴趣的时间段，并且构建了有效的数据结构 SP树（顺序模式树）和自底向上的数据挖掘算法。最大模式挖掘是Bayardo等人提出来的。 2018/3/15 AA 12 关联规则史忠植 107

关联规则改进 l 随后人们开始探讨频率接近项集。Pei给出了一种有效的数据挖掘算法。 l B. Özden等人的周期性关联规则是针对具有时间属性的事务数据库，发现在规律性的时间间隔中满足最小支持度和信任度的规则。 l 贝尔实验室的S. Ramaswamy等人进一步发展了周期性关联规则，提出挖掘符合日历的关联规则（Calendric Association Rules）算法，用以进行市场货篮分析。 2018/3/15 AA 12 关联规则史忠植 108

关联规则改进 l l T. Hannu等人把负边界引入规则发现算法中，每次挖掘不仅保存频繁项集，而且同时保存负边界，达到下次挖掘时减少扫描次数的目的。 Srikant等人通过研究关联规则的上下文，提出规则兴趣度尺度用以剔除冗余规则。 Zakia还用项集聚类技术求解最大的近似潜在频繁项集，然后用格迁移思想生成每个聚类中的频繁项集。 CAR，也叫分类关联规则，是Lin等人提出的一种新的分类方法，是分类技术与关联规则思想相结合的产物，并给出解决方案和算法。 2018/3/15 AA 12 关联规则史忠植 109

关联规则改进 l l l Cheung等人提出关联规则的增量算法。 Thomas等人把负边界的概念引入其中，进一步发展了增量算法。如，基于Apriori框架的并行和分布式数据挖掘算法。 Oates等人将MSDD算法改造为分布式算法。还有其他的并行算法，如利用垂直数据库探求项集聚类等。 2018/3/15 AA 12 关联规则史忠植 110

ARCS (Association Rules Clustering System) ARCS 1. Binning 2. Frequent Items Set 3. Clustering 4. Optimization 2018/3/15 AA 12 关联规则史忠植 111

ARCS 的特点 l ARCS: l 成功的应用聚类的概念到分类中. l l 2018/3/15 但仅限于基于2 -维规则的分类, 如A B Classi 的格式所示利用装箱(Binning)方法将数据属性值离散化, 因此ACRS的准确度与使用的离散化程度强烈相关. AA 12 关联规则史忠植 112

基于关联规则的分类(Classification Based on Association rules, CBA) l 分类规则挖掘与关联规则挖掘 l 目标 § § l 语法(Syntax) § § 2018/3/15 一个小的规则集作为分类器所有规则依照最小支持度与最小可信度 X y X Y AA 12 关联规则史忠植 113

为何及如何结合 l l 分类规则挖掘与关联规则挖掘都是实际应用中必需的. 结合着眼于关联规则的一个特定子集, 其右件限制为分类的类属性. l 2018/3/15 CARs: Class Association Rules AA 12 关联规则史忠植 114

CBA: 三个步骤若有连续值, 则离散化. l 生成所有的 class association rules (CARs) l 构建一个若干生成的CARs的分类器. l 2018/3/15 AA 12 关联规则史忠植 115

CAR集 l l 2018/3/15 生成完整的CARs的集合, 其满足用户指定的最小支持度与最小可信度约束. 由 CARs构建一个分类器. AA 12 关联规则史忠植 116

规则生成: 基本概念 l 规则项(Ruleitem) <condset, y> : 条件集 condset 是一个项集, y 是一个类标签(class label) 每个规则项表示一个规则: condset->y l 条件集支持度(condsup. Count) l l 规则支持度(rulesup. Count) l l l 2018/3/15 D中包含condset的事例(case)数 D中包含condset 及标注类别 y的事例(case)数支持度Support=(rulesup. Count/|D|)*100% 可信度 Confidence=(rulesup. Count/condsup. Count)*100% AA 12 关联规则史忠植 117

规则生成: 基本概念(Cont. ) l 频繁规则项(frequent ruleitems) l l 精确规则(Accurate rule) l l 2018/3/15 一个规则是精确的 , 当其可信度不小于最小可信度 minconf. 潜在规则(Possible rule) l l 一个规则项是频繁的, 当其支持度不小于最小支持度 minsup. 对所有包含同样条件集condset的规则项, 可信度最大的规则项为这一规则项集合的潜在规则. 类别关联规则 class association rules (CARs) 包含所有的潜在规则possible rules (PRs) , 其即是频繁的又是精确的. AA 12 关联规则史忠植 118

规则生成: 一个示例 l 一个规则项: <{(A, 1), (B, 1)}, (class, 1)> l 假定 l l 则 (A, 1), (B, 1) -> (class, 1) l l 2018/3/15 条件集 condset (condsup. Count) 的支持度计数为 3, 规则项ruleitem (rulesup. Count) 的支持度计数为 2, |D|=10 [supt=20% (rulesup. Count/|D|)*100% confd=66. 7% (rulesup. Count/condsup. Count)*100%] AA 12 关联规则史忠植 119

RG: 算法 1 F 1 = {large 1 -ruleitems}; 2 CAR 1 = gen. Rules (F 1 ); 3 pr. CAR 1 = prune. Rules (CAR 1 ); //count the item and class occurrences to determine the frequent 1 -ruleitems and prune it 4 for (k = 2; F k-1 Ø; k++) do 5 C k = candidate. Gen (F k-1 ); //generate the candidate ruleitems Ck using the frequent ruleitems F k-1 6 for each data case d D do //scan the database 7 C d = rule. Subset (C k , d); //find all the ruleitems in Ck whose condsets are supported by d 8 for each candidate c C d do 9 c. condsup. Count++; 10 if d. class = c. class then c. rulesup. Count++; //update various support counts of the candidates in Ck 11 end 2018/3/15 AA 12 关联规则史忠植 12 end 120

RG: 算法(cont. ) F k = {c C k | c. rulesup. Count minsup}; //select those new frequent ruleitems to form F k 14 CAR k = gen. Rules(F k ); //select the ruleitems both accurate and frequent 15 pr. CAR k = prune. Rules(CAR k ); 16 end 17 CARs = k CAR k ; 18 pr. CARs = k pr. CAR k ; 13 2018/3/15 AA 12 关联规则史忠植 121

分类构建器 M 1: 基本概念 l l 给定两个规则 ri and rj, 定义: ri rj 当 l ri的可信度大于 rj的, 或者 l 它们的可信度相同, 但ri的支持度大于rj的, 或者 l 它们的可信度与支持度都相同, 但 ri 比rj生成的早. 我们的分类器如下面格式所示: l 2018/3/15 <r 1, r 2, …, rn, default_class>, l where ri R, ra rb if b>a AA 12 关联规则史忠植 122

M 1: 三个步骤基本思想是选择R中一个优先规则(high precedence)的集合来覆盖 D. l l 对生成的规则集R排序. 根据已排好的序列从R中选择为分类所用的规则并放入C l l l 抛弃C中的不能改进分类器的准备度的规则. l 2018/3/15 每个选择的规则必须正确分类至少一个增加的事例 (case). 并选择默认的属性和计算误差. 保留那些最小误差的规则的位置, 抛弃序列中的其余规则. AA 12 关联规则史忠植 123

M 1: Algorithm 1 R = sort(R); //Step 1: sort R according to the relation “ ” l 2 for each rule r R in sequence do l 3 temp = Ø; l 4 for each case d D do //go through D to find those cases covered by each rule r l 5 if d satisfies the conditions of r then l 6 store d. id in temp and mark r if it correctly classifies d; l 7 if r is marked then l 8 insert r at the end of C; //r will be a potential rule because it can correctly classify at least one case d l 9 delete all the cases with the ids in temp from D; l 10 selecting a default class for the current C; //the majority class in the remaining data l 11 compute the total number of errors of C; l 12 end l 13 end // Step 2 l 14 Find the first rule p in C with the lowest total number of errors and drop all the rules after p in C; l 2018/3/15 15 Add the default class associated with p to end of C, and return C (our AA 12 关联规则史忠植 classifier). //Step 3 l 124

M 1: 满足的两个条件 Each training case is covered by the rule with the highest precedence among the rules that can cover the case. l Every rule in C correctly classifies at least one remaining training case when it is chosen. l 2018/3/15 AA 12 关联规则史忠植 125

MOUCLAS l Assumption: MOUCLAS algorithm assumes that the initial association rules can be agglomerated into clustering regions l The implication of the form of the Mouclas Pattern (so called MP) : Cluster(D)t y where Cluster(D)t is a cluster of D, t = 1 to m, and y is a class label. 2018/3/15 AA 12 关联规则史忠植 126

MOUCLAS l Definitions: Frequency of Mouclas Patterns Accuracy of Mouclas Patterns Reliability of Mouclas Patterns l Task of Mouclas: To discover MPs that have support and confidence greater than the user-specified minimum support threshold (called minsup), and minimum confidence threshold (called minconf) and minimum reliability threshold (called min. R) respectively, and to construct a classifier based upon MPs. 2018/3/15 AA 12 关联规则史忠植 127

The MOUCLAS algorithm l Two steps: Step 1. Discovery of the frequent and accurate and reliable MPs. Step 2. Construction of a classifier, called De. MP, based on MPs. 2018/3/15 AA 12 关联规则史忠植 128

The MOUCLAS algorithm The core of the first step in the Mouclas algorithm is to find all cluster_rules that satisfy minsup and minconf and min. R. Let C denote the dataset D after dimensionality reduction processing. A cluster_rule represents a MP, namely a rule: cluset y, where cluset is a set of itemsets from a cluster Cluster(C)t, y is a class label, y Y. 2018/3/15 AA 12 关联规则史忠植 129

The MOUCLAS algorithm Algorithm of the first step: Mouclas Mining frequent and accurate and reliable Mouclas patterns (MPs) l Input: A training transaction database, D; minimum support threshold (minsup); minimum confidence threshold (minconf) ; minimum reliability threshold (min. R) l Output: A set of frequent and accurate and reliable Mouclas patterns (MPs) 2018/3/15 AA 12 关联规则史忠植 130

The MOUCLAS algorithm l Methods: (1) Reduce the dimensionality of transactions d, which efficiently reduces the data size by removing irrelevant or redundant attributes (or dimensions) from the training data, and (2) Identify the clusters of database C for all transactions d after dimensionality reduction on attributes Aj in database C, based on the Mountain function, which is a fuzzy set membership function, and specially capable of transforming quantitative values of attributes in transactions into linguistic terms, and (3) Generate a set of MPs that are both frequent and accurate, namely, which satisfy the user-specified minimum support (called minsup) and minimum confidence (called 2018/3/15 AA 12 关联规则史忠植 minconf) and minimum reliability (called min. R) constraints. 131

The MOUCLAS algorithm 1 X = reduce. Dim (I); // reduce the dimensionality on the set of all items I of in D 2 Cluster(C)t = gen. Cluster (C); // identify the complete clusters of C 3 for each Cluster(C)t do E = gen. Clusterrules(cluset, class); // generate a set of candidate cluster_rules 4 for each transaction d C do 5 Ed = gen. Sub. Clusterrules (E, d); // find all the cluster_rules in E whose cluset are supported by d 6 for each e Ed do 7 e. clusup. Count++; // accumulate the clusup. Count of the cluset of cluster_rule e 8 if d. class = e. class then e. cisup. Count++ // accumulate the cisup. Count of cluster_rule e supported by d 9 end 10 end 11 F = {e E e. cisup. Count minsupi }; // construct the set of frequent cluster_rules 12 MP = gen. Rules (F); //generate MP using the gen. Rules function by minconf and min. R 13 end 14 MPs = ∪ MP; // discover the final set of MPs 2018/3/15 AA 12 关联规则史忠植 132

The MOUCLAS algorithm The task of the second step in Mouclas algorithm: l l Using a heuristic method to generate a classifier, named De-MP, where the discovered MPs can cover D and are organized according to a decreasing precedence based on their confidence and support. Suppose R be the set of frequent and accurate and reliable MPs which are generated in the past step, and MPdefault_class denotes the default class, which has the lowest precedence. We can then present the De-MP classifier in the form of <MP 1, MP 2, …, MPn, MPdefault_class>, where MPi R, i = 1 to n, MPa ≻MPb if n b > a 1 and a, b i, C ∪ cluset of MPi 2018/3/15 AA 12 关联规则史忠植 133

The MOUCLAS algorithm Algorithm: Mouclas constructing De-MP Classifier l Input: A training database after dimensionality reduction, C; The set of frequent and accurate and reliable Mouclas patterns (MPs) l Output: De-MP Classifier l Methods: (1) Identify the order of all discovered MPs based on the definition of precedence and sequence them according to decreasing precedence order. (2) Determine possible MPs for De-MP classifier from R following the descending sequence of MPs. (3) Discard the MPs which cannot contribute to the improvement of the accuracy of the De-MP classifier and keep the final set of MPs to 2018/3/15 AA 12 关联规则史忠植 134 construct the De-MP classifier.

The MOUCLAS algorithm 1 R = sort(R); // sort MPs based on their precedence 2 for each MP R in sequence do 3 temp = ; 4 for each transaction d C do 5 if d satisfies the cluset of MP then 6 store d. ID in temp; 7 if MP correctly classifies d then 8 insert MP at the end of L; 9 delete the transaction who has ID in temp from C; 10 selecting a default class for the current L; // determine the default class based on majority class of remaining transactions in C 11 end 12 compute the total number of errors of L; // compute the total number of errors that are made by the current L and the default class 13 end 14 Find the first MP in L with the lowest total number of errors and discard all the MPs after the MP in L; 15 Add the default class associated with the above mentioned first MP to end of L; 16 De-MP classifier = L 2018/3/15 AA 12 关联规则史忠植 135

Example of MOUCLAS application The well logging data sets include attributes (well logging curves) of GR (gamma ray), RDEV (deep resistivity), RMEV (shallow resistivity), RXO (flushed zone resistivity), RHOB (bulk density), NPHI (neutron porosity), PEF (photoelectric factor) and DT (sonic travel time). A hypothetically useful MP may suggest a relation between well logging data and the class label of oil/gas formation since. 2018/3/15 AA 12 关联规则史忠植 136

Summary l l 关联规则挖掘是数据挖掘中的一个基本具几个算法 l l 2018/3/15 Apriori: 利用一个可证明的数学性质来改进性能 FP-Growth: 不再生成候选项集, 利用有效的数据结构相关(correlation) 规则: 在统计学的基础上评价有趣度基于约束 (constrain)的关联规则挖掘 AA 12 关联规则史忠植 137

References l Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers l R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93, 207 -216, Washington, D. C. l J. Han, J. Pei, and Y. Yin: “Mining frequent patterns without candidate generation”. In Proc. ACM-SIGMOD’ 2000, pp. 112, Dallas, TX, May 2000. 2018/3/15 AA 12 关联规则史忠植 138

References l S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97, 265276, Tucson, Arizona. l J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining", COMPUTER (special issues on Data Mining), 32(8): 46 -50, 1999. l Craig A. Struble. Association Rule Mining. Slides, Marquette University l Yalei Hao. Markus Stumptner. Gerald Quirchmayr. Qing He. Data Mining by MOUCLAS Algorithm for Petroleum Reservoir Characterization from Well AA 12 关联规则史忠植 Logging Data. AIAI 2004. 2018/3/15 139

www. intsci. ac. cn/shizz/ 2018/3/15 Questions? ! AA 12 关联规则史忠植 140

知识发现 数据挖掘 第六章 关联规则 Association Rules 史忠植 中国科学院计算技术研究所

知识发现数据挖掘第六章关联规则 Association Rules 史忠植中国科学院计算技术研究所