7f4c32de3b7419e3827159c8beff871f.ppt

- Количество слайдов: 19

Predictive Analytics in SQL and Datalog UCLA CS 240 A Course Notes* 1

Rollups, Data Cubes: Descriptive Analytics In the 90 s DBMS were extended with Descriptive Analytics—a major technical and commercial success • From GROUP BY of SQL-2 aggregates to GROUP BY ROLLUP/DATACUBE: an obvious step at the language level. • The sort-based support of traditional aggregates extended to supergroups. Some technical challenges, but the general DBMS framework of big data residing in secondary store was not challenged. While the need to go beyond that and support Discovery analytics was immediately realized no satisfactory solution was found. 2

Memorable Attempts z Apriori in DB 2 Sarawagi et al. [SIGMOD 1998]: only cache-mining works at the required speed ! z Imielinski and Mannila [CACM 1996]: Define Declarative Language constructs for Data Mining. The rest will follow, as future research will invent query optimization techniques that will make these constructs very efficient. z Inductive Databases: a research field that explored this idea for a few years and … then gave up. z But recently things started changing 3

MBA: applicable to many other contexts Telecommunication: Each customer is a transaction containing the set of customer’s phone calls Atmospheric phenomena: Each time interval (e. g. a day) is a transaction containing the set of observed event (rains, wind, etc. ) Etc. 4

Association Rules z Express how product/services relate to each other, and tend to group together z “if a customer purchases three-way calling, then will also purchase call-waiting” z simple to understand z actionable information: bundle three-way calling and call-waiting in a single package 5

Useful, trivial, unexplicable z Useful: “On Thursdays, grocery store consumers often purchase diapers and beer together”. z Trivial: “Customers who purchase maintenance agreements are very likely to purchase large appliances”. z Unexplicable: “When a new hardaware store opens, one of the most sold items is toilet rings. ” 6

Basic Concepts Transaction: Relational format Compact format

Frequent Itemsets Support({dairy}) = 3 (75%) Support({fruit}) = 3 (75%) Support({dairy, fruit}) = 2 (50%) If = 60%, then {dairy} and {fruit} are frequent while {dairy, fruit} is not. 8

Association Rules: Measures +Let A and B be a partition of I : A B [s, c] A and B are itemsets s = support of A B = support(A B) c = confidence of A B = support(A B)/support(A) + Measure for rules: + minimum support + minimum confidence +The rules holds if : s and c 9

Association Rules: Meaning A B [ s, c ] Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database. support(A B [ s, c ]) = p(A B) Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability. confidence(A B [ s, c ]) = p(B|A) = p(A & B)/p(A). 10

Association Rules - Example Min. support 50% Min. confidence 50% For rule A C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66. 6% The Apriori principle: Any subset of a frequent itemset must be frequent 11

Problem Statement z The database consists of a set of transactions. z Each transaction consists of a transaction ID and a set of items bought in that transaction (as in a market basket). z An association rule is an implication of the form X Y , which says that customers who buy item X are also likely to buy item Y. In practice we are only interested in relationships between high volume items (aka frequent items) z Confidence: X Y holds with confidence C% if C% of transactions that contain X also contain Y. z Support: X Y has support S% if S% of transactions contain X Y. Observe that the support level for X is to that for X and that their inverse ratio is the confidence of X Y: Y confidence(X Y) = support(X Y)/support(X) 12

Algorithms: Apriori z A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D TID Items 10 20 a, b, a, e b, 1 -candidates 30 40 c, d c, e b, c, Scan D e Min_sup=2 3 -candidates Scan D Itemset bce Freq 3 -itemsets Itemset bce Sup 2 Itemset a b c d e Freq 1 -itemsets Sup 2 3 3 1 3 Itemset a b c Sup 2 3 3 e 3 Freq 2 -itemsets Itemset ac bc be ce Sup 2 2 3 2 2 -candidates Counting Itemset ab ac ae bc be ce Sup 1 2 3 2 Itemset ab ac ae bc be ce Scan D 13

Performance Challenges of Frequent Sets (aka Frequent Pattern) Mining z Challenges y Data structures: Hash tables and Prefix trees y Multiple scans of transaction database y Huge number of candidates y Tedious workload of support counting for candidates z Improving Apriori: Many algorithms proposed. General ideas y Reduce number of transaction database scans y Shrink number of candidates y Facilitate support counting of candidates y FP without candidate generation [Han, Pei, Yin 2000]. 14

Apriori Summary z Scanning the database and counting occurrences z Pruning the itemsets below the minimum support level: [Particularly after the first step, we might want to prune the database D as well] z Combining frequent sets of size n into candidate larger sets of size n + 1 [or even larger]. Monotonicity Condition: The support level of a set is always smaller than that of every subset 15

Apriori in DB 2 z S. Sarawagi, S. Thomas, R. Agrawal: "Integrating Association Rule Mining with Databases: Alternatives and Implications", Data Mining and Knowledge Discovery Journal, 4(2/3), July 2000. z Very difficult to integrate Apriori into DBMS: the only approach that works is cache-mining. And similar conclusions apply to other KDD algorithms z Lackluster commercial success of DBMS vendors withpredictive analytics– in stark contrast with their success in descriptive analytics, i. e. rollups and data cubes. 16

Extracting the Rules z For rule A C: z Support for rule: support for set of items = support({A, C})=50% z Confidence: support for the rule over support for its left side= support({A, C})/support({A})=66. 6% 17

Rule Implications z Lemma: If X Y Z, then XY Z, and XZ Y z This properties can be used to limit the number of rules tested. z Example: For frequent itemset ABC z If AB C does not hold, then neither do A BC or B AC. Then we can test BC A and if this hold we need to test C AB 18

Apriori in Datalog 19