Databases 2 Storage optimization functional dependencies normal forms

Databases 2 Storage optimization: functional dependencies, normal forms, data ware houses

Functional Dependencies • X -> A is an assertion about a relation R that whenever two tuples of R agree on all the attributes of X, then they must also agree on the attribute A. – Say “X -> A holds in R. ” – Notice convention: …, X, Y, Z represent sets of attributes; A, B, C, … represent single attributes. – Convention: no set formers in sets of attributes, just ABC, rather than {A, B, C }. 2

Example • • Drinkers(name, addr, beers. Liked, manf, fav. Beer). Reasonable FD’s to assert: 1. name -> addr (the name determine the address) 2. name -> fav. Beer (the name determine the favourite beer) 3. beers. Liked -> manf (every beer has only one manufacturer) 3

Example name addr Beers. Liked manf Fav. Beer Janeway Voyager Bud A. B. Wicked. Ale Janeway Voyager Wicked. Ale Pete’s Wicked. Ale Spock Enterprise Bud A. B. Bud name -> addr Beers. Liked -> manf name -> Fav. Beer 4

FD’s With Multiple Attributes • No need for FD’s with more than one attribute on right. – But sometimes convenient to combine FD’s as a shorthand. – Example: name -> addr and name -> fav. Beer become name -> addr fav. Beer • More than one attribute on left may be essential. – Example: bar beer -> price 5

Keys of Relations • K is a key for relation R if: 1. 2. Set K functionally determines all attributes of R For no proper subset of K is (1) true. w If K satisfies (1), but perhaps not (2), then K is a superkey. w Consequence: there are no two tuples having the same value in every attribute of the key. w Note E/R keys have no requirement for minimality, as in (2) for relational keys. 6

Example • Consider relation Drinkers(name, addr, beers. Liked, manf, fav. Beer). • {name, beers. Liked} is a superkey because together these attributes determine all the other attributes. – name -> addr fav. Beer – beers. Liked -> manf 7

Example name addr Beers. Liked manf Fav. Beer Janeway Voyager Bud A. B. Wicked. Ale Janeway Voyager Wicked. Ale Pete’s Wicked. Ale Spock Enterprise Bud A. B. Bud Every pair is different => rest of the attributes are determined 8

Example, Cont. • {name, beers. Liked} is a key because neither {name} nor {beers. Liked} is a superkey. – name doesn’t -> manf; beers. Liked doesn’t -> addr. • In this example, there are no other keys, but lots of superkeys. – Any superset of {name, beers. Liked}. 9

Example name addr Beers. Liked manf Fav. Beer Janeway Voyager Bud A. B. Wicked. Ale Janeway Voyager Wicked. Ale Pete’s Wicked. Ale Spock Enterprise Bud A. B. Bud name doesn’t -> manf Beers. Liked doesn’t -> addr 10

E/R and Relational Keys • Keys in E/R are properties of entities • Keys in relations are properties of tuples. • Usually, one tuple corresponds to one entity, so the ideas are the same. • But --- in poor relational designs, one entity can become several tuples, so E/R keys and Relational keys are different. 11

Example name addr Beers. Liked manf Fav. Beer Janeway Voyager Bud A. B. Wicked. Ale Janeway Voyager Wicked. Ale Pete’s Wicked. Ale Spock Enterprise Bud A. B. Bud name, Beersliked relational key Beers beer manf Bud A. B. Wicked. Ale Pete’s Drinkers In E/R name is a key for Drinkers, and beers. Liked is a key for Beers name addr Janeway Voyager Spock Enterprise 12

Where Do Keys Come From? 1. We could simply assert a key K. Then the only FD’s are K -> A for all atributes A, and K turns out to be the only key obtainable from the FD’s. 2. We could assert FD’s and deduce the keys by systematic exploration. u E/R gives us FD’s from entity-set keys and manyone relationships. 13

Armstrong’s axioms • Let X, Y, Z R (i. e. X, Y, Z are attribute sets of R) Armstrong’s axioms: • Reflexivity: if Y X then X -> Y • Augmentation: if X -> Y then for every Z: XZ -> YZ • Transitivity: if X -> Y and Y -> Z then X -> Z 14

FD’s From “Physics” • While most FD’s come from E/R keyness and many-one relationships, some are really physical laws. • Example: “no two courses can meet in the same room at the same time” tells us: hour room -> course. 15

Inferring FD’s: Motivation • In order to design relation schemas well, we often need to tell what FD’s hold in a relation. • We are given FD’s X 1 -> A 1, X 2 -> A 2, …, Xn -> An , and we want to know whether an FD Y -> B must hold in any relation that satisfies the given FD’s. – Example: If A -> B and B -> C hold, surely A -> C holds, even if we don’t say so. 16

Inference Test • To test if Y -> B, start assuming two tuples agree in all attributes of Y. • Use the given FD’s to infer that these tuples must also agree in certain other attributes. • If B is eventually found to be one of these attributes, then Y -> B is true; otherwise, the two tuples, with any forced equalities form a two-tuple relation that proves Y -> B does not follow from the given FD’s. 17

Closure Test • An easier way to test is to compute the closure of Y, denoted Y +. • Basis: Y + = Y. • Induction: Look for an FD’s left side X that is a subset of the current Y +. If the FD is X -> A, add A to Y +. 18

X Y+ A new Y+ 19

Finding All Implied FD’s • Motivation: “normalization, ” the process where we break a relation schema into two or more schemas. • Example: ABCD with FD’s AB ->C, C ->D, and D ->A. – Decompose into ABC, AD. What FD’s hold in ABC ? – Not only AB ->C, but also C ->A ! 20

Basic Idea • To know what FD’s hold in a projection, we start with given FD’s and find all FD’s that follow from given ones. • Then, restrict to those FD’s that involve only attributes of the projected schema. 21

Simple, Exponential Algorithm 1. For each set of attributes X, compute X +. 2. Add X ->A for all A in X + - X. 3. However, drop XY ->A whenever we discover X ->A. u Because XY ->A follows from X ->A. 4. Finally, use only FD’s involving projected attributes. 22

A Few Tricks • Never need to compute the closure of the empty set or of the set of all attributes: – ∅ += ∅ – R + =R • If we find X + = all attributes, don’t bother computing the closure of any supersets of X: – X + = R and X Y => Y + = R 23

Example • ABC with FD’s A ->B and B ->C. Project onto AC. – A +=ABC ; yields A ->B, A ->C. • We do not need to compute AB + or AC +. – B +=BC ; yields B ->C. – C +=C ; yields nothing. – BC +=BC ; yields nothing. 24

Example, Continued • Resulting FD’s: A ->B, A ->C, and • Projection onto AC : A ->C. B ->C. – Only FD that involves a subset of {A, C }. 25

A Geometric View of FD’s • Imagine the set of all instances of a particular relation. • That is, all finite sets of tuples that have the proper number of components. • Each instance is a point in this space. 26

Example: R(A, B) {(1, 2), (3, 4)} {} {(5, 1)} {(1, 2), (3, 4), (1, 3)} 27

An FD is a Subset of Instances • • • For each FD X -> A there is a subset of all instances that satisfy the FD. We can represent an FD by a region in the space. Trivial FD : an FD that is represented by the entire space. – Example: A -> A. 28

Example: A -> B for R(A, B) {(1, 2), (3, 4)} A -> B {} {(5, 1)} {(1, 2), (3, 4), (1, 3)} 29

Representing Sets of FD’s • If each FD is a set of relation instances, then a collection of FD’s corresponds to the intersection of those sets. – Intersection = all instances that satisfy all of the FD’s. 30

Example Instances satisfying A->B, B->C, and CD->A A->B B->C CD->A 31

Implication of FD’s • If an FD Y -> B follows from FD’s X 1 -> A 1, …, Xn -> An , then the region in the space of instances for Y -> B must include the intersection of the regions for the FD’s Xi -> Ai. – That is, every instance satisfying all the FD’s Xi > Ai surely satisfies Y -> B. – But an instance could satisfy Y -> B, yet not be in this intersection. 32

Example A->B A->C B->C 33

Normalization: Anomalies • Goal of relational schema design is to avoid anomalies and redundancy. – Update anomaly : one occurrence of a fact is changed, but not all occurrences. – Deletion anomaly : valid fact is lost when a tuple is deleted. 34

Example of Bad Design Drinkers(name, addr, beers. Liked, manf, fav. Beer) name addr Beers. Liked manf Fav. Beer Janeway Voyager Bud A. B. Wicked. Ale Janeway ? ? ? Wicked. Ale Pete’s ? ? ? Spock Enterprise Bud ? ? ? Bud Data is redundant, because each of the ? ? ? ’s can be figured out by using the FD’s name -> addr fav. Beer and beers. Liked -> manf. 35

This Bad Design Also Exhibits Anomalies name addr Beers. Liked manf Fav. Beer Janeway Voyager Bud A. B. Wicked. Ale Janeway Voyager Wicked. Ale Pete’s Wicked. Ale Spock Enterprise Bud A. B. Bud • Update anomaly: if Janeway is transferred to Intrepid, will we remember to change each of her tuples? • Deletion anomaly: If nobody likes Bud, we lose track of the fact that Anheuser-Busch manufactures Bud. 36

Boyce-Codd Normal Form • We say a relation R is in BCNF : if whenever X ->A is a nontrivial FD that holds in R, X is a superkey. – Remember: nontrivial means A is not a member of set X. – Remember, a superkey is any superset of a key (not necessarily a proper superset). 37

Example • Drinkers(name, addr, beers. Liked, manf, fav. Beer) • FD’s: name->addr fav. Beer, beers. Liked->manf • Only key is {name, beers. Liked}. • In each FD, the left side is not a superkey. • Any one of these FD’s shows Drinkers is not in BCNF 38

Another Example • • Beers(name, manf. Addr) FD’s: name->manf, manf->manf. Addr Only key is {name}. name->manf does not violate BCNF, but manf ->manf. Addr does. 39

Decomposition into BCNF • Given: relation R with FD’s F. • Look among the given FD’s for a BCNF violation X ->B. – If any FD following from F violates BCNF, then there will surely be an FD in F itself that violates BCNF. • Compute X +. – Not all attributes, or else X is a superkey. 40

Decompose R Using X -> B • Replace R by relations with schemas: 1. 2. R 1 = X +. R 2 = (R – X +) U X. w Project given FD’s F onto the two new relations. 1. Compute the closure of F = all nontrivial FD’s that follow from F. 2. Use only those FD’s whose attributes are all in R 1 or all in R 2. 41

Decomposition Picture R 1 R-X + X X +-X R 2 R 42

Example • • • Drinkers(name, addr, beers. Liked, manf, fav. Beer) F = name->addr, name -> fav. Beer, beers. Liked->manf Pick BCNF violation name->addr. Close the left side: {name}+ = {name, addr, fav. Beer}. Decomposed relations: 1. Drinkers 1(name, addr, fav. Beer) 2. Drinkers 2(name, beers. Liked, manf) 43

Example, Continued • We are not done; we need to check Drinkers 1 and Drinkers 2 for BCNF. • Projecting FD’s is complex in general, easy here. • For Drinkers 1(name, addr, fav. Beer), relevant FD’s are name->addr and name->fav. Beer. – Thus, name is the only key and Drinkers 1 is in BCNF. 44

Example, Continued • For Drinkers 2(name, beers. Liked, manf), the only FD is beers. Liked->manf, and the only key is {name, beers. Liked}. – Violation of BCNF. • beers. Liked+ = {beers. Liked, manf}, so we decompose Drinkers 2 into: 1. Drinkers 3(beers. Liked, manf) 2. Drinkers 4(name, beers. Liked) 45

Example, Concluded • The resulting decomposition of Drinkers : 1. Drinkers 1(name, addr, fav. Beer) 2. Drinkers 3(beers. Liked, manf) 3. Drinkers 4(name, beers. Liked) w Notice: Drinkers 1 tells us about drinkers, Drinkers 3 tells us about beers, and Drinkers 4 tells us the relationship between drinkers and the beers they like. 46

Third Normal Form - Motivation • There is one structure of FD’s that causes trouble when we decompose. • AB ->C and C ->B. – Example: A = street address, B = city, code. C = zip • There are two keys, {A, B } and {A, C }. • C ->B is a BCNF violation, so we must decompose into AC, BC. 47

We Cannot Enforce FD’s • The problem is that if we use AC and BC as our database schema, we cannot enforce the FD AB ->C by checking FD’s in these decomposed relations. • Example with A = street, B = city, and C = zip on the next slide. 48

An Unenforceable FD street zip 545 Tech Sq. city 02138 02139 zip Cambridge 02138 02139 Join tuples with equal zip codes. street 545 Tech Sq. city Cambridge zip 02138 02139 Although no FD’s were violated in the decomposed relations, FD street city -> zip is violated by the database as a whole. 49

3 NF Let’s Us Avoid This Problem • 3 rd Normal Form (3 NF) modifies the BCNF condition so we do not have to decompose in this problem situation. • An attribute is prime if it is a member of any key. • X ->A violates 3 NF if and only if X is not a superkey, and also A is not prime. 50

Example • In our problem situation with FD’s AB ->C and C ->B, we have keys AB and AC. • Thus A, B, and C are each prime. • Although C ->B violates BCNF, it does not violate 3 NF. 51

What 3 NF and BCNF Give You • There are two important properties of a decomposition: 1. Recovery : it should be possible to project the original relations onto the decomposed schema, and then reconstruct the original. 2. Dependency preservation : it should be possible to check in the projected relations whether all the given FD’s are satisfied. 52

3 NF and BCNF, Continued • We can get (1) with a BCNF decompsition. – Explanation needs to wait for relational algebra. • We can get both (1) and (2) with a 3 NF decomposition. • But we can’t always get (1) and (2) with a BCNF decomposition. – street-city-zip is an example. 53

A New Form of Redundancy • Multivalued dependencies (MVD’s) express a condition among tuples of a relation that exists when the relation is trying to represent more than one many-many relationship. • Then certain attributes become independent of one another, and their values must appear in all combinations. 54

Example Drinkers(name, addr, phones, beers. Liked) • A drinker’s phones are independent of the beers they like. • Thus, each of a drinker’s phones appears with each of the beers they like in all combinations. • This repetition is unlike redundancy due to FD’s, of which name->addr is the only one. 55

Tuples Implied by Independence If we have tuples: name addr phones beers. Liked Sue a p 1 b 1 Sue a p 2 b 2 Then these tuples must also be in the relation. Sue a p 1 b 2 Sue a p 2 b 1 56

Definition of MVD • A multivalued dependency (MVD) X ->->Y is an assertion that if two tuples of a relation agree on all the attributes of X, then their components in the set of attributes Y may be swapped, and the result will be two tuples that are also in the relation. 57

Example • The name-addr-phones-beers. Liked example illustrated the MVD name->->phones and the MVD name ->-> beers. Liked. 58

Picture of MVD X ->->Y X Y others equal exchange 59

MVD Rules • Every FD is an MVD. – If X ->Y is a FD, then swapping Y ’s between two tuples that agree on X doesn’t change the tuples. – Therefore, the “new” tuples are surely in the relation, and we know X ->->Y. • Complementation : If X ->->Y, and Z is all the other attributes, then X ->->Z. 60

Splitting Doesn’t Hold • Like FD’s, we cannot generally split the left side of an MVD. • But unlike FD’s, we cannot split the right side either --- sometimes you have to leave several attributes on the right side. 61

Example • Consider a drinkers relation: Drinkers(name, area. Code, phone, beers. Liked, manf) • A drinker can have several phones, with the number divided between area. Code and phone (last 7 digits). • A drinker can like several beers, each with its own manufacturer. 62

Example, Continued • Since the area. Code-phone combinations for a drinker are independent of the beers. Likedmanf combinations, we expect that the following MVD’s hold: name ->-> area. Code phone name ->-> beers. Liked manf 63

Example Data Here is possible data satisfying these MVD’s: name area. Code phone beers. Liked manf Sue Sue 650 415 555 -1111 555 -9999 Bud Wicked. Ale A. B. Pete’s But we cannot swap area codes or phones my themselves. That is, neither name ->-> area. Code nor name ->-> phone holds for this relation. 64

Fourth Normal Form • The redundancy that comes from MVD’s is not removable by putting the database schema in BCNF. • There is a stronger normal form, called 4 NF, that (intuitively) treats MVD’s as FD’s when it comes to decomposition, but not when determining keys of the relation. 65

4 NF Definition • A relation R is in 4 NF if whenever X ->->Y is a nontrivial MVD, then X is a superkey. – “Nontrivial means that: 1. Y is not a subset of X, and 2. X and Y are not, together, all the attributes. – Note that the definition of “superkey” still depends on FD’s only. 66

BCNF Versus 4 NF • Remember that every FD X ->Y is also an MVD, X ->->Y. • Thus, if R is in 4 NF, it is certainly in BCNF. – Because any BCNF violation is a 4 NF violation. • But R could be in BCNF and not 4 NF, because MVD’s are “invisible” to BCNF. 67

Decomposition and 4 NF • If X ->->Y is a 4 NF violation for relation R, we can decompose R using the same technique as for BCNF. 1. XY is one of the decomposed relations. 2. All but Y – X is the other. 68

Example Drinkers(name, addr, phones, beers. Liked) FD: MVD’s: name -> addr name ->-> phones name ->-> beers. Liked • Key is {name, phones, beers. Liked}. • All dependencies violate 4 NF. 69

Example, Continued • Decompose using name -> addr: 1. Drinkers 1(name, addr) – In 4 NF, only dependency is name -> addr. 2. Drinkers 2(name, phones, beers. Liked) – Not in 4 NF. MVD’s name ->-> phones and name ->-> beers. Liked apply. No FD’s, so all three attributes form the key. 70

Example: Decompose Drinkers 2 • Either MVD name ->-> phones or name ->-> beers. Liked tells us to decompose to: – Drinkers 3(name, phones) – Drinkers 4(name, beers. Liked) 71

On-Line Application Processing • Warehousing • Data Cubes • Data Mining 72

Overview • Traditional database systems are tuned to many, small, simple queries. • Some new applications use fewer, more timeconsuming, complex queries. • New architectures have been developed to handle complex “analytic” queries efficiently. 73

The Data Warehouse • The most common form of data integration. – Copy sources into a single DB (warehouse) and try to keep it up-to-date. – Usual method: periodic reconstruction of the warehouse, perhaps overnight. – Frequently essential for analytic queries. 74

OLTP • Most database operations involve On-Line Transaction Processing (OTLP). – Short, simple, frequent queries and/or modifications, each involving a small number of tuples. – Examples: • Answering queries from a Web interface • Sales at cash registers • Selling airline tickets 75

OLAP • Of increasing importance are On-Line Analytical Processing (OLAP) queries. – Few, but complex queries --- may run for hours. – Queries do not depend on having an absolutely up -to-date database. • Sometimes called Data Mining. 76

OLAP Examples 1. Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer. 2. Analysts at Wal-Mart look for items with increasing sales in some region. 77

Common Architecture • Databases at store branches handle OLTP. • Local store databases copied to a central warehouse overnight. • Analysts use the warehouse for OLAP. 78

Star Schemas • A star schema is a common organization for data at a warehouse. It consists of: 1. Fact table : a very large accumulation of facts such as sales. w Often “insert-only. ” 2. Dimension tables : smaller, generally static information about the entities involved in the facts. 79

Example: Star Schema • Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged. • The fact table is a relation: Sales(bar, beer, drinker, day, time, price) 80

Example, Continued • The dimension tables include information about the bar, beer, and drinker “dimensions”: Bars(bar, addr, license) Beers(beer, manf) Drinkers(drinker, addr, phone) 81

Dimensions and Dependent Attributes • Two classes of fact-table attributes: 1. Dimension attributes : the key of a dimension table. 2. Dependent attributes : a value determined by the dimension attributes of the tuple. 82

Example: Dependent Attribute • price is the dependent attribute of our example Sales relation. • It is determined by the combination of dimension attributes: bar, beer, drinker, and the time (combination of day and time attributes). 83

Approaches to Building Warehouses 1. ROLAP = “relational OLAP”: Tune a relational DBMS to support star schemas. 2. MOLAP = “multidimensional OLAP”: Use a specialized DBMS with a model such as the “data cube. ” 84

ROLAP Techniques 1. Bitmap indexes : For each key value of a dimension table (e. g. , each beer for relation Beers) create a bit-vector telling which tuples of the fact table have that value. 2. Materialized views : Store the answers to several useful queries (views) in the warehouse itself. 85

Typical OLAP Queries • Often, OLAP queries begin with a “star join”: the natural join of the fact table with all or most of the dimension tables. • Example: SELECT * FROM Sales, Bars, Beers, Drinkers WHERE Sales. bar = Bars. bar AND Sales. beer = Beers. beer AND Sales. drinker = Drinkers. drinker; 86

Typical OLAP Queries --- 2 • The typical OLAP query will: 1. Start with a star join. 2. Select for interesting tuples, based on dimension data. 3. Group by one or more dimensions. 4. Aggregate certain attributes of the result. 87

Example: OLAP Query For each bar in Palo Alto, find the total sale of each beer manufactured by Anheuser. Busch. 2. Filter: addr = “Palo Alto” and manf = “Anheuser-Busch”. 3. Grouping: by bar and beer. 4. Aggregation: Sum of price. • 88

Example: In SQL SELECT bar, beer, SUM(price) FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’ GROUP BY bar, beer; 89

Using Materialized Views • A direct execution of this query from Sales and the dimension tables could take too long. • If we create a materialized view that contains enough information, we may be able to answer our query much faster. 90

Example: Materialized View • • Which views could help with our query? Key issues: 1. It must join Sales, Bars, and Beers, at least. 2. It must group by at least bar and beer. 3. It must not select out Palo-Alto bars or Anheuser-Busch beers. 4. It must not project out addr or manf. 91

Example --- Continued • Here is a materialized view that could help: CREATE VIEW BABMS(bar, addr, beer, manf, sales) AS SELECT bar, addr, beer, manf, SUM(price) sales FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers Since bar -> addr and beer -> manf, there is no real GROUP We need addr andaddr, beer, grouping. BY bar, manf in the SELECT. 92

Example --- Concluded • Here’s our query using the materialized view BABMS: SELECT bar, beer, sales FROM BABMS WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’; 93

MOLAP and Data Cubes • Keys of dimension tables are the dimensions of a hypercube. – Example: for the Sales data, the four dimensions are bars, beers, drinkers, and time. • Dependent attributes (e. g. , price) appear at the points of the cube. 94

Marginals • The data cube also includes aggregation (typically SUM) along the margins of the cube. • The marginals include aggregations over one dimension, two dimensions, … 95

Example: Marginals • Our 4 -dimensional Sales cube includes the sum of price over each bar, each beer, each drinker, and each time unit (perhaps days). • It would also have the sum of price over all bar-beer pairs, all bar-drinker-day triples, … 96

Structure of the Cube • Think of each dimension as having an additional value *. • A point with one or more *’s in its coordinates aggregates over the dimensions with the *’s. • Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers and all time of the Bud consumed at Joe’s. 97

Drill-Down • Drill-down = “de-aggregate” = break an aggregate into its constituents. • Example: having determined that Joe’s Bar sells very few Anheuser-Busch beers, break down his sales by particular A. -B. beer. 98

Roll-Up • Roll-up = aggregate along one or more dimensions. • Example: given a table of how much Bud each drinker consumes at each bar, roll it up into a table giving total amount of Bud consumed for each drinker. 99

Materialized Data-Cube Views • Data cubes invite materialized views that are aggregations in one or more dimensions. • Dimensions may not be completely aggregated --- an option is to group by an attribute of the dimension table. 100

Example • A materialized view for our Sales data cube might: 1. 2. 3. 4. Aggregate by drinker completely. Not aggregate at all by beer. Aggregate by time according to the week. Aggregate according to the city of the bar. 101

Data Mining • • Data mining is a popular term for queries that summarize big data sets in useful ways. Examples: 1. Clustering all Web pages by topic. 2. Finding characteristics of fraudulent creditcard use. 102

Market-Basket Data • An important form of mining from relational data involves market baskets = sets of “items” that are purchased together as a customer leaves a store. • Summary of basket data is frequent itemsets = sets of items that often appear together in baskets. 103

Example: Market Baskets • If people often buy hamburger and ketchup together, the store can: 1. Put hamburger and ketchup near each other and put potato chips between. 2. Run a sale on hamburger and raise the price of ketchup. 104

Finding Frequent Pairs • The simplest case is when we only want to find “frequent pairs” of items. • Assume data is in a relation Baskets(basket, item). • The support threshold s is the minimum number of baskets in which a pair appears before we are interested. 105

Frequent Pairs in SQL Look for two SELECT b 1. item, b 2. item Basket tuples FROM Baskets b 1, Baskets b 2 with the same basket and WHERE b 1. basket = b 2. basketdifferent items. First item must precede second, AND b 1. item < b 2. item so we don’t GROUP BY b 1. item, b 2. item count the same pair twice. HAVING COUNT(*) >= s; Throw away pairs of items that do not appear at least s times. Create a group for each pair of items that appears in at least one basket. 106

A-Priori Trick --- 1 • Straightforward implementation involves a join of a huge Baskets relation with itself. • The a-priori algorithm speeds the query by recognizing that a pair of items {i, j } cannot have support s unless both {i } and {j } do. 107

A-Priori Trick --- 2 • Use a materialized view to hold only information about frequent items. INSERT INTO Baskets 1(basket, Items that item) appear in at least s baskets. SELECT * FROM Baskets WHERE item IN ( SELECT ITEM FROM Baskets GROUP BY item 108

A-Priori Algorithm 1. Materialize the view Baskets 1. 2. Run the obvious query, but on Baskets 1 instead of Baskets. • Baskets 1 is cheap, since it doesn’t involve a join. • Baskets 1 probably has many fewer tuples than Baskets. – Running time shrinks with the square of the number of tuples involved in the join. 109

Example: A-Priori • Suppose: 1. A supermarket sells 10, 000 items. 2. The average basket has 10 items. 3. The support threshold is 1% of the baskets. • • At most 1/10 of the items can be frequent. Probably, the minority of items in one basket are frequent -> factor 4 speedup. 110