Algorithms for massive data sets Lecture 3 March

Скачать презентацию Algorithms for massive data sets Lecture 3 March

4d55bdc51093e3f676f592e84f6d5aac.ppt

Количество слайдов: 52

Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches 1

Synopses • Synopsis (from Webster) : a condensed statement or outline (as of a narrative or treatise) • Synopsis (here) : A succinct data structure that lets us answers queries efficiently 2

Typical Queries Statistics (count, median, variance, aggregates) Patterns (clustering, associations, classification) Nearest Neighbors (L 1, L 2, Hamming norm) Property Testing (Skewness, Independence) etc. . 3

Why use Synopses? • Can’t store the whole data : E. g. Web Data • Resides in main memory : fast query response. E. g. OLAP Data • Remote transmission at minimal cost • Minimal effect on storage cost 4

Classification of Synopses • Are they useful for more than kind of query? – General purpose: E. g. samples – Specific purpose: E. g. Distinct Values Estimator • What granularity ? – One per database: E. g. Sample of the whole relation – One per distinct value of attribute : E. g. Profiles for customers in a call database 5

Some Numbers • AQUA Project (Bell Labs): – DB Size : 420 MB – Synopsis Size : 420 KB (0. 1%) to 12. 5 MB (3%) – Accuracy : Within 10% for 0. 1% of DB size – Running Time : Less than 0. 3% of the time for full query • Quantile Summary (Khanna et al) : – DB Size : 109 tuples – Synopsis Size : 1249 tuples – Accuracy : 1% 6

Synopses need not be fancy! • Maintaining Mean (μ) of numbers • What about variance ? 7

Objectives • Small Size • Fast Update and Query • Provable error guarantees (Need not give exact answers) • Composable : Useful for distributed scenario 8

A coarse classification • Sampling based : This lecture • Sketches • Histograms 9

Sampling • Where and how are samples used • How are samples maintained – Single relation • Types of samples : – Oblivious – Value based • Limitations of oblivious samples 10

Samples in DSS Decision Support Systems (DSS) • SQL Query Exact Answer Long Response Times! Exact answers NOT always required – DSS applications usually exploratory: early feedback to help identify “interesting” regions – Aggregate queries: precision to “last decimal” not needed • e. g. , “What percentage of the US sales are in NJ? ” (display as bar graph) – Base data can be remote or unavailable: approximate processing using locally-cached data synopses is the only option data synopses 11

Sampling: Basics • Idea: A small random sample S of the data often well-represents all the data – For a fast approx answer, apply the query to S & “scale” the result – E. g. , R. a is {0, 1}, S is a 20% sample select count(*) from R where R. a = 0 select 5 * count(*) from S where S. a = 0 1101 1111000 01111101011 0110 R. a Red = in S Est. count = 5*2 = 10, Exact count = 10 • Leverage extensive literature on confidence intervals for sampling Actual answer is within the interval [a, b] with a given probability E. g. , 54, 000 ± 600 with prob 90% 12

The Aqua Architecture SQL Query Q Q Network Browser Excel HTML XML Result Data Warehouse (e. g. , Oracle) Warehouse Data Updates Picture without Aqua: • User poses a query Q • Data Warehouse executes Q and returns result • Warehouse is periodically updated with new data 13

The Aqua Architecture SQL Query Q Q’ Network Browser Excel Rewriter HTML XML Data Warehouse Result (e. g. , (w/ error bounds) Oracle) Warehouse Data Updates AQUA Synopses AQUA Tracker Picture with Aqua: • Aqua is middleware, between the user and the warehouse • Aqua Synopses are stored in the warehouse • Aqua intercepts the user query and rewrites it to be a query Q’ on the synopses. Data warehouse returns approximate answer select count(*) from R where R. a = 0 Q select 5 * count(*) from S where S. a = 0 Q’ 14

Schema & Queries L order O cust Cnation part, supp PS N P S R • Most queries involve foreign key joins between tables followed by (grouping and) aggregation. 15

Example Query 17

What samples are right? • Naïve approach : maintain samples of each relation in the schema • Problem : sample of the join is not a join of the samples, even foreign key joins • Example : A A B B a 1 a 2 b 1 A B 18

Foreign Key Joins • Foreign Key Join : Effectively a central “fact” table is appended with columns from the dimension tables. • Sampling from the join is same as sampling from the “fact” table itself. • Synopsis : For every table that may be a “fact” table for certain join, sample from the table and join the sample with the dimension tables. 19

Synopsis L order part, supp O cust C nation PS N R P S • For every node in the DAG: – Maintain a sample corresponding to that table. – Join the sample with tables corresponding to all its descendents in the graph. – Maximal join for which the table is a “fact” table. 20

Bells and whistles! • How to allocate memory across samples of different “fact” tables • Group-By Queries: – Are uniform samples best or can we do better? • Aggregate attribute may be skewed – Are uniform samples best or can we do better? • We may revisit these issues later – Have not seen some equations for a while! 21

How to sample? • Consider a single table with only insertions • Want to maintain a sample of this table • Three semantics of sampling: – Coin flip – Fixed size without replacement – Fixed size with replacement • First one (coin flip) easy to maintain under insertions • Exercise : Can we switch between different samples? If so how ? 22

Reservoir Sampling • Given : A stream of elements (tuples), viewed as insertions into a relation • Aim : At every instant maintain a uniform random sample of size n without replacement • Method : (Accept the first n elements) – Let t be the number of elements seen so far – On seeing the (t+1)st element include it with probability n/(t+1) – If included evict one of the previous elements uniformly at random 23

Proof of Correctness • Easy to see that every instant the size of the sample is exactly n • Claim : After seeing t elements, every element belongs to the sample with probability n/t • Exercise : Using induction prove the last claim 24

Efficiency • Let N be the number of records seen • Each record (beyond the first n records) is added to the reservoir with probability n/t • The average number of records added is • Consider any reservoir sample. • The t th element has to be a part of the sample with probability no less than n/t. • Thus, the quantity above is also a lower bound on the additions made to the reservoir (time spent) 25

Efficiency • The naïve algorithm makes N calls to RANDOM() and takes time O(N) • Consider the following random variable: Let S(n, t) denote the number of elements skipped where n is the size of the reservoir and t is the number of elements processed so far. • Aim: Study this random variable and sample from its distribution using O(1) operations. • Idea : Generate S(n, t) and skip those many records doing nothing 26

Observations • S(n, t) is non-negative • Let F(s) denote Prob {S(n, t) ≤ s}, for s≥ 0 Where ab denotes the falling power a(a-1) (a-2)…(a-b-1) and denotes the rising power a(a+1)(a+2)…(a+b-1) 27

Observations • Subtracting two terms corresponding to s and s-1 we get the probability distribution function f(s) as We can compute the expected value which is (t-n+1)/(n-1) Here is a simple way to sample from the distribution corresponding to S(n, t). We already calculated its CDF (F(s)). We generate a random number U between 0 and 1 and find the smallest s such that U ≤ F(s), i. e. 28

Observations • Have reduced the number of calls to RANDOM() to optimal : One per insertion into the reservoir • There are two ways to find the largest s that satisfies the previous equation – Linear scan : Gives O(N) time algorithm – Binary search/Newton’s interpolation method to get a running time of O(n 2(1 + log (N/n)) • Note: This is still not optimal. Read the paper for an optimal (up to constants) algorithm. 29

What have we seen so far? • How to sample efficiently (Reservoir Sampling) – A method to sample without replacement by making a single scan – Optimized the calls to RANDOM() – Overall processing time can also be optimized • How samples are used in DSS and what are the different samples that should be kept in order to answer queries • What next? – Queries in DSS are not simple counts over the entire relation – Typically they have grouping followed by aggregation of an attribute that may have high variance 30

Error using sampling n. R = {y 1, y 2, …, y. N}, sample size n n. Variance in data values: n. Error = Std Dev =√E(μ – μ*)2 31

Group-By Queries • SELECT avg (salary) FROM census GROUP BY state • Some of the states have very tuples as compared to others. E. g. CA has 70 times more people as compared to WY • If we sample uniformly from the entire relation there will be very few tuples corresponding to WY and hence a large error in its avg(salary) estimate 32

Error Metric (Group-By) • Let c*_i be the true answer (aggregate) corresponding to group i • Let c_i be the estimate obtained from sample • The error e_i is given by |c*_i – c_i|/|c_i| • The cumulative error is the L 1, L 2, L∞ norm of the error vector {e_i} 33

Optimal sampling strategy • For every group the error is inversely proportional to √n where n is the number of tuples in the sample from this group • In order to reduce the maximum error among all groups we should have equal number of samples from each group (Senate) • But this strategy is not optimal if the query does not have a group by and is over the entire relation. In that case a uniform sample of the entire relation is optimal (House) 34

Basic-Congress Sampling • Unfortunately, unlike U. S. congress we don’t have place to sit both Senators and House Representatives! • Hence we do the following: – Let X be the total seats allotted to Congress – For a state CA let CA_S (resp CA_H) be the seats allotted to it assuming the congress was only made of senate (resp. house) – The final seat allocation to each state CA is proportional to max(CA_S, CA_H), subject to total seats being X 35

Comments • No error guarantees – Only a best effort solution • Cannot use Reservoir sampling anymore – The full paper talks about one pass algorithms, but admits that they don’t work in all cases • What if the variance in values (S) is large ? – Outlier indexing 36

Error using sampling n. R = {y 1, y 2, …, y. N}, sample size n n. Variance in data values: n. Error = Std Dev : 37

Presence of Data Skew. Outliers (deviant tuples). Sum estimate= 10, 000 9950 tuples. Value = 1 Uniform sample of size 100. case 1 case 2 50 tuples Exact Answer = 59, 950 OR Error > 83% Sum estimate > 109, 900 Value = 1000. 38

Outlier Indexing Scheme. Preprocessing RO (outliers) Query Q A 1 + R RNO sample RNO (sample) Q & extrapolate A A 2 39

Selection of Outlier Index. Objective: Remove at most n outliers such that non outliers have least variance. Theorem: For a sorted (multi)set of values optimal outlier set looks like : . . . , vk+1, vk+2, …, vm-1, vm+1, … 40

Comments • Cannot do reservoir sampling • One pass algorithm for selection of outliers 41

Types of Samples • Oblivious samples: We do not look at the value of attribute while sampling • Value based sampling : The distinct sampling of Gibbons et al • Limitations of oblivious sampling: – Please refer : Sampling algorithms: lower bounds and applicaitons, Z. Bar-Yossef, S. Ravi Kumar, and D. Sivakumar. STOC 2001. 42

Summary • Obvious type of synopsis: samples • Use of samples in DB, in particular DSS. – Idea of maintaining the samples of ‘fact’ tables • How to sample without replacement with a single pass, not knowing the size of the relation a-priori – Reservoir sampling and tricks to make it efficient • Shortcomings of sampling in DB’s – Group-By queries : Congressional samples – High Skew in Data : Outlier indexing, stratified sampling 43

References • Join Synopses for Approximate Query Answering, S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. SIGMOD 1999. • Congressional Samples for Approximate Answering of Group-By Queries, S. Acharya, P, Gibbons, and V. Poosala. SIGMOD 2000. • Overcoming Limitations of Sampling for Aggregation Queries, S. Chaudhuri, G. Das, M. Datar, R. Motwani and V. Narasayya. ICDE 2001. • A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, S. Chaudhuri, G. Das and V. Narasayya. SIGMOD 2001. • Random Sampling with a Reservoir, J. S. Vitter. Trans. on Mathematical Software 11(1): 37 -57 (1985). 44

Sampling over Sliding windows • Samples of streaming data • Need to account for staleness of data • An data element is fresh if it belongs to the last N elements • Problem statement : Given a stream of elements maintain a uniform random sample of size 45

A Simple, Unsatisfying Approach • Choose a random subset X={x 1, …, xk}, X {0, 1, …, n-1} • The sample always consists of the non-expired elements whose indexes are equal to x 1, …, xk (modulo n) • Only uses O(k) memory • Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic • Unsuitable for many real applications, particularly those with periodicity in the data 46

Reservoir Sampling: Why It Doesn’t Work • Suppose an element in the reservoir expires • Need to replace it with a randomly-chosen element from the current window • However, in the data stream model we have no access to past data • Could store the entire window but this would require O(n) memory 47

Chain-Sample • Include each new element in the sample with probability 1/min(i, n) • As each element is added to the sample, choose the index of the element that will replace it when it expires • When the ith element expires, the window will be (i+1…i+n), so choose the index from this range • Once the element with that index arrives, store it and choose the index that will replace it in turn, building a “chain” of potential replacements • When an element is chosen to be discarded from the sample, discard its “chain” as well 48

Example 3514628523542250984673 49

Memory Usage of Chain-Sample • Let T(x) denote the expected length of the chain from the element with index i when the most recent index is i+x { • T(x) = 0 1 + 1/n [ΣT(j)] j

Memory Usage of Chain-Sample • Chain consists of “hops” with lengths 1…n • Chain of length j can be represented by partition of n into j ordered integer parts – j-1 hops with sum less than n plus a remainder • Each such partition has probability n-j • Number of such partitions is (n) < (ne/j)j j • Probability of any such partition is small [O(n-c)] when j = O(k log n) • Uses O(k log n) memory whp 51

Comparison of Algorithms Algorithm Expected High-Probability Periodic O(k) Oversample O(k log n) Chain-Sample O(k) O(k log n) • Chain-sample is preferable to oversampling: – Better expected memory usage: O(k) vs. O(k log n) – Same high-probability memory bound of O(k log n) – No chance of failure due to sample size shrinking below k 52