Congressional Samples for Approximate Answering of Group-By Queries

Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath Poosala Presented By: Muhammed Z. Miah February 14, 2006 CS 6392 - DB Exploration 1

Introduction n Limitations of Uniform Sampling n n Presence of skewed data in aggregate values Effect of low selectivity in selection queries Presence of small groups in group-by queries Biased Sampling for Group-By Queries n (Precomputed) Biased sampling – hybrid union of biased and uniform sampling February 14, 2006 CS 6392 - DB Exploration 2

Aqua System (Architecture) February 14, 2006 CS 6392 - DB Exploration 3

Problems with Group-By Queries n Decision support queries routinely segment the data into groups. n For example, a group-by query on the U. S. census database could be used to determine the per capita income per state. However , there can be a huge discrepancy in the sizes of different groups, e. g. , the state of California has nearly 70 times the population of Wyoming. n As a result, a uniform random sample of the relation will contain disproportionately fewer tuples from the smaller groups, which leads to poor accuracy for answers on those groups because accuracy is highly dependent on the number of sample tuples that belong to that group. n Standard error is inversely proportional to √n for uniform sample. n is the uniform sample random size. February 14, 2006 CS 6392 - DB Exploration 4

Solution (Congressional Sampling) n Congressional samples are hybrid union of uniform and biased samples. n The strategy adopted is to divide the available sample space X equally among the g groups , and take a uniform random sample within each group. n Consider US Congress which is hybrid of House and Senate. House has representative from each state in proportion to its population. Senate has equal number of representative from each state. n Then apply House and Senate scenario for representing different groups. House sample: Uniform random sampling from each group. Senate sample: Sample an equal number of tuples from each group. February 14, 2006 CS 6392 - DB Exploration 5

Solution (Congressional Sampling) n Define a strategy S 1 as following : n Divide the available sample space X equally among the g groups , and take a uniform random sample within each group n Congressional approach : In this approach consider the entire set of possible group by queries over a relation R. n Let be the set of non-empty groups under the grouping G. The grouping G partitions the relation R according to the cross-product of all the grouping attributes; this is the finest possible partitioning for group-bys on R. Any group h on any other grouping T G is the union of one or more groups g from. n Constructing Congress, 1. Apply S 1 on each T G. 2. Let be the set of non-empty groups under the grouping T, and let the number of such groups. 3. By S 1, each of the non-empty groups in T should get a uniform random sample of X/m. T tuples from the group. February 14, 2006 CS 6392 - DB Exploration 6

Solution (Congressional Sampling) n Constructing Congress, 4. Thus for each subgroup g in allocated to g is simply of a group h in T, the expected space where ng and nh are the number of tuples in g and h respectively. 5. Then, for each group g , take the maximum over all T of Sg, T, as the sample size for g, and scale it down to limit the space used to X. The final formula is: Sample Size (g) = 6. For each group g in , select a uniform random sample of size Sample Size(g). Thus we have a stratified, biased sample in which each group at the finest partitioning is its own strata. Thus Congress essentially guarantees that both large and small groups in all groupings will have a reasonable number of samples. February 14, 2006 CS 6392 - DB Exploration 7

Rewriting n n n Query rewriting involves two key steps: a) scaling up the aggregate expressions and b) deriving error bounds on the estimate. For each tuple, let its scale factor Scale. Factor be the inverse sampling rate for its strata. All the sample tuples belonging to a group will have the same Scale. Factor. Thus key step in scaling is efficiently associate each tuple with its corresponding Scale. Factor. There are two approaches to doing this: a) store the Scale. Factor(SF) with each tuple in sample relation - Integrated b) use a separate table to store the Scale. Factors for the groups - Normalized, Key-normalized, Nested-integrated Each approach has its pros and cons. February 14, 2006 CS 6392 - DB Exploration 8

Computation and Maintenance n One Pass Algorithm n [AGP 99 b] S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. Technical report, Bell Laboratories, Murray Hill, New Jersey, November 1999 February 14, 2006 CS 6392 - DB Exploration 9

Experiments n Testbed n n On Aqua, with Oracle (v 7) Accuracy of Sample Allocation Strategies n Performance for Different Query Sets n n Effect of Sample Size n n n Queries w/ No Group-bys, Three group-bys, Two group-bys Error drops as more space is allocated to store the samples Congress – drops error rapidly w/ increasing sample size and provide high accuracy even for arbitrary group-bys Performance of Rewriting Strategies February 14, 2006 CS 6392 - DB Exploration 10

Extensions Generalization to Multiple Criteria n Generalization to Other Queries n February 14, 2006 CS 6392 - DB Exploration 11

Related Work Online Aggregation n Histograms n Wavelets n Biased Sampling (Stratified Sampling) n February 14, 2006 CS 6392 - DB Exploration 12

Conclusions Congressional samples are effective for group-by queries with arbitrary group-bys (including none) n New strategies were validated experimentally for both in their ability to produce accurate estimates to group-by queries and in their execution efficiency n February 14, 2006 CS 6392 - DB Exploration 13

THANK YOU Happy Valentines February 14, 2006 CS 6392 - DB Exploration 14