9d24d37bfcbe7d87a319c6311828c071.ppt
- Количество слайдов: 18
Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath Poosala Presented By: Daniel Kuang February 14, 2006 CS 6392 - DB Exploration 1
Outline Problems with Group-By queries n Congressional sampling n Rewriting n Performance n Conclusion n February 14, 2006 CS 6392 - DB Exploration 2
Problems with Group-By Queries n Decision support queries routinely segment the data into groups. n For example, a group-by query on the U. S. census database could be used to determine the per capita income per state. However , there can be a huge discrepancy in the sizes of different groups, e. g. , the state of California has nearly 70 times the population of Wyoming. n As a result, a uniform random sample of the relation will contain disproportionately fewer tuples from the smaller groups, which leads to poor accuracy for answers on those groups because accuracy is highly dependent on the number of sample tuples that belong to that group. n Standard error is inversely proportional to √n for uniform sample. n is the uniform sample random size. February 14, 2006 CS 6392 - DB Exploration 3
Solution (Congressional Sampling) n Consider US Congress which is hybrid of House and Senate. House has representative from each state in proportion to its population. Senate has equal number of representative from each state. n Then apply House and Senate scenario for representing different groups. House sample: Uniform random sampling from each group. Senate sample: Sample an equal number of tuples from each group. February 14, 2006 CS 6392 - DB Exploration 4
Solution (Congressional Sampling) A B House Sg, 0 Senate Sg, AB Basic Congress before scaling Basic Congress a 1 b 1 30 25 30 27. 3 n a 1 Consider a 30 relation 25 with two grouping attributes A, and B R b 2 30 27. 3 n Number of tuples for the groups a 1 (a 1, b 1) – 3000, (a 1, b 2) – 3000, (a 1, b 3) – 1500, (a 2, b 3) b 3 15 25 25 22. 7 n Basic Congress (sample size = 100) a 2 b 3 25 25 25 22. 7 February 14, 2006 CS 6392 - DB Exploration -- 2500 5
Solution (Congressional Sampling) A B House Sg, 0 Senate Sg, AB Basic Congress before scaling Basic Congress a 1 b 1 30 25 30 27. 3 a 1 b 2 30 25 30 27. 3 a 1 b 3 15 25 25 22. 7 a 2 b 3 25 25 25 22. 7 Sg, A Sg, B Congress before scaling Congress 20 (of 50) 33. 3 23. 5 10 (of 50) 12. 5 (of 33. 3) 25 17. 7 50 20. 8 (of 33. 3) 50 35. 3 February 14, 2006 CS 6392 - DB Exploration 6
Congressional Sampling n Basic congress sample size allocated to each group n Congress sample size allocated to each group February 14, 2006 CS 6392 - DB Exploration 7
Rewriting n n n Query rewriting involves two key steps: a) scaling up the aggregate expressions and b) deriving error bounds on the estimate. Scale. Factor be the inverse sampling rate for its strata. How to associate each tuple with its scalefactor: a) store the Scale. Factor(SF) with each tuple in sample relation b) use a separate table to store the Scale. Factors for the groups Key Grouping column Aggregate column K A B C Q k 1 a 1 b 1 c 1 q 1 From Rel k 2 a 1 b 1 c 2 q 2 Group by A, B Select A, B, sum(Q) Relation Rel with two example tuples February 14, 2006 CS 6392 - DB Exploration 8
Rewriting (Integrated Rewriting) February 14, 2006 CS 6392 - DB Exploration 9
Normalized Rewriting February 14, 2006 CS 6392 - DB Exploration 10
Key-normalized Rewriting February 14, 2006 CS 6392 - DB Exploration 11
Nested-integrated Rewriting February 14, 2006 CS 6392 - DB Exploration 12
Performance n Three Queries Grouping on returnflag, linestatus, shipdate skewed group sizes z = 1. 5 Sample Percentage at 7% February 14, 2006 CS 6392 - DB Exploration 13
Performance February 14, 2006 CS 6392 - DB Exploration 14
Performance February 14, 2006 CS 6392 - DB Exploration 15
Performance February 14, 2006 CS 6392 - DB Exploration 16
Performance Times taken for different sample percentages Actual query time = 40 sec February 14, 2006 CS 6392 - DB Exploration 17
Conclusions n Congressional samples are effective for group-by queries with arbitrary group-bys (including none) February 14, 2006 CS 6392 - DB Exploration 18
9d24d37bfcbe7d87a319c6311828c071.ppt