130c0f94684e6db3c2d0568a1210a6c4.ppt

- Количество слайдов: 13

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington

Introduction Often searches are done on multiple features Each feature produces a different ranking for the query Must thus join and aggregate rankings on different features

Example Find location for a house such that the combination of the cost of the house and 5 years tuition at a nearby school is minimal. n Exact location is not predefined in query, per location the house and school features would have to be analyzed.

Motivation Current techniques decouple join and sorting (ranking) of results. Sorting is expensive and is a blocking operation. More apparent if ranking and the joining features are different.

Rank-Join Algorithm 1) 2) 3) Generate new valid join combinations Compute score for each combination For each incoming input, calculate the total score of: a) The last seen feature value and the top ranked feature value for all other features in the query. b) Store the maximum of these as T (threshold) 4) Store top k in priority queue. 5) Halt when lowest value of queue ≥ T

Optimality Is Instance Optimal over all correct top-K join algorithms. n n Guarantees that cost of Rank-Join is O (cost of any other algorithm). Mathematically: Cost(Rank-Join) ≤ c*Cost(Any Other Algorithm) + c’ c is the optimality ratio c, c’ > 0

Rank-Join Continued … Join strategy crucial n Recommended: Ripple Join Alternates between tuples Flexible in the way it sweeps out (rectangular, etc) Retains ordering in considering samples Variant of Rank-Join n n Hash Rank Join (HRJN) Block Ripple Join

Hash Rank Join (HRJN) Operator Built on idea of hash ripple join n Inputs are as two hash tables Maintains highest (first) and lowest (last selected) objects from each relation. Results are added to a priority queue Advantages: n n Smaller space requirement Can be pipelined

Hash Rank Join (HRJN) Operator: Problems Local Ranking Problem n Results from three or more input streams Larger queue sizes More database accesses Buffer Problem n Cannot predict how many partial joins will result

HRJN Solutions? Block Ripple Joins Do comparisons as blocks Score-Guided Strategy If thresholds are very different, then this may be because of the way one of the rankings is larger and descends at a slower rate Can then take more inputs from the slower growing ranking so that the threshold goes closer to the other thresholds

Optimal Join-Order Try to have the least number of input records in order to get a correct ranking No clear way of estimating the order of joins Have a heuristic – Footrule Distance Simple measure of similarity among two rankings. First join the most similar rankings This would quickly yield a join by accessing fewer records

Rank-Join Algorithm: Benefits What can it do? n Integrates well with query plans n Produces results as fast as possible n Provides performance guarantees n Minimizes space requirements n Offers a mechanism to determine the best order of joining to execute query optimally. n Can be improved further if random access is available n Can eliminate on-the-fly duplicate elimination

References “Supporting top-k join queries in relational databases” - Ihab Ilyas, Walid Aref, Ahmed Elmagarmid (2004) Jing Chen : DBIR Spring 2005, CSE-UT Arlington http: //ranger. uta. edu/~gdas/Courses/ Spring 2005/DBIR/slides/top-k_join. ppt