c93dd533566fb6cd88e019cd22d13792.ppt
- Количество слайдов: 16
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk Candan NEC Laboratories America * University of California, Santa Barbara VLDB' 2006. Seoul, Korea
Background • XML – Hierarchical (tree) structured data – Provide flexibility to model semi-structured data – Widely accepted as universal data exchange format • Query over XML – XPath, XQuery [W 3 C] – Extensively used by many applications – Adopted by a number of commercial systems VLDB' 2006. Seoul, Korea 2
State-of-the-art: XML Query Processing Path (GTP) Generalized Tree Pattern Tree Algebraic Approach Binary Structure Joins [Timber] – Large intermediate results Optimize multiple path expressions of XQuery [Chen, et. al] – Expensive post-processing Holistic Approach Path. Stack [Bruno, et. al] Twig 2 Stack VLDB' 2006. Seoul, Korea 3 ?
Processing Generalized Tree Pattern (GTP) Queries Type Algebraic Approach [Chen et. al] Mandatory Axis Structural Joins Optional Axis Structural Outer Joins Return node Group return node – Grouping Non return node Duplication Elimination A Example B D C XQuery: FOR $b in //A[E]/B, $d in $b/$D LET $c = $b/C RETURN $b, $c, $d VLDB' 2006. Seoul, Korea a 1 //A//B a 2 b 1 Sort a 1 //A/B a 2 b 1 4 Our goal: Avoid ALL these!
Motivation: Path. Stack [Bruno et. al] • Query: //A//B; Data: a 1 a 2 a 1 S[A] a 2 b 1 b 2 b 1 S[B] • Key observation: minimize intermediate results through compact representation of path matches, by – Inter-node: record AD relationship between elements in different query nodes, e. g. , b 1→a 2, b 2→a 2 – Intra-node: record AD relationship between elements within the same query nodes, e. g. , b 1, b 2 • Twig. Stack [Bruno et. al] minimizes intermediate results through: – Output only those path matches that are in final twig results – However, such optimality cannot be guaranteed [Choi, et. al] – Not helpful for processing GTP queries • Question: can we minimize intermediate results for twig queries through compact result encoding (similar to Path. Stack)? – Useful for processing GTP queries as well? VLDB' 2006. Seoul, Korea 5
Hierarchical Stack Encoding a 1 a 2 • Inter-node: //A//B – Can still use explicit edges • Intra-node: A a 3 a 2 a 4 HS[A] a 3 a 4 – Matching elements forms a tree structure as well • Associate each query node with a hierarchical stack – Push element e into hierarchical stack HS[E] iff e satisfies the sub-twig query rooted at E • Matching can be determined when entire sub-tree of e seen • Require post-order document traversal VLDB' 2006. Seoul, Korea 6
Twig 2 Stack: Running Example [1, 20], 1 A a 1 [2, 15], 2 B a 2 D HS[A] [16, 19], 2 a 2 C b 3 [17, 18], 3 [3, 14], 3 d 3 b 1 [4, 11], 4 d 1 [12, 13], 4 c 2 [5, 10], 5 b 2 b 1 b 2 [6, 7], 6 d 2 d 3 HS[D] VLDB' 2006. Seoul, Korea c 1 Merging Stacks Twig. Stack needs to enumerate 3 matches for //A/B//D and 2 for //A/B//C then join them together. HS[B] d 1 d 2 [8, 9], 6 c 1 Twig 2 Stack requires neither path joins nor path enumeration! c 2 HS[C] 7
GTP Result Enumeration • Bottom-up Computation. vs. Top-down Enumeration – Visit Only those that are in the twig matches • Handling grouping results d 1 – Automatic grouping through Inter-node edges • Handling duplicates and out-of-order results – Problems coming from non-return nodes – If D is return node while B is not • b 1 → d 1, d 2, d 3 and b 2 →d 2, d 3 (duplicates) – Observation: Intra-node hierarchy provides hints VLDB' 2006. Seoul, Korea 8 d 2 d 3 a 4 b 1 b 2 c 1 c 2
Experiment Setup • Implementation – Twig 2 Stack: Java 1. 4. 2 – Twig. Stack, TJFast: Java 1. 4. 2 • Kindly provided by Jiaheng Lu from National University of Singapore (NUS) • Datasets – XMark, DBLP, Tree. Bank • Metrics – Query processing time – IO time VLDB' 2006. Seoul, Korea 9
Processing Full Twig Queries Optimization of Query Processing: Twig. Stack Twig 2 Stack Optimization of IO: TJFast VLDB' 2006. Seoul, Korea 10
Not yet done: Memory Usage • Hierarchical Stack Encoding could hold entire document in memory in the worst case – Unlike DOM approach, only matches need to be stored • Tag match • (Partial) twig match • Predicate evaluation • Early result enumeration dramatically reduces the memory usage – Enumerate query results before the end of document and release buffer – Main idea: hybrid of top-down (Path. Stack) and bottom-up (Twig 2 Stack) approaches VLDB' 2006. Seoul, Korea 11
Early Result Enumeration (ERM) • Enumerate results and release buffer when elements in topbranch node are popped from Path. Stack A a 2 a 1 S[A] [1, 20], 1 B D a 1 [2, 15], 2 C [16, 19], 2 a 2 HS[A] b 3 [17, 18], 3 [3, 14], 3 d 3 b 1 S[B] [4, 11], 4 d 1 b 2 S[D] b 2 [6, 7], 6 d 2 d 1 d 2 d 3 VLDB' 2006. Seoul, Korea HS[D] c 2 [5, 10], 5 HS[B] S[C] [12, 13], 4 c 1 c 2 HS[C] 12 [8, 9], 6 c 1
Memory Usage dblp article Small sub-tree title year site open_auctions Huge sub-tree bid reserve bidder increase VLDB' 2006. Seoul, Korea 13
Conclusions and Future Work • Proposed a bottom-up GTP processing solution – A twig encoding scheme – A GTP enumeration algorithm that avoids any post-processing operations – A hybrid scheme to reduce memory usage • Future directions – Handling worst case memory issues – Optimizing IO cost by exploiting indexes – Handling other axes, full XQuery, graph input – Handling XML streams –… VLDB' 2006. Seoul, Korea 14
VLDB' 2006. Seoul, Korea 15
Processing GTP Optimization of non-return nodes VLDB' 2006. Seoul, Korea 16 Automatic grouping
c93dd533566fb6cd88e019cd22d13792.ppt