e6323eed69fda5c4226273c34d8954ae.ppt
- Количество слайдов: 157
Module 5 Implementation of XQuery (Rewrite, Indexes, Runtime System) 1
XQuery: a language at the cross-roads n n n Query languages Functional programming languages Object-oriented languages Procedural languages Some new features : context sensitive semantics Processing XQuery has to learn from all those fields, plus innovate 2
XQuery processing: old and new n Functional programming + Environment for expressions + + Expressions nested with full generality + Lazy evaluation - Data Model, schemas, type system, and query language - Contextual semantics for expressions - Side effects - Non-determinism in logic operations, others - Streaming execution - Logical/physical data mismatch, appropriate optimizations n Relational query languages (SQL) + High level construct (FLWOR/Select-From-Where) + + Streaming execution + Logical/physical data mismatch and the appropriate optimizations - Data Model, schemas, type system, and query language - Expressive power - Error handling - 2 values logic 3
XQuery processing: old and new n Object-oriented query languages (OQL) + Expressions nested with full generality + + Nodes with node/object identity - Topological order for nodes - Data Model, schemas, type system, and query language - Side effects - Streaming execution n Imperative languages (e. g. Java) + Side effects + Error handling - Data Model, schemas, type system, and query language - Non-determinism for logic operators - Lazy evaluation and streaming Logical/physical data mismatch and the appropriate optimizations Possibility of handling large volumes of data 4
Major steps in XML Query processing Query Parsing & Verification Compilation Data access pattern (APIs) Code rewriting Code generation Internal query/program representation Lower level internal query representation Executable code 5
(SQL) Query Processing 101 SELECT * FROM Hotels h, Cities c WHERE h. city = c. name;
(SQL) Join Ordering n Cost of a Cartesian Product: n * m n n n, m size of the two input tables R x S x T; card(R) = card(T) = 1; card(S) = 10 (R x S) x T costs 10 + 10 = 20 n (R x T) x S costs 1 + 10 = 11 n n For queries with many joins, join ordering responsible for orders of magnitude difference n n Millisecs vs. Decades in response time How relevant is join ordering for XQuery? 7
(SQL) Query Rewrite SELECT * FROM A, B, C WHERE A. a = B. b AND B. b = C. c is transformed to SELECT * FROM A, B, C WHERE A. a = B. b AND B. b = C. c AND A. a = C. c n Why is this transformation good (or bad)? n How relevant is this for XQuery? 8
Code rewriting n Code rewritings goals 1. 2. n Reduce the level of abstraction Reduce the execution cost Code rewriting concepts n Code representation n n Code transformations n n db: rewriting rules Cost transformation policy n n db: algebras db: search strategies Code cost estimation 9
Code representation n Is “algebra” the right metaphor ? Or expressions ? Annotated expressions ? Automata ? Standard algebra for XQuery ? Redundant algebra or not ? n n Logical vs. physical algebra ? n n Core algebra in the XQuery Formal Semantics What is the “physical plan” for 1+1 ? Additional structures, e. g. dataflow graphs ? Dependency graphs ? See Compiler transformations for High-Performance computing Bacon, Graham, Sharp 10
Automata representation n Path expressions $x/chapter//section/title chapter n n section title * [Yfilter’ 03, Gupta’ 03, etc]
TLC Algebra (Jagadish et al. 2004) n n n XML Query tree patterns (called twigs) ? Annotated with predicates D Tree matching as basic operation n + E C Logical and physical operation Tree pattern matching => tuple bindings (i. e. relations) Tuples combined via classical relational algebra n B + A Select, project, join, duplicate-elim. , … 12
XQuery Expressions (BEA implementation) n n Expressions built during parsing (almost) 1 -1 mapping between expressions in XQuery and internal ones n n Annotated expressions n n n E. g. unordered is an annotation Annotations exploited during optimization Redundant algebra n n n Differences: Match ( expr, Node. Test) for path expressions E. g. general FLWR, but also LET and MAP E. g. typeswitch, but also instanceof and conditionals Support for dataflow analysis is fundamental 13
Expressions Constants If. Then. Else. Expr Complex Constants Cast. Expr Instance. Of. Expr Treat. Expr Variable For. Let. Variable Parameter Count. Variable External. Variable 14
Expressions Node. Constructor First. Order. Expressions Funct. Param. Cast Match. Expr Second. Order. Expr Sort. Expr Create. Index. Expr FLWRExpr Map. Expr Let. Expr Quantified. Expr 15
Expression representation example for $line in $doc/Order/Order. Line where xs: integer(fn: data($line/Sellers. ID)) eq 1 where $line/Sellers. ID eq 1 return
Dataflow Analysis n Annotate each operator (attribute grammars) Type of output (e. g. , Book. Type*) n Is output sorted? Does it contain duplicates? n Has output node ids? Are node ids needed? n n Annotations computed in walks through plan Instrinsic: e. g. , preserves sorting n Synthetic: e. g. , type, sorted n Inherited: e. g. , node ids are required n n Optimizations based on annotations Eliminate redundant sort operators n Avoid generation of node ids in streaming apps n 17
Dataflow Analysis: Static Type Match(„book“) elem book of Book. Type FO: children elem book of Book. Type or elem thesis of Book. T FO: children elem bib of Bib. Type validate as „bib. xsd“ doc(„bib. xml“) doc of Bib. Type item* 18
XQuery logical rewritings n n n Algebraic properties of comparisons Algebraic properties of Boolean operators LET clause folding and unfolding Function inlining FLWOR nesting and unnesting FOR clauses minimization Constant folding Common sub-expressions factorization Type based rewritings Navigation based rewritings “Join ordering” 19
(SQL) Query Rewrite SELECT * FROM A, B, C WHERE A. a = B. b AND B. b = C. c is transformed to SELECT * FROM A, B, C WHERE A. a = B. b AND B. b = C. c AND A. a = C. c n Why is this transformation good (or bad)? n How relevant is this for XQuery? 20
(SQL) Query Rewrite SELECT A. a FROM A WHERE A. a in (SELECT x FROM X); is transformed to (assuming x is key): SELECT A. a FROM A, X WHERE A. a = X. x n Why is this transformation good (or bad)? n When can this transformation be applied? 21
Algebraic properties of comparisons n General comparisons not reflexive, transitive n n (1, 3) = (1, 2) (but also !=, <, >, <=, >= !!!!!) Reasons n n implicit existential quantification, dynamic casts Negation rule does not hold fn: not($x = $y) is not equivalent to $x != $y General comparison not transitive, not reflexive Value comparisons are almost transitive n Exception: n n xs: decimal due to the loss of precision Impact on grouping, hashing, indexing, caching !!! 22
What is a correct Rewriting n E 1 -> E 2 is a legal rewriting iff n n n Type(E 2) is a subtype of Type(E 1) Free. Var(E 2) is a subset of Free. Var(E 1) For any binding of free variables: n n If E 1 must return error (acc. Semantics), then E 2 must return error (not mandatory the same error) If E 2 can return a value (non error) then E 2 must return a value among the values accepted for E 1, or error Note: Xquery is non-deterministic This definition allows the rewrite E 1 ->ERROR n Trust your vendor she does not do that for all E 1 23
Properties of Boolean operators n Among of the most useful logical rewritings: PCNF and PDNF n And, or are commutative & allow short-circuiting n For optimization purposes But are non-deterministic n Surprise for some programmers : ( n n n 2 value logic n n If (($x castable as xs: integer) and (($x cast as xs: integer) eq 2) ) …. . () is converted into fn: false() before use Conventional distributivity rules for and, not, or do hold 24
LET clause folding n Traditional FP rewriting let $x : = 3 3+2 return $x +2 n Not so easy ! let $x : = (, ) NO. Side effects. (Node identity) return ($x, $x ) declare namespace ns=“uri 1” NO. Context sensitive let $x : =
LET clause folding (cont. ) n Impact of unordered{. . } /* context sensitive*/ let $x : = ($y/a/b)[1] the c’s of a specific b parent return unorderded { $x/c } (in no particular order) not equivalent to unordered {($y/a/b)[1]/c } the c’s of “some” b (in no particular order) 26
LET clause folding : fixing the node construction problem n Sufficient conditions (: before LET : ) (: before LET : ) let $x : = expr 1 (: after LET : ) return expr 2’ return expr 2 where expr 2’ is expr 2 with substitution {$x/expr 1} Expr 1 does never generate new nodes in the result n OR $x is used (a) only once and (b) not part of a loop and (c ) not input to a recursive function n Dataflow analysis required n 27
LET clause folding: fixing the namespace problem n Context sensitivity for namespaces 1. 2. n (1) is not a problem if: n n Namespace resolution during query analysis Namespace resolution during evaluation Query rewriting is done after namespace resolution (2) could be a serious problem (***) n n XQuery avoided it for the moment Restrictions on context-sensitive operations like string -> Qname casting 28
LET clause unfolding n Traditional rewriting for $x : = (1 to 10) let $y : = ($input+2) return ($input+2)+$x for $x in (1 to 10) return $y+$x n Not so easy! n n Same problems as above: side-effects, NS handling and unordered/ordered{. . } Additional problem: error handling for $x in (1 to 10) let $y : = ($input idiv 0) return if($x lt 1) for $x in (1 to 10) then ($input idiv 0) return if ($x lt 1) else $x then $y else $x Guaranteed only if runtime implements consistently lazy evaluation. Otherwise dataflow analysis and error analysis required. 29
Function inlining n Traditional FP rewriting technique define function f($x as xs: integer) as xs: integer 2+1 {$x+1} f(2) n Not always! n n Same problems as for LET (NS handling, side-effects, unordered {…} ) Additional problems: implicit operations (atomization, casts) define function f($x as xs: double) as xs: boolean {$x instance of xs: double} f(2) (2 instance of xs: double) NO n Make sure this rewriting is done after normalization 30
FLWR unnesting n Traditional database technique for $x in (for $y in $input/a/b for $y in $input/a/b, where $y/c eq 3 $x in $y/d return $y/d) where ($x/e eq 4) and ($y/c eq 3) where $x/e eq 4 return $x n Problem simpler than in OQL/ODMG n n No nested collections in XML Order-by, count variables and unordered{…} limit the limits applicability 31
FLWR unnesting (cont. ) n Another traditional database technique for $x in $input/a/b for $x in $input/a/b, where $x/c eq 3 $y in $x/d return (for $y in $x/d) where ($x/e eq 4) and ($x/c eq 3) where $x/e eq 4 return $y) n Same comments apply 32
FOR clauses minimization n Yet another useful rewriting technique for $x in $input/a/b, for $x in $input/a/b $y in $input/c where ($x/d eq 3) return $input/c/e return $y/e for $x in $input/a/b, for $x in $input/a/b $y in $input/c where $x/d eq 3 and $input/c/f eq 4 NO where $x/d eq 3 and $y/f eq 4 return $input/c/e return $y/e NO for $x in $input/a/b for $x $input/a/b $y in $input/c where ($x/d eq 3) return
Constant folding n Yet another traditional technique for $x in (1 to 10) for $x in (1 to 10) where $x eq 3 YES return $x+1 return (3+1) for $x in $input/a for $x in $input/a where $x eq 3 NO return {$x} return {3} for $x in (1. 0, 2. 0, 3. 0) NO where $x eq 1 where $x eq 1 return ($x instance of xs: integer) return (1 instance of xs: integer) 34
Common sub-expression factorization n Preliminary questions n n Same expression ? Same context ? Error “equivalence” ? Create the same new nodes? for $x in $input/a/b let $y : = (1 idiv 0) where $x/c lt 3 for $x in $input/a/b return if ($x/c lt 2) where $x/c lt 3 then if ($x/c eq 1) return if($x/c lt 2) then (1 idiv 0) then if ($x/c eq 1) else $x/c+1 then $y else if($x/c eq 0) else $x/c+1 then (1 idiv 0) else if($x/c eq 0) else $x/c+2 then $y else $x/c+2 35
Type-based rewritings n Type-based optimizations: n Increase the advantages of lazy evaluation n n Eliminate the need for expensive operations (sort, dup-elim) n n n e. g. min, max, avg, arithmetics, comparisons Maximizes the use of indexes Elimination of no-operations n n $input//a/b $input/c/d/a/b Static dispatch for overloaded functions n n $input/a/b/c ((($input/a)[1]/b[1])/c)[1] e. g. casts, atomization, boolean effective value Choice of various run-time implementations for certain logical operations 36
Dealing with backwards navigation n Replace backwards navigation with forward navigation for $x in $input/a/b for $y in $input/a, return
More compiler support for efficient execution Streaming vs. data materialization n Node identifiers handling n Document order handling n Scheduling for parallel execution n Projecting input data streams n 38
When should we materialize? n n Traditional operators (e. g. sort) Other conditions: n n n n Whenever a variable is used multiple times Whenever a variable is used as part of a loop Whenever the content of a variable is given as input to a recursive function In case of backwards navigation Those are the ONLY cases In most cases, materialization can be partial and lazy Compiler can detect those cases via dataflow analysis 39
How can we minimize the use of node identifiers ? n Node identifiers are required by the XML Data model but onerous (time, space) n Solution: 1. Decouple the node construction operation from the node id generation operation 2. Generate node ids only if really needed n n Only if the query contains (after optimization) operators that need node identifiers (e. g. sort by doc order, is, parent, <<) OR node identifiers are required for the result Compiler support: dataflow analysis 40
How can we deal with path expressions ? n n Sorting by document order and duplicate elimination required by the XQuery semantics but very expensive Semantic conditions n $document / a / b / c n n $document / a // b n n Guaranteed to return results in doc order and not to contain duplicates $document // a / b n n Guaranteed to return results in doc order and not to have duplicates NOT guaranteed to return results in doc order but guaranteed not to contain duplicates $document // a // b $document / a /. . / b n Nothing can be said in general 41
Parallel execution ns 1: WS 1($input)+ns 2: WS 2($input) for $x in (1 to 10) return ns: WS($i) n Obviously certain subexpressions of an expression can (and should. . . ) be executed in parallel n n n Scheduling based on data dependency Horizontal and vertical partitioning Interraction between errors and paralellism See David J. De. Witt, Jim Gray: Parallel Database Systems: The Future of High Performance Database Systems. 42
XQuery expression analysis n n n How many times does an expression use a variable ? Is an expression using a variable as part of a loop ? Is an expression a map on a certain variable ? Is an expression guaranteed to return results in doc order ? Is an expression guaranteed to return (node) distinct results? Is an expression a “function” ? Can the result of an expression contain newly created nodes ? Is the evaluation of an expression context-sensitive ? Can an expression raise user errors ? Is a sub expression of an expression guaranteed to be executed ? Etc. 43
Compiling XQuery vs. XSLT n Empiric assertion : it depends on the entropy level in the data (see M. Champion xml-dev): n XSLT easier to use if the shape of the data is totally unknown (entropy high) n n Dataflow analysis possible in XQuery, much harder in XSLT n n XQuery easier to use if the shape of the data is known (entropy low) Static typing, error detection, lots of optimizations Conclusion: less entropy means more potential for optimization, unsurprisingly. 44
Data Storage and Indexing 45
Major steps in XML Query processing Query Parsing & Verification Compilation Data access pattern (APIs) Code rewriting Code generation Internal query/program representation Lower level internal query representation Executable code 46
Questions to ask for XML data storage n What actions are done with XML data? Where does the XML data live? How is the XML data processed? In which granuluarity is XML data processed? n There is no one fits all solution !? ! n n n (This is an open research question. ) 47
What? n Possible uses of XML data ship (serialize) n validate n query n transform (create new XML data) n update n persist n n Example: UNICODE reasonably good to ship XML data n UNICODE terrible to query XML data n 48
Where? n Possible locations for XML data wire (XML messages) n main-memory (intermediate query results) n disk (database) n mobile devices n n Example Compression great for wire and mobile devices n Compression not good for main-memory (? ) n 49
How? n Alternative ways to process XML data materialized, all or nothing n streaming (on demand) n anything in between n n Examples trees good for materialization n trees bad for stream-based processing n 50
Granularity? n Possible granularities for data processing: documents n items (nodes and atomic values) n tokens (events) n bytes n n Example tokens good for fine granularity (items) n tokens bad for whole documents n 51
Scenario I: XML Cache n Cache XHTML pages or results of Web Service calls ship yes wire validate maybe m. -m. query no disk yes materialize yes stream maybe yes docs/ granularity items transform maybe update no 52
Scenario II: Message Broker n n Route messages according to simple XPath rules Do simple transformations ship yes wire yes materialize no validate yes m. -m. yes stream yes query yes disk no granularity docs transform yes update no 53
Scenario III: XQuery Processor apply complex functions n construct query results n ship no wire yes materialize yes validate yes m. -m. yes stream yes query yes disk transform yes update no maybe granularity item 54
Scenario IV: XML Database n Store and archive XML data ship yes wire validate yes m. -m. query yes disk no materialize yes stream yes granularity collection ? transform yes update yes 55
Object Stores vs. XML Stores n Similarities n n nodes are like objects identifiers to access data support for updates Differences n n n XML: tree not graph XML: everything is ordered XML: streaming is essential XML: dual representation (lexical + binary) XML: data is context-sensitive 56
XML Data Representation Issues n Data Model Issues n n Storage Structures basic Issues 1. 2. 3. 4. 5. 6. 7. n n Info. Set vs. PSVI vs. XQuery data model Lexical-based vs. typed-based vs. both Node indentifiers support Context-sensitive data (namespaces, base-uri) Data + order : separate or intermixed Data + metadata : separate or intermixed Data + indexes : separate of intermixed Avoiding data copying Storage alternatives: trees, arrays, tables Indexing APIs Storage Optimizations n compression? , pooling? , partitioning? 57
Lexical vs. Type-based n n Data model requires both properties, but allows only one to be stored and compute the other Functional dependencies string + type annotation -> value-based n value + type annotation -> schema-norm. string Example „ 0001“ + xs: integer -> 1 1 + xs: integer -> „ 1“ n n Tradeoffs: n n n Space vs. Accuracy Redundancy: cost of updates indexing: restricted applicability 58
Node Identifiers Considerations n XQuery Data Model Requirements n n Identifiers might include additional information n n n identify a node uniquely (implements identity) lives as long as node lives robust to updates Schema/type information Document order Parent/child relationship Ancestor/descendent relationship Document information Required for indexes 59
Simple Node Identifiers n Examples: n Alternative 1 (data: trees) n n n Alternative 2 (data: plain text) n n n file name offset in file Encode document ordering (Alternative 1) n n id of document (integer) pre-order number of node in document (integer) identity: doc 1 = doc 2 AND pre 1 = pre 2 order: doc 1 < doc 2 OR (doc 1 = doc 2 AND pre 1 < pre 2) Not robust to updates Not able to answer more complex queries 60
Dewey Order Tatrinov et al. 2002 n Idea: n n n Assessment; n n n Generate surrogates for each path 1. 2. 3 identifies the third child of the second child of the first child of the given root good: order comparison, ancestor/descendent easy bad: updates expensive, space overhead Improvement: ORDPath Bit Encoding O‘Neil et al. 2004 (Microsoft SQL Server) 61
Example: Dewey Order 1 1. 1 person name child 1. 2. 1 name 1. 2. 1. 1 1. 2 person hobby 1. 2. 1. 3 62
XML Storage Alternatives Plain Text (UNICODE) n Trees with Random Access n Binary XML / arrays of events (tokens) n Tuples (e. g. , mapping to RDBMS) n 63
Plain Text Use XML standards to encode data n Advantages: n simple, universal n indexing possible n n Disadvantages: need to re-parse (re-validate) all the time n no compliance with XQuery data model (collections) n not an option for XQuery processing n 64
Trees n XML data model uses tree semantics use Trees/Forests to represent XML instances n annotate nodes of tree with data model info n n Example
Trees n Advantages n n Disadvantages n n n natural representation of XML data good support for navigation, updates index built into the data structure compliance with DOM standard interface difficult to use in streaming environment difficult to partition high overhead: mixes indexes and data index everything Example: DOM, others Lazy trees possible: minimize IOs, able to handle large volumes of data 66
Natix (trees on disk) Each sub-tree is stored in a record n Store records in blocks as in any database n If record grows beyond size of block: split n Split: establish proxy nodes for subtrees n Technical details: n use B-trees to organize space n use special concurrency & recovery techniques n 67
Natix
Binary XML as a flat array of „events“ n Linear representation of XML data n n Node -> array of events (or tokens) n n pre-order traversal of XML tree tokens carry the data model information Advantages good support for stream-based processing n low overhead: separate indexes from data n logical compliance with SAX standard interface n n Disadvantages n difficult to debug, difficult programming model 69
Example Binary XML as an array of tokens xml version=„ 1. 0“>
No Schema Validation (no „ “) xml version=„ 1. 0“> Begin. Document() Begin. Element(„order“, „xs: untyped. Any“, 1)
Schema Validation (no „ “) Begin. Document() xml version=„ 1. 0“>
Binary XML n n n Discussion as part of the W 3 C Processing XML is only one of the target goals Other goals: n n n Data compression for transmission: WS, mobile Open questions today: can we achieve all goals with a single solution ? Will it be disruptive ? Data model questions: Infoset or XQuery Data Model ? Is streaming a strict requirement or not ? More to come in the next months/years. 73
Compact Binary XML in Oracle n Binary serialization of XML Infoset n n n Tokenizes XML Tag names, namespace URIs and prefixes n n Generic token table used by binary XML, XML index and in-memory instances (Optionally) Exploits schema information for further optimization n n Significant compression over textual format Used in all tiers of Oracle stack: DB, i. AS, etc. Encode values in native format (e. g. integers and floats) Avoid tokens when order is known For fully structured XML (relational), format very similar to current row format (continuity of storage !) Provide for schema versioning / evolution n Allow any backwards-compatible schema evolution, plus a few incompatible changes, without data migration 74
XML Data represented as tuples n Motivation: Use an RDBMS infrastructure to store and process the XML data n n transactions scalability richness and maturity of RDBMS Alternative relational storage approaches: n n Store XML as Blob (text, binary) Generic shredding of the data (edge, binary, …) Map XML schema to relational schema Binary (new) XML storage integrated tightly with the relational processor 75
Mapping XML to tuples n External to the relational engine n Use when : n n Processing involves hand written SQL queries + procedural logic Frequently used, but not advantageous n n The structure of the data is relatively simple and fixed The set of queries is known in advance Very expensive (performance and productivity) Server communication for every single data fetch Very limited solution Internally by the relational engine n A whole tutorial in Sigmod’ 05 76
XML Example
Edge Approach (Florescu & Kossmann 99) Edge Table Sourc e 0 0 4711 666 Label Target person name child 4711 666 v 1 i 314 v 2 i 314 Value Table (String) Id v 1 v 2 v 3 Value Lilly Potter James Potter Harry Potter Value Table (Integer) Id v 4 Value 12 79
Binary Approach Partition Edge Table by Label Person Tabelle Source 0 0 i 314 Target 4711 666 314 Name Tabelle Source 4711 666 314 Target v 1 v 2 v 3 Child Tabelle Sourc e 4711 666 Targe t i 314 Age Tabelle Source Target 314 v 4 80
Tree Encoding (Grust 2004) n For every node of tree, keep info pre: pre-order number n size: number of descendants n level: depth of node in tree n kind: element, attribute, name space, … n prop: name and type n frag: document id (forests) n 81
Example: Tree Encoding pre size level kind prop elem person fra g 0 6 0 1 2 0 1 elem name 0 3 3 1 elem child 0 … … … 0 0 3 0 attr id elem person 0 0 1 82
XML Triple (R. Bayer 2003) Pfad Surrogat Value Author[1]/FN[1] 2. 1. 1. 1 Rudolf Author[1]/LN[1] 2. 1 Bayer 83
DTD -> RDB Mapping Shanmugasundaram et al. 1999 n Idea: Translate DTDs into Relations n n n n Element Types -> Tables Attributes -> Columns Nesting (= relationships) -> Tables „Inlining“ reduces fragmentation Special treatment for recursive DTDs Surrogates as keys of tables (Adaptions for XML Schema possible) 84
DTD Normalisation n Simplify DTDs (e 1, e 2)* -> e 1*, e 2* (e 1, e 2)? -> e 1? , e 2? (e 1 | e 2) -> e 1? , e 2? e 1** -> e 1*? -> e 1* e 1? ? -> e 1? . . . , a*, . . . -> a*, . . n Background regular expressions n ignore order (in RDBMS) n generalized quantifiers (be less specific) n 85
Example 86
Example: Relation „book“ book(book. ID, book. price, book. title, book. author. fname, book. author. lname, book. author. age) 87
Example: Relation „article“ article(art. ID, art. title) art. Author(art. Author. ID, art. author. fname, art. author. lname, art. author. age) 88
Example (continued) n Represent each element as a relation n element might be the root of a document title(title. Id, title) author(author. Id, author. age, author. fname, author. lname) fname(fname. Id, fname) lname(lname. Id, lname) 89
Recursive DTDs book(book. Id, book. title, book. author. name) author(author. Id, author. name) author. book(author. book. Id, author. book. title) 90
XML Data Representation Issues n Data Model Issues n n Storage Structures Issues 1. 2. 3. 4. 5. 6. 7. n n Lexical-based vs. typed-based vs. both Node indentifiers support Context-sensitive data (namespaces, base-uri) Order support Data + metadata : separate or intermixed Data + indexes : separate of intermixed Avoiding data copying Storage alternatives: trees, arrays, tables Storage Optimizations n n Info. Set vs. PSVI vs. XQuery data model compression? , pooling? , partitioning? Data accees APIs 91
Major steps in XML Query processing Query Parsing & Verification Compilation Data access pattern (APIs) Code rewriting Code generation Internal query/program representation Lower level internal query representation Executable code 92
XML APIs: an overview n n n DOM (any XML application) SAX (low-level XML processing) JSR 173 (low-level XML processing) Token. Iterator (BEA, low level XML processing) XQJ / JSR 225 (XML applications) Microsoft XMLReader Streaming API 1. For reasonable performance, the data storage, the data APIs and the execution model have to be designed together ! 2. For composability reasons the runtime operators (ie. output data) should implement the same API as the input data. 93
Classification Criteria n n n n Navigational access? Random access (by node id)? Decouple navigation from data reads? If streaming: push or pull ? Updates? Infoset or XQuery Data Model? Target programming language? Target data consumer? application vs. query processor 94
Decoupling n Idea: n n n Example: DOM (tree-based model) n n methods to navigate through data (XML tree) methods to read properties at current position (node) navigation: first. Child, parent. Node, next. Sibling, … properties: node. Name, get. Named. Item, … (updates: create. Element, set. Named. Item, …) Assessment: n n good: read parts of document, integrate existing stores bad: materialize temp. query results, transformations 95
Non Decoupling n Idea: n n n Combined navigation + read properties Special methods for fast forward, reverse navigation Example: BEA‘s Token. Iterator (token stream) Token get. Next(), void skip. To. Next. Node(), … n Assessment: n good: less method calls, stream-based processing n good: integration of data from multiple sources n bad: difficult to wrap existing XML data sources n bad: reverse navigation tricky, difficult programming model 96
Classification of APIs DM Nav. Rand. Decp. Upd. Platf. DOM Info. Set yes no yes - SAX Info. Set no no Java JSR 173 Info. Set (no) no yes no Java Tok. Iter XQuery (no) no no no Java XQJ XQuery yes yes Java MS Info. Set (no) no yes no . Net 97
XML Data Representation Issues n Data Model Issues n n Storage Structures basic Issues 1. 2. 3. 4. 5. 6. 7. n n Info. Set vs. PSVI vs. XQuery data model Lexical-based vs. typed-based vs. both Node indentifiers support Context-sensitive data (namespaces, base-uri) Data + order : separate or intermixed Data + metadata : separate or intermixed Data + indexes : separate of intermixed Avoiding data copying Storage alternatives: trees, arrays, tables Indexing APIs Storage Optimizations n compression? , pooling? , partitioning? 98
Classification (Compression) XML specific? n Queryable? n (Updateable? ) n 99
Compression n Classic approaches: e. g. , Lempel-Ziv, Huffman n Xmill: Liefke & Suciu 2000 n n decompress before queries miss special opportunities to compress XML structure Idea: separate data and structure -> reduce enthropy separate data of different type -> reduce enthropy specialized compression algo for structure, data types Assessment n n n Very high compression rates for documents > 20 KB Decompress before query processing (bad!) Indexing the data not possible (or difficult) 100
Xmill Architecture XML Parser Path Processor Cont. 1 Cont. 2 Cont. 3 Cont. 4 Compr. Compressed XML 101
Xmill Example
Querying Compressed Data (Buneman, Grohe & Koch 2003) n Idea: n n extend Xmill special compression of skeleton lower compression rates, but no decompression for XPath expressions uncompressed bib book title auth. book auth. title auth. bib 2 book auth. title 2 auth. 103
XML Data Representation Issues n Data Model Issues n n Storage Structures basic Issues 1. 2. 3. 4. 5. 6. 7. n n Info. Set vs. PSVI vs. XQuery data model Lexical-based vs. typed-based vs. both Node indentifiers support Context-sensitive data (namespaces, base-uri) Data + order : separate or intermixed Data + metadata : separate or intermixed Data + indexes : separate of intermixed Avoiding data copying Storage alternatives: trees, arrays, tables Indexing APIs Storage Optimizations n compression? , pooling? , partitioning? 104
XML indexing No indexes, no performance n Indexing and storage: common design n Indexing and query compiler: common design n Different kind of indexes possible n Like in the storage case: there is no one size fits all n n it all depends on the use case scenario: type of queries, volume of data, volume of queries, etc 105
Kinds of Indexes 1. Value Indexes n n n 2. Structure Indexes n n 3. materialize results of path expressions (pendant to Rel. join indexes, OO path indices) Full text indexes n n n index atomic values; e. g. , //emp/salary/fn: data(. ) use B+ trees (like in relational world) (integration into query optimizer more tricky) Keyword search, inverted files (IR world, text extenders) Any combination of the above 106
Value Indexes: Design Considerations n What is the domain of the index? (Physical Design) n n What is the key of the index? (Physical Design) n n n n e. g. , //emp/salary/fn: data(. ) , //emp/salary/fn: string(. ) singletons vs. sequences string vs. typed-value which type? homogeneous vs. heterogeneous domains composite indexes and errors Index for what comparison? (Physical Design) n n n All database Document by document Collection =: problematic due to implicit cast + exists eq, leq, … less problematic When is a value index applicable? (Compiler) 107
Index for what comparison ? Example: $x : =
SI Example 1: Patricia Trie Cooper et al. 2001 n Idea: Partitioned Partricia Tries to index strings n Encode XPath expressions as strings (encode names, encode atomic values) n
Example 2: XASR Kanne & Moerkotte 2000 n Implement axis as self joins of XASR table
Example 3: Multi-Dim. Indexes Grust 2002 n n pre- and post order numbering (XASR) multi-dimensional index for window queries pre descendants following preceding ancestors post 111
Oracle’s XML Index n Universal index for XML document collections n n n No dependence on Schema n n n Indexes paths within documents Indexes hierarchical information using dewey-style order keys Indexes values as strings, numbers, dates Stores base table rowid and fragment “locator” Any data that can be converted to number or date is indexed as such regardless of Schema Option to index only subset of XPaths Allows Text (Contains) search embedded within XPath 112
XML Index Path Table (Oracle)
Summary for XML data storage n Know what you want n query? update? persistence? … Understand the usage scenario right n Get the big questions right n n n Get the details right n n tree vs. arrays vs. tuples? compression? decoupling? indexes? identifiers? Open question: n Universal Approach for XML data storage ? ? 114
XML processing benchmark We cannot really compare approaches until we decide on a comparison basis n XML processing very broad n Industry not mature enough n Usage patterns not clear enough n Existing XML benchmarks (Xmark, Xmach, etc. ) limited n Strong need for a TP benchmark n 115
Runtime Algorithms 116
Query Evaluation n Hard to discuss special algorithms n n n Strongly depend on algebra Strongly depends of the data storage, APIs and indexing Main issues: 1. 2. Streaming or materializing evaluations Lazy evaluation or not 117
Lazy Evaluation n Compute expressions on demand compute results only if they are needed n requires a pull-based interface (e. g. iterators) n n Example: declare function endless. Ones() as integer* { (1, endless. Ones()) }; some $x in endless. Ones() satisfies $x eq 1 The result of this program should be: true 118
Lazy Evaluation n Lazy Evaluation also good for SQL processors n n e. g. , nested queries Particularly important for XQuery existential, universal quantification (often implicit) n top N, positional predicates n recursive functions (non terminating functions) n if then else expressions n match n correctness of rewritings, … n 119
Stream-based Processing n Pipe input data through query operators produce results before input is fully read n produce results incrementally n minimize the amount of memory required for the processing n n Stream-based processing online query processing, continuous queries n particularly important for XML message routing n n Traditional in the database/SQL community 120
Stream based processing issues n Streaming burning questions : n n Pure streaming ? n n n push or pull ? Granularity of streaming ? Byte, event, item ? Streaming with flexible granularity ? Processing Xquery needs some data materialization Compiler support to detect and minimize data materialization Notes: n n Streaming + Lazy Evaluation possible Partial Streaming possible/necessary 121
Token Iterator (Florescu et al. 2003) n Each operator of algebra implemented as iterator n n n Conceptionally, the same as in RDMBS … n n n open(): prepare execution next(): return next token skip(): skip all tokens until first token of sibling close(): release resources pull-based multiple producers, one consumer … but more fine-grained n n n good for lazy evaluation; bad due to high overhead special tokens to increase granularity special methods (i. e. , skip()) to avoid fine-grained access 122
XML Parser as Token. Iterator XML Parser
XML Parser as Token. Iterator open() XML Parser
XML Parser as Token. Iterator next() BE(book) XML Parser
XML Parser as Token. Iterator next() BE(book) BE(author) XML Parser
XML Parser as Token. Iterator next() BE(book) BE(author) XML Parser
$x[3] next() top 3 $x 128
$x[3] next() top 3 next() $x 129
$x[3] next() top 3 skip() $x 130
$x[3] next() top 3 next() $x 131
$x[3] next() top 3 skip() $x 132
$x[3] next() top 3 next() $x 133
$x[3] next() top 3 next() $x 134
$x[3] null next() top 3 next() $x 135
Common Subexpressions next() top 3 next() buffer scan Buffer Iterator Factory next() result of common sub-expression 136
Common Subexpressions next() top 3 next()*/skip()* buffer scan Buffer Iterator Factory next() result of common sub-expression 137
Common Subexpressions next() top 3 other fct. next() buffer scan Buffer Iterator Factory result of common sub-expression 138
Iterator Tree for $line in $doc/Order. Line where xs: integer(fn: data($line/Sellers. ID)) eq 1 return
Streaming: push vs. pull n Pull: n n Push: n n n Data consumer requests data from data producer Similar in spirit with the iterator model (SQL engines) Lazy evaluation easier to integrate Data triggers operations to be executed More natural for evaluating automata Control is still transmitted from data consumer to data producer See Fegaras’ 04 for a comparison Remark: pull and push can be mixed, adapters and some buffering required 140
Memoization (Diao et al. 2004) n Memoization: cache results of expressions common subexpressions (intra-query) n multi-query optimization (inter-query) n semantic caching (inter-process) n n Lazy Memoization: Cache partial results occurs as a side-effect of lazy evaluation n cache data and state of query processing n optimizer detects when state needs to be kept n 141
XQuery implementations 1. Extensions of existing data management systems n n 2. New, specialized XML stores and XML processors n n n 3. Relational: e. g. DB 2, Oracle 10 g, Yukon (Microsoft) Non-relational: e. g. Sleepy. Cat Open source: e. g. db. XML, e. Xist, Saxon, Commercial: e. g. Mark. Logic, BEA Data stores vs. query processors only Integrators 1. n do not store data per se, but they do aggregate XML data coming from multiple data sources E. g. Liquid. Data (BEA), Data. Direct “Native XML database !!? ? ” 142
n XQuery implementations (cont. ) BEA: http: //edocs. bea. com/liquiddata/docs 10/prodover/concepts. html • Bluestream Database Software Corp. 's XStream. DB: http: //www. bluestream. com/dr/? page=Home/Products/XStream. DB/ • Cerisent's XQE: http: //cerisent. com/cerisent-xqe. html • Cognetic Systems's XQuantum: http: //www. cogneticsystems. com/XQuery. html • GAEL's Derby: http: //www. gael. fr/derby/ • GNU's Qexo (Kawa-Query): http: //www. qexo. org/ Compiles XQuery on-the-fly to Java bytecodes. Based on and part of the Kawa framework. An online sandbox is available too. Open-source. • Ipedo's XML Database v 3. 0: http: //www. ipedo. com • IPSI's IPSI-XQ: http: //ipsi. fhg. de/oasys/projects/ipsi-xq/index_e. html • Lucent's Galax: http: //db. bell-labs. com/galax/. Open-source. • Microsoft's XML Query Language Demo: http: //XQueryservices. com • Nimble Technology's Nimble Integration Suite: http: //www. nimble. com/ • Open. Link Software's Virtuoso Universal Server: http: //demo. openlinksw. com: 8890/xqdemo • Oracle's XML DB: http: //otn. oracle. com/tech/xmldb/htdocs/querying_xml • Politecnico di Milano's XQBE: http: //dbgroup. elet. polimi. it/XQuery/xqbedownload. html • Qui. Logic's SQL/XML-IMDB: http: //www. quilogic. cc/xml. htm 143
XQuery implementations(cont. ) • Software AG's◦Tamino XML Server: http: //www. softwareag. com/tamino/News/tamino_41. htm◦Tamino XML Query Demo: http: //tamino. demozone. softwareag. com/demo. XQuery/index. html • Sonic Software's◦Stylus Studio 5. 0 (XQuery, XML Schema and XSLT IDE): http: //www. stylusstudio. com◦Sonic XML Server: http: //www. sonicsoftware. com/products/additional_software/extensible_information_s erver/ • Sourceforge's Saxon: http: //saxon. sourceforge. net/. Open-source • Sourceforge's XQEngine: http: //sourceforge. net/projects/xqengine/. Open-source. • Sourceforge's XQuench: http: //xquench. sourceforge. net/. Open-source. • Sourceforge's XQuery Lite: http: //sourceforge. net/projects/phpxmlclasses/. See also documentation and description. PHP implementation, open-source. • Worcester Polytechnic Institute's Rainbow. Core: http: //davis. wpi. edu/~dsrg/rainbow/. Java. • Xavier C. Franc's Qizx/Open: http: //www. xfra. net/qizxopen. Java, open-source. • X-Hive's XQuery demo: http: //www. x-hive. com/XQuery • XML Global's Go. XML DB: http: //www. xmlglobal. com/prod/xmlworkbench / • XQuark Group and Université de Versailles Saint-Quentin's: XQuark Fusion and XQuark Bridge, open-source (see also the. XQuark home page) 144
Outline of the Presentation n Why XML ? Processing XML XQuery: the good, the bad, and the ugly n n n XML query processing n n n XML data model, XML type system, XQuery basic constructs Major XQuery applications Compilation issues Data storage and indexing Runtime algorithms Open questions in XML query processing The future of XML processing (as I see it) 145
Some open problems 1. XQuery equivalence 2. XQuery subsumption 3. Answering queries using views 4. Memoization for XQuery 5. Caching for XQuery 6. Partial and lazy indexes for XML and XQuery 7. XQueries independent of updates 8. Xqueries independent of schema changes 9. Reversing an XML transformation 10. Data lineage through XQuery 11. Keys and identities on the Web 146
Some open problems (cont. ) 11. Declarative description of data access patterns; query optimization based on such descriptions 12. Integrity constraints and assertions for XML 13. Query reformulation based on XML integrity constraints 14. XQuery and full text search 15. Parallel and asynchronous execution of XQuery 16. Distributed execution of XQuery in a peer-to-peer environment 17. Automatic testing of schema verification 18. Optimistic XQuery type checking algorithm 19. Debugging and explaining XQuery behavior 20. XML diff-grams 21. Automatic XML Schema mappings 147
Research topics (1) n XML query equivalence and subsumption n Containment and equivalence of a fragment of Xpath, Gerome Miklau, Dan Suciu n Algebraic query representation and optimization n Algebraic XML Construction and its Optimization in Natix, Thorsten Fiebig Guido Moerkotte n TAX: A Tree Algebra for XML , H. V. Jagadish, Laks V. S. Lakshmanan, Divesh Srivastava, et al. n n Honey, I Shrunk the XQuery! --- An XML Algebra Optimization Approach, Xin Zhang, Bradford Pielech, Elke A. Rundensteiner XML queries and algebra in the Enosys integration platform, the Enosys team n XML compression n n An Efficient Compressor for XML Data, Hartmut Liefke, Dan Suciu Path Queries on Compressed XML, Peter Buneman, Martin Grohe, Christoph Koch n XPRESS: A Queriable Compression for XML Data, Jun-Ki Min, Myung-Jae Park, Chin-Wan Chung 148
Research topics (2) n Views and XML n n n On views and XML, Serge Abiteboul View selection for XML stream processing, Ashish Gupta, Alon Halevy, Dan Suciu Query cost estimations n n Using histograms to estimate answer sizes for XML Yuqing Wu, MI Jignesh M. Patel, MI H. V. Jagadish Stati. X: Making XML Count, J. Freire, P. Roy, J. Simeon, J. Haritsa, M. Ramanath n Selectivity Estimation for XML Twigs, Neoklis Polyzotis, Minos Garofalakis, and Yannis Ioannidis n Estimating the Selectivity of XML Path Expressions for Internet Scale Applications, Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffrey F. Naughton 149
Research topics (3) n Full Text search in XML n XRANK: Ranked Keyword Search over XML Documents, L. Guo, F. Shao, C. Botev, Jayavel Shanmugasundaram n Te. XQuery: A Full-Text Search Extension to XQuery, S. Amer-Yahia, C. Botev, J. Shanmugasundaram n Phrase matching in XML, Sihem Amer-Yahia, Mary F. Fernandez, Divesh Srivastava and Yu Xu n n n XIRQL: A language for Information Retrieval in XML Documents, N. Fuhr, K. Grbjohann Integration of IR into an XML Database, Cong Yu Fle. XPath: Flexible Structure and Full-Text Querying for XML, Sihem Amer-Yahia, Laks V. S. Lakshmanan, Shashank Pandit 150
Research topics (4) n XML Query relaxation/approximation n Aproximate matching of XML Queries, AT&T, Sihem Amer-Yahia, Nick Koudas, Divesh Srivastava n Approximate XML Query Answers, Sigmod’ 04 Neoklis Polyzotis, Minos N. Garofalakis, Yannis E. Ioannidis Approximate Tree Embedding for Querying XML Data, T. Schlieder, F. Naumann. n Co-XML (Cooperative XML) -- UCLA n 151
Research topics (5) n Security and access control in XML n Lock. X: A system for efficiently querying secure XML, Sung. Ran Cho, Sihem Amer-Yahia, Laks V. S. Lakshmanan and Divesh Srivastava n Cryptographically Enforced Conditional Access for XML, Gerome Miklau Dan Suciu n Author-Chi - A System for Secure Dissemination and Update of XML Documents, Elisa Bertino, Barbara Carminati, Elena Ferrari, Giovanni Mella n Compressed accessibility map: Efficient access control for XML, Ting Yu, Divesh Srivastava, Laks V. S. Lakshmanan and H. V. Jagadish n Secure XML Querying with Security Views, Chee-Yong Chan, Wenfei Fan, and Minos Garofalakis 152
Research topics (6) n Indexes for XML n Accelarating XPath Evaluation in Any RDBMS, Torsten Grust, Maurice van Keulen, Jens Teubner n Index Structures for Path Expressions, Dan Suciu, Tova Milo Indexing and Querying XML Data for Regular Path Expressions, Quo Li and Bongki Moon n Covering Indexes for Branching Path Queries, n Kaushik, Philip Bohannon, Jeff Naughton, Hank Korth n A Fast Index Structure for Semistructured Data, Brian Cooper, Nigel Sample, M. Franklin, Gisli Hjaltason, Shadmon n Anatomy of a Native XML Base Management System, Thorsten Fiebig et al. 153
Research topics (7) n Query evaluation, algorithms n Mixed Mode XML Query Processing, . A Halverson, J. Burger, L. Galanis, A. Kini, R. Krishnamurthy, A. N. Rao, F. Tian, S. Viglas, Y. Wang, J. F. Naughton, D. J. De. Witt: n From Tree Patterns to Generalized Tree Patterns: On Efficient Evaluation of XQuery. Z. Chen, H. V. Jagadish, Laks V. S. Lakshmanan, S. Paparizos n Holistic twig joins: Optimal XML pattern matching, Nicolas Bruno, Nick Koudas and Divesh Srivastava. n Structural Joins: A Primitive for Efficient XML Query Pattern Matching, Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel n Navigation- vs. index-based XML multi-query processing, Nicolas Bruno, Luis Gravano, Nick Koudas and Divesh Srivastava n Efficiently supporting order in XML query processing, Maged El-Sayed Katica Dimitrova Elke A. Rundensteiner 154
Research topics (8) n Streaming evaluation of XML queries n Projecting XML Documents, Amelie Marian, Jerome Simeon Processing XML Streams with Deterministic Automata, n Stream Processing of XPath Queries with Predicates, n Todd J. Green, Gerome Miklau, Makoto Onizuka, Dan Suciu Ashish Gupta, Dan Suciu n Query processing of streamed XML data, Leonidas Fegaras, David Levine, Sujoe Bose, Vamsi Chaluvadi n n n Query Processing for High-Volume XML Message Brokering, Yanlei Diao, Michael J. Franklin Attribute Grammars for Scalable Query Processing on XML Streams, Christoph Koch and Stefanie Scherzinger XPath Queries on Streaming Data, Feng Peng, Sudarshan S. Chawathe n An efficient single-pass query evaluator for XML data streams, Dan Olteanu Tim Furche François Bry 155
Research topics (9) n Graphical query languages n XQBE: A Graphical Interface for XQuery Engines, Daniele Braga, Alessandro Campi, Stefano Ceri n Extensions to XQuery n Grouping in XML, Stelios Paparizos, Shurug Al-Khalifa, H. V. Jagadish, Laks V. S. Lakshmanan, Andrew Nierman, Divesh Srivastava and Yuqing Wu n Merge as a Lattice-Join of XML Documents, Kristin Tufte, David Maier. n n Active XQuery, A. Campi, S. Ceri XML integrity constraints n Keys for XML, Peter Buneman, Susan Davidson, Wenfei Fan, Carmem Hara, Wang-Chiew Tan n Constraints for Semistructured Data and XML, Peter Buneman, Wenfei Fan, Jérôme Siméon, Scott Weinstein 156
Some DB research projets n Timber n n n Natix n n n Univ. Manheim http: //www. dataexmachina. de/natix. html XSM n n n Univ. Michigan, At&T, Univ. British Columbia http: //www. eecs. umich. edu/db/timber/ Univ. San Diego http: //www. db. ucsd. edu/Projects/XSM/xsm. htm Niagara n n Univ. Madison, OGI http: //www. cs. wisc. edu/niagara/ 157