
09b3c68e7c5c4193b2f478e331b60502.ppt
- Количество слайдов: 113
XML Storage
XML Storage • Suppose that we are given some XML documents • How should they be stored? • Why does it matter? – Storage implies which type of use can be efficiently made of the XML – Usage requirements determine which type of storage is needed
3 Basic Strategies • Files • Relational Database • Native XML Database • What advantages do you think that each approach has? • What disadvantages do you think that each approach has?
XML Files
Idea • Store XML “as is”, in a file system – When querying, parse the document and traverse it to find the query answer • Obvious Advantage: Simple storage system • Obvious Disadvantage: – Must parse the XML document every time it is queried – Does not take advantage of indexes to quickly get to “interesting” elements (in order to reach a given element, must traverse everything appearing beforehand in the document)
Sample Document
How is an XML document Parsed? • Basic types of parsers: – DOM parser: Creates a tree out of the document – SAX parser: Does not create any data structures. Notifies program for every element seen (“pushes” parsing events to users) – Pull parsing (St. AX): Similar to SAX in memory requirements. Uses an Iterator style interface • Parsers have implementations in virtually every query language
DOM Parser • DOM = Document Object Model • Parser creates a tree object out of the document • User accesses data by traversing the tree • The API allows for constructing, accessing and manipulating the structure and content of XML documents
Document as Tree Methods like: transaction get. Root account buy sell 89 -344 shares 100 shares ticker get. Attributes etc. ticker 30 exch NASDAQ get. Children exch WEBM NYSE GE
Advantages and Disadvantages • How would you answer a query like: – /transaction/buy – //ticker • Advantages: – Natural and relatively easy to use – Can repeatedly query tree without reparsing • Disadvantages: – High memory requirements – the whole document is kept in memory – Must parse the whole document and construct many objects before use
SAX Parser • SAX = Simple API for XML • Parser creates “events” (i. e. , notifications) while traversing tree • Goes through the document one time only
Document as Events
Advantages and Disadvantages • How would you answer a query like: – /transaction/buy – find accounts in which something is bought or sold from the NASDAQ • Advantages: – Requires less memory – Fast • Disadvantages: – Cannot read backwards
Compression • Even if XML is stored “as-is”, we would like to compress the data – Important since XML is very verbose! • Types of compression: – Compression-oriented: Goal is to maximize compression ratio – Query-oriented: Integrate compression with XPath processor, so that evaluation can be performed directly on compressed data • Ideas?
Storing XML in a Relational Database
Why? • Relational databases have been developed for about 30 years • There is extensive knowledge on how to use them efficiently • Why not take advantage of this knowledge? • Main Challenges: – get XML into database (inserting): translating XML into tables – get XML out of database (querying): translating XPath into SQL
Reminder • Relational Database simply contains some tables • Each table can have any number of columns (also called attributes) • Data items in each column are atomic, i. e. , single values • A schema is a description of a set of tables, i. e. , the table name and each table’s column names
Difficulties • DTDs can be complex • Modeling Mismatch – Conceptually, relational databases, i. e. , tables, have 2 levels: tables and attributes – XML documents have arbitrary nesting • XML documents can have set-valued attributes and recursion
Relational Databases: Option 1 The Schema-less Case
Option 1: Store Tree Structure
Option 1: Store Tree Structure (cont. ) 1 person 2 name 3 tel 4 5 tel email 051 – 011 022 6 Bart Simpson 9 bart@tau. ac. il 7 02 – 444 7777 8 1. Assign each node a unique id 2. For each node, store type and value 3. For each node, store parent information
Option 1: Store Tree Structure (cont. ) 1 person 2 name 3 tel 4 5 tel email 051 – 011 022 6 Bart Simpson 9 bart@tau. ac. il 7 02 – 444 7777 8 Node Type Value 1 element person 6 text … … Parent. ID null Bart Simpson 2
How Good Is This? • Simple schema, can work with any document • Translation from XML to tables is easy • What about the translation back? – is this transformation lossless?
Answering XPath Queries • Can you answer an XPath query that: – Just uses the Child axis, e. g. , /a/b/c/d/e – Uses the Descendent axis at the beginning of the query, e. g. , //a/b – Uses the Descendent axis in the middle of the query, e. g. , /a/b//e – Uses the Following, Preceding, Following-Sibling axis?
Solving the Problem • With the current modeling, it is not possible to evaluate many different types of steps of XPath queries • To solve this problem, we: – number the nodes by DFS ordering – store, for each node, the id of its last descendent
Can you answer these queries, now? 2 name 3 Bart Simpson 1 person 4 phones 7 5 tel 9 email tel bart@tau. ac. il 051 – 011 022 6 02 – 444 7777 8 Node Type Value Parent. ID Last. Desc 1 element person null 10 4 element phones 1 8 … … 10
Summary: Main Problems • No convenient method to creating XML as output • Each element in the path expression requires an additional join – Can become very expensive
Relational Databases: Option 2, Taking Advantage of DTDs Based On: Relational Databases for Querying XML Documents: Limitations and Opportunities By: Shanmugasundaram, Tufte, He, Zhang, De. Witt, Naughton
Framework DTD XML Documents Query XML Result XML Translation Layer Relational Schema Tuples SQL Query Translation Information Relational Database System Relational Result
Example XML
Example XML
Considering the DTD • If a DTD is given, then it defines what types of XML documents will be of interest • Challenge: Given a DTD, find a relational schema such that ANY document conforming to the DTD can be stored in the relations –
Reducing the Complexity • DTDs can be very complex • Before translating a DTD to a relational schema, simplify the DTD • Property of the Simplification: If D 2 is a simplification of D 1, then every document that conforms to D 1 also almost conforms to D 2 – almost means that it conforms, if the ordering of subelements is ignored
Simplification Rules (e 1, e 2)* e 1*, e 2* (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1* . . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … …, . . . a? , …, a, … a*, … …, . . . a, …, a? , … a*, … …, . . . a*, …, a, … a*, … …, . . . a, …, a*, …
(e 1, e 2)* e 1*, e 2* (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1*. . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … (b|c|e)? , (e? |f+)
(e 1, e 2)* e 1*, e 2* (b|c|e)? , (e? |f+) (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1*. . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … (b? , c? , e? )? , e? ? , f+?
(e 1, e 2)* e 1*, e 2* (b|c|e)? , (e? |f+) (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? (b? , c? , e? )? , e? ? , f+? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1*. . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … b? ? , c? ? , e? ? , f+?
(e 1, e 2)* e 1*, e 2* (b|c|e)? , (e? |f+) (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? (b? , c? , e? )? , e? ? , f+? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1*. . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … b? ? , c? ? , e? ? , f+? b? ? , c? ? , e? ? , f*?
(e 1, e 2)* e 1*, e 2* (b|c|e)? , (e? |f+) (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? (b? , c? , e? )? , e? ? , f+? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1* b? ? , c? ? , e? ? , f+? b? ? , c? ? , e? ? , f*? . . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … b? , c? , e? , f*
(e 1, e 2)* e 1*, e 2* (b|c|e)? , (e? |f+) (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? (b? , c? , e? )? , e? ? , f+? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1* b? ? , c? ? , e? ? , f+? b? ? , c? ? , e? ? , f*? . . . , a*, . . . , a? , . . . a*, . . . b? , c? , e? , f* . . . , a? , . . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … b? , c? , e*, f*
You try it • Can you simplify the expression – (b|c|e)? , (e? |(f? , (b, b)*))* e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1* . . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … (e 1, e 2)* e 1*, e 2* (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2?
DTD Graphs • In order to describe a technique for converting a DTD to a schema it is convenient to first describe DTDs (or rather simplified DTDs) as graphs • Its nodes are elements, attributes and operators in the DTD • Each element appears exactly once in the graph • Attributes and operators appear as many times as they are in the DTD • Cycles indicate recursion
DTD Example
Corresponding DTD Graph attribute
Very Naïve Storage • Store a table for each element name, with columns – ID – parent. ID (if has an incoming edge) – parent. CODE (if has an incoming edge) – is. Root – Textual data, for elements of type PCDATA or attributes of type CDATA
book (book. ID: int, is. Root: boolean) booktitle (title. ID: int, data : string, parent. ID: int, parent. Code: int, is. Root: boolean) article (article. ID: int, is. Root: boolean) contactauthor (contactauthor. ID: int, parent. Code: int, is. Root: boolean) authorid (authorid. ID: int, data: string, parent. ID: int, parent. Code: int, is. Root: boolean) title (title. ID: int, data: string , parent. ID: int, parent. CODE: int, is. Root: boolean) …. . Partial example ….
Disadvantages? • Many, many joins! • Some relations serve basically no purpose (such as contactauthor) • Solution: Inlining! – Store some of the data of the children within the table of the parent – When? Suggestions?
Creating the Schema: Shared Inline Technique • When creating the schema for a DTD, we create a relation for: – each element with in-degree greater than 1 – each element with in-degree 0 – each element below a * – one element from each set of mutually recursive elements, having in-degree 1 • All other elements are “inlined” into their parent’s relation (i. e. , added into their parents relations) – Note that parent may also be inlined
In the Relations, Store: • Id of node • Boolean is. Root column, for each of the inlined fields (omitted in the examples) • Text content of all leaf nodes that are inlined • For all nodes with an incoming edge: – parent. ID – parent. CODE
Relations for which elements? attribute
book (book. ID: integer, booktitle : string) article (article. ID: integer, article. contactauthorid: string) monograph (monograph. ID: integer, monograph. parent. CODE: integer, monograph. editor. name: string) title (title. ID: integer, title: string , What are these for? title. parent. ID: integer, title. parent. CODE: integer) author (author. parent. ID: integer, author. parent. CODE: integer, author. ID: integer, authorid: string author. address: string, author. name. firstname: string, author. name. lastname: string, )
book (book. ID: integer, booktitle : string) article (article. ID: integer, article. contactauthorid: string) monograph (monograph. ID: integer, monograph. parent. CODE: integer, monograph. editor. name: string) title (title. ID: integer, title: string , title. parent. ID: integer, title. parent. CODE: integer) How many is. Root columns would you add to article? author (author. parent. ID: integer, author. parent. CODE: integer, To monograph? author. ID: integer, authorid: string author. address: string, author. name. firstname: string, author. name. lastname: string, )
Advantages/Disadvantages • Advantages: – Reduces number of joins for queries like “get the first and last names of an author” – Efficient for queries such as “list all authors with name Jack” • Disadvantages: – Extra join needed for “Article with a given title name”
Notes • Can/Should we use foreign keys to connect child tuples with their parents, e. g. , titles with what they belong to? • How can we answer queries, such as: – //title – //article//name
Another Option: Hybrid Inlining Technique • Same as Shared, except also inline elements with in-degree greater than one for the places in which they are not recursive or reached through a * node
What, in addition, will be inline? attribute
book (book. ID: integer, booktitle : string, author. name. firstname: string, author. name. lastname: string, author. address: string, authorid: string) article (article. ID: integer, article. contactauthorid: string, article. title: string) monograph (monograph. ID: integer, monograph. parent. CODE: integer, monograph. title: string, author. name. firstname: string, author. name. lastname: string, author. address: string, authorid: Why string, do we monograph. editor. name: string, ) still have an author (author. ID: integer, author. parent. ID: integer, author relation? author. parent. CODE: integer, author. name. firstname: string, author. name. lastname: string, author. address: string, authorid: string)
Advantages/Disadvantages • Advantages: – Reduces joins through shared elements (that are not set or recursive elements) – Reduces joins for queries like “get first and last names of a book author” (like Shared) • Disadvantages: – Requires more SQL sub-queries to retrieve all authors with first name Jack (i. e. , unions) • Tradeoff between reducing number of unions and reducing number of joins – Shared and Hybrid target union- and join-reduction, respectively
XML in Major Databases • All major databases now have some level of support for XML • Example: Oracle – XML data type (can have a column which contains XML documents) – XPath processing of XML values – Some indexing capabilities – XML is a second class citizen in the database (support consists of a bunch of tools – no coherent framework)
Try It • Consider the DTD: ]>
Try It • Simplify the DTD and draw the DTD graph that corresponds to the simplified DTD. • Show the schema that would be created using the Shared. Inline Technique. • Show the schema that would be created using the Hybrid. Inlining Technique.
Native Databases for XML 62
Basic Idea • Store XML as a tree • Main Challenge: make querying efficient (recall the difficulties when storing XML as a file) – appropriate indexing – efficient query processing • Several native XML database systems have been developed: – TIMBER (University of Michigan) – To. X (University of Toronto) – etc. 63
Natix Subtrees are stored in blocks. When a block is full another block is used. bib
Indexing • In order to do efficient query processing, indexes are used • Reminder: An index is a structure that “points” directly to nodes satisfying a given constraint • More indexes usually allow query processing to be more efficient, but also take up more space (time/space tradeoff) 65
Indexing Strategy • We will discuss different indexing strategies and query processing with these indices – Element and value inverted lists – Rotated paths – Graph-based indexes 66
Element and Value Inverted Lists 67
Basic Indexes • At minimum, the following indexes are usually stored: – Value indexes: for each value appearing in the tree there is a list of nodes containing the value – Element indexes: for each element name appearing in the tree, there is a list of nodes with the corresponding element • Sometimes also structure indexes: for certain XPath expressions, there is a list of nodes that satisfy the expression 68
Example: Value Indexes 1 transaction 2 account 4 3 11 buy 89 -344 sell 12 shares 5 shares WEBM 10 13 6 NYSE 9 16 7 100 14 ticker 30 15 exch 8 exch 9 NYSE 16 10 WEBM NYSE 17 GE 69
Example: Element Indexes 1 transaction 2 account 4 3 11 buy 89 -344 sell 12 shares 5 shares buy 4 13 6 exch 8 15 7 100 14 ticker 30 15 exch 8 exch 9 NYSE 16 10 WEBM NYSE 17 GE 70
Example: Structure Indexes 1 transaction 2 account 4 3 11 buy 89 -344 sell 12 shares 5 shares //buy//exch 8 13 6 7 100 14 ticker 30 15 exch 8 exch 9 NYSE 16 10 WEBM NYSE 17 GE 71
Query Processing • Suppose that we only have value indexes and element indexes • How should we process the query: //buy//exch ? – Strategy 1: Find buy elements. Then traverse the subtree of these elements to look for exch elements – Strategy 2: Find exch elements. Then traverse the ancestors of these elements to look for buy elements • Which is a better strategy? 72
//buy//exch: Strategy 1 1 transaction 2 account 4 3 11 buy 89 -344 buy sell 12 shares 5 shares 4 13 6 exch 8 15 7 100 14 ticker 30 15 exch 8 exch 9 NYSE 16 10 WEBM NYSE 17 GE 73
//buy//exch: Strategy 2 1 transaction 2 account 4 3 11 buy 89 -344 buy sell 12 shares 5 shares 4 13 6 exch 8 15 7 100 14 ticker 30 15 exch 8 exch 9 NYSE 16 10 WEBM NYSE 17 GE 74
Both Strategies Are BAD! • Both strategies require traversal of the tree • Many disk reads • Will be inefficient, if tree is large! • GOAL: Answer queries using indices only, without traversing the XML tree
Improving the Execution • Instead of storing a running id for each element, store triple: (start, end, level) • Find buy elements • Find exch elements • Merge these two lists by finding exch elements that are nested within buy elements • Level is used in case we are interested in finding children, not descendents 76
//buy//exch: Improved End. Level Start buy (4, 10, 2) Merge the 2 lists by finding descendent elements exch (8, 9, 4) (15, 17, 4) What does this remind you of? 77
Merging Lists • What is the complexity of merging the lists? • Is it enough to go through each list once? – Assuming the lists are sorted by start? • Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a b b 78
Merging Lists: Example • Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a 1, 7, 1 2, 2, 2 b a b (1, 7, 1) (2, 2, 2) (3, 6, 2) (4, 4, 3) 4, 4, 3 b (5, 5, 3) a 3, 6, 2 b 5, 5, 3 Where should we go on the b list? 79
Merging Lists: Example • Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a 1, 7, 1 2, 2, 2 b a (1, 7, 1) (3, 6, 2) b (2, 2, 2) (4, 4, 3) 4, 4, 3 b a 3, 6, 2 b 5, 5, 3 (5, 5, 3) 80
Merging Lists: Example • We did extra work • Need a method to find the correct place to start in the b list 1, 7, 1 a 2, 2, 2 b a (1, 7, 1) (3, 6, 2) b (2, 2, 2) (4, 4, 3) 4, 4, 3 b a 3, 6, 2 b 5, 5, 3 (5, 5, 3) 81
Minimizing the Work • Several algorithms have been defined to minimize the amount of work required, by identifying exactly where to restart • See: – Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo, “Efficient Structural Joins on Indexed XML Documents” Proc. of VLDB 2002 – Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jingesh M. Patel, Divesh Srivastava, Yuqing Wu, “Structural Joins: A Primitive for Efficient XML Query Pattern Matching”, ICDE 2002 – Nicolas Bruno, Nick Koudas, Divesh Srivastava, “Holistic Twig Joins: Optimal XML Pattern Matching”, ACM SIGMOD 2002 82
Goal • Efficiently find all pairs of nodes n, m such that m is a descendent (child) of n, and n and m have the user specified labels – E. g. , a//b, c//d, e/f • Recall: – For any label, we have a sorted list (i. e. , an index) of nodes with that label – The sorted list of ids contains both the starting position of a node and its ending position
Stack-Tree Algorithms: Intuition • A depth-first traversal of a tree can be performed in linear time, using a stack as large as the height of the tree. • An ancestor-descendant structural relationship is manifested as the ancestor appearing earlier on the stack than the descendant. • Unfortunately, a depth-first traversal requires going over all the tree. – DON’T GO OVER THE TREE!! ONLY THE INDEX 84
Stack-Tree Algorithms • We will study the algorithm – Stack-Tree-Desc that returns the result ordered by (desc-start, anc-start) • Paper also discusses the algorithm – Stack-Tree-Anc that returns the result ordered by (anc-start, desc-start) • Why is the ordering of the result of interest? 85
Stack-Tree-Desc a = Alist->first node; d = Dlist->first node; Output. List = NULL; while (lists are not finished or stack is not empty) { if (a. start. Pos < d. start. Pos) then e = a; else e = d; while (stack not empty and e. start. Pos > stack. Top(). end. Pos) stack. Pop(); if (e == a) { stack. Push(a); a = a->next. Node; } else for each a’ in stack do append (a’, d) to Output. List; d = d->next. Node; } } a d 86
Stack-Tree-Desc: section//paragraph article section paragraph section paragraph Bla, . . 87
Stack-Tree-Desc: //section//paragraph article Alist section paragraph section paragraph Bla, . . 88
Stack-Tree-Desc: //section//paragraph article Dlist section paragraph section paragraph Bla, . . 89
Stack-Tree-Desc: //section//paragraph article a 1 d 1 paragraph d 2 d 3 paragraph Bla, . . a 2 a 3 section paragraph d 7 d 6 d 5 d 4 Bla, . . 90
Stack-Tree-Desc: //section//paragraph section paragraph a 1 d 1 article a 1 section d 1 paragraph d 2 d 3 paragraph Bla, . . a 2 section a 2 paragraph d 2 a 3 d 4 d 5 d 6 d 7 d 6 Note: These lists are not created at the d 5 paragraph a 3 beginning of the section algorithm. d 4 paragraph They are already available! Bla, . . 91
Stack-Tree-Desc a 1 d 1 a 2 a 1 d 7 d 2 a 3 d 1 a 2 d 2 a 3 d 6 d 5 d 3 d 4 d 5 d 6 d 4 a 3 d 7 a 2 Stack: Output: a 1 (a 1, d 1) (a 1, d 2), (a 2, d 2) (a 1, d 3), (a 2, d 3), (a 3, d 3) (a 1, d 4), (a 2, d 4), (a 3, d 4) (a 1, d 5), (a 2, d 5) (a 1, d 6) 92
Analysis of Stack-Tree-Dec • O(|Alist| + |Dlist| + |Output. List|) for ancestordescendant structural relationships. – Each Alist element is pushed once and popped once, so stack operations take O(|Alist|). – The inner “for loop” outputs a new pair each time, so its total time is O(|Output. List|). 93
Questions and Disadvantages • Can a similar algorithm be used to compute other axes? – e. g. , child, following • How can we use an algorithm for computing a single “step” to compute an entire XPath Query? – E. g. , //a//b[//c/d]//e 94
Tree Pattern Can Computed From Structural Relationships Descendent Child edge book title XML book title book author title XML author jane Algorithm presented only computed a single edge query. Results can be combined to answer entire query. 95
Graph-Based Indexes: Data. Guides 96
Exploiting Regularity • XML documents tend to have a very repetitive structure • Structure can be summarized in a (relatively) small graph, called a dataguide • Nodes in a dataguide point to their corresponding node in the XML document • Strategy: Evaluate query over graph. Then find corresponding nodes in document – Very efficient if dataguide fits into main memory 97
Notes • In this work, we will model documents as graphs with the labels on the edges • We will only consider path queries (no branching) • Our XML documents can be arbitrary graphs • There are many different types of indexes that exploit the same idea – this was the first (1997) 98
An Example Data. Guide: Intuition How would you evaluate the queries: //Name /Restaurant/Owner 99
Data. Guides: Formally • Given a data source (i. e. , XML document) X, a graph D is a dataguide for X if: – every path of labels appearing in X appears exactly once in D (conciseness) – every path of labels appearing in D appears at least once in X (accuracy) 100
Example Revisited • Observe that every path in X also appears in D • Observe that no path (from the root) appears twice in D Document: X Data. Guide: D 101
Is this a Data. Guide? 1 1 A 1 C 1 D B 1 1 C C 1 D 1 1 1 C A B 1 1 D D 1 Document: X B 1 C 1 D 1 1 ? 102
Is this a Data. Guide? 1 1 A 1 C 1 D B 1 1 C C 1 D 1 1 1 C A B 1 1 D D 1 Document: X B 1 1 C C 1 D 1 B 1 1 D 1 ? 103
Is this a Data. Guide? 1 1 A 1 C 1 D B 1 1 C C 1 D 1 1 1 C A B 1 1 D D 1 Document: X B 1 1 C C 1 D 1 C 1 1 D 1 ? 104
Is this a Data. Guide? 1 1 A 1 C 1 D B C A 1 1 B 1 C C 1 D 1 B 1 D 1 1 Document: X ? 1 105
Choosing a Data. Guide 1 1 A 1 C 1 D B 1 1 C C 1 D 1 1 1 C A B 1 1 D D 1 Document: X 1 B 1 C 1 D 1 1 Option 1 A B 1 C 1 D 1 Option 2 What does D point to? 106
Strong Data. Guide: Formally • Consider source X and dataguide D • Let p, p’ be two label paths • Let p(X) be the set of nodes reached in X by traversing path p • We define p ≡X p’ if p(X) = p’(X) – That is, p and p’ are indistinguishable on X – D is a strong Data. Guide for a database X if the equivalence relations ≡D and ≡X are the same 107
Strong Data. Guides • Is (b) a strong dataguide for (a)? • Is (c) a strong dataguide for (a)? 108
Creating a Strong Dataguide • Strong dataguides can be used as indexes since they are unambiguous • How big might a strong dataguide be? • Can it be created efficiently? – In general, exponential time. Requires turning a nondeterministic automaton into a deterministic one – If XML is a tree, can be created in linear time 109
Make. Data. Guide(n) { dg = New. Object() target. Hash. Insert({n}, dg) Recursive. Make({n}, dg) } Recursive. Make(t 1, d 1) { p = set of
Can you create a Strong Data. Guide? • Intuition: If the sets of nodes which are reachable for simple paths are equal, then the simple paths are represented as a single node. • Compute on blackboard A A 2 C 1 B 4 C 3 5 Source A 6 C 1 2, 4 C 3, 5 A A B 6 C 5 Strong Data. Guide 2 C 3 1 1 A B 4 6 C C C B 2, 4 6 C C 5 3, 5 Source Strong Data. Guide 111
Summary • Advantages: – if dataguide can fit in memory, evaluation can be performed efficiently for path queries • Disadvantages: – May be large (why is this worse here than for the rotated lexicon? ) – Only good for simple queries. Which axes? 112
Try It • Construct a strong dataguide for this document, using the algorithm shown • Show an example of a database, strong dataguide and XPath query such that evaluating the XPath query on the dataguide (and then finding the corresponding database nodes) yields a different answer than evaluating the query directly on the database.