XML Storage XML Storage Suppose that

Скачать презентацию XML Storage XML Storage Suppose that

09b3c68e7c5c4193b2f478e331b60502.ppt

Количество слайдов: 113

XML Storage

XML Storage • Suppose that we are given some XML documents • How should they be stored? • Why does it matter? – Storage implies which type of use can be efficiently made of the XML – Usage requirements determine which type of storage is needed

3 Basic Strategies • Files • Relational Database • Native XML Database • What advantages do you think that each approach has? • What disadvantages do you think that each approach has?

XML Files

Idea • Store XML “as is”, in a file system – When querying, parse the document and traverse it to find the query answer • Obvious Advantage: Simple storage system • Obvious Disadvantage: – Must parse the XML document every time it is queried – Does not take advantage of indexes to quickly get to “interesting” elements (in order to reach a given element, must traverse everything appearing beforehand in the document)

Sample Document <transaction> <account>89 -344</account> <buy shares=“ 100”> <ticker exch=“NASDAQ”>WEBM</ticker> </buy> <sell shares=“ 30”> Sample Document 89 -344 WEBM What must we read GE to be able to get information about the ticker element?

How is an XML document Parsed? • Basic types of parsers: – DOM parser: Creates a tree out of the document – SAX parser: Does not create any data structures. Notifies program for every element seen (“pushes” parsing events to users) – Pull parsing (St. AX): Similar to SAX in memory requirements. Uses an Iterator style interface • Parsers have implementations in virtually every query language

DOM Parser • DOM = Document Object Model • Parser creates a tree object out of the document • User accesses data by traversing the tree • The API allows for constructing, accessing and manipulating the structure and content of XML documents

Document as Tree Methods like: transaction get. Root account buy sell 89 -344 shares 100 shares ticker get. Attributes etc. ticker 30 exch NASDAQ get. Children exch WEBM NYSE GE

Advantages and Disadvantages • How would you answer a query like: – /transaction/buy – //ticker • Advantages: – Natural and relatively easy to use – Can repeatedly query tree without reparsing • Disadvantages: – High memory requirements – the whole document is kept in memory – Must parse the whole document and construct many objects before use

SAX Parser • SAX = Simple API for XML • Parser creates “events” (i. e. , notifications) while traversing tree • Goes through the document one time only

Document as Events <transaction> End tag: account Start 89 -344 Text: tag: transaction account Document as Events End tag: account Start 89 -344 Text: tag: transaction account Attribute: 89 -344 Value: tag: buy Start 100 shares WEBM GE

Advantages and Disadvantages • How would you answer a query like: – /transaction/buy – find accounts in which something is bought or sold from the NASDAQ • Advantages: – Requires less memory – Fast • Disadvantages: – Cannot read backwards

Compression • Even if XML is stored “as-is”, we would like to compress the data – Important since XML is very verbose! • Types of compression: – Compression-oriented: Goal is to maximize compression ratio – Query-oriented: Integrate compression with XPath processor, so that evaluation can be performed directly on compressed data • Ideas?

Storing XML in a Relational Database

Why? • Relational databases have been developed for about 30 years • There is extensive knowledge on how to use them efficiently • Why not take advantage of this knowledge? • Main Challenges: – get XML into database (inserting): translating XML into tables – get XML out of database (querying): translating XPath into SQL

Reminder • Relational Database simply contains some tables • Each table can have any number of columns (also called attributes) • Data items in each column are atomic, i. e. , single values • A schema is a description of a set of tables, i. e. , the table name and each table’s column names

Difficulties • DTDs can be complex • Modeling Mismatch – Conceptually, relational databases, i. e. , tables, have 2 levels: tables and attributes – XML documents have arbitrary nesting • XML documents can have set-valued attributes and recursion

Relational Databases: Option 1 The Schema-less Case

Option 1: Store Tree Structure Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau. ac. il person name tel email Bart Simpson 051 – 011 022 02 – 444 7777 bart@tau. ac. il

Option 1: Store Tree Structure (cont. ) 1 person 2 name 3 tel 4 5 tel email 051 – 011 022 6 Bart Simpson 9 bart@tau. ac. il 7 02 – 444 7777 8 1. Assign each node a unique id 2. For each node, store type and value 3. For each node, store parent information

Option 1: Store Tree Structure (cont. ) 1 person 2 name 3 tel 4 5 tel email 051 – 011 022 6 Bart Simpson 9 bart@tau. ac. il 7 02 – 444 7777 8 Node Type Value 1 element person 6 text … … Parent. ID null Bart Simpson 2

How Good Is This? • Simple schema, can work with any document • Translation from XML to tables is easy • What about the translation back? – is this transformation lossless?

Answering XPath Queries • Can you answer an XPath query that: – Just uses the Child axis, e. g. , /a/b/c/d/e – Uses the Descendent axis at the beginning of the query, e. g. , //a/b – Uses the Descendent axis in the middle of the query, e. g. , /a/b//e – Uses the Following, Preceding, Following-Sibling axis?

Solving the Problem • With the current modeling, it is not possible to evaluate many different types of steps of XPath queries • To solve this problem, we: – number the nodes by DFS ordering – store, for each node, the id of its last descendent

Can you answer these queries, now? 2 name 3 Bart Simpson 1 person 4 phones 7 5 tel 9 email tel bart@tau. ac. il 051 – 011 022 6 02 – 444 7777 8 Node Type Value Parent. ID Last. Desc 1 element person null 10 4 element phones 1 8 … … 10

Summary: Main Problems • No convenient method to creating XML as output • Each element in the path expression requires an additional join – Can become very expensive

Relational Databases: Option 2, Taking Advantage of DTDs Based On: Relational Databases for Querying XML Documents: Limitations and Opportunities By: Shanmugasundaram, Tufte, He, Zhang, De. Witt, Naughton

Framework DTD XML Documents Query XML Result XML Translation Layer Relational Schema Tuples SQL Query Translation Information Relational Database System Relational Result

Example XML <book> <booktitle> The Selfish Gene </booktitle> <author id = “dawkins”> <name> <firstname> Example XML The Selfish Gene Richard Dawkins Wouldn’t it be nice to store this as a table

with the columns: Timbuktu • booktitle • author_id 99999 • firstname

• lastname • city • zip

Example XML <book> <booktitle> The Selfish Gene </booktitle> <author id = “dawkins”> <name> <firstname> Example XML The Selfish Gene Richard do this only We can Dawkins all XML if

Timbuktu 99999

documents that we will be considering follow this format. Otherwise, for example, what happens if there are 2 authors?

Considering the DTD • If a DTD is given, then it defines what types of XML documents will be of interest • Challenge: Given a DTD, find a relational schema such that ANY document conforming to the DTD can be stored in the relations –

Reducing the Complexity • DTDs can be very complex • Before translating a DTD to a relational schema, simplify the DTD • Property of the Simplification: If D 2 is a simplification of D 1, then every document that conforms to D 1 also almost conforms to D 2 – almost means that it conforms, if the ordering of subelements is ignored

Simplification Rules (e 1, e 2)* e 1*, e 2* (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1* . . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … …, . . . a? , …, a, … a*, … …, . . . a, …, a? , … a*, … …, . . . a*, …, a, … a*, … …, . . . a, …, a*, …

(e 1, e 2)* e 1*, e 2* (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1*. . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … (b|c|e)? , (e? |f+)

(e 1, e 2)* e 1*, e 2* (b|c|e)? , (e? |f+) (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1*. . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … (b? , c? , e? )? , e? ? , f+?

(e 1, e 2)* e 1*, e 2* (b|c|e)? , (e? |f+) (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? (b? , c? , e? )? , e? ? , f+? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1*. . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … b? ? , c? ? , e? ? , f+?

(e 1, e 2)* e 1*, e 2* (b|c|e)? , (e? |f+) (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? (b? , c? , e? )? , e? ? , f+? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1* b? ? , c? ? , e? ? , f+? b? ? , c? ? , e? ? , f*? . . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … b? , c? , e? , f*

(e 1, e 2)* e 1*, e 2* (b|c|e)? , (e? |f+) (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2? (b? , c? , e? )? , e? ? , f+? e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1* b? ? , c? ? , e? ? , f+? b? ? , c? ? , e? ? , f*? . . . , a*, . . . , a? , . . . a*, . . . b? , c? , e? , f* . . . , a? , . . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … b? , c? , e*, f*

You try it • Can you simplify the expression – (b|c|e)? , (e? |(f? , (b, b)*))* e 1** e 1*? e 1* e 1? * e 1* e 1? ? e 1? e 1+ e 1* . . . , a*, . . . , a? , . . . a*, … …, . . . a, …, a, … a*, … (e 1, e 2)* e 1*, e 2* (e 1, e 2)? e 1? , e 2? (e 1|e 2) e 1? , e 2?

DTD Graphs • In order to describe a technique for converting a DTD to a schema it is convenient to first describe DTDs (or rather simplified DTDs) as graphs • Its nodes are elements, attributes and operators in the DTD • Each element appears exactly once in the graph • Attributes and operators appear as many times as they are in the DTD • Cycles indicate recursion

DTD Example

Corresponding DTD Graph attribute

Very Naïve Storage • Store a table for each element name, with columns – ID – parent. ID (if has an incoming edge) – parent. CODE (if has an incoming edge) – is. Root – Textual data, for elements of type PCDATA or attributes of type CDATA

book (book. ID: int, is. Root: boolean) booktitle (title. ID: int, data : string, parent. ID: int, parent. Code: int, is. Root: boolean) article (article. ID: int, is. Root: boolean) contactauthor (contactauthor. ID: int, parent. Code: int, is. Root: boolean) authorid (authorid. ID: int, data: string, parent. ID: int, parent. Code: int, is. Root: boolean) title (title. ID: int, data: string , parent. ID: int, parent. CODE: int, is. Root: boolean) …. . Partial example ….

Disadvantages? • Many, many joins! • Some relations serve basically no purpose (such as contactauthor) • Solution: Inlining! – Store some of the data of the children within the table of the parent – When? Suggestions?

Creating the Schema: Shared Inline Technique • When creating the schema for a DTD, we create a relation for: – each element with in-degree greater than 1 – each element with in-degree 0 – each element below a * – one element from each set of mutually recursive elements, having in-degree 1 • All other elements are “inlined” into their parent’s relation (i. e. , added into their parents relations) – Note that parent may also be inlined

In the Relations, Store: • Id of node • Boolean is. Root column, for each of the inlined fields (omitted in the examples) • Text content of all leaf nodes that are inlined • For all nodes with an incoming edge: – parent. ID – parent. CODE

Relations for which elements? attribute

book (book. ID: integer, booktitle : string) article (article. ID: integer, article. contactauthorid: string) monograph (monograph. ID: integer, monograph. parent. CODE: integer, monograph. editor. name: string) title (title. ID: integer, title: string , What are these for? title. parent. ID: integer, title. parent. CODE: integer) author (author. parent. ID: integer, author. parent. CODE: integer, author. ID: integer, authorid: string author. address: string, author. name. firstname: string, author. name. lastname: string, )

book (book. ID: integer, booktitle : string) article (article. ID: integer, article. contactauthorid: string) monograph (monograph. ID: integer, monograph. parent. CODE: integer, monograph. editor. name: string) title (title. ID: integer, title: string , title. parent. ID: integer, title. parent. CODE: integer) How many is. Root columns would you add to article? author (author. parent. ID: integer, author. parent. CODE: integer, To monograph? author. ID: integer, authorid: string author. address: string, author. name. firstname: string, author. name. lastname: string, )

Advantages/Disadvantages • Advantages: – Reduces number of joins for queries like “get the first and last names of an author” – Efficient for queries such as “list all authors with name Jack” • Disadvantages: – Extra join needed for “Article with a given title name”

Notes • Can/Should we use foreign keys to connect child tuples with their parents, e. g. , titles with what they belong to? • How can we answer queries, such as: – //title – //article//name

Another Option: Hybrid Inlining Technique • Same as Shared, except also inline elements with in-degree greater than one for the places in which they are not recursive or reached through a * node

What, in addition, will be inline? attribute

book (book. ID: integer, booktitle : string, author. name. firstname: string, author. name. lastname: string, author. address: string, authorid: string) article (article. ID: integer, article. contactauthorid: string, article. title: string) monograph (monograph. ID: integer, monograph. parent. CODE: integer, monograph. title: string, author. name. firstname: string, author. name. lastname: string, author. address: string, authorid: Why string, do we monograph. editor. name: string, ) still have an author (author. ID: integer, author. parent. ID: integer, author relation? author. parent. CODE: integer, author. name. firstname: string, author. name. lastname: string, author. address: string, authorid: string)

Advantages/Disadvantages • Advantages: – Reduces joins through shared elements (that are not set or recursive elements) – Reduces joins for queries like “get first and last names of a book author” (like Shared) • Disadvantages: – Requires more SQL sub-queries to retrieve all authors with first name Jack (i. e. , unions) • Tradeoff between reducing number of unions and reducing number of joins – Shared and Hybrid target union- and join-reduction, respectively

XML in Major Databases • All major databases now have some level of support for XML • Example: Oracle – XML data type (can have a column which contains XML documents) – XPath processing of XML values – Some indexing capabilities – XML is a second class citizen in the database (support consists of a bunch of tools – no coherent framework)

Try It • Consider the DTD: <!DOCTYPE a [ <!ELEMENT a (d+|b)? > <!ELEMENT Try It • Consider the DTD: ]>

Try It • Simplify the DTD and draw the DTD graph that corresponds to the simplified DTD. • Show the schema that would be created using the Shared. Inline Technique. • Show the schema that would be created using the Hybrid. Inlining Technique.

Native Databases for XML 62

Basic Idea • Store XML as a tree • Main Challenge: make querying efficient (recall the difficulties when storing XML as a file) – appropriate indexing – efficient query processing • Several native XML database systems have been developed: – TIMBER (University of Michigan) – To. X (University of Toronto) – etc. 63

Natix Subtrees are stored in blocks. When a block is full another block is used. bib Pointer to block containing child . . . . . . book title author 64

Indexing • In order to do efficient query processing, indexes are used • Reminder: An index is a structure that “points” directly to nodes satisfying a given constraint • More indexes usually allow query processing to be more efficient, but also take up more space (time/space tradeoff) 65

Indexing Strategy • We will discuss different indexing strategies and query processing with these indices – Element and value inverted lists – Rotated paths – Graph-based indexes 66

Element and Value Inverted Lists 67

Basic Indexes • At minimum, the following indexes are usually stored: – Value indexes: for each value appearing in the tree there is a list of nodes containing the value – Element indexes: for each element name appearing in the tree, there is a list of nodes with the corresponding element • Sometimes also structure indexes: for certain XPath expressions, there is a list of nodes that satisfy the expression 68

Example: Value Indexes 1 transaction 2 account 4 3 11 buy 89 -344 sell 12 shares 5 shares WEBM 10 13 6 NYSE 9 16 7 100 14 ticker 30 15 exch 8 exch 9 NYSE 16 10 WEBM NYSE 17 GE 69

Example: Element Indexes 1 transaction 2 account 4 3 11 buy 89 -344 sell 12 shares 5 shares buy 4 13 6 exch 8 15 7 100 14 ticker 30 15 exch 8 exch 9 NYSE 16 10 WEBM NYSE 17 GE 70

Example: Structure Indexes 1 transaction 2 account 4 3 11 buy 89 -344 sell 12 shares 5 shares //buy//exch 8 13 6 7 100 14 ticker 30 15 exch 8 exch 9 NYSE 16 10 WEBM NYSE 17 GE 71

Query Processing • Suppose that we only have value indexes and element indexes • How should we process the query: //buy//exch ? – Strategy 1: Find buy elements. Then traverse the subtree of these elements to look for exch elements – Strategy 2: Find exch elements. Then traverse the ancestors of these elements to look for buy elements • Which is a better strategy? 72

//buy//exch: Strategy 1 1 transaction 2 account 4 3 11 buy 89 -344 buy sell 12 shares 5 shares 4 13 6 exch 8 15 7 100 14 ticker 30 15 exch 8 exch 9 NYSE 16 10 WEBM NYSE 17 GE 73

//buy//exch: Strategy 2 1 transaction 2 account 4 3 11 buy 89 -344 buy sell 12 shares 5 shares 4 13 6 exch 8 15 7 100 14 ticker 30 15 exch 8 exch 9 NYSE 16 10 WEBM NYSE 17 GE 74

Both Strategies Are BAD! • Both strategies require traversal of the tree • Many disk reads • Will be inefficient, if tree is large! • GOAL: Answer queries using indices only, without traversing the XML tree

Improving the Execution • Instead of storing a running id for each element, store triple: (start, end, level) • Find buy elements • Find exch elements • Merge these two lists by finding exch elements that are nested within buy elements • Level is used in case we are interested in finding children, not descendents 76

//buy//exch: Improved End. Level Start buy (4, 10, 2) Merge the 2 lists by finding descendent elements exch (8, 9, 4) (15, 17, 4) What does this remind you of? 77

Merging Lists • What is the complexity of merging the lists? • Is it enough to go through each list once? – Assuming the lists are sorted by start? • Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a b b 78

Merging Lists: Example • Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a 1, 7, 1 2, 2, 2 b a b (1, 7, 1) (2, 2, 2) (3, 6, 2) (4, 4, 3) 4, 4, 3 b (5, 5, 3) a 3, 6, 2 b 5, 5, 3 Where should we go on the b list? 79

Merging Lists: Example • Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a 1, 7, 1 2, 2, 2 b a (1, 7, 1) (3, 6, 2) b (2, 2, 2) (4, 4, 3) 4, 4, 3 b a 3, 6, 2 b 5, 5, 3 (5, 5, 3) 80

Merging Lists: Example • We did extra work • Need a method to find the correct place to start in the b list 1, 7, 1 a 2, 2, 2 b a (1, 7, 1) (3, 6, 2) b (2, 2, 2) (4, 4, 3) 4, 4, 3 b a 3, 6, 2 b 5, 5, 3 (5, 5, 3) 81

Minimizing the Work • Several algorithms have been defined to minimize the amount of work required, by identifying exactly where to restart • See: – Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo, “Efficient Structural Joins on Indexed XML Documents” Proc. of VLDB 2002 – Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jingesh M. Patel, Divesh Srivastava, Yuqing Wu, “Structural Joins: A Primitive for Efficient XML Query Pattern Matching”, ICDE 2002 – Nicolas Bruno, Nick Koudas, Divesh Srivastava, “Holistic Twig Joins: Optimal XML Pattern Matching”, ACM SIGMOD 2002 82

Goal • Efficiently find all pairs of nodes n, m such that m is a descendent (child) of n, and n and m have the user specified labels – E. g. , a//b, c//d, e/f • Recall: – For any label, we have a sorted list (i. e. , an index) of nodes with that label – The sorted list of ids contains both the starting position of a node and its ending position

Stack-Tree Algorithms: Intuition • A depth-first traversal of a tree can be performed in linear time, using a stack as large as the height of the tree. • An ancestor-descendant structural relationship is manifested as the ancestor appearing earlier on the stack than the descendant. • Unfortunately, a depth-first traversal requires going over all the tree. – DON’T GO OVER THE TREE!! ONLY THE INDEX 84

Stack-Tree Algorithms • We will study the algorithm – Stack-Tree-Desc that returns the result ordered by (desc-start, anc-start) • Paper also discusses the algorithm – Stack-Tree-Anc that returns the result ordered by (anc-start, desc-start) • Why is the ordering of the result of interest? 85

Stack-Tree-Desc a = Alist->first node; d = Dlist->first node; Output. List = NULL; while (lists are not finished or stack is not empty) { if (a. start. Pos < d. start. Pos) then e = a; else e = d; while (stack not empty and e. start. Pos > stack. Top(). end. Pos) stack. Pop(); if (e == a) { stack. Push(a); a = a->next. Node; } else for each a’ in stack do append (a’, d) to Output. List; d = d->next. Node; } } a d 86

Stack-Tree-Desc: section//paragraph article section paragraph section paragraph Bla, . . 87

Stack-Tree-Desc: //section//paragraph article Alist section paragraph section paragraph Bla, . . 88

Stack-Tree-Desc: //section//paragraph article Dlist section paragraph section paragraph Bla, . . 89

Stack-Tree-Desc: //section//paragraph article a 1 d 1 paragraph d 2 d 3 paragraph Bla, . . a 2 a 3 section paragraph d 7 d 6 d 5 d 4 Bla, . . 90

Stack-Tree-Desc: //section//paragraph section paragraph a 1 d 1 article a 1 section d 1 paragraph d 2 d 3 paragraph Bla, . . a 2 section a 2 paragraph d 2 a 3 d 4 d 5 d 6 d 7 d 6 Note: These lists are not created at the d 5 paragraph a 3 beginning of the section algorithm. d 4 paragraph They are already available! Bla, . . 91

Stack-Tree-Desc a 1 d 1 a 2 a 1 d 7 d 2 a 3 d 1 a 2 d 2 a 3 d 6 d 5 d 3 d 4 d 5 d 6 d 4 a 3 d 7 a 2 Stack: Output: a 1 (a 1, d 1) (a 1, d 2), (a 2, d 2) (a 1, d 3), (a 2, d 3), (a 3, d 3) (a 1, d 4), (a 2, d 4), (a 3, d 4) (a 1, d 5), (a 2, d 5) (a 1, d 6) 92

Questions and Disadvantages • Can a similar algorithm be used to compute other axes? – e. g. , child, following • How can we use an algorithm for computing a single “step” to compute an entire XPath Query? – E. g. , //a//b[//c/d]//e 94

Tree Pattern Can Computed From Structural Relationships Descendent Child edge book title XML book title book author title XML author jane Algorithm presented only computed a single edge query. Results can be combined to answer entire query. 95

Graph-Based Indexes: Data. Guides 96

Exploiting Regularity • XML documents tend to have a very repetitive structure • Structure can be summarized in a (relatively) small graph, called a dataguide • Nodes in a dataguide point to their corresponding node in the XML document • Strategy: Evaluate query over graph. Then find corresponding nodes in document – Very efficient if dataguide fits into main memory 97

Notes • In this work, we will model documents as graphs with the labels on the edges • We will only consider path queries (no branching) • Our XML documents can be arbitrary graphs • There are many different types of indexes that exploit the same idea – this was the first (1997) 98

An Example Data. Guide: Intuition How would you evaluate the queries: //Name /Restaurant/Owner 99

Data. Guides: Formally • Given a data source (i. e. , XML document) X, a graph D is a dataguide for X if: – every path of labels appearing in X appears exactly once in D (conciseness) – every path of labels appearing in D appears at least once in X (accuracy) 100

Example Revisited • Observe that every path in X also appears in D • Observe that no path (from the root) appears twice in D Document: X Data. Guide: D 101

Is this a Data. Guide? 1 1 A 1 C 1 D B 1 1 C C 1 D 1 1 1 C A B 1 1 D D 1 Document: X B 1 C 1 D 1 1 ? 102

Is this a Data. Guide? 1 1 A 1 C 1 D B 1 1 C C 1 D 1 1 1 C A B 1 1 D D 1 Document: X B 1 1 C C 1 D 1 B 1 1 D 1 ? 103

Is this a Data. Guide? 1 1 A 1 C 1 D B 1 1 C C 1 D 1 1 1 C A B 1 1 D D 1 Document: X B 1 1 C C 1 D 1 C 1 1 D 1 ? 104

Is this a Data. Guide? 1 1 A 1 C 1 D B C A 1 1 B 1 C C 1 D 1 B 1 D 1 1 Document: X ? 1 105

Choosing a Data. Guide 1 1 A 1 C 1 D B 1 1 C C 1 D 1 1 1 C A B 1 1 D D 1 Document: X 1 B 1 C 1 D 1 1 Option 1 A B 1 C 1 D 1 Option 2 What does D point to? 106

Strong Data. Guide: Formally • Consider source X and dataguide D • Let p, p’ be two label paths • Let p(X) be the set of nodes reached in X by traversing path p • We define p ≡X p’ if p(X) = p’(X) – That is, p and p’ are indistinguishable on X – D is a strong Data. Guide for a database X if the equivalence relations ≡D and ≡X are the same 107

Strong Data. Guides • Is (b) a strong dataguide for (a)? • Is (c) a strong dataguide for (a)? 108

Creating a Strong Dataguide • Strong dataguides can be used as indexes since they are unambiguous • How big might a strong dataguide be? • Can it be created efficiently? – In general, exponential time. Requires turning a nondeterministic automaton into a deterministic one – If XML is a tree, can be created in linear time 109

Make. Data. Guide(n) { dg = New. Object() target. Hash. Insert({n}, dg) Recursive. Make({n}, dg) } Recursive. Make(t 1, d 1) { p = set of children pairs of each object in t 1 foreach (unique label l in p) { t 2 = set of node-ids paired with l in p d 2 = target. Hash. Lookup(t 2) if (d 2 != nil) { add an edge from d 1 to d 2 with label l } else { d 2 = New. Object() target. Hash. Insert(t 2, d 2) add an edge from d 1 to d 2 with label l Recursive. Make(t 2, d 2) } } } 110

Can you create a Strong Data. Guide? • Intuition: If the sets of nodes which are reachable for simple paths are equal, then the simple paths are represented as a single node. • Compute on blackboard A A 2 C 1 B 4 C 3 5 Source A 6 C 1 2, 4 C 3, 5 A A B 6 C 5 Strong Data. Guide 2 C 3 1 1 A B 4 6 C C C B 2, 4 6 C C 5 3, 5 Source Strong Data. Guide 111

Summary • Advantages: – if dataguide can fit in memory, evaluation can be performed efficiently for path queries • Disadvantages: – May be large (why is this worse here than for the rotated lexicon? ) – Only good for simple queries. Which axes? 112

Try It • Construct a strong dataguide for this document, using the algorithm shown • Show an example of a database, strong dataguide and XPath query such that evaluating the XPath query on the dataguide (and then finding the corresponding database nodes) yields a different answer than evaluating the query directly on the database.