098ba089244614239ab1866fc256381c.ppt
- Количество слайдов: 66
Storing XML Sihem Amer-Yahia AT&T Labs - Research
What’s XML? l W 3 C Standard since 1998 § Subset of SGML § (ISO Standard Generalized Markup Language) l Data-description markup language § HTML text-rendering markup language l De facto format for data exchange on Internet § Electronic commerce § Business-to-business (B 2 B) communication Storing XML 2
XML: A Wire Protocol l XML = A minimal wire representation for data and storage exchange § A low-level wire transfer format – like IP in networking l Minimal level of standardization for distributed components to interoperate § Platform, language and vendor agnostic § Easy to understand extensible l Data exchange enabled via XML transformations Storing XML 3
Core XML Technologies l XML Validation: Contract for Data Exchange § DTD, Relax N/G, XML Schema l XML API: Programmatic Access to XML § DOM, SAX l Transformation Languages for Data Exchange and Display § XSL, XSLT, XPATH, XQuery Storing XML 4
XML Data Model Highlights l Tagged elements describe semantics of data § Easier to parse for a machine and for a human l Element may have attributes l Element can contain nested sub-elements l Sub-elements may themselves be tagged elements or character data l Tree structure § Can capture any data-model § Easier to navigate Storing XML 5
An XML Document <? xml version=" 1. 0"? > <! DOCTYPE sigmod. Record SYSTEM “sigmod. Record. dtd"> <sigmod. Record> <issue> <volume> 1</ volume> <number> 1</ number> <articles> <article> <title> XML Research Issues</ title> <init. Page> 1</ init. Page> <end. Page> 5</ end. Page> <authors> <author Author. Position=" 00"> Tom Hanks</ author> </ authors> </ articles> </ issue> Storing XML 6
Document Type Definition (DTD) l An XML document may have a DTD l Grammar for describing document structure l Terminology § well-formed: if tags are correctly closed § valid: if it has a DTD and conforms to it l Validation useful for data exchange Storing XML 7
W 3 C XML Schema l Rich set of scalar types l user-defined simple types l Complex types factor common structure l Sequences, choice, repetition, recursion of elements l Sub-typing supports schema reuse l Integrity constraints Storing XML 8
DTD vs XML Schema l DTD <! <! <! ELEMENT ELEMENT article (title, init. Page, end. Page, author) > title (# PCDATA)> init. Page (# PCDATA)> end. Page (# PCDATA)> author (# PCDATA)> l XML Schema <xsd: element name=" article" min. Occurs=" 0" max. Occurs=" unbounded"> <xsd: complex. Type> <xsd: sequence> <xsd: element name=" title" type=" xsd: string"/> <xsd: element name=" init. Page" type=" xsd: string"/> <xsd: element name=" end. Page" type=" xsd: string"/> <xsd: element name=" author" type=" xsd: string"/> </ xsd: sequence> </ xsd: complex. Type> </ xsd: element> Storing XML 9
XML API: DOM l Hierarchical (tree) object model for XML documents l Associate a list of children with every node (or text value) l Preserves sequence of elements in XML document l May be expensive to materialize for a large XML collection Storing XML 10
DOM Features l DOM API supports: § Navigation: access all attribute nodes, children, first/last child, next/previous sibling, parent, … § Creation: create new node § Modification: append, insert, remove, replace node l DOM parser support for validation § Most support DTD § Some support XML Schema § See : http: //www. w 3. org/XML/Schema Storing XML 11
XML API: SAX l Event-driven: fire an event for every open tag/end tag l Does not require full parsing: reads XML document in streaming fashion l Read-only interface l Consumes less memory than DOM l Could be significantly faster than DOM Storing XML 12
SAX Features l Stack-oriented (LIFO) access § Read-once processing of very large documents § E. g. , load XML document into a storage system l SAX parser support for validation § Most support DTD § Microsoft XML Parser (MSXML) supports XML Schema Storing XML 13
XSL l Styling is rendering information for consumption l XSL = A language to express styling (“Stylesheet language”) l Two components of a stylesheet § Transform: Source to a target tree using template rules expressed in XSLT l Format: Controls appearance Storing XML 14
XSLT l XPATH acts as the pattern language l Primary goal is to transform XML vocabularies to XSL formatting vocabularies l But, often adequate for many transformation needs Storing XML 15
XPATH l [www. w 3. org/TR/xpath] l Common sub-language of § XSLT a loosely-typed, "scripting" language § XQuery a strongly-typed, query language l Syntax for tree navigation and node selection l Navigation is described using location paths Storing XML 16
XPATH l. : current node l. . : parent of the current node l / : root node, or a separator between steps in a path l // : descendants of the current node l @ : attributes of the current node l * : "any“ (node with unrestricted name) l [] : a predicate for a given step l [n] : the element with the given ordinal number from a list of elements Storing XML 17
XPATH 2. 0 l Arithmetic l Logical Expr +, -, *, div, mod Expr or/and Expr not(Expr) l Comparison Expr =, !=, <=, >= Expr l Conditional l Iteration l Quantified Expr if Expr then Expr else Expr for Var in Expr return Expr some/every Var in Expr satisfies Storing XML 18
XPATH Example l List the titles of articles in which the author has “Tom Hanks” § //article[//author=“Tom Hanks”]/title l Find the titles of articles authored by “Tom Hanks” in volume 1. § //issue[/volume=“ 1”]/articles/article/[//author=“Tom Hanks”]/title Storing XML 19
Beyond XPATH l Joining, aggregating XML from multiple documents l Constructing new XML l Recursive processing of recursive XML data l Supported by XSLT & XQuery l Differences between XSLT & XQuery § Safety: XQuery enforces input & output types § Compositionality : XQuery maps XML to XML; XSLT maps XML to anything Storing XML 20
XQuery l Functional language l Query is an expression l Expressions are recursively constructed l Includes XPATH as a sub-language l SQL-like FLWR expression l Borrows features from many other languages: XQL, XML-QL, ML, . . Storing XML 21
XQuery: FLWR expression l FOR/LET Clauses § Ordered list of tuples of bound variables l WHERE Clause § Pruned list of tuples of bound variables l RETURN Clause § Instance of XML Query data model Storing XML 22
XQuery: Example List the titles of the articles authored by “Tom Hanks” Query Expression for $b IN document(“sigmod. Record. xml")//article where $b//author =“Tom Hanks" return <title>$b/title. text()</title> Query Result <title>XML Research Issues</title> Storing XML 23
XQuery: Example List the articles authored by “Tom Hanks”. Query Expression <articles> { for $b IN document(“sigmod. Record. xml")//article where $b//author =“Tom Hanks" return $b } </articles> Query Result <articles> <article> <title>XML: Where are we heading for? </title> <init. Page>6</init. Page> <end. Page>10</end. Page> <authors><author Author. Position="00">Tom Hanks</author> </authors> </articles> Storing XML 24
Where’s the XML Data? ? ? Business Application Logic Export Import View Wrap Legacy databases Warehouse XML data Minimal result SOAP/CORBA/Java RM Storing XML 25
XML and Databases l Data stored in SQL databases need to be published in XML for data exchange § Specification schemes for publishing needed § Efficient publishing algorithms needed l Storage and retrieval of XML documents § Need to support mapping schemes § Need to support data manipulation XML API-s Storing XML 26
Storing XML l Storage foundation of efficient XML processing l XML demands own storage techniques § Characteristics of XML data: Optional elements & values, repetition, choice, inherent order, large text fragments, mixed content § Characteristics of XML queries: Document order & structure, full-text search, transformation l Goals of tutorial § Existing storage features for XML § New storage features for XML Storing XML 27
Outline Introduction I. § § XML Documents XML Queries II. Existing Storage Techniques § Non-native § Native III. Physical Storage Features for XML Storing XML 28
I. Introduction Storing XML 29
Classes of XML Documents l Structured § “Un-normalized” relational data Ex: product catalogs, inventory data, medical records, network messages, logs, stock quotes l Mixed § Structured data embedded in large text fragments Ex: On-line manuals, transcripts, tax forms l Application may process XML in both classes Ex: SOAP messages Header is structured; payload is mixed Storing XML 30
Structured Data: HL 7 Lab Report Health-care industry data-exchange format <HL 7> <PATIENT> <PID IDNum="PATID 1234"> <Pa. Na><Fa. Na>Jones</Fa. Na><Gi. Na>William</Gi. Na></Pa. Na> <DTof. Bi><date>1961 -06 -13</date></DTof. Bi> <Sex>M</Sex> </PID> <OBX Set. ID="1"> <Obs. Va>150</Obs. Va> <Obs. Id>Na</Obs. Id> <Abn. Fl>Above high</Abn. Fl> </OBX>. . . Storing XML 31
Queries on Structured Data l Analogs of SQL l Select-Project-Join, Sort by value Ex: Return admission records of patients discharged on 8/30/01 sorted by family and given names l Grouping & schema transformation Ex: Return per-patient record of admission, lab reports, doctors’ observations Storing XML 32
Mixed Data: Library of Congress Documents of U. S. Legislation <bill-stage="Introduction""> <congress>110 th CONGRESS</congress> <session>1 st Session</session> <legis-num>H. R. 133</legis-num> <current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-chamber> <action date="June 5, 2008"> <action-desc> <sponsor>Mr. English</sponsor> (for himself and <cosponsor>Mr. Coyne</cosponsor>) introduced the following bill; which was referred to the <committee-name>Committee on Financial Services</committee-name>. . . </action-desc> Storing XML 33
Queries on Mixed Data l Full-text search operators Ex: Find all <bill>s where "striking" & "amended" are within 6 intervening words l Queries on structure & text Ex: Return <text> element containing both "exemption" & "social security" and preceding & following <text> elements l Queries that span (ignore) structure Ex: Return <bill> that contains “referred to the Committee on Financial Services” Storing XML 34
Properties of XML Data l Variance in structured content § Elements of same type have different structure l Nested sub-element might depend on parent § Direct access to sub-element not required l Order significant in sequence & mixed content l Structured data embedded in text § Schema known a priori or “open content model” l Desirable: explicit support in storage system Storing XML 35
Properties of Queries l Query expressions depend on data properties § Variance § /PATIENT/(SURGERY | CHECK-UP) § Document order: XPath axes § /bill/co-sponsor[. /text() = “Mrs. Clinton” and followsibling: : co-sponsor/text() = “Mr. Torricelli”] § Node identity: equality, union/intersect/except l If not supported in storage system, then operators semantically incorrect or incomplete. Storing XML 36
II. Existing Storage Techniques Storing XML 37
Storage Techniques l Non-native § (Object) Relational, OO, LDAP directories § Indexing, recovery, transactions, updates, optimizers § Mapping from XML to target data model necessary § Captures variance in structured content § No support for mixed content § Recovering XML documents is expensive! l Native § Logical data model is XML § Physical storage features designed for XML Storing XML 38
Non-native Techniques l Generic § Mapping from XML data to relational tables § Models XML as tree: semi-structured approach § Does not use DTD or XML Schema l Schema-driven § Mapping from schema constructs to relational § Fixed mapping from DTD to relational schema § Flexible mapping from XML Schema to relational l User-defined § Labor-intensive Storing XML 39
Generic Mappings l Edge relation § store all edges in one table § Scalar values stored in separate table l Attribute relations § horizontal partition of Edge relation § Scalar values inlined in same table l Universal relation § full outer-join, redundancy l Captures node identity & document order l Element reconstruction requires multiple joins Storing XML 40
Edge Relation Example Edge Table &0 source PID &3 OBX &4 … @IDNum Pa. Na DTof. Bi &5 &6 &7 PATID 1234 “Jones Wm” date &8 1961 -06 -13 1 HL 7 ref &1 1 PATIENT ref &2 1 PID ref &3 &2 2 OBX ref &4 &3 1 IDNum string &5 &3 &2 target &2 PATIENT flag &1 &1 tag &0 HL 7 ordinal 2 Pa. Na string &6 &3 3 DTof. Bi ref &7 Value Table Node Value &5 PATID 1234 &6 Jones William Storing XML 41
Generic Mappings: LDAP Directories l Flexible schema; easy schema evolution § Supports heterogeneous elements with optional values l Captures node identity & document order l Query language captures subset of XPath Storing XML 42
LDAP Example XMLElement OC { XMLAttribute OC { SUBCLASS OF {XMLNode} SUBCLASS OF MUST CONTAIN {order} {XMLNode} MAY CONTAIN {value} MUST CONTAIN {value} TYPE order INTEGER TYPE value STRING } oc: XMLElement oid: 1 name: PID order: 1 PID @IDNum Pa. Na DTof. Bi Sex “PATID 1234” “Jones Wm” date M 1961 -06 -13 oc: XMLAttribute oid: 1. 1 name: IDNum value: PATID 1234 Storing XML oc: XMLElement oid: 1. 2 name: Pa. Na order: 1 value: Jones Wm 43
Schema-driven Mappings l Repetition : separate tables l Non-repeated sub-elements may be “inlined” l Optionality : nullable fields l Choice : multiple tables or universal table l Order : explicit ordinal value l Mixed content ignored l Element reconstruction may require multi-table joins because of normalization Storing XML 44
Fixed Mapping: Hybrid Inlining <!ELEMENT PATIENT OBX PATIENT (Name, (OBX)*)> OBX (Name, Value) > Name (#PCDATA) > Value (#PCDATA) > ID : Int ID: Int PATIENT * OBX Name Value Name: Str parent. ID: Int parent. CODE: Str Name: Str Value: Str Element with in-degree = 0 or > 1 in DTD graph relation § Elements with in-degree = 1 inlined except those reached by * § Non-* & non-recursive elements with in-degree > 1 inlined § Storing XML 45
Flexible Mapping : Lego. DB l Canonical mapping from XML Schema to relational § Every complex type relation l Semantic-preserving XML Schema to XML Schema transformations Ex: Inlining/outlining, Union factorization/distribution, Repetition split l Greedy algorithm for choosing mapping § Mapping cost determined by query mix § Use relational optimizer to estimate cost of mapping Storing XML 46
Lego. DB Example l Inline type in parent vs. Outline type in own relation type OBX = element value { Integer }, type Description = element description { String } XML type OBX = element value { Integer }, element description { String } TABLE OBX (OBX_id INT, value STRING, parent_PATIENT INT) TABLE Description (Description_id INT, description STRING, parent_OBX INT) Relational TABLE OBX (OBX_id INT, value STRING, description STRING, parent_PATIENT INT) Storing XML 47
User-Defined Mappings l No automatic translation from DTD or XML Schema l Annotated schemas or special-purpose queries l Value-based semantics only § Document structure represented by keys/foreign keys l No explicit representation of document order or node identity l Some support for mixed content Storing XML 48
Oracle 9 i l Canonical mapping into user-defined object-relational tables table PERSON(Name NAME, Alist ALIST) object NAME(FN STR, LN STR) table ALIST of ADDR object ADDR(City CITY) <row> <Person> <Name><FN>…</FN><LN>…</LN> <Addr><City>…</City></Addr>* </Person> </row> l Arbitrary XML input § XSLT preprocessing into multiple XML documents, load individually l Stores XML documents in CLOBs (character large objects) § Permits full-text search l Hybrid of canonical mapping & CLOB Storing XML 49
IBM DB 2 XML Extender l Declarative decomposition of arbitrary XML § Pure relational mapping (no object features used) <element_node name="Order"> <table name="order_tab"/> <table name="part_tab"/> <condition> order_tab. order_key = part_tab. order_key </condition> <attribute_node name="key"> <table name="order_tab"/> <column name="order_key"/> </attribute_node> </element_node> l Mixed content CLOBs + side tables for indexing structured data embedded in text Storing XML 50
MS SQL Server l Generic Edge technique with inlined scalar values l User-defined decomposition of XML into multiple tables § XML data mapped into DOM § XPath expressions specify XML values to map into tables § Rows in table Ex: /Customer/Orders row in Table ORDER § Columns in row. /Order. Date Order. Date. Column l Text content modeled in CLOBs Storing XML 51
Native Techniques l Built from scratch § Nati. X (University of Mannheim, Germany) § Xyleme (France) § Xindice (Apache – open source) l Re-tool existing systems to handle XML § Tamino: hierarchical database (ADABAS) § Excelon: OODB l Design efficient data structures for compact storage and fast access; data partitioning; indexing on both values and structure Storing XML 52
Nati. X l Unit of storage = element l Elements clustered to minimize page hits l Inter-element pointers capture document structure l Low-level algorithmic support for read/write/insert/delete operations l No use of DTDs or XML Schema Storing XML 53
Xyleme l Data layout: based on Nati. X l Indexing: sophisticated indexing of text and elements l Query support: XPATH, XQuery, updates l A data warehouse for XML content: store, classify, index, integrate, query and monitor massive volumes of XML content l Semantic services: extensible thesauri and schema mappers that enable the system to go beyond simple indexing Storing XML 54
Xyleme/Natix Storing XML 55
e. Xcelon XIS l Extends Object Store – an object-oriented database l Data Layout: stores parsed nodes (accessible through DOM interface) l Indexing: value indexes for strings and numbers; text indexes; structural indexes l Query Support: DOM, XSLT, XPath, XQuery, updates l Other features: § Node-level management § Data is stored in a pre-parsed format: only data objects needed for an operation are loaded into memory § Create, add, delete, update elements directly § Handles arbitrary XML documents - without Schema or DTD § But can enforce schemas if necessary § Triggers; transactions; distributed caching mechanism Storing XML 56
Software A/G Tamino l Extends Adabas – nested relations l Indexing: value and structure l Query support: § Full-text search operators § Queries return entire document or some projection of document § No construction of new XML values (Ex: XQL) l Access control at the node level, transactions; multi-media; triggers; backup/restore; compression; support for multi-media documents, e. g. , video Storing XML 57
Other Native Systems l Xindice http: //xml. apache. org/xindice/ § Query support: XPath for its query language and XML: DB XUpdate for its update language § APIs: XML: DB API for Java development; other languages using an available XML-RPC plugin l Go. XML § XQuery, full text searching § tree insert, replace and delete Storing XML 58
Update Support l XQuery does not support updates (yet…) l How to update? § Flat streams: overwrite document § Non-native: SQL § Native: DOM, proprietary APIs l But how do you know you have not violated schema for which the mapping was defined? § Flat streams: re-parse document § Non-native: need to understand the mapping and maintain integrity constraints § Native: supported in some systems (e. g. , e. Xcelon) Storing XML 59
Summary l Non-native § Treats target system as black box § Mismatch between data models requires mapping § Supporting order-sensitive queries can be expensive § May require changes to schema to support new tags § Introduces redundancies & necessity of joins § No control of physical layout of data l Native § No mismatch between logical data models § Focus on physical layout (clustering, indices, …) § Extensible - no schema or DTD needed Storing XML 60
Conclusion l XML data requires new storage features § Real-applications depend upon XML data properties § Normalization is not always appropriate l Schema of XML data should drive storage § Real-world data comes with its own schema § Schema as a basis for querying l Handling mixed content is an important research problem l Full version slides at http: //www. research. att. com/~sihem Storing XML 61
More Resources l W 3 C Documents http: //www. w 3. org/TR/ http: //www. w 3. org/XML/Query. html l W 3 C XML Query page l XML Query Implementations & Demos Galax - AT&T, Lucent, and Avaya http: //www-db. research. belllabs. com/galax/ Quip - Software AG http: //www. softwareag. com/developer/quip/ XQuery demo – Microsoft http: //131. 107. 228. 20/xquerydemo/ Fraunhofer IPSI XQuery Prototype http: //xml. ipsi. fhg. de/xquerydemo/ XQengine – Fatdog http: //www. fatdog. com/ X-Hive http: //217. 77. 130. 189/xquery/index. html Open. Link http: //demo. openlinksw. com: 8391/xquery/demo. vsp Storing XML 62
References (Research) l l l Serge Abiteboul, Sophie Cluet, Tova Milo: Querying and Updating the File. VLDB 1993 D. Barbosa, A. Barta, A. Mendelzon, G. Mihaila, F. Rizzolo, P. Rodriguez-Gianolli: To. X – The Toronto XML Engine, International Workshop on Information Integration on the Web, Rio de Janeiro, 2001. Phil Bohannon, Juliana Freire, Prasan Roy, Jérôme Siméon: From XML Schema to Relations: A cost-based Approach to XML Storage. ICDE 2002 Michael J. Carey, Jerry Kiernan, Jayavel Shanmugasundaram, Eugene J. Shekita, Subbu N. Subramanian: XPERANTO: Middleware for Publishing Object. Relational Data as XML Documents. VLDB 2000 Qiming Chen, Yahiko Kambayashi: Nested Relation Based Database Knowledge Representation. SIGMOD Conference 1991 l l l Vassilis Christophides, Sophie Cluet, Jérôme Siméon: On Wrapping Query Languages and Efficient XML Integration. SIGMOD Conference 2000: 141 -152 Alin Deutsch, Mary F. Fernandez, Dan Suciu: Storing Semistructured Data with STORED. SIGMOD Conference 1999 Daniela Florescu, Donald Kossman: A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database. IEEE Data Eng. Bulletin 1999 Minos N. Garofalakis, Aristides Gionis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim: XTRACT: A System for Extracting Document Type Descriptors from XML Documents. SIGMOD Conference 2000 Roy Goldman, Jennifer Widom: Data. Guides: Enabling Query Formulation and Optimization in Semistructured Databases. VLDB 1997 Storing XML 63
References (Research) l l l P. J. Marron, G. Lausen: On Processing XML in LDAP, VLDB 2001 Carl-Christian Kanne, Guido Moerkotte: Efficient Storage of XML Data. Technical Report 8/99, University of Mannheim, 1999 Feng Tian, David J. De. Witt, Jianjun Chen, and Chun Zhang: The Design and Performance Evaluation of Various XML Storage Strategies, Technical report, University of Wisconsin Masatoshi Yoshikawa, Takeyuki Shimura, Shunsuke Uemura: XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases Chun Zhang, Jeffrey F. Naughton, David J. De. Witt, Qiong Luo, Guy M. Lohman: On Supporting Containment Queries in Relational Database Management Systems. SIGMOD 2001 Justin Zobel, James A. Thom, Ron Sacks. Davis: Efficiency of Nested Relational Document Database Systems. VLDB 1991 Storing XML 64
References (W 3 C) l l l W 3 C Recommendation. Extensible Markup Language (XML) 1. 0 (Second Edition) In http: //www. w 3. org/TR/REC -xml. 2000 W 3 C Recommendation. Namespaces in XML In http: //www. w 3. org/TR/RECxml-names. 1999 W 3 C Working Draft. XML Path Language (XPath) 2. 0. In http: //www. w 3. org/TR/xpath 20. 2001 W 3 C XML representation of a relational database In http: //www. w 3. org/XML/RDB. html W 3 C Recommendation. XML Schema Part 0: Primer In http: //www. w 3. org/TR/xmlschema-0. 2001 W 3 C Recommendation. XML Schema Part 1: Structures In http: //www. w 3. org/TR/xmlschema 1. 2001 l W 3 C Recommendation. XML Schema Part 1: Datatypes In http: //www. w 3. org/TR/xmlschema-2. 2001 l W 3 C Recommendation. XSL Transformations (XSLT) 1. 0. In http: //www. w 3. org/TR/xslt. 1999 l W 3 C Working Draft. XQuery 1. 0: An XML Query Language In http: //www. w 3. org/TR/xquery. 2001 l Storing XML 65
References (Products) l Ronald Bourret: XML Database Products: In l l l l http: //www. rpbourret. com/xml/XMLDatabase. Prods. htm, July 2001 Sandeepan Banerjee, Vishu Krishnamurthy, Muralidhar Krishnaprasad, Ravi Murthy: Oracle 8 i - The XML Enabled Data Management System. ICDE 2000 Oracle 9 i Application Developer's Guide – XML Release 1 (9. 0. 1) e. Xcelon: Extensible Information Server White Paper. e. Xcelon Corporation, 2001 Josephine M. Cheng, Jane Xu: XML and DB 2. ICDE 2000: 569 -573 IBM DB 2 Universal Database XML Extender Administration and Programming Version 7. 2001 Microsoft SQL Server Books Online Michael Rys: Bringing the Internet to Your Database: Using SQLServer 2000 and XML to Build Loosely-Coupled Systems. ICDE 2001: 465 -472 Storing XML 66
098ba089244614239ab1866fc256381c.ppt