14c59ac3c1c7e57e0a2bad75fe5d13c8.ppt
- Количество слайдов: 39
Storing XML Based on a tutorial by Sihem Amr-Yahia, given at ICDE 2002.
Storing XML l Effective storage – key for efficient XML processing l XML demands own storage techniques § Characteristics of XML data: Optional elements & values, repetition, choice, inherent order, large text fragments, mixed content § Characteristics of XML queries: Document order & structure, full-text search, transformation Storing XML 2
Outline Introduction I. § § XML Documents XML Queries II. Existing Storage Techniques § Non-native § Native III. Physical Storage Features for XML Storing XML 3
I. Introduction Storing XML 4
Classes of XML Documents l Structured § “Un-normalized” relational data Ex: product catalogs, inventory data, medical records, network messages, logs, stock quotes l Mixed § Structured data embedded in large text fragments Ex: On-line manuals, transcripts, tax forms l Application may process XML in both classes Ex: SOAP messages Header is structured; payload is mixed Storing XML 5
Structured Data: HL 7 Lab Report Health-care industry data-exchange format <HL 7> <PATIENT> <PID IDNum="PATID 1234"> <Pa. Na><Fa. Na>Jones</Fa. Na><Gi. Na>William</Gi. Na></Pa. Na> <DTof. Bi><date>1961 -06 -13</date></DTof. Bi> <Sex>M</Sex> </PID> <OBX Set. ID="1"> <Obs. Va>150</Obs. Va> <Obs. Id>Na</Obs. Id> <Abn. Fl>Above high</Abn. Fl> </OBX>. . . Storing XML 6
Queries on Structured Data l Essentially XQuery (already discussed in detail) l Select-Project-Join, Sort by value Ex: Return admission records of patients discharged on 8/30/01 sorted by family and given names l Grouping & schema transformation Ex: Return per-patient record of admission, lab reports, doctors’ observations l And so forth. Storing XML 7
Mixed Data: Library of Congress Documents of U. S. Legislation <bill-stage="Introduction""> <congress>110 th CONGRESS</congress> <session>1 st Session</session> <legis-num>H. R. 133</legis-num> <current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-chamber> <action date="June 5, 2008"> <action-desc> <sponsor>Mr. English</sponsor> (for himself and <cosponsor>Mr. Coyne</cosponsor>) introduced the following bill; which was referred to the <committee-name>Committee on Financial Services</committee-name>. . . </action-desc> Storing XML 8
Queries on Mixed Data l Full-text search operators Ex: Find all <bill>s where "striking" & "amended" are within 6 intervening words l Queries on structure & text Ex: Return <text> element containing both "exemption" & "social security" and preceding & following <text> elements l Queries that span (ignore) structure Ex: Return <bill> that contains “referred to the Committee on Financial Services” Storing XML 9
Properties of XML Data l Variance in structured content § Elements of same type have different structure l Nested sub-element might depend on parent § Direct access to sub-element not required l Order significant in sequence & mixed content l Structured data embedded in text § Schema known a priori or “open content model” l require explicit support in storage system Storing XML 10
Properties of Queries l Query expressions depend on data properties § Variance § /PATIENT/(SURGERY | CHECK-UP) § Document order: XPath axes § /bill/co-sponsor[. /text() = “Mrs. Clinton” and followsibling: : co-sponsor/text() = “Mr. Torricelli”] § Node identity: equality, union/intersect/except l If not supported in storage system, then operators semantically incorrect or incomplete. (Why? ) Storing XML 11
II. Known Storage Techniques Storing XML 12
Storage Techniques l Non-native § (Object) Relational, OO, LDAP directories § Indexing, recovery, transactions, updates, optimizers § Mapping from XML to target data model necessary § Captures variance in structured content § No support for mixed content § Recovering XML documents is expensive! l Native § Logical data model is XML § Physical storage features designed for XML Storing XML 13
Non-native Techniques l Generic § Mapping from XML data to relational tables § Models XML as tree: semi-structured approach § Does not use DTD or XML Schema l Schema-driven § Mapping from schema constructs to relational § Fixed mapping from DTD to relational schema § Flexible mapping from XML Schema to relational l User-defined § Labor-intensive Storing XML 14
Generic Mappings l Edge relation § store all edges in one table § Scalar values stored in separate table l Attribute relations § horizontal partition of Edge relation § Scalar values inlined in same table l Universal relation § full outer-join, redundancy l Captures node identity & document order l Element reconstruction requires multiple joins Storing XML 15
Edge Relation Example Edge Table &0 source PID &3 OBX &4 … @IDNum Pa. Na DTof. Bi &5 &6 &7 PATID 1234 “Jones Wm” date &8 1961 -06 -13 1 HL 7 ref &1 1 PATIENT ref &2 1 PID ref &3 &2 2 OBX ref &4 &3 1 IDNum string &5 &3 &2 target &2 PATIENT flag &1 &1 tag &0 HL 7 ordinal 2 Pa. Na string &6 &3 3 DTof. Bi ref &7 Value Table Node Value &5 PATID 1234 &6 Jones William Storing XML 16
Generic Mappings: LDAP Directories l Flexible schema; easy schema evolution § Supports heterogeneous elements with optional values l Captures node identity & document order l Query language captures subset of XPath Storing XML 17
LDAP Example XMLElement OC { XMLAttribute OC { SUBCLASS OF {XMLNode} SUBCLASS OF MUST CONTAIN {order} {XMLNode} MAY CONTAIN {value} MUST CONTAIN {value} TYPE order INTEGER TYPE value STRING } oc: XMLElement oid: 1 name: PID order: 1 PID @IDNum Pa. Na DTof. Bi Sex “PATID 1234” “Jones Wm” date M 1961 -06 -13 oc: XMLAttribute oid: 1. 1 name: IDNum value: PATID 1234 Storing XML oc: XMLElement oid: 1. 2 name: Pa. Na order: 1 value: Jones Wm 18
Schema-driven Mappings l Repetition : separate tables l Non-repeated sub-elements may be “inlined” l Optionality : nullable fields l Choice : multiple tables or universal table l Order : explicit ordinal value l Mixed content ignored l Element reconstruction may require multi-table joins because of normalization Storing XML 19
Fixed Mapping: Hybrid Inlining <!ELEMENT PATIENT OBX PATIENT (Name, (OBX)*)> OBX (Name, Value) > Name (#PCDATA) > Value (#PCDATA) > ID : Int ID: Int PATIENT * OBX Name Value Name: Str parent. ID: Int parent. CODE: Str Name: Str Value: Str Element with in-degree = 0 or > 1 in DTD graph relation § Elements with in-degree = 1 inlined except those reached by * § Non-* & non-recursive elements with in-degree > 1 inlined. § Storing XML 20
Flexible Mapping : Lego. DB l Canonical mapping from XML Schema to relational § Every complex type relation l Semantic-preserving XML Schema to XML Schema transformations Ex: Inlining/outlining, Union factorization/distribution, Repetition split l Greedy algorithm for choosing mapping § Mapping cost determined by query mix § Use relational optimizer to estimate cost of mapping Storing XML 21
Lego. DB Example l Inline type in parent vs. Outline type in its own relation type OBX = element value { Integer }, type Description = element description { String } XML type OBX = element value { Integer }, element description { String } TABLE OBX (OBX_id INT, value STRING, parent_PATIENT INT) TABLE Description (Description_id INT, description STRING, parent_OBX INT) Relational TABLE OBX (OBX_id INT, value STRING, description STRING, parent_PATIENT INT) Storing XML 22
User-Defined Mappings l No automatic translation from DTD or XML Schema l Value-based semantics only § Document structure represented by keys/foreign keys l No explicit representation of document order or node identity l Some support for mixed content Storing XML 23
Oracle 9 i l Canonical mapping into user-defined object-relational tables table PERSON(Name NAME, Alist ALIST) object NAME(FN STR, LN STR) table ALIST of ADDR object ADDR(City CITY) <row> <Person> <Name><FN>…</FN><LN>…</LN> <Addr><City>…</City></Addr>* </Person> </row> l Arbitrary XML input § XSLT preprocessing into multiple XML documents, load individually l Stores XML documents in CLOBs (character large objects) § Permits full-text search l Hybrid of canonical mapping & CLOB Storing XML 24
MS SQL Server l Generic Edge technique with inlined scalar values l User-defined decomposition of XML into multiple tables § XML data mapped into DOM § XPath expressions specify XML values to map into tables § Rows in table Ex: /Customer/Orders row in Table ORDER § Columns in row. /Order. Date Order. Date. Column l Text content modeled in CLOBs Storing XML 25
Native Techniques l Built from scratch § Nati. X (University of Mannheim, Germany) § Xyleme (France) § Xindice (Apache – open source) § TIMBER (U. of Michigan; uses Shore). l Re-tool existing systems to handle XML § Tamino: hierarchical database (ADABAS) § Excelon: OODB l Design efficient data structures for compact storage and fast access; data partitioning; indexing on both values and structure Storing XML 26
Nati. X l Unit of storage = element l Elements clustered to minimize page hits l Inter-element pointers capture document structure l Low-level algorithmic support for read/write/insert/delete operations l No use of DTDs or XML Schema Storing XML 27
Xyleme l Data layout: based on Nati. X l Indexing: sophisticated indexing of text and elements l Query support: XPATH, XQuery, updates l A data warehouse for XML content: store, classify, index, integrate, query and monitor massive volumes of XML content l Semantic services: extensible thesauri and schema mappers that enable the system to go beyond simple indexing Storing XML 28
Software A/G Tamino l Extends Adabas – nested relations l Indexing: value and structure l Query support: § Full-text search operators § Queries return entire document or some projection of document § No construction of new XML values (unlike XQuery) l Access control at the node level, transactions; multi-media; triggers; backup/restore; compression; support for multi-media documents, e. g. , video Storing XML 29
TIMBER l Underlying storage manager – SHORE l Store XML documents as trees in preorder l Use node pedigrees ([Start. Pos, End. Pos, Level]) in an l l l essential way to capture order, identity. Support index on tag, value, and on pedigree. Underlying algebra – TAX (tree algebra for XML). Underlying physical algebra – key operator: structural join, and its semi-, outer-, and anti- variants. Optimal join ordering – problem similar to RDBs. Cost of constructing output dominated by that of finding valid bindings from DB for query variables. Tree pattern and generalized tree pattern match. Storing XML 30
Other Native Systems l Xindice http: //xml. apache. org/xindice/ § Query support: XPath for its query language and XML: DB XUpdate for its update language § APIs: XML: DB API for Java development; other languages using an available XML-RPC plugin l Go. XML § XQuery, full text searching § tree insert, replace and delete Storing XML 31
Update Support l XQuery does not support updates (yet…) l How to update? § Flat streams: overwrite document § Non-native: SQL § Native: DOM, proprietary APIs l But how do you know you have not violated schema for which the mapping was defined? § Flat streams: re-parse document (how do we check ICs? ) § Non-native: need to understand the mapping and maintain integrity constraints § Native: supported in some systems (e. g. , e. Xcelon) Storing XML 32
Summary l Non-native § Treats target system as black box § Mismatch between data models requires mapping § Supporting order-sensitive queries can be expensive § May require changes to schema to support new tags § Introduces redundancies & necessity of joins § No control of physical layout of data l Native § No mismatch between logical data models § Focus on physical layout (clustering, indices, …) § Extensible - no schema or DTD needed Storing XML 33
Conclusion l XML data requires new storage features § Real-applications depend upon XML data properties § Normalization is not always appropriate l Schema of XML data should drive storage § Real-world data comes with its own schema § Schema as a basis for querying l Handling mixed content is an important research problem Storing XML 34
More Resources l W 3 C Documents http: //www. w 3. org/TR/ http: //www. w 3. org/XML/Query. html l W 3 C XML Query page l XML Query Implementations & Demos Galax - AT&T, Lucent, and Avaya http: //www-db. research. belllabs. com/galax/ Quip - Software AG http: //www. softwareag. com/developer/quip/ XQuery demo – Microsoft http: //131. 107. 228. 20/xquerydemo/ Fraunhofer IPSI XQuery Prototype http: //xml. ipsi. fhg. de/xquerydemo/ XQengine – Fatdog http: //www. fatdog. com/ X-Hive http: //217. 77. 130. 189/xquery/index. html Open. Link http: //demo. openlinksw. com: 8391/xquery/demo. vsp Storing XML 35
References (Research) l l l Serge Abiteboul, Sophie Cluet, Tova Milo: Querying and Updating the File. VLDB 1993 D. Barbosa, A. Barta, A. Mendelzon, G. Mihaila, F. Rizzolo, P. Rodriguez-Gianolli: To. X – The Toronto XML Engine, International Workshop on Information Integration on the Web, Rio de Janeiro, 2001. Phil Bohannon, Juliana Freire, Prasan Roy, Jérôme Siméon: From XML Schema to Relations: A cost-based Approach to XML Storage. ICDE 2002 Michael J. Carey, Jerry Kiernan, Jayavel Shanmugasundaram, Eugene J. Shekita, Subbu N. Subramanian: XPERANTO: Middleware for Publishing Object. Relational Data as XML Documents. VLDB 2000 Qiming Chen, Yahiko Kambayashi: Nested Relation Based Database Knowledge Representation. SIGMOD Conference 1991 l l l Vassilis Christophides, Sophie Cluet, Jérôme Siméon: On Wrapping Query Languages and Efficient XML Integration. SIGMOD Conference 2000: 141 -152 Alin Deutsch, Mary F. Fernandez, Dan Suciu: Storing Semistructured Data with STORED. SIGMOD Conference 1999 Daniela Florescu, Donald Kossman: A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database. IEEE Data Eng. Bulletin 1999 Minos N. Garofalakis, Aristides Gionis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim: XTRACT: A System for Extracting Document Type Descriptors from XML Documents. SIGMOD Conference 2000 Roy Goldman, Jennifer Widom: Data. Guides: Enabling Query Formulation and Optimization in Semistructured Databases. VLDB 1997 Storing XML 36
References (Research) l l l P. J. Marron, G. Lausen: On Processing XML in LDAP, VLDB 2001 Carl-Christian Kanne, Guido Moerkotte: Efficient Storage of XML Data. Technical Report 8/99, University of Mannheim, 1999 Feng Tian, David J. De. Witt, Jianjun Chen, and Chun Zhang: The Design and Performance Evaluation of Various XML Storage Strategies, Technical report, University of Wisconsin Masatoshi Yoshikawa, Takeyuki Shimura, Shunsuke Uemura: XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases Chun Zhang, Jeffrey F. Naughton, David J. De. Witt, Qiong Luo, Guy M. Lohman: On Supporting Containment Queries in Relational Database Management Systems. SIGMOD 2001 Justin Zobel, James A. Thom, Ron Sacks. Davis: Efficiency of Nested Relational Document Database Systems. VLDB 1991 Storing XML 37
References (W 3 C) l l l W 3 C Recommendation. Extensible Markup Language (XML) 1. 0 (Second Edition) In http: //www. w 3. org/TR/REC -xml. 2000 W 3 C Recommendation. Namespaces in XML In http: //www. w 3. org/TR/RECxml-names. 1999 W 3 C Working Draft. XML Path Language (XPath) 2. 0. In http: //www. w 3. org/TR/xpath 20. 2001 W 3 C XML representation of a relational database In http: //www. w 3. org/XML/RDB. html W 3 C Recommendation. XML Schema Part 0: Primer In http: //www. w 3. org/TR/xmlschema-0. 2001 W 3 C Recommendation. XML Schema Part 1: Structures In http: //www. w 3. org/TR/xmlschema 1. 2001 l W 3 C Recommendation. XML Schema Part 1: Datatypes In http: //www. w 3. org/TR/xmlschema-2. 2001 l W 3 C Recommendation. XSL Transformations (XSLT) 1. 0. In http: //www. w 3. org/TR/xslt. 1999 l W 3 C Working Draft. XQuery 1. 0: An XML Query Language In http: //www. w 3. org/TR/xquery. 2001 l Storing XML 38
References (Products) l Ronald Bourret: XML Database Products: In l l l l http: //www. rpbourret. com/xml/XMLDatabase. Prods. htm, July 2001 Sandeepan Banerjee, Vishu Krishnamurthy, Muralidhar Krishnaprasad, Ravi Murthy: Oracle 8 i - The XML Enabled Data Management System. ICDE 2000 Oracle 9 i Application Developer's Guide – XML Release 1 (9. 0. 1) e. Xcelon: Extensible Information Server White Paper. e. Xcelon Corporation, 2001 Josephine M. Cheng, Jane Xu: XML and DB 2. ICDE 2000: 569 -573 IBM DB 2 Universal Database XML Extender Administration and Programming Version 7. 2001 Microsoft SQL Server Books Online Michael Rys: Bringing the Internet to Your Database: Using SQLServer 2000 and XML to Build Loosely-Coupled Systems. ICDE 2001: 465 -472 Storing XML 39
14c59ac3c1c7e57e0a2bad75fe5d13c8.ppt