Mapping between XML and relational By Guy

Mapping between XML and relational – By Guy Karlebach Based on XQuery from the Experts, Katz et al. Chapter 6 For the foreseeable future, most business data will continue to be stored in relational database system. To exploit the benefits of XML a need for mapping arises.

XML vs. Relational data is flat and untagged, whereas XML data contains tags and is hierarchical. The order of rows in a relational table is unimportant, in contrast to XML documents.

XML vs. Relational Tags: For example, separating an article to subtopics. <introduction title=“…”> , <methods>, <results> … In a relational database each of these tags may be a different table or column.

XML vs. Relational Hierarchy: For example, describing the chain of command in the army <General> … <Colonel> … <Lieutenant> … <Sergeant> … <Private> In a relational database this hierarchy may be represented by using foreign keys, though this may be confusing – for example if an XML parent points only to one child and several parents may point to the same row child in the DB.

A possible model for the mapping application DB Result XML Acts End application XML Processor XML Result Relational DB DB Dialect

Types of End Applications which are XML – oriented and interact with a relational DB Read-Only applications: 1. Complete document emission – For example, an application that visualizes the data of XML files that are stored in the database, which needs XML tags that instruct it through the visualization process (BSML documents has a visualization section). The emitted data can be further processed, e. g. using XSLT.

Types of End Applications which are XML – oriented and interact with a relational DB 2. Document fragment emission Doing this by further processing emitted documents is not efficient. Therefore we need a way to ‘push the selection criterion into the database’. The database already has an innate selection capability. An example for this is an application that visualizes parts of XML documents which are stored in a relational DB.

Types of End Applications which are XML – oriented and interact with a relational DB 3. Document XML Querying & transformation – Query for XML documents/fragments Add markup Change the relational output.

Types of End Applications which are XML – oriented and interact with a relational DB Example: XML for Tables. <Employees. Table>{ for $i in xp: table(“EXAMPLE_DB", “EMPLOYEE")/EMPLOYEE/row , $p in xp: table(“EXAMPLE_DB", EMPLOYER")/EMPLOYER/row[DEPART_ID = $I/DEPART_ID] return <row>{ <employee. Id> {$i/ID} </employee. Id> <manager. Id> {$p/ID} </manager. Id> }</row> }</Employees. Table>

Types of End Applications which are XML – oriented and interact with a relational DB Consider the following XQuery: for $x in doc (“purchases. xml”) / Purchase. Order / originator where $x / contact. Address / zip-four = “ 98102 -1234” return $x / contact. Name Translation to SQL: SELECT contact. Name FROM purchaseorder , originator WHERE purchaseorder. id = originator. poid AND originator. zip 4 = “ 98102”

Types of End Applications which are XML – oriented and interact with a relational DB The translation procedure can be described in the following manner: 1. Start with the SQL query that would be used to emit the entire Purchase. Order element list. (SELECT Purchase. Order. originator, Purchase. Order. id , …) 2. Translate the XQuery selection conditions into equivalent SQL selection conditions. 3. Prune the SQL query so that it will reference only the tables and columns necessary either to evaluate the selection conditions or to return required results.

Types of End Applications which are XML – oriented and interact with a relational DB Writing applications: 4. Update of XML data In the DB – Change of data in the DB according to an XML locating information (such as XPath expressions). A more complex version if this type of application would make changes in the XML schema of the XML data which is stored in the relational DB.

Sources of the application’s input data There are two possibilities: 1. The original format of the data is relational. In this case the application needs a ‘virtual’ XML view of an originally relational data. 2. The original format of the data is XML. This is possible in two different scenarios: a. The input data is currently XML – e. g. shredding. b. The input data used to be XML but was mapped into a DB, and the application has access to the relational form. In the following discussion we will refer to the way (XML or relational) in which the data is stored.

Relational view of data which is stored as XML – concepts There are 3 different ways to view XML as relational data: 1. LOB – A field in a DB table contains XML data. In this case the mapping back to XML is trivial (Easy implementation of a complete document emission application). No need to know the XML schema of the data which is received for storage.

Relational view of data which is stored as XML – concepts 2. Composed – The XML data is separated among different table fields and possibly different tables. The process in which XML documents are entered into a relational DB in a composed form is called shredding. When XML data is stored in a relational DB in a composed form a table field can ‘belong’ to more than one XML document. E. g. an author that created many documents.

Relational view of data which is stored as XML – concepts Knowing the XML schema/DTD helps composed viewing of input XML data because it allows to define mapping rules between the XML form and the relational form. We will later discuss automatic creation of such mappings. The process in which composed relational data is transformed to create XML documents or elements is called composition. Complete document emission is not trivial, but on the other hand the DB mechanisms such as querying can be used on the data. Composed and LOB representations can be combined.

Relational view of data which is stored as XML – concepts 3. XML native data type – A possible functionality of the DB itself. Fields in the DB can be defined as XML, and the DB may allow special operations on this data type. Example: (Native XML data type in green) SELECT … FROM … WHERE TABLE. FIELD 1=“Year/Region=‘middle east’”

XML view of data which is stored as relational – concepts The translation from relational to XML is divided into two types: 1. Implicit translation – The user of the translation mechanism has no control over the results of the translation. For example: A constant XML structure that mirrors the DB tables. <Table. Employee><Column 1>…

XML view of data which is stored as relational – concepts 2. Explicit translation – The user of the translation mechanism can affect its results. This effect may include: a. Choosing tables and columns that will participate in the translation. b. Mapping between rows/columns in the DB and XML elements or attributes. c. Naming XML elements in the translation result. d. Adding elements or attributes with constant values, or wrapper elements. E. g. , adding <time> to <hours> …<minutes> …<seconds> e. Defining the result of the translation as a function of the DB content. e. g. creating a <date> child of <meeting> only if the date of the meeting has not occurred yet.

XML view of data which is stored as relational – concepts Does explicit translation of relational data to XML allow data base update through XML interface? Querying through XML interface is understandably possible, but when it comes to update giving the user the power of explicit translation may cause difficulties. For example: If the user can define in the result a text node which is the concatenation of two database table fields, how will this one new field be used to update the original database fields upon its change? Of course, rules can be defined, but the task gets more complicated and less safe. We will review more aspects of explicit translation in the next two slides.

XML view of data which is stored as relational – Join Queries Joining relational DB tables that are viewed as XML captures the structures of the join query. For example: [ A JOIN B JOIN C ] U [ A JOIN D ] <A> content of A… content of B… <C> content of C… </C> <D> content of D …</D> </A>

XML view of data which is stored as relational – Group Queries Grouping queries allow the user to produce more compact XML data. The elements by which the grouping is made become common XML elements in their group, as is demonstrated below: <Element. For. Group. A> <Grouping. Column. Of. Group. A> 1996 < Grouping. Column. Of. Group. A /> < Grouping. Column. Of. Group. A > books </ Grouping. Column. Of. Group. A > … <Row. In. Group. A> <column> middle east </column> <column> 20, 000 <column> …

Composition Techniques We will discuss three general types of composition techniques: 1. Default mappings 2. Extended SQL 3. Annotated schemas 1. Default Mappings This is a mirroring of the database tables’ structure as XML. This is an implicit translation method. Example: <Table. Order> <Column. Date>. . </Column. Date> …

Composition Techniques Advantage of default mappings: Uncomplicated update commands. Disadvantage: Many applications require a specific XML format. This demands a XML format other then the one which represents a table, and sometimes also data selection. Further transforming ability can be added, but that dispels the update advantage. 2. Extended SQL The idea is to add XQuery capabilities to SQL. An example for this implementation type is the SQL/XML language. SELECT e. id, XMLELEMENT( NAME “Emp”, XMLATTRIBUTES( e. name AS “name”) , e. hire, e. dept) AS “result” FROM EMPLOYEE e

Composition Techniques XMLELEMENT is a DB’s native XML data type, and it stores the XML result of the query in a field called “result” in the relational table which is the result of the query. Following is an example that creates a more complex hierarchy and also prevents redundancy by a grouping command: SELECT XMLELEMENT ( NAME “Department”, XMLELEMENT( NAME “ department. Name”, e. dept ) XMLELEMENT( NAME “families”, XMLAGG( XMLELEMENT( NAME “family” , e. lastname ) ) AS “ families_in_departments” FROM EMPLOYEE e GROUP BY e. dept

Composition Techniques Result: <department> <name> physics </name> <families> <family> Smith </family> <family> Thompson </family> </families> </department> <department> …

Composition Techniques 3. Annotated schemas This composition technique is based on providing an XML document or a DTD or XML schema which contains instructions for the composition process. Example: Annotated XSD Schema Language The default of this language is XML schema without any annotations, which means that each complex element matches a table in the DB and each simple element in it matches a field in that table. Annotations which are added to the schema override the default composition instructions, and also define the composition in the case of complex elements within complex elements that appear in the schema.

Composition Techniques In the following slide: Blue shows the relational tables from which the XML elements draw values Green shows which elements should appear ‘together’ in the result Light Blue matches between the attribute and the column for its content in the relational ‘Orders’ table. If it is lacking in a simple element/attribute definition the mappings looks for a relational table field that matches the attribute/simple element name.

Composition Techniques Annotated schema example: <xs: element name=“Cusomer” type=“Customer. Type” sql: relation=“Customers”> < xs: complex. Type name=“Customer. Type”> < xs: attribute name=“Order. ID” type=“xs: integer” sql: field=“ID”/> … ( more fields as simple elements or attributes) … < xs: sequence> < xs: element name=“Order” max. Occurs=“unbounded” sql: relation=“Orders”> … (fields as simple elements or attributes) … < xs: annotation> < xs: appinfo> <sql: relationship parent=“Customers” parent-key=“Customer. ID” child=“Orders” child-key=“ Customer. ID”/> </xs: appinfo> </ xs: annotation> </ xs: element> </ xs: sequence> </ xs: complex. Type> </xs: element>

Composition Techniques Remarks: 1. parent-key and child-key can contain a list of values in the case where more than one column creates the relational table’s key. 2. The mapping marked in green allows a transitive mapping, e. g. coupling of a Customer element with an element from a relational table which ‘Orders’ points to but ‘Customers’ doesn’t. 3. The advantage of this mapping is that in addition to composition it also defines XML to relational update or shredding. The SQL/XML output, on the other hand , is not restricted to a certain schema and therefore the opposite mapping is not obvious. 4. There is a limited way to apply selection to value-retrieval: <xs: element name=“Bill. To” type=“xs: string” sql: relation=“Addresses” sql: field=“Stree. Address” sql: limit-field=“Address. Type” sql: limit-value=“billing”/> The Address. Type column’s value must be ‘billing’ in order to use the Street. Address column.

Composition Techniques There used to be a tendency to use languages which their sole purpose was accepting input in one form (relational or XML) and transforming it into the other. This approach is no longer common because it requires learning a new language that has no other use , and it did not add expression ability beyond the other mechanisms abilities.

Composition – Automatic generation of XML schema from relational data Usually useful for data which was originally relational (the case of an uncareful shredding also exists). Motivation: 1. Our originally relational data enjoys the benefits of XML 2. Error prevention 3. Reduces workload 4. Provides a methodology different DB systems can conform Based on “Schema Conversion Methods between XML and Relational Models, Dongwon Lee et al.

Composition – Automatic generation of XML schema from relational data Own House# Name Id Own #1 1 Bob 20 #3 2 Bob 10 #4 3 Bob 4 Pete 10 #6 4 Pete {1, 2, 3} Bob 10 nest #2 1 Bob 20 #3 4 Pete 10 {4, 5} Pete 20 10 #5 Id #1 10 #2 Name House# 20 #4 #7 5 Pete 20

Composition – Automatic generation of XML schema from relational data Method 1: Try nesting according to each column, and continue nesting in the tables which are the nesting opertaions’ results. Choose the longest of the nesting series. If only one nesting operation succeeded create only one XML element from the relational table (each field will be either an attribute or a child element). Otherwise, create an XML element for the relational table, and: For each column by which nesting was performed: Create a child element of the table element. If that column was nullable in the relational DB schema it will create an XML element of multiplicity *. Otherwise (not nullable) it will appear as an XML element of multiplicity +. For each column by which the nesting was not performed: All the columns which we didn’t nest by will appear as child elements of multiplicity 1 or 0, using the same considerations ( nullable or not). Note that this method does not provide a mechanism for linking between the tables’ elements.

Composition – Automatic generation of XML schema from relational data Method 2: Based only on the relational schema and not the content of the relational tables. a. Create a directed graph where each element is a vertex and there is a directed edge A B if B has a foreign key (or a referencing field) that points to A. b. Define basic vertexes as vertexes that satisfy one of the following: 1. There are no edges that point to them 2. All the edges that point to them have matching edges that point to the pointing edge’s source. c. From each basic vertex perform BFS on the graph. Every discovery of a new vertex creates a child element of the basic node. The multiplicity of the child element is set to ? If the edge represents a key element (that is, a foreign key in the child that point to the parent and also acts as the child’s only key field ) and otherwise it is set to *. d. If the BFS reaches a vertex a second time instead of creating a child element for it create an attribute IDREF if the edge represents a key and IDREFS otherwise. (We use IDREF/S in order avoid data redundancy).

Shredding – Planning of the relational DB schema Creating a target relational schema is the first step in any shredding procedure. Some considerations: 1. Adding shredding information beyond the content of the shredded document: Examples of Information that is not contained in shredded XML documents: table keys, or foreign keys. Information about composition can also be added to the shredded document before shredding: For example adding document-unique IDs + a document ID to elements and then assigning these IDs to a special field in the DB. This way we produce a field which is a unique key among all the documents which are shredded in a relatively simple procedure (the ID attribute type won’t work here, for instance). If we want to support update of the DB through incoming XML documents (for example a doctor who updates a clinical document which was previously shredded), we have to keep composition enabling information.

Shredding – Planning of the relational DB schema 2. Spreading/Uniting XML elements in relational tables: Creating a single table for several XML elements from the same XML schema/DTD is called Inlining. In other cases it might make sense to create separate tables for certain elements, for example when a child element appears with cardinality 0. . ∞. Creating a single table for several XML elements from different XML schemas/DTDs which resemble one another is called consolidation. The benefits of consolidation are: a. Saving space b. Simplifies the DB schema c. Allows integration of data sources for research (e. g. BSML and Hap. Map).

Shredding – Planning of the relational DB schema 3. Supporting the flexibility of XML documents: The structure that an XML schema/DTD defines is more flexible than that of a relational database. Some examples: a. The maximal length of strings may not be limited. This requires estimation of the data size and using datatypes such as CLOB when these are needed. This may cause space waste. b. XML elements which are defined in the XML schema/DTD may not appear in the XML document or may appear a changing number of times. c. XML schema may use xs: any which allows the occurrence of arbitrary elements (for example the Inclusion of any well-formed XML code). d. The case of an XML input that does not have a schema or that has a changing schema is still more complicated. Possible solutions: Learning. Creating document-specific schemas and crossing them. Making a wide use of LOB representation. Using an edge table.

Shredding – Planning of the relational DB schema Edge table: A single relational table describes the whole XML document: parent. Id The id of the row of the XML parent element of the element which was mapped to this row . . . child. Id Name The id of this row (which contains a single element/attribute ) The name of the element which is contained in this row in the XML document Value The non-element content of the element which is stored in this row

Shredding – Planning of the relational DB schema 4. Prevention of data redundancy: Shredding that involves large number of XML documents can result in data redundancy. Suppose that we have a repository of XML documents that describe the people who live in a certain very long street. Each document contains some data about one neighbor: his house number in the street and the house numbers of all the people who live in a region of 50 houses around his house. When all the data will enter the database keeping only the house number of the one neighbor that lives north to that person is a worthwhile consideration (a solution that keeps all the house numbers requires a connecting person__neighbor table).

Shredding – Automatic generation of relational DB schema Motivation: 1. Error prevention 2. Reduces workload 3. Provides a methodology different DB systems can conform Of course, not always an automatic algorithm captures the semantics of the XML model. We will see an example of creating a relational DB schema from a DTD Definition: Top nodes in the XML tree that satisfy one of the following conditions: 1. Root 2. A child of cardinality * or + 3. A node that may be its own descendant (recursion) 4. A node that can appear between two nodes that appear recursively one inside the er Based on “Schema Conversion Methods between XML and Relational Models, Dongwon Lee et al.

Shredding – Automatic generation of relational DB schema The algorithm: For each top node: 1. create a table 2. perform inlining of all the element and attribute nodes that can be reached from that node, contain PCDATA , and they are not top nodes. Explanation: The root table is the base table. Children of unbounded cardinality should get their own table because the number of fields to contain them all cannot be predetermined. The same goes for elements which can appear recursively and the elements described as type 4 in the previous slide. The rest of the nodes can be ‘safely’ inlined.

A shredding tool that uses XPATH XML Data Mediator is a tool for bi-directional data conversion between XML and structured data formats such as relational. It is an explicit mapping tool. It replaces the programming effort by the simpler effort of writing a script that describes the relationships between the XML constructs and the corresponding RDBMS constructs. XML Data Mediator can be used as a stand-alone utility, or it can be integrated as a library in other applications.

A shredding tool that uses XPATH <? xml version=“ 1. 0”? > <xrt: xrt xmlns: : xrt=http: //www. xrt. org/xrt xmlns: xsl=http: //www. w 3. org/1999/XSL/Transform> <xrt: xml 2 rdbms> <xrt: locator xrt: name=“EX 2 -store”/> <xrt: transaction xrt: tid=“level 1”> <!-First transaction --> <xrt: sqlinsert xrt: sid=“upsert-YEARS_HIGHER_THAN_PRECEDING” xrt: tablename=“SEMINAR. YEARS_HIGHER_THAN_PRECEDING” xrt: template-name=“table_years_higher_than_preceding”> <xrt: key xrt: name=“YEAR_NAME” xrt: type=“integer”/> <xrt: column xrt: name=“REGION” xrt: type=“string”/> <xrt: column xrt: name=“PARENT_ID” xrt: type=“string”/> <xrt: column xrt: name=“NODE_ID” xrt: type=“string”/> </xrt: sqlinsert> </xrt: transaction> </xrt: xml 2 rdbms> <xrt: xml 2 xml> <xsl: stylesheet version=“ 1. 0” xmlns: xsl=http: //www. w 3. org/1999/XSL/Transform xmlns: xrt=http: //www. xrt. org/xrt> <xsl: import href=“some. Directory/my. File. xsl”/> <xsl: template name=“main. Template”> <xsl: call-template name=“table_years_higher_than_preceding”/> </xsl: template> </xsl: stylesheet> <xrt: xml 2 xml>

A shredding tool that uses XPATH my. File. xsl: <? xml version=“ 1. 0”? > <xsl: stylesheet version=“ 1. 0” xmlns: xsl=http: //www. w 3. org/1999/XSL/Transform xmlns: xrt=http: // www. xrt. org/xrt> <xsl: template name=“table_years_higher_than_preceding ”> < xsl: for-each select= “ salesdata/year[preceding sibling: : year[position()=1]/region[name='west']/sales[@unit='millions'] < region[name='west']/sales[@unit='millions']] “ > < xsl: element name=“xrt: record”> < xsl: element name=“xrt: column”> < xsl: attribute name=“xrt: name”>YEAR_NAME</xsl: attribute > < xsl: attribute name=“xrt: value”><xsl: value-of select=“theyear”/></xsl: attribute > </ xst: element > < xsl: element name=“xrt: column”> < xsl: atttribute name=“xrt: name”>REGION</xsl: attribute > < xsl: attribute name=“xrt: value”><xsl: value-of select=“region”/></xsl: attribute > </ xsl: element> < xsl: element name=“xrt: column”> < xsl: atttribute name=“xrt: name”>PARENT_ID</xsl: attribute > < xsl: attribute name=“xrt: value”><xsl: value-of select=“generate-id(. . )”/></ xsl: attribute > </ xsl: element> <xsl: element name=“xrt: column”> < xsl: atttribute name=“xrt: name”>NODE_ID</xsl: attribute > < xsl: attribute name=“xrt: value”><xsl: value-of select=“region”/>generate-id(. )</ xsl: attribute > </ xsl: element> </ xsl: for-each> </xsl: template> </xsl: stylesheet >

Performance Analysis - composition We will separate the work into two conceptual stages: 1. Performing an SQL query that will “prepare” the required data 2. ‘Editing’ the query results by adding tags and hierarchy. A major bottleneck in composition performance is creating XML hierarchy through JOIN queries. We will present 3 techniques to deal with this problem: Simple Join: The most straightforward approach. Perform one JOIN between all the relational tables that we need in the output document (for instance, using parent key id’s). Edit the relational result into XML hierarchy. The relational result of the query should be arranged so that child elements will appear after their parents. Passing over the results will indicate a new level in hierarchy wherever there’s a key column change.

Performance Analysis - composition Simple join requires a key column for each non-leaf element. Here’s an example where it doesn’t happen: (The solution here would be assigning a different key column to each row of the EXPERIMENT table) Experiment Patient Parent id Weight Change Bob Group 1 75 Weight Change Bob Group 1 76

Performance Analysis - composition Another requirement of the simple join approach is that the hierarchy in the target XML document be linear. To understand this, suppose that every XML element in the target document had two child elements. In this case the preliminary DB query will consist of 3*Non-leaf Elements JOIN operations. For example, suppose we want to produce the following markup from the relational tables A, B 1, B 2, C 1, C 2, D 1: <A> <C 1/> <C 2/> <D 1/> <D 2/> </A> The preliminary query will be: A JOIN [(B 1 JOIN C 1) JOIN (B 1 JOIN C 2)] JOIN A JOIN [(B 2 JOIN D 1) JOIN (B 2 JOIN D 2)] because every element in the result table must be preceded by its parent element’s id.

Performance Analysis - composition Dependent Join: In this approach we define first an SQL query for the table that contains the root element of the XML document. Using the resulting table we create a query for each child element, and so on recursively. (remember that we know in advance which tables we should join but we don’t know in advance that each join will result in a new child element. When there are no more child elements in a certain path we have to ‘climb up’ back). At each recursive step we add the current elements to the XML document.

Performance Analysis - composition For example: <A> <C 1/> <C 2/> <D 1/> <D 2/> </A> Step 1: Step 2: (A JOIN B 1) JOIN (A JOIN B 2) (B 1 JOIN C 1) JOIN (B 1 JOIN C 2) create markup . . Experimentally, the dependent join approach shows poor performance in large mappings.

Performance Analysis - composition Sorted outer join (SOU): This is an improvement on the simple join approach which removes the limitation on hierarchy. The idea is to deal in one query with several linear hierarchies which will result in a nonlinear hierarchy. First we compose many ‘legal’ simple join queries. We achieve this by giving each one such tables that its result will contain the keys of only one path from the root to one of the leafs. Next we perform a query which is the UNION of all these query results. The result is a table in which each row contains a linear path from the root to a leaf. Processing this table creates the XML document. We still require that each row will have a unique key field.

Performance Analysis - composition For example: <A> <C 1/> <C 2/> <D 1/> <D 2/> </A> The query: Step 1: (A JOIN B 1 JOIN C 1) result 1 Step 2: (A JOIN B 1 JOIN C 2) result 2 Step 3: (A JOIN B 2 JOIN D 1) result 3 Step 4: (A JOIN B 2 JOIN D 2) result 4 Step 5: result 1 U result 2 U result 3 U result 4 (It is also important to notice that in retrieving the a path like A-B-C, we will preserve in the result the keys of all three tables but only the value columns of C).