51843d1fe7d4e091b3c052466be396aa.ppt
- Количество слайдов: 96
XML: The Big Picture and Some Gory Details (A brief tutorial with an eye towards e-records and archival) Bertram Ludaescher ludaesch@sdsc. edu Data Intensive Computing Environments (DICE) Group San Diego Supercomputer Center, UCSD 1
DICE Members Staff Students • • • • • Reagan Moore Chaitan Baru Amarnath Gupta Bertram Ludäscher Richard Marciano Arcot Rajasekar Wayne Schroeder Michael Wan Ilya Zaslavsky Bing Zhu + NN * 4 XML Tutorial, Bertram Ludäscher Tutorial, Pratik Mukhopadhyay Azra Mulic Kevin Munroe Paul Nguyen Michail Petropolis Nicholas Puz Pavel Velikhov +/- NN 2
Tutorial Outline • Roadmap & Overview • What about XML vs. E-records and Archives? (or: why it’s good to be here ; -) • XML 101 • XML 232 • Querying & Transforming XML • Mediation of Information using XML (MIX) • Other Projects. . . XML Tutorial, Bertram Ludäscher Tutorial, 3
Some History (or: from fat via lean… • SGML (Standard Generalized Markup Language) – – – ISO Standard, 1986, for data storage & exchange Metalanguage for defining languages (through DTDs) A famous SGML language: HTML!! Separation of content and display Used in U. S. gvt. & contractors, large manufacturing companies, technical info. Publishers, . . . – SGML reference is 600 pages long • XML (e. Xtensible Markup Language) – W 3 C (World Wide Web Consortium) -- http: //www. w 3. org/XML/) recommendation in 1998 – Simple subset (80/20 rule) of SGML: “ASCII of the Web”, “Semantic Web”. – XML specification is 26 pages long XML Tutorial, Bertram Ludäscher Tutorial, 4
… to skinny and back! ) • Canonical XML – “normalization”, equivalence testing of XML documents • SML (Simple Markup Language) – “Reduce to the max”: No Attributes / No Processing Instructions (PI) / No DTD / No non-character entity-references / No CDATA marked sections / Support for only UTF-8 character encoding / No optional features • XML Schema – XML Schema definition language – Back to complex: • Part I (Structures), Part II (Data Types), Part III aehm 0 (Primer) • X-Zoo (Xoo? ), “Brave New X-World” • Specifications CSS • Digital Signatures • ebxml Project Teams • eb. XML • IETF Specifications • Internationalization • IOTP (Internet Open Trading Protocol) • OASIS • Requirements Documents • SMIL • SVG (Scalable Vector Graphics) • Topic Maps • W 3 C Activity Pages • W 3 C Notes • W 3 C Standards-in-progress • WAP • Web. DAV • XHTML • XLink • XPath • XSLT • Vocabularies DTDs • Music • P 3 P • RDF • RSS • SMIL • W 3 C Standards-in-progress • WML • XHTML • XSL FO's • XSLT • XUL • Vertical Industries Advertising • Commerce • Consortiums • Construction • Food • Insurance • Legal • Medical • Music • OASIS • Real Estate • Science • Space Exploration • Telecommunications • Travel • Weather XML Tutorial, Bertram Ludäscher Tutorial, 5
… but … FEAR NOT! XML Tutorial, Bertram Ludäscher Tutorial, 6
Back to the Future (or Archival for the Past. . . ) A time traveler sends a message in the virtual bottle, containing parts of the universal library of human and post-human mankind back into the last third of the 20 th century. . . • . . . when the Web, XML, WAP, B 2 B, and Petabytes were unheard of • . . . RAM was so precious that it was ok to deal with nibbles • . . . MS-DOS was still called CP/M • . . . and in fact Bill hadn’t moved into the garage yet but worked on a homework assignment by Christos, trying to sort pancakes faster (Gates, W. H. and Papadimitriou, C. "Bounds for Sorting by Prefix Reversal. " Discr. Math. 27, 47 -57, 1979. ) • Task: make sense out of the futuristic message in the past! XML Tutorial, Bertram Ludäscher Tutorial, 7
Our past futurist’s (future archeologist’s? ) supercomputer looked like this … 62 k CP/M VER 2. 23 (Z 80/DJDMA/VT 100) A>dir A: ARK COM : ASM A: CPM 2 HLP : CBIOS A: DDTZ COM : DUMP A: ERAQ COM : FORMAT A: HELP HLP : LIB A: LOAD COM : LS A: LU HLP : MAC A: MOVCPM COM : PIP A: PUTCPM ASM : PUTCPM A: STAT COM : SUBMIT A: THISSIM HLP : UNARK A: UNZIP COM : USQ A: MBASIC HLP : MBASIC A>mbasic BASIC-80 Rev. 5. 22 [CP/M Version] 32783 Bytes free Ok COM ASM COM COM COM : : : : CLS CBOOT ED FORMAT LINK LT MAC PTRDSK SAP SURVEY UNCR VDE WS COM ASM COM COM HLP : : : COPY DDT EDFILE HELP LINK LU MOUNT PTRDSK SQ SYSGEN UNERASE XSUB ASM COM COM HLP COM ASM COM SUB COM Ever wondered where the 8 letter filenames, 3 letter extensions came from? ; -) XML Tutorial, Bertram Ludäscher Tutorial, 8
Message in the bottle: 1 • • • ÐÏ^Qࡱ^Zá^@^@^@^@^@^@^@^@>^@^C^@þÿ ^@^F^@^@^@^@^@^@^A^@^@^@#^@^@^@^@^P^@^@%^@^@ ^@^A^@^@^@þÿÿÿ^@^@"^@^@^@ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á^@ q^@^D^@^@^@^R¿^@^@^@^P^@^@^@^D^@^@Ç^G^@^@^N^@bjbjt+t+^@^@ ^@ ^@Some Quotations from the Universal Library^M 1 Famous Quotes^M 1. 1 By William I^M[2, Sonnet XVIII]^MShall I compare thee to a summer's day? ^MThou art more lovely and more temperate. ^MRough winds do shake the darling buds of May, ^MAnd summer's lease hath all too short a date. ^MSometime too hot the eye of heaven shines, ^MAnd often is his gold complexion dimmed. ^MAnd every fair from fair some declines, ^MBy chance or nature's changing course untrimmed. ^MBut thy eternal summer shall not fade, ^MNor lose possession of that fair thou owest, ^MNor shall Death brag thou wander'st in his shade^MWhile in eternal lines to time thou growest. ^MSo long as men can breathe, or eyes can see, ^MSo long live this, and this gives life to thee. ^M 1. 2 By William II^M[1, p. 265]^M223 The obvious mathematical breakthrough would be development of^Man easy way to factor large prime numbers. "^MReferences^M[1] W. H. Gates. The Road Ahead. Viking Penguin, 1995. ^M[2] W. Shakespeare. The Sonnets of Shakespeare. 609. ^M^@^@^@^@^@^@^@^@^@^@^@^ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ^A^@þÿ^C^@^@ÿÿÿÿ^F^B^@^@^@À^@^@^@^@F^X^@^@^@Microsoft Word Document^@^@MSWord. Doc^@^P^@^@^@Word. Document. 8^@ô 9²q^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^ XML Tutorial, Bertram Ludäscher Tutorial, 9
Message in the bottle: 2 • • • • • • {rtf 1ansicpg 1252uc 1 deff 0deflang 1033deflangfe 1033{fonttbl{f 0fromanfcharset 0fprq 2{*panose 02020603050405020304}Times New Roman; } {f 1fswissfcharset 0fprq 2{*panose 020 b 060402020204}Arial; }^M {f 17fromanfcharset 238fprq 2 Times New Roman CE; }{f 18fromanfcharset 204fprq 2 Times New Roman Cyr; }{f 20fromanfcharset 161fprq 2 Times New R oman Greek; }{f 21fromanfcharset 162fprq 2 Times New Roman Tur; }^M … Some Quotations from the Universal Library^M par }pardplain s 2sb 240sa 60keepnwidctlparoutlinelevel 1adjustright bif 1cgrid {cgrid 0 1 Famous Quotes^M par }pardplain s 3sb 240sa 60keepnwidctlparoutlinelevel 2adjustright f 1cgrid {cgrid 0 1. 1 By William I^M par }pardplain s 4sb 240sa 60keepnwidctlparoutlinelevel 3adjustright bf 1cgrid {cgrid 0 [2, Sonnet XVIII]^M par }pardplain widctlparadjustright fs 20cgrid {f 1fs 24cgrid 0 Shall I compare thee to a summer's day? ^M par Thou art more lovely and more temperate. ^M par Rough winds do shake the darling buds of May, ^M … par }pardplain s 3sb 240sa 60keepnwidctlparoutlinelevel 2adjustright f 1cgrid {cgrid 0 1. 2 By William II^M par }pardplain s 4sb 240sa 60keepnwidctlparoutlinelevel 3adjustright bf 1cgrid {cgrid 0 [1, p. 265]^M par }pardplain widctlparadjustright fs 20cgrid {f 1fs 24cgrid 0 ldblquote The obvious mathematical breakthrough would be development of^M par an easy way to factor large prime numbers. "^M par }pardplain s 2sb 240sa 60keepnwidctlparoutlinelevel 1adjustright bif 1cgrid {cgrid 0 References^M par }pardplain widctlparadjustright fs 20cgrid {f 1fs 24cgrid 0 [1] W. H. Gates. The Road Ahead. Viking Penguin, 1995. ^M par [2] W. Shakespeare. The Sonnets of Shakespeare. 1609. }{fs 28 ^M par }} XML Tutorial, Bertram Ludäscher Tutorial, 10
Message in the bottle: 3 %!PS-Adobe-2. 0 %%Creator: dvipsk 5. 58 f Copyright 1986, 1994 Radical Eye Software %%Title: msg. dvi %%Pages: 1 … /X{S N}B /TR{translate}N /isls false N /vsize 11 72 mul N /hsize 8. 5 72 mul N /landplus 90{false}def /@rigin{isls{[0 landplus 90{1 -1}{-1 1} ifelse 0 0 0]concat}if 72 Resolution div 72 VResolution div neg scale … Te. XDict begin 39158280 55380996 1000 600 ( msg. dvi) @start /Fa 16 117 df<000001 C 0000000000003 C 000000 07 C 000000000000 FC 000000000001 FE 0000003 FE 0000007 FE 000000 FFE 000000 EFE 000001 EFE 0 000001 CFE 000000000038 FE 000000000070 FE 0 … %%End. Setup 1 0 bop 659 872 a Ff(Some)44 b(Quotations)f(from)f(the)i(Univ)l(ersal)h (Library)515 1470 y Fe(1)134 b(F)-11 b( amous)45 b(Quotes)515 1669 y Fd(1. 1)112 b(By)37 b(William)d(I)515 1822 y Fc([2)o(, )d(Sonnet)h (XVI)s(I])722 2004 y Fb(Shall)c(I)g(compare)e(thee)i(to)f(a)g (summer's)g(da)n(y? )722 2104 y(Thou)h(art)f(more)f( lo)n(v)n(ely)h(and)g (more)g(temp)r(erate. )722 2204 y(Rough)g(winds)h(do)f(shak)n(e)g(the)h (darling)e(buds)i(of)g(Ma)n(y)-7 b(, )722 2303 y(And)28 b(summer's)g(lease)e(hath)i(all)f(to)r(o)h(short)f(a)g(date. )722 2403 y(Sometime)h(to)r(o)f(hot)h(the)g(ey)n(e)f(of)h(hea)n(v)n(en)e (shines, )722 2503 y(And)i(often)g(is)g(his)f(gold)g(complexion)g XML Tutorial, Bertram Ludäscher Tutorial, 11
Message in the bottle: 4 • • • • • • documentclass{article} begin{document} title{Some Quotations from the Universal Library}. . . section{Famous Quotes} subsection{By William I} textbf{cite[Sonnet XVIII]{shakespeare-sonnets-1609}} begin{verse} Shall I compare thee to a summer's day? \ Thou art more lovely and more temperate. \ Rough winds do shake the darling buds of May, \ And summer's lease hath all too short a date. \ Sometime too hot the eye of heaven shines, \ And often is his gold complexion dimmed. \ … qquad So long as men can breathe, or eyes can see, \ qquad So long live this, and this gives life to thee. \ end{verse}. . . bibliographystyle{abbrv} bibliography{msg} end{document} XML Tutorial, Bertram Ludäscher Tutorial, 12
Message in the bottle: 5 • • • <HTML> <HEAD> <TITLE>Some Quotations from the Universal Library</TITLE> </HEAD> <BODY> • • • • • • <B><FONT FACE="Arial" SIZE=5><P>Some Quotations from the Universal Library</P> </FONT><I><FONT FACE="Arial"><P>1 Famous Quotes</P> </B></I><P>1. 1 By William I</P> <B><P>[2, Sonnet XVIII]</P></B> <P>Shall I compare thee to a summer's day? </P> <P>Thou art more lovely and more temperate. </P> <P>Rough winds do shake the darling buds of May, </P> <P>And summer's lease hath all too short a date. </P> <P>Sometime too hot the eye of heaven shines, </P> <P>And often is his gold complexion dimmed. </P>. . . <P>So long as men can breathe, or eyes can see, </P> <P>So long live this, and this gives life to thee. </P> <P>1. 2 By William II</P> <B><P>[1, p. 265]</P> </B><P>" The obvious mathematical breakthrough would be development of</P> <P>an easy way to factor large prime numbers. "</P> <B><I><P>References</P> </B></I><P>[1] W. H. Gates. The Road Ahead. Viking Penguin, 1995. </P> <P>[2] W. Shakespeare. The Sonnets of Shakespeare. 1609. </P></FONT></BODY> </HTML> XML Tutorial, Bertram Ludäscher Tutorial, 13
Message in the bottle: 6 <? xml version="1. 0"? > <universal_library> <books> <book> <title>Some Quotations from the Universal Library</title> <section> <title>Famous Quotes</title> <subsection> <title>By William I</title> <quote bibref="shakespeare-sonnets-1609"> <title>Sonnet XVIII</title> <verse> <line>Shall I compare thee to a summer's day? </line> <line>Thou art more lovely and more temperate. </line> <line>Rough winds do shake the darling buds of May, </line> </verse> … <subsection> <title>By William II</title> <quote bibref="gates-road-ahead-1995"> <title>Page 265</title> <line>``The obvious mathematical breakthrough would be development of an easy way to factor large prime numbers. ’’</line> </quote> </subsection> </book> … </books> </universal_library> XML Tutorial, Bertram Ludäscher Tutorial, 14
XML as a Self-Describing Format • can be “understood” using any (archaic CP/M) editor • can be parsed easily • contains its own structure (=parse tree) in the data => allows the e-archeologist to rediscover schema and content (=semantics!? ) • may also include an explicit schema description (DTD) => “meta-model”: definition of a language w. r. t. which it is valid • allows separation of marked-up content from presentation (=>style sheets) • as a self-describing format good for “archival into the past” => not bad for archival into the future XML Tutorial, Bertram Ludäscher Tutorial, 15
Some thoughts on how XML can help with e-record management. . . • Assumption: represent e-records in XML => self-describing format (good for archival) => get a semistructured data model (flexible: encode regular tables, nested structures, objects, or even (cleaned up) HTML) => many tools (and many more to come -- (re)use code): parsers, validators, query languages, storage => standards (good for interoperation, integration, etc): generic standards (XML, DTDs, XML Schema, XPath, . . . ) community/industry standards (=specific markup languages) XML Tutorial, Bertram Ludäscher Tutorial, 16
. . . thoughts continued • “E-Record Quality Assurance”: – by “subscribing” to a certain XML DTD/XML Schema/XML ? ? ? , you can make sure that “the same language is spoken” – validation using DTDs provides a first simple quality control: • are the right tags used? • is the nesting of elements ok w. r. t. the DTD? • is the order and multiplicity of element ok? – if you need more => use validation w. r. t. an XML Schema • now: check also data types • use specialization and other mechanisms from object-oriented modeling • more integrity checking possible (cardinalities, …) – still want more integrity checks (ICs) or even “policies”? => use a declarative rule language for specifying the constraints and policies at design time. Implement them at run time, e. g. , by adding the ICs to the XML DTD/Schema/… => checking ICs and policies is similar to issuing specific queries against the data => use query processors (relational DBs, XML tools) for integrity checking when possible => for evolution of records, look at versioning models for data bases and temporal database models and query languages XML Tutorial, Bertram Ludäscher Tutorial, 17
Back to XML: Different Perspectives • Document (SGML) Community – data = linear text documents – mark up (annotate) text pieces to describe context, structure, semantics of the marked text • Database Community – XML as a (most prominent) example of the semistructured data model => captures the whole spectrum from highly structured, regular data to unstructured data XML Tutorial, Bertram Ludäscher Tutorial, 18
More Perspectives on XML • "XML is the cure for your data exchange, information integration, e-commerce, [x-2 -y, U name it] problems” (“snake oil/silver bullet theory”) • "XML is nothing but (another) syntax (for Lisp, trees, …)” (“nothing new under the sun”) (books (book (author “Shakespeare” ) (title “Sonnets”) (verse (line “Shall I compare…” ) (line …) …))) XML Tutorial, Bertram Ludäscher Tutorial, 19
So what is XML (all about)? Executive Summary: • XML = HTML – idiosyncrasies (simplified syntax) + user-definable ("semantic") tags • Separation of data and its presentation => simple, very flexible data exchange format: semistructured data model => new applications: • Information exchange (B 2 B), sharing (diglib), integration ("mediation"), archival, . . . • Web site mangement (XML+XSL stylesheets), . . . XML Tutorial, Bertram Ludäscher Tutorial, 20
Many X-cellent(? ) Acronyms. . . • • XML (Extensible Markup Language) XML Namespaces XML DTDs, XML Schema RDF (Resource Description Framework) XSL (Extensible Style Sheet Language) XPath (=XSLT XPointer), XLink XQL, XML-QL (XML Query Language) XMAS (XML Matching And Structuring language) • e. Xcelon, . . . => XML++ (i. e. += X-tensions) >> just syntax => a family of technologies (XML extensions, tools, . . . ) => generic standards and industry/community standards XML Tutorial, Bertram Ludäscher Tutorial, 21
XML Applications & Industry Initiatives http: //www. oasis-open. org/cover/xml. html#applications • • • • Advertising: ad. XML place an ad onto an ad network or to a single vendor Literature: Gutenberg convert the world’s great literature into XML Directories: dir. XML Novell’s Directory Services Markup Language (DSML) Web Servers: apache. XML parsers, XSL, web publishing Travel: open. Travel information for airlines, hotels, and car rental places News: News. ML creation, transfer and delivery of news Human Resources: XML-HR standardization of HR/electronic recruiting XML definitions International Dvt: IDML improve the mgt. and exchange of info. for sustainable development Voice: Vox. ML markup language for voice applications Wireless: WAP (Wireless Application Protocol) wireless devices on the World Wide Web Weather: OMF Weather Observation Markup Format (simulation) Geospatial: ANZMETA distributed national directory for land information Banking: MBA Mortgage Bankers Association of America --> credit report, loan file, underwriting… Healthcare: HL 7 DTDs for prescriptions, policies & procedures, clinical trials Math: Math. ML (Mathematical Markup Language) Surveys: DDI (Data Documentation Initiative) “codebooks” in the social and behavioral sciences XML Tutorial, Bertram Ludäscher Tutorial, 22
XML E-commerce Initiatives • Commerce. Net – – • e. Co Framework XML specs. to support interoperability among e-businesses Commerce One Common Business Library (CBL): set of business components, docs. In DTD, XDR, SOX Biz. Talk Microsoft spec. based on XML schemas c. XML (Commerce XML) -- tag-sets for e-procurement into Biz. Talk Electronic Data Interchange (EDI) – Rosetta. Net Common format for online ordering – Fp. ML (Financial products Markup Language): sharing of financial data (interest rate & foreign exchange products) • Open Buying on the Internet (OBI) – OBI • high volume b 2 b purchasing transactions over the Internet (Office Depot, Lockheed, barnesandnoble, AX. . . E-commerce and XML – VISA Invoices The Visa Extensible Markup Language (XML) Invoice Specification provides a comprehensive list of data elements contained in most invoices, including: Buyer/Supplier, Shipping, Tax, Payment, Currency, Discount, and Line Item Detail. • B 2 B Integration – code 360 XML-Broker is middleware software that manages XML based transactions – Bluestone XML Suite Enables to develop and deploy e-commerce, electronic data interchange, application integration and supply chain management applications. Bluestone XML Suite products include: XML-Server, Visual. XML, XML-Contact and Xwing. ML. – web. Methods XML Tutorial, Bertram Ludäscher Tutorial, Provides companies with integrated direct links to buyers and suppliers 23
What’s Wrong with HTML? Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina. “Object Fusion in Mediator Systems”. In VLDB 96. HTML confuses presentation with content <DT> <IMG SRC="greenball. gif" > <A NAME="object-fusion"></A> Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina. <A HREF="http: //www-cse. ucsd. edu/~yannis/papers/fusion. ps"> "Object. Fusion in Mediator Systems". </A> In <I>VLDB 96. </I> </DT> XML Tutorial, Bertram Ludäscher Tutorial, 24
. . . What’s Wrong with HTML. . . No Explicit Structure, Semantics, or Object-Orientation <DT> <IMG SRC= "greenball. gif" > Author <A NAME="object-fusion"></A> Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina. <A HREF="http: //www-cse. ucsd. edu/~yannis/papers/fusion. ps"> "Object. Fusion in Mediator Systems". </A> In <I>VLDB 96. </I> </DT> Title Conference XML Tutorial, Bertram Ludäscher Tutorial, 25
. . . And Some Repercussions • Lack of schema/semantics when querying the Web (HTML): – "find documents (books, papers, . . . ) where author = Michael Jackson" (. . . and learn how software engineering meets the moon walker. . . ) – "create a list of M. Jackson's books and (if available) their prices" => HTML is inappropriate for - data exchange - automation of information management (retrieval, manipulation, integration) XML Tutorial, Bertram Ludäscher Tutorial, 26
XML is Based on Markup <bibliography> Markup indicates structure and semantics <paper ID= "object-fusion"> <authors> <author>Y. Papakonstantinou</author> <author>S. Abiteboul</author> <author>H. Garcia-Molina</author> </authors> <full. Paper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper> </bibliography> XML Tutorial, Bertram Ludäscher Tutorial, Decoupled from presentation 27
Elements and their Content <bibliography> element name Element Content <paper ID="object-fusion"> <authors> <author>Y. Papakonstantinou</author> <author>S. Abiteboul</author> <author>H. Garcia-Molina</author> </authors> <full. Paper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper> element Empty Element </bibliography> XML Tutorial, Bertram Ludäscher Tutorial, Character content 28
Element Attributes <bibliography> Attribute name Attribute Value <paper ID="object-fusion"> <authors> <author>Y. Papakonstantinou</author> <author>S. Abiteboul</author> <author>H. Garcia-Molina</author> </authors> <full. Paper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper> </bibliography> XML Tutorial, Bertram Ludäscher Tutorial, 29
XML = Labeled Ordered Trees bibliography authors author Yannis . . . paper fullpaper author . . . title Object Fusion Serge semistructured data labeled trees/graphs XML Tutorial, Bertram Ludäscher Tutorial, can also represent • relational and • object-oriented data <bibliography> <paper. . . > <authors> <author>Yannis</author> <author>Serge</author>. . . </authors> <title>Object Fusion</title>. . . </paper> </bibliography> 30
In Search of the Lost Structure & Semantics How do I share structure and metadata/semantics How do I learn and use with the element structure my community? of a document? XML Tutorial, Bertram Ludäscher Tutorial, How to make all this automatable? 31
Adding Structure and Semantics • XML Document Type Definitions (DTDs): • define the structure of "allowed" documents (i. e. , valid wrt. a DTD) • database schema => improve query formulation, execution, . . . • XML Schema – defines structure and data types – allows developers to build their own libraries of interchanged data types • XML Namespaces – identify your vocabulary XML Tutorial, Bertram Ludäscher Tutorial, 32
XML DTDs as Extended CFGs XML DTD <!element bibliography paper*> <!element paper (authors, full. Paper? , title, booktitle)> <!element authors author+> Grammar bibliography paper authors paper* authors full. Paper? title booktitle author+ lhs = element (name) rhs = regular expression over elements + strings (PCDATA) XML Tutorial, Bertram Ludäscher Tutorial, 33
Document Type Definitions (DTDs) Define and Constrain Element Names & Structure <!element <!element <!attlist bibliography paper*> paper (authors, full. Paper? , title, booktitle)> authors author+> Element Type author (#PCDATA)> full. Paper EMPTY> Declaration title (#PCDATA)> booktitle (#PCDATA)> full. Paper source ENTITY #REQUIRED> paper ID ID> XML Tutorial, Bertram Ludäscher Tutorial, Attribute List Declaration 34
Element Declarations Sequence of 0 or more paper <!element Authors followed by optional fullpaper, followed by title, followed by booktitle bibliography paper*> paper (authors, full. Paper? , title, booktitle)> authors author+> Sequence of 1 or author (#PCDATA)> more author Character content <!element <!attlist full. Paper EMPTY> title (#PCDATA)> booktitle (#PCDATA)> full. Paper source ENTITY #REQUIRED> paper ID ID> XML Tutorial, Bertram Ludäscher Tutorial, 35
Element Content Declarations Declaration <element 2> cardinality: R? R* R+ R 1|R 2|…|Rn R 1, R 2 , …, Rn #PCDATA EMPTY (#PCDATA e*)* ANY XML Tutorial, Bertram Ludäscher Tutorial, Meaning Exactly one <element 2> Zero or one instances of R Zero or more instances of R One instance of R 1 or R 2 or … Rn Sequence of R’s, order matters Character content Empty element Mixed Content Anything goes 36
Attributes <person ID="yannis"> Yannis’ info </person> <bibliography> Object Identity Attribute <paper ID="object-fusion" ROLE="publication"> CDATA (character data) <authors> <author. Ref="yannis"> IDREF Y. Papakonstantinou</author> intradocument </authors> reference <full. Paper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <related papers= "semistructured-data" "mediators"/> </paper> </bibliography> XML Tutorial, Bertram Ludäscher Tutorial, Reference to external ENTITY 37
Attribute Types Type ID IDREFS ENTITY ENTITIES CDATA NMTOKENS NOTATION Enumeration Conditional Sec Meaning Token unique within the document Reference to an ID token Reference to multiple ID tokens External entity (image, video, …) External entities Character data Name tokens Data other than XML Choices INCLUDE & IGNORE declarations Attributes may be: REQUIRED, IMPLIED (optional) can have: default values, which may be FIXED XML Tutorial, Bertram Ludäscher Tutorial, 38
Uses of XML Entities • Physical partition – size, reuse, "modularity", … (both XML docs & DTDs) • Non-XML data – unparsed entities binary data • Non-standard characters – character entities • Shorthand for phrases & markup XML Tutorial, Bertram Ludäscher Tutorial, 39
Entities & Physical Structure Mylife. xml DTD. . . <mylife> Chap 1. xml <teen>yada </teen> A logical element can be split into multiple physical entities Chap 2. xml <adult>blah. . </adult> </mylife> XML Tutorial, Bertram Ludäscher Tutorial, 40
External Text Entities External Text Entity Declaration <!ENTITY chap 1 SYSTEM "chap 1. xml"> URL Entity Reference <mylife> &chap 1; &chap 2; </mylife> Logically equivalent to inlining file contents <mylife> <teen>yada</teen> <adult> blah</adult> </mylife> XML Tutorial, Bertram Ludäscher Tutorial, 41
Types of Entities • Internal (to a doc) vs. External ( use URI) • General (in XML doc) vs. Parameter (in DTD) • Parsed (XML) vs. Unparsed (non-XML) XML Tutorial, Bertram Ludäscher Tutorial, 42
Internal Text Entities Internal Text Entity Declaration <!ENTITY WWW "World Wide Web"> Entity Reference <p>We all use the &WWW; . </p> Logically equivalent to actually appearing <p>We all use the World Wide Web. </p> XML Tutorial, Bertram Ludäscher Tutorial, 43
Unparsed (& "Binary") Entities Declare external. . . and unparsed entity <!ENTITY fusion SYSTEM "fusion. ps" NDATA ps> Declare attribute type to be entity <!attlist full. Paper source ENTITY #REQUIRED> Element with ENTITY attribute <full. Paper source="fusion"/> NOTATION declaration (helper app) <!NOTATION ps SYSTEM "ghostview. exe"> XML Tutorial, Bertram Ludäscher Tutorial, 44
From Docs to Data: XML Schema • XML DTDs (part of the XML spec. ) – flexible, semistructured data model (nesting, ANY, ? , *, |, . . . ) – but document-oriented (SGML heritage) – no support for namespaces, datatypes, inheritance (e. g. , type of book. title may be different from poem. title) • XML Schema (W 3 C working draft) – schema definition language in XML – data-oriented: data types – extends capabilities of DTD XML Tutorial, Bertram Ludäscher Tutorial, 45
XML Schema: Example Define an order "record" with (mandatory) fields and an (optional) attribute: <type name="Order" > <element name="name" type="string" /> <element name="street" type="string" /> <element name="zip" type="integer" /> <. . . > <attribute name="order. Date" type="date" /> </type> XML Tutorial, Bertram Ludäscher Tutorial, 46
XML Schema: Example New types can be derived by extension or restriction: <type name="person. Name"> <element name="title" min. Occurs="0"/> <element name="forename" min. Occurs="0" max. Occurs="*"/> <element name="surname"/> </type> <type name="extended. Name" source="person. Name" derived. By="extension"> <element name="generation" min. Occurs="0"/> </type> <type name="simple. Name" source="person. Name" derived. By="restriction"> <restrictions> <element name="title" max. Occurs="0"/> <element name="forename" min. Occurs="1" max. Occurs="1"/> </restrictions> </type> XML Tutorial, Bertram Ludäscher Tutorial, 47
W 3 C Work on XML Schemas • Structures: – Specify complex element structure and – Set constraints on the permitted values of the content of those elements • Datatypes: – Sets forth a standard of content datatypes and – Sets rules for generating new types from them XML Tutorial, Bertram Ludäscher Tutorial, 48
Further Approaches • RELAX (REgular LAnguage description for XML) – Standardized by INSTAC XML SWG of Japan. – Compared with DTD, RELAX has new features: · RELAX grammars are represented in the XML instance syntax · RELAX borrows rich data types of XML Schema Part 2 · RELAX is namespace-aware · many others – XML-Data, XML-DR, DCD, SOX, DDML, DSD, Schematron. . . XML Tutorial, Bertram Ludäscher Tutorial, 49
Normalized Data/Metadata Representation • Resource Description Framework (RDF) – Metadata model – The designer can describe objects, add properties to define and describe them, and also make complicated statements about the objects (statements about relationships between resources). – The specification comes in two sections: • Model & Syntax (viewed as directed, labeled graphs) • RDF Schemas (using an XML vocabulary) XML Tutorial, Bertram Ludäscher Tutorial, 50
Resource Description Framework (RDF) • Metadata is useful for information retrieval (esp. if no other schema info or semantics is available) • Idea: representation independent encoding of metadata as triples (Resource, Property. Type, Value): – (uri 1, DC: creator, uri 2), (uri 2, v. Card: name, smith), . . . uri 1 DC: creator • "Semantic Net" XML Tutorial, Bertram Ludäscher Tutorial, uri 2 v. Card: name smit h 51
Identifying Vocabularies • My element may not be your element: – geometry context: <element>line</element> – chemistry context: <element>oxygen</element> – SGML/XML context: . . use XML namespaces to identify the vocabulary XML Tutorial, Bertram Ludäscher Tutorial, 52
XML Namespaces • mechanism for globally unique tag names: <h: html xmlns: xdc="http: //www. xml. com/books" xmlns: h="http: //www. w 3. org/HTML/1998/html 4"> <h: head><h: title>Book Review</h: title></h: head>. . . <xdc: bookreview> <xdc: title>XML: A Primer</xdc: title>. . . </h: html> mix of different tag vocabularies without confusion • namespaces only identify the vocabulary; additional mechanisms required for structure and meaning of tags XML Tutorial, Bertram Ludäscher Tutorial, 53
Processing XML • Non-validating parser: – checks that XML doc is syntactically well-formed • Validating parser: – checks that XML doc is also valid w. r. t. a given DTD • Parsing yields tree/object representation: – Document Object Model (DOM) API • Or a stream of events (open/close tag, data): – Simple API for XML (SAX) XML Tutorial, Bertram Ludäscher Tutorial, 54
DOM Structure Model and API • hierarchy of Node objects: – document, element, attribute, text, comment, . . . • language independent programming DOM API: – – get. . . first/last child, prev/next sibling, child. Nodes insert. Before, replace get. Elements. By. Tag. Name. . . • alternative event-based SAX API (Simple API for XML) – does not build a parse tree (reports events when encountering begin/end tags) – for (partially) parsing large documents XML Tutorial, Bertram Ludäscher Tutorial, 55
DOM Summary • Object-Oriented approach to traverse the XML node tree • Automatic processing of XML docs • Manipulation & Updating of XML on client & server • Database interoperability mechanism • Memory-intensive XML Tutorial, Bertram Ludäscher Tutorial, 56
SAX Event-Based API • Pros: – – The whole file doesn’t need to be loaded into memory XML stream processing Simple and fast Allows you to ignore less interesting data • Cons: – limited expressive power (query/update) when working on streams => application needs to build (some) parse-tree when necessary XML Tutorial, Bertram Ludäscher Tutorial, 57
Querying XML • What can be done to XML so far: – generation: from HTML, DBs, manually, … – parsing: with/without DTD (valid/well-formed XML) – accessing: APIs for XML applications: • DOM (in memory, tree-based), SAX (event-based) • Now: Query languages for XML – XML-QL, XMAS, XPath, XSL(T), XQL, . . . XML Tutorial, Bertram Ludäscher Tutorial, 58
Querying XML • Why not just query XML with SAX or DOM? – SAX: very simple “event-based” queries: ok – DOM: simple navigational queries (get. Child. Nodes, get. Next. Sibling, get. Elements. By. Tag. Name, …): ok • But: these are “low-level” APIs – iterator/cursor API for RDBs (but more powerful!) – used to write XML applications – “high-level” querying, restructuring and transformation (and updates? ? ) is tedious – => analogue to high-level relational query languages (SQL, QBE, Logic (Datalog), …) => Query languages for XML Tutorial, Bertram Ludäscher Tutorial, 59
Querying XML • No "official" W 3 C XML QL yet (but bits and pieces) • numerous quite different XML QLs are popping up • some XML QL overviews, comparisons, and resources: – XML Query Languages: Experiences and Exemplars (co-authored by several XML QL gurus) – XML and Query Languages (Oasis Cover Pages) – Comparative Analysis of Five XML Query Languages (A. Bonifati, S. Ceri) – A Data Model and Algebra for XML Query (Philip Wadler et. al. “functional (Haskell) perspective”) – XML-QL vs XSLT queries (Geert Jan Bex and Frank Neven; for (future) XSLT experts only ; -) – Introduction to XMAS (the XML QL of the MIX project) XML Tutorial, Bertram Ludäscher Tutorial, 60
Querying XML • Different XML QL paradigms depending on the community: – (relational, oo, semistructured) database perspective • Lorel, Ya. TL, XML-QL, XMAS, FLORA/FLORID, . . . – document processing perspective • XQL, XSL(T), XPath, . . . – functional programming perspective • QLs with structural recursion, … XML Tutorial, Bertram Ludäscher Tutorial, 61
Important QL Features (DB Perspective) – typical parts of a query: • (match) pattern (selects parts of the source XML tree without looking at data) • filter condition (selects further, now looking at the data) • answer construction (putting the results together, possibly reordered, grouped, etc. ) – reordering based on nested queries, grouping, sorting, or Skolem functions – tag variables, path expressions for defining the patterns without requiring knowledge of the DTD XML Tutorial, Bertram Ludäscher Tutorial, 62
Selection Queryies with XQL/XPath Find the root element (bookstore) of this document: /bookstore Find all author elements anywhere within the current document: //author Find all books where the value of the style attribute on the book is equal to the value of the specialty attribute of the bookstore element at the root of the document: //book[/bookstore/@specialty = @style] XML Tutorial, Bertram Ludäscher Tutorial, 63
Sample Queries with XQL/XPath • Find the root element (bookstore) of this document: /bookstore • Find all author elements anywhere within the current document: //author • Find all books where the value of the style attribute on the book is equal to the value of the specialty attribute of the bookstore element at the root of the document: //book[/bookstore/@specialty = @style] • Find all books with author/first-name equal to 'Bob' and all magazines with price less than 10: //(book[author/first-name = 'Bob'] $union$ magazine[price $lt$ 10]) XML Tutorial, Bertram Ludäscher Tutorial, 64
Presenting XML: Extensible Stylesheet Language (XSL) • Why Stylesheets? – separation of content (XML) from presentation (XSL) • Why not just CSS for XML? – XSL is far more powerful: • selecting elements • transforming the XML tree • content based display (result may depend on data) XML Tutorial, Bertram Ludäscher Tutorial, 65
XSL Overview • XSL stylesheets are denoted in XML syntax • XSL components: 1. a language for transforming XML documents (XSLT: integral part of the XSL specification) 2. an XML formatting vocabulary (Formatting Objects: >90% of the formatting properties inherited from CSS) XML Tutorial, Bertram Ludäscher Tutorial, 66
XSLT Processing Model Transformatio n XSLT stylesheet XML source tree XML Tutorial, Bertram Ludäscher Tutorial, XML, HTML, csv, text… result tree 67
XSLT Processing Model • XSL stylesheet: collection of template rules • template rule: (pattern template) • main steps: – match pattern against source tree – instantiate template (replace current node “. ” by the template in the result tree) – select further nodes for processing • control can be a mix of – recursive processing ("push": <xsl: apply-templates>. . . ) – program-driven ("pull": <xsl: foreach>. . . ) XML Tutorial, Bertram Ludäscher Tutorial, 68
But first: some syntactic sugar, PLEASE. . . • instead of something complicated like y=f(x) • in the brave new XSLT world you can “simply” write this as: <xsl: variable name="y"> <xsl: call-template name="f"> <xsl: with-param name="x"/> </xsl: call-template> </xsl: variable name="y"> XML Tutorial, Bertram Ludäscher Tutorial, 69
pattern Template Rule: Example template <xsl: template match="product"> <table> <xsl: apply-templates select="sales/domestic"/> </table> <xsl: apply-templates select="sales/foreign"/> </table> </xsl: template> (i) match pattern: process <product> elements (ii) instantiate template: replace each a product with two HTML tables (iii) select the <product> grandchildren (“sales/domestic”, “sales/foreign”) for further processing XML Tutorial, Bertram Ludäscher Tutorial, 70
Match/Select Patterns • match patterns select patterns = defined in http: //w 3. org/TR/xpath • Examples: – /mybook/chapter[2]/section/* – chapter|appendix – chapter//para – div[@class="appendix" and position() mod 2 = 1]//para –. . /@lang XML Tutorial, Bertram Ludäscher Tutorial, 71
XSLT Processing Flavors: Recursive Descent Processing • take some XML file on books: books. xml • now prepare it with style: books. xsl • and enjoy the result: books. html • the recipe for cooking this was: java com. icl. saxon. Style. Sheet books. xml books. xsl > books. html • and now some different flavors: books 2. xsl books 3. xsl XML Tutorial, Bertram Ludäscher Tutorial, 72
Creating the Result Tree. . . • Literal result elements: non-XSL elements (e. g. , HTML) appear “literally” in the result tree • Constructing elements: <xsl: element name = "…"> attribute & children definition </xsl: element> (similar for xsl: attribute, xsl: text, xsl: comment, …) • Generating text: <xsl: template match="person"> <p> <xsl: value-of select="@first-name"/> <xsl: text> </xsl: text> <xsl: value-of select="@surname"/> </p> </xsl: template> XML Tutorial, Bertram Ludäscher Tutorial, 73
Creating the Result Tree. . . • Further XSL elements for. . . – Numbering • <xsl: number value="position()" format="1 "> – Conditions • <xsl: if test="position() mod 2 = 0"> – Repetition. . . XML Tutorial, Bertram Ludäscher Tutorial, 74
Creating the Result Tree: Repetition <xsl: template match="/"> <html> <head> <title>customers</title> </head> <body> <table> <tbody> <xsl: for-each select="customers/customer"> <tr> <th> <xsl: apply-templates select="name"/> </th> <xsl: for-each select="order"> <td> <xsl: apply-templates/> </td>. . . </html> </xsl: template> XML Tutorial, Bertram Ludäscher Tutorial, 75
Creating the Result Tree: Sorting <xsl: template match="employees"> <ul> <xsl: apply-templates select="employee"> <xsl: sort select="name/last"/> <xsl: sort select="name/first"/> </xsl: apply-templates> </ul> </xsl: template> <xsl: template match="employee"> <li> <xsl: value-of select="name/first"/> <xsl: text> </xsl: text> <xsl: value-of select="name/last"/> </li> </xsl: template> XML Tutorial, Bertram Ludäscher Tutorial, 76
More on XSL • XSL(T): – Conflict resolution for multiple applicable rules – Modularization <xsl: include> <xsl: import> – … • XSL Formatting Objects – a la CSS • XPath (navigation syntax + functions) = XSLT XPointer • . . . XML Tutorial, Bertram Ludäscher Tutorial, 77
The MIX Project: Mediation of Information using XML Joint effort between SDSC and the UCSD CSE Department XML Tutorial, Bertram Ludäscher Tutorial, 78
Mediation of Information using XML (MIX) XML Query XML View Document(s) Wrapper Data Source (eg. home ads) XML Tutorial, Bertram Ludäscher Tutorial, Export: • Schema & Metadata (DTD, RDF, …) • Capabilities XML View Document(s) Wrapper Native XML Database Legacy Source 79
Integrated / Mediated views Integrated XML View Definition in XML Query Lang Mediator XML View Document(s) Wrapper Data Source XML Tutorial, Bertram Ludäscher Tutorial, XML View Document(s) XML Data Source Wrapper Data Source 80
A Typical Mediation Scenario User Interface Query Results Mediator (integrated views over heterogeneous sources) Query “fragment” Convert incoming query Wrapper and outgoing data SQL Database XML Tutorial, Bertram Ludäscher Tutorial, Query “fragment” Wrapper GIS HTML 81
MIX Components • MIXm Mediator tool-kit – allows definition of views across multiple resources – views are expressed in a declarative query language – query engine to execute queries on views • XML Matching And Structuring (XMAS) query language – operates on a given set of XML documents to produce a new XML document, using XMAS algebra XML Tutorial, Bertram Ludäscher Tutorial, 82
An XML Query (XMAS) $C: <*. condo> <address zip=$Z/> </condo> AT www. condo. com AND $S: <*. school type=elementary> <address zip=$Z/> </school> AT schools. org. . . <Real. Estate. Agent> <name>J. Smith</name> <condos> <condo> <address. . . zip=92037> <price>$170 k OBO</price> <bedrooms>2</bedrooms> </condo> <condos> </Real. Estate. Agent> XML Tutorial, Bertram Ludäscher Tutorial, <folder> $C $S for $S </folder> for $C <condos. And. Schools> <folder> <condo> <address. . . zip=92037> <price>$170 k OBO</price> <bedrooms>2</bedrooms> </condo> <school> <name>La Jolla High</name> <address … zip=92037> </school> <school>…</school> 83 </folder>
MIX components. . . • DOM-VXD: DOM Virtual XML Document extension – a “lazy” implementation of DOM. Supports browsing/ navigation of XML documents with a server-side, “compute as you go” model • Blended Browsing and Querying (BBQ) interface – supports navigation and querying of XML documents – generates XMAS queries on mediator views – generates XMAS queries modified by DOM-VXD operations to incrementally evaluate the result set, to support navigation of XML documents XML Tutorial, Bertram Ludäscher Tutorial, 84
Navigation driven evaluation client navigation commands result view definition q( s 1 … sn ) Lazy Mediator source navigation commands s 1 XML source XML Tutorial, Bertram Ludäscher Tutorial, . . . sn XML source 85
Blended Browsing and Querying UI (BBQ) XML Tutorial, Bertram Ludäscher Tutorial, 86
Another MIX Example: CDL/AMICO Mediator Prototype BBQ Interface Request for image (X. 509) XMAS query HPSS XML Tutorial, Bertram Ludäscher Tutorial, Q 2: Find creator and related metadata XML doc of paintings MIXm View based on AMICO DTD tif file SRB/MCAT Q 1: Find title, type, and image ID of paintings Wrapper MARC Database AMICO XML Database AMICO/XML Demo 87
XSL Stylesheet for AMICO Answer Docs XML Tutorial, Bertram Ludäscher Tutorial, 88
. . . and the Result (+BBQ) BBQ query composition XSL rendered output XML answer document XML Tutorial, Bertram Ludäscher Tutorial, 89
Projects at DICE/SDSC • National Archives and Records Administration, NARA – Persistent Archives and Electronic Records • NHPRC/NARA • XML and GIS – a. Xio. Map • I 2 T: An Information Integration Testbed for Digital Government XML Tutorial, Bertram Ludäscher Tutorial, 90
Projects at SDSC (… cont) • AMICO – In conjunction with the California Digital Library (CDL) – Part of the NSF DLI-2 project • ESRI • Community of Science, Inc. • Networked Earthquake Engineering Simulation (NEES) – NSF program XML Tutorial, Bertram Ludäscher Tutorial, 91
Information Based Computing Applications Information Management Applications Digital Sky Neuroscience Protein Data Bank Molecular Structures Earth Systems Science XML Tutorial, Bertram Ludäscher Tutorial, Data Storage Archival Storage Collection Building Digital Library Digital Libraries CDL UCB - Elib UCSB - ADL Stanford - SDLIP U Michigan - UMDL 92
Integrating Data Set Management • Model-Based Information Management – Rule-based ontology mapping, conceptual-level mediation - CMIX • Data Grid – Data federation across multiple libraries - MIX • Digital Library – Interoperable services for information discovery and presentation SDLIP • Data Collection – Tools for managing data set collections on databases - MCAT • Data Handling – Systems for data retrieval from remote storage - SRB • Persistent Archives – Storage of data collections for 30 years XML Tutorial, Bertram Ludäscher Tutorial, 93
Model-Based Mediation • Knowledge-based mediation – conceptual-level integration • Rule-based ontology maps – map source XML to CM to FL (ontologies, views) • Models for exporting – – rules integrity constraints query capabilities data & schema (XML/DTDs) XML Tutorial, Bertram Ludäscher Tutorial, 94
Federation of Brain Data Result (XML/XSLT) PROTLOC Result (VML) ANATOM MODEL-BASED Mediation CCB, Montana SU Surface atlas, Van Essen Lab stereotaxic atlas LONI XML Tutorial, Bertram Ludäscher Tutorial, MCell, CNL, Salk NCMIR, UCSD 95
Further Information • • • xml. com w 3. org xml. org ibm. com/xml. . . • Mediation of Information using XML (MIX): – www. npaci. edu/DICE/MIX/ – www. db. ucsd. edu/Projects/MIX/ XML Tutorial, Bertram Ludäscher Tutorial, 96
51843d1fe7d4e091b3c052466be396aa.ppt