Скачать презентацию XML und Data Management — Introduction Hachim Haddouti Скачать презентацию XML und Data Management — Introduction Hachim Haddouti

4f15836e3b8d254cbda25632fd97941a.ppt

  • Количество слайдов: 49

XML und Data Management - Introduction Hachim Haddouti Al Akhawayn University SSE H. Haddouti@alakhawayn. XML und Data Management - Introduction Hachim Haddouti Al Akhawayn University SSE H. Haddouti@alakhawayn. ma http: //mail. alakhawayn. ma/~H. Haddouti

Table of Conetnt n n n Intro Motivation Semi structured data Why do we Table of Conetnt n n n Intro Motivation Semi structured data Why do we need semistructured data? What is semistructured data? Hachim Haddouti

Motivation n n Some data really is unstructured Examples: – World Wide Web – Motivation n n Some data really is unstructured Examples: – World Wide Web – Data exchange formats – Data Integration Hachim Haddouti

Motivation - Web n n n Why do we want to treat the Web Motivation - Web n n n Why do we want to treat the Web as a database? – Maintain integrity – Query based on structure (as opposed to content) – Introduce some “organization” However, Web has no structure, refer to it as enormous graph Some people claim database research community missed the boat when it comes to the World Wide Web Hachim Haddouti

Motivation – Data Formats n n n Much (most? ) of the world’s data Motivation – Data Formats n n n Much (most? ) of the world’s data is in data formats – Formats defined for interchange and archiving of data Formats vary in generality – ASN. 1, EDI quite general – Scientific data formats tend to be “fixed schema” Textual representation given by data formats not immediately translatable into standard relation/objectoriented representation Hachim Haddouti

ASN. 1 n n n Abstract Syntax Notation number One (ASN. 1) – International ASN. 1 n n n Abstract Syntax Notation number One (ASN. 1) – International standard that aims at specifying data used in communication protocols – e. g. , format for transporting data between two layers of a network operating system Now used for storing bibliographic and genetic data Want more, go to www. asn 1. elibel. tm. fr/en/introduction/ Hachim Haddouti

Sample ASN. 1 Module-order DEFINITIONS AUTOMATIC TAGS : : = BEGIN Order : : Sample ASN. 1 Module-order DEFINITIONS AUTOMATIC TAGS : : = BEGIN Order : : = SEQUENCE { header Order-header, items SEQUENCE OF Order-line } Order-header : : = SEQUENCE { reference Numeric. String (SIZE (12)), date Numeric. String (SIZE (8)) -- MMDDYYYY --, client Client, payment Payment-method } Client : : = SEQUENCE { name Printable. String (SIZE (1. . 20)), street Printable. String (SIZE (1. . 50)) OPTIONAL, postcode Numeric. String (SIZE (5)), town Printable. String (SIZE (1. . 30)), country Printable. String (SIZE (1. . 20)) DEFAULT "France" } Payment-method : : = CHOICE { check Numeric. String (SIZE (15)), credit-card Credit-card, cash NULL } Credit-card : : = SEQUENCE { type Card-type, number Numeric. String (SIZE (20)), expiry-date Numeric. String (SIZE (6)) -- MMYYYY -- } Card-type : : = ENUMERATED {cb(0), visa(1), eurocard(2), diners(3), american-express(4)} -- etc END Hachim Haddouti

EDI n n EDI = Electronic Data Interchange Provides a collection of standard message EDI n n EDI = Electronic Data Interchange Provides a collection of standard message formats and element dictionary to support exchange of data via any electronic messaging service Hachim Haddouti

Sample EDI Invoice File ISA~00~ ~ZZ~YOUR COMM-ID ~14~SLKP COMM-ID ~000227~1053~U~00401~000000012~0~P~> GS~IN~YOUR COMM-ID~SLKP COMM-ID~20000227~1053~3~X~004010 ST~810~0001 Sample EDI Invoice File ISA~00~ ~ZZ~YOUR COMM-ID ~14~SLKP COMM-ID ~000227~1053~U~00401~000000012~0~P~> GS~IN~YOUR COMM-ID~SLKP COMM-ID~20000227~1053~3~X~004010 ST~810~0001 BIG~19991118~001001~19990926~11441~~~DR N 1~RE~REMIT COMPANY, INC~92~002377703 N 3~P. O. BOX 111 N 4~ANYTOWN~NC~27106 N 1~ST~SARA LEE FOOTWEAR N 3~SHIPPING STREET N 4~OUR TOWN~PA~17855 N 1~BT~SARA LEE FOOTWEAR~92~10 N 3~470 W. HANES MILL RD N 4~WINSTON SALEM~NC~27105 ITD~05~3~~~~~60 DTM~011~19991118 IT 1~0001~1470~YD~2~~BP~BUYERPART PID~F~~~~Square Rubber Hose TDS~294000 ISS~1470~YD CTT~1~1470 SE~19~0001 GE~1~3 Hachim Haddouti IEA~1~000000012

Motivation - Browsing n n n To query database, one needs to understand schema Motivation - Browsing n n n To query database, one needs to understand schema Schemas may be hard to understand, users may want to start by querying data with little or no knowledge of schema – Where in database is string “Casablanca”? – Are there integers in database greater than 216? – What objects in db have attribute name that starts with “act”? Some extensions to relational query languages have been proposed for such queries Hachim Haddouti

Motivation – Integration of Heterogeneous Data n n With the growing amount of information Motivation – Integration of Heterogeneous Data n n With the growing amount of information distributed in multiple sources, comes an increased need for tools and algorithms to provide integrated, unified interface to information Information integration is another application which calls for flexible, dynamic, self-describing data model Hachim Haddouti

Content from Multiple Sources …and, if possible, is a preferred supplier 3 rd party Content from Multiple Sources …and, if possible, is a preferred supplier 3 rd party content Who offers the cheapest 10+ Nm motor, matching with my XYZ 123 drive, operating <12 V, available within 2 days ? Product Catalog Pricing Supplier Hachim Haddouti Delivery Supplier Information Product Catalog ERP (P + D) Supplier

Supply Chain Management Expected Supplies Outstanding Customer Orders Sales Pipeline Info Product Returns Flaky Supply Chain Management Expected Supplies Outstanding Customer Orders Sales Pipeline Info Product Returns Flaky Inc. ’s shipment is coming two days later than needed… Given the state of our inventory, expected orders, identity (and value) of customers, and pricing and delivery options, how can we satisfy our best customer at the price we promised them? Shipments to ACME Flaky Inc. Hachim Haddouti Shipments to Pricing & ACME Delivery Credible Inc.

… Or Viewed in Different Terms Personal Databases “Heterogeneities are everywhere” World Wide Web … Or Viewed in Different Terms Personal Databases “Heterogeneities are everywhere” World Wide Web Scientific Databases Digital Libraries • Different interfaces • Different data representations • Duplicate and inconsistent information 14

Integrated View of Heterogeneous Data Integration System World Wide Web • • • Digital Integrated View of Heterogeneous Data Integration System World Wide Web • • • Digital Libraries Scientific Databases Personal Databases Collects and combines information Provides integrated view, uniform user interface Supports sharing 15

World of Data Access and Integration Servers n n Provide to the e. Business World of Data Access and Integration Servers n n Provide to the e. Business application unified access to the data sources Data of multiple sources appear as if they come from one (potentially virtual) database as ubiquitous as application servers Driven by initiatives in – Customer Relationship Management – Supply Chain Management – e. Commerce and e. Content – Business Intelligence and Warehousing Hachim Haddouti

Most-Generic Integration System Architecture Clients Integrator . . . Wrapper . . . Information Most-Generic Integration System Architecture Clients Integrator . . . Wrapper . . . Information Source Haddouti Hachim Information Source

How Do We Represent Data in the Integration System? n n n Relational Data How Do We Represent Data in the Integration System? n n n Relational Data Model – Set of rows and columns – Fixed set of simple data types Data cube – Specialized warehouse management system – Uses a single, multi-dimensional relation as model Neither!! Both models are too rigid to accommodate heterogeneous data from multiple sources Hachim Haddouti

Bottom Line n n n We need a bridge between the repositories where the Bottom Line n n n We need a bridge between the repositories where the data resides (e. g. , data warehouse, transactional databases) and where it is used (Web interface, business application) Data model that allows for the exchange of data with structure Relaxes the strictures of existing, highly structured database systems Hachim Haddouti

New Data Model: Semistructured Data Model Hachim Haddouti New Data Model: Semistructured Data Model Hachim Haddouti

Semistructured Data: Particularities (1) n n Structure is irregular – data heterogeneities – Pieces Semistructured Data: Particularities (1) n n Structure is irregular – data heterogeneities – Pieces of data missing – Extra information is recorded (annotations) – Type variations (Dollars/Euros – Address) Structure may be implicit – Often in files: text + grammar (e. g. , SGML) • Need to parse – structuring may be hidden Hachim Haddouti

Semistructured Data: Particularities (2) n n Structure may be partial – Parts of data Semistructured Data: Particularities (2) n n Structure may be partial – Parts of data lack structure (e. g. , images) – Some data may yield little structure (e. g. , plain text) Types are only indicative – Unlike databases, some sources may not have strict typing policy Hachim Haddouti

Semistructured Data: Particularities (3) n n n A-priori schema vs. a-posteriori: – Database: Fix Semistructured Data: Particularities (3) n n n A-priori schema vs. a-posteriori: – Database: Fix schema, then populate – Web: design a lot of Web pages, then define schema to facilitate access Schema is large Schema often ignored in queries – IR queries and browsing Hachim Haddouti

Semistructured Data: Particularities (4) n n Schema is rapidly evolving Data element type is Semistructured Data: Particularities (4) n n Schema is rapidly evolving Data element type is eclectic – Structure of a piece of information may depend on point of view – e. g. , Person object contains, name, address, phone as strings and picture as gif file Hachim Haddouti

Example {name: “Alan”, tel: 2157786, email: “agb@abc. com”} {name: {first: “Alan”, last: “Black”}, tel: Example {name: “Alan”, tel: 2157786, email: “agb@abc. com”} {name: {first: “Alan”, last: “Black”}, tel: 2157786, email: “agb@abc. com” } n Different from usual tuples in that we allow duplicates: {name: “Alan”, tel: 2157786, tel: 2159989, email: “agb@abc. com”} Hachim Haddouti

Graph Representation node name email name tel “Alan” 2157786 “agb@abc. com” Hachim Haddouti edges Graph Representation node name email name tel “Alan” 2157786 “agb@abc. com” Hachim Haddouti edges tel 2157786 first last “Alan” “Black” email “agb@abc. com”

Example n Possible to describe sets of tuples: {person: {name: “Alan”, tel: 2157786, email: Example n Possible to describe sets of tuples: {person: {name: “Alan”, tel: 2157786, email: “agb@abc. com”}, person: {name: “Sara”, tel: 2344381, email: “srb@math. edu”}, person: {name: “Fred”, tel: 7767546, email: “fht@pto. org”} } Hachim Haddouti

Example – Variations in Structure n Possible to describe sets of heterogeneous tuples: {person: Example – Variations in Structure n Possible to describe sets of heterogeneous tuples: {person: {name: “Alan”, phone: 2157786, email: “agb@abc. com”}, person: {name: {first: “Sara”, last: “Green”}, tel: 2344381, email: srb@math. edu }, person: {name: “Fred”, tel: 7767546, height: 6’ 4”} } Hachim Haddouti

Base Types n n n Numbers: (1234, 4532, …) Strings: (“Alan”, “ssdrf”, “agb@abc. com”, Base Types n n n Numbers: (1234, 4532, …) Strings: (“Alan”, “ssdrf”, “agb@abc. com”, …) Labels: (name, email, …) Distinguishable by syntax Other types such as gif, date, wav, etc. can be added as needed – Each value has a tag that indicates its type and possibly an encoding – Most data formats have their own tagging Hachim Haddouti

Representing Relational Databases n n Relational schema r 1(a, b, c) and r 2(c, Representing Relational Databases n n Relational schema r 1(a, b, c) and r 2(c, d) – r 1 and r 2 are relations, – a, b, c, d column names Instance is some data that conforms to this specification – Usually depicted as rows in table Hachim Haddouti

Pictorially r 1 a b c r 2 c d a 1 b 1 Pictorially r 1 a b c r 2 c d a 1 b 1 c 2 d 1 a 2 b 2 c 3 d 2 a 3 b 3 c 4 d 3 Hachim Haddouti

Self-describing Approach n Using our new model and syntax, we can describe the whole Self-describing Approach n Using our new model and syntax, we can describe the whole database formally as: {r 1: i 1, r 2: i 2}, where i 1 and i 2 are sets of rows {r 1: {row: {a: a 1, b: b 1, c: c 1}, row: {a: a 2, b: b 2, c: c 2}, row: {a: a 3, b: b 3, c: c 3} }, r 2: {row: {c: c 2, d: d 2}, {row: {c: c 3, d: d 3}, {row: {c: c 4, d: d 4} } } Hachim Haddouti

Other Representations r 2 r 1 row a a 1 b b 1 row Other Representations r 2 r 1 row a a 1 b b 1 row row c a c 1 a 2 b b 2 c a c 2 a 3 Hachim Haddouti b b 3 row c c c 3 c 2 d d 2 c c 3 d d 3 c c 4 d d 4

Other Representations r 1 a a 1 b b 1 c a c 1 Other Representations r 1 a a 1 b b 1 c a c 1 a 2 r 1 b b 2 c a c 2 a 3 Hachim Haddouti b b 3 r 2 c c c 3 c 2 r 2 d d 2 r 2 c c 3 d d 3 c c 4 d d 4

The Object Exchange Model (OEM) § Common model for heterogeneous information exchange, “schema-less” § The Object Exchange Model (OEM) § Common model for heterogeneous information exchange, “schema-less” § Each object: OID F Label Type Value OID = unique identifier or NULL F Label = character string descriptor F Type = atomic data type or set F Value = atomic value or set of object references § “Help pages” for labels § Two query languages Hachim Haddouti

OEM n n Provides: – Flexibility: rigid domain models not needed for those software OEM n n Provides: – Flexibility: rigid domain models not needed for those software components which do not require one – Extensibility: information servers can use whatever information is available and can rapidly make its knowledge available on an experimental basis – Stability: the structure of the information remains stable even as new information is added Removes dependencies on compile-time object definitions Hachim Haddouti

Representing Semistructured Data Using OEM Set Value Label Memory Addresses <book, {t 1, a Representing Semistructured Data Using OEM Set Value Label Memory Addresses t 1: a 1: Atomic Value Hachim Haddouti

Representing Semistructured Data Using OEM <collection, {b 1, a 1, . . . }> Representing Semistructured Data Using OEM b 1: t: a: n: p: a 1: v: w: x: . . . Hachim Haddouti

Example: ACe. DB n n ACe. DB (a C. elegans Database) Genome database system Example: ACe. DB n n ACe. DB (a C. elegans Database) Genome database system developed since 1989 primarily by Jean Thierry. Mieg (CNRS, Montpellier) and Richard Durbin (Sanger Institute) Provides custom database kernel, with non-standard data model designed specifically for handling scientific data flexibly Ace. DB is used both for managing data within genome projects, and for making genomic data available to the wider scientific community. – Popular with biologists for its flexibility and ability to accommodate missing data – Underlying data model is quite general Hachim Haddouti

Sample Ace. DB Schema ? person name firstname UNIQUE Text lastname UNIQUE Text tel Sample Ace. DB Schema ? person name firstname UNIQUE Text lastname UNIQUE Text tel Int - at most one first name - at most one last name - several numbers ? book authors ? person title UNIQUE Text chapter-headings Int UNIQUE Text - set of persons - at most one title - array of strings … Hachim Haddouti

Sample ACe. DB Data &ASmith person name firstname “Alan” lastname “Smith” - ASmith is Sample ACe. DB Data &ASmith person name firstname “Alan” lastname “Smith” - ASmith is key/OID &JMiller person name firstname “Janet” lastname “Miller” &LH 17. 23. 15 authors &ASmith &JMiller title “A Very Brief History of Time” chapter-headings 1 “The Beginning” chapter-headings 2 “The Middle” chapter-headings 3 “The End” … Hachim Haddouti

Is ACe. DB Semistructured? n n n Any label other then top identifier can Is ACe. DB Semistructured? n n n Any label other then top identifier can be missing OID’s provided by user ACe. DB requires schema, but data may be missing, no strong typing (labels instead) Hachim Haddouti

Proposal for Generic SS Data Syntax n Semistructured data expressions: ssd-expressions – Standard syntax Proposal for Generic SS Data Syntax n Semistructured data expressions: ssd-expressions – Standard syntax for labels and for atomic values – Object identifiers start with ampersand, e. g. , &123 : : = | oid|oid : : = atomicvalue | : : = {label: , … , label: } Hachim Haddouti

Example {person: &o 1{name: “Mary”, age: 45, child: &o 2, child: &o 3 }, Example {person: &o 1{name: “Mary”, age: 45, child: &o 2, child: &o 3 }, person: &o 2{name: “John”, age: 17, relatives: {mother: &o 1, sister: &o 3} }, person: &o 3{name: “Jane”, country: Canada, mother: &o 1 } } Hachim Haddouti

Pictorially (edge-labeled graph) person child &o 1 child &o 2 &o 3 mother name Pictorially (edge-labeled graph) person child &o 1 child &o 2 &o 3 mother name age country relatives “Mary” 45 “John” mother Hachim Haddouti 17 “Jane” sister “Canada”

Terminology n Terminology used to describe semistructured data is that of basic graph theory Terminology n Terminology used to describe semistructured data is that of basic graph theory Hachim Haddouti

Basic Graph Theory n n n Graph (N, E) set of nodes N and Basic Graph Theory n n n Graph (N, E) set of nodes N and set of edges E Each edge e is associated with a pair of nodes, source node s(e) and target node t(e) Path is a sequence e 1, e 2, … , ek edges such that t(ei) = s(ei+1), 1 i k-1 – Number of edges in path is length Node r is root for graph (N, E) if there is a path from r to n for every node in N, n r A cycle in a graph is a path between a node and itself – Graph with no cycles is acyclic A rooted graph is s tree if there is a unique path from r to n for every n N, n r Hachim Haddouti

Sample Graphs Directed graph with cycle and no root Tree Hachim Haddouti Sample Graphs Directed graph with cycle and no root Tree Hachim Haddouti

Summary Def. : Semistructured data model Is a syntax for data with no separate Summary Def. : Semistructured data model Is a syntax for data with no separate syntax for types, i. e. , data that has no separate schema language or data definition language. n n Data graph vs ssd-expressions Our semistructured data model is that of an edge-labeled graph – Each edge has a label Hachim Haddouti