Скачать презентацию XML Major Sources http www cis upenn Скачать презентацию XML Major Sources http www cis upenn

b78d68718dc1abd6064c3e1aa43c6ccc.ppt

  • Количество слайдов: 87

XML Major Sources: • http: //www. cis. upenn. edu/~cis 550/slides/xml. ppt CIS 550 Course XML Major Sources: • http: //www. cis. upenn. edu/~cis 550/slides/xml. ppt CIS 550 Course Notes, U. Penn, source for many slides • Yaron Kanza’s slides, source for many slides • Brian Travis, XML Day At Microsoft Tech·Ed 99 • XML Black Book • Other sources …. 1

Part I: Background What’s the difference between the world of documents and information retrieval Part I: Background What’s the difference between the world of documents and information retrieval and databases and query interfaces? 2

Documents vs Databases Document world > plenty of small documents > usually static Database Documents vs Databases Document world > plenty of small documents > usually static Database world > a few large databases > usually dynamic > implicit structure > explicit structure (schema) > tagging > records > human friendly > machine friendly > content section, paragraph, toc, form/layout, annotation > Paradigms “Save as”, wysiwyg > meta-data author name, date, subject schema, data, methods > Paradigms Atomicity, Concurrency, Isolation, Durability > meta-data schema description 3

What to do with them Documents • editing Database • updating • printing • What to do with them Documents • editing Database • updating • printing • spell-checking • counting words • cleaning • retrieving (IR) • querying • searching • composing/transforming 4

HTML • Lingua franca for publishing hypertext on the World Wide Web • Designed HTML • Lingua franca for publishing hypertext on the World Wide Web • Designed to describe how a Web browser should arrange text, images and push-buttons on a page. • Easy to learn, but does not convey structure. • Fixed tag set. Text (PCDATA) Opening tag Welcome to the XML course Introduction

Thin red line • The line between the document world and the database world Thin red line • The line between the document world and the database world is not clear. • In some cases, both approaches are legitimate. • An interesting middle ground is data formats -- of which XML is an example 6

The Structure of XML • XML consists of tags and text • Tags come The Structure of XML • XML consists of tags and text • Tags come in pairs . . . • They must be properly nested . . . --- good . . . . . . --- bad (You can’t do . . . . . . . . . in HTML) 7

XML text XML has only one “basic” type -- text. It is bounded by XML text XML has only one “basic” type -- text. It is bounded by tags e. g. The Big Sleep 1935 --- 1935 is still text XML text is called PCDATA (for parsed character data). It uses a 16 -bit encoding, e. g. &#x 0152 for the Hebrew letter Mem Later we shall see how new types are specified by XML-data 8

XML structure Nesting tags can be used to express various structures. E. g. A XML structure Nesting tags can be used to express various structures. E. g. A tuple (record) : Jeff Cohen 04 -828 -1345 054 -470 -778 jeffc@cs. technion. ac. il 9

XML structure (cont. ) • We can represent a list by using the same XML structure (cont. ) • We can represent a list by using the same tag repeatedly: . . . . . . 10

XML structure (cont. ) • We can represent a list by using the same XML structure (cont. ) • We can represent a list by using the same tag repeatedly: Yossi Orr 04 -828 -1345 yossio@cs. technion. ac. il Irma Levy 03 -426 -1142 irmal@yourmail. com 11

Terminology The segment of an XML document between an opening and a corresponding closing Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element Malcolm Atchison (215) 898 4321 mp@dcs. gla. ac. sc element, a sub-element of not an element 12

XML is tree-like person name tel email Malcolm Atchison )215 (898 4321 mp@dcs. gla. XML is tree-like person name tel email Malcolm Atchison )215 (898 4321 mp@dcs. gla. ac. sc Semistructured data models typically put the labels on the edges 13

Mixed Content An element may contain a mixture of sub-elements and PCDATA <airline> <name> Mixed Content An element may contain a mixture of sub-elements and PCDATA British Airways World’s favorite airline Data of this form is not typically generated from databases. It is needed for consistency with HTML 14

A Complete XML Document Jeff Cohen 04 -828 -1345 054 -470 -778 jeffc@cs. technion. ac. il 15

• You can" src="https://present5.com/presentation/b78d68718dc1abd6064c3e1aa43c6ccc/image-16.jpg" alt="The Header Tag • • You can" /> The Header Tag • • You can leave out the encoding attribute and the processor will use the UTF-8 default. 16

Two ways of representing a DB projects: title employees: name budget ssn managed. By Two ways of representing a DB projects: title employees: name budget ssn managed. By age 17

Project and Employee relations in XML Projects and employees are intermixed <db> <project> <title> Project and Employee relations in XML Projects and employees are intermixed Pattern recognition 10000 Joe Joe 344556 34 < /age> Sandra 2234 35 Auto guided vehicle 70000 Sandra : 18

Project and Employee relations in XML (cont’d) Employees follow projects <db> <employees> <projects> <employee> Project and Employee relations in XML (cont’d) Employees follow projects Joe Pattern recognition 344556 10000 34 Joe Sandra Auto guided vehicles 2234 70000 35 Sandra : : 19

Project and Employee relations in XML (cont’d) Or without “separator” tags … <db> <projects> Project and Employee relations in XML (cont’d) Or without “separator” tags … Pattern recognition Joe 344556 10000 34 Joe Sandra Auto guided vehicles 2234 70000 35 Sandra : : 20

Attributes An (opening) tag may contain attributes. These are typically used to describe the Attributes An (opening) tag may contain attributes. These are typically used to describe the content of an element cheese fromage branza A food made … 21

Attributes (cont’d) Another common use for attributes is to express dimension or type <picture> Attributes (cont’d) Another common use for attributes is to express dimension or type 2400 96 M 05 -. +C$@02!G 96 YE A document that obeys the “nested tags” rule and does not repeat an attribute within a tag is said to be well-formed. 22

Jeff Cohen 04 -828 -1345 " src="https://present5.com/presentation/b78d68718dc1abd6064c3e1aa43c6ccc/image-23.jpg" alt=" Attributes (cont’d) Jeff Cohen 04 -828 -1345 " /> Attributes (cont’d) Jeff Cohen 04 -828 -1345 054 -470 -778 jeffc@cs. technion. ac. il Irma Levy 03 -426 -1142 irmal@yourmail. com 23

When to use attributes It’s not always clear when to use attributes <person ssno= When to use attributes It’s not always clear when to use attributes F. Mac. Niel fmacn@dcs. barra. ac. sc . . . 123 45 6789 F. Mac. Niel fmacn@dcs. barra. ac. sc . . . 24

Jeff Cohen 04 -828 -1345 " src="https://present5.com/presentation/b78d68718dc1abd6064c3e1aa43c6ccc/image-25.jpg" alt="Using IDs Jeff Cohen 04 -828 -1345 " /> Using IDs Jeff Cohen 04 -828 -1345 054 -470 -778 jeffc@cs. technion. ac. il Irma Levy 03 -426 -1142 irmal@yourmail. com 25

ODL schema class Movie ( extent Movies, key title ) class Actor ( extent ODL schema class Movie ( extent Movies, key title ) class Actor ( extent Actors, key name ) { { attribute string name; relationship set acted_In inverse Movie: : casts; attribute int age; attribute set directed; attribute string title; attribute string director; relationship set casts inverse Actor: : acted_In; attribute int budget; }; 26

An example <db> <movie id=“m 1”> <title>Waking Ned Divine</title> <director>Kirk Jones III</director> <cast idrefs=“a An example Waking Ned Divine Kirk Jones III 100, 000 Dragonheart Rob Cohen 110, 000 Moondance Dagmar Hirtz 90, 000 : David Kelly Sean Connery 68 Ian Bannen : 27

Part II: Document Type Descriptors Imposing structure on XML documents 28 Part II: Document Type Descriptors Imposing structure on XML documents 28

29

In XMLSpy Grid View 30 In XMLSpy Grid View 30

Document Type Descriptors • Document Type Descriptors (DTDs) impose structure on an XML document. Document Type Descriptors • Document Type Descriptors (DTDs) impose structure on an XML document. • There is some relationship between a DTD and a schema, but it is not close -- hence the need for additional “typing” systems. • The DTD is a syntactic specification. 31

Example: The Address Book <person> Exactly one name At <greet> Dr. John Mac. Niel Example: The Address Book Exactly one name At Dr. John Mac. Niel most one greeting Mac. Niel, John As 1234 Huron Street many address lines as needed (in order) Rome, OH 98765 (321) 786 2543 Mixed telephones (321) 786 2543 and faxes (321) 786 2543 As many jm@abc. com as needed 32

Specifying the structure • name to specify a name element • greet? to specify Specifying the structure • name to specify a name element • greet? to specify an optional (0 or 1) greet elements • name, greet? to specify a name followed by an optional greet 33

Specifying the structure (cont) • addr* to specify 0 or more address lines • Specifying the structure (cont) • addr* to specify 0 or more address lines • tel | fax a tel or a fax element • (tel | fax)* 0 or more repeats of tel or fax • email* 0 or more email elements 34

Specifying the structure (cont) So the whole structure of a person entry is specified Specifying the structure (cont) So the whole structure of a person entry is specified by name, greet? , addr*, (tel | fax)*, email* This is known as a regular expression. Why is it important? 35

Regular Expressions Each regular expression determines a corresponding finite state automaton. Let’s start with Regular Expressions Each regular expression determines a corresponding finite state automaton. Let’s start with a simpler example: name, addr*, email This suggests a simple parsing program addr name email 36

Another example name, address*, (tel | fax)*, email* address name email tel email fax Another example name, address*, (tel | fax)*, email* address name email tel email fax email Adding in the optional greet further complicates things 37

Internal DTD for the address book ]> 38

Rest of the address book <addressbook> <project> <name> Jeff Cohen </name> <greet> Dr. Cohen Rest of the address book Jeff Cohen Dr. Cohen jc@penny. com 39

Our relational DB revisited projects: title employees: name budget ssn managed. By age 40 Our relational DB revisited projects: title employees: name budget ssn managed. By age 40

Two DTDs for the relational DB <!DOCTYPE db [ <!ELEMENT db (projects, employees)> <!ELEMENT Two DTDs for the relational DB . . . ]> . . . ]> 41

Recursive DTDs <DOCTYPE genealogy [ <!ELEMENT genealogy (person*)> <!ELEMENT person ( name, date. Of. Recursive DTDs . . . ]> -- mother -- father What is the problem with this? XMLSpy does notice it! 42

Recursive DTDs cont’d. <DOCTYPE genealogy [ <!ELEMENT genealogy (person*)> <!ELEMENT person ( name, date. Recursive DTDs cont’d. . . . ]> -- mother -- father What is now the problem with this? 43

Some things are hard to specify Each employee element is to contain name, age Some things are hard to specify Each employee element is to contain name, age and ssn elements in some order. Suppose there were many more fields ! 44

General Definitions of Entities ANY - tells that the element can have any content. General Definitions of Entities ANY - tells that the element can have any content. EMPTY - tells that the element have no content. 45

Summary of XML regular expressions • • A e 1, e 2 e* e? Summary of XML regular expressions • • A e 1, e 2 e* e? e+ e 1 | e 2 (e) The tag A occurs The expression e 1 followed by e 2 0 or more occurrences of e Optional -- 0 or 1 occurrences 1 or more occurrences either e 1 or e 2 grouping 46

Deterministic Requirement • Content models in element type declarations should be deterministic. • Formally, Deterministic Requirement • Content models in element type declarations should be deterministic. • Formally, the Glushkov automaton is deterministic. • This automaton has states the positions of the regular expression (semantic actions). • The transitions are based on the ‘follows set’. • The associated automata are succinct. • A regular language may not have an associated deterministic grammar, e. g. , 47

Specifying attributes in the DTD <!ELEMENT height (#PCDATA)> <!ATTLIST height dimension CDATA #REQUIRED accuracy Specifying attributes in the DTD The dimension attribute is required; the accuracy attribute is optional. CDATA is the “type” of the attribute -- it means string, may take any literal string as a value. 48

Specifying ID and IDREF attributes <!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (name)> Specifying ID and IDREF attributes ]> 49

Jane Doe Jane Doe Some conforming data Jane Doe John Doe Mary Doe Jack Doe 50

Consistency of ID and IDREF attribute values • If an attribute is declared as Consistency of ID and IDREF attribute values • If an attribute is declared as ID – the associated values must all be distinct (no confusion) • If an attribute is declared as IDREF – the associated value must exist as the value of some ID attribute (no dangling “pointers”) • Similarly for all the values of an IDREFS attribute • ID and IDREF attributes are not typed 51

Formally • Validity constraint: One ID per Element Type No element type may have Formally • Validity constraint: One ID per Element Type No element type may have more than one ID attribute specified. • Validity constraint: ID Attribute Default An ID attribute must have a declared default of #IMPLIED or #REQUIRED. • Validity constraint: IDREF Values of type IDREF must match the Name production, and values of type IDREFS must match Names; each Name must match the value of an ID attribute on some element in the XML document; i. e. IDREF values must match the value of some ID attribute. 52

A useful abbreviation When an element has empty content we can use <tag blahbla/> A useful abbreviation When an element has empty content we can use for For example:

An alternative specification ]> 54

Jane Doe " src="https://present5.com/presentation/b78d68718dc1abd6064c3e1aa43c6ccc/image-55.jpg" alt="The revised data Jane Doe " /> The revised data Jane Doe John Doe Ami Doe Tami Doe 55

ODL schema class Movie ( extent Movies, key title ) class Actor ( extent ODL schema class Movie ( extent Movies, key title ) class Actor ( extent Actors, key name ) { { attribute string name; relationship set acted_In inverse Movie: : cast; attribute int age; attribute set directed; attribute string title; attribute string director; relationship set cast inverse Actor: : acted_In; attribute int budget; }; 56

Schema. dtd movie (title, director, cast, budget)> movie id ID #REQUIRED> title (#PCDATA)> director (#PCDATA)> cast EMPTY> cast idrefs IDREFS #REQUIRED> budget (#PCDATA)> 57

Schema. dtd (cont’d) <!ELEMENT actor (name, acted_In, age? , directed*)> <!ATTLIST actor id ID Schema. dtd (cont’d) ]> 58

Oh God! Woody Allen " src="https://present5.com/presentation/b78d68718dc1abd6064c3e1aa43c6ccc/image-59.jpg" alt="Data Oh God! Woody Allen " /> Data Oh God! Woody Allen $2 M George Burns 59

Constraints on IDs and IDREFs • ID stands for identifier. No two ID attributes Constraints on IDs and IDREFs • ID stands for identifier. No two ID attributes may have the same value (of type CDATA) • IDREF stands for identifier reference. Every value associated with an IDREF attribute must exist as an ID attribute value • IDREFS specifies several (0 or more) identifiers 60

Connecting the document with its DTD In line: … ]> . . . Another file: A URL: 61

Connecting the document with its DTD Both: file c: /schema. dtd: <!ELEMENT db (movie+, Connecting the document with its DTD Both: file c: /schema. dtd: file to be validated ]> Oh God! Woody Allen $2 M George Burns 62

Well-formed and Valid Documents • Well-formed applies to any document (with or without a Well-formed and Valid Documents • Well-formed applies to any document (with or without a DTD): proper nesting of tags and unique attributes • Valid specifies that the document conforms to the DTD: conforms to regular expression grammar, types of attributes correct, and constraints on references satisfied 63

DTDs v. s Schemas (or Types) • By database (or programming language) standards DTDs DTDs v. s Schemas (or Types) • By database (or programming language) standards DTDs are rather weak specifications. – Only one base type -- PCDATA – No useful “abstractions” e. g. , sets – IDREFs are untyped. You point to something, but you don’t know what! – No constraints e. g. , child is inverse of parent – No methods – Tag definitions are global • Some of the XML extensions impose something like a schema or type on an XML document. We may see these later 64

Part III: Entities To take storage into account 65 Part III: Entities To take storage into account 65

What are Entities An entity is a shortcut to a set of information You What are Entities An entity is a shortcut to a set of information You might think of an entity as being a bit like a macro. Entities allow dividing a document between some different storage devices. 66

Why to use entities: • Entities save typing. • Entities can reduce errors. • Why to use entities: • Entities save typing. • Entities can reduce errors. • Entities are easy to update. • Entities can act as placeholders for TBD information. 67

Defining Entities • You can define entities in your local document as part of Defining Entities • You can define entities in your local document as part of the DOCTYPE definition. • You can also link to external files that contain the entity data. This, too, is done through the DOCTYPE definition. • A third option is to define the entities in your external DTD. • Use a local definition when the entity is being used only in this one particulars file. • Use a linked, external file when the entity being used in many document sets. 68

Kinds of Entities There are two kinds of entities: • • general entities parameter Kinds of Entities There are two kinds of entities: • • general entities parameter entities • • Internal External • • Parsed Unparsed • Possibilities (first 4 are parsed): 1. 2. 3. 4. 5. Internal Parameter External Parameter Internal General External General Unparsed 69

General entities The definition of general entities in the DTD <!ENTITY Name Entity. Definition General entities The definition of general entities in the DTD The usage of the entity in the document is by &Name; 70

" src="https://present5.com/presentation/b78d68718dc1abd6064c3e1aa43c6ccc/image-71.jpg" alt="Example " /> Example <[ Oh God! Woody Allen $2 M 71

Browser View 72 Browser View 72

Non-parsed Entities ]> 73

Oh God! Woody Allen " src="https://present5.com/presentation/b78d68718dc1abd6064c3e1aa43c6ccc/image-74.jpg" alt="Data Oh God! Woody Allen " /> Data Oh God! Woody Allen $2 M 74

Parameter Entities Parameter entities are used only within DTDs. They carry information for use Parameter Entities Parameter entities are used only within DTDs. They carry information for use in the markup declaration. • Internal entities - references are within the DTD. • External entities - references draw information from outside files. Parameter Entity declaration: Can’t use in internal DTD subset 75

" src="https://present5.com/presentation/b78d68718dc1abd6064c3e1aa43c6ccc/image-76.jpg" alt="Parameter Entity Example " /> Parameter Entity Example 76

Entities Definition Local Definition: <!DOCTYPE [ <!ENTITY copyright Entities Definition Local Definition: ]> Global Definition: ]> 77

Example ]> 78

Example (cont. ) <PRESSRELEASE> <HEAD> Mini-globe revolutionizes keychain industry </HEAD> <LEAD> Today As The Example (cont. ) Mini-globe revolutionizes keychain industry Today As The World Spins introduces a new approach to key chains. With the new MINI-GLOBE keys can be kept inside a chain, called for upon demand, and stored safely. Never more will consumers lose a key or stand at a door flipping through a stack of keys seeking the right one. &trademark; ©right; 79

Using CDATA <HEAD 1> Entering a Kennel Club Member </HEAD 1> <DESCRIPTION> Enter the Using CDATA Entering a Kennel Club Member Enter the member by the name on his or her papers. Use the NAME tag. The NAME tag has two attributes. Common (all in lowercase, please!) is the dog's call name. Breed (also in all lowercase) is the dog's breed. Please see the breed reference guide for acceptable breeds. Your entry should look something like this: Sir Fredrick of Ledyard's End]]> 80

81 81

Namespaces • Namespaces are a way of preventing name clashes among elements from more Namespaces • Namespaces are a way of preventing name clashes among elements from more than one source within the same XML document. • They are also useful in identifying elements that are meaningful for a particular XML application. • See http: //www. w 3. org/TR/REC-xml-names/ 82

Namespaces • URIs are either of URLs or URNs. • An XML namespace is, Namespaces • URIs are either of URLs or URNs. • An XML namespace is, literally, identified by a URI reference. • The reference need not point to an actual resource! • A URI reference may be associated more than one prefix. • Prefixes are used in XML documents in forming element and attribute names (prefix: localname). • Two prefixes that are associated with the same URI are said to be in the same namespace. • declaring a namespace - identifying a namespace used in the document. • DTDs are unaware of namespaces. 83

Example Defining the Namespace ATDB: <document xmlns: ATDB= 'http: //www. cs. huji. ac. il/atdb-schema'> Example Defining the Namespace ATDB: Using a tag from the ATDB Namespace This is an xml tag. ADTB: my. Tag is a qualified name. Using A tag not from the namespace: This is a ‘made in Israel’ tag. 84

Scope of Namespaces • A prefix is associated with the namespace in the element Scope of Namespaces • A prefix is associated with the namespace in the element scope in which it is defined. • Example (birthdate is associated with no namespace): John Smith 12 -11 -87

Technion City 234
85

Default Namespaces • A default namespace applies to all elements in its scope. • Default Namespaces • A default namespace applies to all elements in its scope. • However, it does not override explicit prefixes (their nonprefixed child elements are default-bound). • Example (name and birthdate are bound): John Smith 12 -11 -87 Technion City 234 • Non-prefixed attribute names are associated with no namespace even when in scope. 86

Summary • XML is a new data format. Its main virtues are widespread acceptance Summary • XML is a new data format. Its main virtues are widespread acceptance and the (important) ability to handle semi structured data (data without schema) • DTDs provide some useful syntactic constraints on documents. As schemas they are weak • How to store large XML documents? • How to query them? • How to map between XML and other representations? 87