Скачать презентацию Web Data Management Web OQL 1 OVERVIEW Скачать презентацию Web Data Management Web OQL 1 OVERVIEW

7abfd850efbfab11db57d7b4cc4636aa.ppt

  • Количество слайдов: 67

Web Data Management Web. OQL 1 Web Data Management Web. OQL 1

OVERVIEW • Data model supports abstractions for modeling record-based data, structured documents and hypertexts OVERVIEW • Data model supports abstractions for modeling record-based data, structured documents and hypertexts • Supports querying small databases represented as documents (such as catalogs), restructuring single pages (converting a large page into smaller pages), restructuring sets of pages, for example, creating an index page containing a hyperlink to each of them and adding to each page a hyperlink to index page. • Restructuring the content of a web site in order to show the same content in another view 2

Data Model The Web. OQL data model introduces the hypertree: a tree based Data Data Model The Web. OQL data model introduces the hypertree: a tree based Data model representing structured document containing hyperlinks Hypertrees are Ordered arc-labeled trees with two kinds of arcs – Internal and external. Internal arc: represent structured objects External arc: represent references (links), cannot have descendants and their records must contain a ‘URL’ field. 3

Data Model Example: [Group: students] [Name: moshe. Sem: 5] [Name: arik. Sem: 8] [Label: Data Model Example: [Group: students] [Name: moshe. Sem: 5] [Name: arik. Sem: 8] [Label: arik home page. URL: www…/index. html] [Label: moshe home page. URL: www…/index. html] [Group: professors] [Name: oded. Seniority: 8] [Label: seminar in www. URL: www…/s. html] [Label: databases. URL: www…/index. html] 4

Data Model Hyper trees are a useful data structure because they have three important Data Model Hyper trees are a useful data structure because they have three important abstractions: • Collections • Nesting • Ordering The reference notion which is very important to the web structure is captured through the distinction between internal and external arcs. Because the nodes have no type the tree can hold heterogeneous records within its arcs. 5

Data Abstractions WEB schema PAGE a pair (t, F) where: t is a hypertree Data Abstractions WEB schema PAGE a pair (t, F) where: t is a hypertree and browsing function F : URLs Hypertrees F(u) where u is a URL 6

Tree operators Definitions: Tails: tails of a tree t are trees obtained by chopping Tree operators Definitions: Tails: tails of a tree t are trees obtained by chopping prefixes of t. Simple tree: simple trees of a tree t are the trees that are composed of an arc that stems from the root of t and its sub-tree. Subtree: subtrees of t are the trees at the end of arcs which stem from the root of t. 7

[Label: 3] [Label: 1] Tree t [Label: 2] [A: 1] [A: 2] [B: 1] [Label: 3] [Label: 1] Tree t [Label: 2] [A: 1] [A: 2] [B: 1] [Label: 1] [A: 1] [Label: 2] [A: 2] [Label: 3] [B: 1] Simple trees of t [A: 1] [A: 2] [B: 1] null Sub trees of t 8

Tails of T ! (prefixes) [Label: 1] [A: 2] [Label: 3] [Label: 2] [B: Tails of T ! (prefixes) [Label: 1] [A: 2] [Label: 3] [Label: 2] [B: 1] 9

Tree operators Concatenate : Tree 1 + Tree 2 Connects two trees by their Tree operators Concatenate : Tree 1 + Tree 2 Connects two trees by their roots: t 1: [label 1: a] t 2: [label 1: b] t 1 + t 2: [label 1: b] [label 1: c 1] [label 1: a 1] [label 1: c 2] [label 1: c 1] [label 1: a 2] [label 1: c 2] [label 1: a 2] 10

Tree operators Hang : [ Arc 1 / Tree 1 ] Hangs the tree Tree operators Hang : [ Arc 1 / Tree 1 ] Hangs the tree from a new arc. t 1: [ label 1: a / t 1 ] [label 1: a 1] [label 1: a 2] 11

Tree operators Prime : Tree’ The first subtree of the argument. t 1’ : Tree operators Prime : Tree’ The first subtree of the argument. t 1’ : t 1: [label 1: a] [label 1: b] [label 1: a 1] [label 1: a 2] 12

Tree operators Head : Tree & [x] The first x simple trees of the Tree operators Head : Tree & [x] The first x simple trees of the argument. If x is not specified then only the first simple tree. t 1: t 1& : [label 1: a] [label 1: b] [label 1: a 1] [label 1: a 2] 13

q 4 q 4’ q 5& q 6 q 5! q 7 q 5&2 q 4 q 4’ q 5& q 6 q 5! q 7 q 5&2 14

HANG [Label: “papers from smith”, Format: “ps. Z”/q 1] [Label: Papers from smith Format: HANG [Label: “papers from smith”, Format: “ps. Z”/q 1] [Label: Papers from smith Format: ps. Z] [Title: Recent………. . Url: http: //………. . ] [Title : Are………. . Url: http: //www………. ] HANG + concatenate [Tag: “UL”/[Tag: “LI”, Text: “First Child”]+ [Tag: “LI”, Text: “Second Child”]+ [Tag: “LI”, Text: “Third Child”]]+ [Url: “http: //a. b. c. ”, Label “Click Here”] [Tag: UL] [Tag: LI Text: First. Child] [ ] 15

Tree operators Peek : Arc. field Extracts a field from an arc’s label, e. Tree operators Peek : Arc. field Extracts a field from an arc’s label, e. g. Example. Group can have a value of ‘students’. If this field does not exist a value of ‘null’ is returned. Is. Field : Arc? field Test for the presence of a field from an arc’s label, e. g. Example? Group evaluates to true, while Example? Name evaluates to false. 16

Definitions • Page – when a hypertree has an associated URL that identifies it. Definitions • Page – when a hypertree has an associated URL that identifies it. • Web – Collection of interrelated pages. • External Arc of each page is a link in the web • Schema – A web can optionally have a distinguished page to provide entry point to the web 17

 • No Schema: One must know URL of one or more pages http: • No Schema: One must know URL of one or more pages http: //a. b. c. /one. html http: //a. b. c. /three. html http: //a. b. c. /two. html 18

Web Weboql query Web New page Schema http: //a. b. c. /three. html http: Web Weboql query Web New page Schema http: //a. b. c. /three. html http: //a. b. c. /one. html http: //a. b. c. /four. html http: //a. b. c. /two. html 19

[Tag: “UL”/[Tag: “LI”, Text: “First Child”]+ [Tag: “LI”, Text: “Second Child”]+ [Tag: “LI”, Text: [Tag: “UL”/[Tag: “LI”, Text: “First Child”]+ [Tag: “LI”, Text: “Second Child”]+ [Tag: “LI”, Text: “Third Child”]]+ [Url: “http: //a. b. c. ”, Label “Click Here”] [Url: “http: //a. b. c. ”, [Tag: UL] [Tag: LI Text: First. Child] [ ] Label “Click Here”] [ ]

  • First Child
  • Second Child
  • Third Child
Click Here 20

[Url: http: //a. b. c. Label: Click here] [Tag: LI Text: First Child] [Tag: [Url: http: //a. b. c. Label: Click here] [Tag: LI Text: First Child] [Tag: LI Text: Third Child] [Tag: LI Text: Second Child] Tree representing HTML document consisting of a list and a hyperlink • Trees are ordered • Arcs are not labeled with atomic values but records 21

[group: Card] nt [group: Prog. Lang] … re… ith : A itle rs: Sm [group: Card] nt [group: Prog. Lang] … re… ith : A itle rs: Sm ACM] [T ho : ut ations A lic Pub [Label: Abstract Url: www…] [L U abe rl: l: w Fu w ll w P … ap ] er s e h ec mit Tec R : le: rs: S ons it [T utho cati A ubli P h] [group: DBMS] Paper Database CS papers 22

SELECT - FROM - WHERE This familiar query language construct is used by Web. SELECT - FROM - WHERE This familiar query language construct is used by Web. OQL as the main construct of queries. Select Query to evaluate [y. Label, y. URL] From x in example, y in x! Definition of variables Where A boolean condition x. Seniority = 8 23

SELECT - FROM - WHERE For each instantiation of the variables in the from SELECT - FROM - WHERE For each instantiation of the variables in the from clause check the condition in the where clause, if its true then evaluate the query in the select clause and append it to the result. [Label: seminar in www. URL: www…/s. html] [Label: databases. URL: www…/index. html] 24

Select [y. title, y. publication] From x in cs papers, y in x’ missing Select [y. title, y. publication] From x in cs papers, y in x’ missing data Publication - undefined 25

 • Compute a listing of the papers’ publication data grouped by title. Select • Compute a listing of the papers’ publication data grouped by title. Select [x. Title / Select [z. Publication] from y in cs. Papers, z in y’ Where x. title = z. title ] From w in cs. Papers , x in w’ 26

 • Schema – a distinguished hypertree • Browsing function – maps strings (URLs) • Schema – a distinguished hypertree • Browsing function – maps strings (URLs) to hypertree, it defines a graph where the nodes are pages and there is an arc between node a and b if the content of the page at node a contains an external arc whose url attribute is the url of the page at node b. 27

 • • Analogy with Relational database Hypertree > Relations Webs > databases Schema • • Analogy with Relational database Hypertree > Relations Webs > databases Schema of a web >catalog of a database 28

 • Select [x. Tag] From x in browse(http: //www. cs. toronto. edu”) [Tag: • Select [x. Tag] From x in browse(http: //www. cs. toronto. edu”) [Tag: head] [Tag : body] 29

 • SFW creates a web • Select Title and URLs of papers authored • SFW creates a web • Select Title and URLs of papers authored by Smith. Select [y. Title, y’. URL] as schema From x in cs. Papers , y in x’ Where y. authors ~”smith” 30

Queries • Create a web page with URL “Group Names” whose content is the Queries • Create a web page with URL “Group Names” whose content is the list of group names (assume that there is no such page in the current web) • Select [x. Group] as “Group Names” from x in cs. Papers 31

Queries • Create several pages ; one for each research group (using the group Queries • Create several pages ; one for each research group (using the group name as URL). Each page contains the publications of the corresponding group • Select x’ as x. Group from x in cs. Papers 32

Data Model • Records as Labels on Arcs • Internal and External Arcs [Tag: Data Model • Records as Labels on Arcs • Internal and External Arcs [Tag: UL Text: one of the…] [Tag: H 1, Text: City Overview…] [Tag: L 1, Text: If you are interested…] [Tag: LI, Text: One of the…] [Tag: L 1, Text: All the hotels…] [Label: Theatres Online, Url: http: //www…, Base: http: //www…, Text: This page contains. . . ] [Tag: XYZ, Text: One of the…] [Tag: XYZ, Text: If you are…] [Tag: XYZ, Text: Contains…] [Label: All the Hotels, Url: http: //www…, Base: http: //www…, Text: These are all…] [Tag: XYZ, Text: …] [Label: Sports Zone, Url: http: //www…, Base: http: //www…, Text: Sports Zone…] 33

Query: list elements containing “ticket” doc : = “http: //www. citynet. com/overview. html”; [tag Query: list elements containing “ticket” doc : = “http: //www. citynet. com/overview. html”; [tag “UL”/ Select y from y in doc !’ where y’. text ~ “ticket”] [Tag: UL] [Tag: LI] [Label: Theatres Online, Url: http: //www…, Base: http: //www…, Text: This page contains. . . ] [Tag: XYZ, Text: One of the…] [Tag: XYZ, Text: If you are…] [Tag: XYZ, Text: …] [Label: Sports Zone, Url: http: //www…, Base: http: //www…, Text: Sports Zone…] 34

Web restructuring Using these tree operators we have shown how a tree can be Web restructuring Using these tree operators we have shown how a tree can be restructured. To restructure a web we must have a function which maps one web to another. The new web has some hypertree as its schema while the browsing function is an extension of the old web’s browsing function - targets URLs which were not previously targeted. The way it is done in Web. OQL is by using the AS clause. 35

Web restructuring Generally the select clause of Web. OQL has the form of: Select Web restructuring Generally the select clause of Web. OQL has the form of: Select q 1 as s 1, q 2 as s 2, …. , qn as sn Si can be either the key word schema, or a string query. An as clause which evaluates to schema defines the schema of the web. [Title: y. Group] as schema Title: students Title: professors 36

Web restructuring Generally the select clause of Web. OQL has the form of: Select Web restructuring Generally the select clause of Web. OQL has the form of: Select q 1 as s 1, q 2 as s 2, …. , qn as sn Si can be either the key word schema, or a string query. An as clause which evaluates to a string defines a page and is treated as the URL for it. students [x. Name] as y. Group [Name: moshe] [Name: arik] 37

Web restructuring After a web is created there are two possibilities : either query Web restructuring After a web is created there are two possibilities : either query it further (restructure it) or return it to the host application. If we want to return the web to the host application for the sake of showing it to a browser then we must format the pages in an HTML compliant way. This is easily done by restructuring it using HTML tags as labels. 38

Document restructuring Web documents are a perfect example of semi structured data since they Document restructuring Web documents are a perfect example of semi structured data since they do not have a fixed schema and can have various irregularities. In an HTML document most of the tags may appear any number of times or not at all. Web. OQL uses a wrapper which creates abstract syntax trees (AST) from any arbitrary HTML document. This is easily done since the markup tags of HTML reflects the logical relationship between the various information items. Example:

item 1. item 2. 39

 • Generate a web consisting of a page for each research group containing • Generate a web consisting of a page for each research group containing a title and author of all its publications, and an index web page , that lists all the groups and provides links to their pages new. Web Select unique [Name : x. Group, url : x. Group] as schema [y. Title, y. Authors ] as x. Group From x in cs. Papers, y in x’ 40

[Name: Card Punching Url: Card Punching] [Name: … Url: . . ] “As Schema” [Name: Card Punching Url: Card Punching] [Name: … Url: . . ] “As Schema” [Name: Prog. Lang Url: Prog. Lang. . ] Card Punching [Titles: Recent… Authors: Smith] Prog. Lang. [Titles: Assembly Lan [Titles: Cobol… Authors: John, . . ] Authors: James J] [Titles: Arc… Authors: Smith] “As x. group” 41

Newer. Web new. Web | select [ Tag: “H 3”, Text: y. Title ] Newer. Web new. Web | select [ Tag: “H 3”, Text: y. Title ] + [ Tag: “BR”, Text: y. Publication ] + [ Tag: “BR”, Text: y. Authors ] + [ Tag: “P” ] as x. Name from x in schema, y in x. Name | select [ Tag: “H 2”, Text: “Publications of the” * x. Name * “ Group” ] + x. Name + • [ Tag: “A”, Label: “To Index”, Url: “http: //a. b. c/Index of Projects. html” ] • as “http: //a. b. c/” * x. Name * “. html” • from x in schema • • • 42

 • | • select [ Url: “http: //a. b. c/Index of Projects. html” • | • select [ Url: “http: //a. b. c/Index of Projects. html” ] as schema, • [ Tag: “H 2”, Text: “Index of Projects” ] + • [ Tag: “UL” / • select [ Tag: “LI” / • [Tag: “A”, Label: x. Name, • Url: “http: //a. b. c/” * x. name * “. html” • ]] • from x in schema • ] as “http: //a. b. c/Index of Projects. html 43

<H 2> Index of Projects </H 2> <UL> <LI> <A HREF = “http: //a. Index of Projects

Index Page 44

<H 2>Publications of the Card Punching group </H 2> <H 3> recent Discoveries in Publications of the Card Punching group recent Discoveries in Card Punching
Technical Report TROIS
Peter Smith, John Brown

Are Magnetic Media Better ?
ACM TOCP Vol 3 No. (1942) pp. 2337
Peter Smith, John Brown

To index Group Pages 45

Document restructuring Navigation patterns: In the examples we have seen the variables used in Document restructuring Navigation patterns: In the examples we have seen the variables used in the queries ranged over simple trees of the tree we queried, however in the WWW variables may range over several linked sub trees whose structure is not fully known to us. select [x. text] from x in “someone’s. html” via ^*[Tag = “H 2”] ^ - record predicate which is true for every internal arc. [Tag=“H 2”] - record predicate which is true for every arc which has an ‘H 2’ tag. 46

Document restructuring Navigation patterns: In the examples we have seen the variables used in Document restructuring Navigation patterns: In the examples we have seen the variables used in the queries ranged over simple trees of the tree we queried, however in the WWW variables may range over several linked sub trees whose structure is not fully known to us. select [x. text] from x in “someone’s. html” via >*[not(Tag = “H 2”)] > - record predicate which is true for every external arc. [not(Tag=“H 2”)] - record predicate which is true for every arc which does not have an ‘H 2’ tag. 47

Document restructuring Navigation patterns: When navigation patterns are omitted then the query is treated Document restructuring Navigation patterns: When navigation patterns are omitted then the query is treated as if there was a navigation pattern which always evaluated to true. Variables are instantiated in left to right depth-first or breadth-first search. Since the default is depth-first to use breadth-first the key word viabfs is used instead of via. 48

Navigation Pattern [Not (Tag = “A”)]* - Path of any length composed of arcs Navigation Pattern [Not (Tag = “A”)]* - Path of any length composed of arcs not having an attribute tag with value “A”. [Tag = “LI”] [Tag = “A”] – path of length 2 ^*> - all paths in a tree that lead from root to an external arc Select [x. url] from x in “http: //a. b. c. /index. html” Via [not (tag = “Table”)]*> All the external arcs in the document pointed to by the “http”……” that do not occur within a table 49

Select [x. url, x. text] From x in “http: //a. b. c. /root. html” Select [x. url, x. text] From x in “http: //a. b. c. /root. html” Via (^*[Labled “Next’’]>)* What this query will produce? 50

[Tag: H 3, Text: Price…] [Tag: UL] [Tag: LI] Select X ! & From [Tag: H 3, Text: Price…] [Tag: UL] [Tag: LI] Select X ! & From X in http: //a. b. c. /large. html via ^* [Tag = “H 3”] Where X!. Tag=“UL” and X. Text ~ “Price” 51

[Tag: H 2, Text: Publications of the] [Tag: H 3, Text: ] [Tag: BR, [Tag: H 2, Text: Publications of the] [Tag: H 3, Text: ] [Tag: BR, [Tag: P, Text: y] Text: ] [Tag: H 3, Text: ] [Tag: P, Text: ] [Tag: BR, Text: y] [Tag: BR, Text: ] [Label: To index, Url: Base: http: //a. b. c. /cardpunching. html, Text: indexofprojects] Tree generated by Query [Tag: “OL”/Select [Tag: “LI” / X&3] from X in http: //a. b. c. /cardpunching. html! where X. tag = “H 3” [Tag: OL] [Tag: LI] [Tag: H 3] [Tag: LI] 52

[Tag: “OL”/Select [Tag: “LI”/ Select y from y in X while not y. Tag=“p”] [Tag: “OL”/Select [Tag: “LI”/ Select y from y in X while not y. Tag=“p”] From X in http: //a. b. c. //Irregular. Doc. html”! where X. tag = “H 3” ] 53

Project web select [x. proj name, x. proj descr] as “projects” [x. emp name, Project web select [x. proj name, x. proj descr] as “projects” [x. emp name, x. emp phone] as “people” [x. proj name] as “x. proj name” [x. emp name] as “x. emp name” From x in “SQLDb. Select proj name, emp phone, proj descr from proj, emp, worksin where Emp. id = works. In. empid and proj. id = works. In. proj. Id; ” Generate a web containing a page for each project, a page for each person and two index pages, listing all the projects and all the people, a person’s page contains pointers to the Projects in which he /she is involved and a project page contains pointers to the pages or the people involved in it. 54

[Tag: UL, Text: …] …. …] [Tag: H 1, f Research lications o Text: [Tag: UL, Text: …] …. …] [Tag: H 1, f Research lications o Text: Pub [Tag: H 2, Text: Card Punching…] [Tag: UL, Text: Recent…] [Tag: LI, Text: Recent…] [Tag: H 2, Text: Programming…] [Tag: LI, Text: Are Magnetic…] …. [Tag: CITE, Text: Are Magnetic…] [Tag: XYZ, Text: Are Magnetic] [Tag: BR, Text: ] [Tag: B, Text: Peter Smith…] [Tag: H 2, Text: Databases…] [Tag: UL, Text: Cobol in AI Sam James…] [Tag: LI, Text: Cobol in…] …. [Tag: BR, Text: ] [Tag: LI, Text: Assembly for…] …. [Tag: BR, Text: ] [Label: Full Version, Url: http: //www…/paper 2. ps. z, Base: http: //www…/cspapers. html, [Label: Abstract, Url: http: //www…/abstr 2. html, Text: 1 k 098 k 79…] Base: http: //www…/cspapers. html, Text: Are Magnetic Media…] [Tag: BR, Text: ACM TOCP Vol. 3 No. (1942) pp 23 -37] 55

Select [Title: y”. Text, Authors: y”!!. text] From x in “http: //www. a. b. Select [Title: y”. Text, Authors: y”!!. text] From x in “http: //www. a. b. c. /paper. html”, y in x’ Where x. Tag = UL Retrieve titles and authors of each paper x range over simple trees and y over elements under UL 56

Select [title: y”. Text, authors: y”!!. text, Publications: y”!3. Text ps-url: y’!4. url abstract-url: Select [title: y”. Text, authors: y”!!. text, Publications: y”!3. Text ps-url: y’!4. url abstract-url: y’!!. url] as “pubsdb: insert” From X in http: //www. a. b. c. /paper. html, y in X!’ Where X. tag = “H 2” 57

[Tag: H 1, Text: Reports in …] [Tag: HR, Text: ] [Tag: H 2, [Tag: H 1, Text: Reports in …] [Tag: HR, Text: ] [Tag: H 2, Text: David Rice] [Tag: H 2, Text: John Smith] [Tag: CITE, [Tag: BR, Text: Indexing] Text: ] [Tag: BR, Text: ] [Tag: CITE, Text: Efficient] [Tag: P, Text: ] [Tag: XYZ, Text: CS-TR-0327. . ] [Label: Indexing Sound, Url: http: //www…/pl. ps. gz, Base: http: //www…. /trs. html, Text: ; sd. . s. Ghj&9870…. ] [Label: Abstract Available Online, Url: http: //www…/pl. html, Base: http: //www…. /trs. html, Text: Indexing Sound…. ] … [Tag: HR, Text: ] [Tag: XYZ, Text: CS-TR-0120. . ] [Tag: P, Text: ] [Tag: BR, Text: ] [Tag: XYZ, Text: CS-TR-0029. . ] [Label: Efficient Clustering…. , Url: http: //www…/p 2. ps. gz, Base: http: //www…. /trs. html, Text: . f. Hjs*9))fujs……. ] [Label: Temporal Constraints, Url: http: //www…/p 3. ps. gz, Base: http: //www…. /trs. html, Text: ; +-9 ivm 27&813 nd…. ] 58

Select [title: Y. text author: X. text publications: Y!!. Text PS-Url: Y’: Url abstract-url: Select [title: Y. text author: X. text publications: Y!!. Text PS-Url: Y’: Url abstract-url: Y!4. Url ] as “Pubs. Db: insert” From X in “http: //www. x. y. z. /papers. html” Y in X! while not (Y. Tag = “HR”) where X. Tag = “H 2” and Y. Tag=“CITE” 59

[Tag: UL, Text: …] …. …] [Tag: H 1, f Research lications o Text: [Tag: UL, Text: …] …. …] [Tag: H 1, f Research lications o Text: Pub X [Tag: H 2, Text: Card Punching…] y y’ y” [Tag: XYZ, Text: Recent…. ] [Tag: UL, [Tag: H 2, Text: Recent…] Text: Programming…] [Tag: LI, Text: Are Magnetic…] Text: Recent…] …. [Tag: CITE, Text: Recent…] [Tag: BR, Text: Technical……. ] Text: ] [Tag: B, Text: Peter Brown…] [Tag: H 2, Text: Databases…] [Tag: UL, Text: Cobol in AI Sam James…] [Tag: LI, Text: Assembly for…] Text: Cobol in…] …. …. [Label: Full Version, Url: http: //www…/paperl. ps. z, Base: http: //www…/cspapers. html, Text: #h. H 6 Yia. P…. ] [Label: Abstract, Url: http: //www…/abstrl. html, Base: http: //www…/cspapers. html, Text: It is company…] Figure 5. 6 Instantiation of Variables in Query 4 60

Query 4: cs. Papers select[Group: X. Text / select[Title: y”. Text , Authors: y”!!. Query 4: cs. Papers select[Group: X. Text / select[Title: y”. Text , Authors: y”!!. Text, Publication: y”!3. Text/ [Label: “Abstract”, Url: y’!!. Url]+ [Label: “Full Version”, Url: y’!4. Url] ] from y in X!’ ] from X in “http: //www. a. b. c. /papers. html” where X. Tag = “H 2” 61

Architecture query web API Query Engine URL tree Wrapper Manager Wrapper DBMS Wrapper File Architecture query web API Query Engine URL tree Wrapper Manager Wrapper DBMS Wrapper File System Wrapper Web 1 Wrapper . . . Web k 62

 • Each node corresponds to either a subdocument enclosed in an occurrence of • Each node corresponds to either a subdocument enclosed in an occurrence of a paired tag. For example, root node corresponds to the subdocument enclosed between and or to a subdocument enclosed in an occurrence of a non-paired tag and the tag that follows it • Arcs leading to nodes corresponding to the tag and for which the protocol of the associated URL is http are external. All other arcs are internal. 63

 • The incoming arc to a node contains the attributes of the subdocument • The incoming arc to a node contains the attributes of the subdocument represented by this node. • Internal arcs are labeled with a record containing two fields: Tag and Text. • Tag is the HTML tag corresponding to the subtree that is the destination of the arc. • The value of the Text depends on whether Tag is paired or non-paired. • If paired, then the value of the text is the text that is enclosed between and excluding markups. • If Tag is non-paired, the value of text is the text between and the tag that comes after it in document. 64

 • External arcs are labeled with a record containing four fields, label, url, • External arcs are labeled with a record containing four fields, label, url, base and text. • Label is the label of the hyperlink, the text enclosed between and the tags; url is the value of the href attribute, base is the url of the document being processed and Text is the text of the referred document excluding markup. • A dummy tag named is used to enclose pieces of text that are not explicitly tagged. • Rules are applied recursively to the text inside occurrences of paired tags. 65

 • <HTML> <H 1> Publications of Research Groups at Cs Dept</H 1> <H Publications of Research Groups at Cs Dept Card Punching

  • Recent Advances in Card Punching>
    Peter Smith, John Brown
    Technical Report TR 015

    Abstract
    66

 • <a href =“http: //. . /paper. ps. Z> Full version</a> • </LI> Full version • • Are magnetic Media Better?
Peter Smith, John Brown, Tom
ACM TOCP Vol. 3, No. , pp

Abstract
Full version Programming lang 67