
c67e7ce619ffc050a62d869f87863a07.ppt
- Количество слайдов: 89
Knowledge Discovery over the Deep Web, Semantic Web and XML Aparna S. Varde, Fabian M. Suchanek, Richi Nayak and Pierre Senellart DASFAA 2009, Brisbane, Australia 1
Introduction • The Web is a vast source of information • Various developments in the Web – Deep Web – Semantic Web – XML Mining – Domain-Specific Markup Languages • These enhance knowledge discovery 2
Agenda • Section 1: Deep Web – Slides by Pierre Senellart • Section 2: Semantic Web – Slides by Fabian M. Suchanek • Section 3: XML Mining – Slides by Richi Nayak • Section 4: Domain-Specific Markup Languages – Slides by Aparna Varde • Summary and Conclusions 3
Section 1: Deep Web Pierre Senellart Department of Computer Science and Networking Telecom Paristech Paris, France [email protected] com 4
What is the Deep Web Definition (Deep Web, Hidden Web) All the content of the Web that is not directly accessible through hyperlinks. In particular: HTML forms, Web services. Size estimate [Bri 00] 500 times more content than on the surface Web! Dozens of thousands of databases. [HPWC 07] ~ 400 000 deep Web databases. 5
Sources of the Deep Web Examples • Yellow Pages and other directories; • Library catalogs; • Publication databases; • Weather services; • Geolocalization services; • US Census Bureau data; • etc. 6
Discovering Knowledge from the Deep Web • Content of the deep Web hidden to classical Web search engines (they just follow links) • But very valuable and high quality! • Even services allowing access through the surface Web (e. g. , e-commerce) have more semantics when accessed from the deep Web • How to benefit from this information? • How to do it automatically, in an unsupervised way? 7
Extensional Approach WWW discovery siphoning bootstrap Index indexing 8
Notes on the Extensional Approach • Main issues: – Discovering services – Choosing appropriate data to submit forms – Use of data found in result pages to bootstrap the siphoning process – Ensure good coverage of the database • Approach favored by Google [MHC+06], used in production • Not always feasible (huge load on Web servers) 9
Notes on the Extensional Approach • Main issues: – Discovering services – Choosing appropriate data to submit forms – Use of data found in result pages to bootstrap the siphoning process – Ensure good coverage of the database • Approach favored by Google [MHC+06], used in production • Not always feasible (huge load on Web servers) 10
Intensional Approach WWW discovery probing Form wrapped as a Web service query analyzing 11
Notes on the Intensional Approach • More ambitious [CHZ 05, SMM+08] • Main issues: – Discovering services – Understanding the structure and semantics of a form – Understanding the structure and semantics of result pages (wrapper induction) – Semantic analysis of the service as a whole • No significant load imposed on Web servers 12
Discovering deep Web forms • Crawling the Web and selecting forms • But not all forms! – Hotel reservation – Mailing list management – Search within a Web site • Heuristics: prefer GET to POST, no password, no credit card number, more than one field, etc. • Given domain of interest: use focused crawling to restrict to this domain 13
Web forms • Simplest case: associate each form field with some domain concept • Assumption: fields independent from each other (not always true!), can be queried with words that are part of a domain instance 14
Structural analysis of a form (1/2) 1) Build a context for each field: label tag; id and name attributes; text immediately before the field. 2) Remove stop words, stem 3) Match this context with concept names or concept ontology 4) Obtain in this way candidate annotations 15
Structural analysis of a form (1/2) 1) Build a context for each field: label tag; id and name attributes; text immediately before the field. 2) Remove stop words, stem 3) Match this context with concept names or concept ontology 4) Obtain in this way candidate annotations 16
Structural analysis of a form (2/2) For each field annotated with concept c: 1) Probe the field with nonsense word to get an error page 2) Probe the field with instances of concept c 3) Compare pages obtained by probing with the error page (e. g. , clustering along the DOM tree structure of the pages), to distinguish error pages and result pages 4) Confirm the annotation if enough result pages are obtained 17
Structural analysis of a form (2/2) For each field annotated with concept c: 1) Probe the field with nonsense word to get an error page 2) Probe the field with instances of concept c 3) Compare pages obtained by probing with the error page (e. g. , clustering along the DOM tree structure of the pages), to distinguish error pages and result pages 4) Confirm the annotation if enough result pages are obtained 18
Bootstrapping the siphoning • Siphoning (or probing) a deep Web database requires many relevant data to submit the form with • Idea: use most frequent words in the content of the result pages • Allows bootstrapping the siphoning with just a few words! 19
Inducing wrappers from result pages Pages resulting from a given form submission: • share the same structure • set of records with fields • unknown presentation! Goal Building wrappers for a given kind of result pages, in a fully automatic way. 20
Information extraction systems [CKGS 06] 21
Unsupervised Wrapper Induction • Use the (repetitive) structure of the result pages to infer a wrapper for all pages of this type • Possibly: use in parallel with annotation by recognized concept instances to learn with both the structure and the content 22
Some perspectives • Dealing with complex forms (fields allowing Boolean operators, dependencies between fields, etc. ) • Static analysis of Java. Script code to determine which fields of a form are required, etc. • A lot of this is also applicable to Web 2. 0/AJAX applications 23
References [Bri 00] Bright. Planet. The deep Web: Surfacing hidden value. White paper, July 2000. [CHZ 05] K. C. -C. Chang, B. He, and Z. Zhang. Towards large scale integration: Building a metaquerier over databases on the Web. In Proc. CIDR, Asilomar, USA, Jan. 2005. [CKGS 06] C. -H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of Web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10): 14111428, Oct. 2006. [CMM 01] V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large Web sites. In Proc. VLDB, Roma, Italy, Sep. 2001. [HPWC 07] B. He, M. Patel, Z. Zhang, and K. C. -C. Chang. Accessing the deep Web: A survey. Communications of the ACM, 50(2): 94– 101 May 2007. [MHC+06] J. Madhavan, A. Y. Halevy, S. Cohen, X. Dong, S. R. Jeffery, D. Ko, and C. Yu. Structured data meets the Web: A few observations. IEEE Data Engineering Bulletin, 29(4): 19– 26, Dec. 2006. [SMM+08] P. Senellart, A. Mittal, D. Muschick, R. Gilleron et M. Tommasi, Automatic Wrapper Induction from Hidden-Web Sources with Domain Knowledge. In Proc. WIDM, Napa, USA, Oct. 2008. 24
Section 2: Semantic Web Fabian M. Suchanek Databases and Information Systems Max Planck Institute for Informatics Saarbrucken, Germany [email protected] mpg. de 25
Motivation scientists from Brisbane Australia's scientists visit Brisbane The National Science Education Unit invites Australian scientists to gather in Brisbane www. nsceu. au/brisbane Today's state of the art Sam Smart is a scientist from Brisbane. Vision of the Sematic Web born. In Brisbane label „Sam Smart“ 26
The Semantic Web is the project of creating a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. Goals: make computers „understand“ the data they store allow them to answer „semantic“ queries allow them to share information across different systems Techniques: (= this talk) defining semantics in a machine-readable way (RDFS) identifying entities in a globally unique way (URIs) defining logical consistency in a uniform way (OWL) linking together existing resources (LOD) 27 http: //www. w 3. org/2001/sw/
The Resource Description Framework (RDF) RDF is a format of knowledge representation that is similar to the Entity-Relationship-Model. born. In Brisbane Statement: A triple of subject, predicate and object Sam. Smart born. In Brisbane Subject Predicate/Property Object http: //www. w 3. org/TR/rdf-prier/ RDF is used as the only knowledge representation language. 28 => All information is represented in a simple, homogeneous, computer-processable way.
n-ary relationships can always be reduced to binary relationships by introducing a new identifier. Brisbane about. Place about. Person 2009 about. Time living 42 Sam. Smart lives. In Brisbane in 2009 living 42 about. Person Sam. Smart living 42 about. Place Brisbane living 42 about. Time 2009 29
Uniform Resource Identifiers (URIs) A URI is similar to a URL, but it is not necessarily downloadable. It identifies a concept uniquely. born. In Brisbane „resource“ (= „entity“) URI Sam. Smart: http: //brisbane-corp. au/people/Sam. Smart born. In: http: //mpii. de/yago/resource/born. In Brisbane: http: //brisbane. au http: //www. ietf. org/rfc 3986. txt URIs are used as globally unique identifiers for resources. => Knowledge can be interlinked. A knowledge base on one server can refer to concepts from another knowledge base on another server. 30
Namespaces A namespace is a shorthand notation for the first part of a URI. born. In Brisbane Without namespaces, our statement is a triple of 3 URIs -- quite verbose
Popular Namespaces: Basic rdf: The basic RDF vocabulary http: //www. w 3. org/1999/02/22 -rdf-syntax-ns# rdfs: RDF Schema vocabulary (predicates for classes etc. , later in this talk) http: //www. w 3. org/1999/02/22 -rdf-syntax-ns# owl: Web Ontology Language (for reasoning, later in this talk) http: //www. w 3. org/2002/07/owl# dc: Dublin Core (predicates for describing documents, such as „author“, „title“ etc. ) http: //purl. org/dc/elements/1. 1/ xsd: XML Schema (definition of basic datatypes) http: //www. w 3. org/2001/XMLSchema# Standard namespaces are used for basic concepts => The basic concepts are the same across all RDF knowledge bases 32
Popular Namespaces: Specific dbp: The DBpedia ontology (real-world predicates and resources, e. g. Albert Einstein) http: //dbpedia. org/resource/ yago: The YAGO ontology (real-world predicates and resources, e. g. Albert Einstein) http: //mpii. de/yago/resource/ foaf: Friend Of A Friend (predicates for relationships between people) http: //xmlns. com/foaf/0. 1/ cc: Creative Commons (types of licences) http: //creativecommons. org/ns#. . and many, many more There exist already a number of specific namespaces => Knowledge engineers don't have to start from scratch 33
Literals „Sam Smart“ label born. In Brisbane example: Sam. Smart yago: born. In
Classes A class is a resource that represents a set of similar resources person More general classes subsume more specific classes subclass. Of scientist type born. In Brisbane example: Sam. Smart yago: born. In
„Meta-Data“ Meta-Data is data about classes and properties type Class Properties themselves are resources in RDF type Property domain born. In Brisbane born. In type person range yago: born. In rdf: type rdf: Property yago: born. In rdfs: domain example: person yago: born. In rdfs: range example: city example: person rdf: type rdfs: Class http: //www. w 3. org/TR/rdf-schema/ RDFS can be used to talk about classes and properties, too => There is no concept of „meta-data“ in RDFS city 36
Reasoning „A person can only be born in one place“ „Meat is not Fruit“ Functional. Property type born. In Class type Meat type disjoint. With Fruit yago: born. In rdf: type owl: Functional. Property example: Meat owl: disjoint. With example: Fruit The owl namespace defines vocabulary for set operations on classes, restrictions on properties and equivalence of classes. The OWL vocabulary can be used to express properties of classes and predicates => We can express logical consistency 37
Reasoning: Flavors of OWL There exist 3 different flavors of OWL that trade off expressivity with tractability. http: //www. w 3. org/TR/owl-guide/ OWL Full is very powerful, but undecideable OWL DL has the expressive power of Description Logics Reification OWL DL OWL Lite disjoint. With cardinality constraints OWL Lite is a simplified subset of OWL DL set operations on classes full RDF Classes as instances 38
Formats of RDF data RDF is just the model of knowledge representation, there exist different formats to store it. 1. In a database („triple store“) with the schema FACT(resource, predicate, resource) 2. As triples in plain text („Notation 3“, „Turtle“) @prefix yago http: //mpii. de/yago/resource yago: Sam. Smart yago: born. In
Existing OWL/RDF knowlegde bases: General There exist already a number of knowledge bases in RDF. Dataset Freebase (community collaboration) Open. Cyc (spin-off from commerical ontology Cyc) URL #Statements http: //www. freebase. com 2. 5 m http: //www. opencyc. org 60 k http: //www. dbpedia. org 270 m http: //mpii. de/yago 20 m DBpedia (extraction from Wikipedia, focus on coverage) YAGO (extraction from Wikipedia, focus on accuracy) 40
Existing OWL/RDF knowlegde bases: Specific Dataset URL #Statements Music. Brainz http: //www. musicbrainz. org 23 k http: //www. geonames. org 85 k http: //www 4. wiwiss. fu-berlin. de/dblp/ 15 m http: //www. rdfabout. com/demo/census/ 1000 m (Artists, Songs, Albums) Geonames (Countries, Cities, Capitals) DBLP (Papers, Authors, Citations) US Census (Population statistics). . . and many more. . => The Semantic Web has already a reasonable number of knowledge bases 41
The Linking Open Data Project yago: Albert. Einstein owl: same. As dbpedia: Albert_Einstein 42
Querying the knowledge bases: SPARQL is a query language for RDF data. It is similar to SQL Which scientists are from Brisbane? Define our namespaces PREFIX rdf: http: //www. w 3. org/1999/02/22 -rdf-syntax-ns# PREFIX example: . . SELECT ? x WHERE { ? x rdf: type example: scientist. ? x example: born. In example: Brisbane } Pose the query in SQL style http: //www. w 3. org/TR/rdf-sparql-query/ 43
Sample Query on YAGO Which scientists are from Brisbane? 44
References Specifications RDF http: //www. w 3. org/TR/rdf-primer/ RDFS http: //www. w 3. org/TR/rdf-schema/ URIs http: //www. ietf. org/rfc 3986. txt Literals http: //www. ietf. org/rfc 3986. txt OWL http: //www. w 3. org/TR/owl-guide/ SPARQL http: //www. w 3. org/TR/rdf-sparql-query/ Projects YAGO Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum „YAGO - A Core of Sematic Knowledge“ (WWW 2007) DBpedia S. Auer, C. Bizer, J. Lehmann, G. Kobilarov, R. Cyganiak, Z. Ives „DBpedia: A Nucleus for a Web of Open Data“ (ISWC 2007) LOD Christian Bizer, Tom Heath, Danny Ayers, Yves Raimond „Interlinking Open Data on the Web“ (ESWC 2007) 45
Section 3: XML Mining Richi Nayak Faculty of Information Technology Queensland University of Technology Brisbane, Australia r. [email protected] edu. au 46
Outline • • • What XML is? What XML Mining is? Why should we do XML mining? How we do XML mining? Future directions 47
XML XML: e. Xtensible Markup Language XML v. HTML: restricted set of tags, e. g.