cae69287e429c1001be2a9d3597ac1f4.ppt
- Количество слайдов: 39
Harvesting, Searching, and Ranking Knowledge from the Web Gerhard Weikum weikum@mpi-inf. mpg. de http: //www. mpi-inf. mpg. de/~weikum/ joint work with Shady Elbassuoni, Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Mauro Sozio, Fabian Suchanek
My Vision Opportunity: Turn the Web (and Web 2. 0 and Web 3. 0. . . ) into the world‘s most comprehensive knowledge base Approach: 1) harvest and combine a) hand-crafted knowledge sources b) (Semantic Web, ontologies) b) automatic knowledge extraction (Statistical Web, text mining) c) social communities and human computing (Social Web, Web 2. 0) 2) express knowledge queries, search, and rank 3) everything efficient and scalable 2
Why Google and Wikipedia Are Not Enough Answer „knowledge queries“ (by scientists, journalists, analysts, etc. ) such as: drugs or enzymes that inhibit proteases (HIV) connections between Thomas Mann and Goethe German Nobel prize winner who survived both world wars and outlived all of his four children how are Max Planck, Angela Merkel, Jim Gray, and the Dalai Lama related politicians who are also scientists 3
Why Google and Wikipedia Are Not Enough Answer „knowledge queries“ (by scientists, journalists, analysts, etc. ) such as: What is lacking? drugs or enzymes that inhibit proteases (HIV) Information is not Knowledge. connections between Thomas Mann and Goethe Knowledge is not Wisdom. German Nobel prize winner who survivedis not Truth wars Wisdom both world and outlived all of his four children Truth is not Beauty is not Music. how are Max Planck, Angela Merkel, Jim Gray, Music is the best. and the Dalai Lama related (Frank Zappa politicians who are also scientists 1940 – 1993) ® extract facts from Web pages ® capture user intention by concepts, entities, relations 4
Related Work START Answers Hakia Powerset Entity. Rank Cimple DBlife Libra Kylin KOG Text. Runner information extraction & ontology building Web entity search & QA UIMA Freebase Cyc Expert. Finder Top. X Avatar XQ-FT Tijah Banks semistructured IR & graph search DBexplorer DBpedia Yago Naga SPARQL SWSE 5
Relevant Projects Know. It. All / Text. Runner (UW Seattle) Intelligence. In. Wikipedia (UW Seattle) DBpedia (U Leipzig & FU Berlin) Seer. Suite (Penn. State) Cimple / DBlife (U Wisconsin & Yahoo) Avatar / System T (IBM Almaden) Libra (MS Research Beijing) SQo. UT (Columbia U) Wikipedia Entities (Yahoo Barcelona) Expert Finding (U Amsterdam) Expertise Finding (U Twente). . . and G, Y, MS for products, locations, . . . and more Selected overviews in: ACM SIGMOD Record 37(4), Dec 2008 6
Outline Motivation • Information Extraction & Knowledge Harvesting (YAGO) • Consistent Growth of Knowledge (SOFIE) • Ranking for Search over Entity-Relation Graphs (NAGA) • Efficient Query Processing (RDF-3 X) • Conclusion 7
Information Extraction (IE): Text to Records Person Birth. Date Max Planck 4/23, 1858 Albert Einstein 3/14, 1879 Mahatma Gandhi 10/2, 1869 Birth. Place. . . Kiel Ulm Porbandar Person Scientific. Result Max Planck Quantum Theory extracted facts often Constant Value Dimension have confidence < 1 Planck‘s constant(DB with uncertainty) 6. 226 1023 Js sometimes: Person Collaborator confidence << 1 Max high computational costs Planck Albert Einstein Max Planck Niels Bohr Person Organization Max Planck KWG / MPG combine NLP, pattern matching, lexicons, statistical learning 8
High-Quality Knowledge Sources General-purpose ontologies and thesauri: Word. Net family 200 000 concepts and relations; can be cast into • description logics or • graph, with weights for relation strengths (derived from co-occurrence statistics) scientist, man of science (a person with advanced knowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitive scientist => computer scientist. . . => principal investigator, PI … HAS INSTANCE => Bacon, Roger Bacon … 9
Exploit Hand-Crafted Knowledge Wikipedia and other lexical sources 10
Exploit Hand-Crafted Knowledge Wikipedia and other lexical sources {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] e | birth_place = [[Kiel]], [[Germany]] nsid i | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]]</br> … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) … d. B 11
Exploit Hand-Crafted Knowledge Wikipedia, Word. Net, and other lexical sources 12
YAGO: Yet Another Great Ontology [F. Suchanek et al. : WWW‘ 07] • Turn Wikipedia into formal knowledge base (semantic DB); keep source pages as witnesses • Exploit hand-crafted categories and infoboxes • Represent facts as knowledge triples: relation (entity 1, entity 2) entity 1 relation entity 2 (in FOL, compatible with RDF, OWL-lite, XML, etc. ) • Map relations into Word. Net concept DAG Examples: Max_Planck born. In Kiel is. Instance. Of City 13
Difficulties in Wikipedia Harvesting • instance. Of relation: misleading and difficult category names „disputed articles“, „particle physics“, „American Music of the 20 th Century“, „naturalized citizens of the United States“, … • subclass relation: mapping categories onto Word. Net classes: „Nobel laureates in physics“ Nobel_laureates, „people from Kiel“ person • entity name synonyms & ambiguities: „St. Petersburg“, „Saint Petersburg“, „M 31“, „NGC 224“ means. . . • type (consistency) checking for rejecting false candidates: Alma. Mater (Max Planck, Kiel) Person University d. B side n i 14
YAGO Knowledge Base [F. Suchanek et al. : WWW 2007] subclass Entity subclass Person subclass Scientist subclass Biologist concepts Location subclass City Country Physicist instance. Of Erwin_Planck Nobel Prize has. Won October 4, 1947 born. In Kiel Father. Of died. On Max_Planck means “Max Planck” born. On means “Max Karl Ernst Ludwig Planck” April 23, 1858 individual entities means “Dr. Planck” words, phrases Online access and download at http: //www. mpi-inf. mpg. de/yago/ 15
YAGO Knowledge Base [F. Suchanek et al. : WWW 2007] Entities Facts RDF triples ( entity 1 -relation-entity 2, subject-predicate-object ) Know. It. All 30 000 subclass Entity SUMO 20 000 60 000 subclass Word. Net 120 000 80 000 Person Cyc 300 000 5 Mio. concepts Location Text. Runner n/a 8 Mio. subclass YAGO 1. 9 Mio. 19 Mio. Scientist subclass DBpedia 103 Mio. subclass 1. 9 Mio. subclass Freebase ? ? ? 156 Mio. City Country Biologist Physicist instance. Of YAGO Erwin_Planck Nobel Prize has. Won October 4, 1947 born. In Kiel Father. Of died. On Max_Planck means IWP “Max Planck” Accuracy 95% born. On means “Max Karl Ernst Ludwig Planck” April 23, 1858 individual entities means “Dr. Planck” words, phrases Online access and download at http: //www. mpi-inf. mpg. de/yago/ 16
Long Tail of Wikipedia (Intelligence-in-Wikipedia Project) [Wu / Weld: WWW 2008] YAGO & DBpedia mappings of entities onto classes are valuable assets Learning infobox attributes sparse & noisy training data Physicist Computer Scientist Musician Artist University Organization 17
Outline Motivation Information Extraction & Knowledge Harvesting (YAGO) • Consistent Growth of Knowledge (SOFIE) • Ranking for Search over Entity-Relation Graphs (NAGA) • Efficient Query Processing (RDF-3 X) • Conclusion 18
Maintaining and Growing YAGO Word Net + Wikipedia YAGO Core Extractors YAGO Core Checker YAGO Core knows all entities Web sources YAGO Gatherer Hypotheses YAGO Scrutinizer Gatherer YAGO G r o w i n g focus on facts 19
SOFIE: Self-Organizing Framework for IE [F. Suchanek et al. : WWW 2009] Reconcile • textual/linguistic pattern-based IE with statistics seeds patterns facts patterns . . . • declarative rule-based IE with constraints functional dependencies: has. Capital is a function d. B side inclusion dependencies: is. Capital. Of is. City. Of in 20
From Facts to Patterns to Hypotheses occurs (X and her husband Y, Spouse (Hillary. Clinton, Bill. Clinton) Angela Merkel, Joachim. Sauer) [4] Spouse (Melinda. Gates, Bill. Gates) occurs (X and her husband Y, Melinda. Gates, Bill. Gates) [2] Spouse (Angela. Merkel, Joachim. Sauer) occurs (X and her husband Y, Carla. Bruni, NIcolas. Sarkozy) [3] occurs (X married to Y, Melinda. Gates, Bill. Gates) [2] occurs (X loves Y, Larry. Page, Google) [5] Spouse (Carla. Bruni, Nicolas. Sarkozy) Spouse (Larry. Page, Google) Spouse (Angela. Merkel, Ulrich. Merkel) expresses (and her husband, Spouse) expresses (married to, Spouse) expresses (loves, Spouse) 21
Adding Consistency Constraints occurs (X and her husband Y, Spouse (Hillary. Clinton, Bill. Clinton) Angela Merkel, Joachim. Sauer) [4] Spouse (Melinda. Gates, Bill. Gates) occurs (X and her husband Y, Melinda. Gates, Bill. Gates) [2] Spouse (Angela. Merkel, Joachim. Sauer) occurs (X and her husband Y, Carla. Bruni, NIcolas. Sarkozy) [3] occurs (X married to Y, Melinda. Gates, Bill. Gates) [2] occurs (X loves Y, Larry. Page, Google) [5] Spouse (Carla. Bruni, Nicolas. Sarkozy) Spouse (Larry. Page, Google) Spouse (Angela. Merkel, Ulrich. Merkel) expresses (and her husband, Spouse) expresses (married to, Spouse) expresses (loves, Spouse) occur(P, X, Y) expresses(P, Spouse) Spouse(X, Y) occur(P, X, Y) Spouse(X, Y) expresses (P, Spouse) Spouse(X, Y) Y Z Spouse(X, Z) Spouse(X, Y) Type(X, Person) Type(Y, Person) d. B side n i 22
Representation by Clauses occurs (X and her husband Y, Spouse (Hillary. Clinton, Bill. Clinton) Angela Merkel, Joachim. Sauer) [4] Spouse (Melinda. Gates, facts, patterns, hypotheses, husband Y, occurs (X and her constraints Clauses connect Bill. Gates) Melinda. Gates, Bill. Gates) [2] Spouse (Angela. Merkel, Joachim. Sauer) facts as constants: Treat hypotheses as variables, occurs (X and her husband Y, ( 1 A 1), ( 1 A B), ( 1 Carla. Bruni, E), ( D F), [3] C), ( D NIcolas. Sarkozy). . . occurs (X married Clauses can be weighted by pattern statisticsto Y, Melinda. Gates, Bill. Gates) [2] Solve weighted Max-Sat problem: occurs (X loves Y, Larry. Page, Google) [5] assign truth values to variables s. t. Spouse (Carla. Bruni, of satisfied clauses is max! expresses (and her husband, Spouse) total weight Nicolas. Sarkozy) Spouse (Larry. Page, Google) Spouse (Angela. Merkel, Ulrich. Merkel) expresses (married to, Spouse) expresses (loves, Spouse) occur (and her husband, Angela. Merkel, Joachim. Sauer) expresses(and her husband, Spouse) Spouse(Angela. Merkel, Joachim. Sauer) occur (and her husband, Carla. Bruni, Nicolas. Sarkozy) expresses(and her husband, Spouse) Spouse(Carla. Bruni, Nicolas. Sarkozy) Spouse(Angela. Merkel, Joachim. Sauer) Spouse(Angela. Merkel, Ulrich. Merkel) Spouse(Larry. Page, Google) Type(Larry. Page, Person) Type(Google, Person) . . . 23
SOFIE: Consistent Growth of YAGO [F. Suchanek et al. : WWW 2009] • self-organizing framework for scrutinizing hypotheses about new facts, enabling automated growth of the knowledge base • unifies pattern-based IE, consistency checking d. B side in and entity disambiguation Experimental evidence: • input: biographies of 400 US senators, 3500 HTML files • output: birth/death date&place, politician. Of (state) • run-time: 7 h parsing, 6 h hypotheses, 2 h weighted Max-Sat • precision: 90 -95 %, except for death place • discovered patterns: politician. Of: X was a * of Y, X represented Y, . . . death. Date: X died on Y, X was assassinated on Y, . . . death. Place: X was born in Y 24
Open Issues • Temporal Knowledge: temporal validity of all facts (spouses, CEO‘s, etc. ) • Total Knowledge: all possible relations („Open IE“), but in canonical form works. For, employed. At, is. Employee. Of, . . . affiliation • Multimodal Knowledge: photos, videos, sound, sheetmusic of entities (people, landmarks, etc. ) and facts (marriages, soccer matches, etc. ) • Scalable Knowledge Gathering: high-quality IE at the rate at which news, blogs, Wikipedia updates are produced ! 25
Scalability: Benchmark Proposal for all people in Wikipedia (100, 000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night redundancy of sources helps, stresses scalability even more consistency constraints are potentially helpful: d. B side • functional dependencies: {husband, time} wife in • inclusion dependencies: married. Person adult. Person • age/time/gender restrictions: birthdate + < marriage < divorce 26
Outline Motivation Information Extraction & Knowledge Harvesting (YAGO) Consistent Growth of Knowledge (SOFIE) • Ranking for Search over Entity-Relation Graphs (NAGA) • Efficient Query Processing (RDF-3 X) • Conclusion 27
NAGA: Graph Search with Ranking [G. Kasneci et al. : ICDE 2008, ICDE 2009] Graph-based search on knowledge bases with built-in ranking based on confidence and informativeness Complex query (with regular expr. ) Simple query Sep 2, 1945 > Politician Germany (born. In | lives. In | citizen. Of). located. In* < ? b isa born. On ? x ? d died. On ? x isa > Scientist ? c ? x isa Politician ? x isa Scientist Jul 28, 1914 . has Nobel Won Prize father. Of died. On ? y ? x has. Won Nobel. Prize. ? x father. Of ? y ? x born. On ? b. FILTER (? b < Jul-28 -1914) ? x died. On ? d. . 28
Statistical Language Models (LM‘s) for Entity Ranking [work by U Amsterdam, MSR Beijing, U Twente, Yahoo Barcelona, . . . ] LM (entity e) = prob. distr. of words seen in context of e query q: „Dutch soccer player Barca“ candidate entities: e 1: Johan Cruyff e 2: Ruud van Nistelroy e 3: Ronaldinho e 4: Zinedine Zidane e 5: FC Barcelona Dutch goalgetter soccer champion Dutch player Ajax Amsterdam trainer Barca 8 years Camp Nou played soccer FC Barcelona Jordi Cruyff son weighted by extraction accuracy Zizou champions league 2002 Real Madrid van Nistelroy Dutch soccer world cup best player 2005 lost against Barca 29
LM for Fact (Entity-Relation) Ranking query q fact pool for candidate answers q 1 : ? x has. Won Nobel. Prize f 1: Einstein has. Won Nobel. Prize q 2 : ? x born. In Germany f 2: Gruenberg has. Won Nobel. Prize 50 f 3: Gruenberg has. Won Japan. Prize 20 f 4: Vickrey has. Won Nobel. Prize 50 f 5: Cerf has. Won Turing. Award 100 f 6: Einstein born. In Germany 100 instantiation (user interests) LM(q 1): plus smoothing f 7: Gruenberg born. In Germany 200 20 Einstein has. Won NP ® 200/300 Gruenberg has. Won NP ® 50/300 Vickrey has. Won NP ® 50/300 f 8: Goethe born. In Germany 200 f 9: Schiffer born. In Germany 150 f 10: Vickrey born. In Canada 10 LM(q 2): f 11: Cerf born. In USA Einstein born. In G Gruenberg born. In G Goethe born. In G Schiffer born. In G ® 100/470 ® 200/470 ® 150/470 100 witnesses may be weighted by confidence 30
NAGA Example Query: ? x isa politician ? x isa scientist Results: Benjamin Franklin Paul Wolfowitz Angela Merkel … 31
Outline Motivation Information Extraction & Knowledge Harvesting (YAGO) Consistent Growth of Knowledge (SOFIE) Ranking for Search over Entity-Relation Graphs (NAGA) • Efficient Query Processing (RDF-3 X) • Conclusion 32
Scalable Semantic Web: Pattern Queries on Large RDF Graphs schema-free RDF triples: subject-property-object (SPO) example: Einstein has. Won Nobel. Prize SPARQL triple patterns: Select ? p, ? c Where { ? p isa scientist. ? p has. Won Nobel. Prize. ? p born. In ? t in. Country ? c part. Of Europe} large join queries, unpredictable workload, difficult physical design, difficult query optimization All. Triples S P Einstein has. Won Einstein born. In Ronaldo has. Won Spain part. Of France part. Of … … Person O Nobel Ulm FIFA Europe. , . S has. Won born. In Einstein Nobel Ulm Ronaldo FIFA Rio. , . Country S part. Of capital. , . has. Won. , . S Einstein Ronaldo … born. In S … O Nobel FIFA … O … Semantic-Web engines (Sesame, Jena, etc. ) did not provide scalable query performance 33
Scalable Semantic Web: RDF-3 X Engine [T. Neumann et al. : VLDB’ 08] • RISC-style, tuning-free system architecture • map literals into ids (dictionary) and precompute exhaustive indexing for SPO triples: SPO, SOP, PSO, POS, OSP, OPS, SP*, PS*, SO*, OS*, PO*, OP*, S*, P*, O* i. Rnside i very high compression • efficient merge joins with order-preservation • join-order optimization d. B side n i by dynamic programming over subplan result-order • statistical synopses for accurate result-size estimation 34
Performance Experiments execution time [s] Librarything social-tagging excerpt (36 Mio. triples) Benchmark queries such as: Select ? t Where { ? b has. Title ? t. ? u romance ? b. ? u love ? b. ? u mystery ? b. ? u suspense ? b. ? u crime. Novel ? c. ? u has. Friend ? f. . . } books tagged with romance, love, mystery, suspense by users who like crime novels and have friends who. . . RDF-3 X on PC (2 GHz, 2 GB RAM, 30 MB/s disk) compared to: • column-store (for property tables) using Monet. DB • triples store (with selected indexes) using Postgre. SQL similar results on YAGO, Uniprot (845 Mio. triples) and Billion-Triples 35
Outline Motivation Information Extraction & Knowledge Harvesting (YAGO) Consistent Growth of Knowledge (SOFIE) Ranking for Search over Entity-Relation Graphs (NAGA) Efficient Query Processing (RDF-3 X) • Conclusion 36
Take-Home Message • turn Wikipedia, Web, news, literature, . . . into comprehensive knowledge base of facts YAGO core Information is not Knowledge is not Wisdom is not Truth is not Beauty is not Music is the best. (Frank Zappa, 1940 – 1993) • reconcile rule-based & pattern-based info extraction (Semantic-Web & Statistical-Web) with consistency constraints YAGO growth with SOFIE • enable search & ranking over entity-relation graphs NAGA, RDF-3 X d. B side in 37
Technical Challenges • Handling Time • extracting temporal attributes • reasoning on validity times of facts • life-cycle management of KB • Scalable Performance • high-quality dynamic IE at the rate of news/blogs/Wikipedia updates • „Marital Knowledge“ benchmark • Query Language and Ranking • querying expressive but simple (Sparql-FT ? ) • LM-based ranking vs. PR/HITS-style vs. learned scoring from user behavior • efficient top-k queries on ER graphs . . . and more 38
Thank You ! 39
cae69287e429c1001be2a9d3597ac1f4.ppt