Harvesting Knowledge from Web Data and Text CIKM

Harvesting Knowledge from Web Data and Text CIKM 2010 Tutorial (1/2 Day) Hady W. Lauw 1, Ralf Schenkel 2, Fabian Suchanek 3, Martin Theobald 4, and Gerhard Weikum 4 1 Institute for Infocomm Research, Singapore 2 Saarland University, Saarbruecken 3 INRIA Saclay, Paris 4 Max Planck Institute Informatics, Saarbruecken

All slides for download… http: //www. mpi-inf. mpg. de/yago-naga/ CIKM 10 -tutorial/ Harvesting Knowledge from Web Data 2

Outline • Part I – What and Why – Available Knowledge Bases • Part II – Extracting Knowledge • Part III – Ranking and Searching • Part IV – Conclusion and Outlook Harvesting Knowledge from Web Data 3

Motivation Elv ne is, w ed he he ar you, n I yo u! I can Elvis Presley 1935 - 1977 Will there ever be someone like him again? 4

Motivation Another Elvis Presley: The Early Years Elvis spent more weeks at the top of the charts than any other artist. www. fiftiesweb. com/elvis. htm 5

Motivation Another singer called Elvis, young Personal relationships of Elvis Presley – Wikipedia. . . when Elvis was a young teen. . another girl whom the singer's mother hoped Presley would. . The writer called Elvis "a hillbilly cat” en. wikipedia. org/. . . /Personal_relationships_of_Elvis_Presley 6

Motivation Dear Mr. Page, you don’t understand me. I just. . . Elvis Presley - Official page for Elvis Presley Welcome to the Official Elvis Presley Web Site, home of the undisputed King of Rock 'n' Roll and his beloved Graceland. . . www. elvis. com/ 7

Motivation Other (more serious? ) queries: • when is Madonna’s next concert in Europe? • which protein inhibits atherosclerosis? • who was king of England when Napoleon I was emperor of France? King George III • has any scientist ever won the Nobel Prize in Literature? Bertrand Russel • which countries have a HDI comparable to Sweden’s? • which scientific papers have led to patents? • is there another famous singer named “Elvis”? 8

This Tutorial Mr. Page, let’s try this again. Is there another singer named Elvis? In this tutorial, we will explain • how the knowledge is organized • what knowledge bases exist already • how we can construct knowledge bases • how we can query knowledge bases singer type ? “Elvis” 9

Ontologies entity subclass. Of person location subclass. Of scientists subclass. Of singer type city type born. In Tupelo ? The same label for two entities: homonymy Classes label “Elvis” label “The King” Relations Instances The same entity has two labels: synonymy Labels/words 10

Classes entity subclass. Of person subclass. Of scientists singer type ? Transitivity: type(x, y) / subclass. Of(y, z) => type(x, z)

Relations entity subclass. Of person location domain range born. In singer type subclass. Of city type born. In Tupelo Domain and range constraints: domain(r, c) / r(x, y) => type(x, c) range(r, c) / r(x, y) => type(y, c) Looks like higher order, but is not. Consider introducing a predicate fact(r, x, y)

Event Entities 1967 An event entity is an artificial entity introduced to represent an n-ary relationship year Elvis. Grammy winner prize Grammy Award won Event entities allow representing arbitrary relational data as binary graphs Winner Prize Year Row 42 Elvis Presley Grammy Award 1967 Row 43. .

Reification is the method of creating an entity that represents a fact. 1967 year #42 Grammy Award won #42 source Wikipedia born. In Tupelo #43 There are different ways to reify a fact, this is the one used in this talk.

RDF resource The Resource Description Format (RDF) is a W 3 C standard that provides a standard vocabulary to model ontologies. An RDF ontology can be seen as a directed labeled multi-graph where • the nodes are entities • the edges are labeled with relations Edges (facts) are commonly written • as triples <Elvis, born. In, Tupelo> • as literals born. In(Elvis, Tupelo) subclass. Of location subclass. Of city type born. In Tupelo [W 3 C recommendation: RDF, 2004]

Outline • Part I – What and Why ✔ – Available Knowledge Bases • Part II – Extracting Knowledge • Part III – Ranking and Searching • Part IV – Conclusion and Outlook Harvesting Knowledge from Web Data 16

Cyc What if we could make all common sense knowledge computer-processable? Cyc project Douglas Lenat • • started in 1984 driven by staff of 20 goal: formalize knowledge manually [Lenat, Comm. ACM, 1995]

Cyc: Language Cyc. L is the formal language that Cyc uses to represent knowledge. (Semantics based on First Order Logic, syntax based on LISP) (#$forall ? A (#$implies (#$isa ? A #$Animal) (#$there. Exists ? M (#$mother ? A ? M)))) Cyc project (#$arity #$Government. Fn 1) (#$arg 1 Isa #$Government. Fn #$Geopolitical. Entity) (#$result. Isa #$Government. Fn #$Regional. Government) (#$governs (#$Government. Fn #$Canada) + a logical reasoner #$Canada) http: //cyc. com/cycdoc/ref/cycl-syntax. html

Cyc: Knowledge #$Love Strong affection for another agent arising out of kinship or personal ties. Love may be felt towards things, too: warm attachment, enthusiasm, or devotion. #$Love is a collection, as further explained under #$Happiness. Specialized forms of #$Love are #$Love-Romantic, platonic Cyc project love, maternal love, infatuation, agape, etc. guid: bd 589433 -9 c 29 -11 b 1 -9 dad-c 379636 f 7270 direct instance of: #$Feeling. Type direct specialization of: #$Affection direct generalization of: #$Love-Romantic http: //cyc. com/cycdoc/vocab/emotion-vocab. html#Love Facts and axioms about: Transportation, Ecology, everyday living, chemistry, healthcare, animals, law, computer science. . . “If a computer network implements IEEE 802. 11 Wireless LAN Protocol and some computer is a node in that computer network, then that computer is vulnerable to decryption. “ http: //cyc. com/cyc/technology/whatiscyc_dir/maptest

Cyc: Summary Cyc SUMO License proprietary, free for research GNU GPL Entities 500 k 20 k Assertions 5 m 70 k Relations 15 k Tools Reasoner, NL understanding tool Reasoner URL http: //cyc. com http: //ontologyportal. org References [Lenat, Comm. ACM 1995] [Niles, FOIS 2001] http: //cyc. com/cyc/technology/whatiscyc_dir/whatsincyc http: //ontologyportal. org SUMO (the Suggested Upper Model Ontology) is a research project in a similar spirit, driven by Adam Pease of Articulate Software

Word. Net What if we could make the English language computer-processable? George Miller • started in 1985 • Cognitive Science Laboratory, Princeton University • written by lexicographers • goal: support automatic text analysis and AI applications [Miller, CACM 1995]

Word. Net: Lexical Database synonymous words polysemous words Word photographic camera Sense sense 1 camera television camera sense 2

Word. Net

Word. Net: Semantic Relations Hypernymy Kitchen Appliances Toaster Meronymy Is-value-of Camera Optical Lens Speed Slow Fast

Word. Net: Semantic Relations Relation Meaning Examples Synonymy (N, V, Adj, Adv) Same sense (camera, photographic camera) (mountain climbing, mountaineering) (fast, speedy) Antonymy (Adj, Adv) Opposite (fast, slow) (buy, sell) Hypernymy (N) Is-A (camera, photographic equipment) (mountain climbing, climb) Meronymy (N) Part (camera, optical lens) (camera, view finder) Troponymy (V) Manner (buy, subscribe) (sell, retail) Entailment (V) X must mean doing Y (buy, pay) (sell, give)

Word. Net: Hierarchy Hypernymy Is-A relations instrumentation equipment device photographic equipment lamp flash

Word. Net: Size Type Number #words 155 k #senses 117 k #word-sense pairs 207 k %words that are polysemous 17% License Proprietary, Free for research http: //wordnet. princeton. edu/wordnet/man 2. 1/wnstats. 7 WN. html Downloadable at http: //wordnet. prin ceton. edu

Wikipedia If a small number of people can create a knowledge base, how about a LARGE number of people? Jimmy Wales • started in 2001 • driven by Wikimedia Foundation, and a large number of volunteers • goal: build world’s largest encyclopedia

Wikipedia: Entities and Attributes Entities Attributes

Wikipedia: Synonymy and Polysemy Redirection (synonyms) Disambiguation (polysemy)

Wikipedia: Classes/Categories Class hierarchy different from Word. Net

Wikipedia: Others Inter-lingual Links Navigation/ Topic box

Wikipedia: Numbers English: • 1 B words, • 2. 8 M articles, • 152 K contributors All (250 languages): • 1. 74 B words, • 9. 25 M articles, • 283 K contributors vs. Britannica: • 25 X as many words • ½ avg article length License: Creative Commons Attribution-Share. Alike (CC-BY-SA) Growth 2001 - 2008 Downloadable at http: //download. wi kimedia. org/

Automatically Constructed Knowledge Bases • Manual approaches (Cyc, Word. Net, Wikipedia) – produce high quality knowledge bases – labor-intensive and limited in scope Can we construct the knowledge bases automatically? YAGO … , etc.

YAGO Can we exploit Wikipedia and Word. Net to build an ontology? YAGO • started as Ph. D thesis in 2007 • now major project at the Max Planck Institute for Informatics in Germany • goal: extract ontology from Wikipedia with high accuracy and consistency [Suchanek et al. , WWW 2007]

YAGO: Construction Word. Net Person subclass. Of Singer subclass. Of Elvis Presley Blah blub fasel (do not read this, better listen to the talk) blah Elvis blub (you are still reading this) blah Elvis blah blub later became astronaut blah ~Infobox~ Born: 1935. . . Categories: Rock singer Rock Singer type born Exploit Infoboxes Exploit conceptual categories Add Word. Net 1935

YAGO: Consistency Checks Person subclass. Of Singer subclass. Of Guitarist Rock Singer type Physics born Check uniqueness of entities and functional arguments Check domains and ranges of relations Check type coherence 1935

YAGO: Relations About People About Locations About Other Things acted. In established. On. Date happened. In born. In / on date established from / until died. In / on date has. Capital is. Called created / on date has. Population found. In dicovered located. In produced has. Child, has. Spouse has. Currency has. Production. Language family name has. Inflation has. ISBN graduated. From has. Politician has. Precedecssor . . ca. 100 relations with range and domain

YAGO: Numbers YAGO+Geonam es 2. 6 m 10 m organizations 0. 5 m people 0. 8 m classes 0. 5 m Facts 30 m 240 m Relations 86 92 Precision 95% License Creative Commons Attribution. Non. Commercial (CC-NC-BY) Entities Downloadable at http: //mpii. de/yago incl. converters for RDF, XML, databases

DBpedia Can we harvest facts more exhaustively with community effort? • community effort started in 2007 • driven by Free U. Berlin, U. Leipzig, Open. Link • goal: "extract structured information from Wikipedia and to make this information available on the Web" [Bizer et al. , Journal of Web Semantics 2009]

DBPedia: Ontology In YAGO, the taxonomy is based on Word. Net classes. Dbpedia: • places entities extracted from Wikipedia into its own ontology. • hand-crafted: 259 classes, 6 levels, 1200 properties • emphasizes recall • only half of extracted entities are currently placed in its own ontology • alternative classifications: Wikipedia, YAGO, UMBEL (Open. Cyc)

DBPedia: Mapping Rules DBpedia mapping rules: • maps Wikipedia infoboxes and tables to its ontology • target datatypes (normalize units, ignore deviant values) Community effort: • hand-craft mapping rules • expand ontology < http: //en. wikipedia. org/wiki/Elvis_Presley > {{Infobox musical artist |Name = Elvis Presley |Background = solo_singer |Birth_name = Elvis Aaron Presley }} < http: //dbpedia. org/page/Elvis_Presley > Note that the values do not change. foaf: name “Elvis Presley”; background “solo_singer”; foaf: given. Name “Elvis Aaron Presley”;

DBPedia: Numbers Type Number Facts English: 257 m (YAGO: 240 m) All languages: 1 b Entities 3. 4 m overall (YAGO: 10 m) 1. 5 m in DBPedia ontology People 312 k Locations 413 k Organizations 140 k License Creative Commons Attribution-Share. Alike 3. 0 (CC-BY-SA 3. 0) plus • 5. 5 m links to external Web pages • 1. 5 m links to images • 5 m links to other RDF data sets Downloadable at http: //dbpedia. org

Freebase What if we could harvest both automatic extraction and user contribution? • started in 2000 • driven by Metaweb, part of Google since Jul 2010 • goals: • “an open shared database of the world's knowledge” • “a massive, collaboratively-edited database of cross-linked data”

Freebase Like DBpedia and YAGO, Freebase imports data from Wikipedia. Differently: • also imports from other sources (e. g. , Chef. Moz, NNDB, and Music. Brainz) • including individually contributed data • users can collaboratively edit its data (without having to edit Wikipedia).

Freebase: User Contribution Edit Entities • create new entities • assign a new type/class to an entity • add/change attributes • connect to other entities • upload/edit images Review • flag vandalism • flag entities to be merged/deleted • vote on flagged content (3 unanimous vote, or an expert has to be tie-breaker) Edit Schema • define new class, specifying the attributes of the class • class definition can only be changed by creator/admin • class not part of commons until peer-reviewed & promoted by staff/admin Data Game • finding aliases in Wikipedia redirects • extracts dates of events from Wikipedia articles • uses the Yahoo image search API to find candidates

Freebase: Community Experts Admins • tie breaker in reviews • split entities • “rewind” changes • create new classes and attributes • respond to community suggestions New experts inducted by current experts. Promoted by staff or other admins. Members • contribute (edit, review, vote) Anyone can be a member.

Freebase: Numbers Type Number Facts 41 m Entities 13 m (YAGO: 10 m) People 2 m Locations 946 k Businesses 567 k Film 397 k License Creative Commons Attribution (CC-BY) Downloadable at http: //download. freebase. com

Question Answering Systems Objective is to answer user queries from an underlying knowledge base. • data from Wikipedia and user edits • natural language translation of queries • 9 m entities, 300 m facts • computes answers from an internal knowledge base of curated, structured data. • stores not just facts, but also algorithms and models

Application: Semantic Similarity • Task: determine similarity between two words – topological distance of two words in the graph – taxonomic distance: hierarchical is-a relations • Example application: correct real-word spelling errors legume physical entity soy … legume garment bean … trouser soy Tofu is made from soy jeans. [Hirst et al. , Natural Language Engineering 2001] jean

Application: Sentiment Orientation • Task: determine an adjective’s polarity (positive or negative) – same polarity connected by synonymic relations – opposite polarity by antonymic relations • Example application: overall sentiment of customer reviews suitable appropriate proper right spoiled GOOD BAD defective [Hu et al. , KDD 2004] forged risky

Application: Annotation of Web Data • Task: given a data source in the form of a Web table – Annotate column with entity type – Annotate pair of columns with relationship type – Annotate table cell with entity ID [Limaye et al. , VLDB 2010]

Application: Map Annotation Idea: • Determine geographical entities in the vicinity (by GPS coordinates) • Show information about these entities (from DBpedia) Possible Applications: • Map search on the Internet • Enhanced Reality applications [Becker et al. , Linking Open Data Workshop 2008]

Application: Faceted Search Attributes and values based on frequency (? ) search is “full text search within results” Constraints are listed for possible deletion Suggestions based on current consideration set DBpedia Browser

Summary • Part I covers what knowledge bases are – Knowledge representation model (RDF) – Manual knowledge bases: • Word. Net: expert-driven, English words • Wikipedia: community-driven, entities/attributes – Automatically extracted knowledge bases: • YAGO: Wikipedia + Word. Net, automated, high precision • DBpedia: Wikipedia + community-crafted mapping rules, high recall • Freebase: Wikipedia + other databases + user edits • Part II will cover how to extract information included in the knowledge bases

References for Part I • • • C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, S. Hellmann: PDF Document. DBpedia – A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154– 165, 2009. C. Becker, C. Bizer: DBpedia Mobile: A Location. Enabled Linked Data Browser. Linking Open Data Workshop 2008 G. Hirst and A. Budanitsky: Correcting real-word spelling errors by restoring lexical cohesion. Natural Language Engineering 11 (1): 87– 111, 2001. M. Hu and B. Liu: Mining and Summarizing Customer Reviews. KDD, 2004. J. Kamps, M. Marx, R. J. Mokken, and M. de Rijke: Using Word. Net to Measure Semantic Orientations of Adjectives. LREC, 2004. D. Lenat: CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 1995. G. Limaye, S. Sarawagi, and S. Chakrabarti: Annotating and Searching Web Tables Using Entities, Types and Relationships. VLDB, 2010. G. A. Miller, Word. Net: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39 -41, 1995. F. M. Suchanek, G. Kasneci and G. Weikum: Yago - A Core of Semantic Knowledge. WWW, 2007. I. Niles, and A. Pease: Towards a Standard Upper Ontology. In Proceedings of the 2 nd International Conference on Formal Ontology in Information Systems (FOIS-2001), Chris Welty and Barry Smith, eds, Ogunquit, Maine, October 17 -19, 2001. World Wide Web Consortium: RDF Primer. W 3 C Recommendation, 2004. http: //www. w 3. org/TR/rdfprimer/

Outline • Part I – What and Why ✔ – Available Knowledge Bases ✔ • Part II – Extracting Knowledge • Part III – Ranking and Searching • Part IV – Other topics Harvesting Knowledge from Web Data 57

Entities & Classes Which entity types (classes, unary predicates) are there? scientists, doctoral students, computer scientists, … female humans, married humans, … Which subsumptions should hold (subclass/superclass, hyponym/hypernym, inclusion dependencies)? subclass. Of (computer scientists, scientists), subclass. Of (scientists, humans), … Which individual entities belong to which classes? instance. Of (Surajit Chaudhuri, computer scientists), instance. Of (Barbara. Liskov, computer scientists), instance. Of (Barbara Liskov, female humans), … Which names denote which entities? means (“Lady Di“, Diana Spencer), means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), … means (“Madonna“, Madonna Louise Ciccone), means (“Madonna“, Madonna(painting by Edward Munch)), … . . .

Binary Relations Which instances (pairs of individual entities) are there for given binary relations with specific type signatures? has. Advisor (Jim. Gray, Mike. Harrison) has. Advisor (Hector. Garcia-Molina, Gio Wiederhold) has. Advisor (Susan Davidson, Hector Garcia-Molina) graduated. At (Jim. Gray, Berkeley) graduated. At (Hector. Garcia-Molina, Stanford) has. Won. Prize (Jim. Gray, Turing. Award) born. On (John. Lennon, 9 -Oct-1940) died. On (John. Lennon, 8 -Dec-1980) married. To (John. Lennon, Yoko. Ono) Which additional & interesting relation types are there between given classes of entities? competed. With(x, y), nominated. For. Prize(x, y), … divorced. From(x, y), affair. With(x, y), … assassinated(x, y), rescued(x, y), admired(x, y), …

Higher-arity Relations & Reasoning • Time, location & provenance annotations • Knowledge representation – how do we model & store these? • Consistency reasoning – how do we filter out inconsistent facts that the extractor produced? Facts (RDF triples): triples) Facts about facts: 1: 2: 3: 4: 5: (1, in. Year, 1968) 6: (2, in. Year, 2006) 7: (3, valid. From, 22 -Dec-2000) 8: (3, valid. Until, Nov-2008) 9: (4, valid. From, 2 -Feb-2008) 10: (2, source, Sigmod. Record) 11: (5, in. Year, 1999) 12: (5, location, Camp. Nou) 13: (5, source, Wikipedia) (Jim. Gray, has. Advisor, Mike. Harrison) (Surajit. Chaudhuri, has. Advisor, Jeff. Ullman) (Madonna, married. To, Guy. Ritchie) (Nicolas. Sarkozy, married. To, Carla. Bruni) (Manchester. U, won. Cup, Champions. League) Harvesting Knowledge from Web Data 60

Outline • Part I – What and Why ✔ – Available Knowledge Bases ✔ • Part II – Extracting Knowledge • Part III – Ranking and Searching • Part IV – Conclusion and Outlook Harvesting Knowledge from Web Data 61

Outline • Part II –Extracting Knowledge • Pattern-based Extraction • Consistency Reasoning • Higher-arity Relations: Space & Time Harvesting Knowledge from Web Data 62

Framework: Information Extraction (IE) Surajit obtained his Ph. D in CS from Stanford University under the supervision of Prof. Jeff Ullman. He later joined HP and worked closely with Umesh Dayal … sourcecentric IE 1) recall ! 2) precision instance. Of (Surajit, scientist) in. Field (Surajit, computer science) has. Advisor (Surajit, Jeff Ullman) alma. Mater (Surajit, Stanford U) worked. For (Surajit, HP) friend. Of (Surajit, Umesh Dayal) … one source yield-centric harvesting many sources 1) precision ! 2) recall near-human quality ! has. Advisor Student Surajit Chaudhuri Alon Halevy Jim Gray … … alma. Mater Student Surajit Chaudhuri Alon Halevy Jim Gray … … Advisor Jeffrey Ullman Mike Harrison University Stanford U UC Berkeley

Framework: Knowledge Representation • RDF (Resource Description Framework, W 3 C): - subject-property-object (SPO) triples / binary relations - highly structured, but no (prescriptive) schema - first-order logical reasoning over binary predicates This tutorial! • Frames, F-Logic, description logics: OWL/DL/lite • Also: higher-order logics, epistemic logics Facts (RDF triples): triples) Reification: facts about facts: 1: 2: 3: 4: 5: (1, in. Year, 1968) 6: (2, in. Year, 2006) 7: (3, valid. From, 22 -Dec-2000) 8: (3, valid. Until, Nov-2008) 9: (4, valid. From, 2 -Feb-2008) 10: (2, source, Sigmod. Record) (Jim. Gray, has. Advisor, Mike. Harrison) (Surajit. Chaudhuri, has. Advisor, Jeff. Ullman) (Madonna, married. To, Guy. Ritchie) (Nicolas. Sarkozy, married. To, Carla. Bruni) Temporal, spatial, & provenance annotations can refer to reified facts via fact identifiers (approx. equiv. to higer-arity RDF: Sub Prop Obj Time Location Source) . . .

Picking Low-Hanging Fruit (First)

Deterministic Pattern Matching [Kushmerick 97; Califf & Mooney 99; Gottlob 01, …] . . .

Wrapper Induction [Gottlob et al: VLDB’ 01, PODS’ 04, …] • Wrapper induction: • Hierarchical document structure, XHTML, XML • Pattern learning for restricted regular languages (ELog, combining concepts of XPath & FOL) . . . • Visual interfaces • See e. g. http: //www. lixto. com/, http: //w 4 f. sourceforge. net/ 67

Tapping on Web Tables [Cafarella et al: PVLDB‘ 08; Sarawagi et al: PVLDB‘ 09] Problem: discover interesting relations won. Award: Person Award nominated. For. Award: Person Award … from many table headers and co-occurring cells . . .

Relational Fact Extraction From Plain Text • Hearst patterns [Hearst: COLING‘ 92] – POS-enhanced regular expression matching in natural-language text NP 0 {, } such as {NP 1, NP 2, … (and|or) }{, } NPn NP 0 {, }{NP 1, NP 2, … NPn-1}{, } or other NPn … “The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string. ” is. A(“Bambara ndang”, “bow lute”) • Noun classification from predicate-argument structures [Hindle: ACL’ 90] – Clustering of nouns by similar verbal phrases – Similarity based on co-occurrence frequencies (mutual information) beer wine drink 9. 34 10. 20 sell 4. 21 3. 75 have 0. 84 1. 38 Harvesting Knowledge from Web Data 69

DIPRE [Brin: Web. DB‘ 98] • DPIRE: “Dual Iterative Pattern Relation Extraction” – (Almost) unsupervised, iterative gathering of facts and patterns – Positive & negative examples as seeds for target relation e. g. +(Hillary, Bill) +(Carla, Nicolas) –(Larry, Google) – Specificity threshold for new patterns based on occurrence frequency (Hillary, Bill) (Carla, Nicolas) X and her husband Y X and Y on their honeymoon (Angelina, Brad) (Victoria, David) (Hillary, Bill) (Carla, Nicolas) X and Y and their children X has been dating with Y X loves Y (Larry, Google) … Harvesting Knowledge from Web Data 70

DIPRE/Snowball/QXtract [Brin: Web. DB’ 98; Agichtein, Gravano: SIGMOD’ 01+‘ 03] • DPIRE: “Dual Iterative Pattern Relation Extraction” – (Almost) unsupervised, iterative gathering of facts and patterns – Positive & negative examples as seeds for target relation e. g. +(Hillary, Bill) +(Carla, Nicolas) –(Larry, Google) – Specificity threshold for new patterns based on occurrence frequency • Snowball/QXtract [Agichtein, Gravano: DL’ 00, SIGMOD’ 01+‘ 03] – Refined patterns and statistical measures – >80% recall at >85% precision over a large news corpus – QXtract demo additionally allowed user feedback in the iteration loop Harvesting Knowledge from Web Data 71

Help from NLP: Dependency Parsing! • Analyze lexico-syntactic structure of sentences – Part-Of-Speech (POS) tagging & dependency parsing – Prefer shorter dependency paths for fact candidates Carla has been seen dating with Ben. NNP VBZ VBN VBG IN dating(Carla, Ben) NNP software tools: CMU Link Parser: http: //www. link. cs. cmu. edu/link/ Stanford Lex Parser: http: //nlp. stanford. edu/software/lex-parser. shtml Open NLP Tools: http: //opennlp. sourceforge. net/ ANNIE Open-Source Information Extraction: http: //www. aktors. org/technologies/annie/ Ling. Pipe: http: //alias-i. com/lingpipe/ (commercial license) Harvesting Knowledge from Web Data 72

Harvesting Knowledge from Web Data 73

Open-Domain Gathering of Facts (Open IE) [Etzioni, Cafarella et al: WWW’ 04, IJCAI‘ 07; Weld, Hoffman, Wu: SIGMOD-Rec‘ 08] Analyze verbal phrases between entities for new relation types • unsupervised bootstrapping with short dependency paths Carla has been seen dating with Ben. Rumors about Carla indicate there is something between her and Ben. • self-supervised classifier for (noun, verb-phrase, noun) triples … seen dating with … (Carla, Ben), (Carla, Sofie), … … partying with … (Carla, Ben), (Paris, Heidi), … • build statistics & prune sparse candidates But: result often is noisy • group/cluster candidates for new relation types and their facts clusters are not canonicalized relations {dates. With, parties. With}, {affair. With, flirts. With}, {romantic. Relation}, … . . . far from near-human-quality

Learning More Mappings [Wu & Weld: CIKM’ 07, WWW‘ 08 ] Kylin Ontology Generator (KOG): learn classifier for subclass. Of across Wikipedia & Word. Net using • YAGO as training data • advanced ML methods (MLN‘s, SVM‘s) • rich features from various sources • Category/class name similarity measures • Category instances and their infobox templates: template names, attribute names (e. g. known. For) #articles • Wikipedia edit history: refinement of categories • Hearst patterns: C such as X, X and Y and other C‘s, … • Other search-engine statistics: co-occurrence frequencies instances/classes > 3 Mio. entities > 1 Mio. w/ infoboxes > 500 000 categories

Entity Disambiguation Names “Penn“ “U Penn“ Entities ? Sean Penn University of Pennsylvania “Penn State“ Pennsylvania State University „PSU“ Pennsylvania (US State) Passenger Service Unit • ill-defined with zero context • known as record linkage for names in record fields • Wikipedia offers rich candidate mappings: disambiguation pages, re-directs, inter-wiki links, anchor texts of href links

Individual Entity Disambiguation … Penn Into the Wild … Penn XML Treebank … Penn Univ. Park Sean Penn University of Pennsylvania Penn State University Typical Approaches: name similarity: edit distances, n-gram overlap, … context similarity: record level context similarity: words/phrases level context similarity: text around names, classes & facts around entities Challenge: efficiency & scalability

Collective Entity Disambiguation [Doan et al: AAAI‘ 05; Singla, Domingos: ICDM’ 07; Chakrabarti et al: KDD‘ 09, …] • Consider a set of names {n 1, n 2, …} in same context and sets of candidate entities E 1 = {e 11, e 12, …}, E 2 = {e 21, e 22, …}, … • Define joint objective function (e. g. likelihood for prob. model) that rewards coherence of mappings (n 1)=x 1 E 1, (n 2)=x 2 E 2, … • Solve optimization problem Stuart Russell (DJ) Stuart Russell Michael Jordan Stuart Russell (computer scientist) Michael Jordan (NBA)

Declarative Extraction Frameworks • IBM’s System. T [Krishnamurthy et al: SIGMOD Rec. ’ 08, ICDE’ 08] – Fully declarative extraction framework – SQL-style operators, cost models, full optimizer support • DBLife/Cimple [De. Rose, Doan et al: CIDR’ 07, VLDB’ 07] – Online community portal centered around the DB domain (regular crawls of DBLP, conferences, homepages, etc. ) • More commercial endeavors: – Free. Base. com, Wolfram. Alpha. com, Sig. ma, True. Knowledge. com, Google. com/squared Harvesting Knowledge from Web Data 79

Google Images DBLP Homepages/ DBLP/ DBWorld/ Google Scholar DBWorld/DBLP /Google Scholar Harvesting Knowledge from Web Data 80

Probabilistic Extraction Models • Hidden Markov Models (HMMs) [Rabiner: Proc. IEEE’ 89; Sutton, Mc. Callum: MIT Press’ 06] – Markov chain (directed graphical model) with “hidden” states Y, observations X, and transition probabilities – Factorizes the joint distribution P(Y, X) – Assuming independence among observations • Conditional Random Fields (CRFs) [Lafferty, Mc. Callum, Pereira: ML’ 01; Sarawagi, Cohen: NIPS’ 04] – Markov random field (undirected graphical model) – Models the conditional distribution P(Y|X) (less strict independence assumptions) “I went skiing with Fernando Pereira in British Columbia. ” • Joint segmentation and disambiguation of input strings onto entities and classes: NER, POS tagging, etc. • Trained, e. g. , on bibliograhic entries, no manual labeling required Harvesting Knowledge from Web Data 81

Pattern-Based Harvesting [Hearst 92; Brin 98; Agichtein 00; Etzioni 04; …] Facts & Fact Candidates (Hillary, Bill) (Carla, Nicolas) Patterns X and her husband Y X and Y on their honeymoon (Angelina, Brad) (Victoria, David) (Hillary, Bill) X and Y and their children (Carla, Nicolas) X has been dating with Y (Yoko, John) (Kate, Pete) (Carla, Benjamin) (Larry, Google) (Angelina, Brad) (Victoria, David) X loves Y … • good for recall • noisy, drifting • not robust enough for high precision

Outline • Part II –Extracting Knowledge • Pattern-based Extraction ✔ • Consistency Reasoning • Higher-arity Relations: Space & Time Harvesting Knowledge from Web Data 83

French Marriage Problem is. Married. To: person is. Married. To: french. Politician person . . .

French Marriage Problem Facts in KB: married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) New facts or fact candidates: married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Michelle, Barack) married (Yoko, John) married (Kate, Leonardo) married (Carla, Sofie) married (Larry, Google) 1) for recall: pattern-based harvesting 2) for precision: consistency reasoning

Reasoning about Fact Candidates Use consistency constraints to prune false candidates! First-order-logic rules (restricted): spouse(x, y) diff(y, z) spouse(x, z) spouse(x, y) diff(w, y) spouse(w, y) spouse(x, y) f(x) spouse(x, y) m(y) spouse(x, y) (f(x) m(y)) (m(x) f(y)) Rules reveal inconsistencies Find consistent subset(s) of atoms (“possible world(s)“, “the truth“) Ground atoms: spouse(Hillary, Bill) spouse(Carla, Nicolas) spouse(Cecilia, Nicolas) spouse(Carla, Ben) spouse(Carla, Mick) spouse(Carla, Sofie) f(Hillary) f(Carla) f(Cecilia) f(Sofie) m(Bill) m(Nicolas) m(Ben) m(Mick) Rules can be weighted (e. g. by fraction of ground atoms that satisfy a rule) uncertain / probabilistic data compute prob. distr. over (a subset of) ground atoms being “true“

Markov Logic Networks (MLN‘s) [Richardson/Domingos: ML 2006] Map logical constraints & fact candidates into probabilistic graphical model: Markov Random Field (MRF) FOL rules: s(x, y) diff(y, z) s(x, z) s(x, y) diff(w, y) s(w, y) Grounding: s(x, y) f(x) s(x, y) m(y) f(x) m(x) f(x) Grounding: Literal Boolean Var Reasoning: Literal Binary RV s(Ca, Nic) s(Ce, Nic) s(Ca, Nic) s(Ca, Ben) s(Ca, Nic) m(Nic) s(Ca, So) s(Ce, Nic) m(Nic) s(Ca, Ben) s(Ca, So) s(Ca, Ben) m(Ben) s(Ca, So) m(So) Base facts w/entities: s(Carla, Nicolas) s(Cecilia, Nicolas) s(Carla, Ben) s(Carla, Sofie) …

Markov Logic Networks (MLN‘s) [Richardson, Domingos: ML 2006] Map logical constraints & fact candidates into probabilistic graphical model: Markov Random Field (MRF) s(x, y) diff(y, z) s(x, z) s(x, y) diff(w, y) s(w, y) s(x, y) f(x) s(x, y) m(y) f(x) m(x) f(x) s(Ce, Nic) m(Nic) s(Ca, Ben) s(Ca, So) m(Ben) m(So) Variety of algorithms for joint inference: Gibbs sampling, other MCMC, belief propagation, randomized Max. Sat, … s(Carla, Nicolas) s(Cecilia, Nicolas) s(Carla, Ben) s(Carla, Sofie) … RVs coupled by MRF edge if they appear in same clause MRF assumption: P[Xi|X 1. . Xn]=P[Xi|MB(Xi)] joint distribution has product form over all cliques

Markov Logic Networks (MLN‘s) [Richardson, Domingos: ML 2006] Map logical constraints & fact candidates into probabilistic graphical model: Markov Random Field (MRF) s(x, y) diff(y, z) s(x, z) s(x, y) diff(w, y) s(w, y) s(x, y) f(x) s(x, y) m(y) f(x) m(x) f(x) s(Ce, Nic) 0. 1 s(Ca, Nic) 0. 2 s(Ca, So) 0. 7 s(Carla, Nicolas) s(Cecilia, Nicolas) s(Carla, Ben) s(Carla, Sofie) … 0. 8 m(Nic) s(Ca, Ben) 0. 5 m(Ben) 0. 6 m(So) 0. 7 Consistency reasoning: prune low-confidence facts! Stat. Snowball [Zhu et al: WWW‘ 09], Bio. Snowball [Liu et al: KDD‘ 10] Entity. Cube, MSR Asia: http: //entitycube. research. microsoft. com/

Related Alternative Probabilistic Models Constrained Conditional Models [Roth et al. 2007] log-linear classifiers with constraint-violation penalty mapped into Integer Linear Programs Factor Graphs with Imperative Variable Coordination [Mc. Callum et al. 2008] RV‘s share “factors“ (joint feature functions) generalizes MRF, BN, CRF, … inference via advanced MCMC flexible coupling & constraining of RV‘s s(Ca, Nic) s(Ce, Nic) m(Nic) s(Ca, Ben) software tools: s(Ca, So) alchemy. cs. washington. edu code. google. com/p/factorie/ research. microsoft. com/en-us/um/cambridge/projects/infernet/ m(Ben) m(So)

Reasoning for KB Growth: Direct Route [Suchanek, Sozio, Weikum: WWW’ 09] New fact candidates: Facts in KB married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) + married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Carla, Sofie) married (Larry, Google) ? Patterns: X and her husband Y X and Y and their children X has been dating with Y Direct approach: X loves Y • KB facts are true; fact candidates & patterns hypotheses • grounded constraints clauses with hypotheses as vars • cast into Weighted Max-Sat with weights from pattern stats • customized approximation algorithm • unifies: fact/candidate consistency, pattern goodness, entity disambig. www. mpi-inf. mpg. de/yago-naga/sofie/

SOFIE: Facts & Patterns Consistency [Suchanek, Sozio, Weikum: WWW’ 09] Constraints to connect facts, fact candidates & patterns pattern-fact duality: occurs(p, x, y) expresses(p, R) R(x, y) occurs(p, x, y) R(x, y) expresses(p, R) name(-in-context)-to-entity mapping: means(n, e 1) means(n, e 2) … functional dependencies: spouse(x, y): x y, y x relation properties: asymmetry, transitivity, acyclicity, … type constraints, inclusion dependencies: spouse Person capital. Of. Country city. Of. Country domain-specific constraints: born. In. Year(x) + 10 years ≤ graduated. In. Year(x) has. Advisor(x, y) graduated. In. Year(x, t) graduated. In. Year(y, s) s < t www. mpi-inf. mpg. de/yago-naga/sofie/

SOFIE: Facts & Patterns Consistency [Suchanek, Sozio, Weikum: WWW’ 09] Constraints to connect facts, fact candidates & patterns pattern-fact duality: • Grounded into large propositional occurs(p, x, y) expresses(p, R) R(x, y) Boolean formula in CNF occurs(p, x, y) R(x, y) expresses(p, R) • Max-Sat solver for joint inference (complete truth assignment to all name(-in-context)-to-entity mapping: candidate patterns & facts) means(n, e 1) means(n, e 2) … functional dependencies: spouse(x, y): x y, y x relation properties: asymmetry, transitivity, acyclicity, … type constraints, inclusion dependencies: spouse Person capital. Of. Country city. Of. Country domain-specific constraints: born. In. Year(x) + 10 years ≤ graduated. In. Year(x) has. Advisor(x, y) graduated. In. Year(x, t) graduated. In. Year(y, s) s < t www. mpi-inf. mpg. de/yago-naga/sofie/

SOFIE Example Spouse (Hillary. Clinton, Bill. Clinton) Spouse (Carla. Bruni, Nicolas. Sarkozy) occurs (X and her husband Y, Hillary, Bill) occurs (X Y and their children, Hillary, Bill) occurs (X and her husband Y, Victoria, David) occurs (X dating with Y, Rebecca, David) occurs (X dating with Y, Victoria, Tom) Spouse (Victoria, David) [1] Spouse (Rebecca, David) [1] Spouse (Victoria, Tom) [1] expresses (X and her husband Y, Spouse) expresses (X Y and their children, Spouse) expresses (X dating with Y, Spouse) x, y, z, w: R(x, y) David) y=z. Spouse (Rebecca, David) Spouse (Victoria, R(x, z) x, y, z, w: R(x, y) R(w, y) x=w Spouse (Victoria, David) Spouse (Victoria, Tom). . . … x, y: R(x, y) R(y, x) occurs (husband, Victoria, David) expresses (husband, Spouse) … Spouse (Victoria, David) p, x, y: occurs (p, x, y) David) expresses (dating, Spouse) occurs (dating, Rebecca, expresses (p, R) R (x, y) Spouse (Rebecca, David) … p, x, y: occurs (p, Victoria, David) expresses (p, R) David) occurs (husband, x, y) R (x, y) Spouse (Victoria, expresses (husband, Spouse) … [100] [40] [60] [20] [1] [1] [60] [20] [60]

Soft Rules vs. Hard Constraints Enforce FD‘s (mutual exclusion) as hard constraints: has. Advisor(x, y) diff(y, z) has. Advisor(x, z) combine with weighted constraints no longer regular Max. Sat constrained (weighted) Max. Sat instead Generalize to other forms of constraints: Hard constraint Soft constraint has. Advisor(x, y) graduated. In. Year(x, t) graduated. In. Year(y, s) s<t first. Paper(x, p) first. Paper(y, q) author(p, x) author(p, y) ) in. Year(p) > in. Year(q) + 5 years has. Advisor(x, y) [0. 6] open issue for arbitrary constraints Datalog-style grounding (deductive & potentially recursive) rethink reasoning !

Pattern Harvesting, Revisited [Suchanek et al: KDD’ 06; Nakashole et al: Web. DB’ 10, WSDM’ 11] narrow / nasty / noisy patterns: X and his famous advisor Y X carried out his doctoral research in math under the supervision of Y X jointly developed the method with Y POS-lifted n-gram itemsets as patterns: using narrow & dropping nasty patterns loses recall ! X { PRP ADJ advisor } Y X { his doctoral research, under the supervision of} Y X { PRP doctoral research, IN DET supervision of} Y using noisy patterns loses precision & slows down Max. Sat confidence weights, using seeds and counter-seeds: (Moshe. Vardi, Catriel. Beeri), (Jim. Gray, Mike. Harrison) counter-seeds: (Moshe. Vardi, Ron. Fagin), (Alon. Halevy, Larry. Page) confidence of pattern p ~ #p with seeds #p with counter-seeds

Outline • Part II –Extracting Knowledge • Pattern-based Extraction ✔ • Consistency Reasoning ✔ • Higher-arity Relations: Space & Time Harvesting Knowledge from Web Data 97

Higher-arity Relations: Space & Time • YAGO-2 Preview Just Wikipedia Incl. Gazetteer Data 86 92 #Classes 563, 374 563, 997 #Entities 2, 639, 853 9, 819, 683 495, 770, 281 996, 329, 323 - basic relations 20, 937, 244 61, 188, 706 - types & classes 8, 664, 129 181, 977, 830 466, 168, 908 753, 162, 787 23. 4 GB 37 GB #Relations #Facts - space, time & proven. Size (CSV format) estimated precision > 95% (for basic relations excl. space, time & provenance) www. mpi-inf. mpg. de/yago-naga/ Harvesting Knowledge from Web Data 98

French Marriage Problem (Revisited) JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC Facts in KB: 1: married (Hillary, Bill) 2: married (Carla, Nicolas) 3: married (Angelina, Brad) valid. From (2, 2008) New fact candidates: 4: 5: 6: 7: 8: married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) divorced (Madonna, Guy) dom. Partner (Angelina, Brad) valid. From (4, 1996) valid. From (5, 2010) valid. From (6, 2006) valid. From (7, 2008) valid. Until (4, 2007)

Challenge: Temporal Knowledge Harvesting For all people in Wikipedia (100, 000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night

Difficult Dating

(Even More Difficult) Implicit Dating explicit dates vs. implicit dates relative to other dates

(Even More Difficult) Implicit Dating vague dates relative dates narrative text relative order

TARSQI: Extracting Time Annotations http: //www. timeml. org/site/tarsqi/ [Verhagen et al: ACL‘ 05] Hong Kong is poised to hold the first election in more than half <TIMEX 3 tid="t 3" TYPE="DURATION" VAL="P 100 Y">a century</TIMEX 3> that includes a democracy advocate seeking high office in territory controlled by the Chinese government in Beijing. A prodemocracy politician, Alan Leong, announced <TIMEX 3 tid="t 4" TYPE="DATE" VAL="20070131">Wednesday</TIMEX 3> that he had obtained enough nominations to extraction appear on the ballot to become the territory’s next chief executive. But he acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking reerrors! election. Under electoral rules imposed by Chinese officials, only 796 people on the election committee – the bulk of them with close ties to mainland China – will be allowed to vote in the <TIMEX 3 tid="t 5" TYPE="DATE" VAL="20070325">March 25</TIMEX 3> election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in <TIMEX 3 tid="t 6" TYPE="DATE" VAL="1997">1997</TIMEX 3>. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in <TIMEX 3 tid="t 7" TYPE="DATE" VAL="2005">2005</TIMEX 3>, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing. It is of course a foregone conclusion – Donald Tsang will be elected and will hold office for <TIMEX 3 tid="t 9" begin. Point="t 0" end. Point="t 8“ TYPE="DURATION" VAL="P 5 Y">another five years </TIMEX 3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association.

13 Relations between Time Intervals [Allen, 1984; Allen & Hayes 1989] A Before B B After A A Meets B B Met. By A A A Overlaps B B Overlapped. By A A A Starts B B Started. By A A During B B Contains A A Finishes B B Finished. By A A Equal B A B B B A B A

Possible Worlds in Time [Wang, Yahya, Theobald: MUD Workshop ‘ 10] Derived Facts team. Mates(Beckham, Ronaldo) State Relation Non-independent ‘ 04 plays. For(Beckham, Real, T 1) plays. For(Ronaldo, Real, T 2) overlaps(T 1, T 2) 0. 36 0. 16 0. 08 ‘ 03 d Nee e! g nea Li 0. 12 ‘ 05 ‘ 07 Independent 0. 4 Base Facts 0. 6 1. 0 0. 1 0. 2 0. 4 0. 9 0. 2 ‘ 05 ‘ 07 ‘ 03 plays. For(Beckham, Real) ‘ 00 ‘ 02 ‘ 07 ‘ 04 ‘ 05 plays. For(Ronaldo, Real) State Relation

Possible Worlds in Time [Wang, Yahya, Theobald: MUD Workshop ‘ 10] Derived Facts team. Mates(Beckham, Ronaldo) State Relation Non-independent ‘ 04 plays. For(Beckham, Real, T 1) plays. For(Ronaldo, Real, T 2) overlaps(T 1, T 2) 0. 36 0. 16 0. 08 ‘ 03 0. 12 ‘ 05 ‘ 07 d Nee e! g nea Li Independent • Closed and complete representation model (incl. lineage) Stanford Trio project [Widom: CIDR’ 05, Benjelloun et al: VLDB’ 06] 1. 0 0. 4 • Interval remains linear in the number of bins 0. 9 0. 2 0. 1 • Base. Confidence computation per bin is #P-complete ‘ 05 ‘ 07 ‘ 03 ‘ 00 ‘ 02 ‘ 04 • In plays. For(Beckham, Real) plays. For(Ronaldo, Facts general requires possible-worlds-based sampling Real) techniques (Gibbs-style sampling, Luby-Karp, etc. ) State Relation 0. 6 computation 0. 4

Open Problems and Challenges in IE (I) High precision & high recall at affordable cost robust pattern analysis & reasoning parallel processing, lazy / lifted inference, … Types and constraints soft rules & hard constraints, rich DL, beyond CWA explore & understand different families of constraints Declarative, self-optimizing workflows incorporate pattern & reasoning steps into IE queries/programs Scale, dynamics, life-cycle grow & maintain KB with near-human-quality over long periods Open-domain knowledge harvesting turn names, phrase & table cells into entities & relations

Open Problems and Challenges in IE (II) Temporal Querying (Revived) query language (T-SPARQL? ), no schema confidence weights & ranking Gathering Implicit and Relative Time Annotations biographies & news, relative orderings aggregate & reconcile observations Incomplete and Uncertain Temporal Scopes incorrect, incomplete, unknown begin/end vague dating Consistency Reasoning extended Max. Sat, extended Datalog, prob. graph. models, etc. for resolving inconsistencies on uncertain facts & uncertain time

Outline • Part II –Extracting Knowledge • Pattern-based Extraction ✔ • Consistency Reasoning ✔ • Higher-arity Relations: Space & Time ✔ Harvesting Knowledge from Web Data 111

References for Part II • • • • • • • • • • E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, A. Voskoboynik. Snowball: a prototype system for extracting relations from large text collections. SIGMOD, 2001. James Allen. Towards a general theory of action and time. Artif. Intell. , 23(2), 1984. M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, O. Etzioni. Open information extraction from the web. IJCAI, 2007. R. Baumgartner, S. Flesca, G. Gottlob. Visual web information extraction with Lixto. VLDB, 2001. S. Brin. Extracting patterns and relations from the World Wide Web. DB, 1998. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, Y. Zhang. Web. Tables: exploring the power of tables on the web. PVLDB, 1(1), 2008. M. E. Califf, R. J. Mooney. Relational learning of pattern-match rules for information extraction. AAAI, 1999. P. De. Rose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, R. Ramakrishnan. DBLife: A community information management platform for the database research community. CIDR, 2007. A. Doan, L. Gravano, R. Ramakrishnan, S. Vaithyanathan. (Eds. ). Special issue on information extraction. SIGMOD Record, 37(4), 2008. O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. -M. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates. Web-scale information extraction in Know. It. All. WWW, 2004. G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, S. Flesca. The Lixto data extraction project - back and forth between theory and practice. PODS, 2004. R. Gupta, S. Sarawagi: Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1), 2009. M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. COLING, 1992. D. Hindle. Noun classification from predicate-argument structures. ACL, 1990. R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, H. Zhu. System. T: a system for declarative information extraction. SIGMOD Record, 37(4), 2008. S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti. Collective Annotation of Wikipedia Entities in Web Text. KDD, 2009. N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell. , 118(1 -2), 2000. J. Lafferty, A. Mc. Callum, F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ML, 2001. X. Liu, Z. Nie, N. Yu, J. -R. Wen. Bio. Snowball: automated population of Wikis. KDD, 2010. A. Mc. Callum, K. Schultz, S. Singh. FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs. NIPS, 2009. N. Nakashole, M. Theobald, G. Weikum. Find your Advisor: Robust Knowledge Gathering from the Web. DB, 2010. L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 1989. M. Richardson and P. Domingos. Markov Logic Networks. ML, 2006. D. Roth, W. Yih. Global Inference for Entity and Relation Identification via a Linear Programming Formulation. MIT Press, 2007. S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3), 2008. S. Sarawagi, W. W. Cohen. Semi-Markov conditional random fields for information extraction. NIPS, 2004. W. Shen, X. Li, A. Doan. Constraint-Based Entity Matching. AAAI, 2005. P. Singla, P. Domingos. Entity resolution with Markov Logic. ICDM, 2006. F. M. Suchanek, M. Sozio, G. Weikum. SOFIE: a self-organizing framework for information extraction. WWW, 2009. F. M. Suchanek, G. Ifrim, G. Weikum. Combining linguistic and statistical analysis to extract relations from web documents. KDD, 2006. C. Sutton, A. Mc. Callum. An Introduction to Conditional Random Fields for Relational Learning. MIT Press, 2006. R. C. Wang, W. W. Cohen. Language-independent set expansion of named entities using the web. ICDM, 2007. Y. Wang, M. Yahya, M. Theobald. Time-aware Reasoning in Uncertain Knowledge Bases. VDLB/MUD, 2010. D. S. Weld, R. Hoffmann, F. Wu. Using Wikipedia to bootstrap open information extraction. SIGMOD Record, 37(4), 2008. F. Wu, D. S. Weld. Autonomously semantifying Wikipedia. CIKM, 2007. F. Wu, D. S. Weld. Automatically refining the Wikipedia infobox ontology. WWW, 2008. A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni, S. Soderland. Text. Runner: Open information extraction on the web. HLT-NAACL, 2007. J. Zhu, Z. Nie, X. Liu, B. Zhang, J. -R. Wen. Stat. Snowball: a statistical approach to extracting entity relationships. WWW, 2009. Harvesting Knowledge from Web Data

Outline • Part I – What and Why ✔ – Available Knowledge Bases ✔ • Part II – Extracting Knowledge ✔ • Part III – Ranking and Searching • Part IV – Conclusion and Outlook Harvesting Knowledge from Web Data

Outline for Part III • Part III. 1: Querying Knowledge Bases – A short overview of SPARQL – Extensions to SPARQL • Part III. 2: Searching and Ranking Entities • Part III. 3: Searching and Ranking Facts Harvesting Knowledge from Web Data

SPARQL • Query language for RDF from the W 3 C • Main component: – select-project-join combination of triple patterns graph pattern queries on the knowledge base Harvesting Knowledge from Web Data

SPARQL – Example query: Find all actors from Ontario (that are in the knowledge base) scientist is. A actor is. A vegetarian is. A Mike_Myers Jim_Carrey born. In Scarborough Newmarket located. In physicist is. A chemist is. A Albert_Einstein Otto_Hahn born. In Ulm Frankfurt located. In Ontario located. In Germany located. In Canada is. A located. In Harvesting Knowledge from Web Data Europe 116

SPARQL – Example query: Find all actors from Ontario (that are in the knowledge base) scientist is. A actor is. A vegetarian is. A Mike_Myers Jim_Carrey born. In Scarborough Newmarket located. In physicist is. A chemist is. A Albert_Einstein Otto_Hahn born. In Ulm Frankfurt located. In Ontario located. In Germany located. In Canada is. A located. In Harvesting Knowledge from Web Data Europe 117

SPARQL – Example query: Find all actors from Ontario (that are in the knowledge base) SELECT ? person WHERE ? person is. A actor. ? person born. In ? located. In Ontario. Find subgraphs of this form: actor constants actor is. A ? person born. In ? located. In is. A variables vegeta is. A Mike_Myers Jim_Carrey born. In Scarborough Newmarket located. In Ontario located. In Harvesting Knowledge from Web Data Canada 118

SPARQL – More Features • Eliminate duplicates in results SELECT DISTINCT ? c WHERE {? person is. A actor. ? person born. In ? located. In ? c} • Return results in some order SELECT ? person WHERE {? person is. A actor. ? person born. In ? located. In Ontario} ORDER BY DESC(? person) with optional LIMIT n clause • Optional matches and filters on bounded vars SELECT ? person WHERE {? person is. A actor. OPTIONAL{? person born. In ? loc}. FILTER (!BOUND(? loc))} • More operators: ASK, DESCRIBE, CONSTRUCT Harvesting Knowledge from Web Data

SPARQL: Extensions from W 3 C SPARQL 1. 1 draft: • Aggregations (COUNT, AVG, …) • Subqueries • Negation: syntactic sugar for OPTIONAL {? x … } FILTER(!BOUND(? x)) Harvesting Knowledge from Web Data

SPARQL: Extensions from Research (1) More complex graph patterns: • Transitive paths [Anyanwu et al. , WWW 07] SELECT ? p, ? c WHERE { ? p is. A scientist. ? p ? ? r ? c is. A Country. ? c located. In Europe. Path. Filter(cost(? ? r) < 5). Path. Filter (contains. Any(? ? r, ? t ). ? t is. A City. } • Regular expressions [Kasneci et al. , ICDE 08] SELECT ? p, ? c WHERE { ? p is. A ? s is. A scientist. ? p (born. In | lives. In | citizen. Of) located. In* Europe. } Harvesting Knowledge from Web Data

SPARQL: Extensions from Research (2) Queries over federated RDF sources: • Determine distribution of triple patterns as part of query (for example in ARQ from Jena) • Automatically route triple predicates to useful sources Harvesting Knowledge from Web Data 122

SPARQL: Extensions from Research (2) Queries over federated RDF sources: • Determine distribution of triple patterns as part of query (for example in ARQ from Jena) • Automatically route triple predicates to useful sources Potentially requires mapping of identifiers from different sources Harvesting Knowledge from Web Data 123

RDF+SPARQL: Systems • Big. OWLIM • Open. Link Virtuoso • Jena with different backends • Sesame • Onto. Broker • SW-Store, Hexastore, RDF-3 X (no reasoning) System deployments with >1011 triples ( see http: //esw. w 3. org/Large. Triple. Stores) Harvesting Knowledge from Web Data

Outline for Part III • Part III. 1: Querying Knowledge Bases • Part III. 2: Searching and Ranking Entities – Entity Importance: Graph Analysis – Entity Search: Language Models • Part III. 3: Searching and Ranking Facts Harvesting Knowledge from Web Data

Why ranking is essential • Queries often have a huge number of results: – scientists from Canada – conferences in Toronto – publications in databases – actors from the U. S. • Ranking as integral part of search • Huge number of app-specific ranking methods: paper/citation count, impact, salary, … • Need for generic ranking Harvesting Knowledge from Web Data

Extending Entities with Keywords Remember: entities occur in facts in documents Associate entities with terms in those documents chancellor Germany scientist election Stuttgart 21 Guido Westerwelle France Nicolas Sarkozy Harvesting Knowledge from Web Data

Digression 1: Graph Authority Measures Idea: incoming links are endorsements & increase page authority, authority is higher if links come from high-authority pages Authority (page q) = stationary prob. of visiting q Random walk: uniformly random choice of links + random jumps Harvesting Knowledge from Web Data

Graph-Based Entity Importance Combine several paradigms: • Keyword search on associated terms to determine candidate entities • Pagerank or similar measure to determine important entities • Ranking can combine entity rank with keywordbased score Harvesting Knowledge from Web Data

Digression 2: Language Models (LMs) State-of-the-art model in text retrieval d 1 ? LM( 1) d 2 q ? LM( 2) • each document di has LM: generative probability distribution of terms with parameter i • query q viewed as sample from LM( 1), LM( 2), … • estimate likelihood P[ q | LM( i) ] that q is sample of LM of document di (q is „generated by“ di) • rank by descending likelihoods (best „explanation“ of q) Harvesting Knowledge from Web Data 130

Language Models for Text: Example model M A A B B estimate likelihood of observing query C C C D P [ A A B C E |EM] E E E query document d: sample of M used for parameter estimation Harvesting Knowledge from Web Data 131

Language Models for Text: Smoothing + model M A A C B B D A A C C C B D F E E E estimate likelihood of observing query P [ A B C E F | M] query document d + background corpus and/or smoothing used for parameter estimation Harvesting Knowledge from Web Data Laplace smoothing Jelinek-Mercer Dirichlet smoothing … 132

Some LM Basics independ. assumpt. simple MLE: overfitting mixture model for smoothing P[q] est. from log or corpus rank by ascending “improbability“ KL divergence (Kullback-Leibler div. ) aka. relative entropy Harvesting Knowledge from Web Data 133

Entity Search with LM Ranking query: keywords answer: entities LM (entity e) = prob. distr. of words seen in context of e query q: „French player who won world championship“ candidate entities: e 1: David Beckham e 2: Ruud van Nistelroy e 3: Ronaldinho e 4: Zinedine Zidane e 5: FC Barcelona played for Man. U, Real, LA Galaxy David Beckham champions league England lost match against France married to spice girl … weighted by conf. Zizou champions league 2002 Real Madrid won final. . . Zinedine Zidane best player France world cup 1998. . . [Z. Nie et al. : WWW’ 07] Harvesting Knowledge from Web Data 134

Outline for Part III • Part III. 1: Querying Knowledge Bases • Part III. 2: Searching and Ranking Entities • Part III. 3: Searching and Ranking Facts – General ranking issues – NAGA-style ranking – Language Models for facts Harvesting Knowledge from Web Data

What makes a fact „good“? Confidence: Prefer results that are likely correct Ø accuracy of info extraction Ø trust in sources (authenticity, authority) Informativeness: Prefer results with salient facts Statistical estimation from: Ø frequency in answer Ø frequency on Web Ø frequency in query log Diversity: Prefer variety of facts Conciseness: Prefer results that are tightly connected Ø size of answer graph Ø cost of Steiner tree born. In (Jim Gray, San Francisco) from „Jim Gray was born in San Francisco“ (en. wikipedia. org) lives. In (Michael Jackson, Tibet) from „Fans believe Jacko hides in Tibet“ (www. michaeljacksonsightings. com) q: Einstein isa ? Einstein isa scientist Einstein isa vegetarian q: ? x isa vegetarian Einstein isa vegetarian Whocares isa vegetarian E won … E discovered … E played … E won … Einstein won Nobel. Prize Bohr won Nobel. Prize Einstein isa vegetarian Cruise born 1962 Bohr died 1962

How can we implement this? Confidence: Prefer results that are likely correct Ø accuracy of info extraction Ø trust in sources (authenticity, authority) Informativeness: empirical accuracy of IE PR/HITS-style estimate of trust combine into: max { accuracy (f, s) * trust(s) | s witnesses(f) } PR/HITS-style entity/fact ranking [V. Hristidis et al. , S. Chakrabarti, …] Prefer results with salient facts Statistical estimation from: Ø frequency in answer Ø frequency on Web Ø frequency in query log IR models: tf*idf … [K. Chang et al. , …] Statistical Language Models Diversity: Statistical Language Models Prefer variety of facts Conciseness: or graph algorithms (BANKS, STAR, …) [J. X. Yu et al. , S. Chakrabarti et al. , B. Kimelfeld et al. , A. Markovetz et al. , B. C. Ooi et al. , G. Kasneci et al. , …] Prefer results that are tightly connected Ø size of answer graph Ø cost of Steiner tree Harvesting Knowledge from Web Data 137

LMs: From Entities to Facts Document / Entity LM‘s LM for doc/entity: prob. distr. of words LM for query: (prob. distr. of) words LM‘s: rich for docs/entities, super-sparse for queries richer query LM with query expansion, etc. Triple LM‘s LM for facts: (degen. prob. distr. of) triple LM for queries: (degen. prob. distr. of) triple pattern LM‘s: apples and oranges • expand query variables by S, P, O values from DB/KB • enhance with witness statistics • query LM then is prob. distr. of triples ! Harvesting Knowledge from Web Data 138

LMs for Triples and Triple Patterns triple patterns (queries q): triples (facts f): q: Beckham psmoothing LM(q) + ? y f 1: Beckham p Manchester. U q: Beckham p Man. U 200/550 f 2: Beckham p Real. Madrid q: Beckham p Real 300/550 f 3: Beckham p LAGalaxy q: Beckham p Galaxy 20/550 f 4: Beckham p ACMilan q: Beckham p Milan 30/550 F 5: Kaka p ACMilan q: ? x p ASCannes F 6: Kaka p Real. Madrid Zidane p ASCannes 20/30 f 7: Zidane p ASCannes Tidjani p ASCannes 10/30 f 8: Zidane p Juventus f 9: Zidane p Real. Madrid q: ? x p ? y LM(q): {t P [t | t matches q] ~ #witnesses(t)} f 10: Tidjani p ASCannes Messi p FCBarcelona LM(answer f): {t P [t | t matches f] ~ 1 for f} 400/2600 f 11: Messi p FCBarcelona Zidane p Real. Madrid smooth all LM‘s 350/2600 f 12: Henry p Arsenal Kaka p ACMilan 300/2600 rank results by ascending KL(LM(q)|LM(f)) … f 13: Henry p FCBarcelona f 14: Ribery p Bayern. Munich q: Cruyff ? r FCBarcelona f 15: Drogba p Chelsea Cruyff played. For FCBarca 200/500 f 16: Casillas p Real. Madrid Cruyff played. Against FCBarca 50/500 Cruyff coached FCBarca 250/500 Harvesting Knowledge from Web Data witness statistics 200 300 20 30 300 150 20 200 350 10 400 200 150 100 150 20 139 : 2600

LMs for Composite Queries q: Select ? x, ? c Where {? x born. In France. ? x plays. For ? c in UK. } P [ Henry b. I F, Henry p Arsenal, Arsenal in UK ]F, P [ Drogba b. I Drogba p Chelsea, Chelsea in UK ] queries q with subqueries q 1 … qn results are n-tuples of triples t 1 … tn LM(q): P[q 1…qn] = i P[qi] LM(answer): P[t 1…tn] = i P[ti] KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti)) f 21: Zidane b. I F 200 f 22: Tidjani b. I F 20 f 23: Henry b. I F 200 f 24: Ribery b. I F 200 f 25: Drogba b. I F 30 f 26: Drogba b. I IC 100 F 27: Zidane b. I ALG 50 f 1: Beckham p Man. U 200 f 7: Zidane p ASCannes 20 f 8: Zidane p Juventus 200 f 9: Zidane p Real. Madrid 300 f 10: Tidjani p ASCannes 10 f 12: Henry p Arsenal 200 f 13: Henry p FCBarca 150 f 14: Ribery p Bayern 100 f 15: Harvesting Knowledge from Web Data 150 Drogba p Chelsea f 31: Man. U in UK 200 f 32: Arsenal in UK 160 f 33: Chelsea in UK 140

Extensions: Keywords Problem: not everything is triplified • Consider witnesses/sources (provenance meta-facts) • Allow text predicates with each triple pattern (à la XQ-FT) Semantics: triples match struct. pred. witnesses match text pred. European composers who have won the Oscar, whose music appeared in dramatic western scenes, and who also wrote classical pieces ? Select ? p Where { ? p instance. Of Composer. ? p born. In ? t in. Country ? c located. In Europe. ? p has. Won ? a Name Academy. Award. ? p contributed. To ? movie [western, gunfight, duel, sunset]. ? p composed ? music [classical, orchestra, cantata, opera]. } Harvesting Knowledge from Web Data

Extensions: Keywords Problem: not everything is triplified • Consider witnesses/sources (provenance meta-facts) • Allow text predicates with each triple pattern (à la XQ-FT) Grouping of keywords or phrases boosts expressiveness French politicians married to Italian singers? Select ? p 1, ? p 2 Where { ? p 1 instance. Of ? c 1 [France, politics]. ? p 2 instance. Of ? c 2 [Italy, singer]. ? p 1 married. To ? p 2. } CS researchers whose advisors worked on the Manhattan project? Select ? r, ? a Where { ? r inst. Of researcher [“computer science“]. ? p 1 ? o 1 [“computer science“]. ? a worked. On ? x [“Manhattan project“]. ? p 2 ? o 2 [“Manhattan project“]. Harvesting ? r has. Advisor ? a. } from Web Data ? p 3 ? a. } Knowledge 142

LMs for Keyword-Augmented Queries q: Select ? x, ? c Where { France ml ? x [goalgetter, “top scorer“]. ? x p ? c in UK [champion, “cup winner“, double]. } subqueries qi with keywords w 1 … wm results are still n-tuples of triples ti LM(qi): P[triple ti | w 1 … wm] = k P[ti | wk] + (1 ) P[ti] LM(answer fi) analogous KL(LM(q)|LM(answer fi)) = i KL (LM(qi) | LM(fi)) result ranking prefers (n-tuples of) triples whose witnesses score high on the subquery keywords Harvesting Knowledge from Web Data 143

Extensions: Query Relaxation (4) q(2): … Where {? x born. In ? x pp? c. . ? c in UK. . }} IC. ? x ? c in UK [ Zidane b. I F, Zidane p Real, [ Real in ESPIC, Drogba b. I ] Drogba p Chelsea, Chelsea in res. Of F, [ Drogba UK] Drogba p Chelsea, Chelsea in b. I IC, [ Drogba UK] Drogba p Chelsea, Chelsea in UK] f 21: Zidane b. I F 200 f 22: Tidjani b. I F 20 F 23: Henry b. I F 200 F 24: Ribery b. I F 200 F 26: Drogba b. I IC 100 F 27 Zidane b. I ALG 50 LM(q*) = LM(q) + 1 LM(q(1)) + 2 LM(q(2)) + … replace e in q by e(i) in q(i): precompute P: =LM (e ? p ? o) and Q: =LM (e(i) ? p ? o) set i ~ 1/2 (KL (P|Q) + KL (Q|P)) replace r in q by r(i) in q(i) LM (? s r(i) ? o) replace e in q by ? x in q(i) LM (? x r ? o) … LM‘s of e, r, . . . f 1: Beckham p Man. U 200 f 7: Zidane p ASCannes 20 f 9: Zidane p Real 300 f 10: Tidjani p ASCannes 10 f 12: Henry p Arsenal 200 144 f 15: Drogba p Chelsea 150 are prob. distr. ‘s of Man. U in f 31: triples !UK 200 f 32: Arsenal in UK 160 f 33: Chelsea in UK 140

Extensions: Diversification q: Select ? p, ? c Where { ? p isa Soccer. Player. ? p played. For ? c. } 1 Beckham, Manchester. U 2 Beckham, Real. Madrid 3 Beckham, LAGalaxy 4 Beckham, ACMilan 5 Zidane, Real. Madrid 6 Kaka, Real. Madrid 7 Cristiano Ronaldo, Real. Madrid 8 Raul, Real. Madrid 9 van Nistelrooy, Real. Madrid 10 Casillas, Real. Madrid 1 Beckham, Manchester. U 2 Beckham, Real. Madrid 3 Zidane, Real. Madrid 4 Kaka, ACMilan 5 Cristiano Ronaldo, Manchester. U 6 Messi, FCBarcelona 7 Henry, Arsenal 8 Ribery, Bayern. Munich 9 Drogba, Chelsea 10 Luis Figo, Sporting Lissabon rank results f 1. . . fk by ascending KL(LM(q) | LM(fi)) (1 ) KL( LM(fi) | LM({f 1. . fk}{fi})) implemented by greedy re-ranking of fi‘s in candidate pool Harvesting Knowledge from Web Data

Searching and Ranking – Summary • Don‘t re-invent the wheel: LM‘s are elegant and expressive means for ranking consider both data & workload statistics • Extensions should be conceptually simple: can capture informativeness, personalization, relaxation, diversity – all in same framework • Unified ranking model for complete query language: still work to do Harvesting Knowledge from Web Data

References for Part III • • • • • SPARQL Query Language for RDF, W 3 C Recommendation, 15 January 2008, http: //www. w 3. org/TR/2008/REC-rdf-sparql-query 20080115/ SPARQL New Features and Rationale, W 3 C Working Draft, 2 July 2009, http: //www. w 3. org/TR/2009/WD-sparql-features 20090702/ Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SPARQ 2 L: towards support for subgraph extraction queries in RDF databases. WWW Conference, 2007 Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE, 2002 Soumen Chakrabarti: Dynamic personalized pagerank in entity-relation graphs. WWW Conference, 2007 Tao Cheng , Xifeng Yan , Kevin Chen-Chuan Chang: Entity. Rank: searching entities directly and holistically. VLDB, 2007 Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, Gerhard Weikum: Language-model-based ranking for queries on RDF-graphs. CIKM, 2009 Djoerd Hiemstra: Language Models. Encyclopedia of Database Systems, 2009 Vagelis Hristidis, Heasoo Hwang, Yannis Papakonstantinou: Authority-based keyword search in databases. ACM Transactions on Database Systems 33(1), 2008 Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner-Tree Approximation in Relationship Graphs. ICDE, 2009 Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking Knowledge. ICDE, 2008 Mounia Lalmas: XML Retrieval. Morgan & Claypool Publishers, 2009 Thomas Neumann, Gerhard Weikum: The RDF-3 X engine for scalable management of RDF data. VLDB Journal 19(1), 2010 Zaiqing Nie, Yunxiao Ma, Shuming Shi, Ji-Rong Wen, Wei-Ying Ma: Web object retrieval. WWW Conference, 2007 Desislava Petkova, W. Bruce Croft: Hierarchical Language Models for Expert Finding in Enterprise Corpora. ICTAI, 2006 Nicoleta Preda, Gjergji Kasneci, Fabian M. Suchanek, Thomas Neumann, Wenjun Yuan, Gerhard Weikum: Active knowledge: dynamically enriching RDF knowledge bases by web services. SIGMOD Conference, 2010 Pavel Serdyukov, Djoerd Hiemstra: Modeling Documents as Mixtures of Persons for Expert Finding. ECIR, 2008 Cheng. Xiang Zhai: Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008 Harvesting Knowledge from Web Data

Outline • Part I – What and Why ✔ – Available Knowledge Bases ✔ • Part II – Extracting Knowledge ✔ • Part III – Ranking and Searching ✔ • Part IV – Conclusion and Outlook Harvesting Knowledge from Web Data 148

But back to the original question. . . Will there ever be a famous singer called Elvis again? ? x has. Given. Name “Elvis” type singer 149

But back to the original question. . . http: //mpii. de/yago We found him! ? x = Elvis_Costello ? singer = wordnet_singer_110599806 ? d = 1954 -08 -25 Can we find out more about this guy? 150

But back to the original question. . . http: //mpii. de/yago Alright, and even more? 151

Linking Open Data: Goal guitar plays born 1954 Costellopedia YAGO Can we combine knowledge from different sources?

Linking Open Data: URIs guitar plays http: //dbpedia. org/resource born 1954 http: //elvis. org 1. Define a name space http: //dbpedia. org/resource/Elvis. Costello http: //costello. org/Elvis 2. Define entity names in that name space Every entity has a worldwide unique identifier (a Uniform Resource Identifier, URI). There is a W 3 C standard for that. [W 3 C URI]

Linking Open Data: Cool URIs guitar plays born 1954 http: //elvis. org http: //dbpedia. org/resource 1. Define a name space http: //dbpedia. org/resource/Elvis. Costello http: //costello. org/Elvis 2. Define entity names in that name space 3. Make them accessible online http: //costello. org/Elvis client born 1954 server There is a W 3 C description for that [W 3 C Cool. URI]

Linking Open Data: Links guitar plays http: //dbpedia. org/resource born 1954 http: //elvis. org 1. Define a name space http: //dbpedia. org/resource/Elvis. Costello http: //costello. org/Elvis 2. Define entity names in that name space This is an entity resolution problem. Use • • similar identifiers similar labels (names) keys (e. g. , the ISBN) common properties Goal of the W 3 C group [Bizer JSWIS 2009] 3. Make them accessible online 4. Define equivalence links

Linking Open Data: Status so far Currently (2010) • 200 ontologies • 25 billion triples • 400 m links http: //richard. cyganiak. de/2007/10/lod/imagemap. html

Querying Semantic Data Sindice is an index for the Semantic Web developed at the DERI in Galway/Ireland. http: //sindice. com Sindice exploits • RDF dumps available on the Web • RDF information embedded into HTML pages • RDF data available by cool URIs • inter-ontology links [Tummarello ISWC 2007]

Querying Semantic Data ? ? . . . far from perfect. . . but far from useless. . .

Conclusion • We have seen the knowledge representation model of ontologies, RDF In a nutshell, RDF is a kind of distributed entity-relationship model • We have seen numerous existing knowledge bases. . . manually constructed (Cyc and Word. Net) and automatically constructed (YAGO, DBpedia, Freebase, True. Knowledge etc. ) • We have seen techniques for creating such knowledge bases (Pattern-based extraction and reasoning-based extraction, with uncertainty) • We have seen techniques for querying and ranking the knowledge (by SPARQL and language-based models) • We have seen that many knowledge bases already exist and that is ongoing work to interlink them • We have seen that there is indeed a promisinger called Elvis

The End The slides are available at http: //www. mpi-inf. mpg. de/yago-naga/CIKM 10 -tutorial/ Feel free to contact us with further questions Hady Lauw Institute for Infocomm Research, Singapore http: //hadylauw. com Fabian M. Suchanek INRIA Saclay, Paris http: //suchanek. name Martin Theobald Max-Planck Institute for Informatics, Saarbrücken http: //mpii. de/~mtb Ralf Schenkel Saarland University http: //people. mmci. uni-saarland. de/~schenkel/

References for Part IV References • [W 3 C URI] W 3 C: “Architecture of the World Wide Web, Volume One” Recommendation 15 December 2004, http: //www. w 3. org/TR/webarch/ • [W 3 C Cool. URI] W 3 C: “Cool URIs for the Semantic Web” Interest Group Note 03 December 2008, http: //www. w 3. org/TR/cooluris/ • [Bizer JSWIS 2009] C. Bizer, T. Heath, T. Berners-Lee: “Linked data – the story so far” International Journal on Semantic Web and Information Systems, 5(3): 1– 22, 2009. • [Tummarello ISWC 2007] G. Tummarello, R. Delbru, E. Oren: “Sindice. com: Weaving the Open Linked Data” ISWC/ASWC 2007: