Скачать презентацию YAGO A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET Скачать презентацию YAGO A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET

58406b8e28681efdafa183b194586030.ppt

  • Количество слайдов: 44

YAGO: A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD YAGO: A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer

Motivation for an Ontology Natural Language communication Automated text translation Finding information on internet Motivation for an Ontology Natural Language communication Automated text translation Finding information on internet Computer-processable collection of knowledge

What is an Ontology? An ontology is the description of a domain, its classes What is an Ontology? An ontology is the description of a domain, its classes and properties and relationships between those classes by means of a formal language. collection of knowledge about the world, a knowledge base Example ontologies: large taxonomies categorizing Web sites (such as on Yahoo!) categorizations of products for sale and their features (such as on Amazon. com)

Uses of Ontologies Machine Translation Word Sense Disambiguation Document Classification Question Answering Entity and Uses of Ontologies Machine Translation Word Sense Disambiguation Document Classification Question Answering Entity and fact-oriented Web Search

What is Yago Yet Another Great Ontology Part of Yago-Naga project Goal to build What is Yago Yet Another Great Ontology Part of Yago-Naga project Goal to build a knowledge base that is Large Scale Domain-independent Automatic Construction High Accuracy Uses Wikipedia and Word. Net

More about YAGO 2 million entities 20 million facts Facts represented as RDF triples More about YAGO 2 million entities 20 million facts Facts represented as RDF triples Accuracy of 95% Examples: Elvis Presley is. A singer sub. Class. Of person Elvis Presley born. On. Date 1935 -01 -08 Elvis Presley born. In Tupelo located. In Mississippi(state) located. In USA

The YAGO model Slight extension of RDFS Represents knowledge as Entities Classes Relations Facts The YAGO model Slight extension of RDFS Represents knowledge as Entities Classes Relations Facts Properties of relations like transitivity Simple and decidable model

Knowledge Representation in YAGO All objects are entities e. g. Elvis Presley, Grammy Award Knowledge Representation in YAGO All objects are entities e. g. Elvis Presley, Grammy Award 2 entities can stand in a relationship e. g. has. Won. Award Elvis Presley has. Won. Award Grammy Award The triple of entity, relationship, entity is a fact e. g. fact Elvis Presley has. Won. Award Grammy Award is a

Knowledge Representation in YAGO -2 Numbers, dates and strings are also entities. Elvis Presley Knowledge Representation in YAGO -2 Numbers, dates and strings are also entities. Elvis Presley Born. In. Year 1935 Words are entities “Elvis” Entity is instance of class Elvis means Elvis Presley Type Singer Classes are also entities Singer Type class

Knowledge Representation in YAGO- 3 Classes have hierarchies Singer Sub. Class. Of Person Relations Knowledge Representation in YAGO- 3 Classes have hierarchies Singer Sub. Class. Of Person Relations are also entities sub. Class. Of Type atr Each fact has a fact identifier #1 Found. In Wikipedia

Key Contributions of YAGO Information Extraction from Wikipedia Infoboxes Category Pages Combination with Word. Key Contributions of YAGO Information Extraction from Wikipedia Infoboxes Category Pages Combination with Word. Net Taxonomy Quality Control Canonicalization Type Checking

Information Extraction -1 Entities from Wikipedia Each page title is candidate entity Wiki Markup Information Extraction -1 Entities from Wikipedia Each page title is candidate entity Wiki Markup Language Wikipedia dump as of September, 2008

Information Extraction - WML Information Extraction - WML

Information Extraction Techniques Infobox Harvesting Wikipedia Word-Level Techniques Wikipedia Redirects Category Harvesting Wikipedia Infoboxes Information Extraction Techniques Infobox Harvesting Wikipedia Word-Level Techniques Wikipedia Redirects Category Harvesting Wikipedia Infoboxes Categories Type Extraction Wikipedia Categories, Word. Net Classes

1. Information Extraction from Wikipedia – Infobox Harvesting Wikipedia Infobox 1. Information Extraction from Wikipedia – Infobox Harvesting Wikipedia Infobox

Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… born. On. Date … Born Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… born. On. Date … Born Bor B B Born: January 8, 1935 Relation Map Relation born. On. Date Elvis Presley Domain … person … Range yago. Date born. On. Date January 8, 1935

Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… Died died. On. Date Bor Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… Died died. On. Date Bor … B B Relation Map Relation Died: August 16, 1977 died. On. Date Elvis Presley Domain … person … died. On. Date Range yago. Date August 16, 1977

Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… is. Of. Genre … Genre Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… is. Of. Genre … Genre Bor B B Genre: Rock and Roll Relation Map Relation is. Of. Genre Elvis Presley Domain … entity … is. Of. Genre Range yago. Class Rock and Roll

Attribute Map Infobox Attribute Relation Inverse …… Manifold Indirect birth name means Bor … Attribute Map Infobox Attribute Relation Inverse …… Manifold Indirect birth name means Bor … B B Relation Map Relation Birth Name: Elvis Aaron Presley Domain means … yago. Word … Elvis Aaron Presley means Range entity Elvis Presley

Manifold Attributes Some attributes may have multiple values e. g. a person may have Manifold Attributes Some attributes may have multiple values e. g. a person may have multiple children Multiple facts are generated e. g. one has. Child fact for each child

Indirect Attributes - 1 Attribute Map Attribute Relation Inverse …… gdp ppp has. GDP Indirect Attributes - 1 Attribute Map Attribute Relation Inverse …… gdp ppp has. GDP gdp year during Indirect Some attributes do not concern article entity, but another fact Manifold e. g attribute GDP does not concern the article entity i. e. Republic of Singapore, but year 2008 Therefore, facts generated: Singapore has. GDP 238. 755 billion #14 during 2008 Singapore has. GDP 238. 755 billion during 2008

Indirect Attributes - 2 Singapore Infobox Indirect Attributes - 2 Singapore Infobox

Type of Infobox American Pie Released Format Genre Length Label Writer October, 1971 vinyl Type of Infobox American Pie Released Format Genre Length Label Writer October, 1971 vinyl record Folk Rock 8: 33 mins United Artists Don Mc. Lean Song Infobox Tesla Roadster Manufacturer Production Class Length Width Height Tesla Motors 2008 -present Roadster 3, 946 mm 1, 873 mm 1, 127 mm Car Infobox

Type of Infobox: Attribute Map Attribute Relation Inverse Manifold Indirect …… car #length has. Type of Infobox: Attribute Map Attribute Relation Inverse Manifold Indirect …… car #length has. Length … song #length has. Duration … Song Infobox American Pie has. Duration 8: 33 Car Infobox Tesla Roadster has. Length 3946

Information Extraction - Word Level Techniques Wikipedia Redirects virtual redirect page for “Presley, Elvis“ Information Extraction - Word Level Techniques Wikipedia Redirects virtual redirect page for “Presley, Elvis“ links to “Elvis Presley” Each redirect gives ‘means’ fact e. g. “Presley, Elvis“ means Elvis Presley Parsing Person Names extract the name components establish relations given. Name. Of and family. Name. Of e. g. Presley family. Name. Of Elvis Presley Elvis given. Name. Of Elvis Presley

Wikipedia Categories: Presidents of the United States | Lists of office-holders | Lists of Wikipedia Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation from German Wikipedia | Rivers of Netherlands Categories: Canadian Singers| Canadian male singers| 1959 births | English-language singers | Living people | Grammy Award Winners | Portrait photographers

Facts created from Wikipedia Categories Rhine located. In Germany Bryan Adams born. On. Date Facts created from Wikipedia Categories Rhine located. In Germany Bryan Adams born. On. Date 1959 Bryan Adams has. Won. Award Grammy Award Abraham Lincoln politician. Of United States

Information Extraction - Category Harvesting Relational Categories Regular Expression ([0 -9]f 3, 4 g) Information Extraction - Category Harvesting Relational Categories Regular Expression ([0 -9]f 3, 4 g) births ([0 -9]f 3, 4 g) deaths ([0 -9]f 3, 4 g) establishments ([0 -9]f 3, 4 g) books|novels Mountainsj. Rivers in (. *) Presidentsj. Governors of (. *) winners [A-Za-z]+ (. *) winners Relation born. On. Date died. On. Date established. On. Date written. On. Date located. In politician. Of has. Won. Prize Table: Some Category Heuristics

2. Connecting Wikipedia and Word. Net – What is Word. Net Lexical database for 2. Connecting Wikipedia and Word. Net – What is Word. Net Lexical database for the English language Created at the Cognitive Science Laboratory of Princeton University Groups English words into sets of synonyms called synsets Provides short, general definitions Provides hypernym/hyponym relations e. g. canine is hypernym, dog is hyponym

Connecting Wikipedia and Word. Net – Type Extraction Goal: create class hierarchy e. g. Connecting Wikipedia and Word. Net – Type Extraction Goal: create class hierarchy e. g. singer sub. Class. Of performer sub. Class. Of artist hyponymy relation from Word. Net Wikipedia class ‘American people in Japan’ is subclass of Word. Net class ‘person’

Classifications of Categories Conceptual Categories e. g. Albert Einstein is in ‘Naturalized citizens of Classifications of Categories Conceptual Categories e. g. Albert Einstein is in ‘Naturalized citizens of the United States’ Administrative Categories e. g. Albert Einstein is in ‘Articles with unsourced statements’ Relational Information 1879 births Thematic Vicinity Physics

Identification of Conceptual Categories Only conceptual categories are used Shallow linguistic parsing of category Identification of Conceptual Categories Only conceptual categories are used Shallow linguistic parsing of category names e. g. category ‘American people in Japan’ Break category into pre-modifier - ‘American’ head - ‘people’ post-modifier - ‘in Japan’ If head is plural, then category is conceptual category Extract class from Wikipedia category Connect to class from Word. Net e. g. the Wikipedia class ‘American people in Japan’ has to be made a subclass of the Word. Net class ‘person’

Algorithm Function wiki 2 wordnet(c) Input: Wikipedia category name c Output: Word. Net synset Algorithm Function wiki 2 wordnet(c) Input: Wikipedia category name c Output: Word. Net synset 1 head =head. Compound(c) 2 pre =pre. Modifier(c) 3 post =post. Modifier(c) 4 head =stem(head) 5 If there is a Word. Net synset s for pre + head 6 return s 7 If there are Word. Net synsets s 1, … , sn for head 8 (ordered by their frequency for head) 9 return s 1 10 fail

Explanation of Algorithm Input: American people in Japan 1. pre-modifier : American 2. Head Explanation of Algorithm Input: American people in Japan 1. pre-modifier : American 2. Head : people 3. Post-modifier : in Japan 4. Stem(head) : person 5. If there is a Word. Net synset for ‘American person’ 6. return that synset 7. If there are s 1, …, sn synsets for ‘person’ 8. (Ordered by frequency for ‘person’) 9. Return s 1 10. Fail Output: person Result: American People in Japan sub. Class. Of person

Fig. : Word. Net search for “person” Fig. : Word. Net search for ‘American Fig. : Word. Net search for “person” Fig. : Word. Net search for ‘American Person’

Exceptions Complete hierarchy of classes Upper classes from Word. Net Leaves from Wikipedia 2 Exceptions Complete hierarchy of classes Upper classes from Word. Net Leaves from Wikipedia 2 dozen cases failed Categories with head compound “capital” In Wikipedia, it means “capital city” In Word. Net, it means “financial asset” These cases were corrected manually

3. Quality Control Canonicalization Each fact and each entity reference unique an entity is 3. Quality Control Canonicalization Each fact and each entity reference unique an entity is always referred to by the same identifier in all facts in YAGO Type Checking eliminates individuals that do not have class eliminates facts that do not respect domain and range constraints an argument of a fact in YAGO is always an instance of the class required by the relation

Canonicalization - 1 Redirect Resolution infobox heuristics deliver facts that have Wikipedia entities (i. Canonicalization - 1 Redirect Resolution infobox heuristics deliver facts that have Wikipedia entities (i. e. Wikipedia links) as arguments These links may not be correct Wikipedia page identifiers Check if each argument is correct Wikipedia identifier Replace by correct, redirected identifier E. g. Hermitage Museum located. In St. Petersburg Hermitage Museum located. In Saint Petersburg

Canonicalization - 2 Removal of Duplicate facts Sometimes, 2 heuristics deliver the same fact. Canonicalization - 2 Removal of Duplicate facts Sometimes, 2 heuristics deliver the same fact. canonicalization eliminates one of them e. g. , category ‘ 1935 births’ yields the fact: Elvis Presley born. On. Date 1935 Infobox attribute ‘Born: January 8, 1935’ yields the fact: Elvis Presley born. On. Date January 8, 1935

Type Checking - 1 Reductive Type Checking Sometimes class of entity cannot be determined Type Checking - 1 Reductive Type Checking Sometimes class of entity cannot be determined Such facts are discarded e. g. Wikipedia entities that have been proposed for an article, but that do not have a page yet Inductive Type Checking Type constraints can be used to generate facts e. g. Elvis Presley born. On. Date January 8, 1935 So, Elvis Presley is a person Regular expression check to ensure entity name pattern of given name and family name

Type Checking - 2 Type Coherence Checking Sometimes, classification yields wrong results e. g. Type Checking - 2 Type Coherence Checking Sometimes, classification yields wrong results e. g. Abraham Lincoln is instance of 13 classes 12 are subclasses of class ‘person’; e. g. lawyer, president 13 th class is class ‘cabinet’ Class hierarchy of YAGO is partitioned into branches e. g. locations, artifacts, people, other physical entities, and abstract entities Branch that most types lead to, is determined Other types are purged

References YAGO: ALarge Ontology from Wikipedia and. Word. Net Fabian M. Suchanek, Gjergji Kasneci, References YAGO: ALarge Ontology from Wikipedia and. Word. Net Fabian M. Suchanek, Gjergji Kasneci, Gerhard. Weikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany Automated Construction and Growth of a Large Ontology Fabian M. Suchanek Thesis for obtaining the title of Doctor of Engineering of the Faculties of Natural Sciences and Technology of Saarland University Wikipedia http: //en. wikipedia. org/wiki/Main_Page Word. Net http: //wordnet. princeton. edu/

Thank You, Any Questions? Thank You, Any Questions?