![Скачать презентацию YAGO A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET Скачать презентацию YAGO A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET](https://present5.com/wp-content/plugins/kama-clic-counter/icons/ppt.jpg)
58406b8e28681efdafa183b194586030.ppt
- Количество слайдов: 44
YAGO: A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer
Motivation for an Ontology Natural Language communication Automated text translation Finding information on internet Computer-processable collection of knowledge
What is an Ontology? An ontology is the description of a domain, its classes and properties and relationships between those classes by means of a formal language. collection of knowledge about the world, a knowledge base Example ontologies: large taxonomies categorizing Web sites (such as on Yahoo!) categorizations of products for sale and their features (such as on Amazon. com)
Uses of Ontologies Machine Translation Word Sense Disambiguation Document Classification Question Answering Entity and fact-oriented Web Search
What is Yago Yet Another Great Ontology Part of Yago-Naga project Goal to build a knowledge base that is Large Scale Domain-independent Automatic Construction High Accuracy Uses Wikipedia and Word. Net
More about YAGO 2 million entities 20 million facts Facts represented as RDF triples Accuracy of 95% Examples: Elvis Presley is. A singer sub. Class. Of person Elvis Presley born. On. Date 1935 -01 -08 Elvis Presley born. In Tupelo located. In Mississippi(state) located. In USA
The YAGO model Slight extension of RDFS Represents knowledge as Entities Classes Relations Facts Properties of relations like transitivity Simple and decidable model
Knowledge Representation in YAGO All objects are entities e. g. Elvis Presley, Grammy Award 2 entities can stand in a relationship e. g. has. Won. Award Elvis Presley has. Won. Award Grammy Award The triple of entity, relationship, entity is a fact e. g. fact Elvis Presley has. Won. Award Grammy Award is a
Knowledge Representation in YAGO -2 Numbers, dates and strings are also entities. Elvis Presley Born. In. Year 1935 Words are entities “Elvis” Entity is instance of class Elvis means Elvis Presley Type Singer Classes are also entities Singer Type class
Knowledge Representation in YAGO- 3 Classes have hierarchies Singer Sub. Class. Of Person Relations are also entities sub. Class. Of Type atr Each fact has a fact identifier #1 Found. In Wikipedia
Key Contributions of YAGO Information Extraction from Wikipedia Infoboxes Category Pages Combination with Word. Net Taxonomy Quality Control Canonicalization Type Checking
Information Extraction -1 Entities from Wikipedia Each page title is candidate entity Wiki Markup Language Wikipedia dump as of September, 2008
Information Extraction - WML
Information Extraction Techniques Infobox Harvesting Wikipedia Word-Level Techniques Wikipedia Redirects Category Harvesting Wikipedia Infoboxes Categories Type Extraction Wikipedia Categories, Word. Net Classes
1. Information Extraction from Wikipedia – Infobox Harvesting Wikipedia Infobox
Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… born. On. Date … Born Bor B B Born: January 8, 1935 Relation Map Relation born. On. Date Elvis Presley Domain … person … Range yago. Date born. On. Date January 8, 1935
Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… Died died. On. Date Bor … B B Relation Map Relation Died: August 16, 1977 died. On. Date Elvis Presley Domain … person … died. On. Date Range yago. Date August 16, 1977
Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… is. Of. Genre … Genre Bor B B Genre: Rock and Roll Relation Map Relation is. Of. Genre Elvis Presley Domain … entity … is. Of. Genre Range yago. Class Rock and Roll
Attribute Map Infobox Attribute Relation Inverse …… Manifold Indirect birth name means Bor … B B Relation Map Relation Birth Name: Elvis Aaron Presley Domain means … yago. Word … Elvis Aaron Presley means Range entity Elvis Presley
Manifold Attributes Some attributes may have multiple values e. g. a person may have multiple children Multiple facts are generated e. g. one has. Child fact for each child
Indirect Attributes - 1 Attribute Map Attribute Relation Inverse …… gdp ppp has. GDP gdp year during Indirect Some attributes do not concern article entity, but another fact Manifold e. g attribute GDP does not concern the article entity i. e. Republic of Singapore, but year 2008 Therefore, facts generated: Singapore has. GDP 238. 755 billion #14 during 2008 Singapore has. GDP 238. 755 billion during 2008
Indirect Attributes - 2 Singapore Infobox
Type of Infobox American Pie Released Format Genre Length Label Writer October, 1971 vinyl record Folk Rock 8: 33 mins United Artists Don Mc. Lean Song Infobox Tesla Roadster Manufacturer Production Class Length Width Height Tesla Motors 2008 -present Roadster 3, 946 mm 1, 873 mm 1, 127 mm Car Infobox
Type of Infobox: Attribute Map Attribute Relation Inverse Manifold Indirect …… car #length has. Length … song #length has. Duration … Song Infobox American Pie has. Duration 8: 33 Car Infobox Tesla Roadster has. Length 3946
Information Extraction - Word Level Techniques Wikipedia Redirects virtual redirect page for “Presley, Elvis“ links to “Elvis Presley” Each redirect gives ‘means’ fact e. g. “Presley, Elvis“ means Elvis Presley Parsing Person Names extract the name components establish relations given. Name. Of and family. Name. Of e. g. Presley family. Name. Of Elvis Presley Elvis given. Name. Of Elvis Presley
Wikipedia Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation from German Wikipedia | Rivers of Netherlands Categories: Canadian Singers| Canadian male singers| 1959 births | English-language singers | Living people | Grammy Award Winners | Portrait photographers
Facts created from Wikipedia Categories Rhine located. In Germany Bryan Adams born. On. Date 1959 Bryan Adams has. Won. Award Grammy Award Abraham Lincoln politician. Of United States
Information Extraction - Category Harvesting Relational Categories Regular Expression ([0 -9]f 3, 4 g) births ([0 -9]f 3, 4 g) deaths ([0 -9]f 3, 4 g) establishments ([0 -9]f 3, 4 g) books|novels Mountainsj. Rivers in (. *) Presidentsj. Governors of (. *) winners [A-Za-z]+ (. *) winners Relation born. On. Date died. On. Date established. On. Date written. On. Date located. In politician. Of has. Won. Prize Table: Some Category Heuristics
2. Connecting Wikipedia and Word. Net – What is Word. Net Lexical database for the English language Created at the Cognitive Science Laboratory of Princeton University Groups English words into sets of synonyms called synsets Provides short, general definitions Provides hypernym/hyponym relations e. g. canine is hypernym, dog is hyponym
Connecting Wikipedia and Word. Net – Type Extraction Goal: create class hierarchy e. g. singer sub. Class. Of performer sub. Class. Of artist hyponymy relation from Word. Net Wikipedia class ‘American people in Japan’ is subclass of Word. Net class ‘person’
Classifications of Categories Conceptual Categories e. g. Albert Einstein is in ‘Naturalized citizens of the United States’ Administrative Categories e. g. Albert Einstein is in ‘Articles with unsourced statements’ Relational Information 1879 births Thematic Vicinity Physics
Identification of Conceptual Categories Only conceptual categories are used Shallow linguistic parsing of category names e. g. category ‘American people in Japan’ Break category into pre-modifier - ‘American’ head - ‘people’ post-modifier - ‘in Japan’ If head is plural, then category is conceptual category Extract class from Wikipedia category Connect to class from Word. Net e. g. the Wikipedia class ‘American people in Japan’ has to be made a subclass of the Word. Net class ‘person’
Algorithm Function wiki 2 wordnet(c) Input: Wikipedia category name c Output: Word. Net synset 1 head =head. Compound(c) 2 pre =pre. Modifier(c) 3 post =post. Modifier(c) 4 head =stem(head) 5 If there is a Word. Net synset s for pre + head 6 return s 7 If there are Word. Net synsets s 1, … , sn for head 8 (ordered by their frequency for head) 9 return s 1 10 fail
Explanation of Algorithm Input: American people in Japan 1. pre-modifier : American 2. Head : people 3. Post-modifier : in Japan 4. Stem(head) : person 5. If there is a Word. Net synset for ‘American person’ 6. return that synset 7. If there are s 1, …, sn synsets for ‘person’ 8. (Ordered by frequency for ‘person’) 9. Return s 1 10. Fail Output: person Result: American People in Japan sub. Class. Of person
Fig. : Word. Net search for “person” Fig. : Word. Net search for ‘American Person’
Exceptions Complete hierarchy of classes Upper classes from Word. Net Leaves from Wikipedia 2 dozen cases failed Categories with head compound “capital” In Wikipedia, it means “capital city” In Word. Net, it means “financial asset” These cases were corrected manually
3. Quality Control Canonicalization Each fact and each entity reference unique an entity is always referred to by the same identifier in all facts in YAGO Type Checking eliminates individuals that do not have class eliminates facts that do not respect domain and range constraints an argument of a fact in YAGO is always an instance of the class required by the relation
Canonicalization - 1 Redirect Resolution infobox heuristics deliver facts that have Wikipedia entities (i. e. Wikipedia links) as arguments These links may not be correct Wikipedia page identifiers Check if each argument is correct Wikipedia identifier Replace by correct, redirected identifier E. g. Hermitage Museum located. In St. Petersburg Hermitage Museum located. In Saint Petersburg
Canonicalization - 2 Removal of Duplicate facts Sometimes, 2 heuristics deliver the same fact. canonicalization eliminates one of them e. g. , category ‘ 1935 births’ yields the fact: Elvis Presley born. On. Date 1935 Infobox attribute ‘Born: January 8, 1935’ yields the fact: Elvis Presley born. On. Date January 8, 1935
Type Checking - 1 Reductive Type Checking Sometimes class of entity cannot be determined Such facts are discarded e. g. Wikipedia entities that have been proposed for an article, but that do not have a page yet Inductive Type Checking Type constraints can be used to generate facts e. g. Elvis Presley born. On. Date January 8, 1935 So, Elvis Presley is a person Regular expression check to ensure entity name pattern of given name and family name
Type Checking - 2 Type Coherence Checking Sometimes, classification yields wrong results e. g. Abraham Lincoln is instance of 13 classes 12 are subclasses of class ‘person’; e. g. lawyer, president 13 th class is class ‘cabinet’ Class hierarchy of YAGO is partitioned into branches e. g. locations, artifacts, people, other physical entities, and abstract entities Branch that most types lead to, is determined Other types are purged
References YAGO: ALarge Ontology from Wikipedia and. Word. Net Fabian M. Suchanek, Gjergji Kasneci, Gerhard. Weikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany Automated Construction and Growth of a Large Ontology Fabian M. Suchanek Thesis for obtaining the title of Doctor of Engineering of the Faculties of Natural Sciences and Technology of Saarland University Wikipedia http: //en. wikipedia. org/wiki/Main_Page Word. Net http: //wordnet. princeton. edu/
Thank You, Any Questions?