Скачать презентацию Inducing Ontologies from Folksonomies using Natural Language Understanding Скачать презентацию Inducing Ontologies from Folksonomies using Natural Language Understanding

c84315455c3b459dea9760eb6e3f5d95.ppt

  • Количество слайдов: 18

Inducing Ontologies from Folksonomies using Natural Language Understanding Marta Tatu, Dan Moldovan Lymba Corporation Inducing Ontologies from Folksonomies using Natural Language Understanding Marta Tatu, Dan Moldovan Lymba Corporation Presenter: Chris Irwin Davis

Overview Folksonomy • lexical normalization of tags • semantic consistency • tag-tag relations § Overview Folksonomy • lexical normalization of tags • semantic consistency • tag-tag relations § folksonomy-based applications § reasoning applications NLP • typographical errors, spelling variations • singular/plural forms, lower case • space/punctuation used as delimiters • same tag in different contexts • tag synonymy § social annotations (author vs. user) § browse/search bookmarks § resource discovery (recommendations) § collaborative tagging (across folksonomies) LREC 2010 Ontology May 19 th, 2010

Semantic Approach 1. Folksonomy semantic representation 2. Tag understanding o o Syntactic: part-of-speech tagging, Semantic Approach 1. Folksonomy semantic representation 2. Tag understanding o o Syntactic: part-of-speech tagging, syntactic parsing o 3. Lexical: language identification, tokenization and spelling corrections, capitalization restoration Semantic: acronym understanding, word sense disambiguation, named entity recognition, semantic parsing Deriving the ontological structure o • Semantic relations between tags Sources of information o Tag text semantics o Social bookmarking annotations o Machine understanding of bookmark content LREC 2010 May 19 th, 2010

Representing Folksonomies • knowledge[NN]1 • advertisign advertising[NN]1 • americanhistory American[JJ]1 TOPIC history[NN]2 • read-now Representing Folksonomies • knowledge[NN]1 • advertisign advertising[NN]1 • americanhistory American[JJ]1 TOPIC history[NN]2 • read-now now[RB]3 LREC 2010 TEMPORAL read[VB]1 May 19 th, 2010

Representing Folksonomies Associated (user, document) pairs LREC 2010 May 19 th, 2010 Representing Folksonomies Associated (user, document) pairs LREC 2010 May 19 th, 2010

Representing Folksonomies LREC 2010 May 19 th, 2010 Representing Folksonomies LREC 2010 May 19 th, 2010

System Architecture LREC 2010 May 19 th, 2010 System Architecture LREC 2010 May 19 th, 2010

Tag Understanding Sources used to understand tags Tag text LREC 2010 X X Tokenization Tag Understanding Sources used to understand tags Tag text LREC 2010 X X Tokenization and Spell checking X X X Part-of-speech tagging X X Syntactic parsing X Abbreviation and acronym expansion X X X Word sense disambiguation (+ ner) X X X Semantic parsing Semantic X Capitalization restoration Syntactic Document content Language identification Lexical Social bookmarking data X May 19 th, 2010

Acronym/Abbreviation Understanding • Abbreviation dictionary: (abbreviation - expansion - domain of usage) o 118, Acronym/Abbreviation Understanding • Abbreviation dictionary: (abbreviation - expansion - domain of usage) o 118, 055 distinct abbreviations o 137 domains: Law, Music, TV/Radio Stations, Countries, Airport, Domain Names, Chat, Emoticons, etc. o 25% of the abbreviations have more than one definition • (unambiguous) Zip codes – (76012 : Arlington, TX) • (ambiguous) SS : 192 definitions in 66 domains o Social Security – Business and US Government, Screen Saver – File Extensions, Stainless Steel – Housing and Products, Subtropical Storm – Meteorology, Style Sheet – Software • Check tag if part of abbreviation dictionary • Use lexical chains to link document content to abbreviation domain • Use co-occurring tags to identify correct expansion • Use text alignment to find new abbreviation definitions within document content LREC 2010 May 19 th, 2010

Acronym/Abbreviation Understanding • “PR” ~ 1409 documents • 87 definitions for PR o Press Acronym/Abbreviation Understanding • “PR” ~ 1409 documents • 87 definitions for PR o Press Release, Public Relations, Puerto Rico, Page Rank, Public Radio, Permanent Resident/Residency, etc. • http: //prsarahevans. com/2009/06/do-you-have-a-strategy-for-online-comments o “PR” = “public relations” (6 times in document content) o Other tags of the bookmark: “public”, “relations”, “media”, “strategy” • http: //www. bbc. co. uk/pressoffice/pressreleases/category/new_media_index. sht ml o “PR” = “press releases” (in document content) • http: //escape. topuertorico. com o “PR” = “Puerto Rico” (in document content) LREC 2010 May 19 th, 2010

Evaluation • Experimental data o ~ 150, 000 (user, document, tag) from del. icio. Evaluation • Experimental data o ~ 150, 000 (user, document, tag) from del. icio. us • 8, 460 tags; 83, 827 documents; 58, 198 users • Main error source: tag cannot be identified within document o Lack of document content (images, non-EN content, etc. ) • Errors propagate from initial processing steps to later ones o Bad capitalization leads to bad named entity recognition LREC 2010 May 19 th, 2010

Ontological Tag-Tag Relations • EQUALITY relations o same lemma, part-of-speech, and sense number o Ontological Tag-Tag Relations • EQUALITY relations o same lemma, part-of-speech, and sense number o EQ(activity, activities), EQ(after-effects, After. Effects), EQ(opinion, Opnion), etc. • SYNONYMY clusters o Same synset id o SYN(OS, operating. system), SYN(LA, losangeles), SYN (nyt, nytimes) • ISA relations between named entities and type tags o ISA(Oracle. Corporation, organization), ISA(davidfosterwallace, person) • Word. Net relations between tags o ISA(vegan, vegetarian), ANTONYMY(peace, war), PART_WHOLE(Businesses, markets), ENTAIL(proofreading, +read), SIMILARITY(important, general), DOMAIN(light, physics) LREC 2010 May 19 th, 2010

Ontological Tag-Tag Relations • Lexical chains of size 2 and Semantic calculus – tag Ontological Tag-Tag Relations • Lexical chains of size 2 and Semantic calculus – tag 1 rel 1 synset rel 2 tag 2 • rel 1 & rel 2 rel 3 • rel 3(tag 1, tag 2) is added to the ontology – ISA(integration, events, ) ISA(integration, group_action/NN/1) and ISA(group_action/NN/1, events, ) – PART_WHOLE(lobby, hotels) PART_WHOLE(lobby, building/NN/1) and ISA(building/NN/1, hotels) • ISA relations between “modifier head” and “head” tags – ISA(book-cover, covers) – ISA(theoryofmind, theory) – ISA(photoshoptutorials, ) LREC 2010 May 19 th, 2010

Ontological Tag-Tag Relations • Relations between “modifieri headi” tags (i=1, 2) – ISA(build-solar-panel, create-solar-panel) Ontological Tag-Tag Relations • Relations between “modifieri headi” tags (i=1, 2) – ISA(build-solar-panel, create-solar-panel) – SIMILARITY(socialnetworks, socialweb) modifier 2 ISA modifier 1 head 2 & ISA modifier 2 OR head 1 ISA modifier 1 head 2 ⇒ & SYN head 1 REL modifier 2 OR SYN modifier 1 head 2 & ISA head 1 modifier 2 ISA head 2 LREC 2010 head 2 REL modifier 2 May 19 th, 2010

Evaluation • 9, 820 EQ clusters for the 8, 460 unique tags o Same Evaluation • 9, 820 EQ clusters for the 8, 460 unique tags o Same abbreviation expanded to different definitions o EQ: tutorial, tutorials, • 8, 801 SYN clusters o Largest cluster (133 bookmarks): car, automobiles, autos, cars, automobile • 17% of tags placed into incorrect SYN cluster o Errors caused by imperfect word sense disambiguation • 5, 439 ontological tag-tag relations o 3, 869 ISA, 601 SIMILARITY, 429 PART_WHOLE, etc. o 1, 778 relations derived using Word. Net’s lexical chains and Lymba’s semantic calculus rules LREC 2010 May 19 th, 2010

Folksonomic Ontology • Portion of ontology generated from experimental folksonomy LREC 2010 May 19 Folksonomic Ontology • Portion of ontology generated from experimental folksonomy LREC 2010 May 19 th, 2010

Folksonomic Ontology • Portion of ontology generated from experimental folksonomy LREC 2010 May 19 Folksonomic Ontology • Portion of ontology generated from experimental folksonomy LREC 2010 May 19 th, 2010

Thank you! For questions: email marta@lymba. com Thank you! For questions: email [email protected] com