76aa562b774e9ba72670d81670bcb76c.ppt
- Количество слайдов: 18
Extracting Lexical Reference Rules from Wikipedia Eyal Shnarch, Libby Barak, Ido Dagan Bar Ilan University, Israel Eyal Shnarch, Libby Barak, Ido Dagan 1/18
Motivation: lexical inference • Question Answering: “Which British luxury car was the best selling of 2008? ” Abbey Road “The Beatles” • Information Retrieval: Eyal Shnarch, Libby Barak, Ido Dagan 2/18 George Harrison
Lexical reference • Lexical Reference (LR) – a term in a text implies concrete reference to the meaning of a The Beatles target term (Glickman et al. , 2006) – Narrower than similarity / lexical association – Wider than the common lexical relations Abbey Road Yellow Submarine Sgt. Pepper • synonymy, hyponymy, meronymy etc. • Classical LR resources: – Word. Net: covers dictionary knowledge, costly built, partial – Distributional similarity: low precision – Indicative (Hearst) patterns: limited coverage, mostly IS-A Eyal Shnarch, Libby Barak, Ido Dagan 3/18
Goals • Automatically learn LR rules – from a knowledge-base created for human consumption (Wikipedia) – Focusing on generic information elements in a Web resource – We did not utilize Wikipedia specific features: • Info-boxes, category tags, lists pages, disambiguation pages – Publicly available rule-base • Improve inference applications by applying LR rules Eyal Shnarch, Libby Barak, Ido Dagan 4/18
Extraction methods • Be-complement noun in the position of a complement of a verb ‘be’ • All-nouns all nouns in the definition • Redirect various terms to canonical title • Parenthesis disambiguation mean • Link hyperlinks in the entire page Eyal Shnarch, Libby Barak, Ido Dagan 5/18
Eyal Shnarch, Libby Barak, Ido Dagan 6/18
Output analysis • 8 million candidate rules learnt – Mostly Named Entities (as expected), but many common words/terms • Manual analysis: Extraction method Per method – 800 rules sampled annotated for LR (Kappa 0. 7) Precision % Est. # correct rules Redirect 87 1. 8 M (33%) Be-complement 78 1. 6 M (29%) Parenthesis 71 0. 09 M (2%) Link 70 0. 5 M (9%) All-nouns 49 1. 5 M (27%) Total Eyal Shnarch, Libby Barak, Ido Dagan 7/18 5. 5 M (100%)
Interesting relations in All-nouns Relation Rule Text Location Lyon France Lyon city in France Occupation Thomas H. Cormen computer science Thomas H. Cormen professor of computer science Creation The Da Vinci Code Dan Brown The Da Vinci Code novel by Dan Brown Origin Willem van Aelst Dutch Willem van Aelst Dutch artist Alias Dean Moriarty Benjamin Linus Dean Moriarty alias of Benjamin Linus on Lost Spelling Egushawa Agushaway Egushawa, also spelled Agushaway. . . Need to better utilize this method Eyal Shnarch, Libby Barak, Ido Dagan 8/18
Ranking All-nouns rules • Nouns in definition vary in their likelihood to be referred by the title – Depends greatly on the syntactic path connecting the title film and the noun. subj vrel
All-nouns analysis Extraction method Per method Accumulated Precision % Est. # correct rules Precision % Correct rules % Redirect 87 1, 851, 384 87 31 Be-complement 78 1, 618, 913 82 60 Parenthesis 71 94, 155 82 60 Link 70 485, 528 80 68 All-nounstop 60 684, 238 76 83 All-nounsmiddle 46 380, 572 72 90 All-nounsbottom 41 515, 764 66 100 Eyal Shnarch, Libby Barak, Ido Dagan 10/18
Error analysis Eyal Shnarch, Libby Barak, Ido Dagan 11/18
Improve precision – rule filtering • Incorrect rules tend to relate terms that are unlikely to co-occur together • Filter by Dice coefficient threshold: *Subset sum problem cryptography magic cryptography • Partially overcome Wrong NP part error by adjusting Dice: car *aerial tramway cable car Eyal Shnarch, Libby Barak, Ido Dagan 12/18
Eyal Shnarch, Libby Barak, Ido Dagan 13/18
Two task-based evaluations • Unsupervised Text Categorization: – 20 News Group collection – Given a category name expand it using LR rules: cryptology cryptographic cryptographer Cryptography decrypt adversary certificate digital signature cipher – Compare document and expanded category name • Cosine similarity score • Classify to best-scoring category (single-classification) • Recognizing Textual Entailment (RTE) – Usage within inference engine (Bar-Haim et al. , 2008) Eyal Shnarch, Libby Barak, Ido Dagan 14/18
Wikipedia’s contribution • Rule-base utility by TC system: Politics Cryptography Mac Religion Medicine Michael Crichton Jurassic Park Gulf Cooperation Council GCC opposition coalition whip Key exchange certificate cryptosystem digital signature Radius Power. Book belief Grap heaven creation missionary doctor physician treatment clinical MD • Rule-base utility by RTE system: Eyal Shnarch, Libby Barak, Ido Dagan 15/18
Results: text categorization Rule base Recall % Precision % F 1 No Expansions 19 54 28 Kazama & Torisawa, 2007 19 53 28 Snow 400 K 19 54 28 Lin dependency similarity 25 39 30 Word. Net 30 47 37 Redirect + Be-complement 22 55 31 All rules 31 38 34 All rules + Dice 31 49 38 Word. Net + Wikiall rules+Dice 35 47 40 Baselines Extraction methods from Wikipedia Union • Wikipedia’s performance comparable to Word. Net • Union works best (complementary) Eyal Shnarch, Libby Barak, Ido Dagan 16/18
Results: RTE System configuration Accuracy % Accuracy drop % Word. Net + Wikipedia 60. 0 - without Wikipedia 58. 9 1. 1 without Word. Net 57. 7 2. 3 – External knowledge resources typically contribute around 0. 5 -2% in accuracy for current RTE systems. (Iftene and Balahur-Dobrescu, 2007; Dinu and Wang, 2009) Eyal Shnarch, Libby Barak, Ido Dagan 17/18
Conclusions • Future work: – Improve rule ranking criteria Shamir Cryptographer Cryptography Adi – Exploit graph structure • Large-scale resource of lexical reference rules – proven beneficial within two application settings • Automatically built resource comparable to Word. Net and provides complementary knowledge – Combination of resources much more effective than each alone • Use our resource (and cite us ) – will soon be publicly Eyal Shnarch, Libby Barak, Ido Dagan 18/18


