cd0d32d80fa488a79cc531ce17ed3315.ppt
- Количество слайдов: 23
Interpreting noun compounds using paraphrases András Dobó University of Oxford Stephen G. Pulman University of Oxford
Interpreting noun compounds using paraphrases 1. 2. 3. 4. 5. 6. Motivation Related work Method Results Summary Future work
Motivation l English is full of noun compounds, which are sequences of nouns acting as a single noun Their interpretation is crucial for many NLP tasks Using dictionaries is unfeasible l Automated methods l l
Related work l l l Statistical approaches Web queries or large corpora Two main categories of methods l Inventory based approaches l l l Small number of abstract relational categories Criticized for numerous reasons Paraphrasing approaches l l Verbs and prepositions as paraphrases Water bottle = bottle that is for water be for
Method l l l Paraphrasing method Ranked list of paraphrases for each NC Uses large corpora to search for paraphrases l l l Second noun is the head subject = second noun, object = first noun Validates paraphrases using web queries Two main approaches in the search of paraphrases
Subject-paraphrase-objecttriples l l Counts the frequency of all (subject, paraphrase, object) triples in the corpus Then for each NC it searches for those triples, where subject = second noun, object = first noun List of suitable paraphrases for each NC Ranks paraphrases for each NC using a scoring method based on their frequency
Subject-paraphrase-andparaphrase-object-pairs l l Counts the frequency of all (subject, paraphrase) and (paraphrase, object) pairs in the corpus Then for each NC it searches for those pairs, where subject = second noun, object = first noun Two lists of paraphrases for each NC Rank paraphrases for each NC using a scoring method based on their frequency
Scoring methods l Subject-paraphrase-object-triples version: l l Simply the frequency of the relevant (subject, paraphrase, object) triple Subject-paraphrase-and-paraphrase-objectpairs version: l l Using frequencies is not suitable The product of pointwise the mutual information of the relevant (subject, paraphrase) and (paraphrase, object) pairs
Used corpora and their preprocessing l Search for paraphrases: l British National Corpus l l l 100 million words Grammatical relations from parser Web 1 T 5 -gram Corpus l l Generated from 1 trillion words of web page text Grammatical relations from POS patterns § l Noun verb determiner noun Validation of paraphrases: l The Web through Google and Yahoo!
Passive paraphrases l Their surface subject is actually their object l (subject, paraphrase) = (paraphrase 2, object) l l paraphrase: passive, without preposition paraphrase 2: active version of paraphrase subject = object Their frequencies are counted together
Passive paraphrases l (subject, paraphrase, object) = (subject 2, paraphrase 2, object 2) l l l paraphrase: passive, with by preposition paraphrase 2: active version of paraphrase, without preposition object 2 = subject 2 = object Their frequencies are counted together Such (paraphrase, object) and (subject 2, paraphrase 2) pairs are treated the same way
Patientive ambitransitive verbs l l Three main groups of verbs: strictly transitive, strictly intransitive, ambitransitive Strictly intransitive verbs have two subclasses: unergative and unaccusative Ambitransitive verbs have two subclasses too: agentive and patientive Patientive ambitransitive verbs in intransitive use behave in the same way as passive verbs they are treated the same way
Using synonyms, hypernyms, sister words etc. l l l No paraphrases are found for several NCs Hypothesis: NCs comprising semantically similar words are interpreted the same way Using semantically similar words in the search for paraphrases l l Synonyms, hypernyms, sister words from Word. Net Semantically similar words that are automatically found with a method proposed by Dekang Lin
Validation of paraphrases l l Some paraphrases are incorrect Validation is needed Hypothesis: If a paraphrase is suitable for a NC, then there should exist at least some web pages containing the NC paraphrased by that paraphrase
Validation of paraphrases l Google and Yahoo! queries l l Simple queries: “n 2 Infl THAT p n 1 Infl” Extended queries: l l l Multiple verb tenses Wildcard characters (up to 9) Score for each paraphrase is recalculated
Testing and evaluation l l l Tested on the first 50 NCs of the Sem. Eval-2 Task #9 3 best paraphrases for each NC 5 native speakers recruited for evaluation l l They score each paraphrase from 1 to 5 Their agreement was checked using Krippendorff’s alpha, and it was too low The (noun compound, paraphrase) pairs with highest disagreement were omitted
Best version l l l Subject-paraphrase-object-triples version Web 1 T 5 -gram Corpus Combination of two basic versions: l l No substitute words Sister words Scores are recalculated in a way that favors paraphrases returned by the first version Validation: Google, present simple, up to 1 wildcard
Results l Mixed performance Noun compound 1 st rank 3 rd rank arts museum be of be devoted to be for bird droppings l 2 nd rank be in be for be Average scores Rank of paraphrase Average score 1 st rank 2 nd rank 2. 7687 3 rd rank l 3. 1842 2. 5583 Promising results given the difficulty of task
Results Best scoring NCs Noun compound Worst scoring NCs Avg. Score Noun compound Avg. Score broadway youngster 4, 7500 championship bout 2, 0000 cell membrane 4, 6000 buddhist philosophy 1, 8000 cattle population 4, 4000 cell block 1, 7500 arts museum 4, 3333 banana industry 1, 7333 business sector 4, 2000 ancestor spirits 1, 6000 arts colleges 4, 0000 anode loss 1, 5000 backwoods protagonist 3, 8750 bird droppings 1, 2667 antibiotic regimen 3, 8667 bow scrape 1, 2500 census population 3, 8667 activity spectrum 1, 0000 business applications 3, 7000 altitude reconnaissance 1, 0000
Future work l Parsing the Web 1 T 5 -gram Corpus l Much lower error rate in obtaining the grammatical relations Extended validation part l l Employing synonyms, hypernyms, sister words or semantically similar words Combining the different extensions
Summary l l l Interpreting noun compounds is crucial for many NLP tasks We presented a method for noun compound interpretation that searches for paraphrases in large corpora and issues web queries to validate the results The results are promising, and could be further improved
Acknowledgements l The attendance of this workshop was partly supported by the Hungarian National Office for Research and Technology within the framework of the R&D project MASZEKER (Modell-Alapú Szemantikus Kereső Rendszer – Model Based Semantic Search System).
Thank you!
cd0d32d80fa488a79cc531ce17ed3315.ppt