
8018c1f81b69280aae69f7da09c58a50.ppt
- Количество слайдов: 22
UNC-CH at DUC 2007: Query Expansion, Lexical Simplification, and Sentence Selection Strategies for Multi-Document Summarization Catherine Blake Julia Kampov Andreas Orphanides David West Cory Lown
Goals in 2007 l l Get a system up and running Components • Query Expansion • Word. Net • Lexical Compression • Linguistically motivated pruning • Sentence Selection • Clustering
System Architecture
Query Expansion - Approach l l Goal: Increase responsiveness Approach • A – Weak Baseline • B – Baseline • C – Weak Word. Net • D – Word. Net • any term in topic or query • remove stop words inc. small set of tailored terms • Word. Net synsets from terms in B • Synsets from C + synonyms
Query Expansion - Evaluation l Query selection • l Relevance • • l Rank 2006 queries by overall responsiveness 3 annotators identified sentences with “information pertinent to the topic” for 9 topics For evaluation a sentence was identified when a term from in ABC or D appeared in a gold standard sentence Inter-rater reliability • • • Annotators didn’t know how the system would summarize text, but knew that the task was going to be automated Topic 6 and 34 had fair to moderate agreement Annotators reached consensus for topic 6 and 34 Annotators then reworked other topics
Query Selection
Query Expansion – Evaluation
Lexical Simplification Decision: No Word. Net Query Expansion
Lexical Simplification l Goal l Approach • Increase linguistic quality • Representation • Type Dependency Tree (de Marneffe, et al, 2006) • Stanford Parser Version 1. 5 (Klein & Manning, 2002; 2003) • Identify short, stand-alone sentences • Prune both original and short sentences using • Parser tags • Cue phrases identified in previous DUC submissions
Short Stand-Alone Sentences l Sub-Sentences
Pruning l Noun Appositive For nearly a decade, Queen Latifah, the first lady of hip-hop, has been bobbing and weaving questions about … l Participial Modifier Indeed, some people reading this report could get the impression that Amnesty believes violence can be a legitimate instrument, the statement said
Pruning l Lead Adverbials l Attribution • 15 cue phrases from previous DUCs • Parser tags • Cue phrases: said, according Separately, the report said that the murder rate by Indians in 1996 was 4 per 100000, below the national average …
Lexical Simplification
Sentence Selection - Settings l No Word. Net query expansion l Percentage of Topic/Query Terms • original + base form • Num stemmed terms in query num stemmed terms in sentence l Percentage of Unique Terms • Num stem terms new sent that not in selected sent Num of stemmed terms in sentence l Weighted Term Frequency * IDF
Sentence Selection - Settings l Weighted Term Frequency (tottf) Feature Weight Stopword or punctuation 0 Topic/Query ^ ¬Summary 1 Topic/Query ^ Summary 0. 5 ¬ Topic/Query ^ ¬Summary 0. 01 ¬ Topic/Query ^ Summary 0. 001
Sentence Selection l Clustering l Favor Sentences from • Oracle clustering tool • K-means • 1000 iterations • removed determiners, prepositions etc • Different clusters • Popular clusters – ie lots of sentences • How representative the sentence is of the cluster
Sentence Selection – Evaluation l ROUGE-1 Score ID Description DUC 06 DUC 07 I tottf/num. Wd. Sent * CW 0. 3981 0. 4212 F %Wd. Topic * CW 0. 3979 0. 4171 E tottf * CW 0. 3977 0. 4183 B CW 0. 3947 0. 4169 D Tfidf 0. 3912 0. 4086 G Tottf 0. 3904 0. 4109 H tottf/num. Wd. Sent 0. 3754 0. 3913 A %Wd. Topic + %Wd. New+CW 0. 3749 0. 3963 C %Wd. Topic + %Wd. New 0. 3623 0. 3786
Sentence Selection – Evaluation
Official DUC 2007 Evaluation l UNC-CH = System 22 Automatic Evaluation l Manual Evaluation l • ROUGE-2 score 0. 10329 (13 th) • Responsiveness = 2. 956 (7 th) • Linguistic Quality = 2. 987 (24 th)
What we have learned so far l Sentence selection • Optimal Strategy: weighted term frequency / • sentence length * cluster weight Clustering really helps l Lexical simplification l Query expansion had negligible effect • Rework sub-sentences • Pronoun resolution
Next Steps l Alternative Query Expansion • • • Error analysis of medical questions underway Concept representation Unified Medical Language System (UMLS) l Tune sentence selection strategy Lexical simplification l Sentence Re-Ordering l • Rework sub-sentences • Add basic pronoun resolution • Combine with lexical simplification
Acknowledgements l l The organizers for running this conference and providing manual summaries Previous DUC paper authors for making their system designs explicit Monica Sanchez and Stephanie Haas for earlier discussions Thom Hailey, Scott Krauss and Toshiba Burns. Johnson for annotating queries