
c6dff3872e897815fe5c777c1e0c38b7.ppt
- Количество слайдов: 24
Predicting System Performance for Automatic Summarization Annie Louis University of Pennsylvania Advisor: Ani Nenkova IBM Open House 2009 - SMi. Le
Goal of summarization system ~ select important content from input Single/multiple document(s) News reports, scientific literature, search results 2
Content quality judgements can be obtained from humans Compare system summary with human summary Direct rating by human judges 3
What factors are predictive of human judgements? Properties of systems Content selection features Better understanding of content selection Improve systems Automatic evaluation 4
1. Input difficulty for summarization Some inputs are more difficult than others 5
Standard system design Handle variety of inputs ◦ News – events, biographies, opinions ◦ Search results – practically on any topic But often one method on all inputs 6
Systems end up with variable performance on different inputs Average system scores on 100 word summaries ◦ mean ◦ min ◦ max 0. 55 0. 07 1. 65 Range 0 - 4 7
Input type influences summary quality Descriptions of single event or subject ~ easy ◦ Hurricane Andrew ◦ Mad cow disease Collections of opinions ~ difficult ◦ Senate, lawyers, public on a new policy 8
Can input difficulty be used to predict expected performance? Identify measurable indicators of input difficulty Specialized content selection methods Flag expected poor quality summaries 9
Difficult inputs are longer More tokens Large vocabulary sizes 10
Difficult inputs have less redundancy Low values for pair-wise cosine overlap High entropy vocabulary 11
Difficult inputs are topically less cohesive low KL divergence – input vs large random collection 12
Input difficulty can estimate average system performance Accuracy on inputs with extreme high and low scores ◦ multi-document inputs - 74% ◦ single documents – 84% 13
2. Input based evaluation Input-summary similarity is predictive of human judgements of quality 14
Summaries very similar to input could be of higher quality Intuitive Many ways to measure similarity What is a good objective function? How well will it perform? 15
Divergence between input & summary vocabularies KL divergence JS divergence 16
Vector space similarity Cosine overlap 17
Frequency based generative model Frequent words in input ~ more likely in summary Likelihood under unigram model 18
Information-theoretic features are most indicative Best feature – JS divergence Correlations of 0. 88 with human scores 19
3. Wisdom of multiple systems System summaries are collectively indicative of importance 20
Multiple systems ~ multiple methods to select content Unsupervised methods ◦ Frequency/position ◦ Discourse structure ◦ Graph-based measures of centrality Supervised content selection ◦ Using sentences selected by humans Consensus among systems – very important content 21
Can system summaries be used for evaluation? Distribution of combined vocabulary of all system summaries JS divergence Vocabulary distribution of an individual system summary Low divergence ~ better summary 22
Very high correlations with human judgements 0. 93 Could be useful to combine output from multiple systems 23
Conclusions Some inputs are more difficult ◦ need for specialized content selection methods Input-summary similarity is predictive of quality ◦ can be optimized using information-theoretic features Collective knowledge of systems is indicative of importance ◦ system combination might improve performance 24