Predicting System Performance for Automatic Summarization Annie Louis

Скачать презентацию Predicting System Performance for Automatic Summarization Annie Louis

c6dff3872e897815fe5c777c1e0c38b7.ppt

Количество слайдов: 24

Predicting System Performance for Automatic Summarization Annie Louis University of Pennsylvania Advisor: Ani Nenkova IBM Open House 2009 - SMi. Le

Goal of summarization system ~ select important content from input Single/multiple document(s) News reports, scientific literature, search results 2

Content quality judgements can be obtained from humans Compare system summary with human summary Direct rating by human judges 3

What factors are predictive of human judgements? Properties of systems Content selection features Better understanding of content selection Improve systems Automatic evaluation 4

1. Input difficulty for summarization Some inputs are more difficult than others 5

Standard system design Handle variety of inputs ◦ News – events, biographies, opinions ◦ Search results – practically on any topic But often one method on all inputs 6

Systems end up with variable performance on different inputs Average system scores on 100 word summaries ◦ mean ◦ min ◦ max 0. 55 0. 07 1. 65 Range 0 - 4 7

Input type influences summary quality Descriptions of single event or subject ~ easy ◦ Hurricane Andrew ◦ Mad cow disease Collections of opinions ~ difficult ◦ Senate, lawyers, public on a new policy 8

Can input difficulty be used to predict expected performance? Identify measurable indicators of input difficulty Specialized content selection methods Flag expected poor quality summaries 9

Difficult inputs are longer More tokens Large vocabulary sizes 10

Difficult inputs have less redundancy Low values for pair-wise cosine overlap High entropy vocabulary 11

Difficult inputs are topically less cohesive low KL divergence – input vs large random collection 12

Input difficulty can estimate average system performance Accuracy on inputs with extreme high and low scores ◦ multi-document inputs - 74% ◦ single documents – 84% 13

2. Input based evaluation Input-summary similarity is predictive of human judgements of quality 14

Summaries very similar to input could be of higher quality Intuitive Many ways to measure similarity What is a good objective function? How well will it perform? 15

Divergence between input & summary vocabularies KL divergence JS divergence 16

Vector space similarity Cosine overlap 17

Frequency based generative model Frequent words in input ~ more likely in summary Likelihood under unigram model 18

Information-theoretic features are most indicative Best feature – JS divergence Correlations of 0. 88 with human scores 19

3. Wisdom of multiple systems System summaries are collectively indicative of importance 20

Multiple systems ~ multiple methods to select content Unsupervised methods ◦ Frequency/position ◦ Discourse structure ◦ Graph-based measures of centrality Supervised content selection ◦ Using sentences selected by humans Consensus among systems – very important content 21

Can system summaries be used for evaluation? Distribution of combined vocabulary of all system summaries JS divergence Vocabulary distribution of an individual system summary Low divergence ~ better summary 22

Very high correlations with human judgements 0. 93 Could be useful to combine output from multiple systems 23

Conclusions Some inputs are more difficult ◦ need for specialized content selection methods Input-summary similarity is predictive of quality ◦ can be optimized using information-theoretic features Collective knowledge of systems is indicative of importance ◦ system combination might improve performance 24