Скачать презентацию Predicting System Performance for Automatic Summarization Annie Louis Скачать презентацию Predicting System Performance for Automatic Summarization Annie Louis

c6dff3872e897815fe5c777c1e0c38b7.ppt

  • Количество слайдов: 24

Predicting System Performance for Automatic Summarization Annie Louis University of Pennsylvania Advisor: Ani Nenkova Predicting System Performance for Automatic Summarization Annie Louis University of Pennsylvania Advisor: Ani Nenkova IBM Open House 2009 - SMi. Le

Goal of summarization system ~ select important content from input Single/multiple document(s) News reports, Goal of summarization system ~ select important content from input Single/multiple document(s) News reports, scientific literature, search results 2

Content quality judgements can be obtained from humans Compare system summary with human summary Content quality judgements can be obtained from humans Compare system summary with human summary Direct rating by human judges 3

What factors are predictive of human judgements? Properties of systems Content selection features Better What factors are predictive of human judgements? Properties of systems Content selection features Better understanding of content selection Improve systems Automatic evaluation 4

1. Input difficulty for summarization Some inputs are more difficult than others 5 1. Input difficulty for summarization Some inputs are more difficult than others 5

Standard system design Handle variety of inputs ◦ News – events, biographies, opinions ◦ Standard system design Handle variety of inputs ◦ News – events, biographies, opinions ◦ Search results – practically on any topic But often one method on all inputs 6

Systems end up with variable performance on different inputs Average system scores on 100 Systems end up with variable performance on different inputs Average system scores on 100 word summaries ◦ mean ◦ min ◦ max 0. 55 0. 07 1. 65 Range 0 - 4 7

Input type influences summary quality Descriptions of single event or subject ~ easy ◦ Input type influences summary quality Descriptions of single event or subject ~ easy ◦ Hurricane Andrew ◦ Mad cow disease Collections of opinions ~ difficult ◦ Senate, lawyers, public on a new policy 8

Can input difficulty be used to predict expected performance? Identify measurable indicators of input Can input difficulty be used to predict expected performance? Identify measurable indicators of input difficulty Specialized content selection methods Flag expected poor quality summaries 9

Difficult inputs are longer More tokens Large vocabulary sizes 10 Difficult inputs are longer More tokens Large vocabulary sizes 10

Difficult inputs have less redundancy Low values for pair-wise cosine overlap High entropy vocabulary Difficult inputs have less redundancy Low values for pair-wise cosine overlap High entropy vocabulary 11

Difficult inputs are topically less cohesive low KL divergence – input vs large random Difficult inputs are topically less cohesive low KL divergence – input vs large random collection 12

Input difficulty can estimate average system performance Accuracy on inputs with extreme high and Input difficulty can estimate average system performance Accuracy on inputs with extreme high and low scores ◦ multi-document inputs - 74% ◦ single documents – 84% 13

2. Input based evaluation Input-summary similarity is predictive of human judgements of quality 14 2. Input based evaluation Input-summary similarity is predictive of human judgements of quality 14

Summaries very similar to input could be of higher quality Intuitive Many ways to Summaries very similar to input could be of higher quality Intuitive Many ways to measure similarity What is a good objective function? How well will it perform? 15

Divergence between input & summary vocabularies KL divergence JS divergence 16 Divergence between input & summary vocabularies KL divergence JS divergence 16

Vector space similarity Cosine overlap 17 Vector space similarity Cosine overlap 17

Frequency based generative model Frequent words in input ~ more likely in summary Likelihood Frequency based generative model Frequent words in input ~ more likely in summary Likelihood under unigram model 18

Information-theoretic features are most indicative Best feature – JS divergence Correlations of 0. 88 Information-theoretic features are most indicative Best feature – JS divergence Correlations of 0. 88 with human scores 19

3. Wisdom of multiple systems System summaries are collectively indicative of importance 20 3. Wisdom of multiple systems System summaries are collectively indicative of importance 20

Multiple systems ~ multiple methods to select content Unsupervised methods ◦ Frequency/position ◦ Discourse Multiple systems ~ multiple methods to select content Unsupervised methods ◦ Frequency/position ◦ Discourse structure ◦ Graph-based measures of centrality Supervised content selection ◦ Using sentences selected by humans Consensus among systems – very important content 21

Can system summaries be used for evaluation? Distribution of combined vocabulary of all system Can system summaries be used for evaluation? Distribution of combined vocabulary of all system summaries JS divergence Vocabulary distribution of an individual system summary Low divergence ~ better summary 22

Very high correlations with human judgements 0. 93 Could be useful to combine output Very high correlations with human judgements 0. 93 Could be useful to combine output from multiple systems 23

Conclusions Some inputs are more difficult ◦ need for specialized content selection methods Input-summary Conclusions Some inputs are more difficult ◦ need for specialized content selection methods Input-summary similarity is predictive of quality ◦ can be optimized using information-theoretic features Collective knowledge of systems is indicative of importance ◦ system combination might improve performance 24