Скачать презентацию Evaluating Summary Content Selection Pyramid Method Work in Скачать презентацию Evaluating Summary Content Selection Pyramid Method Work in

f72a9082388be53fc79e1ada229c9377.ppt

  • Количество слайдов: 35

Evaluating Summary Content Selection Pyramid Method: Work in Progress Rebecca Passonneau Ani Nenkova Evaluating Summary Content Selection Pyramid Method: Work in Progress Rebecca Passonneau Ani Nenkova

OUTLINE 1. Motivation 2. Problems 3. DUC Evaluations 4. Pyramid Method: Current Status 5. OUTLINE 1. Motivation 2. Problems 3. DUC Evaluations 4. Pyramid Method: Current Status 5. Open Issues 6. Conclusions

EVALUATION GOALS § Define parameters of the problem o What is summarization? § Compare EVALUATION GOALS § Define parameters of the problem o What is summarization? § Compare systems o Is the metric meaningful? § Track progress o When does output improve? § Cost Effectiveness o Can it be (partly) automated?

PICTURING CONTENT “OVERLAP” Philippine Airlines (PAL) experienced a crisis in 1998. Unable to make PICTURING CONTENT “OVERLAP” Philippine Airlines (PAL) experienced a crisis in 1998. Unable to make payments on a $2. 1 billion debt, it was faced by a pilot's strike in June and the region's currency problems which reduced passenger numbers and inflated costs. On September 23 PAL shut down after the ground crew union turned down a settlement which it accepted two. . . Starting in May 1998, Philippine Airlines (PAL) laid off 5000 of its 13, 000 workers. A 3 -week pilots' strike in June and a currency crisis that reduced passenger numbers made payments on PAL's $2 billion debt impossible. President Estrada brokered an agreement to suspend collective bargaining for 10 years in exchange for 20% of PAL stock and union seats on its board. The large ground crew union initially voted no. After PAL shut down operations for 13 days starting Sept. 23 rd, leaving much of the country without air service and foreign. . .

OBSTACLES § Humans select different content § Humans present same content differently § Lack OBSTACLES § Humans select different content § Humans present same content differently § Lack clear standard of “good” summary [Contrasts with translation: L 1(C) L 2(C)] § Need objective method to get at subjective notion of what a summary IS

PREVIOUS WORK: Pessimism Human Judgments § Extraction Ø Low Agreement (Rath, 1961; Salton et PREVIOUS WORK: Pessimism Human Judgments § Extraction Ø Low Agreement (Rath, 1961; Salton et al, 1997) Ø Inconsistent over time (Rath, 1961; Lin & Hovy, 2002) § Abstraction (Depends on individual’s orientation (Gerrig et al 1991) Automated Evaluation § Extraction (Pastra & Saggion, 2003 EACL) Ø 3 -humans; multiple “models”; inconclusive § Abstraction (Lin & Hovy, 2002 ACL) Ø Accepts inconsistent judgments as target Ø Difficult to extend

PREVIOUS WORK: Optimism Good design methodology leads to better understanding areas of agreement Ø PREVIOUS WORK: Optimism Good design methodology leads to better understanding areas of agreement Ø High compression rate leads to high agreement (Jing et al. , 1998) Ø Content variation offset by logarithmic growth in pool of distinct content units (Halteren & Teufel, 2003) Ø Content can be reliably annotated (Beck et al. , 1991)

HOW TO GET AT “CONTENT” FROM ITS “EXPRESSION” 1. ADAPT BLEU MT EVALUATION a) HOW TO GET AT “CONTENT” FROM ITS “EXPRESSION” 1. ADAPT BLEU MT EVALUATION a) Collect multiple “model” summaries b) Quantify ngram overlap 2. IDENTIFY ABSTRACT CONTENT UNITS a) DUC b) Reading Comprehension 3. A THIRD WAY a) Content unit “level” b) Multiple expressions of same content unit

DUC: THE CURRENT APPROACH § Yearly evaluation of systems on new data sets § DUC: THE CURRENT APPROACH § Yearly evaluation of systems on new data sets § NIST evaluations performed by humans § Widely cited results § Does it work? • • • Compare current systems Track individual system progress Track community progress from year to year Identify specific strengths/weaknesses Can it eventually be automated?

DUC SCORING METHOD § Datasets: human/machine summaries § Designate “model” human summary § (Automatically) DUC SCORING METHOD § Datasets: human/machine summaries § Designate “model” human summary § (Automatically) identify content units in “model” summary § Split “peer” summaries into sentences § Human judges evaluate “peer” against model

COMPUTE DUC SCORES 1. For each EDU: a) Does peer sentence express any part COMPUTE DUC SCORES 1. For each EDU: a) Does peer sentence express any part b) How much? (0, 20, 40, 60, 80, 100%) 2. Average EDU percent overlap scores 3. Resulting score ranges from 0 to 1

DRAWBACKS TO DUC SCORES • Very sensitive to choice of “model” • All “model” DRAWBACKS TO DUC SCORES • Very sensitive to choice of “model” • All “model” units created equal • Difficult to interpret scores o Human summary scores as low as 0. 1 o Scores vary for same summarizer o Scores vary for same summary • Systems cannot be differentiated

DUC SCATTERPLOT DUC SCATTERPLOT

FOUNDATION OF PYRAMID § A few CUs appear in many summaries § Humans can FOUNDATION OF PYRAMID § A few CUs appear in many summaries § Humans can identify same/different CUs Weight CUs differentially

MULTIPLE GOOD SUMMARIES This pyramid predicts 6 different good summaries consisting of 4 SCUs: MULTIPLE GOOD SUMMARIES This pyramid predicts 6 different good summaries consisting of 4 SCUs:

SCU ANNOTATION EXAMPLE SCU ANNOTATION EXAMPLE

PAL PYRAMID TIER: W=3 (N=4) SCU 1: PAL has $2. 1 billion debt H PAL PYRAMID TIER: W=3 (N=4) SCU 1: PAL has $2. 1 billion debt H 2 [PAL’s $2 billion debt]1 I 1 [and with a rising $2. 1 billion debt, ]1 J 3 [PAL is buried under a $2. 2 billion dollar debt]1 SCU 2: PAL enforced a shutdown H 5 [After PAL shut down operations]2 I 1 [stopped all operations]2 J 5 [by a]2 [shutdown]2 SCU 3: PAL in crisis H 1 [Philippine Airlines]3 I 1 [Philippines Airlines (PAL), ]3 [devastated]3 J 1 [The fate]3 [is uncertain. ]3

PAL PYRAMID TIER: W=2 (N=8) SCU 5: PAL unable to repay debt H 2 PAL PYRAMID TIER: W=2 (N=8) SCU 5: PAL unable to repay debt H 2 [made payments on]5 [impossible. ]5 J 3 [it cannot repay]5 SCU 6: PAL experienced pilots' strike H 2 [A]5 [pilots' strike]6 I 1 [by pilot]5 [strikes]6 SCU 7: this PAL crisis occurred in 1988 H 1 [1998, ]7 I 1 [in 1998]7 . . .

ANNOTATION: KEEPING TRACK H 1 [Starting in May]23 [1998, ]7 [Philippine Airlines]3 [laid off ANNOTATION: KEEPING TRACK H 1 [Starting in May]23 [1998, ]7 [Philippine Airlines]3 [laid off 5000 of its 13, 000 workers. ]24 H 2 [A]6 [3 -week]25 [pilots' strike]6 [in June]11 [and a currency crisis]12 [that reduced passenger numbers]13 H 3 [President Estrada brokered an agreement to suspend collective bargaining for 10 years]17 [in exchange for 20% of PAL stock and union seats on its board. ]26 H 4 [The large ground crew union initially voted no. ]18 H 5 [After PAL shut down operations]2 [for 13 days]4 [starting Sept. 23 rd, ]8 [leaving much of the country without air service]27 [and foreign carriers flying some domestic routes, ]9 [61% voted yes. ]19 . . .

RELIABILITY Two Annotators Consensus Annotation § Number of SCUs: 33 versus 37 35 § RELIABILITY Two Annotators Consensus Annotation § Number of SCUs: 33 versus 37 35 § Count of Pairwise Agreements (PAs) Ø SCU Label Ø SCU Members § Comparison of Annotations to Consensus Ø Recall/Precision not valid Ø 65/69 PAs Ø Most “disagreements” due to membership size Ø Only 2 “conflicts”

ANOTHER CONSISTENCY TEST Pyramid A H C J Consensus . 95 . 89 . ANOTHER CONSISTENCY TEST Pyramid A H C J Consensus . 95 . 89 . 85 . 76 Annotation 1 . 97 . 83 . 82 Annotation 2 . 94 . 87 . 84 . 74

PYRAMID SCORE PART 1 1. For N summaries, score each “peer” against a pyramid PYRAMID SCORE PART 1 1. For N summaries, score each “peer” against a pyramid with N-1 tiers 2. “Peer” annotation a) Gives SCU “size” b) Yields a residue of SCUs not in pyramid 3. Compute D (Observed distribution) where D=sum of weights of SCUs EG: Summary A (D 30042), size=20 D=(6 x 3) + (6 x 2) + (4 x 1) + (4 x 0) = 34

PYRAMID SCORE PART II 1. Compute Max = Ideal Sum of weights of SCUs, PYRAMID SCORE PART II 1. Compute Max = Ideal Sum of weights of SCUs, given the summary SCU size 2. Pyramid of H, I, J: a) 9 SCUs in tier, w=3 b) 10 SCUs in tier, w=2 c) 12 SCUs in tier, w=1 3. Size=20, Max=(9 x 3) + (10 x 2) + (1 x 1)=48 4. P=D/Max PA= 34/48=. 71

COMPARISON TO DUC SCORES: HUMAN SUMMARIES COMPARISON TO DUC SCORES: HUMAN SUMMARIES

MACHINE SUMMARY EXAMPLE African countries voted in June to ignore the U. N. flight MACHINE SUMMARY EXAMPLE African countries voted in June to ignore the U. N. flight ban which was imposed in 1992 to try and force Libya to hand over for trial two suspects wanted in the 1988 bombing of an American airliner over Lockerbie, Scotland. The reported jailing of the three officials comes as Gadhafi is under pressure to accept a plan to turn over for trial two other Libyans wanted for the 1988 bombing of Pan am flight 103 over Lockerbie, Scotland, that led to 270 deaths. The visit was Farrakhan's …

COMPARISON TO DUC SCORES: MACHINE SUMMARIES SYSTEM DUC PYRAMID Sys 06* . 30 . COMPARISON TO DUC SCORES: MACHINE SUMMARIES SYSTEM DUC PYRAMID Sys 06* . 30 . 79 Sys 13 Sys 14 . 03. 25 . 24. 51 Sys 16* . 25 . 26 Sys 17* . 03 . 17 Sys 18 . 03 . 20 Sys 20 . 10 . 64

MACHINE SUMMARIES System 6 PAL, Asia’s oldest airline, has been unable to make payments MACHINE SUMMARIES System 6 PAL, Asia’s oldest airline, has been unable to make payments on dlrs 2. 1 billion debt after being devasted by a pilot’s strike and by Asia’s currency crisis. PAL earlier accepted a preliminary investment offer from Cathay Pacific, Ailing Philippine Airlines and prospective investor Cathy Pacific Airways have clashed over. . .

MACHINE SUMMARIES System 16 President Joseph Estrada on Saturday urged militant unionists at Philippine MACHINE SUMMARIES System 16 President Joseph Estrada on Saturday urged militant unionists at Philippine Airlines to accept a vote by workers approving a 10 -year no-strike deal to revive the debt-laden airline. President Joseph Estrada said Saturday the financially troubled airlines will resume its international flights on Sunday by flying him to Singapore. . .

MACHINE SUMMARIES System 17 Christmas is a sacred holiday in the Philippines, and nowhere MACHINE SUMMARIES System 17 Christmas is a sacred holiday in the Philippines, and nowhere is that more evident than at the headquarters of Philippine Airlines. But Ramos, who was intent on privatizing the economy, opened the industry to competition, licensing rivals like Air Philippines, Cebu Pacific, and Grand Air. PAL closed for nearly 2 weeks on Sep. 23 after. . .

OPEN ISSUES § Distribution of SCUs NOT an independent variable ØOrdering ØKnowledge ØInformational Goal OPEN ISSUES § Distribution of SCUs NOT an independent variable ØOrdering ØKnowledge ØInformational Goal § Can Pyramid Scoring be Automated?

SCU INTERDEPENDENCIES 1. SCU 4 presupposes SCU 1: SCU 1 (w=4): PAL has a SCU INTERDEPENDENCIES 1. SCU 4 presupposes SCU 1: SCU 1 (w=4): PAL has a debt > 2 billion SCU 4 (w=3): PAL cannot make its debt payments 2. SCU 7, SCU 8 depend on SCU 2 (w=4): PAL shutdown operations SCU 7 (w=3): shutdown began on 9/23 SCU 8 (w=3): shutdown lasted 2 weeks

SCUs and DEPENDENCY/TAG GR A 3 [On September 23]7 [PAL shut down]2 [after the SCUs and DEPENDENCY/TAG GR A 3 [On September 23]7 [PAL shut down]2 [after the ground crew union turned down a settlement]18 [which it accepted two weeks later. ]19 SCU 7 1 On IN 5 shut t 0 2 September NNP 4 PAL t 2 3 23 CD 4 PAL t 2

“LARGE” CONSTITUENTS 1. PAL experienced a crisis in 1998. 2. Unable to make payments “LARGE” CONSTITUENTS 1. PAL experienced a crisis in 1998. 2. Unable to make payments on a $2. 1 billion debt, 3. it was faced by a pilot's strike in June 4. and the region's currency problems 5. which reduced passenger numbers and inflated costs. 6. On September 23 pal shut down 7. after the ground crew union turned down a settlement 8. which it accepted two weeks later. 9. PAL resumed domestic flights on October 7 10. and [resumed] international flights on October 26. 11. Resolution of the basic financial problems was elusive, however, 12. and as of December 18 pal was still $2. 2 billion in debt 13. and [pal was] losing close to $1 million a day.

DOCSET TF*IDF TERMS: $2, airline, billion, day, debt, pal (6 of 13 LCs) 1 DOCSET TF*IDF TERMS: $2, airline, billion, day, debt, pal (6 of 13 LCs) 1 1. Philippine Airlines (pal) experienced a crisis in 1998. SCU 3 w=3 3 2. Unable to make payments on a $2. 1 billion debt, SCU 1 w=4 1 6. On September 23 pal shut down SCU 2 w=4 & SCU 7 w=3 1 9. pal resumed domestic flights on October 7 SCU 10 w=2 4 12. and as of December 18 pal was still $2. 2 billion in debt NO SCU 1 13. and losing close to $1 million a day. SCU 15 w=2

CONCLUSIONS § Define parameters of the problem o What is summarization? § Compare systems CONCLUSIONS § Define parameters of the problem o What is summarization? § Compare systems and/or humans o Is the metric meaningful? § Track progress o When does output improve? § Cost Effectiveness o Can it be (partly) automated?