Скачать презентацию Interlingua Annotation of Multilingual Corpora IAMTC Project Lori Скачать презентацию Interlingua Annotation of Multilingual Corpora IAMTC Project Lori

c9408bdcda8ef512f04957748286ee5b.ppt

  • Количество слайдов: 48

Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon Univeristy Feb 23, 2005 1

IAMTC project members Collaboration: New Mexico, Maryland, Columbia, MITRE, CMU, ISI Members: Bonnie Dorr IAMTC project members Collaboration: New Mexico, Maryland, Columbia, MITRE, CMU, ISI Members: Bonnie Dorr (Maryland) David Farwell (NMSU) Rebecca Green (Maryland) Nizar Habash (Columbia) Stephen Helmreich (NMSU) Eduard Hovy (ISI) Lori Levin (CMU) Keith Miller (MITRE) Teruko Mitamura (CMU) Owen Rambow (Columbia) Flo Reeder (MITRE) Advaith Siddharthan (Columbia) Feb 23, 2005 2

IL-Annotation Outcomes • IL design – Three levels of depth: IL 0, IL 1, IL-Annotation Outcomes • IL design – Three levels of depth: IL 0, IL 1, and IL 2 • Annotation methodology – Manuals, tools, evaluations • Annotated parallel texts – Foreign language original and multiple English translations – Foreign languages: Arabic, French, Hindi, Japanese, Korean, Spanish Feb 23, 2005 3

Uniqueness of Annotation Effort • Multi-parallel – Three versions of each text • Original Uniqueness of Annotation Effort • Multi-parallel – Three versions of each text • Original language and two English translations – Shows multiple surface realizations of the same meaning • Multi-lingual – Each text is in at least two languages (English and one other) – The methodology is applied to multi-parallel corpora in six languages. • Arabic, French, Hindi, Japanese, Korean, Spanish Feb 23, 2005 4

Motivation • Interlingua designed for MT – Multiple English translations of same source show Motivation • Interlingua designed for MT – Multiple English translations of same source show translation divergences. Some phenomena: • • • Lexical level: word changes Syntactic level: phrasing, thematization, nominalization Semantic level: additional/different content Discourse level: multi-clause structure, anaphor Pragmatic level: Speech Acts, implicatures, style, interpersonal • Causes of divergence – Genuine ambiguity/vagueness of source meaning – Translator error/reinterpretation Feb 23, 2005 5

IL Development: Staged, deepening • IL 0: – Shows simple dependency structure • IL IL Development: Staged, deepening • IL 0: – Shows simple dependency structure • IL 1: – Replace open class lexical items with concept names – Replace grammatical relation labels with semantic role labels • IL 2: (under development) – Separates shared portions and unresolved portions of divergent sentences Feb 23, 2005 6

Details of IL 0 • Deep syntactic dependency representation: – Removes auxiliary verbs, determiners, Details of IL 0 • Deep syntactic dependency representation: – Removes auxiliary verbs, determiners, and some function words – Normalizes passives, clefts, etc. – Removes strongly governed prepositions – Includes syntactic roles (Subj, Obj) Feb 23, 2005 7

Construction of IL 0 • Dependency parsers • Connexor (English), Tapanainen and Jarvinen, 1997 Construction of IL 0 • Dependency parsers • Connexor (English), Tapanainen and Jarvinen, 1997 • Kabocha (Japanese) • Hand-corrected • Extensive manual and instructions on IAMTC Wiki website – for English, Spanish, Japanese, and possibly others Feb 23, 2005 8

Syntactic Variation Resolved at IL 0 • Passive • The gangster killed at least Syntactic Variation Resolved at IL 0 • Passive • The gangster killed at least 3 innocent bystanders. • At least 3 innocent bystanders were killed by the gangster. • Other transitivity alternations Feb 23, 2005 9

Example of IL 0 Tr. Ed, Pajas, 1998 Sheikh Mohammed, who is also the Example of IL 0 Tr. Ed, Pajas, 1998 Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center” Feb 23, 2005 10

Example of IL 0 • Sheikh Mohammed, who is also the Defense Minister of Example of IL 0 • Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center” announced V Root Mohamed PN Subj Sheikh PN Mod Defense_Minister PN Mod who Pron Subj also Adv Mod of P Mod UAE PN Obj at P Mod ceremony N Obj inauguration N Mod Feb 23, 2005 11

Details of IL 1 • Associate open-class lexical items with Omega Ontology items • Details of IL 1 • Associate open-class lexical items with Omega Ontology items • Replace syntactic relations by one of approx. 20 semantic (theta) roles (from Dorr) e. g. , AGENT, THEME, GOAL, INSTR… • No treatment of prepositions, quantification, negation, time, modality, idioms, proper names, NP-internal structure… • Nodes may receive more than one concept – Average: about 1. 2 Feb 23, 2005 12

Construction of IL 1 • TIAMAT annotation tool • Manual for converting IL 0 Construction of IL 1 • TIAMAT annotation tool • Manual for converting IL 0 to IL 1 is available Feb 23, 2005 13

Syntactic Variation Resolved at IL 1 • Lexical Synonymy – The toddler sobbed, and Syntactic Variation Resolved at IL 1 • Lexical Synonymy – The toddler sobbed, and he attempted to console her. – The baby wailed, and he tried to comfort her. • Thematic Divergence – Bob enjoys playing with his kids. – Playing with his kids pleases Bob. Feb 23, 2005 14

Example of IL 1 Sheikh Mohammed, who is also the Defense Minister of the Example of IL 1 Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center” Feb 23, 2005 15

Example of IL 1: internal representation The study led them to ask the Czech Example of IL 1: internal representation The study led them to ask the Czech government to recapitalize CSA at this level. Semantic Roles [3, lead, V, lead, Root, LEAD

Tiamat: annotation interface For each new sentence: For each word to be annotated (shown Tiamat: annotation interface For each new sentence: For each word to be annotated (shown with dependents) Feb 23, 2005 17

Tiamat: annotation interface For each new sentence: Step 1: find Omega concepts for objects Tiamat: annotation interface For each new sentence: Step 1: find Omega concepts for objects and events Candidate concepts Feb 23, 2005 18

Tiamat: annotation interface (note: similarity to PDT annotation interface) For each new sentence: Step Tiamat: annotation interface (note: similarity to PDT annotation interface) For each new sentence: Step 1: find Omega concepts for objects and events Candidate concepts Step 2: select event frame (theta roles) Feb 23, 2005 19

Details of IL 2 • Start capturing meaning: – Handle proper names: one of Details of IL 2 • Start capturing meaning: – Handle proper names: one of around 5 classes (PERSON, LOCATION, TIME, ORGANIZATION…) – Conversives (buy vs. sell) at the Frame. Net level – Non-literal language usage (open the door to customers vs. start doing business) – Extended paraphrases involving syntax, lexicon, grammatical features – Possible incorporation of other ‘standardized’ notations for temporal and spatial expressions • Still excluded: – Quantification and negation – Discourse structure – Pragmatics Feb 23, 2005 20

Variation Resolved at IL 2 • Morphological Derivation – I was surprised that he Variation Resolved at IL 2 • Morphological Derivation – I was surprised that he destroyed the old house. – I was surprised by his destruction of the old house. • Differences in clause subordination – This is Joe’s new car, which he bought in New York. – This is Joe’s new car. He bought it in New York. • N-N Compounds – She loves velvet dresses. – She loves dresses made of velvet. Feb 23, 2005 21

IL 2 (continued) • Head Switching – Mike Mussina excels at pitching. – Mike IL 2 (continued) • Head Switching – Mike Mussina excels at pitching. – Mike Mussina pitches well. – Mike Mussina is a good pitcher. • Lexical Conflation – Lindbergh flew across the Atlantic Ocean. – Lindbergh crossed the Atlantic Ocean by plane. Feb 23, 2005 22

Not normalized • Comparitives vs. Superlatives – He’s smarter than everybody else. – He’s Not normalized • Comparitives vs. Superlatives – He’s smarter than everybody else. – He’s the smartest one. • Different Sentence Types – Who composed the Brandenburg Concertos? – Tell me who composed the Brandenburg Concertos. • Inverse Relationship – Only 20% of the participants arrived on time. – 80% of the participants were late. • Inference – The Porto player kicked the ball into the net. – The Porto player scored a goal. • Viewpoint Variation – Stop getting in the way. – Stop trying to help. Feb 23, 2005 23

Note from Lori • In my version of Powerpoint the color blocks on the Note from Lori • In my version of Powerpoint the color blocks on the next slide don’t line up with the text correctly. • I didn’t have time to fix it, so I inserted the other version of the same slide. • If you have time to fix the color box version, then you can delete the two slides after that. • Otherwise, you can delete the color box version. Feb 23, 2005 24

Theoretical goal: Getting at meaning K 1 E 1: Starting on January 1 of Theoretical goal: Getting at meaning K 1 E 1: Starting on January 1 of next year, SK Telecom subscribers can switch to less expensive LG Telecom or KTF. … The Subscribers cannot switch again to another provider for the first 3 months, but they cancel the switch in 14 days if they are not satisfied with services like voice quality. K 1 E 2: Starting January 1 st of next year, customers of SK Telecom can change their service company to LG Telecom or KTF … Once a service company swap has been made, customers are not allowed to change companies again within the first three months, although they cancel the change anytime within 14 days if problems such as poor call quality are experienced. • Semantically identical • Semantically equivalent • Semantically different: • Additional/less information Feb 23, 2005 • Different information 25

Getting at Meaning (Two translations of Korean original text) Starting on January 1 of Getting at Meaning (Two translations of Korean original text) Starting on January 1 of next year, SK Telecom subscribers can switch to less expensive LG Telecom or KTF. … The Subscribers cannot switch again to another provider for the first 3 months, but they cancel the switch in 14 days if they are not satisfied with services like voice quality. Starting January 1 st of next year customers of SK Telecom can change their service company to LG Telecom or KTF … Once a service company swap has been made, customers are not allowed to change companies again within the first three months, although they cancel the change anytime within 14 days if problems such as poor call quality are experienced. Feb 23, 2005 26

Color Key • • • Black: same meaning and same expression Green: small syntactic Color Key • • • Black: same meaning and same expression Green: small syntactic difference Blue: Lexical difference Red: Not contained in the other text Purple: Larger difference. – Need to use some inference to know that the meaning is the same Feb 23, 2005 27

Getting at meaning (Two translations of a Japanese original text) • • • This Getting at meaning (Two translations of a Japanese original text) • • • This year, too, in addition to the birth of Mitsubishi Chemical, which has already been announced, other rather large-scale mergers may continue, and be recorded as a "year of mergers. " More lexical similarity. • • • • This year, which has already seen the announcement of the birth of Mitsubishi Chemical Corporation as well as the continuous numbers of big mergers, may too be recorded as the “year of the merger” for all we know. More differences in dependency Feb 23, 2005 relations. 28

Common Aspects of Meaning This year, too, in addition to the birth of Mitsubishi Common Aspects of Meaning This year, too, in addition to the birth of Mitsubishi Chemical, which has already been announced, other rather large-scale mergers may continue, and be recorded as a "year of mergers. “ This year, which has already seen the announcement of the birth of Mitsubishi Chemical Corporation as well as the continuous numbers of big mergers, may too be recorded as the “year of the merger” for all we know. • Big mergers continue this year • Mergers continue in addition to the birth of Mitsubishi Chemical • Birth of Mitsubishi Chemical • Someone announces the birth of Mitsubishi Chemical • Someone records this year as the year of the merger Feb 23, 2005 29

Divergences that can be resolved This year, too, in addition to the birth of Divergences that can be resolved This year, too, in addition to the birth of Mitsubishi Chemical, which has already been announced, other rather large-scale mergers may continue, and be recorded as a "year of mergers. “ This year, which has already seen the announcement of the birth of Mitsubishi Chemical Corporation as well as the continuous numbers of big mergers, may too be recorded as the “year of the merger” for all we know. • Mergers are big • Someone announces the birth of Mitsubishi Chemical • Someone records something as the year of the merger Feb 23, 2005 30

Benefits for Other Projects • • MT Question Answering Summarization Information Retrieval Information Extraction Benefits for Other Projects • • MT Question Answering Summarization Information Retrieval Information Extraction Text Mining Etc. Feb 23, 2005 31

Approaches to Evaluation • Inter-annotator agreement — completed • Sentence generation from extracted annotation Approaches to Evaluation • Inter-annotator agreement — completed • Sentence generation from extracted annotation structure • Comparison of interlingual structures (graph comparisons) • Ontology growth (or shrinkage) rate (per unit of text) – Competing goals: • Addressing coverage gaps (1/3 of open class words marked as having no concept) • Omega seems too rich: Hard to distinguish between senses; Granularity of concept selection Feb 23, 2005 32

Inter-annotator Agreement • Is the IL sufficiently defined to permit consistent annotation? – Ontology Inter-annotator Agreement • Is the IL sufficiently defined to permit consistent annotation? – Ontology – Theta-roles – Coverage and precision Feb 23, 2005 33

Evaluation webpage Feb 23, 2005 34 Evaluation webpage Feb 23, 2005 34

Inter-annotator agreement • Difficulty is that more than one sense can be selected for Inter-annotator agreement • Difficulty is that more than one sense can be selected for a given annotation – Standard kappa does not apply in this case • Two alternatives for calculating expected probability of agreement: – Agreement and kappa for positive senses – Agreement and kappa for all senses • Both were explored – Positive sense agreement, kappa shown here Feb 23, 2005 35

Positive agreement annotations • Construct a table for each word: – For each annotator Positive agreement annotations • Construct a table for each word: – For each annotator and each sense whether or not that sense was selected by that annotator • Calculate agreement = • Calculate kappa using Monte Carlo simulation of P(E) Feb 23, 2005 36

Evaluation results – positive examples Annotators who finished 95% of their annotations Annotators who Evaluation results – positive examples Annotators who finished 95% of their annotations Annotators who finished 90% of their annotations Annotators who finished 50% of their annotations All annotators A# APA Kappa Mikrokosmos 3. 50 0. 7445 0. 7432 4. 42 0. 7310 0. 7296 6. 33 0. 6105 0. 6085 9. 42 0. 4552 0. 4540 Word. Net 6. 08 0. 6600 0. 6565 7. 00 0. 6538 0. 6502 8. 33 0. 5982 0. 5941 9. 42 0. 5174 0. 5125 Theta Roles 5. 75 0. 5378 0. 5089 6. 58 0. 5492 0. 5210 8. 00 0. 4845 0. 4522 9. 42 0. 3924 0. 3544 Feb 23, 2005 37

All cases count • Count 0, 0 and 1, 1 agreements – T 00 All cases count • Count 0, 0 and 1, 1 agreements – T 00 , T 11 • Count 0, 1 and 1, 0 disagreements – T 10 , T 01 • Count number of 0 & 1 for annotators 1 & 2 - A 01, A 11; A 02, A 12 • Divide all counts by number senses • Agreement = T 00 + T 11 • Kappa = 2 * ((T 00 * T 11) – (T 10 * T 01)) / ((A 01 * A 12) + (A 02 * A 11)) [marginal prob. ] Feb 23, 2005 38

All Cases Agreement / Kappa All cases Exclude pairs zero- Zero-Pairs Agree Kappa 78. All Cases Agreement / Kappa All cases Exclude pairs zero- Zero-Pairs Agree Kappa 78. 58 0. 945 0. 418 0. 943 0. 392 112. 16 0. 886 0. 564 0. 879 0. 534 258. 5 0. 811 0. 522 0. 784 0. 433 Theta Roles Word. Net Mikrokosmos Feb 23, 2005 39

Annotation Issues 1. Post-annotation consistency checking – – Novice annotators may make inconsistent annotations Annotation Issues 1. Post-annotation consistency checking – – Novice annotators may make inconsistent annotations within the same text. Intra-annotator consistency checking procedure • e. g. If two nodes in different sentences are co-indexed, then annotators must ensure that the two nodes carry the same meaning in the context of the two different sentences 2. Post-annotation reconciliation Feb 23, 2005 40

Post-annotation reconciliation • Question: How much can annotators be brought into agreement? • Procedure: Post-annotation reconciliation • Question: How much can annotators be brought into agreement? • Procedure: – – Annotator sees all annotations, votes Yes/Maybe/No on each Annotators then discuss all differences (telephone conf) Annotators then vote again, independently We collapse all Yes and Maybe votes, compare them with No to identify all serious disagreement Feb 23, 2005 41

Results of Reconciliation • Annotators derive common methodology • Small errors and oversights removed Results of Reconciliation • Annotators derive common methodology • Small errors and oversights removed during discussion • Inter-annotator agreement improved • Serious problems of interpretation or error identified Feb 23, 2005 42

Annotation across Translations Question: How different are the translations? • Procedure: – Annotator sees Annotation across Translations Question: How different are the translations? • Procedure: – Annotator sees annotations across both translations, identifies differences of form and meaning – Annotator selects ‘true’ meaning(s) • Results (work still in progress): – Impacts ontology richness/conciseness – Improvement in Interlingua representation ‘depth’ – Useful for IL 2 design development • Observations: – This is very hard work – Methodology unclear: what is seen first, how to show alternatives, what to do with results… Feb 23, 2005 43

Outcomes—how have we done? • IL design – IL 0 and IL 1 finished Outcomes—how have we done? • IL design – IL 0 and IL 1 finished – IL 2 in the works • Annotation methodology – Manuals for IL 0 in at least three languages – Manual for converting IL 0 to IL 1 – Annotation tools for IL 0 and IL 1 – Evaluation of inter-coder agreement – Procedure for annotator reconciliation • Around 144 annotated parallel texts in IL 0 and IL 1 – Six texts from six different source languages – Two English translations of each text – 10 -12 annotators for each text Feb 23, 2005 44

Next Steps • Foreign language annotation standards and tools • Development of IL 2 Next Steps • Foreign language annotation standards and tools • Development of IL 2 • Addressing coverage gaps (1/3 of open class words marked as having no concept) Feb 23, 2005 45

Contact information • URLs and Wiki pages: – Project website: http: //aitcnet. org/nsf/iamtc/ – Contact information • URLs and Wiki pages: – Project website: http: //aitcnet. org/nsf/iamtc/ – PIs: http: //sparky. umiacs. umd. edu: 8000/IAMTC. wiki – Annotators: http: //sparky. umiacs. umd. edu: 8000/IAMTCAnnotator/IAMTC-Annotator. wiki • Text Annotation: anyone interested to try? ? ? – Download the tools – Download the texts – Have fun (if you’re so inclined!)… Feb 23, 2005 46

Extra Slides Feb 23, 2005 47 Extra Slides Feb 23, 2005 47

IAMTC Tasks • Interlingua Content Development – Three level design: IL 0, IL 1, IAMTC Tasks • Interlingua Content Development – Three level design: IL 0, IL 1, IL 2 (and possibly more…) – Linguistic/semantic divergences • • • Noun-noun compound Thematic roles Named entities and Time expressions Conjunctions Ontology reduction • Tool Development • Evaluation Methodology • Annotation of 7 languages Feb 23, 2005 48