0d66d5acb68b5eda896ee3751a26f521.ppt
- Количество слайдов: 19
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi Nov 14, 2003 ITIC Site Visit
Background and Objectives • Full MT of text is problematic: – Requires large amounts of resources, long development time – Quality of output varies • Analysts often are looking for limited concrete information within the text full MT may not be necessary • Alternative: rather than full MT followed by extraction, first extract and then translate only extracted information • Text Extraction technology has made much progress in past decade [TIPSTER, TREC, EELD] • Research Question: Can Extraction-based MT result in improved accuracy and utility of information for analysts? Nov 14, 2003 ITIC Site Visit 2
Extraction-based MT • “Traditional” Approach: – Develop information extraction capability for the source language – Runtime Extractor produces a template of extracted feature-value information – If desired, English Generator can render the information in the form of text • Drawback: Adapting extraction technology to a new foreign language is difficult – Requires significant expertise in the foreign language – Significant amounts of human development time – Not clear that it is an attractive solution Nov 14, 2003 ITIC Site Visit 3
AMTEXT Approach • Attempt to leverage from our work on automatic learning of MT transfer rules • Develop an elicitation corpus specifically designed for targeted extraction patterns • Learn generalized transfer rules for targeted extraction patterns from elicitation corpus • Acquire high accuracy Named-Entity translation lexicon + limited translation lexicon for targeted vocabulary • Runtime: use partial parser + transfer rules to translate only the matched portions of SL text Nov 14, 2003 ITIC Site Visit 4
AMTEXT Extraction-based MT Word-aligned elicited data Source Text Learning Module Transfer Rules Run Time Transfer System Partial Parser S: : S [NE-P pagash et NE-P TE] -> [NE-P met with NE-P TE] ((X 1: : Y 1) (X 4: : Y 4) (X 5: : Y 5)) Transfer Engine Extracted Target Text NE Translation Lexicon Word Translation Lexicon Nov 14, 2003 ITIC Site Visit 5
Elicitation Example Nov 14, 2003 ITIC Site Visit 6
Elicitation Example Nov 14, 2003 ITIC Site Visit 7
Elicitation Example Nov 14, 2003 ITIC Site Visit 8
Elicitation Example Nov 14, 2003 ITIC Site Visit 9
Learning Transfer Rules • Different notion of rule generalization than in our full XFER approach • Generalize from examples to NEs that play specific roles in target extraction pattern • Verbs and function words may not be generalized • Example: Sharon will meet with Bush today sharon yipagesh &im bush hayom Goal Rule: S: : S [NE-P yipagesh &im NE-P TE] -> [NE-P will meet with NE-P TE] ((X 1: : Y 1) (X 4: : Y 5) (X 5: : Y 6)) Nov 14, 2003 ITIC Site Visit 10
Acquisition of Named Entity Translation Lexicon • Utilize Fei Huang’s work on building Named Entity Translation Lexicons based on transliteration models • NE Lexicon will be split into meaningful subcategories: PNs, Organizations, Locations, etc. • NE translation lexicon augmented with NEs from elicited data • Goal: High coverage and high accuracy identification of NEs that play a part in the transfer rules Nov 14, 2003 ITIC Site Visit 11
Named Entity Translation Lexicon • English-Arabic lexicon from Fei: – Trained on TIDES Newswire Data – 7522 entries sorted by transliteration score • Example: 4. 51948528108464 4. 05498190544419 3. 66368346525326 3. 65527347080481 3. 47030997281853 3. 23199522148251 3. 20392400497002 3. 13060360328543 3. 06872591580516 Nov 14, 2003 # # # # # XXX XXX XXX # # # # # Israel # Asr. AAyl Kabul # k. Abwl Paris # b. Arys Afghanistan # Afg. Anst. An Pakistan # b. Akst. An Moscow # mwskw Arafat # Erf. At Beirut # byrwt Russia # rwsy. A ITIC Site Visit 12
Named Entity Identification • NE Identifinder for English – Available from BBN – Will be used for identifying English NEs within elicited data Arabic NEs from word alignments • NE Identifinder for Arabic: – Requested from BBN, so far no response – Will use if available, can manage without it (naïve identification based on NE translation lexicon) Nov 14, 2003 ITIC Site Visit 13
Acquisition of Limited Word Translation Lexicon • Vocabulary of interest is limited based on specific actions and objects that are of interest scopeable on the English side • Elicitation corpus serves as a high-quality initial source for extracting this translation lexicon • Statistical word-to-word translation dictionary from SMT or EBMT can be used as a source for expanding coverage on the foreign language side • Experiment if time/resources permit with incorporating expanded vocabulary into transfer rules Nov 14, 2003 ITIC Site Visit 14
Partial Parsing • Input: Full text in the foreign language • Output: Translation of extracted/matched text • Goal: Extract by effectively matching transfer rules with the full text – Identify/parse NEs and words in restricted vocabulary – Identify transfer-rule (source-side) patterns – Handle expected high-levels of ambiguity Sharon, meluve b-sar ha-xuc shalom, yipagesh im bush hayom NE-P TE Sharon will meet with Bush today Nov 14, 2003 ITIC Site Visit 15
Scope of Pilot System • Arabic-to-English • Newswire text (available from TIDES) • Limited set of actions: (X meet Y) (X attend Y) (X hold Y) (X kill Y) (X announce Y)… • Limited translation patterns: – <subj-NE> <verb> <obj> <LOC>* <TE>* • Limited vocabulary Nov 14, 2003 ITIC Site Visit 16
Evaluation Plan • Compare AMTEXT approach to full-text Arabic-to-English SMT, on a limited task of translation of relations within the scope of coverage – Establish a test set for evaluation – Define an appropriate metric: Precision/Recall/F 1 of relations and entities – Compare performance Nov 14, 2003 ITIC Site Visit 17
Current Status • Initial small elicitation corpus translated and aligned • Extraction of elicitation phrases from Penn-TB in advanced stages • Identifying scope of coverage: relations, actions, translation patterns • Preliminary NE translation lexicon available Nov 14, 2003 ITIC Site Visit 18
Work Plan • • • Creation of full elicitation corpus: Nov-03 Translation/align. of elicitation corpus: Nov/Dec-03 Install and integrate BBN English Identifinder: Dec-03 Acquire initial NE translation lexicon: Dec-03 Acquire initial word translation lexicon: Dec-03 Develop and integrate partial parser: Dec-03/Feb-04 Modify Transfer Engine for AMTEXT configuration: Dec-03/Jan-04 Integration of preliminary complete system: Feb-04 Design of evaluation: Feb-04 System testing and modifications: Feb/Apr-04 Test-set evaluation: Apr-04 Nov 14, 2003 ITIC Site Visit 19
0d66d5acb68b5eda896ee3751a26f521.ppt