15806d2282013e73f6636960ea3d0254.ppt
- Количество слайдов: 44
COMS E 6998: Topics in Computer Science Spring 2013 Machine Translation Dr. Nizar Habash Research Scientist Center for Computational Learning Systems Columbia University
Session #1 • Introductions • Syllabus Explanation • Lecture – Why Machine Translation – Multilingual Challenges for MT – MT Approaches – MT Evaluation
Why (Machine) Translation? Languages in the world • 6, 800 living languages • 600 with written tradition • 100 languages are spoken by 95% of world population Translation Market • $26 Billion Global Market (2010) • Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)
Multilingualism Tower of Babel • Genesis 11: 1 -9 1 And the whole earth was of one language, and of one speech. . 9 Therefore is the name of it called Babel; because the Lord did there confound the language of all the earth: and from thence did the Lord scatter them abroad upon the face of all the earth. • Foremost symbol of multilingualism as a problem
Multilingualism Language Families
Multilingualism Rosetta Stone • Ancient Egyptian stele (196 BCE ) • Key to modern understanding of Egyptian hieroglyphs • Trilingual document: – ancient Egyptian hieroglyphs – Egyptian demotic script – ancient Greek • Common symbol of parallel corpora and translation solutions
Modern Rosetta Stones?
Multilingual Challenges • • • nai you duo shi means buttered toast naiyou means butter duoshi means toast duo means many shi can mean private (as in the army rank)
Shatt Al-Arab Fresh Fish
Why (Machine) Translation? Languages in the world • 6, 800 living languages • 600 with written tradition • 100 languages are spoken by 95% of world population Translation Market • $26 Billion Global Market (2010) • Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)
Machine Translation Science Fiction • Star Trek Universal Translator an "extremely sophisticated computer program" which functions by "analyzing the patterns" of an unknown foreign language, starting from a speech sample of two or more speakers in conversation. The more extensive the conversational sample, the more accurate and reliable is the "translation matrix"….
Machine Translation Science Fiction • Futurama Universal Translator Dr. Farnsworth: “This is my Universal Translator, although it only translate into an incomprehensible dead language” Cubert: “Hello!” Machine: “Bonjour!” Dr. Farnsworth: "Imcomprehensible gibberish”
Machine Translation Science Fiction • The Babel Fish The Hitch Hiker's Guide to the Galaxy" (Douglas Adams) "is small, yellow and leech-like, . . . if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language…"
Machine Translation Reality http: //www. medialocate. com/
Machine Translation Reality
• Currently, Google offers translations between the following languages over 3, 000 pairs Afrikaans Albanian Arabic Armenian Azerbaijani Basque Belarusian Bulgarian Catalan Chinese Croatian Czech Danish Dutch English Estonian Filipino Finnish French Galician Georgian German Greek Haitian Creole Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Korean Latvian Lithuanian Macedonian Malay Maltese Norwegian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swahili Swedish Thai Turkish Ukrainian Urdu Vietnamese Welsh Yiddish
“BBC found similar support”!!!
Why Machine Translation? • Full Translation – Domain specific, e. g. , Weather reports • Machine-aided Translation – Requires post-editing • Cross-lingual NLP applications – Cross-language IR – Cross-language Summarization • Testing grounds – Extrinsic evaluation of NLP tools, e. g. , parsers, pos taggers, tokenizers, etc.
Road Map • Multilingual Challenges for MT • MT Approaches • MT Evaluation
Multilingual Challenges • Orthographic Variations – Ambiguous spelling • ﻛﺘﺐ ﺍﻻﻭﻻﺩ ﺍﺷﻌﺎﺭﺍ ﺍﻷﻻ ﺍﺷﺍﺭ – Ambiguous word boundaries • • Lexical Ambiguity – Bank ( ﺑﻨﻚ financial) vs. ( ﺿﻔﺔ river) – Eat essen (human) vs. fressen (animal)
Multilingual Challenges Morphological Variations • Affixational (prefix/suffix) vs. Templatic (Root+Pattern) write kill do written killed done ﻛﺘﺐ ﻗﺘﻞ ﻓﻌﻞ ﻣﻜﺘﻮﺏ ﻣﻘﺘﻮﻝ ﻣﻔﻌﻮﻝ • Tokenization (aka segmentation+normalization) conj noun article plural And the cars ﻭﺍﻟﺴﻴﺎﺭﺍﺕ Et les voitures and the cars w Al Sy. Ar. At et le voitures
Morphology ﻳﻘﺮﺃ ﺍﻟﻄﺎﻟﺐ ﺍﻟﻤﺠﺘﻬﺪ ﻛﺘﺎﺑﺎ ﻋﻦ ﺍﻟﺼﻴﻦ ﻓﻲ ﺍﻟﺼﻒ read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book • Arabic: very rich morphology: number, gender, case, person, aspect, voice, several clitics, etc. – Arabic tokenization • English: simple morphology • Chinese: no morphology – quantifiers & verbal aspects
Syntax ﻳﻘﺮﺃ ﺍﻟﻄﺎﻟﺐ ﺍﻟﻤﺠﺘﻬﺪ ﻛﺘﺎﺑﺎ ﻋﻦ ﺍﻟﺼﻴﻦ ﻓﻲ ﺍﻟﺼﻒ read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book Arabic Subj-Verb V Subj V English Chinese Subj V Subj … V Verb-PP V…PP Adjectives N Adj N Possessives N Poss Relatives N Rel N of Poss ’s N N Rel V PP PP V Adj de N Poss de N Rel de N
Syntax ﻳﻘﺮﺃ ﺍﻟﻄﺎﻟﺐ ﺍﻟﻤﺠﺘﻬﺪ ﻛﺘﺎﺑﺎ ﻋﻦ ﺍﻟﺼﻴﻦ ﻓﻲ ﺍﻟﺼﻒ read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book Arabic Subj-Verb V Subj V English Chinese Subj V Subj … V Verb-PP V…PP Adjectives N Adj N Possessives N Poss Relatives N Rel N of Poss ’s N N Rel V PP PP V Adj de N Poss de N Rel de N
Translation Divergences conflation am ﻟﺴﺖ ﻫﻨﺎ I not I am not ﻟﺴﺖ ﻫﻨﺎ I-am-not here suis here Je ne pas ici Je ne suis pas ici I not am not here
Translation Divergences categorial, thematic and structural * ﺍ ﻧﺎ be ﺑﺮﺩﺍﻥ I * tener cold Yo frio קר ל אני ﺍﻧﺎ ﺑﺮﺩﺍﻥ I cold I am cold tengo frio I-have cold קר לי cold for-me
Translation Divergences head swap and categorial ﺍﺳﺮﻉ swim I across river quickly ﺍﻧﺎ ﻋﺒﻮﺭ ﺳﺒﺎﺣﺔ ﻧﻬﺮ I swam across the river quickly ﺍﺳﺮﻋﺖ ﻋﺒﻮﺭ ﺍﻟﻨﻬﺮ ﺳﺒﺎﺣﺔ I-sped crossing the-river swimming
Translation Divergences head swap and categorial swim I across river חצה quickly אני את ב ב נהר שחיה מהירות I swam across the river quickly חציתי את הנהר בשחיה במהירות I-crossed obj river in-swim speedily
Translation Divergences head swap and categorial n ﻧﻬﺮ verb across ב ב שחיה מהירות swim I את נהר nou n אני nou u no ﺳﺒﺎﺣﺔ חצה nou n ﻋﺒﻮﺭ ﺍﻧﺎ verb n ﺍﺳﺮﻉ p rep river quickly ad v erb
Translation Divergences Orthography+Morphology+Syntax mom’s car possessed-by mom 妈妈的车 mama de che ﺳﻴﺎﺭﺓ ﻣﺎﻣﺎ sayy. Arat mama la voiture de maman
Road Map • Multilingual Challenges for MT • MT Approaches • MT Evaluation
MT Strategies (1954 -2004) Shallow/ Simple Word-based only Electronic dictionaries Knowledg e Acquisitio Hand-built by n Strategy experts All manual Original statistical MT Examplebased MT Phrase tables Learn from annotated data Hand-built by non-experts Original direct approach Learn from un-annotated data Fully automated Syntactic Constituen t Structure Typical transfer system Classic interlingu al system Semantic analysis New Research Goes Here! Interlingua Deep/ Complex Knowledge Representati on Strategy Slide courtesy of Laurie Gerber
MT Approaches MT Pyramid Source meaning Target meaning Source syntax Source word Analysis Target syntax Gisting Target word Generation
MT Approaches Gisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.
MT Approaches MT Pyramid Source meaning Source syntax Source word Analysis Target meaning Transfer Gisting Target syntax Target word Generation
MT Approaches Transfer Example • Transfer Lexicon – Map SL structure to TL structure poner : subj X : obj mantequilla butter : mod en : obj : subj X : obj Y Y X puso mantequilla en Y X buttered Y
MT Approaches MT Pyramid Source meaning Source syntax Source word Analysis Interlingua Transfer Gisting Target meaning Target syntax Target word Generation
MT Approaches Interlingua Example: Lexical Conceptual Structure (Dorr, 1993)
MT Approaches MT Pyramid Source meaning Source syntax Source word Analysis Interlingua Transfer Gisting Target meaning Target syntax Target word Generation
MT Approaches MT Pyramid Source meaning Interlingual Lexicons Source syntax Source word Analysis Transfer Lexicons Target meaning Target syntax Dictionaries/Parallel Corpora Target word Generation
MT Approaches MT Pyramid
MT Approaches MT Pyramid Source meaning Interlingual Lexicons Source syntax Source word Analysis Transfer Lexicons Target meaning Target syntax Dictionaries/Parallel Corpora Target word Generation
MT Approaches Statistical vs. Rule-based Source meaning Source syntax Source word Analysis Target meaning Target syntax Target word Generation
To be continued …