c12d96251a241a1be1640904936c18dc.ppt
- Количество слайдов: 96
Automatic Translation of Human Languages Kevin Knight USC/Information Sciences Institute USC/Computer Science Department
Machine Translation (MT) 美国关岛国际机场及其办公室均接获一名 自称沙地阿拉伯富商拉登等发出的电子邮 件,威胁将会向机场等公众地方发动生化 袭击後,关岛经保持高度戒备。 ? The U. S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport.
Why People Get Into This Field • Passion about understanding how human language works – What makes one sequence of words grammatical, and another not? • Interest in foreign languages – What’s the difference between English and Chinese? • Desire to change the world – How will the world be different when the language barrier disappears?
Why It’s Challenging • Each word has tons of meanings – I’ll get a cup of coffee – I didn’t get that joke – I get up at 8 am – I get nervous – Yeah, I get around … ? ? ? • Each word has zillions of contexts • Word order is very different
Why It’s Challenging • Output must be a grammatical, sensible, never-before-uttered sentence! • Computers consume lots of human language – Google, Yahoo, Altavista … – Speech recognizers … • More challenging to also produce human language – What makes one sequence of words grammatical, and another not?
Recent Progress 2002 2003 insistent Wednesday may recurred her trips to Libya tomorrow for flying Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 6 -4 ( AFP ) - An official announced today in the Egyptian lines company for flying Tuesday is a company "insistent for flying" may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment. Cairo 4 -6 (AFP) - Said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.
2005 news broadcast foreign language speech recognition English translation searchable archive
Statistical Machine Translation Man, this is so boring. Hmm, every time he sees “banco”, he either types “bank” or “bench” … but if he sees “banco de…”, he always types “bank”, never “bench”… Translated documents
Things are Consistently Improving Annual evaluation of Arabic-to-English MT systems Translation quality 70 60 50 40 30 20 10 Exceeded commercial-grade translation here. 2002 2003 2004 2005 2006
Progress Driven by Experiments! Translation quality 35 30 25 20 USC/ISI Syntax-Based MT System. Chinese/English NIST 2002 Test Set 15 Mar 1 Apr 1 May 1 2005
Warren Weaver (1947) ingcmpnqsnwf cv fpn owoktvcv hu ihgzsnwfv rqcffnw cw owgcnwf kowazoanv. . .
Warren Weaver (1947) e e ingcmpnqsnwf cv fpn owoktvcv e e e hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv. . .
Warren Weaver (1947) e e e the ingcmpnqsnwf cv fpn owoktvcv e e e hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv. . .
Warren Weaver (1947) e he e the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv. . .
Warren Weaver (1947) e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv. . .
Warren Weaver (1947) e he e of the fof ingcmpnqsnwf cv fpn owoktvcv e f o e o oe t hu ihgzsnwfv rqcffnw cw owgcnwf ef kowazoanv. . .
Warren Weaver (1947) e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv. . .
Warren Weaver (1947) e he e is the sis ingcmpnqsnwf cv fpn owoktvcv e s i e i ie t hu ihgzsnwfv rqcffnw cw owgcnwf es kowazoanv. . .
Warren Weaver (1947) decipherment is the analysis ingcmpnqsnwf cv fpn owoktvcv of documents written in ancient hu ihgzsnwfv rqcffnw cw owgcnwf languages. . . kowazoanv. . .
Warren Weaver (1947) When I look at an article in Russian, I say to myself: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.
Spanish/English text 1 a. Garcia and associates. 1 b. Garcia y asociados. 7 a. the clients and the associates are enemies. 7 b. los clients y los asociados son enemigos. 2 a. Carlos Garcia has three associates. 2 b. Carlos Garcia tiene tres asociados. 8 a. the company has three groups. 8 b. la empresa tiene tres grupos. 3 a. his associates are not strong. 3 b. sus asociados no son fuertes. 9 a. its groups are in Europe. 9 b. sus grupos estan en Europa. 4 a. Garcia has a company also. 4 b. Garcia tambien tiene una empresa. 10 a. the modern groups sell strong pharmaceuticals. 10 b. los grupos modernos venden medicinas fuertes. 5 a. its clients are angry. 5 b. sus clientes estan enfadados. 11 a. the groups do not sell zenzanine. 11 b. los grupos no venden zanzanina. 6 a. the associates are also angry. 6 b. los asociados tambien estan enfadados. 12 a. the small groups are not modern. 12 b. los grupos pequenos no son modernos.
Spanish/English text Translate: Clients do not sell pharmaceuticals in Europe. 1 a. Garcia and associates. 1 b. Garcia y asociados. 7 a. the clients and the associates are enemies. 7 b. los clients y los asociados son enemigos. 2 a. Carlos Garcia has three associates. 2 b. Carlos Garcia tiene tres asociados. 8 a. the company has three groups. 8 b. la empresa tiene tres grupos. 3 a. his associates are not strong. 3 b. sus asociados no son fuertes. 9 a. its groups are in Europe. 9 b. sus grupos estan en Europa. 4 a. Garcia has a company also. 4 b. Garcia tambien tiene una empresa. 10 a. the modern groups sell strong pharmaceuticals. 10 b. los grupos modernos venden medicinas fuertes. 5 a. its clients are angry. 5 b. sus clientes estan enfadados. 11 a. the groups do not sell zenzanine. 11 b. los grupos no venden zanzanina. 6 a. the associates are also angry. 6 b. los asociados tambien estan enfadados. 12 a. the small groups are not modern. 12 b. los grupos pequenos no son modernos.
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat. ? ? ?
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat. ? ? ?
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat. process of elimination
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat. cognate?
Centauri/Arcturan [Knight, 1997] Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat. zero fertility
Bilingual Training Data Millions of words (English side) + 1 m-20 m words for many language pairs (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
Sample Learning Curves Swedish/English French/English German/English Finnish/English BLEU score # of sentence pairs used in training Experiments by Philipp Koehn
MT Evaluation Traditionally difficult because there is no “right answer” 20 human translators will translate the same sentence 20 different ways.
New Evaluation Metric (“BLEU”) (Papineni et al, ACL-2002) Reference (human) translation: The U. S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Machine translation: The American [? ] international airport and its the office all receives one calls self the sand Arab rich business [? ] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [? ] highly alerts after the maintenance. • N-gram precision (score is between 0 & 1) – What percentage of machine n-grams can be found in the reference translation? – An n-gram is an sequence of n words – Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the”) • Brevity penalty – Can’t just type out single word “the” (precision 1. 0!) *** Amazingly hard to “game” the system (i. e. , find a way to change machine output so that BLEU goes up, but quality doesn’t)
Multiple Reference Translations Reference translation 1: The U. S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [? ] international airport and its the office all receives one calls self the sand Arab rich business [? ] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [? ] highly alerts after the maintenance. Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter.
(variant of BLEU) BLEU Tends to Predict Human Judgments slide from G. Doddington (NIST)
BLEU in Action 枪手被警方击毙 。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. wounded police jaya of the gunman was shot dead by the police. the gunman arrested by police kill. the gunmen were killed. the gunman was shot to death by the police. gunmen were killed by police ? SUB>0 al by the police. the ringer is killed by the police killed the gunman. #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
BLEU in Action 枪手被警方击毙 。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. wounded police jaya of the gunman was shot dead by the police. the gunman arrested by police kill. the gunmen were killed. the gunman was shot to death by the police. gunmen were killed by police ? SUB>0 al by the police. the ringer is killed by the police killed the gunman. #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 green red = 4 -gram match = word not matched (good!) (bad!)
Word-Based Statistical MT
Statistical MT Systems Spanish/English Bilingual Text Statistical Analysis Spanish Que hambre tengo yo English Text Statistical Analysis Broken English What hunger have I, Hungry I am so, I am so hungry, Have I that hunger … English I am so hungry
Statistical MT Systems Spanish/English Bilingual Text English Text Statistical Analysis Broken English Spanish Translation Model P(s|e) Que hambre tengo yo English Language Model P(e) Decoding algorithm argmax P(e) * P(s|e) e I am so hungry
Bayes Rule Broken English Spanish Translation Model P(s|e) Que hambre tengo yo English Language Model P(e) Decoding algorithm argmax P(e) * P(s|e) e I am so hungry Given a source sentence s, the decoder should consider many possible translations … and return the target string e that maximizes P(e | s) By Bayes Rule, we can also write this as: P(e) x P(s | e) / P(s) and maximize that instead. P(s) never changes while we compare different e’s, so we can equivalently maximize this: P(e) x P(s | e)
Three Problems for Statistical MT • Language model – Given an English string e, assigns P(e) by formula – good English string -> high P(e) – random word sequence -> low P(e) • Translation model – Given a pair of strings
The Classic Language Model Word N-Grams Goal of the language model: He is on the soccer field He is in the soccer field Is table the on cup the The cup is on the table Rice shrine American shrine Rice company American company
The Classic Language Model Word N-Grams Generative story: w 1 = START repeat until END is generated: produce word w 2 according to a big table P(w 2 | w 1) w 1 : = w 2 P(I saw water on the table) = P(I | START) * P(saw | I) * P(water | saw) * P(on | water) * P(the | on) * P(table | the) * P(END | table) Probabilities can be learned from online English text.
Translation Model? Generative story: Mary did not slap the green witch Source-language morphological analysis Source parse tree Semantic representation Generate target structure Maria no dió una botefada a la bruja verde
Translation Model? Generative story: Mary did not slap the green witch Source-language morphological analysis Source parse tree Semantic representation Generate target structure Maria no dió una botefada a la bruja verde What are all the possible moves and their associated probability tables?
The Classic Translation Model Word Substitution/Permutation [IBM Model 3, Brown et al. , 1993] Generative story: Mary did not slap the green witch Mary not slap slap NULL the green witch n(3|slap) P-Null t(la|the) Maria no dió una botefada a la verde bruja d(j|i) Maria no dió una botefada a la bruja verde Probabilities can be learned from raw bilingual text.
Word Alignment … la maison bleue … la fleur … … the house … the blue house … the flower … All word alignments equally likely All P(french-word | english-word) equally likely
Word Alignment … la maison bleue … la fleur … … the house … the blue house … the flower … “la” and “the” observed to co-occur frequently, so P(la | the) is increased.
Word Alignment … la maison bleue … la fleur … … the house … the blue house … the flower … “house” co-occurs with both “la” and “maison”, but P(maison | house) can be raised without limit, to 1. 0, while P(la | house) is limited because of “the” (pigeonhole principle)
Word Alignment … la maison bleue … la fleur … … the house … the blue house … the flower … settling down after another iteration
Word Alignment … la maison bleue … la fleur … … the house … the blue house … the flower … Inherent hidden structure revealed by EM training! For details, see: • “A Statistical MT Tutorial Workbook” (Knight, 1999). • “The Mathematics of Statistical Machine Translation” (Brown et al, 1993) • Software: GIZA++
Word Alignment … la maison bleue … la fleur … … the house … the blue house … the flower … P(juste | fair) = 0. 411 P(juste | correct) = 0. 027 P(juste | right) = 0. 020 … new French sentence Possible English translations, to be rescored by language model
Decoding Actual process of translating a new sentence. Given foreign sentence f, find English sentence e that maximizes P(e) x P(f | e) Que hambre tengo yo what that so where hunger hungry have am make I me
Decoding Actual process of translating a new sentence. Given foreign sentence f, find English sentence e that maximizes P(e) x P(f | e) Que hambre tengo yo what that so where hunger hungry have am make I me
Decoding Actual process of translating a new sentence. Given foreign sentence f, find English sentence e that maximizes P(e) x P(f | e) Que hambre tengo yo what that so where hunger hungry have am make I me
Decoding Actual process of translating a new sentence. Given foreign sentence f, find English sentence e that maximizes P(e) x P(f | e) Que hambre tengo yo what that so where hunger hungry have am make I me
Decoding Actual process of translating a new sentence. Given foreign sentence f, find English sentence e that maximizes P(e) x P(f | e) Que hambre tengo yo what that so where hunger hungry have am make I me
Decoder: Actually Translates New Sentences 1 st target word 2 nd target word 3 rd target word 4 th target word start end all source words covered Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far) [Jelinek, 1969; Brown et al, 1996 US Patent; (Och, Ueffing, and Ney, 2001]
Dynamic Programming Beam Search 1 st target word 2 nd target word 3 rd target word 4 th target word start best predecessor link end all source words covered Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far) [Jelinek, 1969; Brown et al, 1996 US Patent; (Och, Ueffing, and Ney, 2001]
The Classic Results • • • la politique de la haine. politics of hate. the policy of the hatred. (Foreign Original) (Reference Translation) (IBM 4+N-grams+Stack) • • • nous avons signé le protocole. we did sign the memorandum of agreement. we have signed the protocol. (Foreign Original) (Reference Translation) (IBM 4+N-grams+Stack) • • • où était le plan solide ? but where was the solid plan ? where was the economic base ? (Foreign Original) (Reference Translation) (IBM 4+N-grams+Stack) the Ministry of Foreign Trade and Economic Cooperation, including foreign direct investment 40. 007 billion US dollars today provide data include that year to November china actually using foreign 46. 959 billion US dollars and
Flaws of Word-Based MT • Multiple English words for one French word – IBM models can do one-to-many (fertility) but not many-to-one • Phrasal Translation – “real estate”, “note that”, “interest in” • Syntactic Transformations – Verb at the beginning in Arabic – Translation model penalizes any proposed re-ordering – Language model not strong enough to force the verb to move to the right place
Phrase-Based Statistical MT
Phrase-Based Statistical MT Morgen fliege ich Tomorrow I will fly nach Kanada to the conference zur Konferenz In Canada • Foreign input segmented in to phrases – “phrase” is any sequence of words • Each phrase is probabilistically translated into English – P(to the conference | zur Konferenz) – P(into the meeting | zur Konferenz) • Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an intro. This is state-of-the-art HUGE TABLE!!
Advantages of Phrase-Based • Many-to-many mappings can handle noncompositional phrases (e. g. , “real estate”) • Local context is very useful for disambiguating – “Interest rate” … – “Interest in” … • The more data, the longer the learned phrases – Sometimes whole sentences
How to Learn the Phrase Translation Table? • One method: “alignment templates” (Och et al, 1999) • Start with word alignment, build phrases from that. Maria no dió una bofetada a la bruja verde Mary did not slap the green witch This word-to-word alignment is a by-product of training a translation model like IBM-Model-3. This is the best (or “Viterbi”) alignment.
How to Learn the Phrase Translation Table? • One method: “alignment templates” (Och et al, 1999) • Start with word alignment, build phrases from that. Maria no dió una bofetada a la bruja verde Mary did not slap the green witch This word-to-word alignment is a by-product of training a translation model like IBM-Model-3. This is the best (or “Viterbi”) alignment.
IBM Models are 1 -to-Many • Run IBM-style aligner both directions, then merge: E F best alignment MERGE F E best alignment Union or Intersection
How to Learn the Phrase Translation Table? • Collect all phrase pairs that are consistent with the word alignment Maria no dió una bofetada a la bruja verde Mary did not slap the green witch one example phrase pair
Consistent with Word Alignment Maria no dió Mary did did not slap consistent x not slap inconsistent x inconsistent Phrase alignment must contain all alignment points for all the words in both phrases!
Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the)
Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch)
Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) …
Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)
Phrase Pair Probabilities • A certain phrase pair (f-f-f, e-e-e) may appear many times across the bilingual corpus. – We hope so! • We can calculate phrase substitution probabilities P(f-f-f | e-e-e) • We can use these in decoding • Much better results than word-based translation!
Syntax and Semantics in Statistical MT
MT Pyramid interlingua semantics syntax phrases words SOURCE semantics syntax phrases words TARGET
Why Syntax? • Need much more grammatical output • Need accurate control over re-ordering • Need accurate insertion of function words • Word translations need to depend on grammatically-related words
Linguistic Transformations using Tree Automata Original input: Transformation: S NP S VP NP VP PRO VBZ NP he enjoys SBAR VBG VP listening P NP to music
Linguistic Transformations using Tree Automata Original input: Transformation: S NP S VP NP VP PRO VBZ NP he enjoys SBAR VBG VP listening P NP to music
Linguistic Transformations using Tree Automata Original input: Transformation: S NP PRO he VP VBZ enjoys NP NP SBAR VBG PRO VP he listening P NP to music NP , wa , SBAR VBG , o VP listening P NP to music , VBZ enjoys
Linguistic Transformations using Tree Automata Original input: Transformation: S NP PRO he VP VBZ enjoys NP NP kare SBAR VBG VP , wa , SBAR VBG listening P NP to music , o VP listening P NP to music , VBZ enjoys
Linguistic Transformations using Tree Automata Original input: Final output: S NP PRO he VP VBZ enjoys NP kare , wa , ongaku, o , kiku , no, ga, daisuki, desu SBAR VBG VP listening P NP to music
Automata + Linguistics + Learning MT Applications Automata Theory Tree Automata (Rounds 70)
Automata + Linguistics + Learning Transformational Grammar (Chomsky 57) MT Linguistic Theory Automata Theory Tree Automata (Rounds 70) Applications
Automata + Linguistics + Learning Transformational Grammar (Chomsky 57) MT (05) Compression (01) Linguistic Theory Applications QA (03) Generation (00) Automata Theory Tree Automata (Rounds 70)
Automata + Linguistics + Learning Transformational Grammar (Chomsky 57) MT (05) Compression (01) Linguistic Theory Applications QA (03) Generation (00) Automata Theory Tree Automata (Rounds 70) Algorithms Efficient Automata Algorithms Generic Toolkits
Making Good Progress • Algorithms + Data + Evaluation + Computers • Interdisciplinary work – Natural language processing – Machine learning – Linguistics – Automata theory • Lots of room for improvement!
Future Ph. D Theses? “Syntax-based Language Models for Improving Statistical MT” “Discriminative Training of Millions of Features for MT” “Semantic Representations Induced from Multilingual EU and UN Data” “What Makes One Language Pair More Difficult to Translate Than Another” “A State-of-the-Art MT System Based on Syntactic Transformations” “New Training Methods for High-Quality Word Alignment” + many unpredictable ones…
Thank you if you are interested in getting research experience in this area, and are a very good programmer: contact -- knight@isi. edu