Скачать презентацию What s New in Statistical Machine Translation Kevin Knight Скачать презентацию What s New in Statistical Machine Translation Kevin Knight

d11730bd6f8e9859adb41d5d1a4069f0.ppt

  • Количество слайдов: 140

What’s New in Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department What’s New in Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department

Machine Translation 美国关岛国际机场及其办公室均接获一 名自称沙地阿拉伯富商拉登等发出的电 子邮件,威胁将会向机场等公众地方发 动生化袭击後,关岛经保持高度戒备。 The U. S. island of Guam is maintaining Machine Translation 美国关岛国际机场及其办公室均接获一 名自称沙地阿拉伯富商拉登等发出的电 子邮件,威胁将会向机场等公众地方发 动生化袭击後,关岛经保持高度戒备。 The U. S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport.

Thousands of Languages Are Spoken MANDARIN SPANISH ENGLISH BENGALI HINDI PORTUGUESE RUSSIAN JAPANESE GERMAN Thousands of Languages Are Spoken MANDARIN SPANISH ENGLISH BENGALI HINDI PORTUGUESE RUSSIAN JAPANESE GERMAN 885, 000 332, 000 322, 000 189, 000 182, 000, 000 170, 000 125, 000 98, 000 WU (China) JAVANESE KOREAN FRENCH VIETNAMESE TELUGU YUE (China) MARATHI TAMIL TURKISH URDU MIN NAN (China) JINYU (China) 59, 000 58, 000 49, 000 45, 000 GUJARATI POLISH ARABIC UKRAINIAN 44, 000, 000 42, 500, 000 41, 000 ITALIAN XIANG (China) MALAYALAM HAKKA (China) 37, 000 36, 015, 000 34, 022, 000 34, 000 77, 175, 000 75, 500, 800 75, 000 72, 000 67, 662, 000 66, 350, 000 66, 000 64, 783, 000 63, 075, 000 KANNADA ORIYA PANJABI SUNDA 33, 663, 000 31, 000 30, 000 27, 000 Source: Ethnologue

Recent Progress in Statistical MT 2002 2003 insistent Wednesday may recurred her trips to Recent Progress in Statistical MT 2002 2003 insistent Wednesday may recurred her trips to Libya tomorrow for flying Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 6 -4 ( AFP ) - An official announced today in the Egyptian lines company for flying Tuesday is a company "insistent for flying" may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment. Cairo 4 -6 (AFP) - Said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

2005 news broadcast foreign language speech recognition English translation searchable archive 2005 news broadcast foreign language speech recognition English translation searchable archive

Warren Weaver (1947) ingcmpnqsnwf cv fpn owoktvcv hu ihgzsnwfv rqcffnw cw owgcnwf kowazoanv. . Warren Weaver (1947) ingcmpnqsnwf cv fpn owoktvcv hu ihgzsnwfv rqcffnw cw owgcnwf kowazoanv. . .

Warren Weaver (1947) e e ingcmpnqsnwf cv fpn owoktvcv e e e hu ihgzsnwfv Warren Weaver (1947) e e ingcmpnqsnwf cv fpn owoktvcv e e e hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv. . .

Warren Weaver (1947) e e e the ingcmpnqsnwf cv fpn owoktvcv e e e Warren Weaver (1947) e e e the ingcmpnqsnwf cv fpn owoktvcv e e e hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv. . .

Warren Weaver (1947) e he e the ingcmpnqsnwf cv fpn owoktvcv e e e Warren Weaver (1947) e he e the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv. . .

Warren Weaver (1947) e he e of the ingcmpnqsnwf cv fpn owoktvcv e e Warren Weaver (1947) e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv. . .

Warren Weaver (1947) e he e of the fof ingcmpnqsnwf cv fpn owoktvcv e Warren Weaver (1947) e he e of the fof ingcmpnqsnwf cv fpn owoktvcv e f o e o oe t hu ihgzsnwfv rqcffnw cw owgcnwf ef kowazoanv. . .

Warren Weaver (1947) e he e of the ingcmpnqsnwf cv fpn owoktvcv e e Warren Weaver (1947) e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv. . .

Warren Weaver (1947) e he e is the sis ingcmpnqsnwf cv fpn owoktvcv e Warren Weaver (1947) e he e is the sis ingcmpnqsnwf cv fpn owoktvcv e s i e i ie t hu ihgzsnwfv rqcffnw cw owgcnwf es kowazoanv. . .

Warren Weaver (1947) decipherment is the analysis ingcmpnqsnwf cv fpn owoktvcv of documents written Warren Weaver (1947) decipherment is the analysis ingcmpnqsnwf cv fpn owoktvcv of documents written in ancient hu ihgzsnwfv rqcffnw cw owgcnwf languages. . . kowazoanv. . .

Warren Weaver (1947) Can this be computerized? The non-Turkish guy next to me is Warren Weaver (1947) Can this be computerized? The non-Turkish guy next to me is even deciphering Turkish! All he needs is a statistical table of letter-pair frequencies in Turkish … Collected mechanically from a Turkish body of text, or corpus

“When I look at an article in Russian, I say: this is really written “When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. ” - Warren Weaver, March 1947

“When I look at an article in Russian, I say: this is really written “When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. ” - Warren Weaver, March 1947 “. . . as to the problem of mechanical translation, I frankly am afraid that the [semantic] boundaries of words in different languages are too vague. . . to make any quasi-mechanical translation scheme very hopeful. ” - Norbert Wiener, April 1947

Spanish/English corpus 1 a. Garcia and associates. 1 b. Garcia y asociados. 7 a. Spanish/English corpus 1 a. Garcia and associates. 1 b. Garcia y asociados. 7 a. the clients and the associates are enemies. 7 b. los clients y los asociados son enemigos. 2 a. Carlos Garcia has three associates. 2 b. Carlos Garcia tiene tres asociados. 8 a. the company has three groups. 8 b. la empresa tiene tres grupos. 3 a. his associates are not strong. 3 b. sus asociados no son fuertes. 9 a. its groups are in Europe. 9 b. sus grupos estan en Europa. 4 a. Garcia has a company also. 4 b. Garcia tambien tiene una empresa. 10 a. the modern groups sell strong pharmaceuticals. 10 b. los grupos modernos venden medicinas fuertes. 5 a. its clients are angry. 5 b. sus clientes estan enfadados. 11 a. the groups do not sell zenzanine. 11 b. los grupos no venden zanzanina. 6 a. the associates are also angry. 6 b. los asociados tambien estan enfadados. 12 a. the small groups are not modern. 12 b. los grupos pequenos no son modernos.

Spanish/English corpus Translate: Clients do not sell pharmaceuticals in Europe. 1 a. Garcia and Spanish/English corpus Translate: Clients do not sell pharmaceuticals in Europe. 1 a. Garcia and associates. 1 b. Garcia y asociados. 7 a. the clients and the associates are enemies. 7 b. los clients y los asociados son enemigos. 2 a. Carlos Garcia has three associates. 2 b. Carlos Garcia tiene tres asociados. 8 a. the company has three groups. 8 b. la empresa tiene tres grupos. 3 a. his associates are not strong. 3 b. sus asociados no son fuertes. 9 a. its groups are in Europe. 9 b. sus grupos estan en Europa. 4 a. Garcia has a company also. 4 b. Garcia tambien tiene una empresa. 10 a. the modern groups sell strong pharmaceuticals. 10 b. los grupos modernos venden medicinas fuertes. 5 a. its clients are angry. 5 b. sus clientes estan enfadados. 11 a. the groups do not sell zenzanine. 11 b. los grupos no venden zanzanina. 6 a. the associates are also angry. 6 b. los asociados tambien estan enfadados. 12 a. the small groups are not modern. 12 b. los grupos pequenos no son modernos.

Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.

Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.

Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.

Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat. ? ? ?

Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.

Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.

Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.

Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat. ? ? ?

Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat.

Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat. process of elimination

Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok Centauri/Arcturan [Knight 97] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat. cognate?

Centauri/Arcturan [Knight 97] Your assignment, put these words in order: { jjat, arrat, mat, Centauri/Arcturan [Knight 97] Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } 1 a. ok-voon ororok sprok. 7 a. lalok farok ororok lalok sprok izok enemok. 1 b. at-voon bichat dat. 7 b. wat jjat bichat wat dat vat eneat. 2 a. ok-drubel ok-voon anok plok sprok. 8 a. lalok brok anok plok nok. 2 b. at-drubel at-voon pippat rrat dat. 8 b. iat lat pippat rrat nnat. 3 a. erok sprok izok hihok ghirok. 9 a. wiwok nok izok kantok ok-yurp. 3 b. totat dat arrat vat hilat. 4 a. ok-voon anok drok brok jok. 9 b. totat nnat quat oloat at-yurp. 10 a. lalok mok nok yorok ghirok clok. 4 b. at-voon krat pippat sat lat. 5 a. wiwok farok izok stok. 10 b. wat nnat gat mat bat hilat. 11 a. lalok nok crrrok hihok yorok zanzanok. 5 b. totat jjat quat cat. 6 a. lalok sprok izok jok stok. 11 b. wat nnat arrat mat zanzanat. 12 a. lalok rarok nok izok hihok mok. 6 b. wat dat krat quat cat. 12 b. wat nnat forat arrat vat gat. zero fertility

“When I look at an article in Russian, I say: this is really written “When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. ” - Warren Weaver, March 1947 The required statistical tables have millions of entries…? Too much for the computers of Weaver’s day. Not enough RAM!

IBM Candide Project (1988 -1994) • How to get quantities of human translation in IBM Candide Project (1988 -1994) • How to get quantities of human translation in computer readable form? – parallel corpus IBM’s John Cocke, inventor of CKY parsing & RISC processors Canadian bureaucrat

IBM Candide Project (1988 -1994) • How to get quantities of human translation in IBM Candide Project (1988 -1994) • How to get quantities of human translation in computer readable form? – parallel corpus IBM’s John Cocke, inventor of CKY parsing & RISC processors Canadian bureaucrat

IBM Candide Project (1988 -1994) • How to get quantities of human translation in IBM Candide Project (1988 -1994) • How to get quantities of human translation in computer readable form? – parallel corpus IBM’s John Cocke, inventor of CKY parsing & RISC processors

IBM Candide Project [Brown et al 93] French/English Bilingual Text Statistical Analysis French J’ IBM Candide Project [Brown et al 93] French/English Bilingual Text Statistical Analysis French J’ ai si faim English Text Statistical Analysis Broken English What hunger have I, Hungry I am so, I am so hungry, Have me that hunger … English I am so hungry

Mathematical Formulation Given source sentence f: argmaxe P(e | f) = argmaxe P(f | Mathematical Formulation Given source sentence f: argmaxe P(e | f) = argmaxe P(f | e) · P(e) / P(f) = by Bayes Rule argmaxe P(f | e) · P(e) P(f) same for all e Broken English French Translation Model P(f | e) J’ ai si faim English Language Model P(e) Decoding algorithm argmaxe P(e) · P(f | e) I am so hungry

Language Modeling Goal of a language model for MT: He is on the soccer Language Modeling Goal of a language model for MT: He is on the soccer field He is in the soccer field Is table the on cup the The cup is on the table American shrine American company Need to make these decisions, because translation model may not have a lot of context information!

The Classic Language Model Word Bigrams Process model of English: Generate each word based The Classic Language Model Word Bigrams Process model of English: Generate each word based only on the previous word. P(I saw water on the table) = P(I | START) · P(saw | I) · P(water | saw) · P(on | water) · P(the | on) · P(table | the) · P(END | table) Probabilities can be tabulated from an online English corpus … just like Weaver’s Turkish case.

Trigram Language Model to the said royal purchase plan trustco part operations of its Trigram Language Model to the said royal purchase plan trustco part operations of its is international expand banking [Soricut & Marcu, 05]

Trigram Language Model to the said royal purchase plan trustco part operations of its Trigram Language Model to the said royal purchase plan trustco part operations of its is international expand banking the banking trustco is said to expand its purchase part of its royal international plan operations [Soricut & Marcu, 05]

Trigram Language Model to the said royal purchase plan trustco part operations of its Trigram Language Model to the said royal purchase plan trustco part operations of its is international expand banking the banking trustco is said to expand its purchase part of its royal international plan operations royal trustco said the purchase is part of its plan to expand its international banking operations N-grams have a lot of semantics in them! [Soricut & Marcu, 05]

Trigram Language Model to the said royal purchase plan trustco part operations of its Trigram Language Model to the said royal purchase plan trustco part operations of its is international expand banking with the stressed relationship part own longstanding its for chinese boeing , , the banking trustco is said to expand its purchase part of its royal international plan operations royal trustco said the purchase is part of its plan to expand its international banking operations for its part, stressed the longstanding relationship with its own, chinese boeing, for its part, stressed its own longstanding relationship with the chinese [Soricut & Marcu, 05]

Translation Model? Process model of translation: Mary did not slap the green witch Source-language Translation Model? Process model of translation: Mary did not slap the green witch Source-language morphological analysis Source parse tree Semantic representation Generate target structure Maria no dió una bofetada a la bruja verde

Translation Model? Process model of translation: Mary did not slap the green witch Source-language Translation Model? Process model of translation: Mary did not slap the green witch Source-language morphological analysis Source parse tree Semantic representation Generate target structure What are all the possible moves and what probability tables control those moves? Maria no dió una bofetada a la bruja verde

The Classic Translation Model Word Substitution/Permutation [Brown et al. , 1993] Process model of The Classic Translation Model Word Substitution/Permutation [Brown et al. , 1993] Process model of translation: Mary did not slap the green witch Mary not slap slap NULL the green witch Maria no dió una bofetada a la verde bruja n(3|slap) 50 k entries P-Null 1 entry t(la|the) 25 m entries d(j|i) Maria no dió una bofetada a la bruja verde 2500 entries Trainable

The Classic Translation Model Word Substitution/Permutation [Brown et al. , 1993] Process model of The Classic Translation Model Word Substitution/Permutation [Brown et al. , 1993] Process model of translation: Mary did not slap the green witch n(3|slap) 50 k entries ? P-Null 1 entry t(la|the) 25 m entries d(j|i) Maria no dió una bofetada a la bruja verde 2500 entries Still trainable!

Classic Formula for P(f | e) NULL stuff P(f | e) = Σ ( Classic Formula for P(f | e) NULL stuff P(f | e) = Σ ( a sum over alignment possibilities m – Φ 0 l ) · P-Null m – 2Φ 0 · (1 -P-Null) Φ 0 · Φi! · (1 / Φ 0!) · i=0 l m m n(Φi | ei) · t(fj | eaj) · i=1 j: aj <> 0 fertility word translation d(j | aj, l, m) re-ordering Set parameter values so formula assigns the highest possible probability to observed human translations. This is a 25 m-dimensional search space.

Unsupervised EM Training … la maison bleue … la fleur … … the house Unsupervised EM Training … la maison bleue … la fleur … … the house … the blue house … the flower … All P(french-word | english-word) equally likely

Unsupervised EM Training … la maison bleue … la fleur … … the house Unsupervised EM Training … la maison bleue … la fleur … … the house … the blue house … the flower … “la” and “the” observed to co-occur frequently, so P(la | the) is increased.

Unsupervised EM Training … la maison bleue … la fleur … … the house Unsupervised EM Training … la maison bleue … la fleur … … the house … the blue house … the flower … “maison” co-occurs with both “the” and “house”, but P(maison | house) can be raised without limit, to 1. 0, while P(maison | the) is limited because of “la” (pigeonhole principle)

Unsupervised EM Training … la maison bleue … la fleur … … the house Unsupervised EM Training … la maison bleue … la fleur … … the house … the blue house … the flower … settling down after another iteration

Unsupervised EM Training … la maison bleue … la fleur … … the house Unsupervised EM Training … la maison bleue … la fleur … … the house … the blue house … the flower … Inherent hidden structure revealed by EM training! • “A Statistical MT Tutorial Workbook” (Knight, 1999). Promises free beer. • “The Mathematics of Statistical Machine Translation” (Brown et al, 1993) • Software: GIZA++

Sample Translation Probabilities Translation Model e f P(f | e) nationale 0. 05 0. Sample Translation Probabilities Translation Model e f P(f | e) nationale 0. 05 0. 03 le 0. 50 la 0. 21 les 0. 16 l’ 0. 09 ce 0. 02 cette 0. 01 agriculteurs 0. 44 les farmers 0. 42 nationales the national nationaux national 0. 47 0. 42 cultivateurs 0. 05 producteurs 0. 02 [Brown et al 93]

Translation Model e f P(f | e) nationale 0. 05 0. 03 le 0. Translation Model e f P(f | e) nationale 0. 05 0. 03 le 0. 50 la 0. 21 les 0. 16 l’ 0. 09 ce 0. 02 cette 0. 01 agriculteurs 0. 44 les farmers 0. 42 nationales the national nationaux national 0. 47 0. 42 cultivateurs 0. 05 producteurs 0. 02 new French sentence f P(f | e) potential translation e

Language Model w 1 f P(w 2 | w 1) the of P(f | Language Model w 1 f P(w 2 | w 1) the of P(f | e) 0. 13 a 0. 09 another Translation Model e w 2 0. 01 nationale national 0. 42 kong 0. 98 0. 05 said 0. 01 0. 03 stated 0. 01 le 0. 50 la 0. 21 les 0. 16 l’ 0. 09 ce 0. 02 cette 0. 01 agriculteurs 0. 44 les farmers 0. 01 nationales the some nationaux national 0. 47 0. 42 cultivateurs 0. 05 producteurs 0. 02 hong new French sentence f P(f | e) potential translation e P(e)

Language Model w 1 f P(w 2 | w 1) the of P(f | Language Model w 1 f P(w 2 | w 1) the of P(f | e) 0. 13 a 0. 09 another Translation Model e w 2 0. 01 nationale national 0. 42 kong 0. 98 0. 05 said 0. 01 0. 03 stated 0. 01 le 0. 50 la 0. 21 les 0. 16 l’ 0. 09 ce 0. 02 cette 0. 01 agriculteurs 0. 44 les farmers 0. 01 nationales the some nationaux national 0. 47 0. 42 cultivateurs 0. 05 producteurs 0. 02 hong new French sentence f P(f | e) potential translation e P(e) P(f | e) · P(e) score for e

Search for Best Translation voulez – vous taire ! Search for Best Translation voulez – vous taire !

Search for Best Translation voulez – vous taire ! you – you quiet ! Search for Best Translation voulez – vous taire ! you – you quiet !

Search for Best Translation voulez – vous taire ! you – you quiet ! Search for Best Translation voulez – vous taire ! you – you quiet !

Search for Best Translation voulez – vous taire ! quiet you – you ! Search for Best Translation voulez – vous taire ! quiet you – you !

Search for Best Translation voulez – vous taire ! shut you – you ! Search for Best Translation voulez – vous taire ! shut you – you !

Search for Best Translation voulez – vous taire ! you shut ! Search for Best Translation voulez – vous taire ! you shut !

Search for Best Translation voulez – vous taire ! you shut up ! Search for Best Translation voulez – vous taire ! you shut up !

Classic Decoding Algorithm Given f, find the English string e that maximizes P(e) · Classic Decoding Algorithm Given f, find the English string e that maximizes P(e) · P(f | e) NP-Complete [Knight 99]. Brown et al 93: “In this paper, we focus on the translation modeling problem. We hope to deal with the [decoding] problem in a later paper. ”

Beam Search Decoding [Brown et al US Patent #5, 477, 451] 1 st English Beam Search Decoding [Brown et al US Patent #5, 477, 451] 1 st English 2 nd English 3 rd English 4 th English word start end all source words covered Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far) [Jelinek 69; Och, Ueffing, and Ney, 01]

Beam Search Decoding [Brown et al US Patent #5, 477, 451] 1 st English Beam Search Decoding [Brown et al US Patent #5, 477, 451] 1 st English 2 nd English 3 rd English 4 th English word start best predecessor link end all source words covered Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far) [Jelinek 69; Och, Ueffing, and Ney, 01]

Classic Results • • • nous avons signé le protocole. we did sign the Classic Results • • • nous avons signé le protocole. we did sign the memorandum of agreement. we have signed the protocol. (Foreign Original) (Human Translation) (MT) • • • où était le plan solide ? but where was the solid plan ? where was the economic base ? (Foreign Original) (Human Translation) (MT) the Ministry of Foreign Trade and Economic Cooperation, including foreign direct investment 40 billion US dollars today provide data include that year to November china actually using foreign 46. 959 billion US dollars and very slow = one page per day

Okay! I know, so far, this talk should be called … What’s Old in Okay! I know, so far, this talk should be called … What’s Old in Statistical Machine Translation!!

Further Developments • Follow-on projects – Hong Kong – Aachen – Behavior Design Corporation Further Developments • Follow-on projects – Hong Kong – Aachen – Behavior Design Corporation • JHU Summer Workshop 1999 – Build & distribute statistical MT tools – Create standard training & testing data – Disseminate tutorial material – “MT in a Day” – Ask new questions

How Much Data Do We Need? Quality of automatically trained machine translation system Amount How Much Data Do We Need? Quality of automatically trained machine translation system Amount of bilingual training data

Advances in Statistical MT 2000 -2004 Advances in Statistical MT 2000 -2004

Ready-to-Use Online Bilingual Data Millions of words (English side) (Data stripped of formatting, in Ready-to-Use Online Bilingual Data Millions of words (English side) (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

Ready-to-Use Online Bilingual Data Millions of words (English side) (Data stripped of formatting, in Ready-to-Use Online Bilingual Data Millions of words (English side) (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn). + European parliament data [Koehn 05]

BLEU Evaluation Metric (Papineni et al 02) Reference (human) translation: The U. S. island BLEU Evaluation Metric (Papineni et al 02) Reference (human) translation: The U. S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Machine translation: The American [? ] international airport and its the office all receives one calls self the sand Arab rich business [? ] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [? ] highly alerts after the maintenance. • N-gram precision (score is between 0 & 1) What percentage of machine n-grams can be found in the reference translation? Gross measure over 1000 test sentences. Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the”) Brevity penalty: can’t just type out single word “the” (and get precision 1. 0)

BLEU in Action 枪手被警方击毙 。 (Foreign Original) the gunman was shot to death by BLEU in Action 枪手被警方击毙 。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. wounded police jaya of the gunman was shot dead by the police. the gunman arrested by police kill. the gunmen were killed. the gunman was shot to death by the police. gunmen were killed by police ? SUB>0 al by the police. the ringer is killed by the police killed the gunman. #1 #2 #3 #4 #5 #6 #7 #8 #9 #10

BLEU in Action 枪手被警方击毙 。 (Foreign Original) the gunman was shot to death by BLEU in Action 枪手被警方击毙 。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. wounded police jaya of the gunman was shot dead by the police. the gunman arrested by police kill. the gunmen were killed. the gunman was shot to death by the police. gunmen were killed by police ? SUB>0 al by the police. the ringer is killed by the police killed the gunman. #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 green red = 4 -gram match = word not matched (good!) (bad!)

BLEU Tends to Predict Human Judgments slide from G. Doddington (NIST) BLEU Tends to Predict Human Judgments slide from G. Doddington (NIST)

Experiment-Driven Progress BLEU 35 Evaluate new MT research ideas every day! (and be alerted Experiment-Driven Progress BLEU 35 Evaluate new MT research ideas every day! (and be alerted about bugs…) 30 25 20 ISI Syntax-Based MT Chinese/English NIST 2002 Test Set 15 Mar 1 Apr 1 May 1 2005

Draw Learning Curves Swedish/English French/English German/English Finnish/English BLEU score # of sentence pairs used Draw Learning Curves Swedish/English French/English German/English Finnish/English BLEU score # of sentence pairs used in training Experiments by Philipp Koehn

Flaws of Word-Based MT • Can’t translate multiple English words to one French word Flaws of Word-Based MT • Can’t translate multiple English words to one French word • Can’t translate phrases – “real estate”, “note that”, “interest in” • Isn’t sensitive to syntax – Adjectives/nouns should swap order – Verb comes at the beginning in Arabic • Doesn’t understand the meaning (? )

The MT Triangle interlingua logical form syntax words SOURCE logical form syntax words TARGET The MT Triangle interlingua logical form syntax words SOURCE logical form syntax words TARGET

The MT Swimming Pool interlingua logical form syntax words The MT Swimming Pool interlingua logical form syntax words

Commercial Rule-Based Systems interlingua logical form syntax words SOURCE logical form syntax words TARGET Commercial Rule-Based Systems interlingua logical form syntax words SOURCE logical form syntax words TARGET

interlingua logical form syntax Knight et al 95 - meaning-based translation - composition rules interlingua logical form syntax Knight et al 95 - meaning-based translation - composition rules logical form syntax Language Model words SOURCE words TARGET

interlingua logical form syntax Wu 97, Alshawi 98 - inducing syntactic structure as a interlingua logical form syntax Wu 97, Alshawi 98 - inducing syntactic structure as a by-product of aligning words in bilingual text logical form syntax Language Model words SOURCE words TARGET

interlingua logical form syntax Yamada/Knight (01, 02) - tree/string model - used existing target interlingua logical form syntax Yamada/Knight (01, 02) - tree/string model - used existing target language parser logical form syntax Language Model words SOURCE words TARGET

Well, these all seem like good ideas. Which one had the most dramatic effect Well, these all seem like good ideas. Which one had the most dramatic effect on MT quality? None of them!

Phrases How do you translate “real estate” into French? interlingua logical form syntax phrases Phrases How do you translate “real estate” into French? interlingua logical form syntax phrases words SOURCE real estate real number dance card memory stick … logical form syntax phrases words TARGET

Phrase-Based Statistical MT Morgen fliege ich Tomorrow I will fly nach Kanada to the Phrase-Based Statistical MT Morgen fliege ich Tomorrow I will fly nach Kanada to the conference zur Konferenz In Canada • Foreign input segmented into phrases – “phrase” just means “word sequence” • Each phrase is probabilistically translated into English – P(to the conference | zur Konferenz) – P(into the meeting | zur Konferenz) • Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an overview.

How to Learn the Phrase Translation Table? • One method: “alignment templates” [Och et How to Learn the Phrase Translation Table? • One method: “alignment templates” [Och et al 99] • Start with word alignment • Collect all phrase pairs that are consistent with the word alignment

Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch

Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (bruja verde, green witch)

Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (bruja verde, green witch) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) …

Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)

Phrase Pair Probabilities • A certain phrase pair (f-f-f, e-e-e) may appear many times Phrase Pair Probabilities • A certain phrase pair (f-f-f, e-e-e) may appear many times across the bilingual corpus. • No EM training • Just relative frequency: P(f-f-f | e-e-e) = count(f-f-f, e-e-e) -----------count(e-e-e)

Phrase-Based MT • This is currently the best way to do Statistical MT! • Phrase-Based MT • This is currently the best way to do Statistical MT! • What took so long to move from words to phrases? – Missing RAM • 25 m parameters billions of parameters • Trick idea: build test-corpus-specific phrase table (takes 5 hours!) • Now solved in commercial deployments – Missing computing power – Many competing ideas to shake out • Koehn 03 summarizes several variations – Empirical effectiveness even better than intuition would predict • This is not building a ladder to the moon! – If you can’t translate “real estate” into French, you are sunk

Advanced Training Methods argmax P(e | f) = e argmax P(e) x P(f | Advanced Training Methods argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) = e argmax P(e) x P(f | e) e

Advanced Training Methods argmax P(e | f) = e argmax P(e) x P(f | Advanced Training Methods argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) = e argmax P(e)2. 4 x P(f | e) e … works better!

Advanced Training Methods argmax P(e | f) = e argmax P(e) x P(f | Advanced Training Methods argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) = e argmax P(e)2. 4 x P(f | e) x length(e)1. 1 e Rewards longer hypotheses, since these are unfairly punished by P(e)

Advanced Training Methods argmax P(e)2. 4 x P(f | e) x length(e)1. 1 x Advanced Training Methods argmax P(e)2. 4 x P(f | e) x length(e)1. 1 x FEAT 3. 7 … e Lots of features vote on every potential translation. Exponential model. Problem: How to set the exponent weights? IDEA 1: maximize probability of the data IDEA 2: maximize BLEU score of MT system

20. 64% BLEU WTM fixed at 1. 0 17. 96% BLEU plot by Emil 20. 64% BLEU WTM fixed at 1. 0 17. 96% BLEU plot by Emil Ettelaie

Maximum BLEU Training • Novel algorithm developed by [Och 03] • Opened gates to Maximum BLEU Training • Novel algorithm developed by [Och 03] • Opened gates to “feature hacking” – Word-based feature to smooth phrase pair counts (“Model 1 Inverse”) – Phrase-specific propensities to re-order • Currently limited to ~25 features

Advances in Statistical MT 2005 Advances in Statistical MT 2005

Google’s Language Model • Previously, largest language model was trained on 1 b words Google’s Language Model • Previously, largest language model was trained on 1 b words of English • 20 b words of news significant impact on news translation • 200 b words of web helpful

Maryland’s Hiero system [Chiang 05] • Previously: – ne mange pas does not eat Maryland’s Hiero system [Chiang 05] • Previously: – ne mange pas does not eat • New phrase pairs with variables and reordering – ne X pas does not X – le X 1 du X 2 's X 1 • Nesting – “does not X” itself becomes an X • CKY decoder John Cocke

ISI’s Syntax-Based MT System • First strong showing for an SMT system that knows ISI’s Syntax-Based MT System • First strong showing for an SMT system that knows what nouns and verbs are! • Why syntax? “Frequent high-tech exports are bright spots foreign trade growth of Guangdong has made important contributions. ” – Need much more grammatical output – Need accurate control over re-ordering – Need accurate insertion of function words

String Output 枪手 被 警方 击毙. The gunman killed by police. String Output 枪手 被 警方 击毙. The gunman killed by police.

Tree Output 枪手 被 警方 击毙. The gunman killed by police. DT NN VBD Tree Output 枪手 被 警方 击毙. The gunman killed by police. DT NN VBD IN NN NPB PP NP-C VP S

Tree Output 枪手 被 警方 击毙. Gunman by police shot. NN IN NN VBD Tree Output 枪手 被 警方 击毙. Gunman by police shot. NN IN NN VBD NPB PP NP-C VP S

Tree Output 枪手 被 警方 击毙. The gunman was killed by police. DT NN Tree Output 枪手 被 警方 击毙. The gunman was killed by police. DT NN AUX VBN IN NN NPB PP NP-C VP S

Sample Rules Learned from Data VP SBAR VB said IN x 0: S that Sample Rules Learned from Data VP SBAR VB said IN x 0: S that NP x 0: NP PP IN from x 1: NP "说 " ", " x 0 "说 " x 0 "他" "说 " ", " x 0 "指出" ", " x 0 0. 57 0. 09 0. 02 x 1 x 0 "来自" x 1 x 0 x 1 "的" x 0 "从" x 1 x 0 "来自" x 1 "的" x 0 0. 27 0. 15 0. 06

Sample Rules Learned from Data S x 0: NP x 0 x 1 x Sample Rules Learned from Data S x 0: NP x 0 x 1 x 2 x 0 x 1 ", " x 2 x 0 x 1 x 2 x 1 x 0 x 2 VP x 1: VB 0. 54 0. 44 (Chinese/ English) x 2: NP S x 0: NP 0. 82 0. 02 x 2: NP (Arabic/ English) subject-verb inversion

Format is Expressive Phrasal Translation VP S está, cantando PRO VBZ VBG Non-contiguous Phrases Format is Expressive Phrasal Translation VP S está, cantando PRO VBZ VBG Non-contiguous Phrases VP hay, x 0 VP there VB singing is Non-constituent Phrases poner, x 0 VB x 0: NP PRT put on are Context-Sensitive Word Insertion NPB DT the x 0: NNS x 0 Multilevel Re-Ordering NP S x 0: NP VP x 1: VB Lexicalized Re-Ordering x 1, x 0, x 2: NP 2 PP x 0: NP x 1, , x 0 P x 1: NP of [Knight & Graehl, 2005]

Story Gets More Interesting… MT Applications Automata Theory Tree Transducers (Rounds 70) Story Gets More Interesting… MT Applications Automata Theory Tree Transducers (Rounds 70)

Story Gets More Interesting… Transformational Grammar (Chomsky 57) MT Linguistic Theory Automata Theory Tree Story Gets More Interesting… Transformational Grammar (Chomsky 57) MT Linguistic Theory Automata Theory Tree Transducers (Rounds 70) Applications

Story Gets More Interesting… Transformational Grammar (Chomsky 57) MT (05) Compression (01) Linguistic Theory Story Gets More Interesting… Transformational Grammar (Chomsky 57) MT (05) Compression (01) Linguistic Theory Applications QA (03) Generation (00) Automata Theory Tree Transducers (Rounds 70)

Story Gets More Interesting… Transformational Grammar (Chomsky 57) MT (05) Compression (01) Linguistic Theory Story Gets More Interesting… Transformational Grammar (Chomsky 57) MT (05) Compression (01) Linguistic Theory QA (03) Applications Generation (00) Automata Theory Tree Transducers (Rounds 70) Algorithms Efficient Transducer Algorithms Generic Tree Toolkits

Summary • Making good progress • Algorithms + Data + Evaluation + Computers • Summary • Making good progress • Algorithms + Data + Evaluation + Computers • Interdisciplinary work – Natural language processing – Machine learning – Linguistics – Automata theory • Hope that more people will join!

Thank you Thank you

Syntax-Based vs Phrase-Based BLEU 35 phrase-based system 30 25 20 Chinese/English NIST 2002 Test Syntax-Based vs Phrase-Based BLEU 35 phrase-based system 30 25 20 Chinese/English NIST 2002 Test Set 15 Mar 1 Apr 1 May 1 2005

Future Ph. D Theses? “Syntax-based Language Models for Improving Statistical MT” “Discriminative Training of Future Ph. D Theses? “Syntax-based Language Models for Improving Statistical MT” “Discriminative Training of Millions of Features for MT” “Semantic Representations Induced from Multilingual EU and UN Data” “What Makes One Language Pair More Difficult to Translate Than Another” “A State-of-the-Art MT System Based on Syntactic Transformations” “New Training Methods for High-Quality Word Alignment” + many unpredictable ones…

Summary • Phrase-based models are state-of-the-art – – – Word alignments Phrase pair extraction Summary • Phrase-based models are state-of-the-art – – – Word alignments Phrase pair extraction & probabilities N-gram language models Beam search decoding Feature functions & learning weights • But the output is not English – Fluency must be improved – Better translation of person names, organizations, locations – More automatic acquisition of parallel data, exploitation of monolingual data across a variety of domains/languages – Need good accuracy across a variety of domains/languages

Available Resources • Bilingual corpora – 100 m+ words of Chinese/English and Arabic/English, LDC Available Resources • Bilingual corpora – 100 m+ words of Chinese/English and Arabic/English, LDC (www. ldc. upenn. edu) – Lots of French/English, Spanish/French/English, LDC – European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI • (www. isi. edu/~koehn/publications/europarl) – 20 m words (sentence-aligned) of English/French, Ulrich Germann, ISI • (www. isi. edu/natural-language/download/hansard/) • Sentence alignment – Dan Melamed, NYU (www. cs. nyu. edu/~melamed/GMA/docs/README. htm) – Xiaoyi Ma, LDC (Champollion) • Word alignment – GIZA, JHU Workshop ’ 99 (www. clsp. jhu. edu/ws 99/projects/mt/) – GIZA++, RWTH Aachen (www-i 6. Informatik. RWTH-Aachen. de/web/Software/GIZA++. html) – Manually word-aligned test corpus (500 French/English sentence pairs), RWTH Aachen – Shared task, NAACL-HLT’ 03 workshop • Decoding – ISI Re. Write Model 4 decoder (www. isi. edu/licensed-sw/rewrite-decoder/) – ISI Pharoah phrase-based decoder • • Statistical MT Tutorial Workbook, ISI (www. isi. edu/~knight/) Annual common-data evaluation, NIST (www. nist. gov/speech/tests/mt/index. htm)

Some Papers Referenced on Slides • ACL – – – • [Och, Tillmann, & Some Papers Referenced on Slides • ACL – – – • [Och, Tillmann, & Ney, 1999] [Och & Ney, 2000] [Germann et al, 2001] [Yamada & Knight, 2001, 2002] [Papineni et al, 2002] [Alshawi et al, 1998] [Collins, 1997] [Koehn & Knight, 2003] [Al-Onaizan & Knight, 2002] [Och & Ney, 2002] [Och, 2003] [Koehn et al, 2003] EMNLP – [Marcu & Wong, 2002] – [Fox, 2002] – [Munteanu & Marcu, 2002] • AI Magazine – [Knight, 1997] • www. isi. edu/~knight – [MT Tutorial Workbook] • AMTA – [Soricut et al, 2002] – [Al-Onaizan & Knight, 1998] • EACL – [Cmejrek et al, 2003] • Computational Linguistics – [Brown et al, 1993] – [Knight, 1999] – [Wu, 1997] • AAAI – [Koehn & Knight, 2000] • IWNLG – [Habash, 2002] • MT Summit – [Charniak, Knight, Yamada, 2003] • NAACL – – [Koehn, Marcu, Och, 2003] [Germann, 2003] [Graehl & Knight, 2004] [Galley, Hopkins, Knight, Marcu, 2004]

Ready-to-Use Online Bilingual Data Millions of words (English side) (Data stripped of formatting, in Ready-to-Use Online Bilingual Data Millions of words (English side) (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

Ready-to-Use Online Bilingual Data Millions of words (English side) + 1 m-20 m words Ready-to-Use Online Bilingual Data Millions of words (English side) + 1 m-20 m words for many language pairs (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

Ready-to-Use Online Bilingual Data ? ? ? Millions of words (English side) One Billion? Ready-to-Use Online Bilingual Data ? ? ? Millions of words (English side) One Billion?

From No Data to Sentence Pairs • Easy way: Linguistic Data Consortium (LDC) • From No Data to Sentence Pairs • Easy way: Linguistic Data Consortium (LDC) • Really hard way: pay $$$ – Suppose one billion words of parallel data were sufficient – At 20 cents/word, that’s $200 million • Pretty hard way: Find it, and then earn it! – – – De-formatting Remove strange characters Character code conversion Document alignment Sentence alignment Tokenization (also called Segmentation)

Sentence Alignment The old man is happy. He has fished many times. His wife Sentence Alignment The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan.

Sentence Alignment 1. The old man is happy. 2. He has fished many times. Sentence Alignment 1. The old man is happy. 2. He has fished many times. 3. His wife talks to him. 4. The fish are jumping. 5. The sharks await. 1. El viejo está feliz porque ha pescado muchos veces. 2. Su mujer habla con él. 3. Los tiburones esperan.

Sentence Alignment 1. The old man is happy. 2. He has fished many times. Sentence Alignment 1. The old man is happy. 2. He has fished many times. 3. His wife talks to him. 4. The fish are jumping. 5. The sharks await. 1. El viejo está feliz porque ha pescado muchos veces. 2. Su mujer habla con él. 3. Los tiburones esperan.

Sentence Alignment 1. The old man is happy. He has fished many times. 2. Sentence Alignment 1. The old man is happy. He has fished many times. 2. His wife talks to him. 3. The sharks await. 1. El viejo está feliz porque ha pescado muchos veces. 2. Su mujer habla con él. 3. Los tiburones esperan. Note that unaligned sentences are thrown out, and sentences are merged in n-to-m alignments (n, m > 0).

Tokenization (or Segmentation) • English – Input (some byte stream): Tokenization (or Segmentation) • English – Input (some byte stream): "There, " said Bob. – Output (7 “tokens” or “words”): " There , " said Bob. • Chinese – Input (byte stream): – Output: 美国关岛国际机场及其办公室均接获 一名自称沙地阿拉伯富商拉登等发出 的电子邮件。 美国 关岛国 际机 场 及其 办公 室 均接获 一名 自称 沙地 阿拉 伯富 商拉登 等发 出 的 电子邮件。

Lower-Casing • English – Input (7 words): Lower-Casing • English – Input (7 words): " There , " said Bob. – Output (7 words): " there , " said bob. Idea of tokenizing and lower-casing: The the “The “the Smaller vocabulary size. More robust counting and learning.

Recent Progress in Statistical MT • Why is that? – Better algorithms that learn Recent Progress in Statistical MT • Why is that? – Better algorithms that learn patterns from data – More data – Faster, cheaper computers with more RAM – Community-wide test sets – Novel automated evaluation methods – Shared software tools

Three Problems for Statistical MT • Translation model – Given a pair of strings Three Problems for Statistical MT • Translation model – Given a pair of strings , assigns P(f | e) by formula – look like translations high P(f | e) – don’t look like translations low P(f | e) • Language model – Given an English string e, assigns P(e) by formula – good English string high P(e) – random word sequence low P(e) • Decoding algorithm – Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) · P(f | e)

Web Language Models She has a lot of nerve. French input ? It has Web Language Models She has a lot of nerve. French input ? It has a lot of nerve. [20] [3] [Soricut, Knight, Marcu, 02] Used by Google in 2005 to increase performance of their research MT system!