3e958d00dde822395d8bfa0f680211ec.ppt
- Количество слайдов: 52
Machine Translation, Language Divergence and Lexical Resources Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay
Acknowledgement • NLP-AI members, CSE Dept, IIT Bombay.
What is MT Conversion of source language text to target language text Computer Program Document in L 1 Document in L 2
Kinds of MT Systems (How much of Human Participation) • Fully Automatic • Semi Automatic – Human Aided MT (HAMT) • Pre-editing • Post-editing example – Machine Aided HT (MAHT) • On-line Dictionaries • Terminology Data Banks • Translation Memories example
Kinds of MT Systems (domain coverage) • General Purpose (SYSTRAN in Europe) • Domain Specific (Tom-Mateo in Canada; Translates weather reports between French and English)
Kinds of MT Systems (point of entry from source to the target text) fwd
Why is MT difficult? Classical NLP problems • Ambiguity – Lexical – Structural • Ellipsis • Co-reference – Anaphora – Hypernymic examples
Why is MT Difficult Language Divergence • Lexico-Semantic Divergence • Structural Divergence
Language Divergence (English Hindi: Noun to Adjective) • The demands on sportsmen today can lead to burnout at an early age. (noun – the state of being extremely tired or ill, either physically or mentally, because you have worked too hard) • ख ल डय स ज आज अपकष ए ह , व उनह उमर म ह अकर य श ल सकत ह। कम कर
Language Divergence (English Hindi: Noun to Verb) • Every concert they gave us was a sellout. (an event for which on the tickets have been sold) • उनक हर सग त -क रयकरम सभ ट कट ब क गए थ। क
Language Divergence (English Hindi: Adjective to Adverb) • The children watched in wide-eyed amazement. (with eyes fully open because of fear, great surprise, etc) • बचच आशचरय दख रह थ। स आख फ ड
Language Divergence (English Hindi: Adjective to Verb) • He was in a bad mood at breakfast and wasn't very communicative. (able and willing to talk and give information to other people) • न शत क समय वह खर ब मड म थ और जय द ब त -च त नह कर रह थ ।
Language Divergence (English Hindi: Preposition to Adverb) • It gets cooler toward evening. (near a point in time) • श म ह त ह। -ह त ठडक बढ ज त
Language Divergence (English Hindi: idiomatic usage) • Given her interest in children, teaching seems the right job for her. (when you consider sth) • बचच क परत (म ) उसक द लचसप दखत हए , अधय पन उसक ल ए उच त लगत ह।
Language Divergence (Marathi-Hindi-English: case marking and postpositions transfer: works!) • परथम त खय त • वरतम न (simple present( – त ज त . – वह ज त ह । – He goes. • सथ रसतय (universal truth( – पथव सरय भ वत फ र त. – पथव सरय क च र ओर घम त ह । – The earth revolves round the sun.
Language Divergence (Marathi-Hindi-English: case marking and postpositions: works again!) • ऐत ह स क सतय (historical truth( – कषण अरजन स स ग त . . . – कषण अरजन स कह त ह. . . – Krushna says to Arjuna… • अवतरण (quoting( – द मल महण त त. . . , – द मल कह त ह. . . , – Damle says. . . ,
Language Divergence (Marathi-Hindi-English: case marking and postpositions: does not work!) • सन ह त भत • न सशय भव षय (immediate past( – कध आल स ? ह य त इतक च ! – कब आय ? बस अभ आ य । – When did you come? Just now (I came. ( (certainty in future( – आत त म र ख त ख स ! – अब वह म र ख य ग ह ! – He is in for a thrashing. • आशव सन (assurance( – म तमह ल उदय भटत . – म आप स कल म लत ह । – I will see you tomorrow.
Language Divergence Theory: Lexico -Semantic Divergences • • • Conflational divergence Structural divergence Categorial divergence Head swapping divergence Lexical divergence
Language Divergence Theory: Syntactic Divergences • • • Constituent Order divergence Adjunction Divergence Preposition-Stranding divergence Null Subject Divergence Pleonastic Divergence
MT approaches n interlingua Based n Direct n Transfer Based Vaquiouse Triangle
Interlingua Methodology n Directly obtain the meaning of the source sentence. n Do target sentence generation from the meaning representation. Ø John gave the book to Mary. n Meaning representation: n give-action: Ø agent: John Ø object: the book Ø receiver: Mary n ATLAS system in Fujitsu precursor to n World wide project on UNL
Competing approaches n Direct n Transfer based
Direct approach n Word replacements I like mangoes ma. OM Ac. Ca laga Aama I like (root) mangoes n Morphology ma. OM Ac. Ca lagata Aama I like mangoes n Syntactic re-arrangement ma. OM Aama Ac. Ca lagata h. O I mangoes like n Idiomatization mau. Jao Aama Ac. Ca lagata h. O I (dative) mangoes like
Transfer Based Source sentence processed for parsing, chunking etc. S VP NP V I like NP mangoes
Transfer Based Transfer structures obtained for the target sentence. S VP NP NP V I mangoes like
Transfer Based Morphology and language specific modifications S VP NP NP mau. Jao Aama V Ac. Ca lagataa h. O
Relation Between the Transfer and the Interlingua Models Interlingua Interpretation Source language Parse tree Parsing source language words generation transfer Target Language Parse tree generation Target language words
State of Affairs n Systran reports 19 different language pairs. n Only 8 alright for intended use. n Even fewer are capable of quality written or spoken text translation.
Notable Systems in India • Anusaaraka (IITK and IIIT Hyderabad: information access: one of the earliest systems) • Angla-Hindi (IITK: Transfer Based) • Shakti and Shiva (IIIT Hyderabad: Use of simple modules to create complex and high level performance) • UNL Based system (IIT Bombay- part of the UN effort: emphasis on semantics) • Hindi-Tamil system (AU-KBC, Chennai: based on the approach at IIIT Hyderabad)
Semantics: use of Lexical Resources • Word. Net • Word Sense Disambiguation
Wordnet • A lexical knowledgebased on conceptual lookup • Organizing concepts in a semantic network. • Organize lexical information in terms of word meaning, rather than word form • Wordnet can also be used as a thesaurus.
Lexical Matrix
The Structure of Hindi Wordnet • 30, 000 unique words • 13, 000 synsets • Wordnet Relations 1. Lexical Relations (between word forms) Synonymy Antonymy 2. Semantic Relations (between word meanings) Hyponymy/Hypernymy Meronymy/Holonymy Entailment/Troponymy
A small part of Hindi Wordnet
Hindi Word. Net APIs findtheinfo getindex in_wn read_synset free_index_lookup Hindi Data morphstr
The Hindi WSD System
Approach to WSD …. Hindi Document Context Bag Intersection Similarity Hindi Wordnet Semantic Bag
WSD Algorithm 1. 2. 3. For a polysemous word w needing diambiguation, a set of context words in its surrounding window is collected. Let this collection be C, the context bag. The window is the current sentence and the preceding and the following sentences. For each sense s of w, do the following Let B be the bag of words obtained from the 1. 2. 3. 4. 5. 6. Synonyms in the synsets Glosses of the synsets Example Sentences of the synsets Hypernyms (recursively upto the roots) Glosses of Hypernyms Example Sentences of Hypernyms
WSD Algorithm (continued) 7. 8. 9. 10. 11. 12. 4. 5. Hyponyms Glosses of Hypernyms (recursively upto the leaves) Example Sentences of Hyponyms Meronyms (recursively upto the beginner synset) Glosses of Meronyms Example sentences of meronyms Mesure the overlap between C and B using intersection similarity Output that sense as the winner sense which has the maximum overlap simialrity value
Evaluation • Only Nouns • Test corpora from CIIL, Mysore. • Corpus from 8 domains, each containing around 2000 words on an average.
Result
Conclusions (Knowledge Based MT) • Language Divergence is the bottleneck • Not only for languages from distant families (English-Japanese) • But also for siblings within a family (Hindi- • Solution lies in creating and exploiting knowledge structures Marathi)
Conclusions (Statistical MT) • Complementary (not really competing) approach • Example: IBM approach to translation from/to English and other languages (French, Chinese, and currently Hindi) • Needs vast amount of text aligned corpora • Basic idea is to maximize P(T|S) over all target sentences T: needs language modeling (P(T)) and translation modeling (P(S|T))
Pre Editing The inspection team appointed by the United Nations visited Iraq early July, 2003. The
Post Editing • back (I want to eat well today) MMma. OM Aaja Ac. Ca Kanaa caahta h. UM mau. Jao Aaja Ac. Ca Kanaa caaihe
Terminology DB and Translation Memory • Special lexicon containing the domain terms and their translations – Nuclear Energy- Aa. Naivak }jaa- • Memories of previous translations – Apply fragments of previous translations to new translation situations Available – He bought a pen – ]snanao ek klama Kr. Ida – All ministers have huge houses – sa. Ba. I p. Mtao. Mko pasa bahut ba. Do Gar h. OM New – He bought a huge house – ]snanao ek bahut ba. Da Gar Kr. Ida
Pitfall of Translation Memory • German: Ein messer ist im schrank; er miβt eletrizitat. • TM 1: Ein messer ist im schrank -> A meter is in the cabinet. • TM 2: er miβt eletrizitat. It measures electricity • New situation Ein messer ist im schrank; er ist sehr scharf. • A meter is in the cabinet; it is very sharp (? ). • Messer in German: Meter/Knife in English. back
Ambiguity Chair
Co-reference Resolution • Pronoun – Sequence of commands to a robot: • place the wrench on the table. • Then paint it. – What does it refer to? (anaphora- back reference) • Learning of his intentions, Shivaji went to meet Afjal Khan, prepared with concealed weapons – Who does his refer to? (cataphora- forward ref) • Hypernymic – Children love to see lions? These animals, however, are getting extinct.
Elipsis Sequence of command to the Robot: Move the table to the corner. Also the chair. Second command needs completing by using the first part of the previous command. back


