Transfer-based MT Syntactic Transfer-based Machine Translation

Transfer-based MT

Syntactic Transfer-based Machine Translation • Direct and Example-based approaches Two ends of a spectrum – Recombination of fragments for better coverage. – • What if the matching/transfer is done at syntactic parse level • Three Steps – Parse: Syntactic parse of the source language sentence • – Transfer: Rules to transform source parse tree into target parse tree • – Hierarchical representation of a sentence Subject-Verb-Object Subject-Object-Verb Generation: Regenerating target language sentence from parse tree • Morphology of the target language • Tree-structure provides better matching and longer distance transformations than is possible in string-based EBMT.

Examples of Syn. Tran-MT quiero ajá wanna usar mi yeah tarjeta I use my de card credit crédito • Mostly parallel parse structures • Might have to insert word – pronouns, morphological particles

Example of Syn. Tran MT -2 need I make to 私は (I) かける (make) call a collect • Pros: 必要があります (need) コールを (call) コレクト (collect) Allows for structure transfer – Re-orderings are typically restricted to the parent-child nodes. – • Cons: Transfer rules are for each language pair (N 2 sets of rules) – Hard to reuse rules when one of the languages is changed –

Lexical-semantic Divergences Linguistic Divergences • Structural differences between languages – Categorical Divergence • Translation of words in one language into words that have different parts of speech in another language – To be jealous – Tener celos (To have jealousy)

Issues Linguistic Divergences – Conflational Divergence • Translation of two or more words in one language into one word in another language – To kick – Dar una patada (Give a kick)

Issues Linguistic Divergences – Structural Divergence • Realization of verb arguments in different syntactic configurations in different languages – To enter the house – Entrar en la casa (Enter in the house)

Issues Linguistic Divergences – Head-Swapping Divergence • Inversion of a structural-dominance relation between two semantically equivalent words – To run in – Entrar corriendo (Enter running)

Issues Linguistic Divergences – Thematic Divergence • Realization of verb arguments that reflect different thematic to syntactic mapping orders – I like grapes – Me gustan uvas (To-me please grapes)

Divergence counts from Bonnie Dorr 32% of sentences in UN Spanish/English Corpus (5 K) Categorial X tener hambre Y have hunger 98% Conflational X dar puñaladas a Z X stab Z 83% Structural X entrar en Y X enter Y 35% Head Swapping X cruzar Y nadando X swim across Y 8% Thematic X gustar a Y Y likes X 6%

Transfer rules

Syntax-driven statistical machine translation Slides from Devi Xiong, CAS, Beijing

Why syntax-based SMT Weakness of phrase-based SMT • Long-distance reordering: phrase-level reordering • Discontinuous phrases • Generalization • … Other methods using syntactic knowledge • Word alignment integrating syntactic constraints • Pre-order source sentences • Rerank n-best output of translation models

SSMT based on formal structures Compared with phrase-based SMT • Translated hierarchically • The target structures finally generated are not necessarily real linguistic structures, but Make long-distance reordering more feasible – Introduce non-terminals/variables – Discontinuous phrases: put x on, 在 x 时 • Generalization •

SCFG Formulated: • Two CFGs and there correspondences Or • P:

SCFG: an example

SCFG: derivation

ITG as reordering constraint Two kinds of reordering • Inverted • straight Coverage • Wu(1997): “been unable to find real examples” of cases where alignments would fail under this constraint, at least in “lightly inflected languages, such as English and Chinese. ” • Wellington(2006): “we found examples”, “at least 5% of the Chinese/English sentence pairs”. Weakness • No strong mechanism determining which order is better, inverted or straight.

Chiang’ 05: Hierarchical Phrase-based Model (HPM) Rules: Glue rule: Model: log-linear Decoder: CKY

Chiang’ 05: rule extraction

Chiang’ 05: rule extraction restrictions Initial base rule at most 15 on French side Final rule at most 5 on French side At most two non-terminals on each side, nonadjacent At least one aligned terminal pair

Chiang’ 05: Model • Log-linear form • and

Chiang’ 05: decoder

SSMT based on phrase structures Using grammars with linguistic knowledge • The grammars are based on SCFG Two categories: • Tree-string Tree-to-string – String-to-tree – • Tree-tree

Yamada & Knight 2001, 2003

Yamada’s work vs. SCFG Insertion operation: • A (w. A 1, A 1) Reordering operation • A (A 1 A 2 A 3, A 1 A 3 A 2) Translating operation • A (x, y)

Yamada: weakness Single-level mapping • Multi-level reordering – Yamada: flatten Word-based • Yamada: phrasal leaf

Galley et al. 2004, 2006 translation model incorporates syntactic structure on the target language side • trained by learning “translation rules” from bilingual data the decoder uses a parser-like method to create syntactic trees as output hypotheses

Translation rules • Target: multi-level subtrees • Source: continuous or discontinuous phrases Types of translation rules • Translating source phrases into target chunks NPB(PRP/I) ↔我 – NP-C(NPB(DT/this NN/address)) ↔这个地址 –

Types of translation rules Have variables • NP-C(NPB(PRP$/my x 0: NN)) ↔我的 x 0 • PP(TO/to NP-C(NPB(x 0: NNS NNP/park))) ↔ 去 x 0 公园 Combine previously translated results together • VP(x 0: VBZ x 1: NP-C) ↔ x 1 x 0 – takes a noun phrase followed by a verb, switches their order, then combines them into a new verb phrase

Rules extraction Word-align a parallel corpus Parse the target side Extract translation rules • Minimal rules: can not be decomposed • Composed rules: composed by minimal rules Estimate probalities

Rule extraction Minimal rule

Composed rules

Format is Expressive Phrasal Translation VP S está, cantando PRO VBZ VBG Non-contiguous Phrases VP hay, x 0 VP there VB singing is Non-constituent Phrases poner, x 0 VB x 0: NP PRT put on are Context-Sensitive Word Insertion NPB DT the x 0: NNS x 0 Multilevel Re-Ordering NP S x 0: NP VP x 1: VB Lexicalized Re-Ordering x 1, x 0, x 2: NP 2 PP x 0: NP x 1, , x 0 P x 1: NP of [Knight & Graehl, 2005]

decoder probabilistic CYK-style parsing algorithm with beams results in an English syntax tree corresponding to the Chinese sentence guarantees the output to have some kind of globally coherent syntactic structure

Decoding example

Marcu et al. 2006 SPMT • Integrating non-syntactifiable phrases • Multiple features for each rule • Decoding with multiple models

SSMT based on phrase structures Two categories: • Tree-string String-to-tree – Tree-to-string – • Tree-tree

Tree-to-string Liu et al. 2006 • Tree-to-string alignment template model

TAT NP NR LCP NP NN NR 布什 LC NR DNP NP 和 Bush between United NP 间总统美国 President CC NP States and DEG

TAT: extraction Constraints • Source trees have to be Subtree • Have to be consistent with word alignment Restrictions on extraction • both the first and last symbols in the target string must be aligned to some source symbols • The height of T(z) is limited to no greater than h • The number of direct descendants of a node of T(z) is limited to no greater than c

TAT: Model

Decoding

Tree-to-string vs. string-to-tree Tree-to-string • Integrating source structures into translation and reordering • The output can not be grammatical string-to-tree • guarantees the output to have some kind of globally coherent syntactic structure • Can not use any knowledge from source structures

SSMT based on phrase structures Two categories: • Tree-string String-to-tree – Tree-to-string – • Tree-tree

Tree-Tree Synchronous tree-adjoining grammar (STAG) Synchronous tree substitution grammar (STSG)

STAG

STAG: derivation

STSG donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) enfants (“kids”) Sam NP Adv kids null NP NP Adv often quite

STSG: elementary trees donnent (“give”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Start kiss à (“to”) NP null NP NP NP kids enfants (“kids”) Adv null Advquite NP Sam

Dependency structures IP VP NP 成为 NP NN NP NP ADJP NP NN NN VV NR NN JJ NN 外商投资企业成为中国外贸重要增长点 (a) 企业外商投资增长点中国外贸重要 (b )

For MT: dependency structures vs. phrase structures Advantages of dependency structures over phrase structures for machine translation • Inherent lexicalization • Meaning-relative • Better representation of divergences across languages

SSMT based on dependency structures Lin 2004 • A Path-based Transfer Model for Machine Translation Quirk et al. 2005 • Dependency Treelet Translation: Syntactically Informed Phrasal SMT Ding et al. 2005 • Machine Translation Using Probabilistic Synchronous Dependency Insertion Grammars

Lin 2004 Translation model trained by learning transfer rules from bilingual corpus where the source language sentences are parsed. decoding: finding the minimum path covering of the source language dependency tree

Lin 2004: path

Lin 2004: transfer rule

Quirk et al. 2005 Translation model trained by learning treelet pairs from bilingual corpus where the source language sentences are parsed. Decoding: CKY-style

Treelet pairs

Quirk 2005: decoding

Ding 2005

summary Interlingua Semantic Tree (formal, phrase, or dependency structure ) String (phrase or chunk) Word Source Languge Target Language

Introduction State of the art machine translation systems based on statistical models rooted in theory of formal grammars/automata Translation models based on finite state devices cannot easily model translations between languages with strong differences in word ordering Recently, several models based on context-free grammars have been investigated, borrowing from theory of compilers the idea of synchronous rewriting Slides from G. Satta

Introduction Translation models based on synchronous rewriting: Inversion Transduction Grammars (Wu, 1997) Head Transducer Grammars (Alshawi et al. , 2000) Tree-to-string models (Yamada & Knight, 2001; Galley et al, 2004) “Loosely tree-based” model (Gildea, 2003) Multi-Text Grammars (Melamed, 2003) Hierarchical phrase-based model (Chiang, 2005) We use synchronous CFGs to study formal properties of all these

Synchronous CFG A synchronous context-free grammar (SCFG) is based on three components: Context free grammar (CFG) for source language CFG for target language Pairing relation on the productions of the two grammars and on the nonterminals in their right-hand sides

Synchronous CFG Example (Yamada & Knight, 2001) : VB --> PRP(1) VB 1(2) VB 2(3) VB --> PRP(1) VB 2(3) VB 1(2) VB 2 --> VB(1) TO(2) VB 2 --> TO(2) VB(1) ga TO --> TO(1) NN(2) TO --> NN(2) TO(1) PRP --> he PRP --> kare ha VB 1 --> adores VB 1 --> daisuki desu VB --> listening VB --> kiku no TO --> to TO --> wo NN --> music NN --> ongaku

Synchronous CFG Example (cont’d): VB(1) PRP(1) VB 1(2) VB 2(3) PRP(1) VB 2(3) VB 1(2) he adores VB(1) TO(2) TO(1) NN(2) listening to music kare ha TO(2) VB(1) ga daisuki desu NN(2) TO(1) ongaku wo kiku no

Synchronous CFG A pair of CFG productions in a SCFG is called a synchronous production A SCFG generates pairs of trees/strings, where each component is a translation of the other A SCFG can be extended with probabilities: Each pair of productions is assigned a probability Probability of a pair of trees is the product of probabilities of synchronous productions involved

Membership The membership problem (Wu, 1997) for SCFGs is defined as follows: Input: SCFG and pair of strings [w 1, w 2 ] Output: Yes/No depending on whether w 1 translates into w 2 under the SCFG Applications in segmentation, word alignment and bracketing of parallel corpora Assumption that SCFG is part of the input is made here to investigate the dependency of problem complexity on grammar size

Membership Result: Membership problem for SCFGs is NP-complete Proof uses SCFG derivations to explore space of consistent truth assignments that satisfy source 3 SAT instance Remarks: Result transfers to (Yamada & Knight, 2001), (Gildea, 2003), (Melamed, 2003), which are at least as powerful as SCFG

Membership Remarks (cont’d): Problem can be solved in polynomial time if: • input grammar is fixed or production length is bounded (Melamed, 2004) • Inversion Transduction Grammars (Wu, 1997) • Head Transducer Grammars (Alshawi et al. , 2000) For NLP applications, it is more realistic to assume a fixed grammar and varying input string

Chart parsing Providing an exponential time lower bound for the membership problem would amount to showing P ≠ NP But we can show such a lower bound if we make some assumptions on the class of algorithms and data structures that we use to solve the problem Result: If chart parsing techniques are used to solve the membership problem for SCFG, a number of partial analyses is obtained that grows exponentially with the production length of the input grammar

Chart parsing for CFGs works by combining completed constituents with partial analyses: A --> B 1 B 2 B 3 … Bn … B 1 B 2 B 3 Three indices are used to process each combination, for a total number of O(n 3) possible combinations that must be checked, n the length of the input string B 4

Chart parsing Consider the synchronous production : [ A --> B (1) B (2) B (3) B (4) , A --> B (3) B (1) B (4) B (2) ] representing the permutation : B (1) B (2) B (3) B (4) B (3) B (1) B (4) B (2)

Chart parsing When applying chart parsing, there is no way to keep partial analyses “contiguous”: B (1) B (2) B (3) B (4) B (3) B (1) B (4) B (2)

Chart parsing The proof of our result generalizes the previous observations We show that, for some worst case permutations of length q, any combination strategy we choose leads to a number of indices growing with order at least sqrt(q) Then for SCFGs of size q, sqrt(q) is an asymptotic lower bound for the membership problem when chart parsing algorithms are used

Translation A probabilistic SCFG provides the probability that tree t 1 translates into tree t 2: Pr( [t 1 , t 2] ) Accordingly, we can define the probability that string w 1 translates into string w 2: Pr( [w 1 , w 2] ) = t w , t w 1 1 2 2 Pr( [t 1 , t 2] ) and the probability that string w translates into tree t: Pr( [w , t ] ) = t w Pr( [t 1 , t ] ) 1

Translation The string-to-tree translation problem for probabilistic SCFGs is defined as follows: Input: Probabilistic SCFG and string w Output: tree t such that Pr([w, t ]) is maximized Application in machine translation Again, assumption that SCFG is part of the input is made to investigate the dependency of problem complexity on grammar size

Translation Result: string-to-tree translation problem for probabilistic SCFGs (summing over possible source trees) is NP-hard Proof reduces from consensus problem: Strings generated by probabilistic finite automaton or hidden Markov model have probabilities defined as sum of probabilities of several paths Maximizing such summation is NP-hard (Casacuberta & Higuera, 2000) (Lyngso & Pedersen, 2002)

Translation Remarks: Source of complexity of the problem comes from the fact that several source trees can be translated into the same target tree Result persists if there is a constant bound on length of synchronous productions Open: can the problem be solved in polynomial time if probabilistic SCFG is fixed?

Learning Non-Isomorphic Tree Mappings for Machine Translation a b A report B misinform 2 words become 1 reorder dependents wrongly events to-John him of 0 words become 1 events the “wrongly report events to-John” 0 words become 1 “him misinform of the events” Slides from J. Eisner

Syntax-Based Machine Translation Previous work assumes essentially isomorphic trees • Wu 1995, Alshawi et al. 2000, Yamada & Knight 2000 But trees are not isomorphic! • Discrepancies between the languages • Free translation in the training data a b wrongly A report events to-John misinform B him of events the

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. donnent (“give”) kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) Sam kids often quite d’ (“of”) enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Sam NP Adv kids null NP Adv often quite NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. A much worse alignment. . . donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) enfants (“kids”) Sam NP Sam kids NP often quite NP Adv “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Sam NP Adv kids null NP Adv often quite NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. Alignment shows how trees are generated synchronously from “little trees”. . . donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Sam NP Adv kids null NP Adv often quite NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Grammar = Set of Elementary Trees donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) NP Adv null idiomatic translation NP kids enfants (“kids”) NP Sam

Grammar = Set of Elementary Trees donnent (“give”) Start kiss à (“to”) enfants (“kids”) baiser (“kiss”) un (“a”) Sam NP idiomatic translation Sam Adv kids null NP kids enfants (“kids”) NP Sam

Grammar = Set of Elementary Trees donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) NP Adv null NP beaucoup (“lots”) d’ (“of”) NP “beaucoup d’” deletes NP inside the tree kids NP enfants (“kids”) Sam NP Sam

Grammar = Set of Elementary Trees donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) NP Adv null NP “beaucoup d’” deletes NP inside the tree kids NP enfants (“kids”) Sam NP Sam

Grammar = Set of Elementary Trees donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) NP Adv kids null NP “beaucoup d’” matches nothing in English NP enfants (“kids”) kids NP enfants (“kids”) Sam NP Sam

Grammar = Set of Elementary Trees donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) NP Adv null adverbial subtree matches nothing in French NP Adv beaucoup (“lots”) d’ (“of”) null NP kids NP null NP enfants (“kids”) Sam null NP often Adv Sam quite

Probability model similar to PCFG Probability of generating training trees T 1, T 2 with alignment A P(T 1, T 2, A) = p(t 1, t 2, a | n) p( probabilities of the “little” trees that are used report wrongly misinform VP NP | VP NP is given by a maximum entropy model )

Maxent model of little tree pairs p( report wrongly misinform VP NP | VP ) NP FEATURES • report+wrongly misinform? (use dictionary) • report misinform? (at root) • wrongly misinform? • • • verb incorporates adverb child? verb incorporates child 1 of 3? children 2, 3 switch positions? common tree sizes & shapes? . . . etc. . .

Inside Probabilities a b A report wrongly VP B events to-John misinform him of events the ( report VP misinform ) * ( p( =. . . | ) * ( VP ) ) +. . .

Inside Probabilities a b A report wrongly B VP events to-John NP him VP misinform ) (n 2) of events NP ( report only O misinform the p( =. . . report misinform | VP VP ) NP wrongly NP * ( events NP of) * ( to-John NP him ) +. . .

An MT Architecture Viterbi alignment yields output T 2 dynamic programming engine Trainer Decoder scores all alignments between a big tree T 1 & a forest of big trees T 2 scores all alignments of two big trees T 1, T 2 inside-outside estimated counts update parameters each possible (t 1, t 2, a) each proposed (t 1, t 2, a) score little tree pair Probability Model p (t 1, t 2, a) of Little Trees for each possible t 1, various (t 1, t 2, a) propose translations t 2 of little tree t 1