
703ea629722539825514840e1ba72fe6.ppt
- Количество слайдов: 71
The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf. u-szeged. hu Thematic Training Course on Processing Morphologically Rich Languages 11 -15 April 2011
Outline • Introduction • Syntax vs. morphology from a linguistic viewpoint • Morphological coding systems in Hungarian • Morphosyntactic information in Hungarian corpora • Language-specific morphosyntactic problems • Effects on IE, NER and MT Thematic Training Course on Processing Morphologically Rich Languages
Syntax vs. morphology • Typological differences among languages • Agglutinative lg: role of morphology is stronger (lot of information in morphemes) • Isolating lg: role of syntax is stronger (less morphemes, more constructions) • Focus on Hungarian (agglutinative) and English (fusional/isolating) Thematic Training Course on Processing Morphologically Rich Languages
Basic Hungarian syntax • Lot of information encoded in morphemes • No fixed word order • Information structure is reflected in word order (themerheme, old-new) Péter szereti Marit. Peter love-3 Sg. Obj Mary-ACC ‘Peter loves Mary. ’ Péter Marit szereti. ‘It is Mary who Peter loves. ’ Marit szereti Péter. ‘It is Mary who Peter loves. ’ Marit Péter szereti. ‘It is Peter who loves Mary. ’ Szereti Péter Marit. ‘Peter LOVES Mary (and not hates). ’ Szereti Marit Péter. ‘Peter LOVES Mary (and not hates). ’ Thematic Training Course on Processing Morphologically Rich Languages
Morphosyntactic features of Hungarian • Nominal declination (nouns, adjectives, numerals) • Verbal conjugation • Several hundreds of word forms for each lemma • Grammatical relations encoded primarily by morphemes -> morpho + syntactic Thematic Training Course on Processing Morphologically Rich Languages
Nominal suffixes A stem can be extended by: • Derivational suffixes • Plural • Possessive • Case suffixes hat-ás-a-i-nak ‘to its effects’ stem-DERIV. SUFF-POSS. PL-DAT egész-ség-ed-re ‘cheers’ stem-DERIV. SUFF-POSS. Sg 2 -SUB Thematic Training Course on Processing Morphologically Rich Languages
Case suffixes in Hungarian • ~20 cases („rare” cases are not always counted: distributive-temporal (-nte), associative (-stul/-stül…)) • always at the right end of the word form • grammatical relations are encoded: – Arguments of the verb – Adjuncts (temporal and locative adverbials) Thematic Training Course on Processing Morphologically Rich Languages
…and in English Pisti szerdánként edzésre jár. Steve Wednesday-DIST-TEMP training-SUB go-3 Sg Each Wednesday Steve goes to training. Szerdánként – each Wednesday Edzésre – to training Thematic Training Course on Processing Morphologically Rich Languages
Pisti bort iszik. Steve wine-ACC drink-3 Sg Steve is drinking wine. Pisti-NOM – Steve – subject Bort – wine - object Thematic Training Course on Processing Morphologically Rich Languages
Possessive in Hungarian • • • A fiú kutyája The boy dog-POSS The boy’s dog A(z ő) kutyája The (he) dog-POSS His dog • A fiúnak a kutyája • The boy-DAT the dog. POSS • Possessor in dative • Possessed with a possessive marker • Possessor in nominative • Possessed with a possessive marker Thematic Training Course on Processing Morphologically Rich Languages
…and in English • The boy’s dog • His dog • Possessor with a possessive marker (pronoun) • Possessed with no marker • The dog of the boy • Possessive relation is marked by a preposition Thematic Training Course on Processing Morphologically Rich Languages
Hungarian vs. English - nouns • Number of word forms: several hundreds (HU) vs. 2 -3 (EN) • Means to express grammatical relations: – Suffixes (HU) – Preposition, fixed position (word order), suffix, determiner (EN) • Methods for morphological parsing are very different for Hungarian and English Thematic Training Course on Processing Morphologically Rich Languages
Verbal suffixes A stem can be extended by: • Derivational suffixes • Mood markers • Tense markers • Person/number suffixes • Objective markers Vág-at-ná-k Cut-CAUS-COND-3 Pl. Obj ‘they would have it cut’ Thematic Training Course on Processing Morphologically Rich Languages
Mood and tense in Hungarian • Mood: – Indicative: default (not marked) – Conditional: suffixes (present) – analytic form (past) – Imperative: suffixes • Tense: – Present: default (not marked) – Past: suffixes – Future: analytic (auxiliary fog) Thematic Training Course on Processing Morphologically Rich Languages
…and in English • Mood: – Indicative: default (not marked) – Conditional: past tense forms + analytic forms (auxiliary would) – Imperative: auxiliaries + grammatical structure • Tense: – Present: default (not marked) – Past: suffix / irregular forms (suppletives or ablaut (vowel change)) – Future: analytic (auxiliary will) Thematic Training Course on Processing Morphologically Rich Languages
Person & Number • • Hungarian: suffixes Fut-ok Fut-sz Fut-unk Fut-tok Fut-nak • 3 Sg is the default (not marked!) • English: 3 Sg + pronouns / obligatory subject • I run • You run • He runs • We run • You run • They run • 3 Sg marked! Thematic Training Course on Processing Morphologically Rich Languages
Derivational suffixes in Hungarian • Possibility/permission: fut-hat-ok run-MOD-1 Sg ‘I may run’ • Reflexive: mos-akod-unk wash-REFL-1 Pl ‘we wash ourselves’ • Frequentative: üt-öget-sz hit-FREQ-2 Sg ‘you hit sg repeatedly’ • Causative: csinál-tat-nak do-CAUS-3 Pl ‘they have sg done’ Thematic Training Course on Processing Morphologically Rich Languages
… and in English • • Possibility/permission: auxiliaries Reflexive: pronominal objects Frequentative: adverb Causative: construction Thematic Training Course on Processing Morphologically Rich Languages
Hungarian vs. English - verbs • Number of word forms: several hundreds (HU) vs. 4 -5 (EN) • Means to express grammatical relations: – Suffixes + auxiliaries (HU) – Auxiliaries + reflexive pronouns + constructions (EN) • A lot of syntactic information is encoded in Hungarian morphemes Thematic Training Course on Processing Morphologically Rich Languages
Morphology Syntax English Nominal suffix verb-argument relation possessive word order, preposition suffix, preposition Verbal suffix tense agreement modality causation aspect reflexivity suffix pronoun, suffix auxiliary construction pronoun Thematic Training Course on Processing Morphologically Rich Languages
Morphosyntactic coding systems • Language independent (? ) • Language dependent • (dis)advantages: – comparability – considering language-specific features – complexity • Different information is necessary for each language Thematic Training Course on Processing Morphologically Rich Languages
Hungarian coding systems • HUMOR – recall Thursday Session 1 – in the Hungarian National Corpus • MSD – In Szeged Treebank – Parser and POS-tagger available at: http: //www. inf. uszeged. hu/rgai/magyarlanc • KR – No database – Parser and POS-tagger available at: http: //mokk. bme. hu/resources/hunmorph/index_html http: //code. google. com/p/hunpos/ Thematic Training Course on Processing Morphologically Rich Languages
MSD • Morphosyntactic Description • International coding system: – English – Romanian – Slovenian – Czech – Bulgarian – Estonian – Hungarian Thematic Training Course on Processing Morphologically Rich Languages
MSD - 2 • Positional codes • A given position encodes a given type of information • Position 0: part-of-speech • Position 1: (sub)type within POS • Further positions: other grammatical information (person, number, case, etc. ) • Irrelevant positions are marked with a hyphen (-) Thematic Training Course on Processing Morphologically Rich Languages
KR • • • Created for Hungarian Hierarchical attribute-value matrices Default values (3 Sg, singular…) Derivational information is encoded Compounds are also segmented Thematic Training Course on Processing Morphologically Rich Languages
MSD vs. KR • Differences between the two systems: – derivation – compounds • Harmonization efforts in order to build a morphological parser the output of which is in total harmony with the Szeged Treebank (magyarlanc) (Farkas et al. 2010) Thematic Training Course on Processing Morphologically Rich Languages
Nouns in MSD kutya Nc-sn ‘dog’ kutyámat kutya Nc-sa---s 1 ‘my dog-ACC’ kutyaházaikról kutyaház Nc-ph---p 3 ‘about their doghouse’ Obamához Obama Np-st ‘to Obama’ Thematic Training Course on Processing Morphologically Rich Languages
Verbs in MSD futok fut Vmip 1 s---n ‘I run’ futhatsz fut Voip 2 s---n ‘you can run’ ütögették üt Vfis 3 p---y ‘they were hitting it’ csináltattunk csinál Vsis 1 p---n ‘we had sg made’ Thematic Training Course on Processing Morphologically Rich Languages
Morphosyntactically annotated Hungarian corpora • Hungarian National Corpus – 100 -million-word balanced reference corpus of present-day Hungarian – Word forms automatically annotated for stem, part of speech and inflectional information – http: //corpus. nytud. hu/mnsz/index_eng. html • Szeged Treebank – – 1 -million words, 82 K sentences Manually annotated for lemma, POS-tags Constituency and dependency trees http: //www. inf. u-szeged. hu/rgai/nlp Thematic Training Course on Processing Morphologically Rich Languages
Szeged Treebank • Manually annotated treebank for Hungarian – Covers various linguistics styles • literature, newspapers, laws, student essays, computer books, etc. • multilingual connection: Orwell’s 1984; Win 2000 manual in Hungarian – Available free of charge for research • Developed by – University of Szeged, HLT group – Morpho. Logic Ltd. – Academy of Sciences, Research Institute for Linguistics Thematic Training Course on Processing Morphologically Rich Languages
Szeged Treebank 2. • TEI XML format • Manually annotated – sentence split & word segmentation – morphological analysis – PTB-style syntactic structure – Verb argument structure – converted / extended to Dependency Grammar format manually Thematic Training Course on Processing Morphologically Rich Languages
Szeged Treebank 3. • • • Several versions Constituency and dependency versions Old MSD codes New (harmonized) MSD codes (dependency) parser under development Being extended with folklore texts Thematic Training Course on Processing Morphologically Rich Languages
Dependency vs. constituency • Each node corresponds to a word -> no virtual nodes (CP, I’…) in dependency trees • Constituency grammars said to be good for languages with fixed word order • Syntactic relations are determined – by the position in the tree (constituency grammar) – by dependency relations (labeled edges) (dependency) Thematic Training Course on Processing Morphologically Rich Languages
Constituency trees in Sz. T 2. 0 • Based on generative syntax (É. Kiss et al. 1999) • Syntactic features of Hungarian also considered (i. e. not hardcore Chomskyan trees) • Verb-argument relations are encoded by labels • Very detailed information: different grammatical role for each case suffix • Semantic information also can be found (temporal and locative adverbials) Thematic Training Course on Processing Morphologically Rich Languages
Aggie all relative-POSS-ACC the day before yesterday see-PAST-3 Sg-Obj guest-ESS ‘Aggie received all of her relatives the day before yesterday. ’ Thematic Training Course on Processing Morphologically Rich Languages
Thematic Training Course on Processing Morphologically Rich Languages
Dependency trees in Szeged Dependency Treebank • Based on Sz. T 2. 0 • Automatic conversion and manual correction • Word forms are the nodes of the tree • Simplified relations for nominal arguments: SUBJ, OBJ, DAT, OBL, ATT • Semantic information kept • Sentences without 3 Sg copula are distinctively marked Thematic Training Course on Processing Morphologically Rich Languages
Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions. Thematic Training Course on Processing Morphologically Rich Languages
Virtual nodes • No overt copula in present tense 3 Sg • Only subject and predicative noun/adjective manifest • No syntactic structure in Sz. T (grammatical roles are not marked) • Virtual nodes in Sz. DT Thematic Training Course on Processing Morphologically Rich Languages
I like to go to school because it is good to be at school though not always. Thematic Training Course on Processing Morphologically Rich Languages
Szeged Treebank vs. Szeged Dependency Treebank • Labeled relations in both cases -> not so sharp contrast • Virtual nodes in Sz. DT -> grammatical structure marked for every sentence (IE, MT) • No word order constraints in Sz. DT • Word forms are marked • Other possibilities: morpheme-based syntax (Prószéky et al. (1989), Koutny, Wacha (1991)) Thematic Training Course on Processing Morphologically Rich Languages
Language-specific morphosyntactic problems • Morphology vs. syntax: – Pseudo-subjects – Pseudo-objects – Pseudo-datives • Morphological analysis of unknown words • Lemmatization of named entities Thematic Training Course on Processing Morphologically Rich Languages
Pseudo-subjects • a noun in nominative is not the subject of the sentence -> special attention required when parsing • Possessor: a kisfiú labdája the boy ball-3 Sg. POSS the boy’s ball • Predicative noun: István juhász maradt. Stephen shepherd remain-PAST Stephen remained a shepherd. • Object: A kutyám kergeti a macska. The dog-POSS chase-3 Sg. Obj the cat ‘The cat is chasing my dog. ’ (garden path sentence) A fiam szereti a lányod. The son-1 Sg. POSS love-3 Sg. Obj the daughter-2 Sg. POSS ‘My son loves your daughter’ or ‘Your daughter loves my son’ Thematic Training Course on Processing Morphologically Rich Languages
Solutions • Possessor: – Sz. T: one NP includes the possessor and the possessed ((a kisfiú) labdája) – Sz. DT: ATT relation • Predicative noun: PRED relation – Virtual node in Sz. DT • Object: OBJ relation – Sometimes contextual information is needed even for humans… Thematic Training Course on Processing Morphologically Rich Languages
Pseudo-objects Adverbials with an apparently accusative ending: Futottam egy jót. Run-PAST-1 Sg a good-ACC I have had a good run. Nagyot aludtam. Big-ACC sleep-PAST-1 Sg I have slept a lot. Intransitive verbs -> cannot be an object -> MODE relation Thematic Training Course on Processing Morphologically Rich Languages
Pseudo-datives Not all (semantic) subjects are in nominative: • Dative subject: Sándornak kell elrendeznie az ügyeket. Alexander-DAT must arrange-INF-3 Sg the issue-PL Alexander has to arrange the issues. • DAT in both corpora • Certain auxiliaries with dative subjects (exceptions) • Dative-nominative parallelism in possessive as well Thematic Training Course on Processing Morphologically Rich Languages
Unknown words • Unknown words can be: • Methods for analysis (Zsibrita et al. 2010): – Compounds – Named entities – Derivations • • fémkapunk félmillió csokinyúl NATO-hoz – Segmentation into two or more analyzable parts – Expert rules to filter impossible combinations (*V+N) – Analysis of the last part goes to the whole word – Substitution for hyphenated words (pre-defined patterns for each morphological class) Thematic Training Course on Processing Morphologically Rich Languages
félmillió fél half ADJ half NUM half V millió N be afraid NUM million fél+millió Mc-snl Expert rules: NUM + NUM * non-NUM + NUM Thematic Training Course on Processing Morphologically Rich Languages
fémkapunk fém N metal kap V get kapu N gate unk S 1 Pl (verb) nk S 1 Pl. Poss (noun) fém+kap+unk Vmip 1 p---n fém+kapu+nk Nc-sn---p 1 Expert rules: N+N N-non. NOM + V * N-NOM + V Thematic Training Course on Processing Morphologically Rich Languages
csokinyúl csoki N chocolate nyúl N rabbit V stretch out kinyúl Expert rules: N+N N-non. NOM + V csoki+nyúl Vmip 3 s---n Nc-sn cso+kinyúl (? ) Vmip 3 s---n * N-NOM + V Thematic Training Course on Processing Morphologically Rich Languages
NATO-hoz NATO ? NATO hoz V bring S to Expert rules: N+-+S N-non. NOM + - + V NATO-hoz NATO: V Vmip 3 s---n NATO-hoz (kalaphoz) NATO: N Np-st * N-NOM + - + V V+-+V Substitution: NATO- -> kalap ‘hat’ Ordering of rules: 1. substitution 2. segmentation Thematic Training Course on Processing Morphologically Rich Languages
Lemmatization • Lemmatization (i. e. dividing the word form into its root and affixes) is not a trivial task in morphologically rich languages such as Hungarian • common nouns: relying on a good dictionary • NEs: cannot be listed • Problem: the NE ends in an apparent suffix Thematic Training Course on Processing Morphologically Rich Languages
Lemmatization of NEs each ending that seems to be a possible suffix is cut off the NE in step-by-step fashion Citroenben (lemma) Citroen + ben ‘in (a) Citroen’ Citroenb + en ‘on (a) Citroenb’ Citroenbe + n ‘on (a) Citroenbe’ • Each possible lemma undergoes a Google and a Yahoo search – the most frequent one is chosen (Farkas et al. 2008) Thematic Training Course on Processing Morphologically Rich Languages
NLP applications • NER – NEs with suffixes • Information extraction – Modality, uncertainty – Causation • Machine translation – Morphemes vs. structures Thematic Training Course on Processing Morphologically Rich Languages
Named Entities • NEs should be recognized • They should be morphosyntactically tagged -> proper syntactic/semantic analysis A Citroenben a Peugeot meghatározó tulajdonhányadot szerez. • Mini dictionary + suffix list + semantic frame Thematic Training Course on Processing Morphologically Rich Languages
a ben Citroenben en meghatározó n ot Peugeot szerez t tulajdonrész DET S ? S ADJ S S ? V S N the in on dominant on ACC acquire ACC interest Thematic Training Course on Processing Morphologically Rich Languages
Possible analyses • Citroenben • Peugeot Citroenben Peugeot Citroen + ben ‘Citroen. Peugeo + t ‘Peugeo. INE’ ACC’ Citroenb + en ‘Citroenb. Peuge + ot ‘Peuge. SUP’ ACC’ Citroenbe + n ‘Citroenbe -SUP’ Thematic Training Course on Processing Morphologically Rich Languages
[2=N] [3=N("részesedés"|"tulajdonrész"|"rész„| ”tulajdonhányad”)+compl 1=4+modified_by_adj=5]
Analysis A Citroenben a Peugeot meghatározó tulajdonhányadot szerez. Tulajdonhányadot -> ACC/OBJ (3) Citroenben -> INE (4) Peugeot -> NOM/SUBJ (2) ‘Peugeot acquires a dominant interest in Citroen. ’ Thematic Training Course on Processing Morphologically Rich Languages
Uncertainty • Text Mining: – derive facts from free text – uncertainty and negation have an impact on the quality/nature of the information extracted • applications have to treat sentences / clauses containing uncertain or negated information differently from factual information • Uncertainty: possible existence of a thing (neither its existence nor its non-existence is claimed) Thematic Training Course on Processing Morphologically Rich Languages
Uncertainty detection • Uncertainty detection in English: cues (words with uncertain content) • One typical means to express uncertainty in Hungarian: -hat/het High school grades may influence health. A középiskolai jegyek kihathatnak az egészségre. • Morphological analysis should reflect modality (Voip 3 s---n) Thematic Training Course on Processing Morphologically Rich Languages
Causation • Semantic/thematic relations to be determined properly • AGENT != SUBJECT Varrattam egy ruhát. sew-CAUS-PAST-1 Sg a dress-ACC ‘I had a dress sewn. ’ Varrattam Marival egy ruhát. sew-CAUS-PAST-1 Sg Mari-INS a dress-ACC ‘I had Mary sew a dress. ’ Varrtam Marival egy ruhát. sew-PAST-1 Sg Mari-INS a dress-ACC ‘I sewed a dress with Mary. ’ • Causative information should be encoded (Vsip 3 s---n) Thematic Training Course on Processing Morphologically Rich Languages
Argument structure of causative verbs Agent ? Varrattam egy ruhát. Varrattam Mari (INS) Marival egy ruhát. Varrtam I (NOM) + Marival egy Mari (INS) ruhát. Beneficiary Patient I (NOM) ruha (ACC) ? ruha (ACC) Thematic Training Course on Processing Morphologically Rich Languages
Machine translation • Morpheme-based translation would be ideal • Easier alignment of translational units • Good morphological parser needed • Easier to execute in dependency grammar • Morpheme-based dependency structures Thematic Training Course on Processing Morphologically Rich Languages
Alignments at | varr | t | ruha have | sewn | dress ban in | | ház house | | am my Thematic Training Course on Processing Morphologically Rich Languages
Problems • • Not practical: no corpus available at the moment Portmanteau morphs – alignment problems Zero morphs – how many of them? 3 zero morphs in Hungarian nouns: könyv-Ø-Ø-Ø vs. könyveit book-Ø-Ø-Ø book-POSS. PL-ACC • (Mel’cuk 2006) Thematic Training Course on Processing Morphologically Rich Languages
• Morphosyntactic codes might help • Csinálhattátok Vois 2 p ---y • Reordering rules V o i s 2 p y csinál hat t tok á csinálh attátok Thematic Training Course on Processing Morphologically Rich Languages do can PAST you it you could do it
An example hat | csinál /| t á tok can | do /| d Ø you could / you do Thematic Training Course on Processing Morphologically Rich Languages
Syntax vs. case suffix Pseudo-subject Extra rules; PRED, OBJ difficult for humans Pseudo-object List of adverbs with accusative ending Pseudo-dative List of verbs with dative subject Unknown words (lemmas+suffixes) Guessing (rules) Information extraction Thematic/semantic relations Proper morphosyntactic codes + rules Uncertainty detection Proper morphosyntactic codes Machine translation (morpheme-based) Proper morphosyntactic codes Thematic Training Course on Processing Morphologically Rich Languages
Summary • • Syntax-morphology interface in Hungarian Morphological coding systems Syntactic annotation in Hungarian corpora Morphosyntactic problems: – NER – IE – MT Thematic Training Course on Processing Morphologically Rich Languages
References É. Kiss K. , Kiefer F. , Siptár P. : Új magyar nyelvtan, Osiris Kiadó, Bp. , 1999. Farkas Richárd, Szeredi Dániel, Varga Dániel, Vincze Veronika 2010: MSD-KR harmonizáció a Szeged Treebank 2. 5 -ben. In: Tanács Attila, Vincze Veronika (szerk. ): VII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, pp. 349 -353. Farkas, Richárd; Vincze, Veronika; Nagy, István; Ormándi, Róbert; Szarvas, György; Almási, Attila 2008: Web-based lemmatisation of Named Entities. In: Horák, Ales; Kopeček, Ivan; Pala, Karel; Sojka, Petr (eds. ): Proceedings of the 11 th International Conference on Text, Speech and Dialogue (TSD 2008), Berlin, Heidelberg, Springer Verlag, LNCS 5246, pp. 53 -60. Koutny I. , Wacha B. : Magyar nyelvtan függőségi alapon. Magyar Nyelv Vol. 87 No. 4. (1991) 393– 404. Mel’cuk, Igor 2006: Aspects of the Theory of Morphology. Mouton de Gruyter. Prószéky, G. , Koutny, I. , Wacha, B. : Dependency Syntax of Hungarian. In: Maxwell, Dan; Klaus Schubert (eds. ) Metataxis in Practice (Dependency Syntax for Multilingual Machine Translation), Foris, Dordrecht, The Netherlands (1989) 151– 181 Zsibrita János, Vincze Veronika, Farkas Richárd 2010: Ismeretlen kifejezések és a szófaji egyértelműsítés. In: Tanács Attila, Vincze Veronika (szerk. ): VII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, pp. 275 -283. Thematic Training Course on Processing Morphologically Rich Languages