Скачать презентацию The Syntax-Morphology Interface and Natural Language Processing Veronika Скачать презентацию The Syntax-Morphology Interface and Natural Language Processing Veronika

703ea629722539825514840e1ba72fe6.ppt

  • Количество слайдов: 71

The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf. The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf. u-szeged. hu Thematic Training Course on Processing Morphologically Rich Languages 11 -15 April 2011

Outline • Introduction • Syntax vs. morphology from a linguistic viewpoint • Morphological coding Outline • Introduction • Syntax vs. morphology from a linguistic viewpoint • Morphological coding systems in Hungarian • Morphosyntactic information in Hungarian corpora • Language-specific morphosyntactic problems • Effects on IE, NER and MT Thematic Training Course on Processing Morphologically Rich Languages

Syntax vs. morphology • Typological differences among languages • Agglutinative lg: role of morphology Syntax vs. morphology • Typological differences among languages • Agglutinative lg: role of morphology is stronger (lot of information in morphemes) • Isolating lg: role of syntax is stronger (less morphemes, more constructions) • Focus on Hungarian (agglutinative) and English (fusional/isolating) Thematic Training Course on Processing Morphologically Rich Languages

Basic Hungarian syntax • Lot of information encoded in morphemes • No fixed word Basic Hungarian syntax • Lot of information encoded in morphemes • No fixed word order • Information structure is reflected in word order (themerheme, old-new) Péter szereti Marit. Peter love-3 Sg. Obj Mary-ACC ‘Peter loves Mary. ’ Péter Marit szereti. ‘It is Mary who Peter loves. ’ Marit szereti Péter. ‘It is Mary who Peter loves. ’ Marit Péter szereti. ‘It is Peter who loves Mary. ’ Szereti Péter Marit. ‘Peter LOVES Mary (and not hates). ’ Szereti Marit Péter. ‘Peter LOVES Mary (and not hates). ’ Thematic Training Course on Processing Morphologically Rich Languages

Morphosyntactic features of Hungarian • Nominal declination (nouns, adjectives, numerals) • Verbal conjugation • Morphosyntactic features of Hungarian • Nominal declination (nouns, adjectives, numerals) • Verbal conjugation • Several hundreds of word forms for each lemma • Grammatical relations encoded primarily by morphemes -> morpho + syntactic Thematic Training Course on Processing Morphologically Rich Languages

Nominal suffixes A stem can be extended by: • Derivational suffixes • Plural • Nominal suffixes A stem can be extended by: • Derivational suffixes • Plural • Possessive • Case suffixes hat-ás-a-i-nak ‘to its effects’ stem-DERIV. SUFF-POSS. PL-DAT egész-ség-ed-re ‘cheers’ stem-DERIV. SUFF-POSS. Sg 2 -SUB Thematic Training Course on Processing Morphologically Rich Languages

Case suffixes in Hungarian • ~20 cases („rare” cases are not always counted: distributive-temporal Case suffixes in Hungarian • ~20 cases („rare” cases are not always counted: distributive-temporal (-nte), associative (-stul/-stül…)) • always at the right end of the word form • grammatical relations are encoded: – Arguments of the verb – Adjuncts (temporal and locative adverbials) Thematic Training Course on Processing Morphologically Rich Languages

…and in English Pisti szerdánként edzésre jár. Steve Wednesday-DIST-TEMP training-SUB go-3 Sg Each Wednesday …and in English Pisti szerdánként edzésre jár. Steve Wednesday-DIST-TEMP training-SUB go-3 Sg Each Wednesday Steve goes to training. Szerdánként – each Wednesday Edzésre – to training Thematic Training Course on Processing Morphologically Rich Languages

Pisti bort iszik. Steve wine-ACC drink-3 Sg Steve is drinking wine. Pisti-NOM – Steve Pisti bort iszik. Steve wine-ACC drink-3 Sg Steve is drinking wine. Pisti-NOM – Steve – subject Bort – wine - object Thematic Training Course on Processing Morphologically Rich Languages

Possessive in Hungarian • • • A fiú kutyája The boy dog-POSS The boy’s Possessive in Hungarian • • • A fiú kutyája The boy dog-POSS The boy’s dog A(z ő) kutyája The (he) dog-POSS His dog • A fiúnak a kutyája • The boy-DAT the dog. POSS • Possessor in dative • Possessed with a possessive marker • Possessor in nominative • Possessed with a possessive marker Thematic Training Course on Processing Morphologically Rich Languages

…and in English • The boy’s dog • His dog • Possessor with a …and in English • The boy’s dog • His dog • Possessor with a possessive marker (pronoun) • Possessed with no marker • The dog of the boy • Possessive relation is marked by a preposition Thematic Training Course on Processing Morphologically Rich Languages

Hungarian vs. English - nouns • Number of word forms: several hundreds (HU) vs. Hungarian vs. English - nouns • Number of word forms: several hundreds (HU) vs. 2 -3 (EN) • Means to express grammatical relations: – Suffixes (HU) – Preposition, fixed position (word order), suffix, determiner (EN) • Methods for morphological parsing are very different for Hungarian and English Thematic Training Course on Processing Morphologically Rich Languages

Verbal suffixes A stem can be extended by: • Derivational suffixes • Mood markers Verbal suffixes A stem can be extended by: • Derivational suffixes • Mood markers • Tense markers • Person/number suffixes • Objective markers Vág-at-ná-k Cut-CAUS-COND-3 Pl. Obj ‘they would have it cut’ Thematic Training Course on Processing Morphologically Rich Languages

Mood and tense in Hungarian • Mood: – Indicative: default (not marked) – Conditional: Mood and tense in Hungarian • Mood: – Indicative: default (not marked) – Conditional: suffixes (present) – analytic form (past) – Imperative: suffixes • Tense: – Present: default (not marked) – Past: suffixes – Future: analytic (auxiliary fog) Thematic Training Course on Processing Morphologically Rich Languages

…and in English • Mood: – Indicative: default (not marked) – Conditional: past tense …and in English • Mood: – Indicative: default (not marked) – Conditional: past tense forms + analytic forms (auxiliary would) – Imperative: auxiliaries + grammatical structure • Tense: – Present: default (not marked) – Past: suffix / irregular forms (suppletives or ablaut (vowel change)) – Future: analytic (auxiliary will) Thematic Training Course on Processing Morphologically Rich Languages

Person & Number • • Hungarian: suffixes Fut-ok Fut-sz Fut-unk Fut-tok Fut-nak • 3 Person & Number • • Hungarian: suffixes Fut-ok Fut-sz Fut-unk Fut-tok Fut-nak • 3 Sg is the default (not marked!) • English: 3 Sg + pronouns / obligatory subject • I run • You run • He runs • We run • You run • They run • 3 Sg marked! Thematic Training Course on Processing Morphologically Rich Languages

Derivational suffixes in Hungarian • Possibility/permission: fut-hat-ok run-MOD-1 Sg ‘I may run’ • Reflexive: Derivational suffixes in Hungarian • Possibility/permission: fut-hat-ok run-MOD-1 Sg ‘I may run’ • Reflexive: mos-akod-unk wash-REFL-1 Pl ‘we wash ourselves’ • Frequentative: üt-öget-sz hit-FREQ-2 Sg ‘you hit sg repeatedly’ • Causative: csinál-tat-nak do-CAUS-3 Pl ‘they have sg done’ Thematic Training Course on Processing Morphologically Rich Languages

… and in English • • Possibility/permission: auxiliaries Reflexive: pronominal objects Frequentative: adverb Causative: … and in English • • Possibility/permission: auxiliaries Reflexive: pronominal objects Frequentative: adverb Causative: construction Thematic Training Course on Processing Morphologically Rich Languages

Hungarian vs. English - verbs • Number of word forms: several hundreds (HU) vs. Hungarian vs. English - verbs • Number of word forms: several hundreds (HU) vs. 4 -5 (EN) • Means to express grammatical relations: – Suffixes + auxiliaries (HU) – Auxiliaries + reflexive pronouns + constructions (EN) • A lot of syntactic information is encoded in Hungarian morphemes Thematic Training Course on Processing Morphologically Rich Languages

Morphology Syntax English Nominal suffix verb-argument relation possessive word order, preposition suffix, preposition Verbal Morphology Syntax English Nominal suffix verb-argument relation possessive word order, preposition suffix, preposition Verbal suffix tense agreement modality causation aspect reflexivity suffix pronoun, suffix auxiliary construction pronoun Thematic Training Course on Processing Morphologically Rich Languages

Morphosyntactic coding systems • Language independent (? ) • Language dependent • (dis)advantages: – Morphosyntactic coding systems • Language independent (? ) • Language dependent • (dis)advantages: – comparability – considering language-specific features – complexity • Different information is necessary for each language Thematic Training Course on Processing Morphologically Rich Languages

Hungarian coding systems • HUMOR – recall Thursday Session 1 – in the Hungarian Hungarian coding systems • HUMOR – recall Thursday Session 1 – in the Hungarian National Corpus • MSD – In Szeged Treebank – Parser and POS-tagger available at: http: //www. inf. uszeged. hu/rgai/magyarlanc • KR – No database – Parser and POS-tagger available at: http: //mokk. bme. hu/resources/hunmorph/index_html http: //code. google. com/p/hunpos/ Thematic Training Course on Processing Morphologically Rich Languages

MSD • Morphosyntactic Description • International coding system: – English – Romanian – Slovenian MSD • Morphosyntactic Description • International coding system: – English – Romanian – Slovenian – Czech – Bulgarian – Estonian – Hungarian Thematic Training Course on Processing Morphologically Rich Languages

MSD - 2 • Positional codes • A given position encodes a given type MSD - 2 • Positional codes • A given position encodes a given type of information • Position 0: part-of-speech • Position 1: (sub)type within POS • Further positions: other grammatical information (person, number, case, etc. ) • Irrelevant positions are marked with a hyphen (-) Thematic Training Course on Processing Morphologically Rich Languages

KR • • • Created for Hungarian Hierarchical attribute-value matrices Default values (3 Sg, KR • • • Created for Hungarian Hierarchical attribute-value matrices Default values (3 Sg, singular…) Derivational information is encoded Compounds are also segmented Thematic Training Course on Processing Morphologically Rich Languages

MSD vs. KR • Differences between the two systems: – derivation – compounds • MSD vs. KR • Differences between the two systems: – derivation – compounds • Harmonization efforts in order to build a morphological parser the output of which is in total harmony with the Szeged Treebank (magyarlanc) (Farkas et al. 2010) Thematic Training Course on Processing Morphologically Rich Languages

Nouns in MSD kutya Nc-sn ‘dog’ kutyámat kutya Nc-sa---s 1 ‘my dog-ACC’ kutyaházaikról kutyaház Nouns in MSD kutya Nc-sn ‘dog’ kutyámat kutya Nc-sa---s 1 ‘my dog-ACC’ kutyaházaikról kutyaház Nc-ph---p 3 ‘about their doghouse’ Obamához Obama Np-st ‘to Obama’ Thematic Training Course on Processing Morphologically Rich Languages

Verbs in MSD futok fut Vmip 1 s---n ‘I run’ futhatsz fut Voip 2 Verbs in MSD futok fut Vmip 1 s---n ‘I run’ futhatsz fut Voip 2 s---n ‘you can run’ ütögették üt Vfis 3 p---y ‘they were hitting it’ csináltattunk csinál Vsis 1 p---n ‘we had sg made’ Thematic Training Course on Processing Morphologically Rich Languages

Morphosyntactically annotated Hungarian corpora • Hungarian National Corpus – 100 -million-word balanced reference corpus Morphosyntactically annotated Hungarian corpora • Hungarian National Corpus – 100 -million-word balanced reference corpus of present-day Hungarian – Word forms automatically annotated for stem, part of speech and inflectional information – http: //corpus. nytud. hu/mnsz/index_eng. html • Szeged Treebank – – 1 -million words, 82 K sentences Manually annotated for lemma, POS-tags Constituency and dependency trees http: //www. inf. u-szeged. hu/rgai/nlp Thematic Training Course on Processing Morphologically Rich Languages

Szeged Treebank • Manually annotated treebank for Hungarian – Covers various linguistics styles • Szeged Treebank • Manually annotated treebank for Hungarian – Covers various linguistics styles • literature, newspapers, laws, student essays, computer books, etc. • multilingual connection: Orwell’s 1984; Win 2000 manual in Hungarian – Available free of charge for research • Developed by – University of Szeged, HLT group – Morpho. Logic Ltd. – Academy of Sciences, Research Institute for Linguistics Thematic Training Course on Processing Morphologically Rich Languages

Szeged Treebank 2. • TEI XML format • Manually annotated – sentence split & Szeged Treebank 2. • TEI XML format • Manually annotated – sentence split & word segmentation – morphological analysis – PTB-style syntactic structure – Verb argument structure – converted / extended to Dependency Grammar format manually Thematic Training Course on Processing Morphologically Rich Languages

Szeged Treebank 3. • • • Several versions Constituency and dependency versions Old MSD Szeged Treebank 3. • • • Several versions Constituency and dependency versions Old MSD codes New (harmonized) MSD codes (dependency) parser under development Being extended with folklore texts Thematic Training Course on Processing Morphologically Rich Languages

Dependency vs. constituency • Each node corresponds to a word -> no virtual nodes Dependency vs. constituency • Each node corresponds to a word -> no virtual nodes (CP, I’…) in dependency trees • Constituency grammars said to be good for languages with fixed word order • Syntactic relations are determined – by the position in the tree (constituency grammar) – by dependency relations (labeled edges) (dependency) Thematic Training Course on Processing Morphologically Rich Languages

Constituency trees in Sz. T 2. 0 • Based on generative syntax (É. Kiss Constituency trees in Sz. T 2. 0 • Based on generative syntax (É. Kiss et al. 1999) • Syntactic features of Hungarian also considered (i. e. not hardcore Chomskyan trees) • Verb-argument relations are encoded by labels • Very detailed information: different grammatical role for each case suffix • Semantic information also can be found (temporal and locative adverbials) Thematic Training Course on Processing Morphologically Rich Languages

Aggie all relative-POSS-ACC the day before yesterday see-PAST-3 Sg-Obj guest-ESS ‘Aggie received all of Aggie all relative-POSS-ACC the day before yesterday see-PAST-3 Sg-Obj guest-ESS ‘Aggie received all of her relatives the day before yesterday. ’ Thematic Training Course on Processing Morphologically Rich Languages

Thematic Training Course on Processing Morphologically Rich Languages Thematic Training Course on Processing Morphologically Rich Languages

Dependency trees in Szeged Dependency Treebank • Based on Sz. T 2. 0 • Dependency trees in Szeged Dependency Treebank • Based on Sz. T 2. 0 • Automatic conversion and manual correction • Word forms are the nodes of the tree • Simplified relations for nominal arguments: SUBJ, OBJ, DAT, OBL, ATT • Semantic information kept • Sentences without 3 Sg copula are distinctively marked Thematic Training Course on Processing Morphologically Rich Languages

Winston Smith, his chin nuzzled into his breast in an effort to escape the Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions. Thematic Training Course on Processing Morphologically Rich Languages

Virtual nodes • No overt copula in present tense 3 Sg • Only subject Virtual nodes • No overt copula in present tense 3 Sg • Only subject and predicative noun/adjective manifest • No syntactic structure in Sz. T (grammatical roles are not marked) • Virtual nodes in Sz. DT Thematic Training Course on Processing Morphologically Rich Languages

I like to go to school because it is good to be at school I like to go to school because it is good to be at school though not always. Thematic Training Course on Processing Morphologically Rich Languages

Szeged Treebank vs. Szeged Dependency Treebank • Labeled relations in both cases -> not Szeged Treebank vs. Szeged Dependency Treebank • Labeled relations in both cases -> not so sharp contrast • Virtual nodes in Sz. DT -> grammatical structure marked for every sentence (IE, MT) • No word order constraints in Sz. DT • Word forms are marked • Other possibilities: morpheme-based syntax (Prószéky et al. (1989), Koutny, Wacha (1991)) Thematic Training Course on Processing Morphologically Rich Languages

Language-specific morphosyntactic problems • Morphology vs. syntax: – Pseudo-subjects – Pseudo-objects – Pseudo-datives • Language-specific morphosyntactic problems • Morphology vs. syntax: – Pseudo-subjects – Pseudo-objects – Pseudo-datives • Morphological analysis of unknown words • Lemmatization of named entities Thematic Training Course on Processing Morphologically Rich Languages

Pseudo-subjects • a noun in nominative is not the subject of the sentence -> Pseudo-subjects • a noun in nominative is not the subject of the sentence -> special attention required when parsing • Possessor: a kisfiú labdája the boy ball-3 Sg. POSS the boy’s ball • Predicative noun: István juhász maradt. Stephen shepherd remain-PAST Stephen remained a shepherd. • Object: A kutyám kergeti a macska. The dog-POSS chase-3 Sg. Obj the cat ‘The cat is chasing my dog. ’ (garden path sentence) A fiam szereti a lányod. The son-1 Sg. POSS love-3 Sg. Obj the daughter-2 Sg. POSS ‘My son loves your daughter’ or ‘Your daughter loves my son’ Thematic Training Course on Processing Morphologically Rich Languages

Solutions • Possessor: – Sz. T: one NP includes the possessor and the possessed Solutions • Possessor: – Sz. T: one NP includes the possessor and the possessed ((a kisfiú) labdája) – Sz. DT: ATT relation • Predicative noun: PRED relation – Virtual node in Sz. DT • Object: OBJ relation – Sometimes contextual information is needed even for humans… Thematic Training Course on Processing Morphologically Rich Languages

Pseudo-objects Adverbials with an apparently accusative ending: Futottam egy jót. Run-PAST-1 Sg a good-ACC Pseudo-objects Adverbials with an apparently accusative ending: Futottam egy jót. Run-PAST-1 Sg a good-ACC I have had a good run. Nagyot aludtam. Big-ACC sleep-PAST-1 Sg I have slept a lot. Intransitive verbs -> cannot be an object -> MODE relation Thematic Training Course on Processing Morphologically Rich Languages

Pseudo-datives Not all (semantic) subjects are in nominative: • Dative subject: Sándornak kell elrendeznie Pseudo-datives Not all (semantic) subjects are in nominative: • Dative subject: Sándornak kell elrendeznie az ügyeket. Alexander-DAT must arrange-INF-3 Sg the issue-PL Alexander has to arrange the issues. • DAT in both corpora • Certain auxiliaries with dative subjects (exceptions) • Dative-nominative parallelism in possessive as well Thematic Training Course on Processing Morphologically Rich Languages

Unknown words • Unknown words can be: • Methods for analysis (Zsibrita et al. Unknown words • Unknown words can be: • Methods for analysis (Zsibrita et al. 2010): – Compounds – Named entities – Derivations • • fémkapunk félmillió csokinyúl NATO-hoz – Segmentation into two or more analyzable parts – Expert rules to filter impossible combinations (*V+N) – Analysis of the last part goes to the whole word – Substitution for hyphenated words (pre-defined patterns for each morphological class) Thematic Training Course on Processing Morphologically Rich Languages

félmillió fél half ADJ half NUM half V millió N be afraid NUM million félmillió fél half ADJ half NUM half V millió N be afraid NUM million fél+millió Mc-snl Expert rules: NUM + NUM * non-NUM + NUM Thematic Training Course on Processing Morphologically Rich Languages

fémkapunk fém N metal kap V get kapu N gate unk S 1 Pl fémkapunk fém N metal kap V get kapu N gate unk S 1 Pl (verb) nk S 1 Pl. Poss (noun) fém+kap+unk Vmip 1 p---n fém+kapu+nk Nc-sn---p 1 Expert rules: N+N N-non. NOM + V * N-NOM + V Thematic Training Course on Processing Morphologically Rich Languages

csokinyúl csoki N chocolate nyúl N rabbit V stretch out kinyúl Expert rules: N+N csokinyúl csoki N chocolate nyúl N rabbit V stretch out kinyúl Expert rules: N+N N-non. NOM + V csoki+nyúl Vmip 3 s---n Nc-sn cso+kinyúl (? ) Vmip 3 s---n * N-NOM + V Thematic Training Course on Processing Morphologically Rich Languages

NATO-hoz NATO ? NATO hoz V bring S to Expert rules: N+-+S N-non. NOM NATO-hoz NATO ? NATO hoz V bring S to Expert rules: N+-+S N-non. NOM + - + V NATO-hoz NATO: V Vmip 3 s---n NATO-hoz (kalaphoz) NATO: N Np-st * N-NOM + - + V V+-+V Substitution: NATO- -> kalap ‘hat’ Ordering of rules: 1. substitution 2. segmentation Thematic Training Course on Processing Morphologically Rich Languages

Lemmatization • Lemmatization (i. e. dividing the word form into its root and affixes) Lemmatization • Lemmatization (i. e. dividing the word form into its root and affixes) is not a trivial task in morphologically rich languages such as Hungarian • common nouns: relying on a good dictionary • NEs: cannot be listed • Problem: the NE ends in an apparent suffix Thematic Training Course on Processing Morphologically Rich Languages

Lemmatization of NEs each ending that seems to be a possible suffix is cut Lemmatization of NEs each ending that seems to be a possible suffix is cut off the NE in step-by-step fashion Citroenben (lemma) Citroen + ben ‘in (a) Citroen’ Citroenb + en ‘on (a) Citroenb’ Citroenbe + n ‘on (a) Citroenbe’ • Each possible lemma undergoes a Google and a Yahoo search – the most frequent one is chosen (Farkas et al. 2008) Thematic Training Course on Processing Morphologically Rich Languages

NLP applications • NER – NEs with suffixes • Information extraction – Modality, uncertainty NLP applications • NER – NEs with suffixes • Information extraction – Modality, uncertainty – Causation • Machine translation – Morphemes vs. structures Thematic Training Course on Processing Morphologically Rich Languages

Named Entities • NEs should be recognized • They should be morphosyntactically tagged -> Named Entities • NEs should be recognized • They should be morphosyntactically tagged -> proper syntactic/semantic analysis A Citroenben a Peugeot meghatározó tulajdonhányadot szerez. • Mini dictionary + suffix list + semantic frame Thematic Training Course on Processing Morphologically Rich Languages

a ben Citroenben en meghatározó n ot Peugeot szerez t tulajdonrész DET S ? a ben Citroenben en meghatározó n ot Peugeot szerez t tulajdonrész DET S ? S ADJ S S ? V S N the in on dominant on ACC acquire ACC interest Thematic Training Course on Processing Morphologically Rich Languages

Possible analyses • Citroenben • Peugeot Citroenben Peugeot Citroen + ben ‘Citroen. Peugeo + Possible analyses • Citroenben • Peugeot Citroenben Peugeot Citroen + ben ‘Citroen. Peugeo + t ‘Peugeo. INE’ ACC’ Citroenb + en ‘Citroenb. Peuge + ot ‘Peuge. SUP’ ACC’ Citroenbe + n ‘Citroenbe -SUP’ Thematic Training Course on Processing Morphologically Rich Languages

[2=N] [3=N("részesedés"|"tulajdonrész"|"rész„| ”tulajdonhányad”)+compl 1=4+modified_by_adj=5] [2=N] [3=N("részesedés"|"tulajdonrész"|"rész„| ”tulajdonhányad”)+compl 1=4+modified_by_adj=5] A semantic frame [1=V("szerez"|"vásárol" |"vesz"|"megvásárol"|"felvásárol")+subject=2 +direct_object=3] [2=N] [3=N("részesedés"|"tulajdonrész"|"rész„| ”tulajdonhányad”)+compl 1=4+modified_by_adj=5] [4=N+case=ine+ceg] [5=A+measure+modified_by_number=6] [6=NB] Thematic Training Course on Processing Morphologically Rich Languages

Analysis A Citroenben a Peugeot meghatározó tulajdonhányadot szerez. Tulajdonhányadot -> ACC/OBJ (3) Citroenben -> Analysis A Citroenben a Peugeot meghatározó tulajdonhányadot szerez. Tulajdonhányadot -> ACC/OBJ (3) Citroenben -> INE (4) Peugeot -> NOM/SUBJ (2) ‘Peugeot acquires a dominant interest in Citroen. ’ Thematic Training Course on Processing Morphologically Rich Languages

Uncertainty • Text Mining: – derive facts from free text – uncertainty and negation Uncertainty • Text Mining: – derive facts from free text – uncertainty and negation have an impact on the quality/nature of the information extracted • applications have to treat sentences / clauses containing uncertain or negated information differently from factual information • Uncertainty: possible existence of a thing (neither its existence nor its non-existence is claimed) Thematic Training Course on Processing Morphologically Rich Languages

Uncertainty detection • Uncertainty detection in English: cues (words with uncertain content) • One Uncertainty detection • Uncertainty detection in English: cues (words with uncertain content) • One typical means to express uncertainty in Hungarian: -hat/het High school grades may influence health. A középiskolai jegyek kihathatnak az egészségre. • Morphological analysis should reflect modality (Voip 3 s---n) Thematic Training Course on Processing Morphologically Rich Languages

Causation • Semantic/thematic relations to be determined properly • AGENT != SUBJECT Varrattam egy Causation • Semantic/thematic relations to be determined properly • AGENT != SUBJECT Varrattam egy ruhát. sew-CAUS-PAST-1 Sg a dress-ACC ‘I had a dress sewn. ’ Varrattam Marival egy ruhát. sew-CAUS-PAST-1 Sg Mari-INS a dress-ACC ‘I had Mary sew a dress. ’ Varrtam Marival egy ruhát. sew-PAST-1 Sg Mari-INS a dress-ACC ‘I sewed a dress with Mary. ’ • Causative information should be encoded (Vsip 3 s---n) Thematic Training Course on Processing Morphologically Rich Languages

Argument structure of causative verbs Agent ? Varrattam egy ruhát. Varrattam Mari (INS) Marival Argument structure of causative verbs Agent ? Varrattam egy ruhát. Varrattam Mari (INS) Marival egy ruhát. Varrtam I (NOM) + Marival egy Mari (INS) ruhát. Beneficiary Patient I (NOM) ruha (ACC) ? ruha (ACC) Thematic Training Course on Processing Morphologically Rich Languages

Machine translation • Morpheme-based translation would be ideal • Easier alignment of translational units Machine translation • Morpheme-based translation would be ideal • Easier alignment of translational units • Good morphological parser needed • Easier to execute in dependency grammar • Morpheme-based dependency structures Thematic Training Course on Processing Morphologically Rich Languages

Alignments at | varr | t | ruha have | sewn | dress ban Alignments at | varr | t | ruha have | sewn | dress ban in | | ház house | | am my Thematic Training Course on Processing Morphologically Rich Languages

Problems • • Not practical: no corpus available at the moment Portmanteau morphs – Problems • • Not practical: no corpus available at the moment Portmanteau morphs – alignment problems Zero morphs – how many of them? 3 zero morphs in Hungarian nouns: könyv-Ø-Ø-Ø vs. könyveit book-Ø-Ø-Ø book-POSS. PL-ACC • (Mel’cuk 2006) Thematic Training Course on Processing Morphologically Rich Languages

 • Morphosyntactic codes might help • Csinálhattátok Vois 2 p ---y • Reordering • Morphosyntactic codes might help • Csinálhattátok Vois 2 p ---y • Reordering rules V o i s 2 p y csinál hat t tok á csinálh attátok Thematic Training Course on Processing Morphologically Rich Languages do can PAST you it you could do it

An example hat | csinál /| t á tok can | do /| d An example hat | csinál /| t á tok can | do /| d Ø you could / you do Thematic Training Course on Processing Morphologically Rich Languages

Syntax vs. case suffix Pseudo-subject Extra rules; PRED, OBJ difficult for humans Pseudo-object List Syntax vs. case suffix Pseudo-subject Extra rules; PRED, OBJ difficult for humans Pseudo-object List of adverbs with accusative ending Pseudo-dative List of verbs with dative subject Unknown words (lemmas+suffixes) Guessing (rules) Information extraction Thematic/semantic relations Proper morphosyntactic codes + rules Uncertainty detection Proper morphosyntactic codes Machine translation (morpheme-based) Proper morphosyntactic codes Thematic Training Course on Processing Morphologically Rich Languages

Summary • • Syntax-morphology interface in Hungarian Morphological coding systems Syntactic annotation in Hungarian Summary • • Syntax-morphology interface in Hungarian Morphological coding systems Syntactic annotation in Hungarian corpora Morphosyntactic problems: – NER – IE – MT Thematic Training Course on Processing Morphologically Rich Languages

References É. Kiss K. , Kiefer F. , Siptár P. : Új magyar nyelvtan, References É. Kiss K. , Kiefer F. , Siptár P. : Új magyar nyelvtan, Osiris Kiadó, Bp. , 1999. Farkas Richárd, Szeredi Dániel, Varga Dániel, Vincze Veronika 2010: MSD-KR harmonizáció a Szeged Treebank 2. 5 -ben. In: Tanács Attila, Vincze Veronika (szerk. ): VII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, pp. 349 -353. Farkas, Richárd; Vincze, Veronika; Nagy, István; Ormándi, Róbert; Szarvas, György; Almási, Attila 2008: Web-based lemmatisation of Named Entities. In: Horák, Ales; Kopeček, Ivan; Pala, Karel; Sojka, Petr (eds. ): Proceedings of the 11 th International Conference on Text, Speech and Dialogue (TSD 2008), Berlin, Heidelberg, Springer Verlag, LNCS 5246, pp. 53 -60. Koutny I. , Wacha B. : Magyar nyelvtan függőségi alapon. Magyar Nyelv Vol. 87 No. 4. (1991) 393– 404. Mel’cuk, Igor 2006: Aspects of the Theory of Morphology. Mouton de Gruyter. Prószéky, G. , Koutny, I. , Wacha, B. : Dependency Syntax of Hungarian. In: Maxwell, Dan; Klaus Schubert (eds. ) Metataxis in Practice (Dependency Syntax for Multilingual Machine Translation), Foris, Dordrecht, The Netherlands (1989) 151– 181 Zsibrita János, Vincze Veronika, Farkas Richárd 2010: Ismeretlen kifejezések és a szófaji egyértelműsítés. In: Tanács Attila, Vincze Veronika (szerk. ): VII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, pp. 275 -283. Thematic Training Course on Processing Morphologically Rich Languages