e21f9bdf0399bd049705aaaf19a9a431.ppt
- Количество слайдов: 39
Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005
Course Outline July 18: Intro to computational morphology XFST Readings Lauri Karttunen, “Finite-State Constraints”, The Last Phonological Rule. J. Goldsmith (ed. ), pages 173 -194, University of Chicago Press, 1993. Karttunen and Beesley, “ 25 Years of Finite-State Morphology” Chapter 1: “Gentle Introduction” (B&K) July 20: Regular expressions More on XFST Readings Chapter 2: “Systematic Introduction” Chapter 3: “The XFST interface”
July 25 More on XFST: Date Parser Concatenative morphotactics: The LEXC language Readings Chapter 4. “The LEXC Language” July 27 Constraining non-local dependencies: Flag Diacritics Complex morphotactics and alternations: Finnish Numerals Readings Chapter 5. “Flag Diacritics” ”
August 1 Non-concatenative morphotactics Reduplication, interdigitation Realizational morphology Readings Chapter 8. “Non-Concatenative Morphotactics” Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt) Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed. ), 205 -216, Springer Verlag. 2003. August 3 Optimality theory Readings Paul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds. ), pp. 109 -161, CSLI Publications, 2003. Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273 -329.
Syllabification revisited define Mark. Non. Diphthongs [ [. . ] -> ". " || [High. V | Mid. V] _ Low. V, # i. a, e. a Low. V _ Mid. V, # a. e i _ [Mid. V - e], # i. o, i. ä u _ [Mid. V - o], # u. e y _ [Mid. V - ö], # y. e $V i _ e, # poiki. en V u _ o, # $V y _ ö, # $V [Mid. V | Low. V] _ [u|y] C C|. #. ]]; # oike. us define Syllabify [ C* V+ C* @->. . . ". " || _ C V ]; regex Finn. Words. o. Mark. Non. Diphthongs. o. Syllabify;
Constraints +Fem MF+ in eg et ge MF%+ => _ ~$[%+Fem] %+Pl ; o +Pl hund j ec bon ne mal eg et a n
Constraining by composition xfst[0]: read lexc < adj-noun-tags. lexc Root. . . 2, Nouns. . . 2, Noun. Roots. . . 4, Nmf. . . 5, . . Building lexicon. . . Minimizing. . . Done! 2. 7 Kb. 45 states, 70 arcs, Circular. xfst[1]: up gehundino MF+hund+Noun+Fem+Sg xfst[1]: regex "MF+" => _ ~$["+Fem"] "+Pl" ; 1. 2 Kb, 2 states, 7 arcs, Circular xfst[2]: compose 3. 2 Kb, 61 states, 89 arcs, Circular xfst[1]: up gehundino xfst[1]: *** Not accepted *** Less words, bigger network.
Esperanto with Flags Multichar_Symbols +Noun +Adj +Nsuff +ASuff +Nize +Pl +Sg +Acc MF+ +Aug +Dim +Fem Op+ Neg+ @U. MF. Yes@ @U. MF. No@ LEXICON Root Nouns ; Adjectives ; LEXICON bird hund kat Noun. Roots Nmf ; LEXICON Nmf +Noun: 0 Aug. Dim. Fem ; LEXICON Nouns Noun. Roots ; @U. MF. Yes@ Ge ; LEXICON Aug. Dim. Fem @U. MF. No@ Fem ; +Dim: et Aug. Dim. Fem ; +Aug: eg Aug. Dim. Fem ; Nend ; Adjend ; LEXICON Ge MF+: ge Noun. Roots; LEXICON Fem +Fem: in Aug. Dim. Fem ;
Constraining by flags xfst[0]: read lexc < esperanto-flags. lexc xfst[1]: up gehundino xfst[1]: down MF+hund+Noun+Fem+NSuff+Sg xfst[1]: set obey-flags off variable obey-flags = off xfst[1]: up gehundino xfst[1]: MF+hund+Noun+Fem+NSuff+Sg xfst[1]: set show-flags on variable show-flags = on xfst[1]: down MF+hund+Noun+Fem+NSuff+Sg @U. MF. Yes@gehund@U. MF. No@ino@U. MF. No@
Flags in the sigma xfst[1]: print sigma MF+ Neg+ Op+ a b c d e f g h i j k l m n o r t u v +ASuff +Acc +Adj +Aug +Dim +Fem +Nsuff +Nize +Noun +Pl +Sg @U. MF. No@ @U. MF. Yes@ Size: 35 @U. MF. Yes@: UNIFY feature 'MF' with value 'Yes' @U. MF. No@: UNIFY feature 'MF' with value 'No' 2 flag diacritics
Eliminating flags xfst[1]: eliminate flag MF 3. 2 Kb. 61 states 89 arcs, Circular Size: 35 xfst[1]: print sigma MF+ Neg+ Op+ a b c d e f g h i j k l m n o r t u v +ASuff +Acc +Adj +Aug +Dim +Fem +NSuff +Nize +Noun +Pl +Sg Size: 33 The eliminate flag command composes the network with constraint networks that have the same effect as the flag diacritics that are removed.
Flag Diacritics Special symbols for encoding features, that is, attribute-value pairs. Checked at runtime to avoid the cost of compiling them into the structure of the network If a check fails, the path is abandoned.
Attributes and Values Epsilon arcs with feature constraints. @U. Feature. Value@ Unify ‘Feature’ with ‘Value’ if possible. @C. Feature@ Set ‘Feature’ to the unspecified value.
Rules There can be any number of attributes. An attribute can have any number of values. If the value of an attribute is unspecified, it unifies successfully with any given value and is set to that value. If the value of an attribute is specified, it unifies only with the given value.
Actions: Unify, Positive Set @U. Feature. Value@ Unify Value with the current setting of Feature, if possible. Otherwise fail. @P. Feature. Value@ Set Feature to Value regardless of the current setting. Always succeeds.
More Actions: Negative Set, Clear @N. Feature. Value@ Set Feature to the complement of Value regardless of the current setting. Always succeeds. @C. Feature@ Make Feature be unspecified. Always succeeds.
More Actions: Require @R. Feature. Value@ @R. Feature@ Succeed in Feature is set to Value. Otherwise fail. Succeed if Feature has been set to some value. Otherwise fail.
More Actions: Equality @E. Feature 1. Feature 2@ Succeed if Feature 1 has the same value as Feature 2. Otherwise fail.
Eliminating flags The constraints on "@U. FEATURE. VALUE@" have the form ~[? * PROHIBIT_FLAGS ~$[ALLOW_FLAGS] SELF ? *] Constraint for eliminating @U. MF. No@: ~[? * ["@U. MF. Yes@"] ~$["@P. MF. No@" | ”@C. MF@”] "@U. MF. No@" ? *] # prohibit # allow
Finnish Numerals
Numbers and Numerals The mapping from integers 0, 1, 2, 3 … to the corresponding numerals one, two, three… is a regular relation. Some languages have a very simple numeral system, some are more complicated: seventy-three, soixante-treize, drei-und-sibzig We can compile transducers that map between the numbers and the corresponding numerals.
Number-to-Numeral transducer Analysis Generation 105 hundred five hundred and five one hundred and five
The Goal Ahead: Finnish Analysis Generation 105+Sg+Gen 28+Ord+Pl+Gen sadanviiden kahdensienkymmenensienkahdeksansien hundred and five (Sg Gen) twenty-eighth (Pl Gen)
Finnish Numerals Express ordinality, number, and case sata+Sg+Nom (100) sata+Ord+Sg+Nom (100 th) sadas sata+Sg+Gen (100) sadan sata+Ord+Sg+Gen (100 th) sadannen sata+Pl+Gen (100) satojen sata+Ord+Pl+Gen (100 th) sadansien Compound numerals written as one word 2 • 1000 + 5 • 100 + 3 • 10 +1 kaksituhattaviisisataakolmekymmentäyksi = 2531
Singular vs. Plural Numerals generally occur with singular nouns kaksi+Sg+Gen kenkä+Sg+Gen kahden kengän (owner of two shoes) omistaja Sets and public events may be in plural kaksi+Pl+Gen kenkä+Pl+Gen kaksien kenkien omistaja (owner of two pairs of shoes) kolme+Ord+Pl+Nom olympialainen+Pl+Nom kolmannet olympialaiset (third olympic games) yksi+Pl+Nom hää+Pl+Nom yhdet häät (one wedding)
Morphotactics All parts of compound numerals agree in all respects two thousand five hundred (2500) kaksi+Sg+Gen tuhat+Sg+Gen viisi+Sg+Gen sata+Sg+Gen kahden tuhannen viiden sadan two ten eighth (28 th) kaksi+Ord+Pl+Gen kymmenen+Ord+Pl+Gen kahdeksan+Ord+Pl+Gen kahde ns i en kymmene ns i en kahdeksa ns i en
Singular nominative is exceptional Numeral with a noun kaksi+Gen kenkä+Gen kahden kengän (two shoes) kaksi+Nom kenkä+Part kaksi kenkää (two shoes) Compound numeral kaksi+Gen tuhat+Gen viisi+Gen sata+Gen kolme+Gen (2503) kahden tuhannen viiden sadan kolmen kaksi+Nom tuhat+Part viisi+Nom sata+Part kolme+Nom (2503) (kaksi • tuhatta) + (viisi • sataa) + kolme
Morphological Alternations Semiregular stem alternations yksi+Sg+Nom : yksi+Sg+Ess : yksi+Sg+Gen : yksi+Sg+Part : yksi+Pl+Gen : yksi yhtenä yhden yhtä yksien (one) Irregular stem alternations yksi+Ord+Sg+Nom : ensimmäinen (first) Regular suffix alternations Vowel harmony kolme+Sg+Part : kolmea vs. neljä+Sg+Part : neljää Illative vowel kolme+Sg+Ill : kolmeen vs. neljä+Ill+Part : neljään Partitive t yksi+Sg+Part : yhtä vs. neljä+Sg+Part : neljää
Solution for Finnish Numbers/ Finnish Transducer Maps a number with morphological tags into an inflected Finnish numeral. Encodes morphotactic constraints. . o. lexc source lexicon Looping lexicon with all the forms of all Finnish single numerals concatenated in all possible ways. Composed with morphophonological rules.
Example Numbers/ Finnish Transducer 2 5 +Ord +Pl +Gen kaksi +Ord +Pl +Gen kymmenen +Ord +Pl +Gen viisi +Ord +Pl +Gen . o. lexc source lexicon kaksi +Pl +Nom kymmenen +Part VIISI +Ord +Gen kahdet kymmentä viidennen (ungrammatical) kaksi +Ord +Pl +Gen kymmenen +Ord +Pl +Gen viisi +Ord +Pl +Gen kahdensien kymmenensien viidensien
Sublexicon for One LEXICON Yksi YKSI+Sg: yksi Nom; YKSI+Sg: yhde Weak. Grade; YKSI+Sg: yhte Strong. Grade; YKSI+Sg: yht Par; YKSI: yks Pl. Stem 1; YKSI+Ord 1+Sg: ensimmäinen Nom; YKSI+Ord 1+Sg: ensimmäise Any. Grade; YKSI+Ord 1+Sg: ensimmäis Par; YKSI+Ord+Sg: yhdes Nom; YKSI+Ord+Sg: yhdenne Weak. Grade; YKSI+Ord+Sg: yhdente Strong. Grade; YKSI+Ord+Sg: yhdet Par; YKSI+Ord: yhdens Pl. Stem 1; # # # # singular nominative weak stem (most cases) strong stem (essive, ill. ) partitive stem plural stem singular nominative weak/strong stem partitive stem singular nominative weak stem strong stem partitive stem plural stem
Some sublexicons LEXICON Weak. Grade Sg. Gen; Pl. Nom; Invar. Weak; ! Singular Genitive ! Plural Nominative ! Invariant (plural and singular) cases LEXICON Invar. Weak +Tra: ksi Next; +Ine: ss. A Next; +Ela: lt. A Next; +Ade: ll. A Next; +Abl: lt. A Next; +All: lle Next; +Abe: tt. A Next; ! ! ! ! Translative “into” Inessive “in” Elative “from” (inside) Adessive “on” Ablative “from” (outside) Allative “onto” Abessive “without”
Sample paths for Two kaksi+Sg+Nom kaksi+Sg+Gen kahde n kaksi+Sg+Ess kahte na kaksi+Sg+Par kah TA kaksi+Pl+Gen kaks i en kaksi+Pl+Ill kaks i Vn kaksi+Ord+Sg+Nom kahde s kaksi+Ord 1+Sg+Nom toinen kaksi+Ord+Sg+Ill kahde nte Vn kaksi+Ord 1+Sg+Ill toise Vn
Morphophonologial rules define Back. V [a | o | u]; define Front. V [ä | ö | y]; define Vow [Back. V | Front. V | i | e]; define VHarmony [A -> a || Back. V ~$[Front. V] _. o. A -> ä]; define Illative. V [V -> a || a (h) _ , V -> e || e (h) _ , … define Partitive. T [T -> 0 || Vow _ ]; ]
Example again Numbers/ Finnish Transducer 2 5 +Ord +Pl +Gen KAKSI +Ord +Pl +Gen KYMMENEN +Ord +Pl +Gen VIISI +Ord +Pl +Gen . o. lexc source lexicon. o. morphophonological rules KAKSI +Pl +Nom KYMMENEN +Part VIISI +Ord +Gen (ungrammatical) kahdet kymmentä viidennen KAKSI +Ord +Pl +Gen KYMMENEN +Ord +Pl +Gen VIISI +Ord +Pl +Gen kahdensien kymmenensien viidensien
Remaining problems Special ordinals for yksi (one), kaksi (two) ensimmäinen (1 st) vs. kahdeskymmenesyhdes (21 st) Compose the lexicon with an appropriate filter to eliminate unwanted variants. No internal tags 2+Sg+Gen 00+Sg+Gen Delete them: 0 <- Tag || _ $[Tag Tag+]. #. ; Singular nominative as partitive in compounds %+Nom -> %+Par // %+Sg %+Nom ~$Tag %+Sg _ ; Ordinal/Plural/Case agreement Flag diacritics!
Flags for Finnish numerals @U. Type. Card@ @U. Type. Ord@ @U. Number. Sg@ @U. Number. Pl@ @U. Case. Nom@ @U. Case. Ess@ @U. Case. Ill@ @U. Case. Com@ @U. Case. Gen@ @U. Case. Par@ @U. Case. Tra@ @U. Case. Abe@ @U. Case. Ine@ @U. Case. Ela@ @U. Case. Ade@ @U. Case. Abl@ @U. Case. All@ @U. Case. Ins@ 300+Sg+Gen kolmensadan 3 00 +Sg +Gen @U. Type. Card@ @U. Num. Sg@ @U. Case. Gen@ k o lmen s a dan
Conclusion Mapping from numbers to numerals can be done in a simple and elegant way even for languages with complex morphology. Necessary for text to speech applications. Tervetuloa kahdensienkymmenensienkahdeksansien olympialaisten avajaisiin! Welcome to the opening ceremonies of the 28 th Olympic Games!
Demo!