e451a6b7a1e411e3c6f356f80eda80df.ppt
- Количество слайдов: 31
CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini 3/18/2018 CPSC 503 Winter 2009 1
NLP research at UBC TOPICS http: //people. cs. ubc. ca/~rjoty/Webpage/ • Generation and Summarization of Evaluative Text (e. g. , customer reviews) • Summarization of conversations (emails, blogs, meetings) • Subjectivity Detection, Domain Adaptation, Rhetorical Parsing PEOPLE: G. Carenini & R. Ng (Profs), G. Murray (Postdoc) + Students SUPPORT: NSERC, Google, BObjects(now SAP), COLLABORATIONS: MSResearch 3/18/2018 CPSC 503 Winter 2009 2
Linguistic Knowledge (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 3/18/2018 Formalisms and associated Algorithms State Machines (no prob. ) • Finite State Automata (and Regular Expressions) • Finite State Transducers Rule systems (and prob. version) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics) AI planners CPSC 503 Winter 2009 3
Computational tasks in Morphology • Recognition: recognize whether a string is an English/… word (FSA) • Parsing/Generation: word e. g. , bought • Stemming: word 3/18/2018 …. stem, class, lexical features …. buy +V +PAST-PART buy +V +PAST stem …. CPSC 503 Winter 2009 4
Today Sept 16 • Finite State Transducers (FSTs) and Morphological Parsing • Stemming (Porter Stemmer) 3/18/2018 CPSC 503 Winter 2009 5
FST definition (Recap. ) • Q: a finite set of states • I, O: input and an output alphabets (which may include ε) • Σ: a finite alphabet of complex symbols i: o, i I and o O • Q 0: the start state • F: a set of accept/final states (F Q) • A transition relation δ that maps QxΣ to 2 Q E. g. , |Q| =3 ; I={a, b, c, ε} ; O={a, b}; |Σ|=? ; 0 <= |δ| <= ? 3/18/2018 CPSC 503 Winter 2009 6
FST can be used as… • Translators: input one string from I, output another from O (or vice versa) • Recognizers: input a string from Ix. O • Generator: output a string from Ix. O Terminology warning! E. g. , if I={a, b} ; O={a, b, ε}; …… 3/18/2018 CPSC 503 Winter 2009 7
FST: inflectional morphology of plural Some regular-nouns Notes: X -> X: X Some irregular-nouns 3/18/2018 o: i CPSC 503 Winter 2009 lexical: surface 8
Examples lexical surface m i c lexical c a t e +N +PL surface 3/18/2018 CPSC 503 Winter 2009 9
Computational Morphology: Problems/Challenges 1. Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages) 2. Spelling changes: may occur when two morphemes are combined e. g. butterfly + -s -> butterflies 3/18/2018 CPSC 503 Winter 2009 10
Ambiguity: more complex example • What’s the right parse for Unionizable? – Union-ize-able – Un-ion-ize-able • Each would represent a valid path through an FST for derivational morphology. • Both Adj…… 3/18/2018 CPSC 503 Winter 2009 11
Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return them all (without choosing) Then Part-of- speech tagging to choose…… look at the neighboring words 3/18/2018 CPSC 503 Winter 2009 12
(2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change Examples • E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e. g. , kiss, miss, waltz, bush, watch, rich, box) • Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e. g. , butterfly, try) 3/18/2018 CPSC 503 Winter 2009 13
Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape machine as the input to the next • Add intermediate symbols – ^ morpheme boundary – # word boundary 3/18/2018 CPSC 503 Winter 2009 14
Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate level • FTS-2 handles the spelling changes (due to one rule) to the surface tape 3/18/2018 CPSC 503 Winter 2009 15
FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL: ^s# # Some irregularnouns 3/18/2018 o: i CPSC 503 Winter 2009 +PL: ^ ε: s ε: # 16
Example lexical f o x +N +PL intemediate lexical m o u s e +N +PL intemediate 3/18/2018 CPSC 503 Winter 2009 17
FST-2 for E-insertion (Intermediate <-> Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# <-> foxes #: ε 3/18/2018 CPSC 503 Winter 2009 18
Examples intermediate f o x ^ s # surface intermediate b o x ^ i n g # surface 3/18/2018 CPSC 503 Winter 2009 19
Where are we? # 3/18/2018 CPSC 503 Winter 2009 20
Final Scheme: Part 1 3/18/2018 CPSC 503 Winter 2009 21
Final Scheme: Part 2 3/18/2018 CPSC 503 Winter 2009 22
Intersection (FST 1, FST 2) = FST 3 • States of FST 1 and FST 2 : Q 1 and Q 2 • States of intersection: (Q 1 x Q 2) • Transitions of FST 1 and FST 2 : δ 1, δ 2 • Transitions of intersection : δ 3 For all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff – δ 1(q 1 i, a: b) = q 1 n AND a: b – δ 2(q 2 j, a: b) = q 2 m q 1 i q 1 n a: b (q 1 i, q 2 j) 3/18/2018 (q 1 n, q 2 m) CPSC 503 Winter 2009 a: b q 2 j q 2 m 23
Composition(FST 1, FST 2) = FST 3 • • For – – – States of FST 1 and FST 2 : Q 1 and Q 2 States of composition : Q 1 x Q 2 Transitions of FST 1 and FST 2 : δ 1, δ 2 Transitions of composition : δ 3 all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff There exists c such that a: c δ 1(q 1 i, a: c) = q 1 n AND q 1 i q 1 n δ 2(q 2 j, c: b) = q 2 m a: b (q 1 i, q 2 j) 3/18/2018 (q 1 n, q 2 m) CPSC 503 Winter 2009 c: b q 2 j q 2 m 24
FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” (e. g, lexicon, morphotactic and rules) in a Reg. Exp -like notation (pointer) • Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and Karttunen, 2003, CSLI Publications) Complexity/Coverage: • FSTs for the morphology of a natural language may have 105 – 107 states and arcs • Spanish (1996) 46 x 103 stems; 3. 4 x 106 word forms • Arabic (2002? ) 131 x 103 stems; 7. 7 x 106 word forms 3/18/2018 CPSC 503 Winter 2009 25
Other important applications of FST in NLP From segmenting words into morphemes to… • Tokenization: – finding word boundaries in text (? !) …maxmatch – Finding sentence boundaries: punctuation… but. is ambiguous look at example in Fig. 3. 22 • Shallow syntactic parsing: e. g. , find only noun phrases • Phonological Rules…… (Chpt. 11) 3/18/2018 CPSC 503 Winter 2009 26
Computational tasks in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: word e. g. , bought • Stemming: word 3/18/2018 …. stem, class, lexical features …. buy +V +PAST-PART buy +V +PAST stem …. CPSC 503 Winter 2009 27
Stemmer • E. g. the Porter algorithm, which is based on a series of sets of simple cascaded rewrite rules: • (condition) S 1 ->S 2 – ATIONAL ATE (relational relate) – (*v*) ING if stem contains vowel (motoring motor) • Cascade of rules applied to: computerization – ization -> -ize computerize – ize -> ε computer • Errors occur: – organization organ, university universe Code freely available in most languages: Python, Java, … 3/18/2018 CPSC 503 Winter 2009 28
Stemming mainly used in Information Retrieval 1. Run a stemmer on the documents to be indexed 2. Run a stemmer on users queries 3. Compute similarity between queries and documents (based on stems they contain) Seems to work especially well with smaller documents 3/18/2018 CPSC 503 Winter 2009 29
Porter as an FST • The original exposition of the Porter stemmer did not describe it as a transducer but… – Each stage is a separate transducer – The stages can be composed to get one big transducer 3/18/2018 CPSC 503 Winter 2009 30
Next Time • Read handout – Probability – Stats – Information theory • Next Lecture: – finish Chpt 3, 3. 10 -11 – Start Probabilistic Models for NLP (Chpt. 4, 4. 1 – 4. 2 and 5. 9!) 3/18/2018 CPSC 503 Winter 2009 31


