Скачать презентацию CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini Скачать презентацию CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini

e451a6b7a1e411e3c6f356f80eda80df.ppt

  • Количество слайдов: 31

CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini 3/18/2018 CPSC 503 Winter 2009 1 CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini 3/18/2018 CPSC 503 Winter 2009 1

NLP research at UBC TOPICS http: //people. cs. ubc. ca/~rjoty/Webpage/ • Generation and Summarization NLP research at UBC TOPICS http: //people. cs. ubc. ca/~rjoty/Webpage/ • Generation and Summarization of Evaluative Text (e. g. , customer reviews) • Summarization of conversations (emails, blogs, meetings) • Subjectivity Detection, Domain Adaptation, Rhetorical Parsing PEOPLE: G. Carenini & R. Ng (Profs), G. Murray (Postdoc) + Students SUPPORT: NSERC, Google, BObjects(now SAP), COLLABORATIONS: MSResearch 3/18/2018 CPSC 503 Winter 2009 2

Linguistic Knowledge (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 3/18/2018 Formalisms and associated Linguistic Knowledge (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 3/18/2018 Formalisms and associated Algorithms State Machines (no prob. ) • Finite State Automata (and Regular Expressions) • Finite State Transducers Rule systems (and prob. version) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics) AI planners CPSC 503 Winter 2009 3

Computational tasks in Morphology • Recognition: recognize whether a string is an English/… word Computational tasks in Morphology • Recognition: recognize whether a string is an English/… word (FSA) • Parsing/Generation: word e. g. , bought • Stemming: word 3/18/2018 …. stem, class, lexical features …. buy +V +PAST-PART buy +V +PAST stem …. CPSC 503 Winter 2009 4

Today Sept 16 • Finite State Transducers (FSTs) and Morphological Parsing • Stemming (Porter Today Sept 16 • Finite State Transducers (FSTs) and Morphological Parsing • Stemming (Porter Stemmer) 3/18/2018 CPSC 503 Winter 2009 5

FST definition (Recap. ) • Q: a finite set of states • I, O: FST definition (Recap. ) • Q: a finite set of states • I, O: input and an output alphabets (which may include ε) • Σ: a finite alphabet of complex symbols i: o, i I and o O • Q 0: the start state • F: a set of accept/final states (F Q) • A transition relation δ that maps QxΣ to 2 Q E. g. , |Q| =3 ; I={a, b, c, ε} ; O={a, b}; |Σ|=? ; 0 <= |δ| <= ? 3/18/2018 CPSC 503 Winter 2009 6

FST can be used as… • Translators: input one string from I, output another FST can be used as… • Translators: input one string from I, output another from O (or vice versa) • Recognizers: input a string from Ix. O • Generator: output a string from Ix. O Terminology warning! E. g. , if I={a, b} ; O={a, b, ε}; …… 3/18/2018 CPSC 503 Winter 2009 7

FST: inflectional morphology of plural Some regular-nouns Notes: X -> X: X Some irregular-nouns FST: inflectional morphology of plural Some regular-nouns Notes: X -> X: X Some irregular-nouns 3/18/2018 o: i CPSC 503 Winter 2009 lexical: surface 8

Examples lexical surface m i c lexical c a t e +N +PL surface Examples lexical surface m i c lexical c a t e +N +PL surface 3/18/2018 CPSC 503 Winter 2009 9

Computational Morphology: Problems/Challenges 1. Ambiguity: one word can correspond to multiple structures (more critical Computational Morphology: Problems/Challenges 1. Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages) 2. Spelling changes: may occur when two morphemes are combined e. g. butterfly + -s -> butterflies 3/18/2018 CPSC 503 Winter 2009 10

Ambiguity: more complex example • What’s the right parse for Unionizable? – Union-ize-able – Ambiguity: more complex example • What’s the right parse for Unionizable? – Union-ize-able – Un-ion-ize-able • Each would represent a valid path through an FST for derivational morphology. • Both Adj…… 3/18/2018 CPSC 503 Winter 2009 11

Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return them all (without choosing) Then Part-of- speech tagging to choose…… look at the neighboring words 3/18/2018 CPSC 503 Winter 2009 12

(2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may (2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change Examples • E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e. g. , kiss, miss, waltz, bush, watch, rich, box) • Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e. g. , butterfly, try) 3/18/2018 CPSC 503 Winter 2009 13

Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape machine as the input to the next • Add intermediate symbols – ^ morpheme boundary – # word boundary 3/18/2018 CPSC 503 Winter 2009 14

Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate level • FTS-2 handles the spelling changes (due to one rule) to the surface tape 3/18/2018 CPSC 503 Winter 2009 15

FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL: ^s# FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL: ^s# # Some irregularnouns 3/18/2018 o: i CPSC 503 Winter 2009 +PL: ^ ε: s ε: # 16

Example lexical f o x +N +PL intemediate lexical m o u s e Example lexical f o x +N +PL intemediate lexical m o u s e +N +PL intemediate 3/18/2018 CPSC 503 Winter 2009 17

FST-2 for E-insertion (Intermediate <-> Surface) E-insertion: when –s is added to a word, FST-2 for E-insertion (Intermediate <-> Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# <-> foxes #: ε 3/18/2018 CPSC 503 Winter 2009 18

Examples intermediate f o x ^ s # surface intermediate b o x ^ Examples intermediate f o x ^ s # surface intermediate b o x ^ i n g # surface 3/18/2018 CPSC 503 Winter 2009 19

Where are we? # 3/18/2018 CPSC 503 Winter 2009 20 Where are we? # 3/18/2018 CPSC 503 Winter 2009 20

Final Scheme: Part 1 3/18/2018 CPSC 503 Winter 2009 21 Final Scheme: Part 1 3/18/2018 CPSC 503 Winter 2009 21

Final Scheme: Part 2 3/18/2018 CPSC 503 Winter 2009 22 Final Scheme: Part 2 3/18/2018 CPSC 503 Winter 2009 22

Intersection (FST 1, FST 2) = FST 3 • States of FST 1 and Intersection (FST 1, FST 2) = FST 3 • States of FST 1 and FST 2 : Q 1 and Q 2 • States of intersection: (Q 1 x Q 2) • Transitions of FST 1 and FST 2 : δ 1, δ 2 • Transitions of intersection : δ 3 For all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff – δ 1(q 1 i, a: b) = q 1 n AND a: b – δ 2(q 2 j, a: b) = q 2 m q 1 i q 1 n a: b (q 1 i, q 2 j) 3/18/2018 (q 1 n, q 2 m) CPSC 503 Winter 2009 a: b q 2 j q 2 m 23

Composition(FST 1, FST 2) = FST 3 • • For – – – States Composition(FST 1, FST 2) = FST 3 • • For – – – States of FST 1 and FST 2 : Q 1 and Q 2 States of composition : Q 1 x Q 2 Transitions of FST 1 and FST 2 : δ 1, δ 2 Transitions of composition : δ 3 all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff There exists c such that a: c δ 1(q 1 i, a: c) = q 1 n AND q 1 i q 1 n δ 2(q 2 j, c: b) = q 2 m a: b (q 1 i, q 2 j) 3/18/2018 (q 1 n, q 2 m) CPSC 503 Winter 2009 c: b q 2 j q 2 m 24

FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” (e. g, lexicon, morphotactic and rules) in a Reg. Exp -like notation (pointer) • Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and Karttunen, 2003, CSLI Publications) Complexity/Coverage: • FSTs for the morphology of a natural language may have 105 – 107 states and arcs • Spanish (1996) 46 x 103 stems; 3. 4 x 106 word forms • Arabic (2002? ) 131 x 103 stems; 7. 7 x 106 word forms 3/18/2018 CPSC 503 Winter 2009 25

Other important applications of FST in NLP From segmenting words into morphemes to… • Other important applications of FST in NLP From segmenting words into morphemes to… • Tokenization: – finding word boundaries in text (? !) …maxmatch – Finding sentence boundaries: punctuation… but. is ambiguous look at example in Fig. 3. 22 • Shallow syntactic parsing: e. g. , find only noun phrases • Phonological Rules…… (Chpt. 11) 3/18/2018 CPSC 503 Winter 2009 26

Computational tasks in Morphology • Recognition: recognize whether a string is an English word Computational tasks in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: word e. g. , bought • Stemming: word 3/18/2018 …. stem, class, lexical features …. buy +V +PAST-PART buy +V +PAST stem …. CPSC 503 Winter 2009 27

Stemmer • E. g. the Porter algorithm, which is based on a series of Stemmer • E. g. the Porter algorithm, which is based on a series of sets of simple cascaded rewrite rules: • (condition) S 1 ->S 2 – ATIONAL ATE (relational relate) – (*v*) ING if stem contains vowel (motoring motor) • Cascade of rules applied to: computerization – ization -> -ize computerize – ize -> ε computer • Errors occur: – organization organ, university universe Code freely available in most languages: Python, Java, … 3/18/2018 CPSC 503 Winter 2009 28

Stemming mainly used in Information Retrieval 1. Run a stemmer on the documents to Stemming mainly used in Information Retrieval 1. Run a stemmer on the documents to be indexed 2. Run a stemmer on users queries 3. Compute similarity between queries and documents (based on stems they contain) Seems to work especially well with smaller documents 3/18/2018 CPSC 503 Winter 2009 29

Porter as an FST • The original exposition of the Porter stemmer did not Porter as an FST • The original exposition of the Porter stemmer did not describe it as a transducer but… – Each stage is a separate transducer – The stages can be composed to get one big transducer 3/18/2018 CPSC 503 Winter 2009 30

Next Time • Read handout – Probability – Stats – Information theory • Next Next Time • Read handout – Probability – Stats – Information theory • Next Lecture: – finish Chpt 3, 3. 10 -11 – Start Probabilistic Models for NLP (Chpt. 4, 4. 1 – 4. 2 and 5. 9!) 3/18/2018 CPSC 503 Winter 2009 31