Скачать презентацию CS 601 R section 2 Statistical Natural Language Скачать презентацию CS 601 R section 2 Statistical Natural Language

e3f843b1e75e42b7b577f72e41b4ae1c.ppt

  • Количество слайдов: 23

CS 601 R, section 2: Statistical Natural Language Processing Lectures #16 & 17: Part CS 601 R, section 2: Statistical Natural Language Processing Lectures #16 & 17: Part of Speech Tagging, Hidden Markov Models Thanks to Dan Klein of UC Berkeley for many of the materials used in this lecture.

Last Time § Maximum entropy models § A technique for estimating multinomial distributions conditionally Last Time § Maximum entropy models § A technique for estimating multinomial distributions conditionally on many features § A building block of many NLP systems

Goals § To be able to model sequences § Application: Part-of-Speech Tagging § Technique: Goals § To be able to model sequences § Application: Part-of-Speech Tagging § Technique: Hidden Markov Models (HMMs) § Think of this as sequential classification

Parts-of-Speech § Syntactic classes of words § Useful distinctions vary from language to language Parts-of-Speech § Syntactic classes of words § Useful distinctions vary from language to language § Tagsets vary from corpus to corpus [See M+S p. 142] § Some tags from the Penn tagset CD DT IN JJ MD NN NNP PRP RB RP VB VBD VBN VBP numeral, cardinal determiner preposition or conjunction, subordinating adjective or numeral, ordinal modal auxiliary noun, common, singular or mass noun, proper, singular pronoun, personal adverb particle verb, base form verb, past tense verb, past participle verb, present tense, not 3 rd person singular mid-1890 nine-thirty 0. 5 one a all an every no that the among whether out on by if third ill-mannered regrettable can may might will would cabbage thermostat investment subhumanity Motown Cougar Yvette Liverpool hers himself it we them occasionally maddeningly adventurously aboard away back by on open through ask bring fire see take pleaded swiped registered saw dilapidated imitated reunifed unsettled twist appear comprise mold postpone

CC CD DT EX FW IN JJ JJR JJS MD NN NNPS NNS POS CC CD DT EX FW IN JJ JJR JJS MD NN NNPS NNS POS PRP$ RB RBR RBS RP TO UH VB VBD VBG VBN VBP VBZ WDT WP WP$ WRB conjunction, coordinating numeral, cardinal determiner existential there foreign word preposition or conjunction, subordinating adjective or numeral, ordinal adjective, comparative adjective, superlative modal auxiliary noun, common, singular or mass noun, proper, singular noun, proper, plural noun, common, plural genitive marker pronoun, personal pronoun, possessive adverb, comparative adverb, superlative particle "to" as preposition or infinitive marker interjection verb, base form verb, past tense verb, present participle or gerund verb, past participle verb, present tense, not 3 rd person singular verb, present tense, 3 rd person singular WH-determiner WH-pronoun, possessive Wh-adverb and both but either or mid-1890 nine-thirty 0. 5 one a all an every no that there gemeinschaft hund ich jeux among whether out on by if third ill-mannered regrettable braver cheaper taller bravest cheapest tallest can may might will would cabbage thermostat investment subhumanity Motown Cougar Yvette Liverpool Americans Materials States undergraduates bric-a-brac averages ' 's hers himself it we them her his mine my ours their thy your occasionally maddeningly adventurously further gloomier heavier less-perfectly best biggest nearest worst aboard away back by on open through to huh howdy uh whammo shucks heck ask bring fire see take pleaded swiped registered saw stirring focusing approaching erasing dilapidated imitated reunifed unsettled twist appear comprise mold postpone bases reconstructs marks uses that whatever whichever that whatever which whom whose however whenever where why

Part-of-Speech Ambiguity § Example VBD VBN NNP VBZ NNS VB VBP NN VBZ NNS Part-of-Speech Ambiguity § Example VBD VBN NNP VBZ NNS VB VBP NN VBZ NNS CD NN Fed raises interest rates 0. 5 percent § Two basic sources of constraint: § Grammatical environment § Identity of the current word § Many more possible features: § … but we won’t be able to use them until next class

Why POS Tagging? § Useful in and of itself § Text-to-speech: record, lead § Why POS Tagging? § Useful in and of itself § Text-to-speech: record, lead § Lemmatization: saw[v] see, saw[n] saw § Quick-and-dirty NP-chunk detection: grep {JJ | NN}* {NN | NNS} § Useful as a pre-processing step for parsing § Less tag ambiguity means fewer parses § However, some tag choices are better decided by parsers! IN DT NNP NN VBD VBN RP NN NNS The Georgia branch had taken on loan commitments … VBN DT NN IN NN VBD NNS VBD The average of interbank offered rates plummeted …

HMMs § We want a generative model over sequences t and observations w using HMMs § We want a generative model over sequences t and observations w using states s < , > s 0 § Assumptions: § § < , t 1> < t 1, t 2> < tn-1, tn> s 1 s 2 sn w 1 w 2 wn Tag sequence is generated by an order n markov model This corresponds to a 1 st order model over tag n-grams Words are chosen independently, conditioned only on the tag These are totally broken assumptions: why?

Parameter Estimation § Need two multinomials § Transitions: § Emissions: § Can get these Parameter Estimation § Need two multinomials § Transitions: § Emissions: § Can get these off a collection of tagged sentences:

Practical Issues with Estimation § Use standard smoothing methods to estimate transition scores, e. Practical Issues with Estimation § Use standard smoothing methods to estimate transition scores, e. g. : § Emissions are trickier § § Words we’ve never seen before Words which occur with tags we’ve never seen One option: break out the Good-Turing smoothing Issue: words aren’t black boxes: 343, 127. 23 11 -year Minteria reintroducible § Another option: decompose words into features and use a maxent model along with Bayes’ rule.

Disambiguation § Given these two multinomials, we can score any word / tag sequence Disambiguation § Given these two multinomials, we can score any word / tag sequence pair < , > < , NNP> NNP VBZ Fed raises NN NNS interest rates CD NN . 0. 5 percent . P(NNP|< , >) P(Fed|NNP) P(VBZ|) P(raises|VBZ) P(NN|)…. . § In principle, we’re done – list all possible tag sequences, score each one, pick the best one (the Viterbi state sequence) NNP VBZ NN NNS CD NN log. P = -23 NNP NNS NN NNS CD NN log. P = -29 NNP VBZ VB NNS CD NN log. P = -27

Finding the Best Trajectory § Too many trajectories (state sequences) to list § Option Finding the Best Trajectory § Too many trajectories (state sequences) to list § Option 1: Beam Search Fed: NNP <> Fed: VBN Fed: VBD Fed: NNP raises: NNS Fed: NNP raises: VBZ Fed: VBN raises: NNS Fed: VBN raises: VBZ § A beam is a set of partial hypotheses § Start with just the single empty trajectory § At each derivation step: § Consider all continuations of previous hypotheses § Discard most, keep top k, or those within a factor of the best, (or some combination) § Beam search works relatively well in practice § … but sometimes you want the optimal answer § … and you need optimal answers to validate your beam search

The Path Trellis § Represent paths as a trellis over states NNP, NNS: 2 The Path Trellis § Represent paths as a trellis over states NNP, NNS: 2 NNS, NN: 3 NNP, VBZ: 2 NNS, VB: 3 VBN, NNS: 2 VBZ, NN: 3 VBN, VBZ: 2 VBZ, VB: 3 , NNP: 1 , : 0 , VBN: 1 Fed raises interest § Each arc (s 1: i s 2: i+1) is weighted with the combined cost of: § Transitioning from s 1 to s 2 (which involves some unique tag t) § Emitting word i given t P(VBZ | NNP, ) P(raises | VBZ) § Each state path (trajectory): § Corresponds to a derivation of the word and tag sequence pair § Corresponds to a unique sequence of part-of-speech tags § Has a probability given by multiplying the arc weights in the path

The Viterbi Algorithm § Dynamic program for computing § The score of a best The Viterbi Algorithm § Dynamic program for computing § The score of a best path up to position i ending in state s § Also store a backtrace § Memoized solution § Iterative solution

The Path Trellis as DP Table … VBZ, VB VBZ, NN NNS, VB NNS, The Path Trellis as DP Table … VBZ, VB VBZ, NN NNS, VB NNS, NN NNP, VBZ NNP, NNS VBN, VBZ VBN, NNS , NNP , VBN , Fed raises interest …

How Well Does It Work? § Choose the most common tag § 90. 3% How Well Does It Work? § Choose the most common tag § 90. 3% with a bad unknown word model § 93. 7% with a good one! § Tn. T (Brants, 2000): § A carefully smoothed trigram tagger § 96. 7% on WSJ text (SOA is ~97. 2%) § Noise in the data § Many errors in the training and test corpora DT NN IN NN VBD NNS VBD The average of interbank offered rates plummeted … § Probably about 2% guaranteed error from noise (on this data) JJ JJ NN chief executive officer NN JJ NN chief executive officer JJ NN NN chief executive officer NN NN NN chief executive officer

What’s Next for POS Tagging § Better features! RB PRP VBD IN RB IN What’s Next for POS Tagging § Better features! RB PRP VBD IN RB IN PRP VBD. They left as soon as he arrived. § We could fix this with a feature that looked at the next word JJ NNP NNS VBD VBN. Intrinsic flaws remained undetected. § We could fix this by linking capitalized words to their lowercase versions § Solution: maximum entropy sequence models (next class) § Reality check: § Taggers are already pretty good on WSJ journal text… § What the world needs is taggers that work on other text!

HMMs as Language Models § We have a generative model of tagged sentences: § HMMs as Language Models § We have a generative model of tagged sentences: § We can turn this into a distribution over sentences by summing over the tag sequences: § Problem: too many sequences! § (And beam search isn’t going to help this time)

Summing over Paths § Just like Viterbi, but with sum instead of max § Summing over Paths § Just like Viterbi, but with sum instead of max § Recursive decomposition

The Forward-Backward Algorithm The Forward-Backward Algorithm

What Does This Buy Us? § Why do we want forward and backward probabilities? What Does This Buy Us? § Why do we want forward and backward probabilities? § Lets us ask more questions § Like: what fraction of sequences contain tag t at position i § Max-tag decoding: § Pick the tag at each point which has highest expectation § Raises accuracy a tiny bit § Bad idea in practice (why? ) § Also: Unsupervised learning of HMMs § At least in theory, more later…

How’s the HMM as a LM? § POS tagging HMMs are terrible as LMs! How’s the HMM as a LM? § POS tagging HMMs are terrible as LMs! I bought an ice cream ___ The computer that I set up yesterday just ___ § Don’t capture long-distance effects like a parser could § Don’t capture local collocational effects like n-grams § But other HMM-based LMs can work very well c 1 c 2 cn w 1 w 2 wn START

Next Time § Better Tagging Features using Maxent § Dealing with unknown words § Next Time § Better Tagging Features using Maxent § Dealing with unknown words § Adjacent words § Longer-distance features § Soon: Named-Entity Recognition