55024c0b412cf5b0d19f1d2e75ed3a7c.ppt
- Количество слайдов: 20
CSCE 771 Natural Language Processing Lecture 9 NLTK POS Tagging Part 2 Topics n n n Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5. 4 -? February 3, 2011
Overview Last Time n Overview of POS Tags Today n n n Part of Speech Tagging Parts of Speech Rule Based taggers Stochastic taggers Transformational taggers Readings n – 2– Chapter 5. 4 -5. ? CSCE 771 Spring 2011
Table 5. 1: Simplified Part-of-Speech Tagset Tag ADJ ADV CNJ Meaning adjective adverb conjunction Examples new, good, high, special, big, local really, already, still, early, now and, or, but, if, while, although DET EX FW determiner existential foreign word the, a, some, most, every, no there, there's dolce, ersatz, esprit, quo, maitre – 3– CSCE 771 Spring 2011
MOD modal verb will, can, would, may, must, should N noun year, home, costs, time, education NP proper noun Alison, Africa, April, Washington NUM number twenty-four, fourth, 1991, 14: 24 PRO pronoun he, their, her, its, my, I, us P preposition on, of, at, with, by, into, under TO the word to to UH interjection ah, bang, ha, whee, hmpf, oops V verb is, has, get, do, make, see, run VD past tense said, took, told, made, asked VG present participle making, going, playing, working VN past participle given, taken, begun, sung wh determiner who, which, when, what, where, how 2011 CSCE 771 Spring – 4– WH
Rank tags from most to least common >>> from nltk. corpus import brown >>> brown_news_tagged = brown. tagged_words(categories='news', simplify_tags=True) >>> tag_fd = nltk. Freq. Dist(tag for (word, tag) in brown_news_tagged) >>> print tag_fd. keys() ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ', ', 'CNJ', 'PRO', 'ADV', 'VD', . . . ] – 5– CSCE 771 Spring 2011
What Tags Precede Nouns? >>> word_tag_pairs = nltk. bigrams(brown_news_tagged) >>> list(nltk. Freq. Dist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N')) ['DET', 'ADJ', 'N', 'P', 'NUM', 'V', 'PRO', 'CNJ', ', ', 'VG', 'VN', . . . ] – 6– CSCE 771 Spring 2011
Most common Verbs >>> wsj = nltk. corpus. treebank. tagged_words(simplify_tags=True) >>> word_tag_fd = nltk. Freq. Dist(wsj) >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag. startswith('V')] ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', . . . ] – 7– CSCE 771 Spring 2011
Rank Tags for words using CFDs • word as a condition and the tag as an event >>> wsj = nltk. corpus. treebank. tagged_words(simplify_tags=True) >>> cfd 1 = nltk. Conditional. Freq. Dist(wsj) >>> print cfd 1['yield']. keys() ['V', 'N'] >>> print cfd 1['cut']. keys() ['V', 'VD', 'N', 'VN'] – 8– CSCE 771 Spring 2011
Tags and counts for the word cut print "ranked tags for the word cut" cut_tags=cfd 1['cut']. keys() print "Counts for cut" for c in cut_tags: print c, cfd 1['cut'][c] ranked tags for the word cut Counts for cut V 12 VD 10 N 3 VN 3 – 9– CSCE 771 Spring 2011
P(W | T) – Flipping it around >>> cfd 2 = nltk. Conditional. Freq. Dist((tag, word) for (word, tag) in wsj) >>> print cfd 2['VN']. keys() ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', . . . ] – 10 – CSCE 771 Spring 2011
List of words for which VD and VN are both events list 1=[w for w in cfd 1. conditions() if 'VD' in cfd 1[w] and 'VN' in cfd 1[w]] print list 1 – 11 – CSCE 771 Spring 2011
Print the 4 word/tag pairs before kicked/VD idx 1 = wsj. index(('kicked', 'VD')) print wsj[idx 1 -4: idx 1+1] – 12 – CSCE 771 Spring 2011
– 13 – CSCE 771 Spring 2011
Example Description cfdist = Conditional. Freq. Dist(pairs) create a conditional frequency distribution from a list of pairs Table 2. 4 cfdist. conditions() alphabetically sorted list of conditions cfdist[condition][sample] frequency for the given sample for this condition cfdist. tabulate() tabulate the conditional frequency distribution cfdist. tabulate(samples, conditions) tabulation limited to the specified samples and conditions cfdist. plot() graphical plot of the conditional frequency distribution cfdist. plot(samples, conditions) – 14 – the frequency distribution for this condition graphical plot limited to the specified samples and conditions cfdist 1 < cfdist 2 test if samples in cfdist 1 occur less frequently CSCE 771 Spring 2011 than in cfdist 2
Example 5. 2 (code_findtags. py) def findtags(tag_prefix, tagged_text): cfd = nltk. Conditional. Freq. Dist((tag, word) for (word, tag) in tagged_text if tag. startswith(tag_prefix)) return dict((tag, cfd[tag]. keys()[: 5]) for tag in cfd. conditions()) >>> tagdict = findtags('NN', nltk. corpus. brown. tagged_words(categories='news')) >>> for tag in sorted(tagdict): . . . print tag, tagdict[tag] – 15 – . . . CSCE 771 Spring 2011
NN ['year', 'time', 'state', 'week', 'home'] NN$ ["year's", "world's", "state's", "city's", "company's"] NN$-HL ["Golf's", "Navy's"] NN$-TL ["President's", "Administration's", "Army's", "Gallery's", "League's"] NN-HL ['Question', 'Salary', 'business', 'condition', 'cut'] NN-NC ['aya', 'eva', 'ova'] NN-TL ['President', 'House', 'State', 'University', 'City'] NN-TL-HL ['Fort', 'Basin', 'Beat', 'City', 'Commissioner'] NNS ['years', 'members', 'people', 'sales', 'men'] NNS$ ["children's", "women's", "janitors'", "men's", "builders'"] NNS$-HL ["Dealers'", "Idols'"] – 16 – CSCE 771 Spring 2011
words following often import nltk from nltk. corpus import brown print "For the Brown Tagged Corpus category=learned" brown_learned_text = brown. words(categories='learned') print "sorted words following often" print sorted(set(b for (a, b) in nltk. ibigrams(brown_learned_text) if a == 'often')) – 17 – CSCE 771 Spring 2011
brown_lrnd_tagged = brown. tagged_words(categories='learned', simplify_tags=True) tags = [b[1] for (a, b) in nltk. ibigrams(brown_lrnd_tagged) if a[0] == 'often'] fd = nltk. Freq. Dist(tags) print fd. tabulate() VN V VD ADJ DET ADV 15 12 – 18 – 8 5 5 4 P , CNJ . TO VBZ VG WH 4 3 1 1 1 1 CSCE 771 Spring 2011
highly ambiguous words >>> brown_news_tagged = brown. tagged_words(categories='news', simplify_tags=True) >>> data = nltk. Conditional. Freq. Dist((word. lower(), tag). . . for (word, tag) in brown_news_tagged) >>> for word in data. conditions(): . . . if len(data[word]) > 3: . . . tags = data[word]. keys(). . . print word, ' '. join(tags). . . best ADJ ADV NP V better ADJ ADV V DET – 19 – …. CSCE 771 Spring 2011
Tag Package http: //nltk. org/api/nltk. tag. html#module-nltk. tag – 20 – CSCE 771 Spring 2011