CSCE 771 Natural Language Processing Lecture 9 NLTK

Скачать презентацию CSCE 771 Natural Language Processing Lecture 9 NLTK

55024c0b412cf5b0d19f1d2e75ed3a7c.ppt

Количество слайдов: 20

CSCE 771 Natural Language Processing Lecture 9 NLTK POS Tagging Part 2 Topics n n n Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5. 4 -? February 3, 2011

Overview Last Time n Overview of POS Tags Today n n n Part of Speech Tagging Parts of Speech Rule Based taggers Stochastic taggers Transformational taggers Readings n – 2– Chapter 5. 4 -5. ? CSCE 771 Spring 2011

Table 5. 1: Simplified Part-of-Speech Tagset Tag ADJ ADV CNJ Meaning adjective adverb conjunction Examples new, good, high, special, big, local really, already, still, early, now and, or, but, if, while, although DET EX FW determiner existential foreign word the, a, some, most, every, no there, there's dolce, ersatz, esprit, quo, maitre – 3– CSCE 771 Spring 2011

MOD modal verb will, can, would, may, must, should N noun year, home, costs, time, education NP proper noun Alison, Africa, April, Washington NUM number twenty-four, fourth, 1991, 14: 24 PRO pronoun he, their, her, its, my, I, us P preposition on, of, at, with, by, into, under TO the word to to UH interjection ah, bang, ha, whee, hmpf, oops V verb is, has, get, do, make, see, run VD past tense said, took, told, made, asked VG present participle making, going, playing, working VN past participle given, taken, begun, sung wh determiner who, which, when, what, where, how 2011 CSCE 771 Spring – 4– WH

Rank tags from most to least common >>> from nltk. corpus import brown >>> brown_news_tagged = brown. tagged_words(categories='news', simplify_tags=True) >>> tag_fd = nltk. Freq. Dist(tag for (word, tag) in brown_news_tagged) >>> print tag_fd. keys() ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ', ', 'CNJ', 'PRO', 'ADV', 'VD', . . . ] – 5– CSCE 771 Spring 2011

What Tags Precede Nouns? >>> word_tag_pairs = nltk. bigrams(brown_news_tagged) >>> list(nltk. Freq. Dist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N')) ['DET', 'ADJ', 'N', 'P', 'NUM', 'V', 'PRO', 'CNJ', ', ', 'VG', 'VN', . . . ] – 6– CSCE 771 Spring 2011

Most common Verbs >>> wsj = nltk. corpus. treebank. tagged_words(simplify_tags=True) >>> word_tag_fd = nltk. Freq. Dist(wsj) >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag. startswith('V')] ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', . . . ] – 7– CSCE 771 Spring 2011

Rank Tags for words using CFDs • word as a condition and the tag as an event >>> wsj = nltk. corpus. treebank. tagged_words(simplify_tags=True) >>> cfd 1 = nltk. Conditional. Freq. Dist(wsj) >>> print cfd 1['yield']. keys() ['V', 'N'] >>> print cfd 1['cut']. keys() ['V', 'VD', 'N', 'VN'] – 8– CSCE 771 Spring 2011

Tags and counts for the word cut print "ranked tags for the word cut" cut_tags=cfd 1['cut']. keys() print "Counts for cut" for c in cut_tags: print c, cfd 1['cut'][c] ranked tags for the word cut Counts for cut V 12 VD 10 N 3 VN 3 – 9– CSCE 771 Spring 2011

P(W | T) – Flipping it around >>> cfd 2 = nltk. Conditional. Freq. Dist((tag, word) for (word, tag) in wsj) >>> print cfd 2['VN']. keys() ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', . . . ] – 10 – CSCE 771 Spring 2011

List of words for which VD and VN are both events list 1=[w for w in cfd 1. conditions() if 'VD' in cfd 1[w] and 'VN' in cfd 1[w]] print list 1 – 11 – CSCE 771 Spring 2011

Print the 4 word/tag pairs before kicked/VD idx 1 = wsj. index(('kicked', 'VD')) print wsj[idx 1 -4: idx 1+1] – 12 – CSCE 771 Spring 2011

– 13 – CSCE 771 Spring 2011

Example Description cfdist = Conditional. Freq. Dist(pairs) create a conditional frequency distribution from a list of pairs Table 2. 4 cfdist. conditions() alphabetically sorted list of conditions cfdist[condition][sample] frequency for the given sample for this condition cfdist. tabulate() tabulate the conditional frequency distribution cfdist. tabulate(samples, conditions) tabulation limited to the specified samples and conditions cfdist. plot() graphical plot of the conditional frequency distribution cfdist. plot(samples, conditions) – 14 – the frequency distribution for this condition graphical plot limited to the specified samples and conditions cfdist 1 < cfdist 2 test if samples in cfdist 1 occur less frequently CSCE 771 Spring 2011 than in cfdist 2

Example 5. 2 (code_findtags. py) def findtags(tag_prefix, tagged_text): cfd = nltk. Conditional. Freq. Dist((tag, word) for (word, tag) in tagged_text if tag. startswith(tag_prefix)) return dict((tag, cfd[tag]. keys()[: 5]) for tag in cfd. conditions()) >>> tagdict = findtags('NN', nltk. corpus. brown. tagged_words(categories='news')) >>> for tag in sorted(tagdict): . . . print tag, tagdict[tag] – 15 – . . . CSCE 771 Spring 2011

NN ['year', 'time', 'state', 'week', 'home'] NN$ ["year's", "world's", "state's", "city's", "company's"] NN$-HL ["Golf's", "Navy's"] NN$-TL ["President's", "Administration's", "Army's", "Gallery's", "League's"] NN-HL ['Question', 'Salary', 'business', 'condition', 'cut'] NN-NC ['aya', 'eva', 'ova'] NN-TL ['President', 'House', 'State', 'University', 'City'] NN-TL-HL ['Fort', 'Basin', 'Beat', 'City', 'Commissioner'] NNS ['years', 'members', 'people', 'sales', 'men'] NNS$ ["children's", "women's", "janitors'", "men's", "builders'"] NNS$-HL ["Dealers'", "Idols'"] – 16 – CSCE 771 Spring 2011

words following often import nltk from nltk. corpus import brown print "For the Brown Tagged Corpus category=learned" brown_learned_text = brown. words(categories='learned') print "sorted words following often" print sorted(set(b for (a, b) in nltk. ibigrams(brown_learned_text) if a == 'often')) – 17 – CSCE 771 Spring 2011

brown_lrnd_tagged = brown. tagged_words(categories='learned', simplify_tags=True) tags = [b[1] for (a, b) in nltk. ibigrams(brown_lrnd_tagged) if a[0] == 'often'] fd = nltk. Freq. Dist(tags) print fd. tabulate() VN V VD ADJ DET ADV 15 12 – 18 – 8 5 5 4 P , CNJ . TO VBZ VG WH 4 3 1 1 1 1 CSCE 771 Spring 2011

highly ambiguous words >>> brown_news_tagged = brown. tagged_words(categories='news', simplify_tags=True) >>> data = nltk. Conditional. Freq. Dist((word. lower(), tag). . . for (word, tag) in brown_news_tagged) >>> for word in data. conditions(): . . . if len(data[word]) > 3: . . . tags = data[word]. keys(). . . print word, ' '. join(tags). . . best ADJ ADV NP V better ADJ ADV V DET – 19 – …. CSCE 771 Spring 2011

Tag Package http: //nltk. org/api/nltk. tag. html#module-nltk. tag – 20 – CSCE 771 Spring 2011