Скачать презентацию Categorizing Unknown Words Using Decision Trees to Identify Скачать презентацию Categorizing Unknown Words Using Decision Trees to Identify

69f91b8afce7d1ab2aa34ada9a1559f9.ppt

  • Количество слайдов: 42

Categorizing Unknown Words: Using Decision Trees to Identify Names and Misspellings Categorizing Unknown Words: Using Decision Trees to Identify Names and Misspellings

Janine Toole Simon Fraser University Burnaby, BC, Canada From ANLP-NAACL Proceedings, April 29 -May Janine Toole Simon Fraser University Burnaby, BC, Canada From ANLP-NAACL Proceedings, April 29 -May 4, 2000 (pp. . 173 -179)

Goal: automatic categorization of unknown words b Unknown Words (Ukn. Wrds): word not contained Goal: automatic categorization of unknown words b Unknown Words (Ukn. Wrds): word not contained in lexicon of NLP system b "unknown-ness" - property relative to NLP system

Motivation b Degraded system performance in presence of unknown words b Disproportionate effect possible Motivation b Degraded system performance in presence of unknown words b Disproportionate effect possible • Min (1996) - only 0. 6% of words in 300 e-mails misspelled • Result - 12% of the sentences contained an error (discussed in (Min and Wilson, 1998)). b Difficulties translating live closed captions (CC) • 5 seconds to transcribe dialogue, no post-edit

Reasons for unknown words b Proper name b Misspelling b Abbreviation or number b Reasons for unknown words b Proper name b Misspelling b Abbreviation or number b Morphological variant

And my favorite. . . b Misspoken words b Examples (courtesy H. K. Longmore): And my favorite. . . b Misspoken words b Examples (courtesy H. K. Longmore): – *I'll phall you on the cone (call, phone) – *I did a lot of hiking by mysummer this self (myself this summer)

What to do? b Identify class of unknown word b Take action based on What to do? b Identify class of unknown word b Take action based on goals of system and class of word • Correct spelling • Expand abbr. • Convert number format

Overall System Architecture b Multiple components, one per category b Return confidence measure (Elworthy, Overall System Architecture b Multiple components, one per category b Return confidence measure (Elworthy, 1998) b Evaluate results from each component to determine category b One reason for approach: take advantage of existing research

Simplified Version: Names & Spelling Errors b Decision tree architecture • combine multiple types Simplified Version: Names & Spelling Errors b Decision tree architecture • combine multiple types of evidence about word b Results combined using weighted voting procedure b Evaluation: Live CC data - replete with wide variety of Ukn. Wds

Name Identifier b Proper names ==> proper name bucket b Others ==> discard b Name Identifier b Proper names ==> proper name bucket b Others ==> discard b PN : person, place, concept, typically requiring Caps in English

Problems b CC is ALL CAPS! b No confidence measure with existing PN Recognizers Problems b CC is ALL CAPS! b No confidence measure with existing PN Recognizers b Perhaps future PNRs will work?

Solution b Build custom PNR Solution b Build custom PNR

Decision Trees b Highly explainable - readily understand features affecting analysis b Well suited Decision Trees b Highly explainable - readily understand features affecting analysis b Well suited for combining a variety of info. b Don't grow tree from seed - use IBM's Intelligent Miner suite b Ignore DT algorithm - point is application of DT

Proper Names - Features b 10 features specified per Ukn. Wrd • • • Proper Names - Features b 10 features specified per Ukn. Wrd • • • POS and Detailed POS of Ukn. Wrd + and - 2 words Rule-based system for detailed tags in-house statistical parser for POS b Would include feature indicating presence of Initial Upper Case if data had it

Misspellings b Unintended, orthographically incorrect representation b Relative to NLP system b 1 or Misspellings b Unintended, orthographically incorrect representation b Relative to NLP system b 1 or more additions, deletions, substitutions, reversals, punctuation

Orthography b Word: orthography or. thog. ra. phy o. r-'tha: g-r*-fe- n 1 a: Orthography b Word: orthography or. thog. ra. phy o. r-'tha: g-r*-fe- n 1 a: the art of writing words with the proper letters according to standard usage 1 b: the representation of the sounds of a language by written or printed symbols 2: a part of language study that deals with letters and spelling

Misspellings - Features b Derived from prior research (including own) b Abridged list of Misspellings - Features b Derived from prior research (including own) b Abridged list of features used • Corpus freq. , word length, edit distance, Ispell info, char seq. freq. , Non-Engl. chars

Misspellings Features (cont. ) b Word length - (Agirre et. al. , 1998) b Misspellings Features (cont. ) b Word length - (Agirre et. al. , 1998) b Predictions for correct spelling more accurate if |w| > 4

Misspellings Features (cont. ) b Edit distance • 1 edit distance == 1 substitution, Misspellings Features (cont. ) b Edit distance • 1 edit distance == 1 substitution, addition, deletion, reversal • 80% of errors w/in 1 edit distance of intended word • 70% w/in 1 edit distance of intended word b Unix spell checker: ispell • edit distance = distance from Unk. Wrd to closest ispell suggestion, or 30

Misspellings Features (cont. ) b Char. Seq. Freq. • • wful, rql, etc. composite Misspellings Features (cont. ) b Char. Seq. Freq. • • wful, rql, etc. composite of individual char. seq. relevance to 1 tree vs. many Non-English - Transmission noise in CC case, or Foreign names

Decision Time b Misspelling module says not a misspell PNR says its a name Decision Time b Misspelling module says not a misspell PNR says its a name -> name b Both negative -> neither misspell nor name b What if both are positive? • One with highest confidence measure wins • Confidence measure – per leaf, calculated from training data – correct predictions / total # of predictions at leaf

Evaluation - Dataset b 7000 cases of Unk. Wrds b 2. 6 million word Evaluation - Dataset b 7000 cases of Unk. Wrds b 2. 6 million word corpus b Live business news captions b 70. 4% manually ID'd as names b 21. 3% as misspellings b Rest - other types of Unk. Wrds

Dataset (cont. ) b 70% of Dataset randomly selected as training corpus b Remainder Dataset (cont. ) b 70% of Dataset randomly selected as training corpus b Remainder (2100) for test corpus b Test data - 10 samples, random selection with replacement b Total of 10 test datasets

Evaluation - Training b Train a DT with misspelling module b Train a DT Evaluation - Training b Train a DT with misspelling module b Train a DT with misspelling & name module b Train a DT with name & misspelling module

Misspelling DT Results - Table 3 b baseline - no recall b 1 st Misspelling DT Results - Table 3 b baseline - no recall b 1 st decision tree -73. 8% recall b 2 nd decision tree - increase in precision, decrease in recall by similar amount b name features not predictive for ID'ing misspellings in this domain b not surprising - 8 of 10 features deal with information external to word itself

Misspelling DT failures b 2 classes of omissions b Misidentifications • Foreign words Misspelling DT failures b 2 classes of omissions b Misidentifications • Foreign words

Omission type 1 b Words with typical characteristics of English words b Differ from Omission type 1 b Words with typical characteristics of English words b Differ from intended word by addition or deletion of a syllable • • • creditability for credibility coordinatored for coordinated representives for representatives

Omission type 2 b Words differing from intended word by deletion of a blank Omission type 2 b Words differing from intended word by deletion of a blank • webpage • crewmembers • rainshower

Fixes b Fix for 2 nd type • feature to specify whether Unk. Wrd Fixes b Fix for 2 nd type • feature to specify whether Unk. Wrd can be split into 2 known words b Fix for 1 st type more difficult • homophonic relationship • phonetic distance feature

Name DT Results - Table 4 b 1 st tree • precision is large Name DT Results - Table 4 b 1 st tree • precision is large improvement • recall is excellent b 2 nd tree • increased recall & precision • unlike 2 nd misspelling DT - why?

Name DT failures b Not ID'd as a name - Names with determiners • Name DT failures b Not ID'd as a name - Names with determiners • the steelers, the pathfinder b Adept at individual people, places • trouble with names having similar distributions to common nouns

Name DT failures (cont. ) b Incorrectly ID'd as name • Unusual character sequences: Name DT failures (cont. ) b Incorrectly ID'd as name • Unusual character sequences: sxetion, fwlamg b Misspelling identifier correctly ID's as misspellings b Decision-making component needs to resolve these

Unknown Word Categorizer b Precision = # of correct misspelling or name categorizations / Unknown Word Categorizer b Precision = # of correct misspelling or name categorizations / total number of times a word was identified as misspelling or name b Recall = # of times system correctly ID's misspelling or name / # of misspellings and names existing in data

Confusion matrix of tie-breaker b Table 5 - good results b 5% of cases Confusion matrix of tie-breaker b Table 5 - good results b 5% of cases needed confidence measure b Majority of cases decision-maker rules in favor of name prediction

Confusion matrix (cont. ) b Name DT has better results, likely to have higher Confusion matrix (cont. ) b Name DT has better results, likely to have higher confidence measures b Ukn. Wrd as Name when it is a misspelling (37 cases) b Phonetic relation with intended word - temt, tempt; floyda, Florida;

Encouraging Results b Productive approach b Future focus • Improve existing components – features Encouraging Results b Productive approach b Future focus • Improve existing components – features sensitive to distinction between names & misspellings • Develop components to ID remaining types – abbr. , morph variants, etc. • Alternative decision-making process

Portability b Little required linguistic resources • Corpus of new domain (language) • Spelling Portability b Little required linguistic resources • Corpus of new domain (language) • Spelling suggestions – ispell avail. for many languages • POS tagger

Possible portability problems b Edit distance • Words consist of alphabetic chars. having undergone Possible portability problems b Edit distance • Words consist of alphabetic chars. having undergone subst/add/del • Less useful for Chinese, Japanese b General approach still transferable • consider means by which misspellings differ from intended words • identify features to capture differences

Related Research b Assume all Ukn. Wrds are misspellings b Rely on capitalization b Related Research b Assume all Ukn. Wrds are misspellings b Rely on capitalization b Expectations from scripts • Rely on world knowledge of situation – e. g. naval ship-to-shore messages

Related Research (cont. ) b (Baluja et al. , 1999) DT classifier to ID Related Research (cont. ) b (Baluja et al. , 1999) DT classifier to ID PNs in text b 3 features: word level, dictionary level, POS information b Highest F-score: 95. 2% • slightly higher than name module

But. . . b Different tasks • ID all words & phrases that are But. . . b Different tasks • ID all words & phrases that are PNs • vs. ID only those words which are Ukn. Wrds b Different data - Case information b If word-level features (case) excluded F-score of 79. 7%

Conclusion b Ukn. Wrd Categorizer to ID misspellings & names b Individual components, specializing Conclusion b Ukn. Wrd Categorizer to ID misspellings & names b Individual components, specializing in identifying a particular class of Ukn. Wrd b 2 Existing components use DTs b Encouraging results in a challenging domain (live CC transcripts)!