
69f91b8afce7d1ab2aa34ada9a1559f9.ppt
- Количество слайдов: 42
Categorizing Unknown Words: Using Decision Trees to Identify Names and Misspellings
Janine Toole Simon Fraser University Burnaby, BC, Canada From ANLP-NAACL Proceedings, April 29 -May 4, 2000 (pp. . 173 -179)
Goal: automatic categorization of unknown words b Unknown Words (Ukn. Wrds): word not contained in lexicon of NLP system b "unknown-ness" - property relative to NLP system
Motivation b Degraded system performance in presence of unknown words b Disproportionate effect possible • Min (1996) - only 0. 6% of words in 300 e-mails misspelled • Result - 12% of the sentences contained an error (discussed in (Min and Wilson, 1998)). b Difficulties translating live closed captions (CC) • 5 seconds to transcribe dialogue, no post-edit
Reasons for unknown words b Proper name b Misspelling b Abbreviation or number b Morphological variant
And my favorite. . . b Misspoken words b Examples (courtesy H. K. Longmore): – *I'll phall you on the cone (call, phone) – *I did a lot of hiking by mysummer this self (myself this summer)
What to do? b Identify class of unknown word b Take action based on goals of system and class of word • Correct spelling • Expand abbr. • Convert number format
Overall System Architecture b Multiple components, one per category b Return confidence measure (Elworthy, 1998) b Evaluate results from each component to determine category b One reason for approach: take advantage of existing research
Simplified Version: Names & Spelling Errors b Decision tree architecture • combine multiple types of evidence about word b Results combined using weighted voting procedure b Evaluation: Live CC data - replete with wide variety of Ukn. Wds
Name Identifier b Proper names ==> proper name bucket b Others ==> discard b PN : person, place, concept, typically requiring Caps in English
Problems b CC is ALL CAPS! b No confidence measure with existing PN Recognizers b Perhaps future PNRs will work?
Solution b Build custom PNR
Decision Trees b Highly explainable - readily understand features affecting analysis b Well suited for combining a variety of info. b Don't grow tree from seed - use IBM's Intelligent Miner suite b Ignore DT algorithm - point is application of DT
Proper Names - Features b 10 features specified per Ukn. Wrd • • • POS and Detailed POS of Ukn. Wrd + and - 2 words Rule-based system for detailed tags in-house statistical parser for POS b Would include feature indicating presence of Initial Upper Case if data had it
Misspellings b Unintended, orthographically incorrect representation b Relative to NLP system b 1 or more additions, deletions, substitutions, reversals, punctuation
Orthography b Word: orthography or. thog. ra. phy o. r-'tha: g-r*-fe- n 1 a: the art of writing words with the proper letters according to standard usage 1 b: the representation of the sounds of a language by written or printed symbols 2: a part of language study that deals with letters and spelling
Misspellings - Features b Derived from prior research (including own) b Abridged list of features used • Corpus freq. , word length, edit distance, Ispell info, char seq. freq. , Non-Engl. chars
Misspellings Features (cont. ) b Word length - (Agirre et. al. , 1998) b Predictions for correct spelling more accurate if |w| > 4
Misspellings Features (cont. ) b Edit distance • 1 edit distance == 1 substitution, addition, deletion, reversal • 80% of errors w/in 1 edit distance of intended word • 70% w/in 1 edit distance of intended word b Unix spell checker: ispell • edit distance = distance from Unk. Wrd to closest ispell suggestion, or 30
Misspellings Features (cont. ) b Char. Seq. Freq. • • wful, rql, etc. composite of individual char. seq. relevance to 1 tree vs. many Non-English - Transmission noise in CC case, or Foreign names
Decision Time b Misspelling module says not a misspell PNR says its a name -> name b Both negative -> neither misspell nor name b What if both are positive? • One with highest confidence measure wins • Confidence measure – per leaf, calculated from training data – correct predictions / total # of predictions at leaf
Evaluation - Dataset b 7000 cases of Unk. Wrds b 2. 6 million word corpus b Live business news captions b 70. 4% manually ID'd as names b 21. 3% as misspellings b Rest - other types of Unk. Wrds
Dataset (cont. ) b 70% of Dataset randomly selected as training corpus b Remainder (2100) for test corpus b Test data - 10 samples, random selection with replacement b Total of 10 test datasets
Evaluation - Training b Train a DT with misspelling module b Train a DT with misspelling & name module b Train a DT with name & misspelling module
Misspelling DT Results - Table 3 b baseline - no recall b 1 st decision tree -73. 8% recall b 2 nd decision tree - increase in precision, decrease in recall by similar amount b name features not predictive for ID'ing misspellings in this domain b not surprising - 8 of 10 features deal with information external to word itself
Misspelling DT failures b 2 classes of omissions b Misidentifications • Foreign words
Omission type 1 b Words with typical characteristics of English words b Differ from intended word by addition or deletion of a syllable • • • creditability for credibility coordinatored for coordinated representives for representatives
Omission type 2 b Words differing from intended word by deletion of a blank • webpage • crewmembers • rainshower
Fixes b Fix for 2 nd type • feature to specify whether Unk. Wrd can be split into 2 known words b Fix for 1 st type more difficult • homophonic relationship • phonetic distance feature
Name DT Results - Table 4 b 1 st tree • precision is large improvement • recall is excellent b 2 nd tree • increased recall & precision • unlike 2 nd misspelling DT - why?
Name DT failures b Not ID'd as a name - Names with determiners • the steelers, the pathfinder b Adept at individual people, places • trouble with names having similar distributions to common nouns
Name DT failures (cont. ) b Incorrectly ID'd as name • Unusual character sequences: sxetion, fwlamg b Misspelling identifier correctly ID's as misspellings b Decision-making component needs to resolve these
Unknown Word Categorizer b Precision = # of correct misspelling or name categorizations / total number of times a word was identified as misspelling or name b Recall = # of times system correctly ID's misspelling or name / # of misspellings and names existing in data
Confusion matrix of tie-breaker b Table 5 - good results b 5% of cases needed confidence measure b Majority of cases decision-maker rules in favor of name prediction
Confusion matrix (cont. ) b Name DT has better results, likely to have higher confidence measures b Ukn. Wrd as Name when it is a misspelling (37 cases) b Phonetic relation with intended word - temt, tempt; floyda, Florida;
Encouraging Results b Productive approach b Future focus • Improve existing components – features sensitive to distinction between names & misspellings • Develop components to ID remaining types – abbr. , morph variants, etc. • Alternative decision-making process
Portability b Little required linguistic resources • Corpus of new domain (language) • Spelling suggestions – ispell avail. for many languages • POS tagger
Possible portability problems b Edit distance • Words consist of alphabetic chars. having undergone subst/add/del • Less useful for Chinese, Japanese b General approach still transferable • consider means by which misspellings differ from intended words • identify features to capture differences
Related Research b Assume all Ukn. Wrds are misspellings b Rely on capitalization b Expectations from scripts • Rely on world knowledge of situation – e. g. naval ship-to-shore messages
Related Research (cont. ) b (Baluja et al. , 1999) DT classifier to ID PNs in text b 3 features: word level, dictionary level, POS information b Highest F-score: 95. 2% • slightly higher than name module
But. . . b Different tasks • ID all words & phrases that are PNs • vs. ID only those words which are Ukn. Wrds b Different data - Case information b If word-level features (case) excluded F-score of 79. 7%
Conclusion b Ukn. Wrd Categorizer to ID misspellings & names b Individual components, specializing in identifying a particular class of Ukn. Wrd b 2 Existing components use DTs b Encouraging results in a challenging domain (live CC transcripts)!