562c1f6953c63e31699228baeee6672f.ppt
- Количество слайдов: 18
Text Preprocessing
Preprocessing step • Aims to create a correct text representation, according to the adopted model. • Step: – – – Lexical analysis; Case folding, numbers; Stop-words elimination; Stemming; (other preprocessing procedures. . . )
Generating index terms Logical view of the documents Spaces and Signals Docs stopwords Nominal groups stemming Manual indexing structure Structure Full text Stop words elimination; n. Nominal groups detection; n. Stemming; n. Index terms generation; n. Other preprocessing procedures: n. Synonyms, co-occurrences, latent semantic indexing. . n Index terms
Text preprocessing • Most common procedures: – “Tokenization”: • Identification of text words; • Words are defined as “strings with cotinuous alphanumeric characters with no spaces, possibly including hyphens and apostrophes, but no end-of-sentence”; • The most employed elements to separate words are the blank, the tab ou the new-line.
Text preprocessing n Problems: a) b) c) d) e) End-of-sentence x abbreviators; ex. Wash. Apostrophes ( ‘ ): “magic words” x contractions; ex. I’ll. Hyphens: single words x hyphenised words; ex. email. Blank: sometimes does not indicate word separation; ex. database and data base; New York and San Francisco. Numbers: 9365 1873.
Text preprocessing – Case-Folding: • the THE The => THE ; – http: //www. delorie. com/gnu/docs/diffutils/diff_6. html – http: //curry. edschool. virginia. edu/aace/conf/webnet/h tml/invwitt. htm – http: //www. dlib. org/dlib/november 96/newzealand/11 witten. html
Text preprocessing – Stop-Words removal: • an, the, is, are, and, or, so, because, . . . ; • list on the Web (524 palavras) in the BOW library, CMU. – http: //www. dcs. gla. ac. uk/idom/ir_resources/linguistic_ utils/stop_words – http: //searchenginewatch. internet. com/facts/stopwor ds. html – http: //pen 2. ci. santamonica. us/city/municode/stopwords. html
Text preprocessing – Stemming: • compressed, compression, compressed => compress; • Porter`s algorithm: – http: //maya. cs. depaul. edu/~mobasher/classes/ds 599 /porter. html – http: //ils. unc. edu/keyes/java/porter
Stemming algorithm [Porter 86] Steps: 1. Plural removal, including special cases such as “sses” “ies”; 2. Union of pattern s with some suffixes such as: “ational" -> "ate", "tional" -> "tion", "enci" -> "ence", "anci" -> "ance", "iser" -> "ize", "abli" -> "able", "alli" -> "al", "entli" -> "ent", "eli" -> "e", "ousli" -> "ous", "ization" -> "ize", "isation" -> "ize", "ation" -> "ate", "ator" -> "ate", "alism" -> "al", "iveness" -> "ive", "fulness" -> "ful", "ousness" -> "ous", "aliti" -> "al", "iviti" -> "ive", "biliti" -> "ble“;
" src="https://present5.com/presentation/562c1f6953c63e31699228baeee6672f/image-10.jpg" alt="Stemming algorithm [Porter 86] Steps: 3. Manipulation of special transformations such as: "icate" ->" /> Stemming algorithm [Porter 86] Steps: 3. Manipulation of special transformations such as: "icate" -> "ic", "ative" -> "", "alize" -> "al", "alise" -> "al", "iciti" -> "ic", "ical" -> "ic", "ful" -> "", "ness" -> "“ 4. Verification of composite words, including: "al", "ance", "er", "ic", "able", "ible", "ant", "ement", "ent", "sion", "tion", "ou", "ism", "ate", "iti", "ous", "ive", "ize", "ise" 5. Verification if the word ends with a vocal: "kilo", "micro", "milli", "intra", "ultra", "mega", "nano", "pico", and "pseudo".
Text preprocessing – N-Grams: • APPLE => _APP, APPL, PPLE, PLE_ – http: //www. cs. umbc. edu/ngram – http: //citeseer. nj. nec. com/miller 99 hidden. html – http: //citeseer. nj. nec. com/5655. html
Text preprocessing • Other techniques: • Part-of-Speech tagger (Eric Brill www. cs. jhu. edu/~brill/ ): – Sentence separation in its syntactic or grammatical components (POS tags); – Main use in terms of information content: noums, verbs, adjectives.
Brill POS Tagger Output • Input: Mr. Red have a red ball • Output: Mr/NNP. /. Red/NNP have/VBP a/DT red/JJ ball/NN Part of Speech Tags DT Determiner NNP Proper noun, singular JJ Adjective VBP Verb, non-3 rd ps. sing. present NN Noun, singular or mass . Sentence-final punctuation
POS - nouns § in general indicate generic entities (dog, tree); for the English, consider only the plural noun variation; § the plural usually is characterized by the suffix -s (dogs, trees); § the plural has exceptions: “es” (speeches) and irregular terms (woman: women); § in addition there is the possessive case (woman’s house), called clitic. §
Text preprocessing – Wordnet (Princeton University): http: //www. cogsci. princeton. edu/wordnet/current/ • Is a database of lexemes [Miller 98]; • Contain information about composite expressions (phrasal verbs, collocations, idiomatic phases, etc. ); • Separate its entries according to their syntactic categories: nouns, verbs, adjectives, …; • In a category several semantic relations among words are stored.
Word. Net Search for and return of Noun of Verb of Adjective of Adverb
Word. Net Composition:
Wordnet The Wordnet contains the relations hyponym, hypersonic, meronym and holonym: hyponym is a more specific word: cat is a hyponym of animal; n hypernym is a more generic word: animal is a hypernym of cat; n a part of the whole is a meronym: leaf is a meronym of tree; n the whole which corresponds to a part is called holonym. n