Скачать презентацию Text Preprocessing Preprocessing step Aims to Скачать презентацию Text Preprocessing Preprocessing step Aims to

562c1f6953c63e31699228baeee6672f.ppt

  • Количество слайдов: 18

Text Preprocessing Text Preprocessing

Preprocessing step • Aims to create a correct text representation, according to the adopted Preprocessing step • Aims to create a correct text representation, according to the adopted model. • Step: – – – Lexical analysis; Case folding, numbers; Stop-words elimination; Stemming; (other preprocessing procedures. . . )

Generating index terms Logical view of the documents Spaces and Signals Docs stopwords Nominal Generating index terms Logical view of the documents Spaces and Signals Docs stopwords Nominal groups stemming Manual indexing structure Structure Full text Stop words elimination; n. Nominal groups detection; n. Stemming; n. Index terms generation; n. Other preprocessing procedures: n. Synonyms, co-occurrences, latent semantic indexing. . n Index terms

Text preprocessing • Most common procedures: – “Tokenization”: • Identification of text words; • Text preprocessing • Most common procedures: – “Tokenization”: • Identification of text words; • Words are defined as “strings with cotinuous alphanumeric characters with no spaces, possibly including hyphens and apostrophes, but no end-of-sentence”; • The most employed elements to separate words are the blank, the tab ou the new-line.

Text preprocessing n Problems: a) b) c) d) e) End-of-sentence x abbreviators; ex. Wash. Text preprocessing n Problems: a) b) c) d) e) End-of-sentence x abbreviators; ex. Wash. Apostrophes ( ‘ ): “magic words” x contractions; ex. I’ll. Hyphens: single words x hyphenised words; ex. email. Blank: sometimes does not indicate word separation; ex. database and data base; New York and San Francisco. Numbers: 9365 1873.

Text preprocessing – Case-Folding: • the THE The => THE ; – http: //www. Text preprocessing – Case-Folding: • the THE The => THE ; – http: //www. delorie. com/gnu/docs/diffutils/diff_6. html – http: //curry. edschool. virginia. edu/aace/conf/webnet/h tml/invwitt. htm – http: //www. dlib. org/dlib/november 96/newzealand/11 witten. html

Text preprocessing – Stop-Words removal: • an, the, is, are, and, or, so, because, Text preprocessing – Stop-Words removal: • an, the, is, are, and, or, so, because, . . . ; • list on the Web (524 palavras) in the BOW library, CMU. – http: //www. dcs. gla. ac. uk/idom/ir_resources/linguistic_ utils/stop_words – http: //searchenginewatch. internet. com/facts/stopwor ds. html – http: //pen 2. ci. santamonica. us/city/municode/stopwords. html

Text preprocessing – Stemming: • compressed, compression, compressed => compress; • Porter`s algorithm: – Text preprocessing – Stemming: • compressed, compression, compressed => compress; • Porter`s algorithm: – http: //maya. cs. depaul. edu/~mobasher/classes/ds 599 /porter. html – http: //ils. unc. edu/keyes/java/porter

Stemming algorithm [Porter 86] Steps: 1. Plural removal, including special cases such as “sses” Stemming algorithm [Porter 86] Steps: 1. Plural removal, including special cases such as “sses” “ies”; 2. Union of pattern s with some suffixes such as: “ational" -> "ate", "tional" -> "tion", "enci" -> "ence", "anci" -> "ance", "iser" -> "ize", "abli" -> "able", "alli" -> "al", "entli" -> "ent", "eli" -> "e", "ousli" -> "ous", "ization" -> "ize", "isation" -> "ize", "ation" -> "ate", "ator" -> "ate", "alism" -> "al", "iveness" -> "ive", "fulness" -> "ful", "ousness" -> "ous", "aliti" -> "al", "iviti" -> "ive", "biliti" -> "ble“;

" src="https://present5.com/presentation/562c1f6953c63e31699228baeee6672f/image-10.jpg" alt="Stemming algorithm [Porter 86] Steps: 3. Manipulation of special transformations such as: "icate" ->" /> Stemming algorithm [Porter 86] Steps: 3. Manipulation of special transformations such as: "icate" -> "ic", "ative" -> "", "alize" -> "al", "alise" -> "al", "iciti" -> "ic", "ical" -> "ic", "ful" -> "", "ness" -> "“ 4. Verification of composite words, including: "al", "ance", "er", "ic", "able", "ible", "ant", "ement", "ent", "sion", "tion", "ou", "ism", "ate", "iti", "ous", "ive", "ize", "ise" 5. Verification if the word ends with a vocal: "kilo", "micro", "milli", "intra", "ultra", "mega", "nano", "pico", and "pseudo".

Text preprocessing – N-Grams: • APPLE => _APP, APPL, PPLE, PLE_ – http: //www. Text preprocessing – N-Grams: • APPLE => _APP, APPL, PPLE, PLE_ – http: //www. cs. umbc. edu/ngram – http: //citeseer. nj. nec. com/miller 99 hidden. html – http: //citeseer. nj. nec. com/5655. html

Text preprocessing • Other techniques: • Part-of-Speech tagger (Eric Brill www. cs. jhu. edu/~brill/ Text preprocessing • Other techniques: • Part-of-Speech tagger (Eric Brill www. cs. jhu. edu/~brill/ ): – Sentence separation in its syntactic or grammatical components (POS tags); – Main use in terms of information content: noums, verbs, adjectives.

Brill POS Tagger Output • Input: Mr. Red have a red ball • Output: Brill POS Tagger Output • Input: Mr. Red have a red ball • Output: Mr/NNP. /. Red/NNP have/VBP a/DT red/JJ ball/NN Part of Speech Tags DT Determiner NNP Proper noun, singular JJ Adjective VBP Verb, non-3 rd ps. sing. present NN Noun, singular or mass . Sentence-final punctuation

POS - nouns § in general indicate generic entities (dog, tree); for the English, POS - nouns § in general indicate generic entities (dog, tree); for the English, consider only the plural noun variation; § the plural usually is characterized by the suffix -s (dogs, trees); § the plural has exceptions: “es” (speeches) and irregular terms (woman: women); § in addition there is the possessive case (woman’s house), called clitic. §

Text preprocessing – Wordnet (Princeton University): http: //www. cogsci. princeton. edu/wordnet/current/ • Is a Text preprocessing – Wordnet (Princeton University): http: //www. cogsci. princeton. edu/wordnet/current/ • Is a database of lexemes [Miller 98]; • Contain information about composite expressions (phrasal verbs, collocations, idiomatic phases, etc. ); • Separate its entries according to their syntactic categories: nouns, verbs, adjectives, …; • In a category several semantic relations among words are stored.

Word. Net Search for and return of Noun of Verb of Adjective of Adverb Word. Net Search for and return of Noun of Verb of Adjective of Adverb

Word. Net Composition: Word. Net Composition:

Wordnet The Wordnet contains the relations hyponym, hypersonic, meronym and holonym: hyponym is a Wordnet The Wordnet contains the relations hyponym, hypersonic, meronym and holonym: hyponym is a more specific word: cat is a hyponym of animal; n hypernym is a more generic word: animal is a hypernym of cat; n a part of the whole is a meronym: leaf is a meronym of tree; n the whole which corresponds to a part is called holonym. n