
797446d9d5c17b761dd83e2fe411871f.ppt
- Количество слайдов: 27
Information Retrieval Search Engine Technology (2) http: //tangra. si. umich. edu/clair/ir 09 Prof. Dragomir R. Radev radev@umich. edu
SET/IR – W/S 2009 … 3. Document preprocessing. Tokenization. Stemming. The Porter algorithm. Storing, indexing and searching text. Inverted indexes. …
Document preprocessing • Dealing with formatting and encoding issues • Hyphenation, accents, stemming, capitalization • Tokenization: – Paul’s, Willow Dr. , Dr. Willow, 555 -1212, New York, ad hoc, can’t – Example: “The New York-Los Angeles flight”
Non-English languages – Arabic: ﻛﺘﺎﺏ – Japanese: この本は重い。 – German: Lebensversicherungsgesellschaftsangesteller
Document preprocessing • Normalization: – – Casing (cat vs. CAT) Stemming (computer, computation) Soundex Labeled/labelled, extraterrestrial/extra-terrestrial/extra terrestrial, Qaddafi/Kadhafi/Ghadaffi • Index reduction – Dropping stop words (“and”, “of”, “to”) – Problematic for “to be or not to be”
Porter’s algorithm Example: the word “duplicatable” duplicate duplic rule 4 rule 1 b 1 rule 3 The application of another rule in step 4, removing “ic, ” cannot be applied since one rule from each step is allowed to be applied.
Porter’s algorithm
Links • http: //maya. cs. depaul. edu/~classes/ds 575/ porter. html • http: //www. tartarus. org/~martin/Porter. Ste mmer/def. txt
Approximate string matching • The Soundex algorithm (Odell and Russell) • Uses: – spelling correction – hash function – non-recoverable
The Soundex algorithm 1. Retain the first letter of the name, and drop all occurrences of a, e, h, I, o, u, w, y in other positions 2. Assign the following numbers to the remaining letters after the first: b, f, p, v : 1 c, g, j, k, q, s, x, z : 2 d, t : 3 l: 4 mn: 5 r: 6
The Soundex algorithm 3. if two or more letters with the same code were adjacent in the original name, omit all but the first 4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits Examples: Euler: E 460, Gauss: G 200, H 416: Hilbert, K 530: Knuth, Lloyd: L 300 same as Ellery, Ghosh, Heilbronn, Kant, and Ladd Some problems: Rogers and Rodgers, Sinclair and St. Clair
SET/IR – W/S 2009 … 4. Word distributions. The Zipf distribution. The Benford distribution. Heap‘s law. TF*IDF …
Word distributions • Words are not distributed evenly! • Same goes for letters of the alphabet (ETAOIN SHRDLU), city sizes, wealth, etc. • Usually, the 80/20 rule applies (80% of the wealth goes to 20% of the people or it takes 80% of the effort to build the easier 20% of the system), more examples coming up…
Shakespeare • Romeo and Juliet: • • • And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60; … A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1; http: //www. mta 75. org/curriculum/english/Shakes/indexx. html (visited in Dec. 2006)
The BNC (Adam Kilgarriff) • • • • • 1 6187267 the det 2 4239632 be v 3 3093444 of prep 4 2687863 and conj 5 2186369 a det 6 1924315 in prep 7 1620850 to infinitive-marker 8 1375636 have v 9 1090186 it pron 10 1039323 to prep 11 887877 for prep 12 884599 i pron 13 760399 that conj 14 695498 you pron 15 681255 he pron 16 680739 on prep 17 675027 with prep 18 559596 do v 19 534162 at prep 20 517171 by prep Kilgarriff, A. Putting Frequencies in the Dictionary. International Journal of Lexicography 10 (2) 1997. Pp 135 --155
Stop words • 250 -300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch. 1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2. 63
Zipf’s law Rank x Frequency Constant
Zipf's law is fairly general! • Frequency of accesses to web pages • in particular the access counts on the Wikipedia page, with s approximately equal to 0. 3 • page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0. 5 • Words in the English language • for instance, in Shakespeare’s play Hamlet with s approximately 0. 5 • Sizes of settlements • Income distributions amongst individuals • Size of earthquakes • Notes in musical performances http: //en. wikipedia. org/wiki/Zipf's_law http: //www. nslij-genetics. org/wli/zipf/ http: //www. cut-the-knot. org/do_you_know/zipf. Law. shtml
Zipf’s law (cont’d) • Limitations: – Low and high frequencies – Lack of convergence • Power law with coefficient c = -1 – Y=kxc • Li (1992) – typing words one letter at a time, including spaces
Heap’s law • Size of vocabulary: V(n) = Knb • In English, K is between 10 and 100, β is between 0. 4 and 0. 6. V(n) http: //en. wikipedia. org/wiki/Heaps%27_law n
Heap’s law (cont’d) • Related to Zipf’s law: generative models • Zipf’s and Heap’s law coefficients change with language Alexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws’ Coefficients Depend on Language. Proc. CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics, February 18– 24, 2001, Mexico City. Lecture Notes in Computer Science N 2004, ISSN 0302 -9743, ISBN 3 -540 -41687 -0, Springer-Verlag, pp. 332– 335.
The Benford law • the first digit of a random number is d with the probability log 10 (1+1/d). • Number ones are much more frequent that number nines. • Useful in forensic accounting, political science: – Mebane, Walter R. , Jr. 2006. `` Detecting Attempted Election Theft: Vote Counts, Voting Machines and Benford's Law
IDF: Inverse document frequency TF * IDF is used for automated indexing and for topic discrimination: N: number of documents dk: number of documents containing term k fik: absolute frequency of term k in document i wik: weight of term k in document i idfk = log 2(N/dk) + 1 = log 2 N - log 2 dk + 1
Vector-based matching • The cosine measure sim (D, C) = S (d. c. idf(k)) S (d ). S (c ) k k k 2
Asian and European news 622. 941 306. 835 196. 725 153. 608 152. 113 124. 591 108. 777 102. 894 85. 173 71. 898 68. 820 43. 402 38. 166 deng china beijing chinese xiaoping jiang communist body party died leader state people 97. 487 92. 151 74. 652 46. 657 34. 778 33. 803 32. 571 14. 095 9. 389 9. 154 8. 459 6. 059 nato albright belgrade enlargement alliance french opposition russia government told would their which
Other topics 120. 385 99. 487 90. 128 70. 224 59. 992 50. 160 49. 722 47. 782 40. 889 35. 778 27. 063 shuttle space telescope hubble rocket astronauts discovery canaveral cape mission florida center 74. 652 65. 321 55. 989 29. 996 27. 994 27. 198 15. 890 15. 271 11. 647 11. 174 6. 781 6. 315 compuserve massey salizzoni bob online executive interim chief service second world president
Readings • 2: MRS 9 • 3: MRS 13, MRS 14