a1d555383ed87926a5f563911b59a22e.ppt
- Количество слайдов: 40
Information Retrieval (3) Prof. Dragomir R. Radev radev@umich. edu
SI 650 Winter 2010 … 5. Evaluation of IR systems Reference collections TREC …
Relevance • Difficult to change: fuzzy, inconsistent • Methods: exhaustive, sampling, pooling, search-based
Contingency table retrieved not retrieved relevant w=tp x=fn not relevant y=fp z=tn n 2 = w + y n 1 = w + x N
Precision and Recall: w w+x Precision: w w+y
Exercise Go to Google (www. google. com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e. g. , Tolkien, “JRR Tolkien”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by Alta. Vista. Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs. Later, try different queries.
[From Salton’s book]
Interpolated average precision (e. g. , 11 pt) Interpolation – what is precision at recall=0. 5?
Issues • Why not use accuracy A=(w+z)/N? • Average precision • Average P at given “document cutoff values” • Report when P=R • F measure: F=(b 2+1)PR/(b 2 P+R) • F 1 measure: F 1 = 2/(1/R+1/P) : harmonic mean of P and R
Kappa • N: number of items (index i) • n: number of categories (index j) • k: number of annotators
Kappa example J 1+ J 1 - TOTAL J 2+ 300 10 310 J 2 - 20 70 90 TOTAL 320 80 400
Kappa (cont’d) • • P(A) = 370/400 = 0. 925 P (-) = (10+20+70+70)/800 = 0. 2125 P (+) = (10+20+300)/800 = 0. 7875 P (E) = 0. 2125 * 0. 2125 + 0. 7875 * 0. 7875 = 0. 665 • K = (0. 925 -0. 665)/(1 -0. 665) = 0. 776 • Kappa higher than 0. 67 is tentatively acceptable; higher than 0. 8 is good
Sample TREC query
March 16, 1989, Thursday, Home Edition Business; Part 4; Page 1; Column 5; Financial Desk 586 words AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS By LINDA WILLIAMS, Times Staff Writer The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. Several Fatalities However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities, " Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the 1984 -1989 Bronco II models, the agency said. According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100, 000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100, 000; 13 involving the Chevrolet S 10 Blazers or GMC Jimmy, or 6 per 100, 000, and six fatal Jeep Cherokee roll-overs, for 2. 5 per 100, 000. After the accident report, NHTSA declined to investigate the Samurai. Photo, The Ford Bronco II "appears to have a higher number of single-vehicle, first event roll-overs, " a federal official said. TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY
TREC (cont’d) • http: //trec. nist. gov/tracks. html • http: //trec. nist. gov/presentations. ht ml
Most used reference collections • Generic retrieval: OHSUMED, CRANFIELD, CACM • Text classification: Reuters, 20 newsgroups • Question answering: TREC-QA • Web: DOTGOV, wt 100 g • Blogs: Buzzmetrics datasets • TREC ad hoc collections, 2 -6 GB • TREC Web collections, 2 -100 GB
Comparing two systems • • Comparing A and B One query? Average performance? Need: A to consistently outperform B [this slide: courtesy James Allan]
The sign test • Example 1: – – A > B (12 times) A = B (25 times) A < B (3 times) p < 0. 035 (significant at the 5% level) • Example 2: – – A > B (18 times) A < B (9 times) p < 0. 122 (not significant at the 5% level) http: //www. fon. hum. uva. nl/Service/Statistics/Sign_Tes t. html [this slide: courtesy James Allan]
Other tests • Student t-test: takes into account the actual performances, not just which system is better – http: //www. fon. hum. uva. nl/Service/Statistics/Student_t _Test. html – http: //www. socialresearchmethods. net/kb/stat_t. php • Wilcoxon Matched-Pairs Signed-Ranks Test – http: //www. fon. hum. uva. nl/Service/Statistics/Signed_ Rank_Test. html
IR Winter 2010 … 6. Automated indexing/labeling Compression …
Indexing methods • Manual: e. g. , Library of Congress subject headings, Me. SH • Automatic: e. g. , TF*IDF based
LOC subject headings A -- GENERAL WORKS B -- PHILOSOPHY. PSYCHOLOGY. RELIGION C -- AUXILIARY SCIENCES OF HISTORY D -- HISTORY (GENERAL) AND HISTORY OF EUROPE E -- HISTORY: AMERICA F -- HISTORY: AMERICA G -- GEOGRAPHY. ANTHROPOLOGY. RECREATION H -- SOCIAL SCIENCES J -- POLITICAL SCIENCE K -- LAW L -- EDUCATION M -- MUSIC AND BOOKS ON MUSIC N -- FINE ARTS P -- LANGUAGE AND LITERATURE Q -- SCIENCE R -- MEDICINE S -- AGRICULTURE T -- TECHNOLOGY U -- MILITARY SCIENCE V -- NAVAL SCIENCE Z -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL) http: //www. loc. gov/catdir/cpso/lcco. html
Medicine CLASS R - MEDICINE Subclass R R 5 -920 Medicine (General) R 5 -130. 5 General works R 131 -687 History of medicine. Medical expeditions R 690 -697 Medicine as a profession. Physicians R 702 -703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R 711 -713. 97 Directories R 722 -722. 32 Missionary medicine. Medical missionaries R 723 -726 Medical philosophy. Medical ethics R 726. 5 -726. 8 Medicine and disease in relation to psychology. Terminal care. Dying R 727 -727. 5 Medical personnel and the public. Physician and the public R 728 -733 Practice of medicine. Medical practice economics R 735 -854 Medical education. Medical schools. Research R 855 -855. 5 Medical technology R 856 -857 Biomedical engineering. Electronics. Instrumentation R 858 -859. 7 Computer applications to medicine. Medical informatics R 864 Medical records R 895 -920 Medical physics. Medical radiology. Nuclear medicine
Automatic methods • TF*IDF: pick terms with the highest TF*IDF scores • Centroid-based: pick terms that appear in the centroid with high scores • The maximal marginal relevance principle (MMR) • Related to summarization, snippet generation
Compression • Methods – Fixed length codes – Huffman coding – Ziv-Lempel codes
Fixed length codes • Binary representations – ASCII – Representational power (2 k symbols where k is the number of bits)
Variable length codes • • Alphabet: A. - B -. . . C -. -. D -. . E. F. . -. G --. H. . I. . J. --- K -. - L. -. . M -- N -. O --- P. --. Q --. - R. -. S. . . T - U. . - V. . . - W. -- X -. . Y -. — Z --. . Demo: – http: //www. scphillips. com/morse/ 0 ----1. ---2. . --3. . . — 4. . 5. . . 6 -. . 7 --. . . 8 ---. . 9 ----.
Most frequent letters in English • Most frequent letters: – E T A O I N S H R D L U • Demo: – http: //www. amstat. org/publications/jse/secure/v 7 n 2/co unt-char. cfm • Also: bigrams: – TH HE IN ER AN RE ND AT ON NT
Huffman coding • Developed by David Huffman (1952) • Average of 5 bits per character (37. 5% compression) • Based on frequency distributions of symbols • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols
0 0 0 c 1 1 1 0 0 0 1 b d f g 1 0 a 1 0 1 e h 1 i 1 0 j
Exercise • Consider the bit string: 0110110111100010011000111010011100 01101011101 • Use the Huffman code from the example to decode it. • Try inserting, deleting, and switching some bits at random locations and try decoding.
Extensions • Word-based • Domain/genre dependent models
Ziv-Lempel coding • Two types - one is known as LZ 77 (used in GZIP) • Code: set of triples • a: how far back in the decoded text to look for the upcoming text segment • b: how many characters to copy • c: new character to add to complete segment
• • • • • <0, 0, p> <0, 0, e> <0, 0, t> <2, 1, r> <0, 0, _> <6, 1, i> <8, 2, r> <6, 3, c> <0, 0, k> <7, 1, d> <7, 1, a> <9, 2, e> <9, 2, _> <0, 0, o> <0, 0, f> <17, 5, l> <12, 1, d> <16, 3, p> <3, 2, r> <0, 0, s> p pe peter_pi peter_piper_picked peter_piper_picked_a_peck_o peter_piper_picked_a_peck_of_pickled_pep peter_piper_picked_a_peck_of_pickled_peppers
Links on text compression • Data compression: – http: //www. data-compression. info/ • Calgary corpus: – http: //links. uwaterloo. ca/calgary. corpus. html • Huffman coding: – http: //www. compressconsult. com/huffman/ – http: //en. wikipedia. org/wiki/Huffman_coding • LZ – http: //en. wikipedia. org/wiki/LZ 77
100 alternative search engines • http: //rss. slashdot. org/~r/Slashdot/slashdot /~3/83468703/article. pl
Readings • 2: MRS 9 • 3: MRS 13, MRS 14 • 4: MRS 15, MRS 16


