Скачать презентацию January 7 2005 Information Retrieval Handout 1 C Скачать презентацию January 7 2005 Information Retrieval Handout 1 C

0bf6f2f0eb04a5c55c2b3ed3602acfd1.ppt

  • Количество слайдов: 56

January 7, 2005 Information Retrieval Handout #1 (C) 2003, The University of Michigan 1 January 7, 2005 Information Retrieval Handout #1 (C) 2003, The University of Michigan 1

Course Information • • • Instructor: Dragomir R. Radev (radev@si. umich. edu) Office: 3080, Course Information • • • Instructor: Dragomir R. Radev ([email protected] umich. edu) Office: 3080, West Hall Connector Phone: (734) 615 -5225 Office hours: TBA Course page: http: //tangra. si. umich. edu/~radev/650/ Class meets on Fridays, 2: 10 -4: 55 PM in 409 West Hall (C) 2003, The University of Michigan 2

Introduction (C) 2003, The University of Michigan 3 Introduction (C) 2003, The University of Michigan 3

IR systems • • Google Vivísimo Ask. Jeeves NSIR Lemur MG Nutch (C) 2003, IR systems • • Google Vivísimo Ask. Jeeves NSIR Lemur MG Nutch (C) 2003, The University of Michigan 4

Examples of IR systems • Conventional (library catalog). Search by keyword, title, author, etc. Examples of IR systems • Conventional (library catalog). Search by keyword, title, author, etc. • Text-based (Lexis-Nexis, Google, FAST). Search by keywords. Limited search using queries in natural language. • Multimedia (QBIC, Web. Seek, Sa. Fe) Search by visual appearance (shapes, colors, … ). • Question answering systems (Ask. Jeeves, NSIR, Answerbus) Search in (restricted) natural language (C) 2003, The University of Michigan 5

(C) 2003, The University of Michigan 6 (C) 2003, The University of Michigan 6

(C) 2003, The University of Michigan 7 (C) 2003, The University of Michigan 7

Need for IR • Advent of WWW - more than 4 Billion documents indexed Need for IR • Advent of WWW - more than 4 Billion documents indexed on Google • How much information? 200 TB according to Lyman and Varian 2003. http: //www. sims. berkeley. edu/research/projects/how-much-info/ • Search, routing, filtering • User’s information need (C) 2003, The University of Michigan 8

Some definitions of Information Retrieval (IR) Salton (1989): “Information-retrieval systems process files of records Some definitions of Information Retrieval (IR) Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests. ” Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multimedia objects). ” (C) 2003, The University of Michigan 9

Syllabus (Part I) • Introduction. Information Needs and Queries. • Document preprocessing. Stemming. Document Syllabus (Part I) • Introduction. Information Needs and Queries. • Document preprocessing. Stemming. Document representations. TF*IDF. Indexing and Searching. Inverted indexes • IR Models. The Vector model. The Boolean model. • Retrieval Evaluation. Precision and Recall. F-measure. Reference collections. The TREC conferences. • Queries and Documents. Query Languages. Natural language querying • Word distributions. The Zipf distribution. • Relevance feedback and query expansion. • Approximate matching. • Compression. • Vector space similarity and clustering. k-means clustering. (C) 2003, The University of Michigan 10

Syllabus (Part II) • Document classification. k-nearest neighbors. Naive Bayes. Support vector machines. • Syllabus (Part II) • Document classification. k-nearest neighbors. Naive Bayes. Support vector machines. • Singular value decomposition and Latent Semantic Indexing. • Probabilistic models. Document models. Language models. • Crawling the Web. Hyperlink analysis. Measuring the Web. • Hypertext retrieval. Web-based IR. • Social network analysis for IR. Hubs and authorities. Page. Rank and HITS. • Focused crawling. Resource discovery. Discovering communities. • Collaborative filtering. • Information extraction using Hidden Markov Models. • Additional topics, e. g. , relevance transfer, XML retrieval, text tiling, text summarization, question answering. (C) 2003, The University of Michigan 11

Readings • • • • Books: 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Readings • • • • Books: 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Retrieval, Addison. Wesley/ACM Press, 1999. 2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth; Modeling the Internet and the Web: Probabilistic Methods and Algorithms; Wiley, 2003, ISBN: 0 -470 -84906 -1 Papers (tentative list): Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509 -512, 1999 Bharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998 Brin and Page "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW 1998 Bush "As we may thing" The Atlantic Monthly 1945 Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999 Cho, Garcia-Molina, and Page "Efficient Crawling Through URL Ordering" WWW 1998 Davison "Topical locality on the Web" SIGIR 2000 Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999 Deerwester, Dumais, Landauer, Furnas, Harshman "Indexing by latent semantic analysis" JASIS 41(6) 1990 (C) 2003, The University of Michigan 12

Readings • • • Erkan and Radev Readings • • • Erkan and Radev "Lex. Rank: Graph-based Lexical Centrality as Salience in Text Summarization" JAIR 22, 2004 Jeong and Barabasi "Diameter of the world wide web" Nature (401) 130 -131, 1999 Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000 Haveliwala "Topic-sensitive pagerank" WWW 2002 Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal "The Web as a graph" PODS 2000 Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107 -109, 1999 Lawrence and Giles "Searching the World-Wide Web" Science (280) 98 -100, 1998 Menczer "Links tell us about lexical and semantic Web content" ar. Xiv 2001 Page, Brin, Motwani, and Winograd "The Page. Rank citation ranking: Bringing order to the Web" Stanford TR, 1998 Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" JASIST 2005 Singhal "Modern Information Retrieval: an Overview" IEEE 2001 (C) 2003, The University of Michigan 13

Assignments Homeworks: The course will have three homework assignments in the form of problem Assignments Homeworks: The course will have three homework assignments in the form of problem sets. Each problem set will include essay-type questions, questions designed to show understanding of specific concepts, and hands-on exercises involving existing IR engines. Project: The final course project can be done in three different formats: (1) a programming project implementing a challenging and novel information retrieval application, (2) an extensive survey-style research paper providing an exhaustive look at an area of IR, or (3) a SIGIR-style experimental IR paper. (C) 2003, The University of Michigan 14

Grading • Three HW assignments (30%) • Project (30%) • Final (40%) (C) 2003, Grading • Three HW assignments (30%) • Project (30%) • Final (40%) (C) 2003, The University of Michigan 15

Sample queries (from Excite) In what year did baseball become an offical sport? play Sample queries (from Excite) In what year did baseball become an offical sport? play station codes. com birth control and depression government "Work. Ability I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M. S Subalaksmi running (C) 2003, The University of Michigan 16

Types of queries (Alta. Vista) Including or excluding words: To make sure that a Types of queries (Alta. Vista) Including or excluding words: To make sure that a specific word is always included in your search topic, place the plus (+) symbol before the key word in the search box. To make sure that a specific word is always excluded from your search topic, place a minus (-) sign before the keyword in the search box. Example: To find recipes for cookies with oatmeal but without raisins, try recipe cookie +oatmeal -raisin. Expand your search using wildcards (*): By typing an * at the end of a keyword, you can search for the word with multiple endings. Example: Try wish*, to find wish, wishes, wishful, wishbone, and wishy-washy. (C) 2003, The University of Michigan 17

Types of queries AND (&) Finds only documents containing all of the specified words Types of queries AND (&) Finds only documents containing all of the specified words or phrases. Mary AND lamb finds documents with both the word Mary and the word lamb. OR (|) Finds documents containing at least one of the specified words or phrases. Mary OR lamb finds documents containing either Mary or lamb. The found documents could contain both, but do not have to. NOT (!) Excludes documents containing the specified word or phrase. Mary AND NOT lamb finds documents with Mary but not containing lamb. NOT cannot stand alone--use it with another operator, like AND. NEAR (~) Finds documents containing both specified words or phrases within 10 words of each other. Mary NEAR lamb would find the nursery rhyme, but likely not religious or Christmas-related documents. (C) 2003, The University of Michigan 18

Mappings and abstractions Reality Data Information need Query (C) 2003, The University of Michigan Mappings and abstractions Reality Data Information need Query (C) 2003, The University of Michigan From Korfhage’s book 19

Typical IR system • • (Crawling) Indexing Retrieval User interface (C) 2003, The University Typical IR system • • (Crawling) Indexing Retrieval User interface (C) 2003, The University of Michigan 20

Key Terms Used in IR • QUERY: a representation of what the user is Key Terms Used in IR • QUERY: a representation of what the user is looking for can be a list of words or a phrase. • DOCUMENT: an information entity that the user wants to retrieve • COLLECTION: a set of documents • INDEX: a representation of information that makes querying easier • TERM: word or concept that appears in a document or a query (C) 2003, The University of Michigan 21

Other important terms • • • Classification Cluster Similarity Information Extraction Term Frequency Inverse Other important terms • • • Classification Cluster Similarity Information Extraction Term Frequency Inverse Document Frequency • Precision • Recall (C) 2003, The University of Michigan • • • Inverted File Query Expansion Relevance Feedback Stemming Stopword Vector Space Model Weighting TREC/TIPSTER/MUC 22

Query structures • Query viewed as a document? – Length – repetitions – syntactic Query structures • Query viewed as a document? – Length – repetitions – syntactic differences • Types of matches: – exact – range – approximate (C) 2003, The University of Michigan 23

Additional references on IR • Gerard Salton, Automatic Text Processing, Addison. Wesley (1989) • Additional references on IR • Gerard Salton, Automatic Text Processing, Addison. Wesley (1989) • Gerald Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer (1997) • Gerard Salton and M. Mc. Gill, Introduction to Modern Information Retrieval, Mc. Graw-Hill (1983) • C. J. an Rijsbergen, Information Retrieval, Buttersworths (1979) • Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes, Van Nostrand Reinhold (1994) • ACM SIGIR Proceedings, SIGIR Forum • ACM conferences in Digital Libraries (C) 2003, The University of Michigan 24

Related courses elsewhere • Stanford (Chris Manning, Prabhakar Raghavan, and Hinrich Schuetze) http: //www. Related courses elsewhere • Stanford (Chris Manning, Prabhakar Raghavan, and Hinrich Schuetze) http: //www. stanford. edu/class/cs 276 a/ • Cornell (Jon Kleinberg) http: //www. cs. cornell. edu/Courses/cs 685/2002 fa/ • CMU (Yiming Yang and Jamie Callan) http: //krakow. lti. cs. cmu. edu/classes/11 -741/2004/index. html/ • UMass (James Allan) http: //ciir. cs. umass. edu/cmpsci 646/ • UTexas (Ray Mooney) http: //www. cs. utexas. edu/users/mooney/ir-course/ • Illinois (Chengxiang Zhai) http: //sifaka. cs. uiuc. edu/course/498 cxz 04 f/ • Johns Hopkins (David Yarowsky) http: //www. cs. jhu. edu/~yarowsky/cs 466. html (C) 2003, The University of Michigan 25

Readings for weeks 1 – 3 • MIR (Modern Information Retrieval) – Week 1 Readings for weeks 1 – 3 • MIR (Modern Information Retrieval) – Week 1 • Chapter 1 “Introduction” • Chapter 2 “Modeling” • Chapter 3 “Evaluation” – Week 2 • Chapter 4 “Query languages” • Chapter 5 “Query operations” – Week 3 • Chapter 6 “Text and multimedia languages” • Chapter 7 “Text operations” • Chapter 8 “Indexing and searching” (C) 2003, The University of Michigan 26

Documents (C) 2003, The University of Michigan 27 Documents (C) 2003, The University of Michigan 27

Documents • • • Not just printed paper collections vs. documents data structures: representations Documents • • • Not just printed paper collections vs. documents data structures: representations Bag of words method document surrogates: keywords, summaries encoding: ASCII, Unicode, etc. (C) 2003, The University of Michigan 28

Document preprocessing • Formatting • Tokenization (Paul’s, Willow Dr. , Dr. Willow, 555 -1212, Document preprocessing • Formatting • Tokenization (Paul’s, Willow Dr. , Dr. Willow, 555 -1212, New York, ad hoc) • Casing (cat vs. CAT) • Stemming (computer, computation) • Soundex (C) 2003, The University of Michigan 29

Document representations • • Term-document matrix (m x n) term-term matrix (m x n) Document representations • • Term-document matrix (m x n) term-term matrix (m x n) document-document matrix (n x n) Example: 3, 000 documents (n) with 50, 000 terms (m) • sparse matrices • Boolean vs. integer matrices (C) 2003, The University of Michigan 30

Document representations • Term-document matrix – Evaluating queries (e. g. , (A B) C) Document representations • Term-document matrix – Evaluating queries (e. g. , (A B) C) – Storage issues • Inverted files – Storage issues – Evaluating queries – Advantages and disadvantages (C) 2003, The University of Michigan 31

Additional issues • Dealing with phrases? • Proximity search • Synonyms? (C) 2003, The Additional issues • Dealing with phrases? • Proximity search • Synonyms? (C) 2003, The University of Michigan 32

Porter’s algorithm Example: the word “duplicatable” duplicate duplic rule 4 rule 1 b 1 Porter’s algorithm Example: the word “duplicatable” duplicate duplic rule 4 rule 1 b 1 rule 3 The application of another rule in step 4, removing “ic, ” cannot be applied since one rule from each step is allowed to be applied. (C) 2003, The University of Michigan 33

Porter’s algorithm (C) 2003, The University of Michigan 34 Porter’s algorithm (C) 2003, The University of Michigan 34

Relevance feedback • Automatic • Manual • Method: identifying feedback terms Q’ = a Relevance feedback • Automatic • Manual • Method: identifying feedback terms Q’ = a 1 Q + a 2 R - a 3 N Often a 1 = 1, a 2 = 1/|R| and a 3 = 1/|N| (C) 2003, The University of Michigan 35

Example • Q = “safety minivans” • D 1 = “car safety minivans tests Example • Q = “safety minivans” • D 1 = “car safety minivans tests injury statistics” relevant • D 2 = “liability tests safety” - relevant • D 3 = “car passengers injury reviews” - nonrelevant • R=? • S=? • Q’ = ? (C) 2003, The University of Michigan 36

Approximate string matching • The Soundex algorithm (Odell and Russell) • Uses: – spelling Approximate string matching • The Soundex algorithm (Odell and Russell) • Uses: – spelling correction – hash function – non-recoverable (C) 2003, The University of Michigan 37

The Soundex algorithm 1. Retain the first letter of the name, and drop all The Soundex algorithm 1. Retain the first letter of the name, and drop all occurrences of a, e, h, I, o, u, w, y in other positions 2. Assign the following numbers to the remaining letters after the first: b, f, p, v : 1 c, g, j, k, q, s, x, z : 2 d, t : 3 l: 4 mn: 5 r: 6 (C) 2003, The University of Michigan 38

The Soundex algorithm 3. if two or more letters with the same code were The Soundex algorithm 3. if two or more letters with the same code were adjacent in the original name, omit all but the first 4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits Examples: Euler: E 460, Gauss: G 200, H 416: Hilbert, K 530: Knuth, Lloyd: L 300 same as Ellery, Ghosh, Heilbronn, Kant, and Ladd Some problems: Rogers and Rodgers, Sinclair and St. Clair (C) 2003, The University of Michigan 39

IR models (C) 2003, The University of Michigan 40 IR models (C) 2003, The University of Michigan 40

Major IR models • • • Boolean Vector Probabilistic Language modeling Fuzzy retrieval Latent Major IR models • • • Boolean Vector Probabilistic Language modeling Fuzzy retrieval Latent semantic indexing (C) 2003, The University of Michigan 41

Major IR tasks • • • Ad-hoc Filtering and routing Question answering Spoken document Major IR tasks • • • Ad-hoc Filtering and routing Question answering Spoken document retrieval Multimedia retrieval (C) 2003, The University of Michigan 42

Venn diagrams x D 1 (C) 2003, The University of Michigan w z y Venn diagrams x D 1 (C) 2003, The University of Michigan w z y D 2 43

Boolean model A (C) 2003, The University of Michigan B 44 Boolean model A (C) 2003, The University of Michigan B 44

Boolean queries restaurants AND (Mideastern OR vegetarian) AND inexpensive • • • What types Boolean queries restaurants AND (Mideastern OR vegetarian) AND inexpensive • • • What types of documents are returned? Stemming thesaurus expansion inclusive vs. exclusive OR confusing uses of AND and OR dinner AND sports AND symphony 4 OF (Pentium, printer, cache, PC, monitor, computer, personal) (C) 2003, The University of Michigan 45

Boolean queries • Weighting (Beethoven AND sonatas) • precedence coffee AND croissant OR muffin Boolean queries • Weighting (Beethoven AND sonatas) • precedence coffee AND croissant OR muffin raincoat AND umbrella OR sunglasses • Use of negation: potential problems • Conjunctive and Disjunctive normal forms • Full CNF and DNF (C) 2003, The University of Michigan 46

Transformations • De Morgan’s Laws: NOT (A AND B) = (NOT A) OR (NOT Transformations • De Morgan’s Laws: NOT (A AND B) = (NOT A) OR (NOT B) NOT (A OR B) = (NOT A) AND (NOT B) • CNF or DNF? – Reference librarians prefer CNF - why? (C) 2003, The University of Michigan 47

Boolean model • Partition • Partial relevance? • Operators: AND, NOT, OR, parentheses (C) Boolean model • Partition • Partial relevance? • Operators: AND, NOT, OR, parentheses (C) 2003, The University of Michigan 48

Exercise • • D 1 = “computer information retrieval” D 2 = “computer retrieval” Exercise • • D 1 = “computer information retrieval” D 2 = “computer retrieval” D 3 = “information” D 4 = “computer information” • Q 1 = “information retrieval” • Q 2 = “information ¬computer” (C) 2003, The University of Michigan 49

Exercise 0 1 Swift 2 Shakespeare 3 Shakespeare 4 Milton 5 Milton 6 Milton Exercise 0 1 Swift 2 Shakespeare 3 Shakespeare 4 Milton 5 Milton 6 Milton Shakespeare 7 Milton Shakespeare Swift 8 Chaucer 9 Chaucer 10 Chaucer Shakespeare 11 Chaucer Shakespeare 12 Chaucer Milton 13 Chaucer Milton 14 Chaucer Milton Shakespeare 15 Chaucer Milton Shakespeare Swift Swift ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare)) (C) 2003, The University of Michigan 50

Stop lists • 250 -300 most common words in English account for 50% or Stop lists • 250 -300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch. 1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2. 63 (C) 2003, The University of Michigan 51

Vector models Term 1 Doc 2 Term 3 Term 2 (C) 2003, The University Vector models Term 1 Doc 2 Term 3 Term 2 (C) 2003, The University of Michigan Doc 3 52

Vector queries • Each document is represented as a vector • non-efficient representations (bit Vector queries • Each document is represented as a vector • non-efficient representations (bit vectors) • dimensional compatibility (C) 2003, The University of Michigan 53

The matching process • Document space • Matching is done between a document and The matching process • Document space • Matching is done between a document and a query (or between two documents) • distance vs. similarity • Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc. (C) 2003, The University of Michigan 54

Miscellaneous similarity measures • The Cosine measure (D, Q) = (di x qi) |X Miscellaneous similarity measures • The Cosine measure (D, Q) = (di x qi) |X Y| = |X| * |Y| (di)2 * (qi)2 • The Jaccard coefficient (D, Q) = |X Y| (C) 2003, The University of Michigan 55

Exercise • Compute the cosine measures (D 1, D 2) and (D 1, D Exercise • Compute the cosine measures (D 1, D 2) and (D 1, D 3) for the documents: D 1 = <1, 3>, D 2 = <100, 300> and D 3 = <3, 1> • Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients. (C) 2003, The University of Michigan 56