Information Retrieval Introduction Overview Material for these slides obtained

Information Retrieval Introduction/Overview Material for these slides obtained from: n. Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http: //www. sims. berkeley. edu/~hearst/irbook/ n. Data Mining Introductory and Advanced Topics by Margaret H. Dunham http: //www. engr. smu. edu/~mhd/book

Information Retrieval n n n Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”. 2

DB vs IR n n Records (tuples) vs. documents Well defined results vs. fuzzy results DB grew out of files and traditional business systesm IR grew out of library science and need to categorize/group/access books/articles 3

DB vs IR (cont’d) §Data retrieval §which docs contain a set of keywords? §Well defined semantics §a single erroneous object implies failure! §Information retrieval §information about a subject or topic §semantics is frequently loose §small errors are tolerated §IR system: §interpret contents of information items §generate a ranking which reflects relevance §notion of relevance is most important 4

Motivation §IR in the last 20 years: §classification and categorization §systems and languages §user interfaces and visualization §Still, area was seen as of narrow interest §Advent of the Web changed this perception once and for all §universal repository of knowledge §free (low cost) universal access §no central editorial board §many problems though: IR seen as key to finding the solutions! 5

Basic Concepts Logical view of the documents Accents spacing Docs stopwords Noun groups stemming Manual indexing structure Full text Index terms Document representation viewed as a continuum: logical view of docs might shift 6

The Retrieval Process Text User Interface user need Text Operations logical view Query Operations Indexing user feedback query Searching DB Manager Module inverted file Index retrieved docs Text Database Ranking ranked docs 7

IR is Fuzzy Reject Accept Simple Accept Fuzzy 8

Indexing n n IR systems usually adopt index terms to process queries Index term: n n n Stemming might be used: n n a keyword or group of selected words any word (more general) connect: connecting, connections An inverted file is built for the chosen index terms 10

Indexing Docs Index Terms doc match Ranking Information Need query 11

Inverted Files n There are two main elements: n n vocabulary – set of unique terms Occurrences – where those terms appear The occurrences can be recorded as terms or byte offsets Using term offset is good to retrieve concepts such as proximity, whereas byte offsets allow direct access Vocabulary … Occurrences (byte offset) … 12

Inverted Files n n n The number of indexed terms is often several orders of magnitude smaller when compared to the documents size (Mbs vs Gbs) The space consumed by the occurrence list is not trivial. Each time the term appears it must be added to a list in the inverted file That may lead to a quite considerable index overhead 13

Example n 1 Text: 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful n Inverted file Vocabulary Occurrences beautiful 70 flowers 45, 58 garden 18, 29 house 6 14

Ranking n n A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the query A ranking is based on fundamental premisses regarding the notion of relevance, such as: n n common sets of index terms sharing of weighted terms likelihood of relevance Each set of premisses leads to a distinct IR model 15

Classic IR Models - Basic Concepts n n Each document represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document main themes Usually, index terms are nouns because nouns have meaning by themselves However, search engines assume that all words are index terms (full text representation) 16

Classic IR Models - Basic Concepts n n n The importance of the index terms is represented by weights associated to them ki- an index term dj - a document wij - a weight associated with (ki, dj) The weight wij quantifies the importance of the index term for describing the document contents 17

Classic IR Models - Basic Concepts n n t is the total number of index terms K = {k 1, k 2, …, kt} is the set of all index terms n n wij >= 0 is a weight associated with (ki, dj) wij = 0 indicates that term does not belong to doc n n dj= (w 1 j, w 2 j, …, wtj) is a weighted vector associated with the document dj gi(dj) = wij is a function which returns the weight associated with pair (ki, dj) 18

The Boolean Model n n Simple model based on set theory Queries specified as boolean expressions n precise semantics and neat formalism n Terms are either present or absent. Thus, n Consider wij {0, 1} n n n q = ka (kb kc) qdnf = (1, 1, 1) (1, 1, 0) (1, 0, 0) qcc= (1, 1, 0) is a conjunctive component 19

The Vector Model n n Use of binary weights is too limiting Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document Ranked set of documents provides for better matching 20

The Vector Model n n n n wij > 0 whenever ki appears in dj wiq >= 0 associated with the pair (ki, q) dj = (w 1 j, w 2 j, . . . , wtj) q = (w 1 q, w 2 q, . . . , wtq) To each term ki is associated a unitary vector i The unitary vectors i and j are assumed to be orthonormal (i. e. , index terms are assumed to occur independently within the documents) The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors 21

Query Languages n n n Keyword Based Boolean Weighted Boolean Context Based (Phrasal & Proximity) Pattern Matching Structural Queries 22

Keyword Based Queries n Basic Queries n n n Single word Multiple words Context Queries n n Phrase Proximity 23

Boolean Queries n Keywords combined with Boolean operators: n n n OR: (e 1 OR e 2) AND: (e 1 AND e 2) BUT: (e 1 BUT e 2) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. Naïve users have trouble with Boolean logic. 24

Boolean Retrieval with Inverted Indices n n Primitive keyword: Retrieve containing documents using the inverted index. OR: Recursively retrieve e 1 and e 2 and take union of results. AND: Recursively retrieve e 1 and e 2 and take intersection of results. BUT: Recursively retrieve e 1 and e 2 and take set difference of results. 25

Phrasal Queries n Retrieve documents with a specific phrase (ordered list of contiguous words) n n “information theory” May allow intervening stop words and/or stemming. n “buy camera” matches: “buy a camera” “buying the cameras” etc. 26

Phrasal Retrieval with Inverted Indices n n n Must have an inverted index that also stores positions of each keyword in a document. Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. Best to start contiguity check with the least common word in the phrase. 27

Proximity Queries n n n List of words with specific maximal distance constraints between terms. Example: “dogs” and “race” within 4 words match “…dogs will begin the race…” May also perform stemming and/or not count stop words. 28

Pattern Matching n n Allow queries that match strings rather than word tokens. Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently. 29

Simple Patterns n Prefixes: Pattern that matches start of word. n n Suffixes: Pattern that matches end of word: n n “ix” matches “fix”, “matrix”, etc. Substrings: Pattern that matches arbitrary subsequence of characters. n n “anti” matches “antiquity”, “antibody”, etc. “rapt” matches “enrapture”, “velociraptor” etc. Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. n “tin” to “tix” matches “tip”, “tire”, “title”, etc. 30