Intelligent Information Retrieval CS 336 Xiaoyan Li Spring

Скачать презентацию Intelligent Information Retrieval CS 336 Xiaoyan Li Spring

d6bd0333a20e1403a3ce0370a3629965.ppt

Количество слайдов: 20

Intelligent Information Retrieval CS 336 Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s slides

What is Information Retrieval? • Includes the following: – Organization – Storage/Representation – Manipulation/Analysis – Search/Retrieval • How far back in history can we find examples?

IR Through the Ages • 3 rd Century BCE – Library of Alexandria • 500, 000 volumes • catalogs and classifications • 13 th Century A. D. – First concordance of the Bible • What is a concordance? • 15 th Century A. D. – Invention of printing • 1600 – University of Oxford Library • All books printed in England

IR Through the Ages • 1755 – Johnson’s Dictionary • Set standard for dictionaries • Included common language • Helped standardize spelling • 1800 – Library of Congress • 1828 – Webster’s Dictionary • Significantly larger than previous dictionaries • Standardized American spelling • 1852 – Roget’s Thesaurus

IR Through the Ages • 1876 – Dewey Decimal Classification • 1880’s – Carnegie Public Libraries • 1, 681 built (first public library 1850) • 1930’s – Punched card retrieval systems • 1940’s – Bush’s Memex – Shannon’s Communication Theory – Zipf’s “Law”

Historical Summary • 1960’s – Basic advances in retrieval and indexing techniques • 1970’s – Probabilistic and vector space models – Clustering, relevance feedback – Large, on-line, Boolean information services – Fast string matching • 1980’s – Natural Language Processing and IR – Expert systems and IR – Off-the-shelf IR systems

IR Through the Ages • Late 1980’s – First mini-computer and PC systems incorporating “relevance ranking” • Early 1990’s – information storage revolution • 1992 – First large-scale information service incorporating probabilistic retrieval (West’s legal retrieval system)

IR Through the Ages • Mid 1990’s to present – Multimedia databases • 1994 to present – The Internet and Web explosion • e. g. Google, Yahoo, Lycos, Infoseek (now Go) • 1995 to present – – – Digital Libraries Data Mining Agents and Filtering Knowledge and Distributed Intelligence Information Organization Knowledge Management

• 1990’s Historical Summary – Large-scale, full-text IR and filtering experiments and systems (TREC) – Dominance of ranking – Many web-based retrieval engines – Interfaces and browsing – Multimedia and multilingual – Machine learning techniques

Trends in IR Technology On-line Information Petabytes Image and Video Retrieval Visualization Data Mining Terabytes Distributed Retrieval Summarization Information Extraction Ranked Filtering Concept-Based Retrieval Technologies Ranked Retrieval Boolean Retrieval and Filtering Gigabytes 1970 1990 Time Batch systems. . . Interactive systems. . . Database Systems…Cheap Storage. . . Internet…Multimedia. . . 1 -page word document without any images = ~10 kilobytes (kb) of disk space. 1 terabyte = one-hundred million imageless word docs 1 petabyte = one-thousand terabytes.

• The Future Historical Summary – Logic-based IR? – NLP? – Integration with other functionality – Distributed, heterogeneous database access – IR in context – “Anytime, Anywhere”

Information Retrieval • Ad Hoc Retrieval – Given a query and a large database of text objects, find the relevant objects • Distributed Retrieval – Many distributed databases • Information Filtering – Given a text object from an information stream (e. g. newswire) and many profiles (long-term queries), decide which profiles match • Multimedia Retrieval – Databases of other types of unstructured data, e. g. images, video, audio

Information Retrieval • Multilingual Retrieval – Retrieval in a language other than English • Cross-language Retrieval – Query in one language (e. g. Spanish), retrieve documents in other languages (e. g. Chinese, French, and Spanish)

Information Retrieval • Text Representation (Indexing) – given a text document, identify the concepts that describe the content and how well they describe it • what makes a “good” representation? • how is a representation generated from text? • what are retrievable objects and how are they organized? • Representing an Information Need (Query Formulation) – describe and refine information needs as explicit queries • what is an appropriate query language? • how can interactive query formulation and refinement be supported?

Information Retrieval • Comparing Representations (Retrieval) – compare text and information need representations to determine which documents are likely to be relevant • what is a “good” model of retrieval? • how is uncertainty represented? • Evaluating Retrieved Text (Feedback) – present documents for user evaluation and modify query based on feedback • what are good metrics? • what constitutes a good experimental testbed

Information Retrieval and Filtering Information Need Text Objects Representation Query Indexed Objects Comparison Evaluation/Feedback Retrieved Objects

Features of a Modern IR Product • • • Effective “relevance ranking” Simple free text (“natural language”) query capability Boolean and proximity operators Term weighting Query formulation assistance Query by example Filtering Field-based retrieval Distributed architecture Index anything Fast retrieval Information Organization

Typical Systems • IR systems – Verity, Fulcrum, Excalibur • Database systems – Oracle, Informix • Web search and In-house systems – West, LEXIS/NEXIS, Dialog – Yahoo, Google, MSN, Ask. Jeeves

IR vs. Database Systems • Emphasis on effective, efficient retrieval of unstructured data • IR systems typically have very simple schemas • Query languages emphasize free text although Boolean combinations of words is also common

IR vs. Database Systems • Matching is more complex than with structured data (semantics less obvious) – easy to retrieve the wrong objects – need to measure accuracy of retrieval • Less focus on concurrency control and recovery, although update is very important