Скачать презентацию Information Retrieval Adapted from Lectures by Berthier Ribeiro-Neto Скачать презентацию Information Retrieval Adapted from Lectures by Berthier Ribeiro-Neto

f8ebc20124aa476ad84508b8edd2835c.ppt

  • Количество слайдов: 22

Information Retrieval Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) Information Retrieval Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford) Prasad L 1 Intro. IR 1

Unstructured (text) vs. structured (database) data in 1996 Prasad L 1 Intro. IR 2 Unstructured (text) vs. structured (database) data in 1996 Prasad L 1 Intro. IR 2

Unstructured (text) vs. structured (database) data in 2006 Prasad L 1 Intro. IR 3 Unstructured (text) vs. structured (database) data in 2006 Prasad L 1 Intro. IR 3

Structured vs unstructured data • Structured data : information in “tables” Employee Manager Salary Structured vs unstructured data • Structured data : information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e. g. , Salary < 60000 AND Manager = Smith. Prasad L 1 Intro. IR 4

Unstructured data • Typically refers to free text • Allows q. Keyword-based queries including Unstructured data • Typically refers to free text • Allows q. Keyword-based queries including operators q. More sophisticated “concept” queries, e. g. , • find all web pages dealing with drug abuse Prasad L 1 Intro. IR 5

Semi-structured data • In fact almost no data is “unstructured” q. E. g. , Semi-structured data • In fact almost no data is “unstructured” q. E. g. , this slide has distinctly identified zones such as the Title and Bullets • Facilitates “semi-structured” search such as q. Title contains data AND Bullets contain search … to say nothing of linguistic structure Prasad L 1 Intro. IR 6

What is IR? • Representation • Keywords/Phrases, Structure/Fonts, Counts, etc • Organization and Storage What is IR? • Representation • Keywords/Phrases, Structure/Fonts, Counts, etc • Organization and Storage • Inverted File Index, Compressed, etc • Hardware Architecture and Memory Hierarchy • Access to information items • Interface : Spell-checker to tree-structured display • Visualization : Labeled Clusters, Timelines, Spring graphs, etc. Prasad L 1 Intro. IR 7

Ultimate Focus of IR • Satisfying user information need q Emphasis is on retrieval Ultimate Focus of IR • Satisfying user information need q Emphasis is on retrieval of information (not data) • User information need q. Printer reviews q. Book prices and availability q. Words in which all vowels appear q. Anagram/Permutations of art • Predicting which documents are relevant, and then linearly ranking them. Prasad L 1 Intro. IR 8

DIKW Hierarchy • Data: Symbolic units q. E. g. , Records of customer. q. DIKW Hierarchy • Data: Symbolic units q. E. g. , Records of customer. q. E. g. , Bytes from sensors. • Information : Data with an interpretation (Who? , What? , When? , Where? ). q. E. g. , Records of current/new customer grouped by their ages. q. E. g. , Variation in temperature readings. Prasad L 1 Intro. IR 9

DIKW Hierarchy • Knowledge : Information organized with theoretical concepts or abstract ideas (How? DIKW Hierarchy • Knowledge : Information organized with theoretical concepts or abstract ideas (How? ) q. E. g. , How many customer have cancelled the accounts in current fiscal year? q. E. g. , Analysis of temperature variation over the years and their causes. • Wisdom : Understanding of fundamental principles + Human Judgement q. E. g. , What strategies can be employed to retain customers in the face of cheaper alternatives? q. E. g. , Global warming issues and the future of Earth. Prasad L 1 Intro. IR 10

DIKW hierarchy: Clark 2004 Formation of a whole Wisdom Context Joining of wholes Future DIKW hierarchy: Clark 2004 Formation of a whole Wisdom Context Joining of wholes Future Knowledge Novelty Information Connection of parts Past Experience Data Gathering of parts Understanding Researching Absorbing Doing Interacting Reflecting Prasad L 1 Intro. IR 11

You see things; and you say You see things; and you say "Why? " But I dream things that never were; and I say "Why not? " George Bernard Shaw Prasad L 1 Intro. IR 12

Information vs Data Retrieval • DATA: • QUERY : • Unstructured : open to Information vs Data Retrieval • DATA: • QUERY : • Unstructured : open to interpretation • Usually incomplete or ambiguous (w. r. t information need) • QUALITY OF • Partial match allowed, RESULTS: relevance-based ranking • FOUNDATIONS: • Probabilistic underpinnings • APPLICATION: • Well-defined semantics • Exact match required - no or many results • Foundations: Algebra/Logic • Accounting • Library Prasad • Structured with well -defined semantics L 1 Intro. IR 13

User Task Retrieval Database Browsing q. Retrieval • Purposeful – HP Multifunction Printer Information User Task Retrieval Database Browsing q. Retrieval • Purposeful – HP Multifunction Printer Information q. Browsing • Casual – Big Bang, CBR, Element Genesis, Supernova, . . . • Hyperlink-based q. Filtering by Agents • Push – Podcasts from B. B. C’s Naked Science Prasad L 1 Intro. IR 14

Logical View of Documents Accents spacing Docs stopwords Noun groups stemming Manual indexing structure Logical View of Documents Accents spacing Docs stopwords Noun groups stemming Manual indexing structure Full text Index terms • Abstraction (essentials) q. Structure, fonts, proximity, repetitions, etc Prasad L 1 Intro. IR 15

The Retrieval Process Text User Interface 4, 10 user need Text Operations logical view The Retrieval Process Text User Interface 4, 10 user need Text Operations logical view Query user feedback Operations 6, 7 logical view DB Manager Module Indexing 5 query inverted file Searching 8 Index 8 retrieved docs ranked docs Prasad Text Database Ranking 2 L 1 Intro. IR 16

IR Basics • Models and retrievaluation • Query languages and operations • Improve inferring IR Basics • Models and retrievaluation • Query languages and operations • Improve inferring query context – (query expansion, relevance feedback) • Text operations • Improve gleaning of document semantics – (stemming keywords) • Efficient Access: Index and Search q. Visualization, Multimedia, Applications, … Prasad L 1 Intro. IR 17

Clustering and classification • Given a set of docs, group them into clusters based Clustering and classification • Given a set of docs, group them into clusters based on their contents. • Given a set of topics, plus a new doc D, decide which topic(s) D belongs to. Prasad L 1 Intro. IR 18

The web and its challenges • Unusual and diverse documents • Unusual and diverse The web and its challenges • Unusual and diverse documents • Unusual and diverse users, queries, information needs • Beyond terms, exploit ideas from social networks qlink analysis, clickstreams. . . • How do search engines work? And how can we make them better? Prasad L 1 Intro. IR 19

More sophisticated semistructured search • Title is about Object Oriented Programming AND Author something More sophisticated semistructured search • Title is about Object Oriented Programming AND Author something like stro*rup qwhere * is the wild-card operator • Issues: qhow do you process “about”? qhow do you rank results? • The focus of XML search. Prasad L 1 Intro. IR 20

More sophisticated information retrieval • Cross-language information retrieval • Question answering • Summarization • More sophisticated information retrieval • Cross-language information retrieval • Question answering • Summarization • Text mining • … Prasad L 1 Intro. IR 21

Future Progress: Factors/Trends • Large, uncontrolled publishing media q. Quality issues • Cheap, fast Future Progress: Factors/Trends • Large, uncontrolled publishing media q. Quality issues • Cheap, fast and wide access q. Ease of use (query formulation) • Variety and flexibility q. Navigational and Visualization aids q. Directory-based (Table of contents) vs Keywordsbased (Inverted File Index) • Index terms (automatic/human-created) vs Full-text • Privacy, Security, Copyright Prasad L 1 Intro. IR 22