Intelligent Information Retrieval some research trends Gabriella Pasi

Intelligent Information Retrieval: some research trends Gabriella Pasi Istituto per le Tecnologie della Costruzione Sezione Tecnologie Informatiche Multimediali Consiglio Nazionale delle Ricerche via Ampère, 56, 20131 - Milano e-mail: gabriella. pasi@itim. mi. cnr. it

The problem of Information Access Development of the WWW Increasing amount of available information NEED FOR SYSTEMS WHICH SUPPORT A FAST AND EFFECTIVE ACCESS TO INFORMATION Distinct nature of information needs Distinct ways to provide an automatic support to information access

The problem of Information Access There are distinct ways to locate information, depending both on the way in which the information is represented, and on the users’ needs: – Navigating via links on web sites (point and click paradigm) requires The identification of a meaningful starting point – Explicit specification of users needs requires An explicit query formulation (Information Retrieval Systems Search Engines) – Reccomendations as a decision support aid requires Learning from “similar” preferences (Recommender Systems) – Preferences elicitation through “guided” dialogues requires User knowledge elicitation (Decision Support Systems)

The problem of Information Access – Systems which support information access: The definition of systems which help users to access information relevant to their needs is based on the solution of a decision making problem: how to select and rank information items which reflect the user’s preferences ? – Notion of relevance: what the user wants is relevant information. Relevance is a subjective property of information items. The notion of preference is in this context related to the one of relevance

Information Retrieval (IR) aims at defining systems able to find documents which satisfy someone’s information need. Information can be of any kind: textual, visual, or auditory, although most actual IR systems store and enable the retrieval of only textual information organized in documents. The problem of identifying the documents relevant to specific needs is a decision-making problem, based on the assessment of the subjective notion of relevance. Very complex task, pervaded with imprecision and uncertainty

Information Retrieval System: a basic scheme INDEXING MECHANISM DOCUMENTS Usually unstructured or semi-structured text FORMAL REPRESENTATION OF DOCUMENTS ITEMS ESTIMATED RELEVANT QUERY FORMULATION MATCHING MECHANISM USER QUERY Ultimate aim of the system: to estimate the relevance of documents on the basis of a comparison of the formal representation of documents and queries

Techniques that improve the basic scheme of an IRS Some techniques which allows to improve the retrieval capabilities are: • Relevance Feedback, • Text Categorization, • Use of Thesauri • Document clustering • Cross-lingual Information Retrieval

Information Retrieval: main issues • Text (or other media) formal representation the text representation is usually based on keywords extraction and weighting – how to improve document representations? • Queries usually based on selection criteria specified by terms – how to define query languages that better express user’s needs? • The matching mechanism it compares the document and query representations – what is a “good” model of retrieval? How to account for imprecision and uncertainty? • Produced results: ranked lists of documents degrees of relevance or probability of relevance

Information Retrieval Systems The relevance estimate strongly depends on the adopted IR model How to improve the relevance estimate? Definition of “intelligent retrieval systems” by better interpreting and learning users’ preferences Flexible systems vs. intelligent systems • tolerance to uncertainty and imprecision (intrinsic in subjective evaluations) • learning capabilities Application of soft computing techniques: • to simplify the user-system interaction (tolerance to an approximate expression of users’ needs ) • • to improve the formal representation of the documents’ content to learn the user notion of relevance

“Intelligent” IR: some research directions • IR models that manage uncertainty and vagueness They model the uncertainty and/or imprecision intrinsic in the retrieval activity • Relevance Feedback To learn users’ preferences by refinement of queries • Automated text categorization • Vocabulary expansion and intelligent users’ interfaces • Personalized indexing To improve the formal representation of documents • Flexible query languages To improve the expression of users’ needs

IR models that deal with uncertainty and vagueness • Probabilistic models estimate of the probability of relevance of documents to a user’s query • Logical models The estimate of the relevance of a document with respect to a query consists in determining the "logical status" of the implication. • Fuzzy models relevance is modeled as a gradual property of documents. They capture the vagueness intrinsic in the retrieval activity • Neural models to design IRSs able to adapt to the characteristics of the IR environment, and in particular to the user's interpretation of relevance.

Relevance Feedback RELEVANCE FEED-BACK DOCUMENTS Estimated relevant Information Retrieval System USERS INFORMATION NEEDS QUERY FORMULATION Relevance feedback exploits a learning of the user’s notion of relevance by adapting the system behavior to it A relevance feedback mechanism performs an automatic process which generates improved queries on the basis of an initial query evaluation This process is directed by the user who first analyzes the preference ordering estimated by the system over the retrieved information items, and then is asked to express her/his preferences over the retrieved items in order to explicitly indicate to the systems the items truly evaluated relevant.

Automated text categorization It is aimed at the automated categorization (classification) of texts into predefined categories, thus organizing them and making retrieval more flexible and consequently more effective. TC is applied in several domains, such as for example document indexing based on a controlled vocabulary, document filtering, document sense disambiguation etc. The dominant approach to text categorization is based on machine learning techniques: a general inductive process automatically builds a classifier by learning from a set of preclassified documents the characteristics of the categories.

Vocabulary expansion and intelligent users’ interfaces The query representation in IRSs is commonly based on keywords (or strings) specification. The retrieval mechanism performs in this case a lexical match of words. One of the main problems of IR systems is vocabulary mismatch. Vocabulary expansion can result from transforming the document and query representations, as with Latent Semantic Indexing, or it can be done by using a thesaurus. The basic assumption of LSI is that in the word usage there is an underlying or latent structure: for retrieval some statistically derived conceptual indices are used instead of individual words Fuzzy thesauri and pseudothesauri are used to expand the set of index terms of documents with new terms by taking into account their varying significance in representing the topics dealt with in the documents

Document indexing The most used automatic indexing procedures are based on term extraction and weighting: a document is represented by means of a collection of index terms with associated weights (the index term weights). An index term weight expresses the degree of significance of the index term as a descriptor of the document information content The vector space model, the probabilistic models and fuzzy models adopt a weighted document representation Limitations: • the weighted representation of documents does not take into account that a term can play a different role within a text, according to the distribution of its occurrences. • usual indexing procedures behave as a black box producing the same document representation for all users Need for “personalized“ indexing procedures

Personalized document indexing To index structured documents The formal representation of a document is defined by exploiting its logical structure (e. g. XML documents). Given a term t, for each subpart si of the document a distinct term weight is computed, expressing the importance of the term as a descriptor in that document subpart. The overall index term weigh is computed by aggregating, in a user-driven way the “partial” weigths TITLE. . . . Fs 1(d, t) AUTHORS. . . . Fs 2(d, t) ABSTRACT. . . . INTRODUCTION. . . . . . . Fs 3(d, t) aggregation function A F (d, t) Fs 4(d, t) The user specifies both her/his preference about the sections in which to privilege the search and the aggregation function.

Flexible Query Languages A) linguistic query weights as flexible constraints on weighted document representations Weighted representation R = {0. 2/computer, 0. 6/network, 0. 1/chip, 0. 9/PC, 0. 7/DOS} d 1 of documents d 1 Partial matching mechanism Weighted query The constraints’ evaluation depends on the weight semantics q = <network; important> AND <PC; not very important> important Soft constraints The weights specify soft constraints on the weighted document representations The Retrieval Status Value of a document expresses the degree of constraints satisfaction

Flexible Query Languages A) linguistic aggregation operators • To simplify query formulation • to improve expressiveness Use of linguistic quantifiers to specify aggregation criteria (the behaviour of these operators lie between AND and OR) ex: (all, most of, at least k); all -----------------> AND most of at least k with K N. . . At least 1 --------------> OR Example : Boolean query (images. AND noise) OR (images AND satellite) OR (images AND meteo*) OR (noise AND satellite) OR (noise AND meteo*) OR (meteo* AND satellite) Same query with linguitic quantifiers At least 2(images, noise, satellite, meteo*)

Flexible querying of Structured Documents First step: the user ranks the logical sections of the documents in decreasing order of their perceived importance in bearing relevant information; section si is more important than section sj iff i<j (i and j being the positions of si and sj, respectively, in the ordered list). Second Step: the user formulates a flexible query based on the following soft constraints: t in Q sections in which Q is a linguistic quantifier such as at least one, most, all that specifies the number of the documents’ sections that should be taken into account to compute the overall weight of the index term t in a whole document d.

Conclusions Some approaches to the definition of “intelligent” Information Retrieval Systems have been presented. In particular some promising research directions that could guarantee the development of more effective IRSs have been outlined. Among these, the research efforts aimed at defining new indexing techniques of semi-structured documents (such as XML documents) are very important: the possibility of creating in a user-driven way the documents’ surrogates would ensure a modeling of the users’ interests also at the indexing level (usually this is limited to the query formulation level).