Statistical Language Models for Information Retrieval 1

e553ddddea26dffa7e20497194e23b1b.ppt

Statistical Language Models for Information Retrieval

1 + +- - + + --- 0 + +- + ++ - - -- - + - -

Retrieval Engine Updated query Document collection top 10

Source Transmitter (encoder) Noisy Channel Receiver (decoder) Destination

Sampling

Estimation Total #words =100

Query = “data mining algorithms” Which model would most likely have generated this query?

Doc LM Query likelihood

Max. Likelihood Estimate Smoothed LM

TF weighting Words in both query and doc IDF-like weighting Ignore for ranking

long Verbose queries long short Keyword queries

Relevance

Ignored for ranking D

Document prior

An infinite mixture model Kernel-based density function

Document D Query Q

Background words Topic words

Empirical divergence Divergence minimization

Trec topic 412: “airport security” Mixture model approach Web database Top 10 docs

Estimate with a bilingual lexicon Or Parallel corpora

QUERY MODELING Query Language Model Query USER MODELING Retrieval Decision: Documents Document Language Models DOC MODELING Loss Function

Us er Sourc e Query Document

Boolean model Probabilistic relevance model Generative Relevance Theory Vector-space Model Two-stage LM KL-divergence model Subtopic retrieval model

Estimate Query Estimate Documents Query model parameters Query Language Model User model parameters Document Language Models Loss Function Set

Can LMs consistently (convincingly) outperform traditional methods without sacrificing efficiency? Can we do much better by going beyond unigram LMs?

How can we learn effectively from past relevance judgments? How can we break the document unit in a principled way?

How can we exploit user information and search context to improve search? What role can LMs play when combining text with relational data?

How can we develop an effective unified retrieval model for Web search? How can we exploit LMs to develop models for complex retrieval tasks?