Towards Context-Aware Search by Learning A Very Large

Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs Huanhuan Cao 1, Daxin Jiang 2, Jian Pei 3, Enhong Chen 1 and Hang Li 2 1 University of Science and Technology of China, 2 Microsoft Research Asia, 3 Simon Fraser University

Context of User Queries • A user usually raises multiple queries and conducts multiple rounds of interactions for an information need. One round of Interaction User query click query Current Query click Context

An Example • Suppose Ada plans to buy a new car and need some cars reviews. • But she doesn’t know to formulate an effective query. Consequently, she raises a series of queries about different cars. • No surprisingly, for each query, the review web sites are ranked low and not easy to be noticed.

Why Context is Useful? • Suppose we have such a search log: SID S 1 Search session Ford => Toyota => GMC => Allstate www. autohome. com S 2 Ford cars => Toyota cars => GMC cars => Allstate www. autohome. com S 3 Ford cars => Toyota cars => Allstate www. allstate. com S 4 GMC => GMC dealers www. gmc. com

Patterns in The Search Log • Pattern 1: – 50% users clicked a cars review web site www. autohome. com after asking a series of cars. • Ada will have better experience if the search engine knows pattern 1.

• Pattern 2: – 75% users searched for car insurances after a series of queries about different cars. • The search engine will provide more appropriate query suggestions and URL recommendations if it knows pattern 2. • Idea: Learning from search log to provide context-aware ranking, query suggestion and URL recommendation.

Related Work Mining “wisdom of the crowds” from search logs Improve ranking Use click-through data as implicit feedback Mining click-through data Query suggestion Mining session data Mixture: CACB URL recomendation Mining search trials Only CACB considers context, but: 1. CACB constraints a query to one search intent 2. CACB doesn’t use click information as context 3. CACB can only be used for query suggestion

Modeling Context by vl. HMM (variable length Hidden Markov Model)

Overview of Technique Details • • Definition of vl. HMM Parameters Estimation Challenges and Strategies Applications

Formal Definition • Given: – – A set of hidden states {s 1 … s. Ns}; A set of queries {q 1 … q. Nq}; A set of URLs {u 1 … u. Nu}; The maximal length Tmax of state sequences • A vl. HMM is a probability model defined as follows: – The transition probability distribution Δ = {P(si|Sj)}; – The initial state distribution Ψ = {P(si)}; – The emission probability distribution for each state sequence Λ = {P(q, U|Sj)};

Parameter Estimation • Let X = {O 1…ON} be the set of training sessions, where: – On is a sequence of pairs (qn, 1 , Un, 1) … (qn, Tn , Un, Tn) – qn, t and Un, t are the t-th query and the set of clicked URLs, respectively – Moreover, we use un, t, k to denote the k-th URL in Un, t. • The task is to find Θ* such that

EM • The original problem is in a complex form which may not have a closed-form solution. • Alternatively, we use an iterative method: EM (Expectation Maximum). • Objective function:

• E-step: • M-step:

Challenges for Training A Large vl. HMM • Challenge 1: – The EM algorithm needs a user-specified number of hidden states. – However, in our problem, the hidden states correspond to users' search intents, whose number is unknown. • Strategy: – We apply the mining techniques developed by our previous work as a prior process to the parameter learning process.

• Challenge 2: – Search logs may contain hundreds of millions of training sessions. – It is impractical to learn a vl. HMM from such a huge training data set using a single machine. • Strategy: – We deploy the learning task on a distributed system under the map-reduce programming model

• Challenge 3: – Each machine needs to hold the values of all parameters. – Since the log data usually contains millions of unique queries and URLs, the space of parameters is extremely large. • Strategy: – we develop a special initialization strategy based on the clusters mined from the click-through bipartite

Experiments • A large-scale search log from Live Search – Web searches in English from the US market • Training Data – 1, 812, 563, 301 search queries, – 2, 554, 683, 191 clicks – 840, 356, 624 sessions – 151, 869, 102 unique queries – 114, 882, 486 unique URLs. • Test Data – 100, 000 sessions extracted from another search log

Coverage • For each test session <(q 1 , U 1)…(q. T UT )>, the vl. HMM deals with each qi. When i > 1, <(q 1 , U 1)…(qi-1 , Ui-1)> is used as a context. • The total coverage is 58. 3%. • Denote the set of test cases without context as Test 0 and the other as Test 1. • For the covered cases in Test 1, 25. 5% contexts are recognized.

Re-ranking • Baseline: – Boost the URLs with high click times given the query. • Evaluation: – Sample 500 re-ranking URL pairs from Test 0 and from the cases whose context are recognized in Test 1, respectively. – Each re-ranking URL pair is judged as Improved or Degraded or Unsure by 3 experts.

The effectiveness of re-ranking by the vl. HMM and Baseline 1 on (a) Test 0 and (b) Test 1.

Examples of Re-ranking Search for games Visit the homepage of Ask Jeeves Up the URL about game Up the URL which introduces the history of Ask Jeeves

URL Recommendation • Baseline: – Recommend the URLs with high click times following the current query. • Evaluation: – “Leave-one-out" method: given <(q 1 , U 1)…(q. T UT )>, we use q. T-1 as the test query and consider UT as the ground truth. – Suppose the set of recommended URLs is R, the precision is |R∩UT |/|R| and the recall is |R ∩ UT |/|UT |.

The precision and recall of the URLs recommended by the vl. HMM and Baseline 2.

An Example of URL Recommendation Search for online store about electronics Online store about equipments

Query Suggestion • Baseline: – CACB, a context-aware concept based approach of query suggestion. • Evaluation: – The results of two approaches are comparable since they both consider contexts. – However, the ratio of recognizing contexts is increased by 55% by vl. HMM.

Summary • We propose a general approach to context-aware search by learning a vl. HMM from log data. • We tackle the challenges of learning a large vl. HMM with millions of states from hundreds of millions of search sessions. • The experimental results on a large real data set clearly show that our context-aware approach is both effective and efficient.

• Our recent works on context-aware search: • Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-tao Sun, Enhong Chen and Qiang Yang. Context-aware query classification. To appear in SIGIR’ 09. • Huanhuan Cao, Daxin Jiang, Jian Pei, Enhong Chen and Hang Li. Towards context-aware search by learning a large variable length Hidden Markov Model from search logs. To appear in WWW’ 09. • Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen and Hang Li. Context-aware query suggestion by mining click-through and session data. KDD’ 08, pages 875 -883, 2008. (This paper won the Best Application Paper Award of KDD’ 08)

Thanks