Res In A Combination of Results Caching and

Res. In: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines Gleb Skobeltsyn Vassilis Plachouras Flavio Junqueira Ricardo Baeza-Yates The 31 st Annual International ACM SIGIR Conference Singapore, 21 July 2008 Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore

Motivation • Caching – crucial for WSE to save resources • Results caching: + Is efficient with – real queries But its hit rate is limited due to singletons • How to increase the hit-rate further? – index pruning Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 2

Contents • Res. In architecture • Original query stream vs. query stream after the results cache (misses) • Static pruned index: • Term pruning • Document pruning • A combination of both • Conclusion Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 3

Res. In architecture Query processing: 1. from the main index query result query Front query end Main Index Term cache Back Top results end Broker query result • We study Results Caching and Index Pruning together • … to reduce latency and load on back-end servers Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 4

Res. In architecture Query processing: 2. from the results cache Main Index Term cache Front query Results miss end cache hit Term cache Back end query result query Term cache Back end Broker result • We study Results Caching and Index Pruning together • … to reduce latency and load on back-end servers Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 5

Res. In architecture Query processing: 3. from the pruned index Main Index Term cache Front query Results miss end cache hit Term cache Back end query result query Term cache Back end Pruned miss Pruned Broker index hit result • We study Results Caching and Index Pruning together • … to reduce latency and load on back-end servers Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 6

Original query stream (all queries) vs. query stream after the results cache (misses) Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 7

All queries vs. Misses: Experimental setup Original query log all queries “Miss-log” misses Q 1: britney spears Q 2: sigir 2007 Results cache (LRU) Q 3: britney spears Q 4: sigir 2008 miss Q 2: sigir 2007 Q 4: sigir 2008 … … hit 185 M queries from yahoo. co. uk • Real query log to test results cache and generate a “miss-log”: Q 185’ 000: last query Q 3: britney spears … Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 8

All queries vs. Misses: Number of terms in a query • Average number of terms for all queries = 2. 4, for misses = 3. 2 • Most single term queries are hits in the results cache • Queries with many terms are unlikely to be hits Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 9

All queries vs. Misses: Query result size distribution • Randomly selected 2000 queries from all queries and misses: • Avg. result size for misses is ~100 times smaller than for all queries • Approx. half of the misses returns less than 5000 results – SMALL! • Similar results with a “small” UK document collection (78 M) Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 10

All queries vs. Misses: Term popularity distribution • Each point -> avg. Log sizes: 185 M – all queries, 41 M - misses popularity of 1000 consecutive terms • The order of terms for misses is the same as for all queries • Terms which were popular before the results cache remain popular after Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 11

Static index pruning Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 12

Static pruned index • Smaller version of the main index, returns: • • the top-k response that is the same as the main index’s, or a miss otherwise. • Assumes Boolean query processing • Types of pruning: • • • Term pruning – full posting lists for selected terms Document pruning – truncated posting lists Term+Document pruning – combination of both Full index Term pruning Document pruning T+D pruning t 1 t 1 t 2 t 2 t 3 t 3 t 4 t 4 Posting list Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 13

Term Pruning: Performance • Term pruning based on profit(t)=popularity(t)/df(t) • Answers a query if all query terms are in the pruned index UK document collection, 78 M documents: • Performs well for all queries • For misses as well: e. g. , can process almost 50% of the queries with 25% of the index Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 14

Result Caching + Term Pruning • Results caching performance is independent of the collection size results cache capacity is up to 10% of the full index size Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 15

Term pruning: Frequent terms in misses • Min. DF (df of the least frequent query term) correlates to the result size • Max. DF (df of the most frequent query term) is high for most of the misses Min. DF Gleb Flavio Vassilis Ricardo • • • • • • • • • • • • • • • • • • Max. DF • Many misses contain at least one frequent term • => the term pruned index has to include large posting lists Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 16

Document pruning • Based on Fagin’s top-k intersection algorithm • Keeps postings with high scores only: • Sufficient to compute top-k results for some queries • Determining correctness of the result requires computing of a scoring threshold – LATENCY! t 1 D 5 D 3 D 2 t 2 D 1 D 5 … t 3 D 4 D 1 D 2 D 3 D 4 … Top-2 results: D 1 D 2 … Score threshold: s(D 2, t 1)+s(D 1, t 2)+s(D 2, t 3) Posting list, sorted by score Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 17

Document pruning: Experimental setup • Scoring function: • • pr(d) – query independent score of the document d (pagerank) ω, k – normalization constants: • ω=[0, 10, 20] • k=1 • We try different values of PLLmax – maximum Posting List Length and choose the one that maximizes the hit rate • We only look at the upper bound for the hit rate: Whether the original top-10 results found in the top portions of all PLs? Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 18

Document pruning: performance • Doc. pruning needs high pagerank weights • It performs better for All queries than for Misses Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 19

Term+Document pruning: performance • T+D pruning is the best but expensive (high latency) • profit 2 is better than profit 1 • Improvement is marginal for misses unless the pagerank weight is very high Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 20

Conclusions • Results caching: • delivers good hit rates with a constant capacity • but hit rate is limited because of singletons • Index pruning: Lesson learned: Important to consider the interaction between the components • has no limit on hit rate, • but the pruned index size grows with the doc. collection – more expensive • Static index pruning: addition to results caching, not replacement • Term pruning performs well for misses also => “compatible” with results cache • Document pruning: all queries - OK, misses - only with high pagerank weights • Term+Document pruning slightly improves over document pruning Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 21

Last slide Thank you Questions? Gleb Skobeltsyn “Resin: A Combination of Results Caching and Index Pruning for Search Engines” The 31 st Annual International ACM SIGIR Conference, 21 July 2008, Singapore 22