Solr Performance Key Innovations Yonik Seeley Lucid

Скачать презентацию Solr Performance Key Innovations Yonik Seeley Lucid

17c0a85901d2e1e7a33fc7563e4c9e9b.ppt

Количество слайдов: 32

Solr Performance & Key Innovations Yonik Seeley, Lucid Imagination yonik@lucidimagination. com, May 26 2011

Solr 3. 1 Highlights § Numeric range facets (similar to date faceting). § New spatial search, including spatial filtering, boosting and sorting capabilities. § Example Velocity driven search UI at http: //localhost: 8983/solr/browse § A new faster termvector-based highlighter. § Extended dismax (edismax) query parser with support for fielded queries, enhanced relevancy, and full lucene syntax support. § Distributed search support for the Spell check and Terms components. 3

Solr 3. 1 Highlights (continued) § Suggester, a fast trie-based autocomplete component. § Sort results by any function query. § JSON document indexing. § CSV response format § Apache UIMA integration for metadata extraction. § Tons of optimizations, bugfixes, and new analysis capabilities via Apache Lucene 3. 1. 4

What’s not in 3. 1? § § § § Result Grouping (AKA Field Collapsing) Pivot Faceting Solr. Cloud Pseudo-fields Pseudo-join Relevancy function queries Per-segment faceting *Tons* of new Lucene performance/efficiency goodness 5

Recent Lucene Performance § Tiered. Merge. Policy – the new default • • Much better for incremental indexing / NRT Ignores segment order when selecting best merge Takes deletes into account Does not over-merge (no cascading merges) § Finite State Transducer (FST) based terms index 6

Document. Writer. Per. Thread (DWPT) § Flushing new segment is now concurrent w/ indexing § Use multiple indexing threads/connections § When max mem is hit, biggest DWPT is concurrently flushed Indexing thread Index Writer DWPT in-memory DWPT Flush segment to disk _1_0. tiv _1_0. prx _1_0. frq … _2_0. tiv _2_0. prx _2_0. frq … _3_0. tiv _3_0. prx _3_0. frq … 7

Solr Cloud http: //. . . /solr/collection 1? distrib=true Load-balanced sub-request shard 1(re plica 1) replica 2 shard 2(re plica 1) replica 2 replica 3 ZK node /collections /collection 1 config. Name=myconf /shards /shard 1 server 1: 8983/solr server 2: 8983/solr /shard 2 server 3: 8983/solr ZK server 4: 8983/solr node /livenodes server 1: 8983/solr server 2: 8983/solr ZK node /configs /myconf solrconfig. xml schema. xml Zoo. Keeper quorum ZK node 8

Solr Cloud: Getting Started http: //wiki. apache. org/solr/Solr. Cloud java -Dbootstrap_confdir=. /solr/conf -Dcollection. config. Name=myconf -Dzk. Run -jar start. jar Upload /solr/conf to ZK and call it “myconf” Run an internal ZK server http: //localhost: 8983/solr/collection 1/admin/zookeeper. jsp

Distributed Requests l Explicitly specify node addresses to load-balance across shards=localhost: 8983/solr|localhost: 8900/solr, localhost: 7574/solr|localhost: 7500/solr l l A list of equivalent nodes are separated by “|” Different phases of the same distributed request use the same node l Specify logical shard ids to search across shards=NY_shard, NJ_shard l Query across all shards in the collection http: //localhost: 8983/solr/collection 1/select? distrib=true l public l Cloud. Solr. Server(String zk. Host) Solr. J Java client that load-balances across all nodes in cluster

Extended Dismax Parser l Superset of dismax l Designed to directly handle user queries w/o exceptions &def. Type=edismax&q=foo&qf=body l Fixes OR l Full l l edge cases where dismax could still throw exceptions AND NOT - “ lucene syntax support Tries lucene syntax first Smart escaping is done if syntax errors l Optionally supports treating “and”/”or” as AND/OR in lucene syntax l Fielded queries (e. g. myfield: foo) even in degraded mode l uf parameter controls what field names may be directly specified in “q”

Extended Dismax Parser (continued) l boost parameter for multiplicative boost-by-function l Pure negative query clauses Example: solr OR (-solr) l Enhanced term proximity boosting l pf 2=myfield – results in term bigrams in sloppy phrase queries myfield: “aa bb cc” -> myfield: “aa bb” myfield: “bb cc” l Enhanced stopword handling l stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf 2=myfield -> +myfield: (solr awesome) (myfield: ”solr is” myfield: ”is awesome”) l Currently controlled by the absence of Stop. Word. Filter in index analyzer, and presence in query analyzer

Faceting Performance Improvements l For facet. method=enum, speed up initial population of the filter. Cache (i. e. first time facet): from 30% to 32 x improvement l Optimized facet. method=fc for multi-valued fields and large facet. limit – up to 3 x faster l Optimized deep facet paging – up to 10 x faster with really large facet. offsets l Less memory consumed by field cache entries l Per-segment faceting with facet. method=fcs l l Only faster when re-opening index frequently (many times a second) Only works for single-valued fields

Pivot Faceting l Other l names that could have made sense: Grid Faceting, Cross-Product Faceting, Matrix Faceting l Syntax: facet. pivot=field 1, field 2, field 3, … facet. pivot=cat, in. Stock #docs w/ in. Stock: true #docs w/ instock: false cat: electronics 14 10 4 cat: memory 3 3 0 cat: connector 2 0 2 cat: graphics card 2 0 2 cat: hard drive 2 2 0

Pivot Faceting http: //. . . &facet=true&facet. pivot=cat, popularity "facet_counts": { (continued) "facet_pivot": { "cat, popularity": [{ { "field": "cat", "field": "popularity", 14 docs w/ "value": "electronics", "value": "1", cat==electronics "count": 14, "count": 2}]}, "pivot": [{ { "field": "popularity", "field": "cat", 5 docs w/ "value": "6", "value": "memory", cat==electronics && popularity==6 "count": 5}, "count": 3, { "pivot": []}, "field": "popularity", "value": "7", […] "count": 4},

Range Faceting § Like Date faceting, but more generic http: //. . . &facet=true &facet. range=price &facet. range. start=0 &facet. range. end=500 &facet. range. gap=50 "facet_counts": { "facet_ranges": { "price": { "counts": { "0. 0": 5, "50. 0": 2, "100. 0": 0, "150. 0": 2, "200. 0": 0, "250. 0": 1, "300. 0": 2, "350. 0": 2, "400. 0": 0, "450. 0": 1}, "gap": 50. 0, "start": 0. 0, "end": 500. 0}}}}

Spatial Search Step 1: Index some locations! The Alpine Shop 44. 013617, -73. 168264 Step 2: Decide where you are &pt=44. 0153371, -73. 16734 &d=1 &sfield=store Step 3: Profit! Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc Returning the distance: &fl=geodist() Pseudo-fields! Note: You can now sort by any arbitrary function query!

Pseudo-Fields Returns other info along with document stored fields § Function queries fl=name, location, geodist(), add(myfield, 10) § Fieldname globs fl=id, attr_* § Multiple “fl” (field list) values &fl=id, attr_*&fl=geodist()&fl=termfreq(text, ’solr’) § Aliasing fl=id, location: loc, _dist_: geodist() § Future: inlined highlighting, “explain”, sort-values, group-value 18

Result Grouping / Field Collapsing l Goal l l Limit the number of results per category “category” normally defined by unique values in a field l Uses l l l Web Search – collapse by web site Email threads – collapse by thread id Ecommerce/retail l Show the top 5 items for each store category (music, movies, etc)

Field Collapsing by Site

Result Grouping by Category Field Collapse on Product Type

Group by Field http: //. . . &fl=id, name&q=ipod&group=true&group. field=manu_ex act "grouped": { "manu_exact": { "matches": 3, "groups": [{ "group. Value": "Belkin", "doclist": {"num. Found": 2, "start": 0, "docs": [ { "id": "IW-02", "name": "i. Pod & i. Pod Mini USB 2. 0 Cable"}] }}, { "group. Value": "Apple Computer Inc. ", "doclist": {"num. Found": 1, "start": 0, "docs": [ {

Group by Query http: //. . . &group=true&group. query=price: [0 TO 99. 99]&group. query=price: [100 TO *]&group. limit=5 "grouped": { "price: [0 TO 99. 99]": { "matches": 3, "doclist": {"num. Found": 2, "start": 0, "docs": [ { "id": "IW-02", "name": "i. Pod & i. Pod Mini USB 2. 0 Cable"}, { "id": "F 8 V 7067 -APL-KIT", "name": "Belkin Mobile Power Cord for i. Pod"}] }}, "price: [100 TO *]": { "matches": 3, "doclist": {"num. Found": 1, "start": 0, "docs": [

Grouping Params parameter meaning group. field= Like facet. field – group by unique field values group. query= default Like facet. query – top docs that also match group. function= function query group. limit= How many docs per group 1 group. sort= How to sort documents within a group Same as sort rows= How many groups to return 10 sort= How to sort the groups relative to each other (based on top doc) group. format= grouped/simple – if simple, a single flat list is used and rows units are “docs” group. main=true/false If true, the first field grouping command is false used as main result set grouped

Pseudo-Join id: blog 1 name: Solr ‘n Stuff owner: Yonik Seeley Started: 2007 -10 -26 id: post 1 blog_id: blog 1 author: Yonik Seeley title: Solr relevancy function queries body: Lucene’s default ranking […] id: blog 2 name: lifehacker owner: Gawker Media started: 2005 -1 -31 id: post 2 blog_id: blog 1 author: Yonik Seeley title: Solr result grouping body: Result Grouping, also called […] Restrict to blogs mentioning netflix fq={!join from=blog_id to=id}body: netflix id: post 3 blog_id: blog 2 author: Whitson Gordon title: How to Install Netflix on Almost Any Android Device - Finds all documents matching “netflix” - Maps to different docs by following blog_id to id 25

$Pseudo-Join Examples § Only show posts from blogs started after 2010 q=foo&fq={!join from=id to=blog_id}started:$ Pseudo-Join Examples § Only show posts from blogs started after 2010 q=foo&fq={!join from=id to=blog_id}started: [2010 TO *] § If any post in a blog mentions “obama”, then search all posts in that blog for “bomb” (self-join) q=bomb&fq={!join from=blog_id to=blog_id}obama § If any blog post mentions “obama”, then search all websites with the same blog owner for “bomb” q=bomb&fq={!join from=owner to=website_owner}{!join from=blog_id to=id}obama 26

Cross-Core Join id: doc 1 security: managers title: doc for managers only body: … id: mary security_groups: managers, employees id: doc 1 security: managers, employees title: doc for everyone body: … id: john security_groups: employees sec 1 collection 1 Single Solr Server http: //localhost: 8983/solr/collection 1/select? q=foo&fq={!join from. Index=sec 1 from=security_groups to=security}user: john 27

Pseudo-Join vs Grouping Pseudo-Join Result Grouping / Field Collapsing O(n_terms_in_join_fields) O(n_docs_in_result) Single or multi-valued fields Single-valued fields only Filters only (no info currently passed from the “from” docs to the “to” docs). Can order docs within a group and groups by top doc within that group using normal sort criteria. Chainable (one join can be the input to another) Not currently chainable – can only group one field deep Affects which documents match a request, Grouping does not currently affect the set so naturally affects facet numbers (e. g. you of documents matching the query, so can search posts and get numbers of faceting is unaffected. blogs) 28

Auto-Suggest l Many l Can be slow for a large corpus l New l l l people previously used terms component auto-suggest builds off Spell. Check component TST implementation: compact memory based trie FST implementation: slower to build, but smaller & faster lookup Based on a field in the main index, or on a dictionary file http: //localhost: 8983/solr/suggest? wt=json&indent=true&q=ult "spellcheck": { "suggestions": [ "ult", { "num. Found": 1, "start. Offset": 0, "end. Offset": 3, "suggestion": ["ultrasharp"]}, "collation", "ultrasharp"]}} 29

Index with JSON $ URL=http: //localhost: 8983/solr/update/json $ curl $URL -H 'Content-type: application/json' -d ’ [ { "id" : "978 -0641723445", "cat" : ["book", "hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "in. Stock" : true, "price" : 12. 50, "pages_i" : 384 } ]'

Query Results in CSV http: //localhost: 8983/solr/select? q=ipod&fl=name, price, cat, popularity&wt=csv name, price, cat, popularity i. Pod & i. Pod Mini USB 2. 0 Cable, 11. 5, "electronics, connector", 1 Belkin Mobile Power Cord for i. Pod w/ Dock, 19. 95, "electronics, connector", 1 Apple 60 GB i. Pod with Video Playback Black, 399. 0, "electronics, music", 10 l l l Can handle multi-valued fields (see “cat” field in example) Completely compatible with the CSV update handler (can round-trip) Results are streamed – good for dumping entire parts of the index

http: //localhost: 8983/solr/browse

Q&A Q&A