Скачать презентацию Search 6 December 2006 10 30 INTL 6 Скачать презентацию Search 6 December 2006 10 30 INTL 6

45c76b8b5653490409d2985556f1a12b.ppt

  • Количество слайдов: 26

Search 6 December 2006 10: 30 INTL 6 Dr Ian Boston University of Cambridge Search 6 December 2006 10: 30 INTL 6 Dr Ian Boston University of Cambridge Image © University of Cambridge 2006

Search: Problem Area • Stovepipe Applications – All wanted search • Cant search each Search: Problem Area • Stovepipe Applications – All wanted search • Cant search each tool • Unified Search of all content – 1 Text box + a button – Just like Google • To Start with • Slightly less content

Possible Solutions Image © University of Cambridge 2006 Possible Solutions Image © University of Cambridge 2006

Public/Private Search Engine – Register your site with Google • What about the content/permissions? Public/Private Search Engine – Register your site with Google • What about the content/permissions? • Non starter, content missing. – Google Scholar • Eg DSpace – Google Researcher ? Google Learner ? • Sakai is not Open. Access • Why would they ?

Private Search Application – Intranet solution • Install Apache Nutch › Add Auth. Z Private Search Application – Intranet solution • Install Apache Nutch › Add Auth. Z code • Buy a Google Appliance › Configure to do some Auth. Z › ~£ 40 K 0. 5 M pages – Rendered content is only a view • Misses properties • Approximates linkage › Doesn’t know about Sakai – Nutch Prototype in 1. 5. 1

Entity Search – Write a search engine! • Full time job. – Reuse Lucene Entity Search – Write a search engine! • Full time job. – Reuse Lucene • Scalability › Most have < 5 M active documents › Nutch benchmarked » 5 boxes, 2 TB == 100 M+ docs » http: //wiki. apache. org/nutch/Hardware. Requirements • Plumb in Lucene › Connect to Sakai Entity Bus › Connect to Entity Produces at the object level. • Learn from Nutch › Index Storage and Management › Scalability Reliability – MUST Cluster OOTB

Search Tool Image © University of Cambridge 2006 Search Tool Image © University of Cambridge 2006

Search Tool Search Tool

Search Tool • Permissions – Owning Entity checks permission on each result • Rendering Search Tool • Permissions – Owning Entity checks permission on each result • Rendering Highlighting – Matching terms highlighted • RSS Feed of search results • Open. Search (FF 2. 0, IE 7) and Sherlock/Mycroft (FF 1. 5) integration

Admin Tool Admin Tool

Admin Tool • • Monitor Indexing progress Monitor Segments Request Worksite Index Rebuilds Request Admin Tool • • Monitor Indexing progress Monitor Segments Request Worksite Index Rebuilds Request Complete Index Rebuilds – Expensive!

Tag Tool Tag Tool

Tag Tool • Search for a term • Discover other terms – Size indicates Tag Tool • Search for a term • Discover other terms – Size indicates relevance within result set • Needs some windowing on the word vectors – High frequency words not significant – Short words not significant

Search API • Simple API, one method. . Search() • Results paged at lowest Search API • Simple API, one method. . Search() • Results paged at lowest level • Access to secondary Indexes – “+Tool: wiki +Site: +cowslips +bluebell • Content terms use Porter Stemmer and Stop words – Stop words “and” “the” “a” ignored – Stemmer looks == look, try == trying • May be some i 18 n issues

Internal Architecture Image © Wikipedia Commons 2006 Internal Architecture Image © Wikipedia Commons 2006

RWiki Search Tag Tool Search Tool Resources Tool OSP Tools Chat Tool Message Service RWiki Search Tag Tool Search Tool Resources Tool OSP Tools Chat Tool Message Service Email Tool Content Service Announcements Wiki Service Wiki Tool Architecture Search API Search Service Entity Content Producer Index Builder Lucene Event Listener Sakai Entity Bus Index Builder Clustered Index Store Index Queue Local Segment Store Shared Segment Store

Indexer – Indexing Queue • Events arrive on the Bus • Added to the Indexer – Indexing Queue • Events arrive on the Bus • Added to the Queue transitionally – Indexing • • Index workers run concurrently ( 2 per Sakai node) Take Events from the queue Open an Abstract Lucene segment Distributed lock manager Search Service Entity Content Producer Index Builder Lucene Event Listener Index Builder Index Queue Clustered Index Store Local Segment Store Shared Segment Store

Content – Entity Content Producer • Digests a Token Stream › On Content › Content – Entity Content Producer • Digests a Token Stream › On Content › Using Stemmer and Stop Words • Provides index terms › › › Site ID User info Properties Tool Custom • RDF Structure › Requires A triple Store › Sesame in Contrib › Mulgara/Kowali needs work. Search Service Entity Content Producer Index Builder Lucene Event Listener Index Builder Index Queue Clustered Index Store Local Segment Store Shared Segment Store

Cluster Index Storage – Not Distributed • Mirrored for Central Deposit Not as scalable Cluster Index Storage – Not Distributed • Mirrored for Central Deposit Not as scalable as Nutch with Google Map. Reduce • BUT No setup required – Local Segments • Opened by Index. Readers, Index. Writers, Index. Searchers • High performance Seek – Shared Segments • Central deposit of search segments • Synchronized with local copies Search Service – Periodic Merging • Reduce open files • Eliminated Deleted items Search Service Entity Content Producer Index Builder Lucene Event Listener Index Builder Index Queue Clustered Index Store Local Segment Store Shared Segment Store

Production Deployment Image © University of Cardiff 2006 Production Deployment Image © University of Cardiff 2006

Sites • In production – Cambridge • 73 K documents, 6 GB index, content Sites • In production – Cambridge • 73 K documents, 6 GB index, content in index. • Rebuild time = 45 minutes – Cape Town • 93 K documents, 200 MB index, content not in index. • Rebuild time = ? – Others ? • Considering – Michigan • 1. 7 M documents • Rebuild time…. Weeks ? • Should not put the content in the index

Deployment Issues • Indexing Times – Acceptable for smaller sites, a few hours – Deployment Issues • Indexing Times – Acceptable for smaller sites, a few hours – Pain at larger sites • Rolling per worksite index build • Dedicated indexing cluster (not serving pages) • Storage strategies – First Attempts - Cambridge - Cape Town • Cape Town identified many problems - Thank you! • My. SQL - Don’t put segments in DB! - Extremely slow tables. – Node Layout • All nodes are indexers – Content in the Index or Out of the index • No content in index now • Results re-digested on search

Roadmap Image from: http: //marlin. sourceforge. net A Gnome 2 media editor Image © Roadmap Image from: http: //marlin. sourceforge. net A Gnome 2 media editor Image © Marlin Pr

New Features • Tagged Search Discovery – Based on word vectors – In trunk New Features • Tagged Search Discovery – Based on word vectors – In trunk – Needs a lens - focus on distribution segment • RDF Faceted Discovery – Merged word vectors and triples – Needs per worksite ontology tools – Needs triple Store • Should be a Sakai wide store. › Kowali - issues with community › Mulgara

Roadmap • Parallel Indexing – – – Implemented, needs heavy testing Learn from Nutch Roadmap • Parallel Indexing – – – Implemented, needs heavy testing Learn from Nutch Multiple active indexes Big sites in production Better merge algorithm • Other tools using search – Use indexes for PK search – Issues over Queue delays • Text Mining - Sydney - Rafael Calvo

Questions Image © University of Cambridge 2006 Questions Image © University of Cambridge 2006