bc2724ae072f56ef011341ddb7bdec9d.ppt
- Количество слайдов: 28
A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution Collections. si. edu wangch@si. edu
Background Information Smithsonian Institution is a public institution whose mission is the increase and diffusion of knowledge, ¡ 19 museums and 9 research institutes, ¡ 136 million collection objects, ¡ 12 major museum collection information systems (with 30 databases), ¡ Hundreds of other databases. ¡
Issues we faced Users want information now! l Google Effect and user’s mentality: “if it is not online, it does not exist. ” l Users want immediate access to digital documents. l Separate databases are confusing to the public. We must act now!
Smithsonian’s Collection Searching Center Overview ¡a discovery center for information with a single searching point ¡ faceted searching and content-sensitive navigation ¡ positive and negative browse & select options ¡ relevancy ranking of search results ¡ automatic stemming for word matching
Smithsonian’s Cross Searching Catalog Overview (continued) ¡ integrated searching of data from multiple types of databases ¡ scalability for large data sets ¡ a metadata center which interacts with other online applications
Project Team and Resources ¡ ¡ ¡ Andrew Gunther implementation Jim Felley George Bowman configuration Randy Arnold Ching-hsien Wang – Software development and – Data conversion and implementation – Database management and security – Project support – Program Manager Since August 2007, we have integrated data from 12 major databases with 2 million records.
Starting from Multiple databases
Transform into a single Search Center
Cross Searching Demo – simple opening screen
Demo – search result screen
Demo – search history
Process Flow Diagram Horizon Virtual Museum In 2 nd Life Data Extract and Trans. Formation Output data In XML documents Solr Horizon Digital Library Digital Archives Digital Museum Lucene Index Data Extract and Trans. Formation Online Exhibition Output data In JSON Cross Searching Catalog Solr XML documents Output data In Python Education Interface Open Access Applications
Automated Process Library Trigger XML Data Transformation Trigger Archives Art Inventory Horizon Archives Trigger Photo Archives Trigger Exhibition Catalogs Trigger Smithsonian History Trigger Research Trigger Bibliographies Airplane Directory Trigger Solr_ Index_ Pending ……. DB Table A Perl program converts records based on BIB# XML Documents
Define an Index Metadata Model: Free text data fields used for Keyword searching & display Record Link Title/Object-name Identifier Physical Description Gallery Label Notes Publisher Object Type Taxonomic Name Language Topic Place Date Name Culture Set Name Data Source Credit Line Online Media Group
Facet data fields used for browsing and limiting Record ID Object Type Language Topic Place Date Name Culture Data Source Online Media Type Rights for Online Media File Related Record Usage Flag Taxon-Kingdom Taxon-Phylum Taxon-Division Taxon-Class Taxon-Order Taxon-Family Tabxon-Sub-Family Scientific_name Common name Geo-age-Era Geo-Age-System Geo-Age-Series Geo-Age-Stage Strat-Group Strat-Formation Strat-Member
Getting help from Solr ¡ Task specific handlers: Request handler Respond handler Update handler Solr Lucene Index Solr Schema. xml file defines fields to be indexed, displayed, and searchable. ¡ Solrconfig. xml file defines cache size, faceted field type, request handler customization. ¡
object_type" src="https://present5.com/presentation/bc2724ae072f56ef011341ddb7bdec9d/image-17.jpg" alt="Solrconfig. xml Example facet field definition ¡ ¡ ¡ ¡ ¡
-
A system is only as good as the data that is in it.
Data mapping for multiple databases (truncated)
Faceted Categories ¡ Determine the most useful facets; more is not better. l Number of unique facets will affect system response time ¡ Smithsonian has 4. 6 million unique terms. Among them: l l l 864, 000 names, 126, 000 topics, 47, 000 places, 139 dates(down from 40, 000 before cleanup), 1, 000 types (down from 2, 000 before cleanup)
Build the facet terms 650 $a Art $z Africa, North $v Periodicals.
Build the facet terms 655 $a Photographs $y 1840 -1860.
Challenges ¡ ¡ Adapting LCSH and AAT terms in a whole new way Still seeking a good way to use See and See Also reference data Reduce Data inconsistency in our records for better quality facet terms Character conversion challenge with MARC 8, UNICODE and UTF 8
Future plans ¡ Continue to add data from more digital library databases and museum collection databases l ¡ Working on National History museum, and American Indian museum. Complete the implementation of the capability to interact with external applications l Plan to support “American Art and Artist” application ¡ Add new functionality such as my-list, list-sharing, social tagging. ¡ Support more visual displays such as Google map and time slider
A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution www. siris. si. edu wangch@si. edu


