Скачать презентацию A Collections Searching Center Using Lucene Solr Скачать презентацию A Collections Searching Center Using Lucene Solr

bc2724ae072f56ef011341ddb7bdec9d.ppt

  • Количество слайдов: 28

A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution Collections. si. A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution Collections. si. edu wangch@si. edu

Background Information Smithsonian Institution is a public institution whose mission is the increase and Background Information Smithsonian Institution is a public institution whose mission is the increase and diffusion of knowledge, ¡ 19 museums and 9 research institutes, ¡ 136 million collection objects, ¡ 12 major museum collection information systems (with 30 databases), ¡ Hundreds of other databases. ¡

Issues we faced Users want information now! l Google Effect and user’s mentality: “if Issues we faced Users want information now! l Google Effect and user’s mentality: “if it is not online, it does not exist. ” l Users want immediate access to digital documents. l Separate databases are confusing to the public. We must act now!

Smithsonian’s Collection Searching Center Overview ¡a discovery center for information with a single searching Smithsonian’s Collection Searching Center Overview ¡a discovery center for information with a single searching point ¡ faceted searching and content-sensitive navigation ¡ positive and negative browse & select options ¡ relevancy ranking of search results ¡ automatic stemming for word matching

Smithsonian’s Cross Searching Catalog Overview (continued) ¡ integrated searching of data from multiple types Smithsonian’s Cross Searching Catalog Overview (continued) ¡ integrated searching of data from multiple types of databases ¡ scalability for large data sets ¡ a metadata center which interacts with other online applications

Project Team and Resources ¡ ¡ ¡ Andrew Gunther implementation Jim Felley George Bowman Project Team and Resources ¡ ¡ ¡ Andrew Gunther implementation Jim Felley George Bowman configuration Randy Arnold Ching-hsien Wang – Software development and – Data conversion and implementation – Database management and security – Project support – Program Manager Since August 2007, we have integrated data from 12 major databases with 2 million records.

Starting from Multiple databases Starting from Multiple databases

Transform into a single Search Center Transform into a single Search Center

Cross Searching Demo – simple opening screen Cross Searching Demo – simple opening screen

Demo – search result screen Demo – search result screen

Demo – search history Demo – search history

Process Flow Diagram Horizon Virtual Museum In 2 nd Life Data Extract and Trans. Process Flow Diagram Horizon Virtual Museum In 2 nd Life Data Extract and Trans. Formation Output data In XML documents Solr Horizon Digital Library Digital Archives Digital Museum Lucene Index Data Extract and Trans. Formation Online Exhibition Output data In JSON Cross Searching Catalog Solr XML documents Output data In Python Education Interface Open Access Applications

Automated Process Library Trigger XML Data Transformation Trigger Archives Art Inventory Horizon Archives Trigger Automated Process Library Trigger XML Data Transformation Trigger Archives Art Inventory Horizon Archives Trigger Photo Archives Trigger Exhibition Catalogs Trigger Smithsonian History Trigger Research Trigger Bibliographies Airplane Directory Trigger Solr_ Index_ Pending ……. DB Table A Perl program converts records based on BIB# XML Documents

Define an Index Metadata Model: Free text data fields used for Keyword searching & Define an Index Metadata Model: Free text data fields used for Keyword searching & display Record Link Title/Object-name Identifier Physical Description Gallery Label Notes Publisher Object Type Taxonomic Name Language Topic Place Date Name Culture Set Name Data Source Credit Line Online Media Group

Facet data fields used for browsing and limiting Record ID Object Type Language Topic Facet data fields used for browsing and limiting Record ID Object Type Language Topic Place Date Name Culture Data Source Online Media Type Rights for Online Media File Related Record Usage Flag Taxon-Kingdom Taxon-Phylum Taxon-Division Taxon-Class Taxon-Order Taxon-Family Tabxon-Sub-Family Scientific_name Common name Geo-age-Era Geo-Age-System Geo-Age-Series Geo-Age-Stage Strat-Group Strat-Formation Strat-Member

Getting help from Solr ¡ Task specific handlers: Request handler Respond handler Update handler Getting help from Solr ¡ Task specific handlers: Request handler Respond handler Update handler Solr Lucene Index Solr Schema. xml file defines fields to be indexed, displayed, and searchable. ¡ Solrconfig. xml file defines cache size, faceted field type, request handler customization. ¡

object_type" src="https://present5.com/presentation/bc2724ae072f56ef011341ddb7bdec9d/image-17.jpg" alt="Solrconfig. xml Example facet field definition ¡ ¡ ¡ ¡ ¡ object_type" /> Solrconfig. xml Example facet field definition ¡ ¡ ¡ ¡ ¡ object_type language name="facet. field">topic name="facet. field">place name="facet. field">date name="facet. field">name name="facet. field">culture name="facet. field">online_media_type name="facet. field">set_name name="facet. field">data_source name="facet. field">tax_kingdom name="facet. field">tax_phylum name="facet. field">tax_division name="facet. field">tax_class name="facet. field">tax_order name="facet. field">tax_family name="facet. field">tax_sub-family name="facet. field">common_name name="facet. field">scientific_name name="facet. field">freetext

siris_sil_905285 SIL" src="https://present5.com/presentation/bc2724ae072f56ef011341ddb7bdec9d/image-18.jpg" alt="Data Example (abbreviated) – a Library Book siris_sil_905285 SIL" /> Data Example (abbreviated) – a Library Book siris_sil_905285 SIL Smithsonian Institution Libraries STORY OF WEST POINT: 18021943 THE WEST POINT TRADITION IN AMERICAN LIFE Story of West Point: 1802 -1943; the West Point tradition in American life 1943 Books 1943

siris_arc_104765 EEPA" src="https://present5.com/presentation/bc2724ae072f56ef011341ddb7bdec9d/image-19.jpg" alt="Data Example (abbreviated) – a Photograph siris_arc_104765 EEPA" /> Data Example (abbreviated) – a Photograph siris_arc_104765 EEPA Eliot Elisofon Photographic Archives AERIAL VIEW OF DOWNTOWN JOHANNESBURG SOUTH AFRICA SLIDE Aerial view of downtown Johannesburg, South Africa, [slide] http: //sirismm. si. edu/eepa/eepa_05859. jpg< /media> Eliot Elisofon Photographic Archives EEPA EECL 15973 Elisofon, Eliot slide : col This photograph was taken when Eliot Elisofon was on ass magazine and traveled to Africa from August 18, 1959 to December 20, 1959 Photographs Mod. architecture/cityscape South Africa 1959 Eliot Elisofon Field photographs 1942 -1972

- siris_ari_7985" src="https://present5.com/presentation/bc2724ae072f56ef011341ddb7bdec9d/image-20.jpg" alt="Data Example (abbreviated) – a sculpture - siris_ari_7985" /> Data Example (abbreviated) – a sculpture - siris_ari_7985 ARI Art Inventories DREXEL MONUMENT SCULPTURE The Drexel Monument, (sculpture) http: //sirisartinventories. si. edu/ipac 20/ipac. jsp? &profile=all&source=~!siartinventories&uri=full=3100001~!7985 0#focus - http: //americanart. si. edu/images/1966. 47. 36_1 b. jpg - Art Inventories IAS 75004286 Manger, Heinrich b. 1833 Chas. F. Heaton Francis M. Drexel Monument, (sculpture) metal: bronze Sculpture: bronze; Base: granite; Fountain basin: concrete Index of American Sculpture, University of Delaware, 1985 Sculptures-Fountain Drexel, Francis M Illinois 1881. Cast 1882. Dedicated 1883 - Manger, Heinrich Chas. F. Heaton Sculptures Portrait male Drexel, Francis M

A system is only as good as the data that is in it. A system is only as good as the data that is in it.

Data mapping for multiple databases (truncated) Data mapping for multiple databases (truncated)

Faceted Categories ¡ Determine the most useful facets; more is not better. l Number Faceted Categories ¡ Determine the most useful facets; more is not better. l Number of unique facets will affect system response time ¡ Smithsonian has 4. 6 million unique terms. Among them: l l l 864, 000 names, 126, 000 topics, 47, 000 places, 139 dates(down from 40, 000 before cleanup), 1, 000 types (down from 2, 000 before cleanup)

Build the facet terms 650 $a Art $z Africa, North $v Periodicals. <Topic> Art Build the facet terms 650 $a Art $z Africa, North $v Periodicals. Art Africa, North Periodicals

Build the facet terms 655 $a Photographs $y 1840 -1860. <type> <date> Photographs </type> Build the facet terms 655 $a Photographs $y 1840 -1860. Photographs 1840 s 1850 s 1860 s

Challenges ¡ ¡ Adapting LCSH and AAT terms in a whole new way Still Challenges ¡ ¡ Adapting LCSH and AAT terms in a whole new way Still seeking a good way to use See and See Also reference data Reduce Data inconsistency in our records for better quality facet terms Character conversion challenge with MARC 8, UNICODE and UTF 8

Future plans ¡ Continue to add data from more digital library databases and museum Future plans ¡ Continue to add data from more digital library databases and museum collection databases l ¡ Working on National History museum, and American Indian museum. Complete the implementation of the capability to interact with external applications l Plan to support “American Art and Artist” application ¡ Add new functionality such as my-list, list-sharing, social tagging. ¡ Support more visual displays such as Google map and time slider

A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution www. siris. A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution www. siris. si. edu wangch@si. edu