ad84bcacefe8a4ea0b50535b0d242728.ppt
- Количество слайдов: 50
The Biodiversity Heritage Library: A Knowledge Domain Enterprise Community-Driven Open Access Indra Neil Sarkar, Ph. D Marine Biological Laboratory Thomas Garnett Smithsonian Institution Libraries Coalition for Networked Information Washington, DC December 11, 2007
Overview • Biodiversity Heritage Library (Tom Garnett) – – Overview Why? How? Sustainability • Building Knowledge Links (Neil Sarkar) – – Knowledge Integration Taxonomic Names Management Taxonomic Intelligence Linking to the Encyclopedia of Life 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
The Biodiversity Heritage Library Tom Garnett
Biodiversity Heritage Library • Museums – – Field Museum Natural History Museum (London) Smithsonian Institution American Museum of Natural History • Botanical Gardens – Missouri Botanical Garden – New York Botanical Garden – Royal Botanic Gardens, Kew • University Libraries – Botany Libraries, Harvard University – Ernst Meyer Library of the Museum of Comparative Zoology, Harvard University • Research Institute Library – Marine Biological Laboratory / Woods Hole Oceanographic Institution Library (MBL/WHOI) 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library Collaborators: Internet Archive International Commission on Zoological Nomenclature Open Content Alliance European Distributed Institute of Taxonomy Global Biodiversity Information Facility (GBIF) Many more under negotiation 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library Mandates: Digitize the core literature on biodiversity. Open Access: all content can be repurposed, reused, reformatted. Congruent: must fit in to a healthy knowledge ecology. Reptilia and Batrachia. (1885 -1902) by Albert C. L. G. Günther 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library –Why? In any well-appointed Natural History Library there should be found every book and every edition of every book dealing in the remotest way with the subjects concerned. One never knows wherein one edition differs from or supplements the other and unless these are on the same table at the same time it is not possible to collate them properly. Moreover for accurate work it is necessary for the student to verify every reference he may find; it is not enough to copy from a previous author; he must verify each reference itself from the original. Charles Davies Sherborn (1861 -1942) 2007 12 11 - Coalition for Networked Information, Washington, DC Charles Davies Sherborn, Epilogue to Index Animalium, March 1922 Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library –Why? Yet another physical difficulty is the task of assembling the library and indexes which will enable the student to work under proper conditions…. the beginner must now be prepared to spend liberally, or else must establish himself in an institution where a large library exists; if he work by himself with only a few books, he will have to confine himself to a very narrow specialty indeed. Insecta. Diptera. Volume I (1886 -1901) 2007 12 11 - Coalition for Networked Information, Washington, DC 'The Limitations of Taxonomy' by J. M. Aldrich, Science, April 22, 1927, vol. LXV, no. 1686, p. 381 Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library –Why? • The cited half-life of publications in taxonomy is longer than in any other scientific discipline -Macro-economic case for open access, ~Tom Moritz • Current taxonomic literature often relies on texts and specimens > 100 years old. Levinus Vincent Elenchus tabularum, pinacothecarum, 1719 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library –Why? The Taxonomic Impediment “The taxonomic impediment is a term that describes the gaps of knowledge in our taxonomic system” - Darwin Declaration, 1998 Georges Louis Leclerc, comte de Buffon Histoire naturelle : générale et particulière (Oiseaux), 1799 -1808 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library –Why? Convention on Biological Diversity: Article 17 “… exchange of information shall include exchange of results of technical, scientific and socio-economic research … It shall also, where feasible, include repatriation of information. ” Henry Bates Insecta. Coleoptera, 1881 -1884 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library –How? • Internet Archive establishes scanning centers in London, New York, Boston, Washington, etc. • High-quality, non-destructive scans. • Image files and text derived from OCR. 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library –How? “Guano diggers among the albatrosses. Laysan Island” What good are page image files, “dirty OCR”, and some metadata? Researchers are stuck like these guano diggers in Hawaii. Lionel Walter Rothschild The avifauna of Laysan and the neighboring islands, 1893 -1900 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library –How? BHL Portal http: //www. biodiversitylibrary. org Serve image and text files; create volume, part, piece metadata; ingest page level metadata at scanning level; apply Globally Unique Identifiers (GUIDs) for linking to other taxonomic services Jacob Christian Schäffer Elementa entomologica. . . 1766. 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library Classes of texts Each class presents a unique set of issues to resolve: Public Domain – pre-1923 Post-1923 monographs some with copyright renewals some without copyright renewals Non-profit learned society journals with permissions Commercial journals; Grey literature Archival material; field and expedition notebooks 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library BHL Seeks Permissions from Copyright Holders Opt in Copyright Model: The BHL will actively work with professional societies and associations to integrate their publications into the BHL in a way that serves the societies’ missions and goals BHL will digitize learned society backfiles and mount them through the BHL Portal at no cost. Will provide a set of files to the publishers for reuse as they see fit. Will index the articles using Taxonomic Intelligence, thereby vastly increasing their usability. 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library Embedding Content in the Knowledge Ecology The BHL is primarily funded as a component of the Encyclopedia of Life, an international effort to create an authoritative website for every species of the earth’s biota. 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library • Legal Sustainability Strategy – Avoid legal conflicts. – Keep copyright infringement risk low. It is impossible to eliminate it altogether. – Obtain permissions where feasible. – Where it isn’t feasible, move on. 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library • Scientific and Scholarly Support Strategy – Make it too useful not to support. – Embed it current and developing workflows for the identification, tracking, documenting, and researching the biota. BHL is building on many documented use cases. – Network with many professional societies. – Automated structural markup of journal literature to bring the digitized ocr into conformance with the NLM DTD. 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library • Financial Sustainability Strategy – Quick ramp-up high early costs – development, mass scanning, etc. Drive long-term costs down the asymptote toward zero. – Derive some long-term costs from the operating budgets of the member institutions. (examples under consideration: acquisitions budget, staff positions, etc. ) – Integrate functions/tasks with wider efforts where appropriate, e. g. mass storage. – Clear roles for staff who wear multiple hats. Two full-time grant funded positions currently but >15 staff who make substantive contributions. – Make the BHL absolutely essential. 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library • The Long Now Strategy – Institutions that are creating the BHL exist to persist through time. That’s an important part of their business. Use them. – The future is uncertain, the technology landscape changes, people pass on. So create consortial structures that are lowoverhead, flexible, and can respond quickly. F 2 F interaction is surprisingly necessary to create this. 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library • The Long Now Strategy (cont. ) – Take Risks. Why? – “We must, indeed, all hang together, or most assuredly we shall hang separately. “ – Interoperability is the key. Repository islands will sink. 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library Embedding Content in the Knowledge Ecology Species names, taxon concepts, and the classification of living organisms are the basis for linking multiple disciplines such as evolutionary biology, taxonomy, genomics, agriculture, conservation, etc. Taxonomic intelligence algorithms are being developed to mine the BHL content to link species names with other biological resources. 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
BHL-based Knowledge Integration Neil Sarkar
http: //www. idiagram. com/ideas/knowledge_integration. html 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Knowledge Integration • • Meet Information Needs Map to Other Knowledge Extract Domain Specific Features Perform Data Mining for Novel Correlations • Automated Methods – Biomedical (Yes!) – Biodiversity (No!) 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biological Data Revolution Biomedical Knowledge 2007 12 11 - Coalition for Networked Information, Washington, DC Biodiversity Knowledge Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Literature, Literature • Retrospective Biological Knowledge – Not Just PDF’s! – Biodiversity Heritage Library • Contemporary Biological Knowledge – Titles, Abstracts, Metadata (Me. SH) – Medline • Prospective Biological Knowledge – Track New Literature – Services Integrated Into Interfaces 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
“All accumulated information of a species is tied to a scientific name, a name that serves as a link between what has been learned in the past and what we today add to the body of knowledge. ” ~ Grimaldi & Engel, 2005, Evolution of the Insects 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Names Are Often Misspelled Loligo pealeii Loligo pealei 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Peranema – the fern Peranema – the euglenid 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Who Cares? Libraries Publishers 2007 12 11 - Coalition for Networked Information, Washington, DC Museums Federal Agencies Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Life on Earth 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Names for Life on Earth No Complete List of Scientific Names* 112, 133 741, 872 49, 382 *Scientific Names ≠ Species Published Variants Objective Synonyms Bacterium coli Escherichia coli Bacillus coli 2007 12 11 - Coalition for Networked Information, Washington, DC Mis-spellings Escheria coli Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Taxonomic Knowledge 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Scientific Names Management • Collect Scientific Names – Digital Taxonomy Resources – Data Marts – Natural Language Text • Scientific Name Reconciliation – Many Names for Same Organism • Objective: Escherichia coli, Bacterium coli, Bacillus coli • Subjective: Brucella melitensis, Brucella canis, Brucella ovis – Many Organisms for Same Name • Agathis montana is both a wasp and plant 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
u. Bio • 10. 7 Million+ Name Strings • Reconciliation Groups • http: //www. ubio. org 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Ogden-Richards Semiotic Triangle Thought/Reference “White” “Blanc” “Weiss” Symbols 2007 12 11 - Coalition for Networked Information, Washington, DC XVFD Referent Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Ogden-Richards Semiotic Triangle Species “Bacterium coli” “Escherichia coli” Scientific Names 2007 12 11 - Coalition for Networked Information, Washington, DC “E. coli” urn: lsid: ubio. org: namebank: 5369544 LSID Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Taxonomic Intelligence • • • Lexicon of Scientific Names Reconciliation and Disambiguation Hierarchical Inclusion Integration into Information Retrieval Linkage to Other Data Types (e. g. , Molecular, Morphological, Phenotype) 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biodiversity Heritage Library 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Biomedical Knowledge 2007 12 11 - Coalition for Networked Information, Washington, DC Biodiversity Knowledge Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Extracting Taxonomic Names • Named Entity Recognition – Taxonomic Name Recognition (TNR) • Current TNR Tools – Taxon. Grab (AMNH) – Find. IT (u. Bio) – FAT (Karlsruhe) 2007 12 11 - Coalition for Networked Information, Washington, DC Taxon. Finder (u. Bio) Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Tracking Biodiversity Knowledge • Taxonomically Intelligent Applications – Real-time Taxonomic Indexing – RSS – Taxonomic Portals 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Encyclopedia of Life • Create one Web page for each species that is currently named (~1. 8 million) • Integrate relevant literature (e. g. , BHL) • BHL represented on EOL Board • $25 M of funding in place 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
The Encyclopedia of Life www. EOL. org 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Acknowledgments Christopher Freeland Martin Kalfatovic Graham Higley BHL & EOL Teams Catherine Norton Patrick Leary David Remsen David Patterson A. W. Mellon Foundation Alfred P. Sloan Foundation John D. & Catherine T. Mac. Arthur Foundation 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
Neil Sarkar sarkar@mbl. edu Tom Garnett garnett. T@si. edu 2007 12 11 - Coalition for Networked Information, Washington, DC Tom Garnett & Neil Sarkar, Biodiversity Heritage Library
ad84bcacefe8a4ea0b50535b0d242728.ppt