Скачать презентацию An evaluation of taxonomic name finding next Скачать презентацию An evaluation of taxonomic name finding next

47d9bddc40c6e84d27d86a14a7d253ea.ppt

  • Количество слайдов: 22

An evaluation of taxonomic name finding & next steps in Biodiversity Heritage Library (BHL) An evaluation of taxonomic name finding & next steps in Biodiversity Heritage Library (BHL) developments Chris Freeland Technical Director, BHL Director of Bioinformatics, Missouri Botanical Garden Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

Goals of BHL • • Scan public domain biodiversity literature. Negotiate rights to copyrighted Goals of BHL • • Scan public domain biodiversity literature. Negotiate rights to copyrighted materials. Ingest content digitized by others. Provide interfaces & APIs for repository. – GUIs – Services for data mining & citation resolution http: //www. biodiversitylibrary. org Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

BHL Institutions Botanical Gardens – Missouri Botanical Garden – New York Botanical Garden – BHL Institutions Botanical Gardens – Missouri Botanical Garden – New York Botanical Garden – Royal Botanic Garden, Kew University Libraries – Botany Libraries, Harvard University – Ernst Meyer Library of the Museum of Comparative Zoology, Harvard University – University of Illinois Freeland. TDWG Annual Conference. 20 October 2008 Museums – American Museum of Natural History (New York) – Natural History Museum (London) – Smithsonian Institution (Washington) – The Field Museum (Chicago) Bioinformatics Institutes – MBL/WHOI – u. Bio. org www. biodiversitylibrary. org

Now Online • More than: 22, 000 volumes 9. 2 million pages Only 290 Now Online • More than: 22, 000 volumes 9. 2 million pages Only 290 million to go! • Avg. monthly growth rate 1, 500 volumes 600, 000 pages Freeland. TDWG Annual Conference. 20 October 2008 See you in 2048! www. biodiversitylibrary. org

Scanning Operations BHL uses scanning centers established by Internet Archive for mass scanning. Some Scanning Operations BHL uses scanning centers established by Internet Archive for mass scanning. Some partner libraries also scan in-house. Locations of BHL/IA Scanning Centers Freeland. TDWG Annual Conference. 20 October 2008 Want to expand international footprint: • mirrored content • ingest from global data providers www. biodiversitylibrary. org

Complexities of distributed, mass scanning from NYBG from Smithsonian Freeland. TDWG Annual Conference. 20 Complexities of distributed, mass scanning from NYBG from Smithsonian Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

Open Access Data The snakes of Australia; an illustrated and descriptive catalogue of all Open Access Data The snakes of Australia; an illustrated and descriptive catalogue of all the known species. By Gerard Krefft. . . Publisher: Sydney, T. Richards, Government Printer, 1869. PDF OCR JP 2 XML Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

Name Finding via Taxon. Finder Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. Name Finding via Taxon. Finder Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

SOAP response Raw Image Name finding via Taxon. Finder Name. Bank Submit to Extract SOAP response Raw Image Name finding via Taxon. Finder Name. Bank Submit to Extract names Converted to text via OCR Name Finding in action with Taxonomic Intelligence…

Name Finding Stats to date* • Have mined more than 30 million name string Name Finding Stats to date* • Have mined more than 30 million name string occurrences – 4. 3 million unique • More than 23. 3 million name strings verified by Name. Bank – 1. 1 million unique *19 October 2008 Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

APIs & Data Sharing • Name Service (Documentation) – REST: XML or JSON • APIs & Data Sharing • Name Service (Documentation) – REST: XML or JSON • Data Export (Documentation) – Monthly export of BHL titles, volumes, pages, names in delimited files • Citation Resolver v 0. 1 – available by end of 2008 Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

Name Finding Evaluation See Poster in hall • Structured and performed by Qin Wei Name Finding Evaluation See Poster in hall • Structured and performed by Qin Wei – Ph. D. student at UIUC, working with Bryan Heidorn • Methodology – Scholarly volunteers manually identified scientific names on random sample of 392 pages in BHL corpus – Compared those against OCR, then two name finding algorithms (Taxon. Finder & FAT) • Goals – Spark discussion, set baseline for future work Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

Characteristics of sample Number of Pages 392 Average Number of Words per Page 446. Characteristics of sample Number of Pages 392 Average Number of Words per Page 446. 8 Average Number of Names per Page 7. 7 Total Number of Names 3003 Total Number of Unique Names 2610 Freeland. TDWG Annual Conference. 20 October 2008 = 86. 91% www. biodiversitylibrary. org

OCR error rate for names only Of the 3, 003 names, 1, 056 were OCR error rate for names only Of the 3, 003 names, 1, 056 were incorrectly transcribed by OCR. Top OCR errors 1 n->v 2 Omit Space 9 l->i e->c 10 r->i 4 u->I 11 u->ii 5 u->n 12 h->l 6 i->l 13 h->ii 7 Freeland. TDWG Annual Conference. 20 October 2008 8 3 35. 16% Insert Space c->e 14 e->o www. biodiversitylibrary. org

Performances of algorithms Taxon. Finder FAT Precision 40. 32% 28. 20% Recall 36. 62% Performances of algorithms Taxon. Finder FAT Precision 40. 32% 28. 20% Recall 36. 62% 23. 34% F-score 38. 47% 25. 77% Precision 43. 77% 32. 25% Recall 25. 82% 17. 21% 34. 80% 24. 73% F-score Freeland. TDWG Annual Conference. 20 October 2008 Excluding names with OCR errors Including names with OCR errors www. biodiversitylibrary. org

Considerations • Improving OCR software is out of scope – Google’s Tesseract is only Considerations • Improving OCR software is out of scope – Google’s Tesseract is only viable open source option – Flurry of activity in 2006 -2007, quiet since • Rekeying is expensive given size of corpus – Will not scale Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

Recommendations • Enhance “fuzzy” retrieval in algorithms – Exception rules to overcome OCR errors Recommendations • Enhance “fuzzy” retrieval in algorithms – Exception rules to overcome OCR errors • More work needed in this space – More evaluations & experiments – Robust training sets • re. CAPTCHA for names? Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

Up next: BHL Article Repository • for biodiversity articles • “Safe harbor” model – Up next: BHL Article Repository • for biodiversity articles • “Safe harbor” model – BHL provides platform – Community provides content • Scientists, students, libraries • Implemented using Fedora Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

And if that wasn’t enough… • Additional services – Title Resolver, LSIDs • Distributed And if that wasn’t enough… • Additional services – Title Resolver, LSIDs • Distributed architecture – data & applications • Interface improvements – Internationalization • Further evaluations & experiments – rich test bed for information retrieval Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org

Contact Chris Freeland 4344 Shaw Blvd. St. Louis, MO 63110 chris. freeland@mobot. org http: Contact Chris Freeland 4344 Shaw Blvd. St. Louis, MO 63110 chris. [email protected] org http: //www. biodiversitylibrary. org Freeland. TDWG Annual Conference. 20 October 2008 www. biodiversitylibrary. org