aa73cfe959737f9b63c91097d0a765de.ppt
- Количество слайдов: 31
Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab 1
DLs as seen by a DIA Researcher n 15 years in DIA R&D n Lucky to have known/collaborated with: – – – n PARC DL enthusiasts: Masinter, Street, Bloomberg, et al UC Berkeley Digital Library project: Wilensky, Fateman, et al CMU Universal Library project: Thibadeau, Hauptmann, et al Xerox Scanning Service Bureaus: Wallis, et al … many others with an interest in DLs What challenges do DLs pose to DIA R&D? ICDAR Aug 4, 2003 - HSB 2
Digital Library Dreams Electronic networked DLs promise to provide: – – more books, journals, etc to more people faster at more places & times than physical libraries can hope to…. The Ideal DL: an international, interoperable, sustainable body of rich cultural materials in digital form ICDAR Aug 4, 2003 - HSB 3
Document Images’ Usefulness in DLs raster image display, print + metadata (title, author, …) + index, catalogue + OCRed text + retrieve (more or less well) + correct text + retrieve well, reuse, summarize, translate, … + layout format (e. g. RTF) + reprinting + links (e. g. HTML) + Web publishing + functional tags (e. g. XML) + “semantic web” ICDAR Aug 4, 2003 - HSB 4
Advantages of Digital Displays versus Ink-on-Paper n Many… – – n networked -- potentially unbounded content rapidly rewritable -- supports animation radiant -- legible in the dark sensitive -- markable, interactive Generally thought to be overwhelming, but … ICDAR Aug 4, 2003 - HSB 5
Advantages of Ink-on-Paper versus Digital Displays PAPER n cheap n large, many n high-resolution n lightweight n thin n unpowered n stable DISPLAYS today n expensive n small, few n low-resolution n heavy n thick n powered n requires maintenance A. Dillon, “Reading from Paper versus Screen: a critical review of the empirical literature, ” Ergonomics 53(10): 1297 -1326, 1992. ICDAR Aug 4, 2003 - HSB DISPLAYS in future n less expensive n larger, more n higher-resolution n lighter n thinner n lower power e. Books, e-paper, notebooks, laptops, PDAs, … 6
The fact is, for many uses Paper is Still Widely Preferred “Paper [remains today] the medium of choice for reading, even when most high-tech [display] technologies are to hand” — Sellen & Harper (2002) Why is this? Paper allows: – – flexible navigation though documents cross-referencing of several documents annotations interweaving of reading and writing A. J. Sellen & R. Harper, The Myth of the Paperless Office, The MIT Press, Cambridge, MA, 2002. ICDAR Aug 4, 2003 - HSB 7
Document Images are Doubly Disadvantaged within DLs n n They fail to support most uses that symbolically encoded, tagged data do They lose many key advantages they enjoyed on paper A Threat: ‘If it’s not in Google, I don’t need it!’ Can they be made as useful in DLs as encoded data? Can they sometimes work better in DLs than encoded data? …these are challenges to us, the DIA R&D community. ICDAR Aug 4, 2003 - HSB 8
The British Library ‘The World’s Knowledge’ 38. 8 M items catalogued website: 18. 4 M page hits/year Compare Google: • >3 B pages • 150 M searches/day “[Reinforcing] the Library’s role as the pre-eminent global document supplier, digital scanning from print and microfilm originals will give researchers rapid, high quality delivery from over 100 million research articles, reports, and conference papers direct to their desktop. ” -- Lynne Brindley, Chief Executive 2002 -2003 Annual Report ICDAR Aug 4, 2003 - HSB 9
Bibliothèque nationale de France n The Digital Library – digitization of both printed books and graphic material – primarily in image mode to begin with – most out-of-copyright n Gallica 2000 – multimedia documents: Middle Ages -> early 20 th century – 35, 000 printed volumes: images – 1000 titles full text – “one of the largest DLs free of charge on the web” ICDAR Aug 4, 2003 - HSB 10
Million Book DL Project n 1 M books to be scanned by 2005 – bitonal, 600 dpi n n Free-to-read, universally accessible Searchable by full text (where OCR is possible) – ABBYY Fine Reader OCR n n Books in public domain or copyrighted but of print Fifteen partners: – US, India, China; est. 4000 person-years of clerical labor – Multinational, multilingual (mainly English) n n 20 Tbyte trusted repository Research testbed for summarization, OCR, automatic extraction of metadata, machine translation Reddy, Raj and Gloriana St. Clair, “The Million Book Project, ” CMU, Dec. 1, 2001. ICDAR Aug 4, 2003 - HSB 11
Google Catalogs n n “ 1000’s” of scanned mail-order catalogs free for publishers, ‘few days’ turnaround – for a fee: link products to web sites n n free to users: download page images indexed by: vendor, date, page numbers, etc (not by full text content) ICDAR Aug 4, 2003 - HSB 12
Amazon. com plan ‘Look Inside the Book II’ ~500 k books: in-copyright, non-fiction n Scan (full color), OCR cover-to-cover n Full-text search, download sample pages n Free but limited access to page images ——— Can Google be far behind…? search document image files found on Web n David D. Kirkpatrick, “Amazon Plan Would Allow Searching Text of Many Books, ” The New York Times, July 21, 2003. ICDAR Aug 4, 2003 - HSB 13
Capturing Document Images To digitize a book: $4 - $1000 each! cheaply: bitonal, low quality, mass scanning, … expensively: color, quality control, custom handling, … Breakdown of costs: 1/3 1/3 cataloging, description, indexing scanning, OCR, correction, markup quality control, file maintenance, admin NOTE: DIA can help with all three “The Price of Digitization, ” Proc. , NINCH Symposium (National Initiative for a Networked Cultural Heritage), New York, April 8, 2003. ICDAR Aug 4, 2003 - HSB 14
Document Image Capture Operations n n Usually, large-scale batch operations Sometimes destructive: – cut off spines, discard covers, wear & tear – hot debate over ‘scan-and-discard’ policies n Image quality standards are often subjective – usually: “completeness”; no missing pages, text – seldom: checked for human, machine legibility – rarely: guaranteed suitable for future uses n Scan once, for ever: – seldom rescanned (Lesk: “not for 5 -10 y”) M. Lesk, Practical Digital Libraries: Books, Bytes, & Bucks. Morgan Kaufmann, San Francisco, CA, 1997. ICDAR Aug 4, 2003 - HSB 15
The PARC Rare Book Scanner • Bulk scanning w/out damaging books • Zero force on binding • Book is open 90 degrees • Pages turned manually • 280 dpi • 9. 25”x 11. 75” field • Throughput • 8 -bit grey 450 pages/h • 24 -bit color 120 pages/h Bob Street & Steve Ready, PARC. ICDAR Aug 4, 2003 - HSB 16
GUI & IP for Image Capture • Calibration • color test targets • per-pixel gain/offset map • Image Processing • performed on the fly: • contrast, cleaning, etc • crop. skew-correct • processing templates • Assuring Quality • visual inspection • Capturing Metadata • automatic page numbering 1, 2, 3, . . . / i, iii, . . . / I, III, … • section labels • comments (manual) ICDAR Aug 4, 2003 - HSB 17
DIA R&D for Image Quality Control n Measuring document image quality – new test target designs – image processing algorithms – rigorous, quantitative standards n Assuring quality – fast algorithms for on-the-fly image quality estimation n Predicting human & machine legibility What image quality features correlate well with human and OCR legibility? … and with other, later DIA tasks? K. Summers, “Document Image Improvement for OCR as a Classification Problem, ” Proc. , DR&R X, Santa Clara, CA, Jan 2003. ICDAR Aug 4, 2003 - HSB E. H. Barney Smith & X. Qiu, “Relating Statistical Image Differences & Degradation Features, ” Proc, 5 th DAS, Princeton, NJ. , Aug 2002. 18
When Quality Control Goes Wrong Front Page, 1852 Edition of the New York Times Scanned from microfilm. The Historical New York Times Project, CMU/NYT, 1999. ICDAR Aug 4, 2003 - HSB 19
Extracting & Recognizing Content These are central DIA R&D goals But existing doc image understanding systems cannot guarantee high accuracy across the full range of documents: - typefaces, h/w styles - image qualities layout geometries writing systems languages domains of discourse old fashioned poor & variable deformed obsolete rare arcane DL’s scholarly & historical docs are often harder S. Rice, G. Nagy, T. Nartker, OCR: An Illustrated Guide to the Frontier , Kluwer Academic Publishers: 1999. 20
Richly Meaningful Typographical Book Designs Rare Botanical Reference Book • Jepson’s A Flora of California, 1943. • Authoritative, still in demand by scholars • Only a few copies are left • Difficult to OCR well • Scanned at PARC, all page images put on the Univ. California, Berkeley Digital Library website ICDAR Aug 4, 2003 - HSB 21
Cut into Word-box Images: layout analysis without OCR ICDAR Aug 4, 2003 - HSB 22
Reflow Word Boxes into Textlines to Fit the Display Geometry T. Breuel, W. Janssen, K. Popat, H. Baird, “Paper to PDA, ” Proc. , ICPR, Quebec City, 2002. ICDAR Aug 4, 2003 - HSB 23
Make Doc-Images Highly Portable, Legible Everywhere No OCR errors! (Only layout errors. ) Preserve meaningful appearance Challenges: n reading order n non-text n navigation n linking ICDAR Aug 4, 2003 - HSB 24
Other ‘Pure-Image’ DIA for DLs Not Dependent on Accurate Recognition n For Text seems feasible – Summarization of doc images w/out OCR – Outlining, condensing, linking – Reflowing tables n For Non-text seems dauntingly hard – Mathematics – Chemical formulae – Line-art drawings Vitally important to try since recognition & encoding are highly problematic – Graphics generally ICDAR Aug 4, 2003 - HSB 25
Personal Digital Libraries n People are beginning to – collect – manage – share n n their own small DLs Scanned & encoded documents, mixed together How to assist ‘productive reading’ These users lack specialized skills DIA tools need to be deskilled to a clerical level … and to work together far better Thanks to: Jon Hull et al, Ricoh Innovations; Robert Wilensky et al, UC Berkeley; Larry Spitz, Doc. Rec; Kris Popat et al, PARC. ICDAR Aug 4, 2003 - HSB 26
Interactive Digital Libraries n n n Today’s DIA tools leave many errors in recognition, encoding, tagging etc How can these mistakes affordably be fixed? Invite volunteer help: – e. g. Gutenberg Project, Open Mind Initiative n Challenge: provide interactive tools to – – accept corrections on-line enforce review, verification efficiently make the most of every correction DIA tools able to benefit from correction Thanks to: George Nagy, David Stork, Dan Lopresti. ICDAR Aug 4, 2003 - HSB 27
Collaborative DLs: DIA for the Masses n Enable non-professionals to collaborate in improving, manually, on the best that automatic DIA tools can do, e. g. – one person may correct thresholding – another corrects OCR errors – yet another adds tags Offer DIA tools downloadable from the web, possibly under GPL-like licenses n Dimp ? — document image processing toolkit n interoperable via common data structures & file formats Thanks to: Tom Breuel, Kris Popat, Bill Janssen. ICDAR Aug 4, 2003 - HSB 28
DIA R&D Opportunities for DLs Making Document Images as Useful as Symbolically Encoded Data Image capture, quality control Image improvement, rectification, etc Content extraction, recognition, & analysis Legibility, presentation, reflowing Markup, indexing, retrieval, summarization Personal & interactive DLs Offering DIA tools to DL users … many more, no doubt ICDAR Aug 4, 2003 - HSB 29
An Urgent Responsibility? n Vast, irreplaceable, culturally vital legacy collections of paper documents are competing ineffectively for attention with billions of digital documents n Thus paper archives are threatened with neglect, perceived irrelevance, …. & eventually, oblivion? The DIA community is uniquely qualified to help the DL community rescue them. ICDAR Aug 4, 2003 - HSB 30
Contact Henry S. Baird Statistical Pattern & Image Analysis baird@parc. com www. parc. com/baird +1 -650 -812 -4481 ICDAR Aug 4, 2003 - HSB FAX – 4374 31
aa73cfe959737f9b63c91097d0a765de.ppt