15e5a6e3e7f934bb90ba767bd929d68b.ppt
- Количество слайдов: 29
Where Preservation Meets Mass Digitization John A. Kunze California Digital Library LAUC Fall Assembly, UC Merced, 16 November 2007
The UC Libraries’ Digital Preservation Program UC-wide program: serves all 10 UC campuses – 208, 000 students – 121, 000 faculty and staff – 10+ libraries – Museums Located at the CDL 2
3
Preservation challenges: case studies With benefit of hindsight, what’s hard? • Policy • Making files small • Fast data transfer • Cheap, reliable storage • Lots of annoying files • Preserving the revenue stream 4
What’s digital preservation? Storing digital objects while retaining a balance of usability and faithfulness (truthiness) to their creators’ original intentions 5
Policy Challenges • • • How faithful How long How many replicas How much manipulation Right(s)mare 6
Fast data transfer challenges Lots of files, lots of data • Could take months to move and replicate Explore data transfer / replication options • Test with CDL and New York University Survey tool performance and usability Continuing conversations with the San Diego Supercomputer Center and the Library of Congress with goal of creating guidelines 7
Transfer tools tested Ubiquitous, usual suspects: RSYNC, SCP, SFTP, FTP • Mogile. FS (simple distributed filesystem, Perl scripts) http: //www. danga. com/mogilefs/ • High Performance SSH (no system gaming) http: //www. psc. edu/networking/projects/hpn-ssh/ But parallelism really works: • Grid. FTP (high security, from Grid community) http: //www. globus. org/grid_software/data/gridftp. php • SRB (bundled Sget/Sput tools) http: //www. sdsc. edu/srb/index. php/Main_Page • BBFTP (easy installation and use) http: //doc. in 2 p 3. fr/bbftp/ • BBCP (easy installation and use) http: //www. slac. stanford. edu/~abh/bbcp/ Practically, combine parallelism with common tools: 20 x SCP! 8
Upload Comparison 9
Making many files small Now we know how to move millions of files How to make them smaller? 10
What is mass digitization? Large-scale scanning of newspapers, books, videos, etc. from the world’s major libraries – Millions of items/hours to digitize, e. g. , 11
Why mass digitization? For better access and search – Page images remotely accessible – OCR (Optical Character Recognition) makes text visible to search engines Mass digitization is, for us, not intended to replace the physical item 12
“Page Image Compression for Mass Digitization” A study of page image tradeoffs with: • National Library of France (Bn. F) • Harvard University Libraries (HUL) – With Google Book Search: G 9 Libraries – Harvard, Michigan, Stanford, NYPL, Oxford, University of California, etc. • University of California Berkeley (UCB) and the California Digital Library (CDL) – With Open Content Alliance: Internet Archive, Microsoft, University of Toronto, etc. Presented at IS&T Archiving 2007, Arlington, May 2007 13
Mass book digitization tradeoffs For our millions of volumes • Need to strike balance between size of the files and quality of the reading experience • Images need to work with OCR • Possibility of re-printing books (print on demand), but this was not investigated formally Recommendations common to all 3 groups: • JPEG 2000 JP 2 (ISO/IEC 15444 -1) file format • An all color, all lossy solution is feasible 14
Text pages: point size mixes, foxing, handwriting 15
Text page : fonts, paper color, bleed-through 16
Text page : wordy, tight 2 -cols, uneven ink (details) 17
Color page : high information density (detail) 18
Color page : over-exposed, fine lines 19
Grayscale : coarse half-tones (detail) 20
Don’t forget audio/video Case: Swedish National Archive of Sound and Moving Images is digitizing 6 million hours of material – 50 different recording formats and catalogs, growing 10% per annum – Eg, 500, 000 hours of open-reel 4 track using 16 simultaneous players, 8 players per operator – Eg, 220, 000 hours VHS using 12 simultaneous players Digitizing and ingesting 42 TB/month 21
Cheap, reliable storage OK, we can make files smaller and we can move lots of them quickly, but can we make disk cheaper and still reliable? • RAID (Redundant Arrays of Inexpensive Disk) 1980 s • JOBD (Just a Bunch of Disks) 1990 s • MAID (Massive Arrays of Idle Disks) 2000 s 22
Lots of annoying files, or “making files fewer” Origin: web archiving Solution: aggregate W/ARC file format – Many “files” in one file for speed and ease – Records are sort of peers of files Generalization to mass digitization and other processing products 23
W/ARC File Anatomy WARC = Web ARChive file format W/ARC File W/ARC Record Text header Content block . . . Append at will Length, source URI, date, type, … E. g. , HTTP response headers and length bytes of HTML, GIF, PDF, … WARC is fast track ISO work item
Digitizing the Digital Origin: preservation of revenue stream Case of Data Desiccation, creating no-frills, sometimes feature-poor derivatives that retain most of the original scholarly value but are likely to be less perishable than original format (similar to “digital microfilm”) Save desiccated derivatives along with original, just in case no one ever again • Has the funds to touch files • Has the expertise to convert them properly 25
Example Photo of Mission San Luis de Tolosa [2]About the City [3]Visiting SLO [4]What’s New [5]City Government [6]Employment Opportunities [7]Bids & Proposals [8]Economic Development [9]FAQs [10]How are we doing? City of San Luis Obispo About the City [Choose a Destination. . ] [11]Search [12]Contact Us [13]City Home A Brief History Who we are and how we got started. The City of San Luis Obispo serves as the commercial, governmental and cultural hub of California’s Central Coast. One of California’s oldest communities, it began with the founding of Mission San Luis Obispo de Tolosa in 1772 by Father Junípero Serra as the fifth mission in the California chain of 21 missions. The mission was named after Saint Louis, a 13 th Century Bishop of Toulouse, France. (San Luis Obispo is Spanish for "St. Louis, the Bishop". ) It was first incorporated in 1856 as a General Law City, and became a Charter City in 1876. Where we’re located. With a population of 44, 000, the City is located eight miles from the Pacific Ocean and is midway between San Francisco and Los Angeles at the junction of Highway 101 and scenic Highway 1. San Luis Obispo is the County Seat, and a number of federal and state regional offices and facilities are located here, including Cal Poly State University, Cuesta Community College, Regional Water Quality Board and Caltrans District offices. The City’s ideal weather and natural beauty provide numerous opportunities for outdoor recreation at nearby City and State parks, lakes, beaches and wilderness areas. Great place to live and visit. While San Luis Obispo grew relatively… 26
Example continued: endnotes … [18]About the City | [19]Visiting SLO | [20]What’s New | [21]City Government | [22]Employment [23]Bids & Proposals | [24]Economic Development | [25]FAQs | [26]How are we doing? [27]© 2006, City of San Luis Obispo References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. http: //www. ci. san-luis-obispo. ca. us/briefhistory. asp#content http: //www. ci. san-luis-obispo. ca. us/about. asp http: //www. ci. san-luis-obispo. ca. us/visit. asp http: //www. ci. san-luis-obispo. ca. us/whatsnew. asp http: //www. ci. san-luis-obispo. ca. us/government. asp http: //www. ci. san-luis-obispo. ca. us/humanresources/index. asp http: //www. ci. san-luis-obispo. ca. us/finance/bids. asp http: //www. ci. san-luis-obispo. ca. us/economicdevelopment/index. asp http: //www. ci. san-luis-obispo. ca. us/faq. asp http: //www. ci. san-luis-obispo. ca. us/how. asp http: //www. ci. san-luis-obispo. ca. us/search 2. asp http: //www. ci. san-luis-obispo. ca. us/contact. asp http: //www. ci. san-luis-obispo. ca. us/index. asp http: //www. ci. san-luis-obispo. ca. us/visit. asp … 27
Desiccation and Mass Digitization? How to make the OCR’d plain text version of a book as acceptable as possible? Very difficult problem: cf. work of Project Gutenberg and Digital Proofreaders – Born-digital plain text prettier than OCR – Page numbers, footnotes, sidebars – Multiple columns and reading order At the same time, page/section/chapter structural layout is a mass digitization feature frontier 28
Questions? jak@ucop. edu 29