Скачать презентацию Where Preservation Meets Mass Digitization John A Kunze Скачать презентацию Where Preservation Meets Mass Digitization John A Kunze

15e5a6e3e7f934bb90ba767bd929d68b.ppt

  • Количество слайдов: 29

Where Preservation Meets Mass Digitization John A. Kunze California Digital Library LAUC Fall Assembly, Where Preservation Meets Mass Digitization John A. Kunze California Digital Library LAUC Fall Assembly, UC Merced, 16 November 2007

The UC Libraries’ Digital Preservation Program UC-wide program: serves all 10 UC campuses – The UC Libraries’ Digital Preservation Program UC-wide program: serves all 10 UC campuses – 208, 000 students – 121, 000 faculty and staff – 10+ libraries – Museums Located at the CDL 2

3 3

Preservation challenges: case studies With benefit of hindsight, what’s hard? • Policy • Making Preservation challenges: case studies With benefit of hindsight, what’s hard? • Policy • Making files small • Fast data transfer • Cheap, reliable storage • Lots of annoying files • Preserving the revenue stream 4

What’s digital preservation? Storing digital objects while retaining a balance of usability and faithfulness What’s digital preservation? Storing digital objects while retaining a balance of usability and faithfulness (truthiness) to their creators’ original intentions 5

Policy Challenges • • • How faithful How long How many replicas How much Policy Challenges • • • How faithful How long How many replicas How much manipulation Right(s)mare 6

Fast data transfer challenges Lots of files, lots of data • Could take months Fast data transfer challenges Lots of files, lots of data • Could take months to move and replicate Explore data transfer / replication options • Test with CDL and New York University Survey tool performance and usability Continuing conversations with the San Diego Supercomputer Center and the Library of Congress with goal of creating guidelines 7

Transfer tools tested Ubiquitous, usual suspects: RSYNC, SCP, SFTP, FTP • Mogile. FS (simple Transfer tools tested Ubiquitous, usual suspects: RSYNC, SCP, SFTP, FTP • Mogile. FS (simple distributed filesystem, Perl scripts) http: //www. danga. com/mogilefs/ • High Performance SSH (no system gaming) http: //www. psc. edu/networking/projects/hpn-ssh/ But parallelism really works: • Grid. FTP (high security, from Grid community) http: //www. globus. org/grid_software/data/gridftp. php • SRB (bundled Sget/Sput tools) http: //www. sdsc. edu/srb/index. php/Main_Page • BBFTP (easy installation and use) http: //doc. in 2 p 3. fr/bbftp/ • BBCP (easy installation and use) http: //www. slac. stanford. edu/~abh/bbcp/ Practically, combine parallelism with common tools: 20 x SCP! 8

Upload Comparison 9 Upload Comparison 9

Making many files small Now we know how to move millions of files How Making many files small Now we know how to move millions of files How to make them smaller? 10

What is mass digitization? Large-scale scanning of newspapers, books, videos, etc. from the world’s What is mass digitization? Large-scale scanning of newspapers, books, videos, etc. from the world’s major libraries – Millions of items/hours to digitize, e. g. , 11

Why mass digitization? For better access and search – Page images remotely accessible – Why mass digitization? For better access and search – Page images remotely accessible – OCR (Optical Character Recognition) makes text visible to search engines Mass digitization is, for us, not intended to replace the physical item 12

“Page Image Compression for Mass Digitization” A study of page image tradeoffs with: • “Page Image Compression for Mass Digitization” A study of page image tradeoffs with: • National Library of France (Bn. F) • Harvard University Libraries (HUL) – With Google Book Search: G 9 Libraries – Harvard, Michigan, Stanford, NYPL, Oxford, University of California, etc. • University of California Berkeley (UCB) and the California Digital Library (CDL) – With Open Content Alliance: Internet Archive, Microsoft, University of Toronto, etc. Presented at IS&T Archiving 2007, Arlington, May 2007 13

Mass book digitization tradeoffs For our millions of volumes • Need to strike balance Mass book digitization tradeoffs For our millions of volumes • Need to strike balance between size of the files and quality of the reading experience • Images need to work with OCR • Possibility of re-printing books (print on demand), but this was not investigated formally Recommendations common to all 3 groups: • JPEG 2000 JP 2 (ISO/IEC 15444 -1) file format • An all color, all lossy solution is feasible 14

Text pages: point size mixes, foxing, handwriting 15 Text pages: point size mixes, foxing, handwriting 15

Text page : fonts, paper color, bleed-through 16 Text page : fonts, paper color, bleed-through 16

Text page : wordy, tight 2 -cols, uneven ink (details) 17 Text page : wordy, tight 2 -cols, uneven ink (details) 17

Color page : high information density (detail) 18 Color page : high information density (detail) 18

Color page : over-exposed, fine lines 19 Color page : over-exposed, fine lines 19

Grayscale : coarse half-tones (detail) 20 Grayscale : coarse half-tones (detail) 20

Don’t forget audio/video Case: Swedish National Archive of Sound and Moving Images is digitizing Don’t forget audio/video Case: Swedish National Archive of Sound and Moving Images is digitizing 6 million hours of material – 50 different recording formats and catalogs, growing 10% per annum – Eg, 500, 000 hours of open-reel 4 track using 16 simultaneous players, 8 players per operator – Eg, 220, 000 hours VHS using 12 simultaneous players Digitizing and ingesting 42 TB/month 21

Cheap, reliable storage OK, we can make files smaller and we can move lots Cheap, reliable storage OK, we can make files smaller and we can move lots of them quickly, but can we make disk cheaper and still reliable? • RAID (Redundant Arrays of Inexpensive Disk) 1980 s • JOBD (Just a Bunch of Disks) 1990 s • MAID (Massive Arrays of Idle Disks) 2000 s 22

Lots of annoying files, or “making files fewer” Origin: web archiving Solution: aggregate W/ARC Lots of annoying files, or “making files fewer” Origin: web archiving Solution: aggregate W/ARC file format – Many “files” in one file for speed and ease – Records are sort of peers of files Generalization to mass digitization and other processing products 23

W/ARC File Anatomy WARC = Web ARChive file format W/ARC File W/ARC Record Text W/ARC File Anatomy WARC = Web ARChive file format W/ARC File W/ARC Record Text header Content block . . . Append at will Length, source URI, date, type, … E. g. , HTTP response headers and length bytes of HTML, GIF, PDF, … WARC is fast track ISO work item

Digitizing the Digital Origin: preservation of revenue stream Case of Data Desiccation, creating no-frills, Digitizing the Digital Origin: preservation of revenue stream Case of Data Desiccation, creating no-frills, sometimes feature-poor derivatives that retain most of the original scholarly value but are likely to be less perishable than original format (similar to “digital microfilm”) Save desiccated derivatives along with original, just in case no one ever again • Has the funds to touch files • Has the expertise to convert them properly 25

Example Photo of Mission San Luis de Tolosa [2]About the City [3]Visiting SLO [4]What’s Example Photo of Mission San Luis de Tolosa [2]About the City [3]Visiting SLO [4]What’s New [5]City Government [6]Employment Opportunities [7]Bids & Proposals [8]Economic Development [9]FAQs [10]How are we doing? City of San Luis Obispo About the City [Choose a Destination. . ] [11]Search [12]Contact Us [13]City Home A Brief History Who we are and how we got started. The City of San Luis Obispo serves as the commercial, governmental and cultural hub of California’s Central Coast. One of California’s oldest communities, it began with the founding of Mission San Luis Obispo de Tolosa in 1772 by Father Junípero Serra as the fifth mission in the California chain of 21 missions. The mission was named after Saint Louis, a 13 th Century Bishop of Toulouse, France. (San Luis Obispo is Spanish for "St. Louis, the Bishop". ) It was first incorporated in 1856 as a General Law City, and became a Charter City in 1876. Where we’re located. With a population of 44, 000, the City is located eight miles from the Pacific Ocean and is midway between San Francisco and Los Angeles at the junction of Highway 101 and scenic Highway 1. San Luis Obispo is the County Seat, and a number of federal and state regional offices and facilities are located here, including Cal Poly State University, Cuesta Community College, Regional Water Quality Board and Caltrans District offices. The City’s ideal weather and natural beauty provide numerous opportunities for outdoor recreation at nearby City and State parks, lakes, beaches and wilderness areas. Great place to live and visit. While San Luis Obispo grew relatively… 26

Example continued: endnotes … [18]About the City | [19]Visiting SLO | [20]What’s New | Example continued: endnotes … [18]About the City | [19]Visiting SLO | [20]What’s New | [21]City Government | [22]Employment [23]Bids & Proposals | [24]Economic Development | [25]FAQs | [26]How are we doing? [27]© 2006, City of San Luis Obispo References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. http: //www. ci. san-luis-obispo. ca. us/briefhistory. asp#content http: //www. ci. san-luis-obispo. ca. us/about. asp http: //www. ci. san-luis-obispo. ca. us/visit. asp http: //www. ci. san-luis-obispo. ca. us/whatsnew. asp http: //www. ci. san-luis-obispo. ca. us/government. asp http: //www. ci. san-luis-obispo. ca. us/humanresources/index. asp http: //www. ci. san-luis-obispo. ca. us/finance/bids. asp http: //www. ci. san-luis-obispo. ca. us/economicdevelopment/index. asp http: //www. ci. san-luis-obispo. ca. us/faq. asp http: //www. ci. san-luis-obispo. ca. us/how. asp http: //www. ci. san-luis-obispo. ca. us/search 2. asp http: //www. ci. san-luis-obispo. ca. us/contact. asp http: //www. ci. san-luis-obispo. ca. us/index. asp http: //www. ci. san-luis-obispo. ca. us/visit. asp … 27

Desiccation and Mass Digitization? How to make the OCR’d plain text version of a Desiccation and Mass Digitization? How to make the OCR’d plain text version of a book as acceptable as possible? Very difficult problem: cf. work of Project Gutenberg and Digital Proofreaders – Born-digital plain text prettier than OCR – Page numbers, footnotes, sidebars – Multiple columns and reading order At the same time, page/section/chapter structural layout is a mass digitization feature frontier 28

Questions? jak@ucop. edu 29 Questions? jak@ucop. edu 29