Скачать презентацию International Atomic Energy Agency Digital Preservation Session Tue Скачать презентацию International Atomic Energy Agency Digital Preservation Session Tue

e98aadd644652bb3278cb2d3c8a497f5.ppt

  • Количество слайдов: 27

International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3 -5 Nov 2008, Vienna, Austria S. Rieder, G. St-Pierre, Y. Reynaud-Pulido, T. Kalapurackal Database Production and Imaging Group, INIS Unit INIS & NKM Section

Digital Preservation at INIS Mission: • preservation of nuclear knowledge • serving as a Digital Preservation at INIS Mission: • preservation of nuclear knowledge • serving as a reservoir of nuclear information • provision of quality information services • promotion of a culture of “information and knowledge sharing“ 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 2 International Atomic Energy Agency

Digital Preservation at INIS ü INIS Non-Conventional Literature (NCL) v Production of the INIS Digital Preservation at INIS ü INIS Non-Conventional Literature (NCL) v Production of the INIS electronic Full Text Database ü Digital Preservation Activities v Digitization projects at IAEA Ø at Member States Ø 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 3 International Atomic Energy Agency

Digital Preservation at INIS Objectives: ü ü ü Consistent, high-level of image quality Interoperability Digital Preservation at INIS Objectives: ü ü ü Consistent, high-level of image quality Interoperability and accessibility of digitized resources Long-term preservation of digital resources for future generations Member States IAEA ü Develop good practices for digital preservation v ‘Overview of INIS Digital Preservation Practices ’: INIS Information Letter No. 253 & Attachment (2008 -10 -03) http: //www. iaea. org/inisnkm/marea/restrictedpdf/2008/infoletter 253_attachment. pd f 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 4 International Atomic Energy Agency

INIS Digital Preservation Principles v INIS principles and workflow base on Cornell University’s digital INIS Digital Preservation Principles v INIS principles and workflow base on Cornell University’s digital imaging tutorial: http: //www. library. cornell. edu/preservation/tutorial/index. html v available in English, French, Spanish 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 5 International Atomic Energy Agency

INIS Workflow Ø Ø Ø Document Benchmarking Document Preparation Scanning Quality Control Image Enhancement INIS Workflow Ø Ø Ø Document Benchmarking Document Preparation Scanning Quality Control Image Enhancement Metadata Creation/Validation Export including Compression Completeness Check Back-up Post-processing Storage and dissemination 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 6 International Atomic Energy Agency

Benchmarking & Document Preparation Benchmarking: v v v v Adequately capture the ‘original’ content Benchmarking & Document Preparation Benchmarking: v v v v Adequately capture the ‘original’ content in digital form? Physical format & condition meets digitizing requirements? What is the type of material to be digitized? Which resolution? At which bit-depth? Which compression parameters? Estimated accuracy level for OCR? Other considerations? Preparation : v v v Physically (unbind, remove staples/clips, etc. ) Structurally (add/remove barcodes, separate chapters, parts, etc. ) Characteristics of paper (eg. size, thick, glossy/mat, condition) 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 7 International Atomic Energy Agency

Scanning – Capture Modes & Optical Resolution Capture modes: depends on the physical form Scanning – Capture Modes & Optical Resolution Capture modes: depends on the physical form of original Ø Ø Bitonal: Greyscale: Ø Colour: 1 bit/pixel – black & white (printed text) 8 bits/pixel – 256 grey shades (black & white photographs) 24 bits/pixel – 16 million colours & grey shades (continuous tone & colour) Optical Resolution: Ø “dots per inch” (DPI) or “pixels per inch” (PPI) Ø High resolution fine detail Ø Bit depth: large file size amount of information captured Ø Greater bit depths more accurate representation 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 8 International Atomic Energy Agency

Scanning at INIS – Capture & Optical Resolution INIS practice: Ø Standard Scanner Settings Scanning at INIS – Capture & Optical Resolution INIS practice: Ø Standard Scanner Settings (for Plain b/w text): Ø bitonal (black & white) Ø 300 dpi Ø Special Cases (colour, pictures): Ø greyscale and colour Ø 200 – 300 dpi with 8 bit depth (256 colours/tones) Ø IMPORTANT: post-processing image compression needed to reduce file size Ø NEVER use colour settings to scan B/W documents 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 9 International Atomic Energy Agency

Quality Control - QC Retain: value utility integrity of resources Verify: quality accuracy consistency Quality Control - QC Retain: value utility integrity of resources Verify: quality accuracy consistency INIS verifies: Ø Ø Ø Ø accuracy & completeness (eg. same number of pages? ) data integrity correctness of metadata form and validity correct matching of metadata and image files ‘checksum’ algorithm (authenticity & integrity of digitized files) number & order of bytes (eg. after move, copy, transfer, burn) Ø visual inspection: resolution, colour, tone, appearance attn: changeable light & monitors 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 10 International Atomic Energy Agency

Image Enhancement Definition: Any process applied to the raw scan to improve quality or Image Enhancement Definition: Any process applied to the raw scan to improve quality or legibility of the resource Image Enhancement at INIS: despeckling v deskewing v noise reduction v black border removal v colour and tone adjustment, etc. v 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 11 International Atomic Energy Agency

Quality Control and Image Enhancement (1) q Skewed? 34 th ILO Meeting, 3 -5 Quality Control and Image Enhancement (1) q Skewed? 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 12 International Atomic Energy Agency

Quality Control and Image Enhancement (2) q Noisy (e. g. unnecessary dots)? 34 th Quality Control and Image Enhancement (2) q Noisy (e. g. unnecessary dots)? 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 13 International Atomic Energy Agency

Quality Control and Image Enhancement (3) q Black border ? 34 th ILO Meeting, Quality Control and Image Enhancement (3) q Black border ? 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 14 International Atomic Energy Agency

Quality Control and Image Enhancement (4) IMPORTANT • Paper Size must match document hard Quality Control and Image Enhancement (4) IMPORTANT • Paper Size must match document hard copy • A 4 ≠ Letter Size • Text cut = RESCAN • If noticed during QC of incoming PDF, INIS will request the Input Centre to resend the page 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 15 International Atomic Energy Agency

File Formats Ø Very important: Prefer ‘non-proprietary’ formats Ø Several standard file formats exist File Formats Ø Very important: Prefer ‘non-proprietary’ formats Ø Several standard file formats exist Ø different resolution, bit-depth, colour capabilities, etc. INIS Digital Collection: 1. From ‘Paper’ or ‘Microfiche’ Ø Ø 2. to ‘Digital’: Master images in TIFF Group IV (b/w), in JPEG (colour) Majority Full-Text searchable PDF Digital files received from INIS National Centres Ø Ø PDF Compression: JBIG 2 (b/w) , JPEG (colour) 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 16 International Atomic Energy Agency

Preservation Formats Ø PDF: open standard – official ISO 32000 -1: 2008 Ø PDF/A: Preservation Formats Ø PDF: open standard – official ISO 32000 -1: 2008 Ø PDF/A: Long-term archiving of electronic documents ü Creation of PDF documents whose visual appearance will remain the same over the course of time ü Official ISO standard: ISO 19005 -1: 2005 ü Further development ongoing http: //www. pdfa. org INIS: Ø considers adopting PDF/A ü for efficient preservation ü long-term archival of the Agency’s and Member States’ nuclear information resources Ø pilot project in 2009 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 17 International Atomic Energy Agency

OCR – Optical Character Recognition ü Printed text searchable as electronic text ü Primary OCR – Optical Character Recognition ü Printed text searchable as electronic text ü Primary objective for INIS digitization projects: creation of ‘searchable full text’ INIS: v major tool for mass production: ABBYY Fine. Reader 8 v ~ 98% accuracy: printed text in Latin & Cyrillic characters Satisfactory testing with Script and Arabic Characters: v Adobe Acrobat Professional 8. 0: Chinese (Simplified), Japanese, Korean v ABBYY Fine. Reader Pro 9: Hebrew, Thai v VERUS™ Professional: Arabic 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 18 International Atomic Energy Agency

Various OCR types v Typewritten v Hand print and cursive v Fraktur v Music Various OCR types v Typewritten v Hand print and cursive v Fraktur v Music scores v MICR (Magnetic Ink Character Recognition) 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 19 International Atomic Energy Agency

OCR process 1 (no or wrong dictionary) International Atomic Energy Agency OCR process 1 (no or wrong dictionary) International Atomic Energy Agency

OCR process 2 (proper dictionary) International Atomic Energy Agency OCR process 2 (proper dictionary) International Atomic Energy Agency

OCR - value added Image Layer Hidden Text errors • Scanned (raster) Image • OCR - value added Image Layer Hidden Text errors • Scanned (raster) Image • Visual representation of Enables full-text search Extra information for the original document 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT search engines 22 International Atomic Energy Agency

Storage of Digital Files v Mandatory: reliable & controlled environment Ø Storage of master Storage of Digital Files v Mandatory: reliable & controlled environment Ø Storage of master files: high quality, industry standard devices, eg. CD-R, DVD, or other contemporary reliable media Ø Backup of master files: regularly, off-site, secure location Ø RAID: Redundant Array of Independent Disks Ø several drives act collectively as a single storage system Ø consider RAID for large production environment INIS: v v v THECUS N 5200 B PRO, 5 x 3, 5" SATA Raid 5 disks 1 TB each configured as local network data storage 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 23 International Atomic Energy Agency

Back-up and Off-Site Storage ü Create: ü Store: regular back-ups of master files remote Back-up and Off-Site Storage ü Create: ü Store: regular back-ups of master files remote from the original source in a secure location INIS: Ø 1970 to 1997: ‘microfiche’ v NCL full text: paper microfiche v safe, long-term storage INIS National Centres v full set of NCL microfiche Austrian Central Lib. of Physics Ø From 1997: ‘digital’ v NCL on CD: INIS Document Delivery Centres (National Centres) Secure “off site” & back-up: Austrian Central Lib. of Physics v v 2008: microfiche to PDF Austrian Central Lib. of Physics v INIS National Centres v INIS Online Database 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 24 International Atomic Energy Agency

Preservation Planning Ø Contents of digital files must remain ‘meaningful’ Different processes: 1. Refreshing: Preservation Planning Ø Contents of digital files must remain ‘meaningful’ Different processes: 1. Refreshing: copy files from one storage medium to another v verify authenticity & integrity of the files (e. g. checksum) 2. Migration: transfer files from one HW & SW to another or from one computer generation to next generations Ø format-based: move files from ‘obsolete’ format to ‘new’ format 3. Emulation: re-create technical environment v maintain information about HW & SW = system reengineered INIS: Refreshing Ø CD to DVD (until 2007) Ø from 2008: copy to Thecus storage device Ø When PDF/A implemented: ‘migration’ 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 25 International Atomic Energy Agency

Metadata v Key role for digital resources: Ø Describe, process, manage, track, access, preserve Metadata v Key role for digital resources: Ø Describe, process, manage, track, access, preserve INIS: comprehensive ‘bibliographic’ metadata describe the intellectual content of full text bibliographic elements to identify & retrieve resources INIS Database: digital resources with bibliographic metadata Ø Ø v Technical metadata for digital resources: automatic creation with PDF files v Future: more sophisticated approach with implementation of PDF/A 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT 26 International Atomic Energy Agency

International Atomic Energy Agency Thank you for your attention! Your INIS Digital Preservation Team International Atomic Energy Agency Thank you for your attention! Your INIS Digital Preservation Team 34 th ILO Meeting, 3 -5 Nov 2008, Vienna, AT