192e6fefbc0ff5d782b3c50356a2be25.ppt
- Количество слайдов: 18
Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information PI: Joseph Ja. Ja Co-PIs: Allison Druin and Doug Oard Major Collaborators: Library of Congress, National Archives, Shoah Visual History Foundation, ICDL, SDSC, Georgia Tech, SLAC
Scientific Research Objectives n n Development of tools and technologies for automated ingestion and management of preservation processes. Evaluation and demonstration of tools on widely different collections. Overall layered architecture based on distributed repositories using open standards, web and data grid technologies Overall approach captures all essential elements of the Open Archival Information System (OAIS) Reference Framework.
Accomplishments n n Development of a Global Digital Format Registry prototype based on scalable and secure web technologies – FOCUS (FOrmat CUration Service). Automated ingestion tools and testing on the ICDL collection – ICDL Book Builder. Preliminary design of policy driven integrity auditing for distributed archives. Detailed design of a deep archive based on erasure-resilient codes.
Format Obsolescence n n n Handling of digital formats is an essential part of long-term preservation Preservation of any object must include ways to render and transform the object if necessary. Needs to preserve • • Different essential aspects of objects. Tools for capturing the essential format characteristics of information stored as digital objects.
FOrmat CUration Service n n Maintains persistent information on digital formats and applications to access and manipulate them. Accessible either • • Directly through LDAP Or indirectly through SOAP (Web Services) SOAP Web Service Agent Format Registry LDAP
FOCUS on LDAP/SOAP n n Interoperability • Scalability • • • n LDAP and SOAP provide the standard models and protocols, being platform independent. LDAP is a proven scalable technology. LDAP schema can be extended and server can be replicated with ease. SOAP server side can be extended without affecting client sides. Security • • SOAP can be on top of SSL (https). LDAP also provides its own secure authentication and authorization methods.
FOCUS Data Model q. General descriptive properties. q. Processing : format taken as input and/or output. dc=umiacs, dc=umd, dc=edu ou=Format-Registry ou=Applications q. General descriptive properties. q. Processing: rendering, editing, conversion and validation services/systems. ou=Formats Adobe Acrobat v 6. 0 Adobe PDF v 1. 4 Adobe Photoshop v 7. 0 Compu. Serv GIF 1989 a Jhove 1. 0 JPEG Image Format 2000
FOCUS Service Model Web Service Agent Identificatio n Service Format Registry Locates transformation services to convert DO from source format to format of interest. Conversio n Service Identifies format of a specific Validation DO using the internal signature Service Determines a verification service to verify the format of a specific DO Identifies current rendering conditions Rendering for specific digital format. Service
International Children’s Digital Library (ICDL) n n Joint project between UMD and the Internet Archive funded by NSF and IMLS (Allison Druin). Goal: efficient search, browsing, and reading of a collection of 10, 000 books in 100 languages. Current holdings almost 1000 books in over 30 languages, with innovative book readers and browsing tools. Books are digitized in TIFF format, and processed in 6 sizes of JPEG 2000 for each page of each book.
Producer – Archive Workflow Network (PAWN) n n n Distributed and secure ingestion of digital objects into the archive. Use of web/grid technologies – platform independent Ease of integration with data grids or digital libraries. XML Representation of metadata and bitstream • Self describing bitstream submissions Accountability of transfer and guarantee of data integrity Currently being used to ingest SLAC data into the National Archives.
More About PAWN Producer 1 Digital Archive Producer 2 Producer n Scheduler Bitstream Validation Service
ICDL Book Builder n Purpose: archive digital book collection of ICDL (International Children’s Digital Library). • Builder allows users to: n n n Select books from ICDL Map metadata from ICDL database Create Submission Information Packets(SIP) and transfer into PAWN
ICDL Ingestion Steps 5 step process: 1. User queries ICDL 2. 3. 4. 5. database with under given criteria (eg. Book id, title, # of pages, etc…) Select books to ingest. Choose mapping of ICDL metadata into Dublin Core Download book contents and create SIP Send packet to PAWN n PAWN transfers to archive
Submission Information Packet (SIP) n n n METS Handles all areas of a SIP except Physical Object and Descriptive Information can be embedded into METS as 3 rd party XML schema Submission agreement constrains how a SIP is structured and described.
Collaboration Examples and Success Stories n n n A prototype Producer-Archive Workflow Network (PAWN) is currently being used to ingest SLAC collection at NARA-II. Several parties have expressed interest in collaborating with us to further develop the design and implementation of the Global Digital Format Registry. Designed and built a “grid brick” for NARA-I, which is currently in use for demonstrating the distributed pilot persistent archive linking UMD, SDSC, NARA-II, and Georgia Tech.
Broad Impact n n A workshop organized by R. Moore, J. Ja, and A. Rajasekar to assess the suitability of the SRB for long term preservation was held on Dec 8 -9, 2005. Over 70 people from the archiving, digital library, and grid communities participated in the workshop. Interactions with NDIIPP partners, NARA partners, Don Sawyer’s group at NASA Goddard, etc.
Challenges n n n We have to work with constantly changing requirements and assumptions as most of the nontechnical issues are still open-ended in addition to the core problem of dealing with technology evolution. Graduate students would rather work in core disciplines. Open-ended research issues – no rigorous methodology to distinguish between different approaches, and no clear way to measure progress.
192e6fefbc0ff5d782b3c50356a2be25.ppt