1c496c8e539d088e309343a9d621ac5a.ppt
- Количество слайдов: 40
CNI Fall Task Force Meeting Washington, DC, December 10 -11, 2007 The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC 0
Digital preservation and format • Preservation is concerned with ensuring access to managed digital assets over time • Thus, preservation activities are focused on – – – Viability Fixity Authenticity Interpretability Renderability • The last two are primarily a function of format 1
Without format typing, all content is opaque ffd 8 ffe 000104 a 46494600010201 00830000 ffed 0 fb 050686 f 74 6 f 73686 f 7020332 e 30003842494 d 03 e 90 a 5072696 e 7420496 e 666 f 00 000000780000048000002 f 40240 ffee 03060252 0347052803 fc 000200000048 000002 d 80228000100000064 000000010003030300000001270 f 0001000000000060080019019000000000000000000003842 494 d 03 ed 0 a 5265736 f 6 c 7574696 f 6 e 000010008313 a 3000200. . . 2
Without format typing, all content is opaque ffd 8 ffe 000104 a 46494600010201 00830000 ffed 0 fb 050686 f 74 6 f 73686 f 7020332 e 30003842494 d 03 e 90 a 5072696 e 7420496 e 666 f 00 000000780000048000002 f 40240 ffee 03060252 0347052803 fc 000200000048 000002 d 80228000100000064 000000010003030300000001270 f 0001000000000060080019019000000000000000000003842 494 d 03 ed 0 a 5265736 f 6 c 7574696 f 6 e 000010008313 a 3000200. . . 3 SOI APP 0 JFIF 1. 2 APP 13 IPTC APP 2 ICC DQT SOF 0 183 x 512 DRI DHT SOS ECS 0 RST 0 ECS 1 RST 1 ECS 2. . .
Without format typing, all content is opaque ffd 8 ffe 000104 a 46494600010201 00830000 ffed 0 fb 050686 f 74 6 f 73686 f 7020332 e 30003842494 d 03 e 90 a 5072696 e 7420496 e 666 f 00 000000780000048000002 f 40240 ffee 03060252 0347052803 fc 000200000048 000002 d 80228000100000064 000000010003030300000001270 f 0001000000000060080019019000000000000000000003842 494 d 03 ed 0 a 5265736 f 6 c 7574696 f 6 e 000010008313 a 3000200. . . 4 SOI APP 0 JFIF 1. 2 APP 13 IPTC APP 2 ICC DQT SOF 0 183 x 512 DRI DHT SOS ECS 0 RST 0 ECS 1 RST 1 ECS 2. . .
Global Digital Format Registry “The Global Digital Format Registry (GDFR) will provide sustainable services to collect, review, store, discover, and deliver significant representation information about digital formats. ” – Centrally-organized collection and review – Distributed storage, discovery, and delivery on a network of independent, but cooperating registries 5
What is a format? • “A serialized encoding of an abstract information model” • Encompasses the nominal sense of “file format” as well as a range of conceptual entities from the micro to the macro level – IEEE 754 floating point number – File system – In both case, there are well-defined syntactic and semantic rules for mapping from information to bits, and back again 6
What’s wrong with MIME types? 7
What’s wrong with MIME types? • Non-standardized documentation • Intended for human, not machine consumption • Coarse granularity – image/tiff vs. TIFF 4. 0 – 6. 0 Baseline Class B, G, P, R Extension Class Y TIFF/EP TIFF/IT with file types CT, LW, HC, MP, BP, BL, FP Exif 2. 0 – 2. 2 Geo. TIFF/FX DNG 8
GDFR project • Two DLF-sponsored invitational workshops – University of Pennsylvania, January 2003 – Washington, March 2003 • Two independent demonstration projects – FRED [John Ockerbloom, University of Pennsylvania] tom. library. upenn. edu/fred/ – FOCUS [Joseph Ja. Ja, University of Maryland] www. umiacs. umd. edu/~joseph/focus-archiving 06. pdf 9
GDFR project • Harvard University Library (HUL) funded for 2 years by the Andrew W. Mellon Foundation • Staffing and technical work subcontracted by HUL to OCLC (July 2006) 10
GDFR project oversight • Technical Working Group (TWG) – – – Bibliothèque nationale de France British Library California Digital Library Digital Curation Centre, UK Library of Congress National Archives, UK National Archives and Records Administration National Library of Australia National Library of New Zealand Stanford University of Pennsylvania 11
General development goals • A generalized registry framework, specialized for the distributed GDFR application • Based on well-known products and protocols • Human and machine interfaces • Full information content expressible in XML form, and can be re-instantiated from that expression • Platform independence • Globally fault tolerant • Open source 12
GDFR data model • Consistent with PRONOM registry 13
Identifiers • Canonical, GDFR-assigned identifier – “info” URI info: rfa/gdfr 1/Formats/1 • Other well-known identifiers – – Common name “TIFF”, “Tagged Image File Format” MIME type image/tiff PRONOM identifier info: pronom/fmt/7 Library of Congress Format Description Document (FDD) identifier fdd 000022 14
Classification scheme • Eight facets – Genre (required) text, still-image, sound, aggregate, … – Role (required) family, file-format, encoding, serialization – Composition unitary, container-bundle, container-wrapper – Form binary, text – Constraint – Basis structured, unstructured sampled, symbolic – Domain astronomy, cad-cam, gis, web-archive, … – Transform compression, encryption, message-digest, … 15
Classification scheme • Examples – TIFF (Tagged Image File Format) genre: still-image role: family composition: container-wrapper form: binary basis: sampled – LZW (Liv-Zempel-Welch) genre: still-image role: encoding transform: compression – SVG (Scalable Vector Graphics) genre: still-image role: file-format form: text basis: symbolic 16
Signatures • External signatures – File extension – Mac OS type – Mac OS X Uniform Type Identifiers (UTI) • Internal signatures – “Magic numbers” – Required vs. optional – Fixed vs. restricted vs. unrestricted 17
Grammar • Formal description of the syntactic grammar underlying a format, expressed in some formal typed notation – – – BNF BSDL DFDL EAST XCEL Backus-Naur Form MPEG-21 Bitstream Syntax Description Language Data Format Description Language CCSDS 644. 0 -B-2 Extensible Characterisation Extraction Language 18
Assessment • Assessment of a format, expressed in some formal typed notation – Cornell Virtual Remote Control (VRC) – DTSC PANIC – Library of Congress Sustainability, Quality, Function (SQF) – National Library of Australia AONS – OCLC INFORM 19
Documentation • Specification documents (and software files) can be managed and distributed in the network – Applicable only in cases of public domain resources or if explicit permission is granted by rights holders – Other documents (and software) will be referenced by full citation, including actionable links where possible – Mechanism for individuals or institutions to register locally-held copies, with terms of use 20
Software • Format role Input, output • Process type Characterize, create, edit, identify, … • Enables discovery of transformative processing chains 21
Relationships • Modification BWF → WAVE – Extension DNG → TIFF 6. 0 – Restriction PDF/A → PDF 1. 4 • Definition NITF → XML DTD • Requisite XML → Relax NG • Containment ZIP → * • Equivalence DXF (ASCII) → DXF (binary) • Version Word 97 → Word 6. 0 • Affinity SPIFF → JPEG 22
GDFR node • Based on the OCLC IWSA / RFA framework 23
GDFR node • Java, Apache/Tomcat, Berkeley DB XML • GNU LGPL license – Including pre-existing OCLC technology and technology newlydeveloped for the project • Release schedule – – – v 0. 1 (alpha) March 23, 2007 v 0. 1 (beta) June 14, 2007 v 1. 0 June 30, 2007 v 1. 1 August 12, 2007 v 1. 3 September 17, 2007 v 1. 3. 1 October 26, 2007 24
GDFR node 25
GDFR node 26
GDFR node 27
GDFR network • Peer-to-peer network of independent, but cooperating registries communicating over a common protocol 28
GDFR network • Public notification of the availability of new data – RSS feed available at well-known public address to which remote nodes can subscribe • Remote harvesting of local data – OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) • Initially, a single source (root node) for all new data 29
Project status • Extensive internal testing of GDFR software in a stand-alone mode • Current project activities are focused on – Implementing the distribution and synchronization functions – Building the network – Data acquisition – Succession planning 30
Initial population • Manual addition is possible, but time consuming • Automated update using Atom • What sources are available for bulk population? – PRONOM registry www. nationalarchives. gov. uk/pronom – Library of Congress Format Description Documents (FDD) www. digitalpreservation. gov/formats/fdd/descriptions. shtml – Unix / Linux magic(4) database 31
Subsequent population • RFC 2026, Internet Standards Process www. ietf. org/rfc 2026. txt – “Iterations of review by the. . . community and revision based upon experience” • Draft distribution and public discussion • Approval by “area” editors • Release to the network for distribution 32
Sustainability • The technological solution is the (relatively) easy part, but… – The technology is expendable – The important point is for the data to survive, evolve, and expand 33
Governance and succession • Mellon funding was for technical work only • At the end of the two year project… – Harvard will continue maintenance for up to two years – Library of Congress has agreed to be a care-taker agency until a permanent body is identified 34
Governance and succession • NARA GDFR governance investigation – Part of the Electronic Records Archives (ERA) initiative – GDFR Governance Workshop, November 2007 • • • Bibliothèque et Archives, Canada • NARA Corp. for National Research Initiatives • NASA Digital Curation Centre, UK • NIST Digital Library Federation • National Library of Australia General Services Administration • National Library of New Zealand Georgia Institute of Technology • San Diego Supercomputer Center Government Printing Office • Stanford University Harvard University • Statens Archiv, Sweden IBM Watson Research Center • Tessalla Support Services Koninklijke Bibliotheek, Netherlands • University of Pennsylvania Library of Congress MIT 35
Administrative considerations • Policy – Who (and how many) can join the network? – What are the eligibility requirements? – What are the rights and obligations of membership? • Technical – Who will maintain and enhance the data model? – Who will maintain, enhance, distribute, and support the software? 36
Administrative considerations • Data – Who will contribute data? – Who will vouch for data authenticity? – Who will ensure data integrity? • Financial – What are the real human and system costs associated with GDFR operation? – Who pays, and how? 37
Summary • The GDFR is an enabling technology that will support digital repository and preservation activities – Supports the strong typing of digital assets at an appropriate level of granularity – Enables the future recovery of the syntax and semantics associated with typed digital assets – A means to pool and redistribute the expertise of the international digital preservation community 38
For more information… www. formatregistry. org stephen_abrams@harvard. edu andreas_stanescu@oclc. org 39
1c496c8e539d088e309343a9d621ac5a.ppt