- Количество слайдов: 46
Repositories, Identifiers and the Kahn/Wilensky Framework Thanks to Michael Nelson, Bill Arms
Repositories vs archives • repositories – “any computer system whose primary function is to store digital material for use in a library” – a collection of “stuff” • archives – repositories that make longevity promises – covered in a future lectures W. Arms
“Key Concepts in the Architecture of the Digital Library” • Bill Arm’s seminal article in the inaugural issue of D-Lib Magazine: – http: //www. dlib. org/dlib/July 95/07 arms. html
The technical framework exists within a legal and social framework • DLs no longer represent systems specific to academics or information specialists – content influences how the DL is used • architecture must allow the implementation of various policies
Understanding of digital library concepts is hampered by terminology • “common English” != “professional English” – multiple professional jargons too • What do these words mean to you? – – – copy publish content document work
The underlying architecture should be separate from the content stored in the library • General purpose functions and contentspecific functions should be separated • TL analogy: – the more specific the bookshelf is to holding actual books, the harder it is to repurpose the bookshelf in the future
Names and identifiers are the basic building block for the digital library • names != addresses • in any DL architecture diagram, (almost) anything that can be drawn can be named • consider the impact that handles/DOIs have had on the publishing/DL community
DOI’s • The Digital Object Identifier (DOI ) is an identification system for intellectual property in the digital environment. Developed by the International DOI Foundation on behalf of the publishing industry, its goals are to provide a framework for managing intellectual content, link customers with publishers, facilitate electronic commerce, and enable automated copyright management.
Digital library objects are more than collections of bits • objects = metadata + data – “but what is metadata? ” figure 2 in http: //www. dlib. org/dlib/July 95/07 arms. html
The digital library object that is used is different from the stored object • what you store is not necessarily what you get – storage and dissemination are separate events, and can represent separate formats • also, potentially separate from the applicationspecific format
Users want intellectual works, not digital objects • The DL architect’s needs should not inconvenience the users’ needs • recombination of objects – what is an object in your world view? figure 4 in http: //www. dlib. org/dlib/July 95/07 arms. html
Repositories must look after the information they hold • “Repository Access Protocol” – Kahn Wilensky Framework • http: //www. cnri. reston. va. us/home/cstr/arch/k-w. html figure 3 in http: //www. dlib. org/dlib/July 95/07 arms. html
A Framework for Distributed Digital Object Services • More commonly known as the Kahn/Wilensky Framework (KWF) • A high level document, not even detailed enough to be an architecture, that defines some of the key concepts and terms that form the basis for the next generation of DLs – DLs beyond “make the ftp server look nice”
Key KWF Terms • digital objects (DOs) – a unit of exchange for the DL with a particular data structure and characteristics • repository – the place where DOs live • handles – a unique, persistent name for a DO
KWF Originator makes a Data Digital Object which consists of Handle which can go in a which comes from a handle generator Repository which is accessed by Repository Access Protocol (RAP) which registers the DO’s handle with a Handle Server at which point the DO becomes a registered DO
Digital Objects • Digital object = data + key-metadata – data is typed; core types include: • bit-sequence / set-of-bit-sequences • digital-object / set-of-digital-objects • handle / set-of-handles – other types can be defined, and registered with a global type registry • definition and registration left undefined • similar to MIME? – key-metadata includes handle, possibly other metadata (left undefined in KWF)
MIME • Short for Multipurpose Internet Mail Extensions, a specification formatting non. ASCII messages so that they can be sent over the Internet. Many e-mail clients now support MIME, which enables them to send and receive graphics, audio, and video files via the Internet mail system. In addition, MIME supports messages in character sets other than ASCII. • There are many predefined MIME types, such as GIF graphics files and Post. Script files. It is also possible to define your own MIME types. • In addition to e-mail applications, Web browsers also support various MIME types. This enables the browser to display or output files that are not in HTML format. • MIME was defined in 1992 by the Internet Engineering Task Force (IETF). A new version, called S/MIME, supports encrypted messages webopedia
Digital Objects • Typed data; example from KWF: – a DO subtype: computer-science-tech-report – with metadata: author, institution, series, etc. • Composite DOs: – a DO with data of type digital-object – non-composite DOs are elemental DOs – composite DOs can be used to collect similar works together • composite DO that contains a DO for each work of Shakespeare. . .
Changing Digital Objects • Mutable DOs can be changed once placed in a repository – key-metadata cannot be changed -- the DO’s handle does not change! • Immutable DOs cannot be changed once placed in a repository – however, it can be deleted
Uniform Resource Identifiers URI URL RFC 1738 RFC 2396 URN RFC 2141
URI vs URL vs URN • • A URI can be classified as a locator or a name or both. A Uniform Resource Locator (URL) is an URI that, in addition to identifying a resource, provides means of acting upon or obtaining a representation of the resource by describing its primary access mechanism or network "location". For example, the URL http: //www. wikipedia. org/ is a URI that identifies a resource (Wikipedia's home page) and implies that a representation of that resource (such as the home page's current HTML code, as encoded characters) is obtainable via HTTP from a network host named www. wikipedia. org. A Uniform Resource Name (URN) is a URI that identifies a resource by name in a particular namespace. A URN can be used to talk about a resource without implying its location or how to dereference it. For example, the URN urn: ISBN: 0 -395 -36341 -1 is a URI that, like an International Standard Book Number (ISBN), allows one to talk about a book, but doesn't suggest where and how to obtain an actual copy of it. The URI syntax is essentially a URI scheme name like "http", "ftp", "mailto", "urn", etc. , followed by a colon character, and then a scheme-specific part. From wikipedia
URIs & URNs • registered URI schemes – http: //www. iana. org/assignments/uri-schemes • registered URN namespaces – http: //www. iana. org/assignments/urnnamespaces
From RFC 2396 “A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e. g. , their network "location"), rather than identifying the resource by name or by some other attribute(s) of that resource. The term "Uniform Resource Name" (URN) refers to the subset of URI that are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable. ”
URLs • URLs are tightly coupled with the physical location of an object, and are thus more likely to be transient – “Error 404 - File not found” • Tricks to make URLs more durable: • • plan ahead when constructing web site structure use good DNS CNAMEs symbolic links on filesystems http server redirects
URNs • But with all the tricks available, URLs are not suitable for archival use in DLs • how long will this URL (a report in LTRS): http: //techreports. larc. nasa. gov/ltrs/PDF/1997/tm/NASA-97 -tm 112871. pdf – be good? – how to handle mirroring, replication, etc. ? • “appropriate copy” problem… • mnemonic: – URL = IP address (128. 82. 5. 173) – URN = IP name (blearg. cs. odu. edu)
Handles • Handles can be thought of as a Uniform Resource Name (URN) implementation – http: //www. dlib. org/dlib/february 96/02 arms. html for historical comparison of efforts • http: //www. handle. net/ contains info about the handle system – persistence – location independence – multiple instances • Handles are of the general form: Global. Authority. Local. Authority/Locally. Unique. String or, for example: NASA. La. RC/tm 112871
NASA. La. RC/tm 112871 • “NASA” would be assigned from the global naming authority • “La. RC” would be created by who registered “NASA”, and the entire string “NASA. La. RC” would be registered • “tm 112871” is a locally unique string generated by “La. RC” – ODU. CS/tm 112871 is possible. . .
Handle Syntax • In URL-type syntax: – – “hdl” is a scheme; handle is resolved into a URL by locally defined handle server • see http: //ftp. ics. uci. edu/pub/ietf/uri/ for a good list of schemes and naming projects • Using a proxy server: – – hdl. handle. net performs resolution • from: http: //www. handle. net/draft-ietf-handle-system-01. html
Handles • Observation: isn’t the handle system just the Domain Name System (DNS) all over again? • The need for URNs for just general WWW use is obvious; the need for them in DLs even more so. . .
Semantics in Names • Two schools of thought: – semantic clues in names, such as: – NASA. La. RC/tm 112871 – www. larc. nasa. gov – are: • good: easy to parse, remember, map to real-world concepts, etc. • bad: names are not for human consumption, are hurtful or restrictive in the long run, etc.
“I Love Mom” (without Semantics) image from Eddie Kohler http: //www. cs. ucla. edu/~kohler/
Purls • Persistent URLs (Purls) – http: //purl. net/, OCLC – Maps stable URLs (registered in purl. net space) to transient URLs • What happens here: ist. psu. edu/giles • examples: – http: //www. purl. org/DC – http: //www. purl. org/NET/oai_explorer
DOIs • Digital Object Identifier System (DOIs) – http: //www. doi. org/ – no semantics in the names (well, that’s not always true…) • driven by the publishing industry – examples: • doi: 10. 1045/september 2002 -rasmussen • 10. 1145/544220. 544284 – resolver: • http: //dx. doi. org/
Repositories • “A network accessible storage system in which digital objects may be stored for possible subsequent access or retrieval” (KWF) • A stored DO is a DO that resides in a repository • A registered DO is a DO that the repository has registered with a handle server – storing and registering can be the same or different processes
Repositories • A repository keeps a properties record for each DO – contains key-metadata and any other metadata the repository chooses to keep • A repository of record (ROR) is the first repository that a DO is placed in – ROR authorizes additional instances of the DO • A dissemination is the result of an access service request
Repository Access Protocol (RAP) • “Protocol” may be misleading, its really just the skeleton for a protocol • RAP is designed to be simple – repositories themselves should be simple • KWF defines 3 basic operation classes: – ACCESS_DO – DEPOSIT_DO – ACCESS_REF • this is the catch-all operation for all meta-services. . .
RAP • RAP is fleshed out more in Cornell CS 95 -TR 1540 • Where KWF suggested that the operations would take “metadata”, “key-metadata”, and “digital object” as arguments, TR 1540 splits some of those into separate operations • RAP could be implemented as a subset of a more sophisticated protocol (Dienst, Z 39. 50, etc. ) – prelude to the Open Archives Initiative (OAI) metadata harvesting protocol
Terms and Conditions • First lengthy discussion with respect to KWF in Cornell CS 95 TR-1593 • Terms and Conditions (TC) can be arbitrarily complex, but generally consist of: – permissions: read, write, etc. – authentication - person, group, etc. – payment – 3 rd party intervention (possibly in support of the above)
TC • TC are attached to: – each DO – dissemination – repository • TC are a precondition for any operation on the above • Repositories responsible for enforcing TC
Booch Diagram for TC 1 1 repository 1 terms and conditions N 1 1 digital object 1 dissemination 1 1 terms and conditions Figure 1 from 95 TR-1593 1 data 1 terms and conditions 1 data
Why Are TC Difficult? • Wide open model -- “everyone can access and do everything” is much simpler • How do you: – inform user of TC? – negotiate TC? – enforce TC? • esp. with respect to 3 rd party enforcers – specify TC?
Access Rules and TC Figure 1 from TR-1540
Access Rules and TC • TR-1593 makes access_rules an instance of the class terms_and_conditions • Defines KWF concepts in a Common Object Request Broker Architecture (CORBA) context – CORBA is a standard/architecture/mechanism for object communication across heterogeneous everything. . . • http: //www. corba. org/
KWF Now • The KWF was never “implemented” in a real DL , (the 1995 Cornell TRs notwithstanding), yet it has influenced all repository & object model projects that followed – e. g. Warwick Framework, Fedora, Buckets/SODA, Dienst, OAI-PMH • T&C, or “Rights Expressions”, have mostly been moved out of the DL/repository protocols – cf. http: //www. loc. gov/standards/relreport. pdf
Objects vs. Archives • “Repositories must look after the information they hold” – Can they really? – Most DL objects still bound to the applications that generate or render the objects