14ed7260d634227fa00254c9ae24abe5.ppt
- Количество слайдов: 16
Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law
Delivery Formats & Issues Delivery Format: type of the file a user receives when accessing a document in a digital collection l Important not just for viewing, but also for Information Retrieval (IR) tasks like full-text indexing l There is no one format that is right for every type of collection. l Important issues to consider: – Open v. Closed Formats – Usability and Accessibility – Subject Specific Concerns for Legal Materials l
Open v. Closed Formats Who is "in control" of the document format you choose? A standards body? A single company or organization? l Can you count on something that one entity controls to be supported over time? l Advantages of Open Formats (a. k. a. Standards) l – Interoperability and support over time. – Integrate well with open-source or low cost processing and IR tools – Help web content providers who need to support an increasing variety of devices and platforms
Usability & Accessibility What software do users need to view a particular format? l Can a web browser natively display it? l If the format requires a browser plug-in: l – Is it free? Are users likely to have it installed? – Does it work on all computing platforms? Do public search engines index the format? l Can dial-up modem users access the material in the collection? l
Subject Specific Concerns for Legal Materials l l l Legal digital projects usually manage texts, not images. Some types of legal materials are harder to maintain, i. e. codified material. Legal documents are almost exclusively printed in black & white. Preservation of the page structure is important for citation purposes. Maintaining the original appearance of digitized print documents is not important; archival and rare materials are potential exceptions.
Possible Delivery Formats l Pure image formats: TIFF, JPEG l Open encoded formats: XML, HTML, ASCII, and Unicode l Hybrid formats: PDF, Dj. Vu – can contain both image and text l Proprietary formats: Microsoft Word, Word. Perfect
Pure Images: TIFF, JPEG l Raster (pixel-based) exclusively used for scanned collections l TIFF is the best choice for archival scanned images l Pros – Web browsers display them natively – Both are open formats l Cons – Large file sizes make viewing on slow connections – – problematic Text of the documents available only through OCR (Optical Character Recognition) Weak support for multi-page documents JPEGs have trouble displaying text when they are compressed to levels appropriate for the web Contain metadata about the physical file itself, not the contents of the file
Imaged Formats Cont. l OCR is an important consideration: – 5% rate of error doesn't have an impact on traditional IR measures – 20% error rate significantly degrades [Doerman 98] the performance of traditional IR techniques. – High quality OCR is now available for relatively low cost l l Abbyy Finereader ($300) Table and page layout recognition supported
Open Encoded Formats XML, HTML, ASCII, Unicode Typically easier to integrate into digital libraries [Baird 2004] – Created in 3 ways: l l l Born digital documents Manually keyed documents Corrected OCR – IR applications easy to build, open source support – – – strong International standards or W 3 C recommendations Accessible with all current web technologies Metadata easily embedded in XML|HTML documents Can be created with any text-editor Improvements in OCR make encoding scanned collections feasible
Open Encoded Formats Cont. l Cons: – These documents can be expensive for staff to create l Manual Encoding in XML may have to be done by hand l Manual correction of OCR errors – Need technical expertise on staff to get the full benefits of these formats, the PERL programmer – These don't necessarily preserve the "look" of printed documents
Hybrid Formats: PDF, Dj. Vu l l l PDF and Dj. Vu are proprietary technologies that have substantial support in the open source community. Both can contain a layer of the document’s text and an image of each page in a document. Both utilize cross-platform, freely available web browser plug-ins. Both try to preserve the look of print documents Easy to export born digital documents to these formats using printer drivers, “print to PDF”
Adobe PDF l Pros: – PDF has strong market acceptance in the legal community – PDF-Archive, a standard for using PDF as an archival format in development by AIIM [Association for Information and Image Management] – Adobe makes the PDF reference manual and software development kit freely available to developers. – Standard methodology for embedding metadata in documents, the XMP Standard (Extensible Metadata Platform) that seeks compatibility with semantic web technologies l Cons: – Plug-in performance is poor for long documents – PDFs composed of scanned images can be very large in size, even for short documents
Dj. Vu l l Designed to be a scan-to-web technology. Pros: – Best compression of any image format on the web – Users can load lengthy documents very quickly – The Dj. Vu plug-in can be manipulated via cgi-style arguments – Use the Any 2 Dj. Vu server to try out the format. l Cons: – Dj. Vu does not yet have great market acceptance in the legal community. – Dj. Vu does not have a standard method for embedded metadata within documents.
Proprietary Formats Word Processing Formats: MS Word, Word. Perfect l Not a good choice for document delivery on the web l Cons: l – These formats are completely closed – Poor cross platform support – It is often problematic to index these documents using inexpensive or open source IR tools.
The New Jersey Digital Legal Library URL: http: //njlegallib. rutgers. edu l Digitize New Jersey Legal materials not currently available online. l Available for users in two formats: Dj. Vu and PDF l Current Workflow: – Scan -> TIFF; then TIFF -> PDF and TIFF -> Dj. Vu – Extract OCR text from the Dj. Vu to XHTML using XSL Stylesheets and Dj. Vu. Libre (The Open Source Dj. Vu Library) – Use swish-e to index the XHTML documents with embedded extended Dublin Core metadata l
References 1. 2. Baird, Henry. Difficult and Urgent Open Problems in Document Images Analysis for Libraries. Proceedings of the First International Workshop on Document Image Analysis for Libraries. Palo Alto CA, 2004. Doerman, David. The Indexing and Retrieval of Document Images: A Survey. 70 (3). Computer Vision and Image Understanding. pp. 287 -298.
14ed7260d634227fa00254c9ae24abe5.ppt