Скачать презентацию doc WORKS METAe The Engine for Automated Metadata Extraction Скачать презентацию doc WORKS METAe The Engine for Automated Metadata Extraction

54e10aab8dc6a0da2070605525268b42.ppt

  • Количество слайдов: 30

doc. WORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst Content doc. WORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst Content Conversion Specialists July 2004 – METS Opening Day UK www. ccs-gmbh. de 11

CCS – Offices What is doc. WORKS/METAe? n Production tool for conversion of printed CCS – Offices What is doc. WORKS/METAe? n Production tool for conversion of printed documents into fully tagged digital objects n The METAe edition of doc. WORKS is the result of the EU-funded project METAe n Start of project: September 2000 n End of project: August 2003 n Product launch: March 2003, Ce. BIT exhibition July 2004 – METS Opening Day UK www. ccs-gmbh. de 22

CCS – Offices The project group 1. Leopold-Franzens-Universität Innsbruck (Co-ordinator), Austria 2. Universität Linz, CCS – Offices The project group 1. Leopold-Franzens-Universität Innsbruck (Co-ordinator), Austria 2. Universität Linz, Institut für Angewandte Informatik, University of Linz, Austria 3. Mitcom Neue Medien Gmb. H (ABBYY Europe), Germany 4. CCS Compact Computer Systeme, Germany 5. Universidad de Alicante, Spain 6. Friedrich-Ebert-Stiftung, Germany 7. Cornell University Library. Department of Preservation and Conservation, USA 8. Bibliothèque nationale de France 9. The National Library of Norway, Rana division, Norway 10. Biblioteca Statale A. Baldini, Italy 11. Dipartimento di Sistemi e Informatica, University of Florence, Italy 12. Karl-Franzens-Universität Graz, Universitätsbibliothek, Austria 13. Scuola Normale Superiore, Centro di Ricerche Informatiche per i Beni Culturali, Italy 14. Higher Education Digitisation Service HEDS, UK July 2004 – METS Opening Day UK www. ccs-gmbh. de 33

CCS – Offices Challenges ? Digitization and retro-conversion of printed or textual material is CCS – Offices Challenges ? Digitization and retro-conversion of printed or textual material is getting more and more important: n Keep knowledge and cultural heritage alive n Preserve the origin n Enable quick and enhanced access by high structured documents n Open up new dimensions of research n Provide standardized output formats July 2004 – METS Opening Day UK www. ccs-gmbh. de 44

CCS – Offices Goals ? Automate the conversion process ? Make digitization more effective CCS – Offices Goals ? Automate the conversion process ? Make digitization more effective and safer ? Increase the added value of digitized collections ? Provide a standardized output format in order to allow transformation of metadata into various applications and systems July 2004 – METS Opening Day UK www. ccs-gmbh. de 55

CCS – Offices doc. WORKS – System Overview Input doc. WORKS engine Output Image CCS – Offices doc. WORKS – System Overview Input doc. WORKS engine Output Image Pre-Processing Scanning Correction Layout Analysis document Character Recognition Import Export Structural Analysis METS/ALTO METS/TEI PDF TIFF, JPEG Rules DB July 2004 – METS Opening Day UK www. ccs-gmbh. de 66

CCS – Offices doc. WORKS – recording as much metadata as possible! Available data CCS – Offices doc. WORKS – recording as much metadata as possible! Available data Descriptive metadata Administrative metadata Structural metadata logical Structural metadata physical Formats Library records, e. g. MARC TIFF Images METS DC or MODS linking to catalogue record METS incl. NISO (mix) METS Structural map ALTO (Analyzed Layout and Text Object) doc. WORKS engine Import of subsets, linking to record Creates descriptive records for articles, pictures, … Records metadata Suggests labels of logical elements and structures Provides suggestion for physical structure User mode Automated Semiautomated Fullyautomated after defining a profile Automated Correction recommended Correction in special cases July 2004 – METS Opening Day UK Correction recommended www. ccs-gmbh. de 77

CCS – Offices doc. WORKS – Matching of Image Files and Page Numbers Imagefile CCS – Offices doc. WORKS – Matching of Image Files and Page Numbers Imagefile Pagination Page. Number 000008. tif Counted VI 000009. tif Counted 1 000001. tif Not counted Np 000010. tif Counted, not paginated 000002. tif Not counted Np 000011. tif Counted 3 000003. tif Counted I 000012. tif Counted 4 000004. tif Counted II placeholder Missing page 5 000005. tif Counted III placeholder Missing page 6 000006. tif Counted IV 000013. tif Counted 7 000007. tif Counted V 000014. tif Counted 8 July 2004 – METS Opening Day UK www. ccs-gmbh. de (2) 88

CCS – Offices doc. WORKS – Structural Analysis FRONT MAIN BACK July 2004 – CCS – Offices doc. WORKS – Structural Analysis FRONT MAIN BACK July 2004 – METS Opening Day UK www. ccs-gmbh. de 99

CCS – Offices doc. WORKS – Structural Analysis Subchapter 1 Subchapter 2 Chapter 1 CCS – Offices doc. WORKS – Structural Analysis Subchapter 1 Subchapter 2 Chapter 1 Chapter 2 July 2004 – METS Opening Day UK www. ccs-gmbh. de 10 10

CCS – Offices doc. WORKS – Structural Analysis Preface Title page July 2004 – CCS – Offices doc. WORKS – Structural Analysis Preface Title page July 2004 – METS Opening Day UK Statement page www. ccs-gmbh. de Table of contents 11 11

CCS – Offices doc. WORKS – Document layers ? Various document layers are differentiated CCS – Offices doc. WORKS – Document layers ? Various document layers are differentiated automatically and while using certain levels enable well directed searches as well as the presentation of electronic text without unnecessary items n n n Body text independently from its presentation Margin notes, footnotes Pictures and captions Advertisement Annex and supplements Navigation layer: Table of contents, running title, document index , page number, volume index n Book: Separation of „intellectual“ and „artifical“ content July 2004 – METS Opening Day UK www. ccs-gmbh. de 12 12

CCS – Offices doc. WORKS – Digitization of books and journals (METAe) July 2004 CCS – Offices doc. WORKS – Digitization of books and journals (METAe) July 2004 – METS Opening Day UK www. ccs-gmbh. de 13 13

CCS – Offices doc. WORKS – Digitization of books and journals (METAe) July 2004 CCS – Offices doc. WORKS – Digitization of books and journals (METAe) July 2004 – METS Opening Day UK www. ccs-gmbh. de 14 14

CCS – Offices doc. WORKS – Digitization of scientific documents July 2004 – METS CCS – Offices doc. WORKS – Digitization of scientific documents July 2004 – METS Opening Day UK www. ccs-gmbh. de 15 15

CCS – Offices doc. WORKS – Manual editing of descriptive metadata / volume July CCS – Offices doc. WORKS – Manual editing of descriptive metadata / volume July 2004 – METS Opening Day UK www. ccs-gmbh. de 16 16

CCS – Offices doc. WORKS – Manual editing of descriptive metadata / illustration July CCS – Offices doc. WORKS – Manual editing of descriptive metadata / illustration July 2004 – METS Opening Day UK www. ccs-gmbh. de 17 17

CCS – Offices doc. WORKS – Basic Workflow Digitization Scanning Quality Control Images Conversion CCS – Offices doc. WORKS – Basic Workflow Digitization Scanning Quality Control Images Conversion Quality Control Output Presentation Export XML/METS PDF DB OPAC MARC July 2004 – METS Opening Day UK www. ccs-gmbh. de 18 18

CCS – Offices doc. WORKS – Scalable Client / Server architecture Server 1 Server CCS – Offices doc. WORKS – Scalable Client / Server architecture Server 1 Server 2 Scan Import July 2004 – METS Opening Day UK Server 3 . . è Auto-Import Server n è è è Image Preprocessing Layout Analysis OCR Structural Analysis Export Quality Control www. ccs-gmbh. de 19 19

CCS – Offices doc. WORKS – METS / ALTO METS document TIFF ALTO – CCS – Offices doc. WORKS – METS / ALTO METS document TIFF ALTO – Analyzed Layout and Text Object July 2004 – METS Opening Day UK www. ccs-gmbh. de 20 20

CCS – Offices doc. WORKS – METS n n n Header MODS or DC, CCS – Offices doc. WORKS – METS n n n Header MODS or DC, descriptive metadata NISO 39. 087 (mix), technical metadata Structural Map: Physical Structure Structural Map: Logical Structure July 2004 – METS Opening Day UK www. ccs-gmbh. de 21 21

CCS – Offices doc. WORKS – ALTO n Styles - Paragraph (alignment, linespacing, etc. CCS – Offices doc. WORKS – ALTO n Styles - Paragraph (alignment, linespacing, etc. ) - Font (name, size, bold, italic, etc. ) n Layout - Printspace - Top. Margin - Inner. Margin - Outer. Margin - Bottom. Margin n Objects in 5 areas above: - Text block - Text lines - Strings [coordinates, string (as printed), substitution (hyphenation)] - Spaces - Composed block - Picture - Table - Formula July 2004 – METS Opening Day UK www. ccs-gmbh. de 22 22

CCS – Offices doc. WORKS – METS / physical structure METS DC FILEGRP PHYS CCS – Offices doc. WORKS – METS / physical structure METS DC FILEGRP PHYS LOGICAL July 2004 – METS Opening Day UK ORDER 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … www. ccs-gmbh. de LABEL ORDERLABEL II IV V VI I II IV V VI 2 3 4 5 1 2 3 4 5 6 … 23 23

CCS – Offices doc. WORKS – METS / physical structure METS ALTO DC FILEGRP CCS – Offices doc. WORKS – METS / physical structure METS ALTO DC FILEGRP EID FIL par FIL DIV (page) EID PHYS fptr IMAGE fptr LOGICAL July 2004 – METS Opening Day UK www. ccs-gmbh. de 24 24

CCS – Offices doc. WORKS – METS / logical structure METS FILEID DC ALTO CCS – Offices doc. WORKS – METS / logical structure METS FILEID DC ALTO FILEGRP text block Coo rdin DCMD_CHAP# EID FILEID DIV (issue) ALTO tes ina DIV (contrib. ) text block DIV (chapter) rd oo C GIN DCMD_#CONT# DIV (volume) BE DCMD_ISSUE# FIL LOGICAL DCMD_PHYS DCMD_ELEC ates EID PHYS BE T XSL LT seq XS GI N DIV (paragraph) fptr Those who have read the History of Columbus will, doubtless, remember the character and exploits. . . fptr July 2004 – METS Opening Day UK www. ccs-gmbh. de 25 25

CCS – Offices doc. WORKS – ALTO / page layout and text content July CCS – Offices doc. WORKS – ALTO / page layout and text content July 2004 – METS Opening Day UK www. ccs-gmbh. de 26 26

CCS – Offices doc. WORKS – ALTO / hyphenated word July 2004 – METS CCS – Offices doc. WORKS – ALTO / hyphenated word July 2004 – METS Opening Day UK www. ccs-gmbh. de 27 27

CCS – Offices doc. WORKS – ALTO / hyphenated word July 2004 – METS CCS – Offices doc. WORKS – ALTO / hyphenated word July 2004 – METS Opening Day UK www. ccs-gmbh. de 28 28

CCS – Offices doc. WORKS – Workshop UK 2004 ? University Library of Southampton CCS – Offices doc. WORKS – Workshop UK 2004 ? University Library of Southampton September 28/29, free of charge ? 1 st day n Product information n Output, metadata standards n Workflow, use cases ? 2 nd day n „Hands on“ – Working with your own samples n Individual consultancy sessions ? Contact n Simon Brackenbury - s. c. brackenbury@soton. ac. uk n Hartmut Janczikowski - hartmut. janczikowski@ccs-gmbh. de July 2004 – METS Opening Day UK www. ccs-gmbh. de 29 29

CCS – Offices Thank you! Claus Gravenhorst claus. gravenhorst@ccs-gmbh. de Content Conversion Specialists www. CCS – Offices Thank you! Claus Gravenhorst claus. gravenhorst@ccs-gmbh. de Content Conversion Specialists www. ccs-gmbh. de http: //meta-e. uibk. ac. at/ July 2004 – METS Opening Day UK www. ccs-gmbh. de 30 30