8c3c012cf216350d6463da1a14141eac.ppt
- Количество слайдов: 29
DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13. 3. 2009
DML–CZ, a brief description Digital Mathematics Library consisting of relevant mathematical literature published in the domain of the Czech Republic and Slovakia n Funding: R&D programme Information Society of the Academy of Sciences n 2005– 2009 n 2
Partners Jenštejn n Institute of Mathematics AS CR, Praha (J. Rákosník) – coordinator, material selection, copyright, mathematical supervision n Institute of Computer Science, Masaryk University, Brno (M. Bartošek, P. Kovář, M. Šárfy, V. Krejčíř) – content management system, metadata Q/A, long-term archiving n Faculty of Informatics MU, Masaryk University, Brno (P. Sojka) – formats and tools, technical coordination, information retrieval, indexing n Faculty of Mathematics and Physics, Charles University, Praha (O. Ulrych, J. Veselý) – harvesting and adjusting metadata n Library AS CR, Praha (M. Lhoták, M. Duda, A. Ryšánková) – document scanning, graphical adjustment and OCR in the Digitization Centre Jenštejn 3
The scope n n journals for mathematical research and education conference proceedings monographs, textbooks altogether more than 200 000 pages 4
Journals pages: 2008 2009 106 000 133 000 retro (scan) retro-borndigital Czechoslovak Mathematical Journal 1951– 1991 1992– 2008 Aplikace Matematiky / Applications of Mathematics 1956– 1993 1994– 2008 Archivum Mathematicum, Brno 1965– 1991 1992– 2007 Commentationes Mathematicae Universitatis Carolinae 1960– 1990 1991– 2008 Kybernetika 1965– 1997 1998– 2008 Časopis pro pěstování matematiky a fysiky 1872– 1950 Časopis pro pěstování matematiky 1951– 1990 Mathematica Bohemica 1991 1992– 2008 Acta Univ. Palackianae Olomucensis. Mathematica 1960– 2003 2004– 2008 Acta Mathematica et Informatica Univ. Ostraviensis 2010– 30 000+ 1993– 2003 Title Acta Mathematica Univ. Ostraviensis born-digital 2004– 2008 Mathematica Slovaca 1951– 2008 Matematika–Fyzika–Informatika 1991– 2005 2006– 2009 Pokroky matematiky, fyziky a astronomie 1956– 2005 2006– 2009 5
Proceedings Title pages: 2008 7 750 2009 6 900 2010– volumes Equadiff 11 Toposym 10 Asymptotic Statistics Winter School Abstract Analysis 4 33 Nonlinear Analysis, Function Spaces, Applications 8 Function Spaces, Differential Operators, Nonlinear Analysis 6 … 6
Monographs Title pages: 2008 4 500 2009 1 000 2010– volumes Bernad Bolzano Collection 21 From the collection of The Royal Czech Society for Sciences 15 Other monographs 2 7
Content multilingual: Czech, Slovak, Russian, English, German, French, Italian n text, drawings, photographs (B&W) n maths, physics, chemistry, education, reviews, personalia, politics n 8
Inspiration n GDZ: ¨ technology for scanning, text adjustment, OCR n Cellule Math. Doc, NUMDAM ¨ DML, document enhancement, presentation, services 9
Scanning n n parameters – 600 dpi, 4 bit depth scanning facilities – Digibook RGB 10000, A 1 color book scanner and two book scanners Zeutschel OS 7000, A 2 B/W software – Book. Restorer to make the scanned pages uniform (graphical adjustment, white space around the text body etc. ) Sirius system for archival storage of scans (put on CDs as TIFFs) 10
Optical Character Recognition n text OCR by two phase DML-OCR implemented with ABBYY Fine. Reader SDK 8. 1 errors in maths reading → methods for separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al. , Japan) ¨ layout analysis ¨ character recognition ¨ structure analysis of math. expressions ¨ manual error correction n n PDF with one OCR layer, multilayer PDF with several OCR layers (text, math in Te. X, math in Math. ML or OMDoc) 99 %+ accuracy for text, 96 %+ for mathematics 11
Optical Character Recognition n text OCR by two phase DML-OCR implemented with ABBYY Fine. Reader SDK 8. 1 errors in maths reading → methods for separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al. , Japan) ¨ layout analysis ¨ character recognition ¨ structure analysis of math. expressions ¨ manual error correction n n PDF with one OCR layer, multilayer PDF with several OCR layers (text, math in Te. X, math in Math. ML or OMDoc) 99 %+ accuracy for text, 96 %+ for mathematics 12
Metadata and image enhancement/processing n metadata standards – choice of standards (DC, MODS, METS are supported by DSpace) ¨ ¨ n n n Unicode with Te. X → possible conversion to Math. ML maths standards rather than librarians’ standards metadata acqusition – Zbl/MR, OCR tagging, (retyping) image enhancements – TIFF, PDF, jbig 2 compression as a measure of quality semantic processing – document markup enhancement, document classification, citation linking, document clustering, indexing references and fulltexts as part of metadata, English titles and MSC mandatory OAI-PMH export ¨ trying to follow mini. DML, T. Fischer etc. 13
Metadata and image enhancement/processing n metadata standards – choice of standards (DC, MODS, METS are supported by DSpace) ¨ ¨ n n n Unicode with Te. X → possible conversion to Math. ML maths standards rather than librarians’ standards metadata acqusition – Zbl/MR, OCR tagging, (retyping) image enhancements – TIFF, PDF, jbig 2 compression as a measure of quality semantic processing – document markup enhancement, document classification, citation linking, document clustering, indexing references and fulltexts as part of metadata, English titles and MSC mandatory OAI-PMH export ¨ trying to follow mini. DML, T. Fischer etc. 14
Metadata Editor n n n metadata creation & DL integration developed in Brno for DML-CZ web-based application ¨ web interface ¨ suite of scripts ¨ files in directories ¨ internal database 15
Metadata Editor input data loading n articles building n metadata editing n references processing n verification n pdf-compilation n export to DML-CZ n 16
17
article 1 article 2 pages to be excluded 18
19
Indexing, storage n indexing ¨ multiple OCR, multiple attribute layers (lemmas, reviewer comments, semantic classifications, etc. ) n space ¨ no problem to store and index that for all mathematics literature so far n software ¨ client/server architecture ¨ Lucene indexing software (OSS) 20
Presentation n delivery ¨ customised digital library system DSpace (open source, created at MIT) for final articles delivery, search ¨ Manakin interface n planned visualization techniques – “lost in hyperspace fear”, vizualization of document clustering, Visual Browser (different user's eyes) 21
Delivery n web portal ¨ unique and persistent URLs: PURL n interfaces to other services ¨ OAI-PMH harvesting – necessary to set up the content for OAI-PMH ¨ bibtex export ¨ Googlebot optimization of metadata 22
Further problems and questions n n n n paper classification automated MSC experiment automated MSC learning metadata from born-digital documents search OCR systems OCR XML postprocessing maths OCR 23
Bids n n n Metadata Editor Applications for classification of publications Document markup enhancement ¨ algorithms of language identification (bi-gram, tri-gram based, paragraph or even sentence level) n n Measuring mathematical similarity of publications OCR experience (possibly capacity) Adjusted metadata of high fidelity Experience (both good and wrong) in workflow conduct 24
Asks n n n Interlinking system (the Eu. DML core? ) Effective system for adjusting and standardizing scanned pages Metadata standards and metadata conversion/export tools Unified authority base, journal names abbreviations, … Effective maths OCR 25
Asks n Coordinated effort/support in copyright issues ¨ Directive 2001/29/EC on the harmonisation of certain aspects of copyright and related rights in the information society ¨ Green Paper Copyright in the Knowledge Economy COM(2008) 466/3 ¨ Fifth Freedom in the single market: free movement of knowledge and innovation ¨ ENCES (European Network for Copyright in support of Education and Science) http: //www. ences. eu ¨ moving wall ¨ supporting Open Access activities 26
Asks n Document markup enhancement ¨ context dependent mapping from visual to logical markup ¨ document classification, metrics, ontology construction, comparison with MSC 2000 classification ¨ semiautomatic bibliography markup and metrics, global mathematics citation index, “Math. Rank” ¨ document clustering (for visualization, …), identification of plagiarism 27
Mathematician’s expectations n Reliability ¨ rate of correspondence with the original document ¨ persistency n Search ¨ multilingual ¨ reliable identification of authors ¨ interlinking with Zentralblatt and Mathematical Reviews 28
Mathematician’s expectations n Copyright ¨ free access / reasonable moving wall n User friendly services ¨ citations export in bibtex/Ams. Te. X format ¨ interlinking between repositories ¨ unified layout design n Sustainable development 29
8c3c012cf216350d6463da1a14141eac.ppt