
9b867b870cac4f98dd5ba8fa9424dd9b.ppt
- Количество слайдов: 108
Digital Library Projects at Bibliotheca Alexandrina Noha Adly 16 January 2006 Noha Adly 06 Bibliotheca Alexandrina
Infrastructure and Connectivity § Network – Fiber Optical backbone • The 11 floors of the library • The BA Conference Center (BACC) • The Science Museum and Planetarium – FTP used for horizontal cabling (2200+ outlets) – Gigabit Ethernet technology is deployed – Leased lines used to connect remote branches • CULTNAT • Shallalat • Swedish Institute (Anna-Lindh foundation) § Internet Connectivity – – Bandwidth from 10 Mbps to 155 Mbps (STM 1) Plans for wireless Internet access using Wi-Fi Hotspots Full Internet access through Internet Cafe BA Conference Center for journalists and press agents Noha Adly 06 Bibliotheca Alexandrina 2
System Architecture Firewall DMZ Network Backbone External Noha Adly 06 PCs Servers Bibliotheca Alexandrina 3
Infrastructure Overview Staff PCs Public PCs Reading Tables 145 FAP 158 Study Rooms 85 LIS 155 OPAC 26 ICT 230 Print Servers 12 CUL 233 Young People 13 EXT 51 Others 71 Children Library 7 Taha Hussein Lib 10 Internet Cafe 7 Information Literacy Lab Total 9 Museums Total Noha Adly 06 16 330 Bibliotheca Alexandrina 898 § 74 servers § 2 firewalls § corporate Antivirus 4
Server Room Noha Adly 06 Bibliotheca Alexandrina 5
Services …. etc Security VTLS Email ERP Intranet MMV Web casting Web Backup Streaming Anti-Virus Noha Adly 06 Office Bibliotheca Alexandrina Video Conf. 6
Video Conferencing Noha Adly 06 Bibliotheca Alexandrina 7
Access Control System – Staff Noha Adly 06 Bibliotheca Alexandrina 8
Ticketing Control System Noha Adly 06 Bibliotheca Alexandrina 9
ILS – Integrated Library System Noha Adly 06 Bibliotheca Alexandrina 10
Library Information System § § § Web based Support Arabisation Trilingual interface (Arabic, French, English) Integrated with Multimedia system Available 24 x 7 In-house development tools – Payment Card System – Automated Circulation Overdue Notices – Membership system – Cataloging Performance Tracking – Circulation Reports and Statistics – Customized Reports – etc … Noha Adly 06 Bibliotheca Alexandrina 11
BA Website Noha Adly 06 Bibliotheca Alexandrina 12
Statistics – www. bibalex. org Noha Adly 06 Bibliotheca Alexandrina 13
Statistics Noha Adly 06 Bibliotheca Alexandrina 14
Noha Adly 06 Bibliotheca Alexandrina 15
ISIS - Mission & Goals § Mission – Initiate, carry-out and promote research and development of activities and projects related to building a universal knowledge center – Acting as an incubator for digital and technological projects, promoting and nurturing innovations in accordance with BA goals § Goals – Preserving the heritage for future generations, and – Universal Access to Human Knowledge Noha Adly 06 Bibliotheca Alexandrina 16
Noha Adly 06 Bibliotheca Alexandrina 17
Overview § § § Internet Archive Million Book Project UDBE: Universal Digital Book Encoder DAR: Digital Asset Repository The Digital Modern History of Egypt – Gamal Abdel Nasser – Description de l’Egypte § OACIS Noha Adly 06 Bibliotheca Alexandrina 18
Internet Archive Noha Adly 06 Bibliotheca Alexandrina 19
Overview § § § Web: 10 billion pages from 1996 -2001 Television: 2000 hours of Egyptian and US TV Movies: 1000 archival films 100 Terabytes of data Storage on 200 computers The second copy world wide, after the original copy in San Francisco Noha Adly 06 Bibliotheca Alexandrina 20
Noha Adly 06 Bibliotheca Alexandrina 21
Noha Adly 06 Bibliotheca Alexandrina 22
Noha Adly 06 Bibliotheca Alexandrina 23
Noha Adly 06 Bibliotheca Alexandrina 24
Noha Adly 06 Bibliotheca Alexandrina 25
Access Statistics Noha Adly 06 Bibliotheca Alexandrina 26
Second Generation Machines: Petabox § Designed to store and process one petabyte (million gigabytes). § Features: – Low power: 6 k. W per rack, and 60 k. W for the whole system – High density: 64 TB Terabytes per 40 U rack – Local computing to process the data – 800 low-end PCs – Multi operating systems – Software to automate mirroring – Easy Maintenance: one system administrator petabyte – Software to automate mirroring itself – Inexpensive design – Inexpensive storage Noha Adly 06 Bibliotheca Alexandrina 27
Single Rack Configuration Data Node (80) 1. 2 TB, 1 GHz, 100 Mbps Admin Node (2) 1. 2 TB, 1 GHz, 2 x 100 Mbps 43 U Switch (2) 48 x 100 Mbps, 2 x 1 Gbps Router/Firewall (1) 2 x 3 GHz, 2 GB, 4 x 1 Gbps All boxes 1 U, except Router/Firewall 2 U Noha Adly 06 Bibliotheca Alexandrina 28
Noha Adly 06 Bibliotheca Alexandrina 29
Progress § An agreement with the Internet Archive for building the Petabox has been signed § Hard disks for 2 Petabytes have been purchased § 3700 hard disks to reach IA by February 1 st 2006 § IA will build the machines and load them with the data of the web collection of 2002, 2003, 2004 and 2005 § 1300 hard disks will be delivered at BA to be assembled locally § New machines for the 2006 collection will be designed and manufactured locally. Noha Adly 06 Bibliotheca Alexandrina 30
Million Book Project Million Book Noha Adly 06 Bibliotheca Alexandrina 31
Goals § Long-term: Capture all books in digital format; § Short-term: Digitize 1 million books by 2007; § Provide a test bed to support research areas, such as – – – Scanning techniques; Optical character recognition; Intelligent indexing; Machine translation; Information retrieval. Noha Adly 06 Bibliotheca Alexandrina 32
Partners § USA – Carnegie Mellon University – Internet Archive § China – – – – Beijing University Chinese Academy of Science Fudan University Chinese Ministry of Education Nanjing University State Planning Commission of China Tsinghua University Zhejiang University Noha Adly 06 Bibliotheca Alexandrina 33
Partners § India – Indian Institute of Science – International Institute of Information Technology, Hyderabad – Arulmigu Kalasalingam College of Engineering – Goa University – Indian Institute of Information Technology, Allahabad – Shanmugha Arts, Science, Technology & Research Academy – Tirumala Tirupati Devasthanams – Maharashtra Industrial Development Corporation – University of Pune – Anna University § ……. Now increased to 22 centers Noha Adly 06 Bibliotheca Alexandrina 34
Noha Adly 06 Bibliotheca Alexandrina 35
Digital Lab Workflow Noha Adly 06 Bibliotheca Alexandrina 36
Noha Adly 06 Bibliotheca Alexandrina 37
Image Processing § Enhances the quality of the scanned images – Removes noise – Reduces file size § Functions performed – – – Despeckle – removes isolated black pixels Deskew – detects and removes skew Crop – removes the extra white spaces Curvature correction Removal of margins Noha Adly 06 Bibliotheca Alexandrina 38
Image Processing Procedure ACDSee Compress Photoshop Black edge Centering OTIFF Scan. Fix Noise PTIFF. X Recover Noha Adly 06 Photoshop Resize Scan. Fix Skew PTIFF Recover Bibliotheca Alexandrina 39
OCR - Arabic § Poses unique challenges – – – Written cursively, with blocks of connected characters a ‘block of characters’ can have more than one base line. Uses external objects such as dots, 'Hamza' and 'Madda'. Diacritization Characters can have more than one shape according to their position – Overlapping makes it difficult to determine the spacing § Sakhr Automatic reader is used § Tricky with old books § Requires learning Noha Adly 06 Bibliotheca Alexandrina 40
Noha Adly 06 Bibliotheca Alexandrina 41
Pre-OCR Text Enhancement § Condition of Arabic printings varies – Old/new – Light/heavy – Solid/dot-matrix § Scan. Fix’s smoothing and completion features improve recognition accuracy § Separate from actual processing phase – Must be tested under OCR right away – OCR specialists have a better feel for “good text” Noha Adly 06 Bibliotheca Alexandrina 42
Font Libraries § Improvement of Arabic OCR results through – Tweaking of OCR engine settings – Learning § Libraries for different fonts have been built to achieve higher recognition rates § Databases of character glyphs that describe a particular type of script and improve OCR accuracy § Built on a carefully selected and classified high-variety set of scanned images belonging to a batch of about 1000 books that boiled down to 15 font groups Noha Adly 06 Bibliotheca Alexandrina 43
Font Classification § Classification criteria: – Script type • TA: Traditional Arabic • AR: Arabic Transparent • DT: Deco type Naskh and Deco type Naskh extension – Printing quality: High (H), Medium (M), and Low (L) – Font size: 1 (largest) to 5 (smallest) § “Group X” – virtual font to tag unclassifiable printings and handwriting § Minimum accuracy number assigned to each group based on testing results Noha Adly 06 Bibliotheca Alexandrina 44
Font Groups Noha Adly 06 Bibliotheca Alexandrina 45
Progress § Five scanning stations since October 2003 § As of January 1 st 2006: – 22, 214 books digitized & processed (6. 7 million pages) – 15, 550 books OCRed (4. 6 million pages) • 11, 101 Arabic books (3. 3 million pages) • 4, 449 Latin books (1. 3 million pages) § Daily Rates – – Scanning: ≈ 2000 pages/person Processing: ≈ 1800 pages/person Latin OCR: ≈ 4000 pages/person Arabic OCR: ≈ 1500 pages/person § The target is to scan and process 5000 pages/day/scanner, leading to ≈ 25, 000 books/year Noha Adly 06 Bibliotheca Alexandrina 46
Noha Adly 06 Bibliotheca Alexandrina 47
Publishing § Challenges – Preservation of layout – Searchability of content and metadata – Efficient image compression – Accommodating low bandwidth user – Easy browsing of books – Multipaging – Multilingual text support Noha Adly 06 Bibliotheca Alexandrina 48
Image-on-Text § Multilayered: – Visible page image – Hidden OCR text § View exact original layout while searching and highlighting § Supported with some OCR suites only § Supported format: DJVU and PDF Noha Adly 06 Bibliotheca Alexandrina 49
UDBE – Universal Digital Book Encoder § Built around a Common OCR Format (COF) Noha Adly 06 Bibliotheca Alexandrina 50
Common OCR Format (COF) § Captures necessary image-on-text document information § Inspired by Dj. Vu. XML DAFS and Document Attribute Format Specification § XML-compliant – simple integration Noha Adly 06 Bibliotheca Alexandrina 51
Implementation § OCR Converter for Automatic Reader: – Supports 18 Latin languages, Arabic, and Persian – Features font learning capabilities § Format Handlers: – Dj. Vu: • MRC imaging model high-quality/low-file-size image compression from AT&T Labs • Implemented around Dj. Vu Libre and Lizard. Tech’s Document Express – PDF: • Widely-used Post. Script-like Portable Document Format from Adobe • Implemented in Java based on i. Text Noha Adly 06 Bibliotheca Alexandrina 52
UDBE Performance Noha Adly 06 Bibliotheca Alexandrina 53
UDBE Performance Noha Adly 06 Bibliotheca Alexandrina 54
UDBE Performance Noha Adly 06 Bibliotheca Alexandrina 55
Noha Adly 06 Bibliotheca Alexandrina
Noha Adly 06 Bibliotheca Alexandrina 57
Noha Adly 06 Bibliotheca Alexandrina 58
Noha Adly 06 Bibliotheca Alexandrina 59
Noha Adly 06 Bibliotheca Alexandrina 60
Noha Adly 06 Bibliotheca Alexandrina 61
Noha Adly 06 Bibliotheca Alexandrina 62
Noha Adly 06 Bibliotheca Alexandrina 63
Noha Adly 06 Bibliotheca Alexandrina 64
Progress § A database for the books, metadata and status has been designed and implemented. § The complete cycle of the workflow for producing digital books has been automated, and integrated with the ILS. § This work has been extended to accommodate other types of materials including slides, maps, images, audio and video. Noha Adly 06 Bibliotheca Alexandrina 65
DAR Digital Assets Repository Noha Adly 06 Bibliotheca Alexandrina 66
Goals § Automation of the digitization process § Integrating the actual content and metadata of varieties of object types into one homogeneous repository § Preservation and archiving of digital media produced by the Digital Lab or acquired by the Library in digital format § Enhancing the interoperability and seamless access to the Library digital assets Noha Adly 06 Bibliotheca Alexandrina 67
Standards § Digital objects descriptive metadata – VRA Core Categories – MARC 21 § Metadata presentation – XML – MARC format – Dublin Core § Content dissemination – OAI-PMH Noha Adly 06 Bibliotheca Alexandrina 68
System Architecture Noha Adly 06 Bibliotheca Alexandrina 69
Progress § DAF has been fully deployed since March 2004 for books § In January 2005, support for images and other material was introduced. § The DAK first version was deployed in July 2005, with some parts still in the beta version. § A publishing tool has been implemented with a special viewer for digitized assets, and a viewer for books using image-on-text technology. Noha Adly 06 Bibliotheca Alexandrina 70
Noha Adly 06 Bibliotheca Alexandrina 71
Noha Adly 06 Bibliotheca Alexandrina 72
Noha Adly 06 Bibliotheca Alexandrina 73
Noha Adly 06 Bibliotheca Alexandrina 74
The Digital Modern History Of Egypt Noha Adly 06 Bibliotheca Alexandrina 75
Gamal Abdel Nasser Collection Noha Adly 06 Bibliotheca Alexandrina 76
Nasser – Objectives § Digitize and publish the collection of the eminent Arab and Egyptian president Gamal Abdel Nasser § Provide online access to his collection through a web based system mainly intended for research purposes and documentation Noha Adly 06 Bibliotheca Alexandrina 77
Nasser – Collection § Documents published by the Public Records Office, London, UK (53, 000+ pages) § Documents published by the United State Department of State (30, 000+ pages) § Over 1, 300 speeches, audio and printed § Over 51, 000 photos and 1, 000 portraits § More than 1, 000 videos (50+ hours) § A complete archive of the articles published in the newspapers § The decrees issued by the Revolutionary Command Council (RCC) § The daily news of the President Noha Adly 06 Bibliotheca Alexandrina 78
Nasser – Collection § Minutes of the Central Committee for Arab Socialist Union (ASU) § 140+ handwritten documents with 593 papers § A complete archive of the "Bisaraha" articles by Mohammed Hassanein Haikal § Caricature, stamps, coins and plastic arts illustrations § Books written by and about Nasser § More than 1, 200 national songs § Over 130 Poems Noha Adly 06 Bibliotheca Alexandrina 79
Nasser § The entire collection has been digitized § Database designed and populated with the digital objects and their metadata § Backend applications – – Managing the contents Categorization Adding and refining descriptions Adding keywords § Integration of all the different information sources and media under a single interface § Front end – A web based interface – Full text Arabic and English search engine Noha Adly 06 Bibliotheca Alexandrina 80
Nasser – Website Noha Adly 06 Bibliotheca Alexandrina 81
Description De L’Egypte Noha Adly 06 Bibliotheca Alexandrina 82
Description de l’Egypte § The work includes – 11 plates volumes (950+ pages) – 9 text volumes (7500+ pages) – Index book § The volumes recorded – Antiquities – Modern state – Natural history § They described cities, buildings, temples, monuments, arts, animals, plants, minerals, society, etc. Noha Adly 06 Bibliotheca Alexandrina 83
Digitization § The complete volumes of plates and text have been fully digitized. Noha Adly 06 Bibliotheca Alexandrina 84
Processing Noha Adly 06 Bibliotheca Alexandrina 85
Virtual Browser The whole collection has been integrated on a virtual browser and made accessible to the public. Noha Adly 06 Bibliotheca Alexandrina 86
Noha Adly 06 Bibliotheca Alexandrina 87
Noha Adly 06 Bibliotheca Alexandrina 88
First Release § Provide the collection on DVD, in both English and French Languages, for the public and for researchers § A relation established between text and images in a searchable form § Published with two versions of pictures – Low resolution for quick browsing – High resolution for zooming with dynamic loading Noha Adly 06 Bibliotheca Alexandrina 89
Digitizing of the Botroseyya Collection Noha Adly 06 Bibliotheca Alexandrina 90
Botroseyya – Overview § This project aims at digitizing the documents pertaining to the Botros Ghaly family § The family has saved a large number of documents related to its political role since the late 1800’s. § The project will attempt to – digitize the entire multilingual (Arabic, English, French, German, Italian and Turkish) collection, and – provide it in searchable form for historians, politicians and researchers. Noha Adly 06 Bibliotheca Alexandrina 91
Digitization of Mohamed Mahmoud Pasha Collection Noha Adly 06 Bibliotheca Alexandrina 92
Mohamed Mahmoud Pasha Collection – Overview § This project aims at digitizing the documents pertaining to Mohamed Mahmoud Pasha, one of the most famous Egyptian Prime Ministers § The project will attempt to – digitize the entire collection of rich and rare historical documents and materials never been published before – provide it in searchable form for historians, politicians and researchers. Noha Adly 06 Bibliotheca Alexandrina 93
Al Hilal Digital Collection Noha Adly 06 Bibliotheca Alexandrina 94
Al Hilal – Overview § This project aims to publish an exhaustive digital copy of the issues of Al-Hilal since its first publication in 1892 § Al-Hilal is considered the oldest continuously published cultural journal in the Arab world, and the only regular journal that has been issued for more than a hundred years § It had a marked effect on the history of the Arab world in general and the history of Egypt in particular § It played a leading role in modernizing Arab intellectual thinking, and opened new collaborations towards the cultural evolution Noha Adly 06 Bibliotheca Alexandrina 95
Al Hilal – Progress § The volumes of years 1 to 50 were completely scanned, processed and indexed (about 51, 000 pages). § An application has been implemented for browsing through the digital copies with searching facilities. The hierarchy for titles and subtitles helps users select the desired issues § The issues of each decade are to be compiled on a CD including necessary browsing and searching tools. Noha Adly 06 Bibliotheca Alexandrina 96
Noha Adly 06 Bibliotheca Alexandrina 97
OACIS Online Access to Consolidated Information on Serials Noha Adly 06 Bibliotheca Alexandrina 98
OACIS – Mission § Create a publicly and freely accessible, continuously updated listing of Middle East journals and serials, including those available in print, microform, and online § Improve access to Middle Eastern serials in libraries in the – United States – Europe – Middle East § Make scholarly literature from, and about, the Middle East widely and easily available to scholars around the world Noha Adly 06 Bibliotheca Alexandrina 99
OACIS – Statistics Holds: 23, 000+ unique title records Noha Adly 06 Bibliotheca Alexandrina 100
OACIS – BA contribution § Over 400 records have been uploaded in the database § 23 volumes have been digitized § Digitized documents have been integrated into the OACIS system through a digital viewer § Content retrieval web interface for the digitized serials has been developed § Regular update of the OACIS catalog is taking place on quarterly basis § A mirror site of the system at BA has been set and released 25 th January 2005 (http: //oacis. bibalex. org) Noha Adly 06 Bibliotheca Alexandrina 101
OACIS –Website Noha Adly 06 Bibliotheca Alexandrina 102
OACIS – Digital Viewer Noha Adly 06 Bibliotheca Alexandrina 103
OACIS – Search Contents Noha Adly 06 Bibliotheca Alexandrina 104
Arabic and Middle Eastern Electronic Library (AMEEL) Noha Adly 06 Bibliotheca Alexandrina 105
AMEEL – Overview § Develop an Arabic and Middle Eastern Electronic Library (AMEEL) containing a large collection of significant Middle Eastern resources § Bring together qualified partners to create a Middle East electronic library including: – Digital representations of traditional materials, – “Born digital" contemporary materials – A service structure for Inter Library Loan § Building an access portal Noha Adly 06 Bibliotheca Alexandrina 106
Noha Adly 06 Bibliotheca Alexandrina 107
Thank You Noha Adly 06 Bibliotheca Alexandrina 108
9b867b870cac4f98dd5ba8fa9424dd9b.ppt