3410814f710c69cdafb13db025991da9.ppt
- Количество слайдов: 17
Documentary databases with introduction to the ISIS-database technology. An introduction by Egbert de Smet Univ. of Antwerp, Belgium
Summary • Relational database technology is not the best solution for all types of information • Another model ‘NO-SQL’ is becoming more popular, esp. in web-environments • ISIS uses a non-relational model for flexibility, but is mostly suited for semi-structured information with semi-relational capabilities • An ISIS database uses a MST with XRF-index and Inverted File with postings (addresses) for searching and a powerful ‘Formatting Language’
17/03/2018 New trends in databases (1) • No-SQL (non-relational) databases : ▫ not all information needs ACID-requirements ▫ websites e. g. need 'scalability' and flexibility • Examples : ▫ ▫ ▫ Mongo. DB (Drupal 5 x faster than with My. SQL) Couch. DB Cassandra (Apache) Berkeley DB (Oracle, used in JISIS) Oracle No. SQL Big Data (Google) • hybrid databases, e. g. Virtuoso
17/03/2018 New trends in databases (2) • 'schemeless' database structures : no fixed predefined 'fields' ▫ each record has its own structure with structural ID • 'fingerprints' : a structural or content-based 'summary' of the document is stored in the record and indexed for faster retrieval ▫ e. g. the old 'ISO 2709' standard is actually an example of this
17/03/2018 New trends in databases (3) • 'Triple'-stores to store the 'Semantic Web' • all information is expressed as triples : ▫ ▫ subject predicate/property object e. g. X is author of Y, A is friend of B etc. • such triples are stored mostly in RDF ▫ use references to authority elements (e. g. URI, author ID's, thesauri) instead of literal values ▫ library catalogs : publication URI's, RDA/FRBR based cataloging to add more relations in the catalog
Documentary vs. relational • In computer science, mostly ‘databases’ are seen as ‘relational’ • Because that model fits well for administrative data, such as addresses, staff records etc. • Libraries, but also web-sites e. g. , however use different, less-structured data, with in fact each record carrying its own structure : ‘documents’
The relational model (SQL) • Relational databases : all data go into tables • Tables are ‘matrices’ with rows and colums ▫ Each row is a ‘case’ (or record) ▫ Columns are fields ▫ Because of the matrix-structure all records must have the same length, all fields have fixed length, defined beforehand • By splitting data into several tables – and relating or linking them – some flexibility is introduced
ID Title Author 1 A single matrix Author 2 Keywords 1 Title 1 Author_a Author_b Food; security; 2 Title 2 Author_b Author_c Food; health; 3 Title 3 Author_a 4 Title 4 Author_a History; economy; food author_d Food; agriculture Problems : • The problem : how to cope with an undefined number of authors ? • What is the maximum length of a title ? • How can one search one single keyword ? Solution : create a table with authors/keywords and an intermediary table to link authors to titles.
The relational model ID Title Editor 1 Title_1 Editor_1 2 Title_2 Editor_2 Title_ ID Author _ID ID Name 1 a 1 Author_a 1 a 2 Author_b 2 a 3 Author_c 2 a 3 a 4 Author_d 3 a 1 4 a 4 Solution : linked tables or ‘relations’ When sorted on title_ID, all authors of a title are listed until next title
Relational model : (dis)-advantages • Advantages : ACID • After ‘normalizing’ the data into several tables, all data are to be kept only once, keeping consistency and avoiding reduncancy • E. g. If an author changes gender, it has to be changed only in that single record • Data can be changed into the same storage space where already present • Disadvantages : • Empty fields take space : a lot of space is ‘lost’ • one ‘record’ is split over many tables; HD-heads have to read several blocks from different parts/sectors of the HD • Every move into one table causes index-pointers to also move into all related tables • No flexibility : architecture has to be well planned/designed ahead
The no-SQL model ID Document_value 1 Text_string 2 XML_string 3 BLOB 4 ISO 2709 • The key-value pairs : each row has an identifier and the document is stored in the value-part • The value-part can be any structure, e. g. an XML-file, a BLOB. . . • This is better suited for ‘documents’ in a database : each record is a document with its own structure, no fixed lenghts or fields • Large websites can be stored using this model e. g. In Drupal 7 : Mongo. DB outperforms My. SQL by 5 times • Examples : Google : Big. Table, Cassandra, Mongo. DB, Couch. DB, Berkeley. DB
ISIS as a No-SQL database • ISIS (1975!!) uses this model long before it was officially named as such (‘avant la lettre’) • The ‘value’ is an ISO-2709 record with header, directory and field-values • header : numerical descriptions, e. g. Total length of record • directory : ID, starting position and length of each field • fields : the values themselves • Semi-relational : with the ‘REF’-function ISIS can combine data from different databases at ‘run-time’ (meaning : only when needed, not by design) • High flexibility : fields can be 0 -x times present, the directory will tell the software; fields can be variable length within given max. Record-length • Records of any structure can be merged into one single database or arranged into different databases with REF-function • If author changes gender : correct original gender is kept into that record, but : reduncancy (made up for by more efficient storage)
The ISO 2709 record as a structural fingerprint • example record : 008460000027700045000010031000000040003102300090003512001140004400 300020015800500020016010000110 01621000011001731090033001841210067002171220017002841230015003016000005003162200130003212000017004512 40004200468250001000510324002800520 332001000548343000500558350000500563#ABT ASSOCIATES INC. /AGRICULTUR#AMS#19951205 #Conducting Pan‑European research: a preliminary evaluation of a new methodology for European aquaculture research#B#K#Shaw, S. A. #Bailly, D. #^a. Univ. Strathclyde ^b. Glasgow^c. UK#3. Annu. Conf. of the European Association of Fisheries Economists #Dublin (Ireland)#10‑ 12 Apr 1991#^aen#Proceedings of the third Annual Conference of the European Association of Fisheries Economists, Dublin, Ireland, 10‑ 12 Apr il 1991#Hillis, J. P. ^ed. #^a. Dublin (Ireland)^b. The Stationery Office#^p 163‑ 175#Ir. Fish. Invest. [B. Mar. ]#0578‑ 7467#1994#^i 42#~ • Advantages : ▫ By only reading the header/directory of the record, the whole structure of the ‘document’ is known and the parser does not need to parse the document itself ▫ The header/directory can be created by the software at the time of creating the record when there is plenty of time • Disadvantage : ▫ According to classic ISO-2709 only 5 positions are provided for total lenght, menaing the max. Lenght is 1 Mb
The ISIS database model • All ISO-2709 records (MFNs) are stored into the MST • A 1 st order index (XRF) with fixed record-length stores the MFNs and their starting position • A B-tree index creates an ‘inverted file’ (2 nd order index) to keep full ‘addresses’ (MFN, Field, Occurrence, Position) of extracted search-keys • A powerful ‘parser’ language (PFT) allows detailed definition of values to be extracted
17/03/2018 The ISIS database model (2) • compare to : ▫ file-systems : files are opened by checking their exact location in the file-system index ▫ memory management : all values for a software stored in memory and called by their exact location in memory ▫ when XRF fits in RAM : very fast • but : in addition the ISO 2709 header defines the document structure and speeds up the 'parsing'
17/03/2018 Conclusion • (CDS/)ISIS as a database-model is indeed very old but yet still quite modern, as it is followed by many new modern databases • built-in, principle flexibility allows systemmanagers, rather than programmers, to create any database structure (library catalogs, digital library, musea, archives, factual data. . . ) • the ABCD system as an example : much more than just library automation
17/03/2018 THANK YOU • Questions ? • Remarks ? • demo : ABCD / JISIS • practical exercises • Egbert. de. Smet@uantwerpen. be
3410814f710c69cdafb13db025991da9.ppt