Database Engine Design a k a Research DSL

Database Engine Design a. k. a. Research@ DSL Jayant Haritsa August 2007 SERC Research Symposium Slide 1

Database Management Systems (DBMS) • Efficient and convenient mechanisms for storing, querying and maintenance of enterprise data • Cornerstone of computer industry – Uses more than 80 percent of computers worldwide – Employs more than 70 percent of computer professionals – Largest monetary sector of computer business August 2007 SERC Research Symposium Slide 2

DBMS FEATURES (contd) • Declarative Access – state what you want, not how to get it • On-the-Fly Questions – ask new questions without writing new programs • PEACE OF MIND – changes to the database are guaranteed to be immune to subsequent system failures Sri Ravishankar of the Information World August 2007 SERC Research Symposium Slide 4

Current Database Systems • Commercial – IBM DB 2 / Oracle / Microsoft SQL Server / Sybase • Public-domain – Postgre. SQL / My. SQL / Berkeley DB August 2007 SERC Research Symposium Slide 5

DBMS Myths • Databases? Isn’t that the boring part of accounting? • Hazaar dumb Cobol programming! • Maha-bore - almost as dull as watching Rahul Dravid bat! • High-tech name for data entry! • Will only get job with TCS! • . . . August 2007 SERC Research Symposium Slide 6

DBMS Realities • Design of database engines has lots of really, really interesting intellectual problems with practical impact – theory, algorithms, data structures, experiments, prototypes • Turing awards – 1981: Edgar Codd (relational data model) – 1999: Jim Gray (transaction model) • Ullman, Silberschatz, Papadimitrou, … • Rajaraman, Patnaik, Balakrishnan, Jacob/Govindarajan … August 2007 SERC Research Symposium Slide 7

Database Systems Lab (DSL) Established 1995 August 2007 SERC Research Symposium Slide 8

Research Topics – Real-Time Database Systems – Distributed Transaction Management 1995 -2000 – OODBMS – Web Databases – Data Mining – XML Databases 2000 -2005 – Biological Databases – Query Optimization Last few years – Multilingual Databases – Music Databases August 2007 SERC Research Symposium Slide 9

Research Trajectory MIDDLEWARE OO Models Mining XML CORE DB TECHNOLOGY Transaction Processing August 2007 Access Methods SERC Research Symposium Query Processing Slide 10

Research Techniques • Theory – real-time, data mining, query optimization • Simulation studies – real-time, distributed, web dbms • Empirical evaluation – data mining, biological, multilingual dbms, query optimization • Prototype development – OODBMS (Flexible Manufacturing [MIDAS], VLSI [DIAS], Bio-diversity [Oshadhi, Bodhi] ) – XML (Storage [Lego. DB], Compression [XGrind] ) – Query Optimization (Clustering [Plastic], Visualization [Picasso] ) – Multilingual Databases (Cross-lingual SQL [Mira] ) August 2007 SERC Research Symposium Slide 11

SPINE: Putting Backbone into Genomic Sequence Indexing August 2007 SERC Research Symposium Slide 12

Standard Genomic Index: Suffix Tree [Weiner 1973] Vertically-compressed trie of suffixes augmented with links 0 12 3 4 5 6 78 9 Suffix Links (x. W → W) Data = ‘GTTAATTACT$’ A T ATTACT$ 3 GTTAATTACT$ 7 Search for Query = ‘TTA’ August 2007 Tree Edges CT$ A TTACT$ 4 8 TA $ 9 0 ATTACT$ 2 SERC Research Symposium CT$ 6 ATTACT$ 1 CT$ 5 Slide 13

Locate all Maximal Matching Substrings [Chang & Lawler 1990] • For each position in query sequence Q , locate all longest matching substrings of length ≥ in the indexed data sequence D – Example: D = ‘GTTAATTACT$’ Q = ‘CTAATGA’ and Result: { TAAT: <2, 1> August 2007 =3 AAT: <3, 2> } SERC Research Symposium Slide 14

Maximal Substring Search with Suffix Tree Index 0 1 2 34 5 6 7 8 9 Q = ‘CTAATGA’ = 3 D = ‘GTTAATTACT$’ A T CT$ ATTACT$ 3 GTTAATTACT$ 7 A TTACT$ 4 8 9 0 ATTACT$ 2 August 2007 TA $ SERC Research Symposium CT$ 6 ATTACT$ 1 CT$ 5 Slide 15

Features of Suffix Tree Index • Accurate retrieval – no false negatives (unlike BLAST) • Linear Time Complexity for both Construction and Search! – because of Suffix-links • Widely used – More than 40 -50 applications over biological sequences [Gusfield, 2002] – MUMmer [Celera Genomics], AVID, … August 2007 SERC Research Symposium Slide 16

Crippling Limitation • Viable only for sequences that are short enough for their associated suffix tree to fit completely in main memory … [Baeza-Yates and Navarro, 2000] • Best that has been built so far is for sequences of ~ 10 Mbp (Human Genome is 300 times longer!) August 2007 SERC Research Symposium Slide 17

Difficulties in Supporting Suffix Trees on Long Sequences - 1 Space overheads are enormous – Order(s) of magnitude larger than data! – Human Genome can be easily stored in main memory (~1 GB) but the index could be of the order of 10 -100 GB Disk-resident suffix trees for long sequences August 2007 SERC Research Symposium Slide 18

Difficulties in Supporting Suffix Trees on Long Sequences - 2 Tree Construction on Disk is Very Slow – Due to disk thrashing from random seeks The active suffix creeps through the text like a caterpillar … corresponding active node swings through the tree like a butterfly [Giegerich and Kurtz, 1995] August 2007 SERC Research Symposium Slide 19

Difficulties in Supporting Suffix Trees on Long Sequences - 3 Searching on Disk is Very Slow – Unbalanced Tree Structure • Shape of tree depends on sequence stochastic properties – “Multi-directional” traversals causes disk thrashing • Tree-Edge “Vertical Walk-Down” Suffix Tree Search • Edge + Link mesh • Two phase Search • Locate • Report ≥ • Suffix-Link “Horizontal Jump-Across” Combination of Batman and Spiderman ! August 2007 SERC Research Symposium Slide 20

The SPINE* Index A Horizontally-Compacted Trie Index [*Sequence Processing INdexing Engine] August 2007 SERC Research Symposium Slide 22

SPINE Index Structure D = ‘ACCACAC’ Complete horizontal compaction into single linear chain!! 0 0 0 Root node A C(0) 1 C 2 • Nodes • Forward Edges Link 1 2 3 4 C 2 – Links Vertebra August 2007 A(1) A – Vertebras (Backbone) – Ribs / Ext-Ribs • Backward Edges C 1 SERC Research Symposium Rib 1(2) 5 A 6 C 7 Extension rib Slide 23

Structural Advantages of SPINE w. r. t. Suffix Trees 1) Number of nodes is equal to length of string, whereas in suffix tree can go up to double. 2) Entire data sequence explicitly embedded in index throw away the data! 3) On-line incremental algorithm (by definition) – do not need to possess entire data sequence in advance D =‘ACCA’ 4) Node creation order and logical order are the same prefix-partitionable 0 A 1 C 2 C 3 A August 2007 SERC Research Symposium C (0) 4 A (1) Slide 24

Advantages of SPINE (contd) 5) Each node represents a set of suffixes whereas in suffix tree each node represents only a single suffix – Number of suffixes processed for construction and searching is smaller 0 A 1 C (0) C 2 C 3 A (1) A 4 6) Easy to develop buffering strategies for persistent implementations August 2007 SERC Research Symposium Slide 25

SPINE Performance Summary Data Sets Ecoli: 3. 5 Mbp Celegans: 15. 5 Mbp HC 21: 28. 5 Mbp HC 19: 57. 5 Mbp Suffix Tree (MUMmer - Celera Genomics) • Spine Space – ~ 2/3 of Suffix Tree • Spine Time • Construction: ~ 1/2 of Suffix Tree • Searching: ~ 1/2 of Suffix Tree August 2007 SERC Research Symposium Slide 26

SPINE Summary • First index based on horizontal (inter-path) compaction of the trie • Collapses into a single linear structure • Improved features and performance w. r. t. suffix trees, the classical index • • • August 2007 Prefix-partitionable (first index to have this property) Easily amenable to persistent disk implementation Retains linear time/space complexity Better construction speed and capacity Better search response times SERC Research Symposium Slide 27

Full details at http: //dsl. serc. iisc. ernet. in Questions? August 2007 SERC Research Symposium Slide 28

END PRESENTATION August 2007 SERC Research Symposium Slide 29