The Analytic DBMS Market s New opportunities with new

The Analytic DBMS Market(s) New opportunities with new technology by Curt A. Monash, Ph. D. President, Monash Research Editor, DBMS 2 contact @monash. com http: //www. DBMS 2. com

Curt Monash Analyst since 1981 Own firm since 1987 Publicly available research Covered DBMS since the pre-relational days Also analytics, search, etc. Blogs, including DBMS 2 (www. DBMS 2. com -- the source for most of this talk) Feed at www. monash. com/blogs. html White papers and more at www. monash. com User and vendor consulting

Our agenda Why there are specialty analytic DBMS It’s not just the analytic area Hardware issues Tips for choosing among them Segments and priorities The selection process

Database diversity High-end e-commerce 100 -terabyte analytics High-volume call center Media-heavy web startup Simple departmental application (and many more)

11 kinds of data management software 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. High-end OLTP/general-purpose DBMS Mid-range OLTP/general-purpose DBMS Row-based analytic RDBMS Column- or array-based analytic RDBMS Text search engines XML and OO DBMS (but these may merge with search) RDF and other graphical DBMS (but these may merge with relational) Event/stream processing engines (aka CEP) Embedded DBMS for devices Sub-DBMS file managers (e. g. Simple. DB, some My. SQL uses) Science DBMS

Why are there specialized analytic DBMS? General-purpose database managers are optimized for updating short rows … … not for analytic query performance 10 -100 X price/performance differences are not uncommon At issue is the interplay between storage, processors, and RAM

Moore’s Law, Kryder’s Law, and a huge exception Growth factors: Transistors/chip : >100, 000 since 1971 Disk density: >100, 000 since 1956 Disk speed: 12. 5 since 1956 The disk speed barrier dominates everything! 7

The “ 1, 000: 1” disk-speed barrier RAM access times ~5 -7. 5 nanoseconds CPU clock speed <1 nanosecond Interprocessor communication can be ~1, 000 X slower than on-chip Disk seek times ~2. 5 -3 milliseconds Limit = ½ rotation i. e. , 1/30, 000 minutes i. e. , 1/500 seconds = 2 ms Tiering brings it closer to ~1, 000: 1 in practice, but even so the difference is VERY BIG 8

Hardware strategies to optimize analytic I/O Lots of RAM Parallel disk access!!! Lots of networking Tuned MPP (Massively Parallel Processing) is the key

Software strategies to optimize analytic I/O Minimize data returned Minimize index accesses Page size Precalculate results Classic query optimization Materialized views OLAP cubes Return data sequentially Store data in columns Stash data in RAM 10

16 contenders Aster Dataupia Exasol Greenplum HP Neoview IBM DB 2 BCUs Infobright Kickfire Kognitio Microsoft Madison Netezza Oracle Exadata Par. Accel Sybase IQ Teradata Vertica

Varied approaches 3 are trying to meld OLTP and analytic processing 2 have very specialized hardware 1 is purely RAM-centric Several use Infiniband; several stress gig. E switches 6 are columnar 2 stress cloud/Daa. S

Segmentation made simple One database to rule them all One analyticdatabase to rule them all Frontlineanalytic database Very, very big analytic database Big analytic database handled very costeffectively

7 more precise segmentation issues What is your tolerance for specialized hardware? What is your tolerance for set-up effort? What is your tolerance for ongoing administrative burden? What are your insert and update requirements? At what volumes will you run fairly simple queries? What are your complex queries like? and, most important, Are you madly in love with your current DBMS?

Specialized hardware Custom or unusual chips (rare) Custom or unusual interconnects Fixed configurations of common parts

Set-up effort Hardware acquisition and installation Database and index design Data cleaning and integration Porting of existing applications

Ongoing administration Part of the set-up effort also translates to an ongoing administrative burden Indexes, materialized views, cubes, etc. … … unless the DBMS architecture minimizes their use

Inserts and updates Finally we get to the performance criteria Batch load ELT (or ETLT) vs. pure ETL Mini-batches or trickle feeds True transactional updates

Concurrent queries Major use cases Traditional BI Customer-facing apps Product maturity is often key

Complex queries This is where the glamour is MPP to speed up I/O Clever answers to the data redistribution problem Table scans vs. random access Columns vs. rows Aggressive use of RAM Compression (saving on disk cost isn’t the point) … and fast analytics even beyond the queries

The analytic DBMS selection process Figure out what you’re trying to buy Make a short list Do free POCs Evaluate and decide

Figure out what you’re trying to buy Inventory your use cases Set constraints Current Known future Wish-list/dream-list future People and platforms Money Establish target SLAs Must-haves Nice-to-haves

Short list basics You might as well consider the incumbent(s) Cash cost is an easy filter to apply What is the crux of the deployment effort? References can be scarce

Free POCs are a great invention Most of the effort is in the set-up The better you match your use cases, the more reliable the POC is You might as well do POCs for several vendors – at (almost) the same time! Where is the POC being held? Can you plan this yourself, or do you need outside help?

Evaluate and decide It all comes down to Cost Speed Risk and in some cases Time to value Upside

Further information Curt A. Monash, Ph. D. President, Monash Research Editor, DBMS 2 contact @monash. com http: //www. DBMS 2. com