5e82229e9c7c4ca7975f6726ccce6a37.ppt
- Количество слайдов: 28
Column-Stores vs. Row-stores Harikrishna Bikmal 05 d 05019
Row vs. Column Stores
Column Stores Really good for read-mostly datawarehouses ◦ Lots of column scans and aggregations ◦ Writes tend to be in batch ◦ Top 3 in TPC-H rankings are column stores(Exasol, Par. Accel and Kickfire) ◦ Yahoo’s world largest data warehouse is a column store
Paper Looks At Key Question How much of the buzz around column-stores just marketing hype? ◦ Do you really need to buy Sybase IQ or Vertica? How far will your current row-store take you? Can you get column-store performance from a row-store? Can you simulate a column-store in a rowstore?
Benchmark Star Schema Benchmark (SSMB) ◦ Fact table: 17 columns, 60, 000 rows Table: Line. Order ◦ 4 dimension tables: largest one 80, 000 rows Tables: Customer, Supplier, Part, Date Contains 13 queries divided into four categories or flights Used System-x row-store database and for column store they used C-store
Simulate Column-Store Inside Row-Store
Average Query Time
Vertical Partitioning Tuple overheads Traditional row store takes 4 GB (compressed) Vertical partitioning takes 0. 7 to 1. 1 GB per column(compressed) ** Column joins Expensive hash joins Were unable to use merge join
Vertical Partitioning Horizontal Partitioning Fact table is vertically partitioned using year attribute of date table (dimension table) Most queries in SSMB have predicate on year attribute Vertically partitioned tables do not contain date foreign key (except one table)
Index only plans Common type of query: ◦ SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact. store_id = atores. store_id AND stores. country = “Canada” GROUP BY store_name ◦ Contains predicates on fact table using dimension tables and need to extract SELECT attributes for these tuple ids
Index only plans ◦ Available : Index maps values to tuple ids ◦ Required : Tuple ids to attribute values Tuple construction is slow ! ◦ ** Expensive hash joins of the columns of the fact table before filtering using dimension tables
So. . All indexes approach is a poor way to simulate a column-store Problems with vertical partitioning are NOT fundamental ◦ Store tuple header in a separate partition ◦ Allow virtual TIDs ◦ Allow HP using a foreign key on a different VP So can row-stores simulate columnstores?
Optimal set of Materialized Views
Column-Store Experiments Start with column-store (C-Store) Remove column-store-specific performance optimizations End with column-store with a roworiented query executer
Column Store specific optimizations ◦ Compression ◦ Late Materialization ◦ Block Processing ◦ Invisible Join
Column Store Compression ◦ Higher data value locality in column-stores Better ratio - reduced I/O ◦ Can use schemes like run-length encoding ◦ Easy to operate on directly for improved performance
Column Store Late Materialization ◦ Select and Aggregate operations render the construction of some tuples unnecessary ◦ Delays decompression ◦ Better cache performance Block Iteration ◦ Attributes extraction can be done in blocks instead of iterating on rows
Column Store Invisible Join ◦ Typical query
Invisible Join Typical query Contains selection predicates on fact table using one or more dimension tables, need to extract SELECT attributes for these tuple ids, aggregation using other dimension tables Joins between fact table and dimension tables for each selection predicate, aggregate grouping
Invisible Join Previous approaches either used early materialization or the ones with late materialization had many out of order attribute extraction from the dimension tables
Invisible Join Step 1
Invisible Join Step 2
Invisible Join Step 3
Invisible Join Between-Predicate Rewriting Use of range predicates instead of hash lookup in step 1 of invisible join If the set of keys obtained for join in step 1 results in a complete range of keys Use of dictionary table in case of predicates on sorted fields Invisible join gave up to 75% improvement in some queries
Break down of column store advantages t = block processing (T) C = compression (c) I = Invisible join (i) L = late Materialization(l)
Conclusions Attempt to emulate the physical layout of a column-store in a row-store Broke down the reasons why a columnstore is able to process column-oriented data so efficiently Introduced Invisible join
Questions
Thank You
5e82229e9c7c4ca7975f6726ccce6a37.ppt