The Making of TPC-DS Meikel Poess Oracle Corporation

Скачать презентацию The Making of TPC-DS Meikel Poess Oracle Corporation

a8c7282f8037532646fa653e7c2f1b82.ppt

Количество слайдов: 65

The Making of TPC-DS Meikel Poess Oracle Corporation Raghunath Othayoth Nambiar Hewlett-Packard Company

Agenda 1. Industry standard benchmark development 2. Limitations of TPC-H 3. Key elements of TPC-DS 4. Current state of the specification 5. Q&A August 31, 2006 32 nd International Conference on Very Large Data Bases 2

Benchmark Categories • Industry standard benchmarks − Transaction Processing Performance Council (TPC) − Standard Performance Evaluation Corporation (SPEC) • Application benchmarks − SAP, Oracle Apps, JD Edwards, Exchange, Domino • Special purpose benchmarks − Dhrystone, Whetstone, Linpak, Iozone, Netperf, Stream August 31, 2006 32 nd International Conference on Very Large Data Bases 3

Industry Standard Benchmarks − Broad Industry representation (all decision taken by the board) − Verifiable (audit process) − Domain specific standard tests − Resolution of disputes and challenges TPC Benchmarks SPEC Benchmarks August 31, 2006 • TPC-C (OLTP), TPC-E (New OLTP) • TPCH (DSS), TPC-DS (New DSS) • TPC-App - Dynamic WEB • SPEC CPU – Integer and Floating Point • SPEC SFS - System File Server • SPECweb – Web Server • SPECPower – Power Consumption (New) 32 nd International Conference on Very Large Data Bases 4

Why Benchmarks Are Important • Vendor point of view − Define the playing field (measurable, repeatable) − Enable competitive analysis − Monitor release to release progress − Result understood by engineering, sales and customers − Accelerate focused technology development • Customer point of view − Cross-vendor comparisons (performance, TCO) − Evaluate new technologies − Eliminate costly in-house characterization August 31, 2006 32 nd International Conference on Very Large Data Bases 5

Tracking Release to Release Progress, Example SPEC CPU 2000 benchmark results on HP Pro. Liant DL 380, 2002 -todate All SPEC® CPU 2000 benchmark results stated above reflect results published as of July 25, 2006. For the latest SPEC® CPU 2000 benchmark results, visit www. spec. org/cpu 2000/. August 31, 2006 32 nd International Conference on Very Large Data Bases 6

Tracking Product-line Progress, Example TPC-C benchmark on HP Pro. Liant servers over 10 years. August 31, 2006 32 nd International Conference on Very Large Data Bases 7

Competitive Analysis, Example Top Ten 3000 GB TPC-H by Performance, As of 12 -Sept-2006. August 31, 2006 32 nd International Conference on Very Large Data Bases 8

Transaction Processing Performance Council (TPC) August 31, 2006 32 nd International Conference on Very Large Data Bases 9

What makes the TPC unique • TPC is the only benchmark organization that requires priceperformance scores across all of its benchmarks • All tests require full documentation of the components and applications under test, so that the test can be replicated • The TPC requires an independent audit of results prior to publication • TPC tests the whole system performance, not just a piece • TPC is database agnostic: Oracle, IBM DB 2, Sybase, Microsoft SQL Server, Non. Stop SQL/MX and other databases • TPC provides cross-platform performance comparisons, a view of processor versus real performance, technology comparisons and actual cost of performance comparisons August 31, 2006 32 nd International Conference on Very Large Data Bases 10

TPC Business Model • TPC follows the philosophy of real world benchmarks, so that its customers can: − relate their business to the benchmark business model − relate their workload to the workload of the benchmark − Understand the benchmark August 31, 2006 32 nd International Conference on Very Large Data Bases 11

TPC Members • 20 Member companies • Page August 31, 2006 12 4 Associate members 32 nd International Conference on Very Large Data Bases 12

TPC • TPC Organization − Subcommittee • • OLTP DSS Web Pricing − Technical Advisory Board − Steering Committee • Benchmark development/maintenance − 6 Face-to-face meetings per year − Weekly conference calls August 31, 2006 32 nd International Conference on Very Large Data Bases 13

New Benchmark Development August 31, 2006 32 nd International Conference on Very Large Data Bases 14

Industry Standard Benchmark Development Activities • Development of a new benchmark in a new domain • Refinement of existing benchmarks • New Benchmark in an existing domain − Good benchmarks drive industry and technology forward − At some point, all reasonable advances have been made − Benchmarks can become counter productive by encouraging artificial optimizations − So, even good benchmarks become obsolete over time − As technology and user environment evolve, so should the benchmark August 31, 2006 32 nd International Conference on Very Large Data Bases 15

Benchmark Lifecycle new idea Requirements feasibility forms Subcommittee debate defines Draft Spec feasibility implements Prototypes evaluation publishes Benchmark Spec refinement Vendors Publish Benchmark Results Benchmark becomes obsolete and results in new requirements August 31, 2006 32 nd International Conference on Very Large Data Bases 16

Industry Standard Benchmark Development, Challenges • Development cycle − Benchmark development can take years • Technology and business could change significantly • Members have their own agenda − Hardware vs. software − Scale-out vs. scale-up − Proprietary vs. industry standard August 31, 2006 32 nd International Conference on Very Large Data Bases 17

New Industry Standard Benchmarks, Challenges • Unknowns − How does my product perform under new load and metric • Risk factors − There are failed benchmarks • Investments − TPC Benchmark publications are expensive − Vendors want to keep their existing publications − Need to train engineers, sales and marketing, customers August 31, 2006 32 nd International Conference on Very Large Data Bases 18

TPC-H August 31, 2006 32 nd International Conference on Very Large Data Bases 19

TPC-H • Measures generally applicable aspects of a Decision Support System • Its basic ideas have been a standard since 1994 • Currently about 100 results from 14 vendors (system/ database) on website • Has served the industry and academia very well • Shortcomings in: − Data model − Workload model − Metric August 31, 2006 32 nd International Conference on Very Large Data Bases 20

Data Model Shortcomings • Database Schema − 3 rd Normal Form − 8 tables − On average 10 columns per table − Commonly used database technologies are restricted • Dataset − Uniform data distributions − Synthetic data − No null values − Linear scaling with scale factor of almost all tables − Unrealistic table sizing • Scale factor 100, 000 20 Billion parts sold to 15 Billion customers at a rate of 150 Billion orders a year August 31, 2006 32 nd International Conference on Very Large Data Bases 21

Workload Model Shortcomings • Query workload − 22 SQL 92 queries − Simple structure − Only ad-hoc queries • Update workload − Simple insert and delete operations − No data transformations − Only two tables are maintained − Random inserts and deletes based on non-contiguous keys in the dataset August 31, 2006 32 nd International Conference on Very Large Data Bases 22

Metric Shortcomings • Primary performance metric − Very complex − Mix of geometric mean and arithmetic mean Qph. H= August 31, 2006 32 nd International Conference on Very Large Data Bases 23

Objectives for TPC-DS • Realistic data model • Complex workload − Large query set − ETL like update model • Simple and comprehensible metric • Understandable business model August 31, 2006 32 nd International Conference on Very Large Data Bases 24

Data Model August 31, 2006 32 nd International Conference on Very Large Data Bases 25

Fact Tables Catalog Returns Web Returns Store Returns Inventory Catalog Sales l l l August 31, 2006 Web Sales Store Sales 3 sales channels: Catalog - Web - Store 7 fact tables 2 fact tables for each sales channel 24 tables total Basic auxiliary data structure allowed on all tables Complex auxiliary data structures are only allowed on Catalog Sales and Catalog Returns 32 nd International Conference on Very Large Data Bases 26

Snow Flake Date_Dim Store Item Store_Sales Time_Dim Promotion Customer_ Demographics Customer_ Household_ Demographics Address Customer August 31, 2006 32 nd International Conference on Very Large Data Bases Income_ Band 27

Schema: Store Channel w/ Dimensions Date_Dim Store Item Store_Sales Time_Dim Promotion Customer_ Demographics Customer_ Household_ Demographics Address Customer August 31, 2006 32 nd International Conference on Very Large Data Bases Income_ Band 28

Data Model Advantages • • • August 31, 2006 Complex relationships Fact to fact table relations Large number of tables (24) Large number of columns (18) Auxiliary data structures are allowed on a subset of the schema complex queries star and “traditional” executions satisfies hardware and software vendors extents lifetime of the benchmark due to complexity 32 nd International Conference on Very Large Data Bases 30

Dataset August 31, 2006 32 nd International Conference on Very Large Data Bases 31

Database Scaling • • • August 31, 2006 Database size is defined in scalefactors Scale factor indicates raw data size in GB Auxiliary data structures and temporary storage are not included Scale Factor Database Size 1 1 GB 100 GB 300 GB 1000 1 TB 3000 3 TB 10000 10 TB 30000 30 TB 100000 100 TB 32 nd International Conference on Very Large Data Bases 32

Fact Table Scaling • August 31, 2006 Fact tables scale linearly with the scale factor 32 nd International Conference on Very Large Data Bases 33

$Database Scaling (Dimensions) • • August 31, 2006 Scale sub-linearly Amount for a fraction$ Database Scaling (Dimensions) • • August 31, 2006 Scale sub-linearly Amount for a fraction of the fact tables 32 nd International Conference on Very Large Data Bases 34

Table Sizes at SF 100 GB Table Percent of Total Store Sales 288 Million 39 Store Returns 28. 8 Million 3. 4 Catalog Sales 144 Million 30 Catalog Returns 14. 4 Million 2. 4 Web Sales 72 Million 15 Web Returns 7. 2 Million 1 Inventory 390 Million 9 Customer 2 Million 0. 5 Item 100, 000 0. 1 Catalog Page 24, 000 0. 002 Remaining August 31, 2006 #Rows 3, 3 Million 0. 005 32 nd International Conference on Very Large Data Bases 35

Data Content • Some data has “real world” content: − Last name “Sanchez”, “Ward”, “Roberts” − Addresses “ 630 Railroad, Woodbine, Sullivan County, MO-64253” • Data is skewed − Sales are modeled after US census data − More green items than red − Small and large cities August 31, 2006 32 nd International Conference on Very Large Data Bases 36

Sales Distribution 14 % of all sales happen between January and July 28 % of all sales happen 58% happen between November and in August and October December Group 3 Group 2 Group 1 August 31, 2006 32 nd International Conference on Very Large Data Bases 37

Color Distribution 8 % of all colors are 24% are in Group 1 in group 2 68% of all colors are in Group 3 Group 2 Group 1 August 31, 2006 32 nd International Conference on Very Large Data Bases 38

Dataset Advantages • Realistic table scaling • Real world data content • Non-uniform distributions challenging for: − statistics collection − query optimizer August 31, 2006 32 nd International Conference on Very Large Data Bases 39

Query Model August 31, 2006 32 nd International Conference on Very Large Data Bases 40

Query Model • Queries are designed to be realistic. They: − Answer real world questions − Cover system’s functionality relevant to decision support applications − Only allow tuning methods available to a DBA − Queries cover all data so that unrealistic creation of auxiliary data structures is unlikely − Yet, they impose a controlled and repeatable workload August 31, 2006 32 nd International Conference on Very Large Data Bases 41

Query Templates • TPC-DS requires a large query set − E. g. 100 TB benchmarks runs 1089 queries • Queries are written in a query template language • Each query template is unique • Queries are automatically generated using query templates • More information about the query generator at: Meikel Poess, John M. Stephens: Generating Thousand Benchmark Queries in Seconds. VLDB 2004: 1045 -1053 August 31, 2006 32 nd International Conference on Very Large Data Bases 42

Query Model Query Language: SQL 99 + OLAP extensions Query needs to be executed “as is” • • − No hints or rewrites allowed, except when approved by TPC • • 99 different query templates 4 different query types: Type simulate Implemented via Templates Reporting Finely tuned reoccurring queries Access catalog sales channel tables 38 Ad-hoc Sporadic queries, minimal tuning Access Store and Web Sales Channel tables 47 Iterative Users issuing sequences of queries Sequence of queries where each query adds SQL elements 4 Data Mining Queries feeding Data Mining Tools for further processing Return large number of rows 10 August 31, 2006 32 nd International Conference on Very Large Data Bases 43

Ad Hoc Query select i_item_id, s_state, grouping(s_state) g_state, avg(ss_quantity) agg 1, avg(ss_list_price) agg 2, avg(ss_coupon_amt) agg 3, avg(ss_sales_price) agg 4 from store_sales, customer_demographics, date_dim, store, item where ss_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and ss_store_sk = s_store_sk and ss_cdemo_sk = cd_demo_sk and cd_gender = '[GEN]' and cd_marital_status = '[MS]' and cd_education_status = '[ES]' and d_year = [YEAR] and s_state in ('[STATE_A]', '[STATE_B]', '[STATE_C]', '[STATE_D]', '[STATE_E]', '[STATE_F]') group by rollup (i_item_id, s_state); August 31, 2006 32 nd International Conference on Very Large Data Bases 44

Reporting Query select count(distinct cs_order_number) as "order count" , sum(cs_ext_ship_cost) as "total shipping cost" , sum(cs_net_profit) as "total net profit" from catalog_sales cs 1 , date_dim , customer_address , call_center where d_date between '[YEAR]-[MONTH]-01' and (cast('[YEAR]-[MONTH]-01' as date) + 60 ) and cs 1. cs_ship_date_sk = d_date_sk and cs 1. cs_ship_addr_sk = ca_address_sk and ca_state = '[STATE]' and cs 1. cs_call_center_sk = cc_call_center_sk and cc_county in ('[COUNTY_A]', '[COUNTY_B]', '[COUNTY_C]‘ , '[COUNTY_D]', '[COUNTY_E]') and exists (select * from catalog_sales cs 2 where cs 1. cs_order_number = cs 2. cs_order_number and cs 1. cs_warehouse_sk <> cs 2. cs_warehouse_sk) and not exists(select * from catalog_returns cr 1 where cs 1. cs_order_number = cr 1. cr_order_number); August 31, 2006 32 nd International Conference on Very Large Data Bases 45

Iterative Query Part # 1 with frequent_ss_items as (select substr(i_item_desc, 1, 30) itemdesc, i_item_sk, d_date solddate, count(*) cnt from store_sales , date_dim , item where ss_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and d_year in ([YEAR], [YEAR]+1, [YEAR]+2, [YEAR]+3) group by substr(i_item_desc, 1, 30), i_item_sk, d_date having count(*) >4), max_store_sales as (select max(csales) cmax from (select c_customer_sk, sum(ss_quantity*ss_sales_price) csales from store_sales , customer , date_dim where ss_customer_sk = c_customer_sk and ss_sold_date_sk = d_date_sk and d_year in ([YEAR], [YEAR]+1, [YEAR]+2, [YEAR]+3) group by c_customer_sk) x), best_ss_customer as (select c_customer_sk, sum(ss_quantity*ss_sales_price) ssales from store_sales, customer where ss_customer_sk = c_customer_sk group by c_customer_sk having sum(ss_quantity*ss_sales_price) > 0. 95 * (select * from max_store_sales)) select sum(sales) from ((select cs_quantity*cs_list_price sales from catalog_sales , date_dim where d_year = [YEAR] and d_moy = [MONTH] and cs_sold_date_sk = d_date_sk and cs_item_sk in (select item_sk from frequent_ss_items) and cs_bill_customer_sk in (select c_customer_sk from best_ss_customer)) union all (select ws_quantity*ws_list_price sales from web_sales , date_dim where d_year = [YEAR] and d_moy = [MONTH] and ws_sold_date_sk = d_date_sk and ws_item_sk in (select item_sk from frequent_ss_items) and ws_bill_customer_sk in (select c_customer_sk from best_ss_customer))) y; August 31, 2006 32 nd International Conference on Very Large Data Bases 46

Iterative Query Part # 2 select c_last_name, c_first_name, sales from ((select c_last_name, c_first_name, sum(cs_quantity*cs_list_price) sales from catalog_sales , customer , date_dim where d_year = [YEAR] and d_moy = [MONTH] and cs_sold_date_sk = d_date_sk and cs_item_sk in (select item_sk from frequent_ss_items) and cs_bill_customer_sk in (select c_customer_sk from best_ss_customer) and cs_bill_customer_sk = c_customer_sk group by c_last_name, c_first_name ) union all (select c_last_name, c_first_name, sum(ws_quantity*ws_list_price) sales from web_sales, customer , date_dim where d_year = [YEAR] and d_moy = [MONTH] and ws_sold_date_sk = d_date_sk and ws_item_sk in (select item_sk from frequent_ss_items) and ws_bill_customer_sk in (select c_customer_sk from best_ss_customer) and ws_bill_customer_sk = c_customer_sk group by c_last_name, c_first_name )) y; August 31, 2006 32 nd International Conference on Very Large Data Bases 47

Iterative Query Part # 3 select c_last_name, c_first_name, sales from ((select c_last_name, c_first_name, sum(cs_quantity*cs_list_price) sales from catalog_sales , customer , date_dim where d_year = [YEAR] and d_moy = [MONTH] and cs_sold_date_sk = d_date_sk and cs_item_sk in (select item_sk from frequent_ss_items) and cs_bill_customer_sk in (select c_customer_sk from best_ss_customer) and cs_bill_customer_sk = c_customer_sk and cs_bill_customer_sk = cs_ship_customer_sk group by c_last_name, c_first_name) union all (select c_last_name, c_first_name, sum(ws_quantity*ws_list_price) sales from web_sales , customer , date_dim where d_year = [YEAR] and d_moy = [MONTH] and ws_sold_date_sk = d_date_sk and ws_item_sk in (select item_sk from frequent_ss_items) and ws_bill_customer_sk in (select c_customer_sk from best_ss_customer) and ws_bill_customer_sk = c_customer_sk and ws_bill_customer_sk = ws_ship_customer_sk group by c_last_name, c_first_name)) y; August 31, 2006 32 nd International Conference on Very Large Data Bases 48

Query Model Advantages • SQL 99 + OLAP extensions • Query templates allow for the generation of thousands of different queries • Combining of different query classes − − • August 31, 2006 Ad-hoc Reporting Iterating Data mining Star schema and “traditional” query execution 32 nd International Conference on Very Large Data Bases 49

Execution Rules August 31, 2006 32 nd International Conference on Very Large Data Bases 50

Benchmark Execution System Setup Database Load Query Run #1 Un-timed Data Maintenance Query Run #2 Timed • • Load of streams concurrently, Setup of: raw data Runs n Creation of: each • System tables System running • Load into fact tables • Servers/ Operating • Creation of auxiliary data 99 queries • • Delete from fact tables • Storage Arrays including RAID structures spaces • Table different, slowly changing • Repeat of random Run #1 • • Maintain. Query substitutions • Networks • dimensions simulates. Groups • File n concurrent • Statistics gathering users • Database Software • files (optional) • Flat Log files August 31, 2006 32 nd International Conference on Very Large Data Bases 51

Benchmark Execution System Setup Database Load Query Run #1 Un-timed Data Maintenance Timed Scale Factor Number Streams 1 n. a 100 3 300 5 1000 7 3000 9 Stream 2: Q 1, Q 55, Q 4, Q 1430000 … , 13 12, Q 3 , Q 9, Q Q 69 3 100000 15 Stream n: Q 94, Q 3, Q 1, Q 84, … , Q 34, Q 23 32 nd International Conference on Very Large Data Bases … Q 7 47 99 … Stream 1: Q 3, Q 21, Q 11, Q 3, Q 8, … , 11 47, Q 99 Q 10000 August 31, 2006 Query Run #2 Q 7 52

Benchmark Execution System Setup Database Load Query Run #1 Un-timed Data Maintenance Timed Flat File read Transformation load DBMS August 31, 2006 Query Run #2 32 nd International Conference on Very Large Data Bases Data Warehouse Tables 53

Benchmark Execution System Setup Database Setup Un-timed Database Load Query Run #1 Data Maintenance Timed Q 7 Stream 2: Q 1, Q 55, Q 4, Q 14, Q 9, … , Q 12, Q 3 Q 69 … … Stream 1: Q 3, Q 21, Q 11, Q 3, Q 8, … , Q 47, Q 99 Stream n: Q 94, Q 3, Q 1, Q 84, … , Q 34, Q 23 August 31, 2006 Query Run #2 32 nd International Conference on Very Large Data Bases Q 7 54

Database Load • Simulates data warehouse reload • Measures the system’s ability to: − Load data − Create auxiliary data structures − Gather statistics • Is part of metric because − Data warehouses get recreated − It prevents using unrealistic auxiliary data structures August 31, 2006 32 nd International Conference on Very Large Data Bases 55

Execution Query Run #1 • Simulates execution of queries by multiple concurrent users • Measures the system’s ability to: − Process concurrent query executions in the least amount of time − Allocate resources efficiently among multiple concurrent users • Minimum number of streams required • Maximum number of streams not limited August 31, 2006 32 nd International Conference on Very Large Data Bases 56

Data Maintenance • Simulates incremental raw data feeds from an OLTP system • Costs auxiliary data structures • Amount of data loaded is linear to the number of streams guarantees significance of DM • Approach is database centric no ETL tools August 31, 2006 32 nd International Conference on Very Large Data Bases 57

Execution Query Run #2 • Rerun of Query Run #1 • Measures the system’s ability to repeat the results of Query Run #1 after Data Maintenance August 31, 2006 32 nd International Conference on Very Large Data Bases 58

Metric August 31, 2006 32 nd International Conference on Very Large Data Bases 59

Primary Metrics • Three primary metrics − − − • Queries per hour Price per Query System Availability Queries per Hour Qph. DS= − − − August 31, 2006 S: Number of query streams SF: Scale Factor TQ 1 and TQ 2: elapsed times to complete query run #1 and #2 TDM is the elapsed time to complete the data maintenance TLOAD is the total elapsed time to complete the database load 32 nd International Conference on Very Large Data Bases 60

Metric Explanation Qph. DS= • Numerator: − S*198 normalizes result to queries − 3600 normalizes result to hours − SF normalizes result to scale factor • Denominator − 0. 01 costs load with 1% − S do avoid diminishing the costing of load August 31, 2006 32 nd International Conference on Very Large Data Bases 61

Current Status of TPC-DS Spec August 31, 2006 32 nd International Conference on Very Large Data Bases 62

Status TPC-DS new idea Requirements feasibility forms Subcommittee debate defines Draft Spec feasibility implements Prototypes evaluation publishes Benchmark Spec refinement Vendors Publish Benchmark Results Benchmark becomes obsolete and results in new requirements August 31, 2006 32 nd International Conference on Very Large Data Bases 63

More Information • Specification: http: //www. tpc. org/tpcds/default. asp • Benchmark tools: − Dbgen − Query templates • August 31, 2006 will be available on website soon 32 nd International Conference on Very Large Data Bases 64

Q&A August 31, 2006 32 nd International Conference on Very Large Data Bases 65