Скачать презентацию Handling big dimensions in distributed data warehouses using Скачать презентацию Handling big dimensions in distributed data warehouses using

7b085759ed91917fd661af5d211d5a03.ppt

  • Количество слайдов: 23

Handling big dimensions in distributed data warehouses using the DWS technique Marco Costa DEI Handling big dimensions in distributed data warehouses using the DWS technique Marco Costa DEI – CISUC – University of Coimbra Critical Software S. A. Dependable Technologies for Critical Systems Copyright Critical Software S. A. 1998 -2003 All Rights Reserved.

Agenda n n Introduction The DWS technique n n n Description Problems with big Agenda n n Introduction The DWS technique n n n Description Problems with big dimensions The Selective Loading technique Experimental Results Conclusions © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 2

Critical Software Inc. Company Profile n International Software Engineering company. n Founded in 1998, Critical Software Inc. Company Profile n International Software Engineering company. n Founded in 1998, offices in Portugal, US, UK. n Entrepreneurial and independent SME. n Staff of 100, software engineers, Msc’s, Phd’s. Figures n Turnover of US 6 M (2004). n International market represents +70%. n Profitable since foundation (ebit= 17%, 2003). Headquarters, Portugal Quality, R&D n ISO 9001: 2000 Tick-IT certified (only in Iberia). n ISO 15504 / CMM level 3 n R&D focused, Patents submitted © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 3

Introduction n n Companies produce and store more and more data Data Warehouses have Introduction n n Companies produce and store more and more data Data Warehouses have large and continuously growing volumes of data to process High performance in query execution is crucial to enable interactivity in OLAP process Typically the performance is achieved through very expensive hardware platforms (e. g. high end servers) © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 4

Introduction n Parallel processing has been explored as one of the solutions to support Introduction n Parallel processing has been explored as one of the solutions to support large DW n n Intra-query parallelism Distributed DW n n For geographical reasons For performance n n n Load balancing of data Query execution Reduce communication between nodes © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 5

The DWS Technique n Distribution of a DW through a cluster of “low cost The DWS Technique n Distribution of a DW through a cluster of “low cost computers” n n n n Data partition technique Query re-write and parallel execution technique Approximated query answering Shared-nothing architecture – Federated Conceived specifically for data warehouses implemented with star-schema model High scalability Near linear speed up for data aggregation queries © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 6

The DWS Technique n Data partitioning / data placement n n n All nodes The DWS Technique n Data partitioning / data placement n n n All nodes have the same data model Dimension tables are replicated Fact tables are distributed through all nodes in an uniform way n n Row by row Random © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 7

The DWS Technique n Data partitioning / data placement n Row by row example The DWS Technique n Data partitioning / data placement n Row by row example © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 8

The DWS Technique n Query re-write n Partition the queries in steps: n n The DWS Technique n Query re-write n Partition the queries in steps: n n Partial Query (independently executed in each node) Merge Query Some queries might require more than one step Execution tree optimizer – determines the steps that need to be executed independently or can be included in the upper query © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 9

The DWS Technique n Query Re-write (example for 2 nodes) n A typical data The DWS Technique n Query Re-write (example for 2 nodes) n A typical data aggregation query: select t. calendar_month_desc "Month", Dimensions c. cust_city "City", p. prod_category "Category", avg(s. quantity_sold) "Quantity", Facts (aggregated) avg(s. amount_sold) "Amount" from sales s, customers c, times t, products p where s. time_id = t. time_id and s. cust_id = c. cust_id and s. prod_id = p. prod_id and t. calendar_year = 2000 group by t. calendar_month_desc, c. cust_city, p. prod_category © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 10

The DWS Technique n Query Re-write (example for 2 nodes) n Partial Query sent The DWS Technique n Query Re-write (example for 2 nodes) n Partial Query sent to all nodes: create table dws 110517101718101 as select t. calendar_month_desc, c. cust_city, p. prod_category, sum(s. quantity_sold) as dws 1_sum, count(s. quantity_sold) as dws 1_count, Collect partial sum(s. amount_sold) as dws 2_sum, count(s. amount_sold) as dws 2_count aggregations from sales s, customers c, times t, products p where s. time_id = t. time_id and s. cust_id = c. cust_id and s. prod_id = p. prod_id and t. calendar_year = 2000 group by t. calendar_month_desc, c. cust_city, p. prod_category © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 11

The DW-SP Technology n Query Re-write (example for 2 nodes) n Merge Query – The DW-SP Technology n Query Re-write (example for 2 nodes) n Merge Query – merge the partial results: create table dws_finalmerge_ as (select * from dws [email protected] 1 union all select * from dws [email protected] 2) select calendar_month_desc "month", cust_city "city", prod_category "category", sum(dws 1_sum) / sum(dws 1_count) "quantity", sum(dws 2_sum) / sum(dws 2_count) "amount" from dws_finalmerge_ group by calendar_month_desc, cust_city, prod_category © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. Gather partial Results Build final results Merge aggregations 12

The DWS Technique n Achievements n n n Optimal data load balance Optimal work The DWS Technique n Achievements n n n Optimal data load balance Optimal work load balance For each query each node processes the same amount of data as all the others, mostly within its local data Low communication between nodes High scalability n n n Near linear speed-up Nead linear scale-up Tested with APB 1 benchmark (Olap Council) and 10 nodes © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 13

The DWS Technique n The problem n n n Replication of dimension tables is The DWS Technique n The problem n n n Replication of dimension tables is not typically a problem (dimension tables represent 5% to 10% of the data) Business with big dimensions can not apply DWS The businesses that have big dimensions have high potential (e. g. airlines, telecoms, ebusiness) © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 14

The Selective Load Technique n Selective load the dimension tables n n Typical OLAP The Selective Load Technique n Selective load the dimension tables n n Typical OLAP aggregate facts according to restrictions applied to dimensions The join between facts and dimensions only need the dimension rows that exist in both tables Do not replicate the big dimension tables Load only the necessary rows to each node © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 15

The Selective Load Technique n Selective load the dimension tables n Example: Node of The Selective Load Technique n Selective load the dimension tables n Example: Node of a DWS cluster © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 16

The Selective Load Technique n High reduction of the number of rows to load The Selective Load Technique n High reduction of the number of rows to load to each node n Big dimensions n n n High number of rows (absolute size) Significant percentage of the number of rows in fact tables Produce sparse models (passenger in a flight company) Rows in the dimension table are related with low number of facts Worst scenario is having has many dimension rows as facts in each node © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 17

The Selective Load Technique n Dimension browsing queries? n n There’s not a complete The Selective Load Technique n Dimension browsing queries? n n There’s not a complete version of the big dimension table The union of all selective load partitions of the dimension table does not give a complete version of the dimension table Dimension rows with no fact won’t be loaded at all Apply the DWS data partitioning algorithm to the big dimension n Create a partitioned version of the dimension table distributed through all nodes Enables the dimension queries to benefit of DWS speed up and scale up Dimension browsing queries aiming big dimension will be executed in parallel by all nodes © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 18

Experimental Results n Experiments with TPC-H n n Facts: Lineitem Big Dimension: Orders Dimensions: Experimental Results n Experiments with TPC-H n n Facts: Lineitem Big Dimension: Orders Dimensions: Customer, Supplier, Region, Nation, Part Scenarios n n n Single Node – Centralized DB for reference DWS (5, 10, 20) – DWS with replication of dimensions for 5, 10 and 20 nodes DWS_SL (5, 10, 20) – DWS with selective load of big dimension for 5, 10 and 20 nodes © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 19

Experimental Results n Storage per node n n Replication of big dimension has a Experimental Results n Storage per node n n Replication of big dimension has a high impact Selective load reduces significantly the data volume Line. Item Orders_dist Total Single Node 3576, 25 1573, 56 5149, 82 DWS_5 715, 25 1573, 56 2288, 81 DWS_SL_5 715, 25 557, 97 314, 71 1587, 93 DWS_10 357, 63 1573, 56 1931, 19 DWS_SL_10 357, 63 312, 10 157, 36 827, 09 DWS_20 178, 81 1573, 56 1752, 38 DWS_SL_20 178, 81 157, 01 78, 68 414, 51 Table size (MB) © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 20

Experimental Results n Performance n n DWS speed up is inexistent due to the Experimental Results n Performance n n DWS speed up is inexistent due to the replication of the big dimension DWS_SL speed up is near linear © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 21

Conclusions n n DWS is a technique to distribute data warehouses through a cluster Conclusions n n DWS is a technique to distribute data warehouses through a cluster of (low cost) computers with near linear speed up and scale up for star schema models and aggregations queries The current work enables the use of the DWS technique for star schema models with large dimensions with linear speed up and scale up. Enables browsing dimension queries to experience the advantages of parallel execution in a DWS system. © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 22

Questions and Contacts Marco Costa, mcosta@criticalsoftware. com Henrique Madeira, henrique@dei. uc. pt Critical Software, Questions and Contacts Marco Costa, [email protected] com Henrique Madeira, [email protected] uc. pt Critical Software, S. A. Parque Industrial de Taveiro, Lote 48 3045 -504 Coimbra, PORTUGAL Tel +351 239989100, Fax+351 239989119 Critical Software Inc. 111 North Market Street, Suite 670 San Jose, California, USA, 95113 Tel. +1(408)9711231, Fax: +1(408)3513330 © Copyright Critical Software S. A. 1998 -2003 All Rights Reserved. 23