Скачать презентацию What is a Data Warehouse A single complete Скачать презентацию What is a Data Warehouse A single complete

e3534954f9dd65a7b43163dea47a7bc2.ppt

  • Количество слайдов: 31

What is a Data Warehouse? A single, complete and consistent store of data obtained What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand use in a business context. [Barry Devlin]

Data Warehouse Architecture Relational Databases Optimized Loader ERP Systems Extraction Cleansing Data Warehouse Engine Data Warehouse Architecture Relational Databases Optimized Loader ERP Systems Extraction Cleansing Data Warehouse Engine Purchased Data Legacy Data Metadata Repository Analyze Query

Data Warehouse for Decision Support & OLAP • Putting Information technology to help the Data Warehouse for Decision Support & OLAP • Putting Information technology to help the knowledge worker make faster and better decisions – Which of my customers are most likely to go to the competition? – What product promotions have the biggest impact on revenue? – How did the share price of software companies correlate with profits over last 10 years?

Why Separate Data Warehouse? • Performance – Op dbs designed & tuned for known Why Separate Data Warehouse? • Performance – Op dbs designed & tuned for known txs & workloads. – Complex OLAP queries would degrade perf. for op txs. – Special data organization, access & implementation methods needed for multidimensional views & queries. • Function – Missing data: Decision support requires historical data, which op dbs do not typically maintain. – Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. – Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

OLTP vs. Data Warehouse • OLTP systems are tuned for known transactions and workloads OLTP vs. Data Warehouse • OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse • Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) – e. g. , average amount spent on phone calls between 9 AM-5 PM in Pune during the month of December

To summarize. . . • OLTP Systems are used to “run” a business • To summarize. . . • OLTP Systems are used to “run” a business • The Data Warehouse helps to “optimize” the business

Schema Design • Database organization – must look like business – must be recognizable Schema Design • Database organization – must look like business – must be recognizable by business user – approachable by business user – Must be simple • Schema Types – Star Schema – Fact Constellation Schema – Snowflake schema

Dimension Tables • Dimension tables – Define business in terms already familiar to users Dimension Tables • Dimension tables – Define business in terms already familiar to users – Wide rows with lots of descriptive text – Small tables (about a million rows) – Joined to fact table by a foreign key – heavily indexed – typical dimensions • time periods, geographic region (markets, cities), products, customers, salesperson, etc.

Fact Table • Central table – mostly raw numeric items – narrow rows, a Fact Table • Central table – mostly raw numeric items – narrow rows, a few columns at most – large number of rows (millions to a billion) – Access via dimensions

Star Schema • A single fact table and for each dimension one dimension table Star Schema • A single fact table and for each dimension one dimension table • Does not capture hierarchies directly T i m e c u s t date, custno, prodno, cityname, . . . f a c t p r o d c i t y

Snowflake schema • Represent dimensional hierarchy directly by normalizing tables. • Easy to maintain Snowflake schema • Represent dimensional hierarchy directly by normalizing tables. • Easy to maintain and saves storage T i m e c u s t p r o d date, custno, prodno, cityname, . . . f a c t c i t y r e g i o n

Fact Constellation • Fact Constellation – Multiple fact tables that share many dimension tables Fact Constellation • Fact Constellation – Multiple fact tables that share many dimension tables – Booking and Checkout may share many dimension tables in the hotel industry Promotion Hotels Travel Agents Booking Checkout Customer Room Type

De-normalization • Normalization in a data warehouse may lead to lots of small tables De-normalization • Normalization in a data warehouse may lead to lots of small tables • Can lead to excessive I/O’s since many tables have to be accessed • De-normalization is the answer especially since updates are rare

Indexing Techniques • Exploiting indexes to reduce scanning of data is of crucial importance Indexing Techniques • Exploiting indexes to reduce scanning of data is of crucial importance • Bitmap Indexes • Join Indexes • Bitsliced Indexing • Projection Indexing • Positional Indexing

Indexing Techniques • Bitmap index: – A collection of bitmaps -- one for each Indexing Techniques • Bitmap index: – A collection of bitmaps -- one for each distinct value of the column – Each bitmap has N bits where N is the number of rows in the table – A bit corresponding to a value v for a row r is set if and only if r has the value for the indexed attribute

Bit. Map Indexes • An alternative representation of RID-list • Specially advantageous for low-cardinality Bit. Map Indexes • An alternative representation of RID-list • Specially advantageous for low-cardinality domains • Represent each row of a table by a bit and the table as a bit vector • There is a distinct bit vector Bv for each value v for the domain • Example: the attribute sex has values M and F. A table of 100 million people needs 2 lists of 100 million bits

Bitmap Index M Y 0 1 0 F Y 1 1 1 F N Bitmap Index M Y 0 1 0 F Y 1 1 1 F N 1 0 0 M N 0 0 0 F Y 1 1 1 F N 1 0 0 Customer Query : select * from customer where gender = ‘F’ and vote = ‘Y’

Bit Map Index Base Table Customers where Region Index Region = W Rating Index Bit Map Index Base Table Customers where Region Index Region = W Rating Index And Rating = M

Bit. Map Indexes • Comparison, join and aggregation operations are reduced to bit arithmetic Bit. Map Indexes • Comparison, join and aggregation operations are reduced to bit arithmetic with dramatic improvement in processing time • Significant reduction in space and I/O (30: 1) • Adapted for higher cardinality domains as well. • Compression (e. g. , run-length encoding) exploited • Products that support bitmaps: Model 204, Target. Index (Redbrick), IQ (Sybase), Oracle 7. 3

Join Indexes • Pre-computed joins • A join index between a fact table and Join Indexes • Pre-computed joins • A join index between a fact table and a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute – e. g. , a join index on city dimension of calls fact table – correlates for each city the calls (in the calls table) from that city

Join Indexes • Join indexes can also span multiple dimension tables – e. g. Join Indexes • Join indexes can also span multiple dimension tables – e. g. , a join index on city and time dimension of calls fact table

Star Join Processing • Use join indexes to join dimension and fact table Calls Star Join Processing • Use join indexes to join dimension and fact table Calls C+T Time C+T+L Location Plan C+T+L +P

Optimized Star Join Processing Time Apply Selections Location Plan Calls Virtual Cross Product of Optimized Star Join Processing Time Apply Selections Location Plan Calls Virtual Cross Product of T, L and P

Bitmapped Join Processing Bitmaps Calls 1 0 1 Location Calls 0 0 1 Plan Bitmapped Join Processing Bitmaps Calls 1 0 1 Location Calls 0 0 1 Plan Calls Time 1 1 0 AND

Indexing Techniques • Variant Indexes [O’Neil & Quass, 97] – B+-tree (all systems): • Indexing Techniques • Variant Indexes [O’Neil & Quass, 97] – B+-tree (all systems): • RID-list for each search-key value – Bitmapped (almost all systems): • Bit-vectors (usually compressed) instead of RID-lists – Projection (Sybase IQ): • Mirror copy of column – Bit-sliced (Sybase IQ): • “bit-level” projection index • Join Indexes [Valduriez et. al. , 86] – Bitmapped-Join Indexes [O’Neil & Graefe ‘ 95] (Informix) • Limitations: – These structures are maintained in addition to the base table. – Query response times are unacceptable in interactive contexts.

The Data. Index • Both a storage and an access structure – Indexing comes The Data. Index • Both a storage and an access structure – Indexing comes for “free” • Based on, and extends the notions of vertical partitioning and transposed files • Two kinds presented here: – Basic Data. Index (BDI) – Join Data. Index (JDI)

The Basic Data. Index (BDI) • Projection-like Index with matching column removed from the The Basic Data. Index (BDI) • Projection-like Index with matching column removed from the table • Can have multiple columns (e. g. , no TPC-D query asks for Ext. Price or Discount alone) Base Table Cust. Key CK 1 CK 2 CK 3 CK 4 BDI Cust. Key CK 1 CK 2 CK 3 CK 4 Qty Q 1 Q 2 Q 3 Q 4 Ext. Price Discount E 1 D 1 E 2 D 2 E 3 D 3 E 4 D 4 BDI Qty Q 1 Q 2 Q 3 Q 4 BDI Ext. Price Discount E 1 D 1 E 2 D 2 E 3 D 3 E 4 D 4

The Join Data. Index (JDI) – JDI is “BDI” of RIDs to foreign table. The Join Data. Index (JDI) – JDI is “BDI” of RIDs to foreign table. Base Dimension Table Name Address N 1 A 1 N 2 A 2 N 3 A 3 BDI Name N 1 N 2 N 3 BDI Address A 1 A 2 A 3 Cust. Key CK 1 CK 2 CK 3 BDI Cust. Key CK 1 CK 2 CK 3 Base Fact Table Cust. Key CK 1 CK 2 CK 3 JDI Tax T 1 T 2 T 3 T 4 Ext. Price Discount E 1 D 1 E 2 D 2 E 3 D 3 E 4 D 4 BDI RIDs RID 1 RID 2 RID 3 – Joins can be processed efficiently Tax T 1 T 2 T 3 T 4 BDI Ext. Price Discount E 1 D 1 E 2 D 2 E 3 D 3 E 4 D 4

Maintaining Logical Records • Order of records is conserved in Data. Indexes • A Maintaining Logical Records • Order of records is conserved in Data. Indexes • A simple arithmetic mapping is used to associate fields of a record • Records in each vertical partition can easily be mapped to blocks and vice-versa • RID: (Block ID, Slot Number within that Block) – (Block ID, Slot Number) to Position: – Position to (Block ID, Slot Number):

Query Processing with Data. Indexes • Two common classes of queries in data warehousing: Query Processing with Data. Indexes • Two common classes of queries in data warehousing: – Range queries – Star join queries • Example range query: SALES Table BDI SELECT Cust. Key FROM SALES WHERE Qty>10 • Steps: – Apply restrictions to form rowset(s) – Load display BDI(s) into memory – Display values Rowset 1 0 0 1 Cust. Key CK 1 CK 2 CK 3 CK 4 BDI Qty 25 5 7 15 BDI Cust. Key CK 1 CK 2 CK 3 CK 4

Star Join Queries • A fact table is joined with a set of dimension Star Join Queries • A fact table is joined with a set of dimension tables: SELECT Column-list FROM Fact. Table, Dimension. Tables WHERE Selection. Predicates AND Join. Predicates: “Fact. Attr 1 = Dimension. Attr 2” General Technique Used to Evaluate: 1. Apply Selection. Predicates on individual tables. 2. Perform Join on restricted set of rows or rowsets.