What is a Data Warehouse A single complete

Скачать презентацию What is a Data Warehouse A single complete

e3534954f9dd65a7b43163dea47a7bc2.ppt

Количество слайдов: 31

What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand use in a business context. [Barry Devlin]

Data Warehouse Architecture Relational Databases Optimized Loader ERP Systems Extraction Cleansing Data Warehouse Engine Purchased Data Legacy Data Metadata Repository Analyze Query

Data Warehouse for Decision Support & OLAP • Putting Information technology to help the knowledge worker make faster and better decisions – Which of my customers are most likely to go to the competition? – What product promotions have the biggest impact on revenue? – How did the share price of software companies correlate with profits over last 10 years?

Why Separate Data Warehouse? • Performance – Op dbs designed & tuned for known txs & workloads. – Complex OLAP queries would degrade perf. for op txs. – Special data organization, access & implementation methods needed for multidimensional views & queries. • Function – Missing data: Decision support requires historical data, which op dbs do not typically maintain. – Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. – Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

OLTP vs. Data Warehouse • OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse • Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) – e. g. , average amount spent on phone calls between 9 AM-5 PM in Pune during the month of December

To summarize. . . • OLTP Systems are used to “run” a business • The Data Warehouse helps to “optimize” the business

Schema Design • Database organization – must look like business – must be recognizable by business user – approachable by business user – Must be simple • Schema Types – Star Schema – Fact Constellation Schema – Snowflake schema

Dimension Tables • Dimension tables – Define business in terms already familiar to users – Wide rows with lots of descriptive text – Small tables (about a million rows) – Joined to fact table by a foreign key – heavily indexed – typical dimensions • time periods, geographic region (markets, cities), products, customers, salesperson, etc.

Fact Table • Central table – mostly raw numeric items – narrow rows, a few columns at most – large number of rows (millions to a billion) – Access via dimensions

Star Schema • A single fact table and for each dimension one dimension table • Does not capture hierarchies directly T i m e c u s t date, custno, prodno, cityname, . . . f a c t p r o d c i t y

Snowflake schema • Represent dimensional hierarchy directly by normalizing tables. • Easy to maintain and saves storage T i m e c u s t p r o d date, custno, prodno, cityname, . . . f a c t c i t y r e g i o n

Fact Constellation • Fact Constellation – Multiple fact tables that share many dimension tables – Booking and Checkout may share many dimension tables in the hotel industry Promotion Hotels Travel Agents Booking Checkout Customer Room Type

De-normalization • Normalization in a data warehouse may lead to lots of small tables • Can lead to excessive I/O’s since many tables have to be accessed • De-normalization is the answer especially since updates are rare

Indexing Techniques • Exploiting indexes to reduce scanning of data is of crucial importance • Bitmap Indexes • Join Indexes • Bitsliced Indexing • Projection Indexing • Positional Indexing

Indexing Techniques • Bitmap index: – A collection of bitmaps -- one for each distinct value of the column – Each bitmap has N bits where N is the number of rows in the table – A bit corresponding to a value v for a row r is set if and only if r has the value for the indexed attribute

Bit. Map Indexes • An alternative representation of RID-list • Specially advantageous for low-cardinality domains • Represent each row of a table by a bit and the table as a bit vector • There is a distinct bit vector Bv for each value v for the domain • Example: the attribute sex has values M and F. A table of 100 million people needs 2 lists of 100 million bits

Bitmap Index M Y 0 1 0 F Y 1 1 1 F N 1 0 0 M N 0 0 0 F Y 1 1 1 F N 1 0 0 Customer Query : select * from customer where gender = ‘F’ and vote = ‘Y’

Bit Map Index Base Table Customers where Region Index Region = W Rating Index And Rating = M

Bit. Map Indexes • Comparison, join and aggregation operations are reduced to bit arithmetic with dramatic improvement in processing time • Significant reduction in space and I/O (30: 1) • Adapted for higher cardinality domains as well. • Compression (e. g. , run-length encoding) exploited • Products that support bitmaps: Model 204, Target. Index (Redbrick), IQ (Sybase), Oracle 7. 3

Join Indexes • Pre-computed joins • A join index between a fact table and a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute – e. g. , a join index on city dimension of calls fact table – correlates for each city the calls (in the calls table) from that city

Join Indexes • Join indexes can also span multiple dimension tables – e. g. , a join index on city and time dimension of calls fact table

Star Join Processing • Use join indexes to join dimension and fact table Calls C+T Time C+T+L Location Plan C+T+L +P

Optimized Star Join Processing Time Apply Selections Location Plan Calls Virtual Cross Product of T, L and P

Bitmapped Join Processing Bitmaps Calls 1 0 1 Location Calls 0 0 1 Plan Calls Time 1 1 0 AND

Indexing Techniques • Variant Indexes [O’Neil & Quass, 97] – B+-tree (all systems): • RID-list for each search-key value – Bitmapped (almost all systems): • Bit-vectors (usually compressed) instead of RID-lists – Projection (Sybase IQ): • Mirror copy of column – Bit-sliced (Sybase IQ): • “bit-level” projection index • Join Indexes [Valduriez et. al. , 86] – Bitmapped-Join Indexes [O’Neil & Graefe ‘ 95] (Informix) • Limitations: – These structures are maintained in addition to the base table. – Query response times are unacceptable in interactive contexts.

The Data. Index • Both a storage and an access structure – Indexing comes for “free” • Based on, and extends the notions of vertical partitioning and transposed files • Two kinds presented here: – Basic Data. Index (BDI) – Join Data. Index (JDI)

The Basic Data. Index (BDI) • Projection-like Index with matching column removed from the table • Can have multiple columns (e. g. , no TPC-D query asks for Ext. Price or Discount alone) Base Table Cust. Key CK 1 CK 2 CK 3 CK 4 BDI Cust. Key CK 1 CK 2 CK 3 CK 4 Qty Q 1 Q 2 Q 3 Q 4 Ext. Price Discount E 1 D 1 E 2 D 2 E 3 D 3 E 4 D 4 BDI Qty Q 1 Q 2 Q 3 Q 4 BDI Ext. Price Discount E 1 D 1 E 2 D 2 E 3 D 3 E 4 D 4

The Join Data. Index (JDI) – JDI is “BDI” of RIDs to foreign table. Base Dimension Table Name Address N 1 A 1 N 2 A 2 N 3 A 3 BDI Name N 1 N 2 N 3 BDI Address A 1 A 2 A 3 Cust. Key CK 1 CK 2 CK 3 BDI Cust. Key CK 1 CK 2 CK 3 Base Fact Table Cust. Key CK 1 CK 2 CK 3 JDI Tax T 1 T 2 T 3 T 4 Ext. Price Discount E 1 D 1 E 2 D 2 E 3 D 3 E 4 D 4 BDI RIDs RID 1 RID 2 RID 3 – Joins can be processed efficiently Tax T 1 T 2 T 3 T 4 BDI Ext. Price Discount E 1 D 1 E 2 D 2 E 3 D 3 E 4 D 4

Maintaining Logical Records • Order of records is conserved in Data. Indexes • A simple arithmetic mapping is used to associate fields of a record • Records in each vertical partition can easily be mapped to blocks and vice-versa • RID: (Block ID, Slot Number within that Block) – (Block ID, Slot Number) to Position: – Position to (Block ID, Slot Number):

Query Processing with Data. Indexes • Two common classes of queries in data warehousing: – Range queries – Star join queries • Example range query: SALES Table BDI SELECT Cust. Key FROM SALES WHERE Qty>10 • Steps: – Apply restrictions to form rowset(s) – Load display BDI(s) into memory – Display values Rowset 1 0 0 1 Cust. Key CK 1 CK 2 CK 3 CK 4 BDI Qty 25 5 7 15 BDI Cust. Key CK 1 CK 2 CK 3 CK 4

Star Join Queries • A fact table is joined with a set of dimension tables: SELECT Column-list FROM Fact. Table, Dimension. Tables WHERE Selection. Predicates AND Join. Predicates: “Fact. Attr 1 = Dimension. Attr 2” General Technique Used to Evaluate: 1. Apply Selection. Predicates on individual tables. 2. Perform Join on restricted set of rows or rowsets.