e53a2b198b4224f61ea8e97b403ed86e.ppt
- Количество слайдов: 34
9 Adv. DBS and Data Warehouse CSC 5301 Ch 1 and 2 Hachim Haddouti Ch 1 and 2, Hachim Haddouti 1
9 Do You Remember? u OLTP (examples? ) u OLAP u ROLAP / MOLAP u Star Schema u Fact Table u DSS u Slice/Dice, Drill Down/Roll. Up Ch 1 and 2, Hachim Haddouti 2
9 Requirements of OLAP Products FASMI Definition u FAST (simple query under 1 s, medium under 5 s, complex 20 s) u Analysis (analytical and statistical funtionality, ad hoc) u Shared u Multidimensional (essential to give different views) u Information (extraction of information from data) Ch 1 and 2, Hachim Haddouti 3
9 MD Data model u Asymmetric, one large dominant table in the center of the schema (multiple joins) and around Dimension tables u Better visualization u Business queries (why, what if) Ch 1 and 2, Hachim Haddouti 4
9 MD Model u Fact table: where numercial measurements of the business are stored, plus keys of dimension tables u Dimension table: textual description of dimension u Attributes: describing items in Dimension Ch 1 and 2, Hachim Haddouti 5
Data Warehouse Structure Ch 1 and 2, Hachim Haddouti 9 6
9 Basic Concepts of the MDD-Model Def. : A Dimension is a data type (almost always finite), which is used as a component of a composite (multidimensional) key. Def. : Dimension-Members are elements of a dimension. Examples: frequently enumeration types or intervals: Month =( January, February, …, December) Day = [ 1: 31] Ch 1 and 2, Hachim Haddouti 7
9 Further examples: BMW-engines = {1600, 1800, 2000, 2300, 2800, 3000, 3500, 4000, 5000, 2500 D, 3000 D} BMW-bodies = {3 rd, 5 th, 7 th. L, 3 rd Cabrio, 5 th. L, 8 th, Z 3} Note : not all combinations of engines and bodies are built Ch 1 and 2, Hachim Haddouti 8
9 Data Cube Example Dimension Product Cell 25 Dimension. Members Trekking Bike 35 Facts (Measures) 30 Time Mountain Bike June D Ch 1 and 2, Hachim Haddouti GB Region 9
9 Def. : Dimension-Attributes: additional attributes for a detailed description of the dimension members. Examples: u Number of days per month dimension Month = ((January, 31), (February, 29), : (April, 30), . . . , (December, 31)) u Gasoline type and number of cylinders of an engine dimension BMW-engine = {(1600, Super, 4), . . . , Ch 1 and 2, Hachim Haddouti (2500 D, Diesel, 5), . . . , (4000, Regular, 8)…} 10
Relational Modeling Dimension finite fixed relation Dimension. Element Dimension. Attribute Cube 9 key other non-key attributes Measures relation, on E/R level a relationship between dimensions key composed of foreign keys of dimensions non-key attributes of cube Dimensionality number of foreign keys Cube-key Ch 1 and 2, Hachim Haddouti 11
Relational Model BMW-engines Facts cyl B-E B-B fuel B-E 9 GRAIN Month M € # Name days Bodies B-B Ch 1 and 2, Hachim Haddouti … 12
9 E/R-Model: Star-Schema BMW-Engines Facts BMW-bodies Months i. e. simple star schema Ch 1 and 2, Hachim Haddouti 13
Relational Representation of Multidimensional Data Ch 1 and 2, Hachim Haddouti 9 14
Multidimensional Representation of 3 -dim Data: Dimensions with Measures or Facts Ch 1 and 2, Hachim Haddouti 9 15
Density and Sparsity 9 Def. : Dense data-cube: all combinations of (d 1, …, dm) occur. Sparse data-cube is not dense Def. : Note: u logical model assumes dense cubes u physical storage model deals with dense and sparse cubes. Ch 1 and 2, Hachim Haddouti 16
9 Data Warehouse Shape Purchase Storage Sales Personnel Financial Analyzes, Trends Internal Information Sources Data Warehouse External information sources Customer Ch 1 and 2, Hachim Haddouti Supllier Market competition 17
9 History Unchanged Vision: right information to the right time and place MIS (=Management Informationssystem) DSS (=Decision Support System) MAIS (=Marketing Informationssystem) EIS (=Executive Information System) 60' 70' Ch 1 and 2, Hachim Haddouti 80', Begin 90' Data-Warehousesystem EIS (=Enterprise Intelligence System) IDF (=Information Delivery Facility) Information Warehouse EIS (=Enterprise Information System) Mid 90' 18
Example 1 Ch 1 and 2, Hachim Haddouti 9 19
Example 2 Ch 1 and 2, Hachim Haddouti 9 20
9 Ch 2: The Grocery Store Case: 500 large grocery stores, in 3 Regions, each store with many depts; Steps in the design process u u Choose a business process to model (order, invoice, sales etc. ) Choose the grain of the business process (daily snapshots, monthly. . ) Choose the dimensions(e. g; time, customer, product) Choose the measured facts (Quantity sold, DH Sold) Ch 1 and 2, Hachim Haddouti 21
Grocery Store Item Movement u u u 9 Stock Keeping Unit (SKU), individual products (60 000) Universal Product Code (UPC), bar codes point of sale (POS), front door where customers buy takeaway, back door where vendors make deliveries temporary price reduction (TPR), promotions shelf/end-aisle display, ads like dispalys at stores Ch 1 and 2, Hachim Haddouti 22
Design Principles: Identifying the processes to model 9 The first step in the design is to decide what business process(es) to model, by combining an understanding of the business with an understanding of what data is available 1. Ø daily item movement (GRAIN is SKU by store by promotion by day); what^products are selling in which stores, prices, days The second step in the design is to decide on the grain of the fact table in each business process (daily item movement, why not each Transaction? ). 2. u u 3. market basket depleted cannibalized syndicated data suppliers ( comparing own sales with other competitive stores, eg. Top 10 sold by my competitor) A data ware house almost always demands data expressed at the lowest possible grain of each dimension, not because queries want to see individual low-level records, but because queries need to cut through the database in very precise ways. Ch 1 and 2, Hachim Haddouti 23
9 Design Principles 4. A careful grain statement determines the primary dimensionality of the fact table. It is then usually possible to additional dimensions to the basic grain of the fact table, where these additional dimensions naturally take on only a single value under each combination of the primary dimensions. If it is recognized that an additional desired dimension violates the grain by causing additional records to be generated, then the grain statement must be revised to accomodate this additional dimension. Ch 1 and 2, Hachim Haddouti 24
9 Grocery Store Schema Time dim Time key Time attributes … Promotion dim Promotion key Promotion attributes … Ch 1 and 2, Hachim Haddouti Sales Fact Time key Produkt key Store key Promotion key Other fact … Product dim Product key Product attributes … Store dim Store key Store attributes … 25
9 Design Principles u u Picking the business measurements for the fact table The number of base sales transaction line items in a business can be estimated by dividing the gross revenue of the business by the average price of a sales item. u Resisting normalization 5. The fact table in a dimensional schema is naturally highly normalized. Efforts to normalize any of the tables in a dimensional database solely in order to save disk space are a waste of time. 6. Ch 1 and 2, Hachim Haddouti 26
Grocery Store Schema showing measured facts Time dim Time key Time attributes … Promotion dim Promotion key Promotion attributes … Sales Fact Time key Produkt key Store key Promotion key DH_sales Units_sales DH_cost Customer_count Product dim Product key Product attributes … Store dim Store key Store attributes … Suppose: each attribute in Fact table has 4 Bytes, except Store. Key 2 Bytes Whole fact table is only 30 Bytes large, for 1 b row-fact table 30 GB Ch 1 and 2, Hachim Haddouti 9 27
9 Design Principles Preserving browsing 7. The dimension tables must not be normalized but should remain as flat tables. Normalized dimension tables destroy the ability to browse. Disk space savings gained by normalizing the dimension tables are typically less than one percent of the total disk space needed for the overall schema. Ch 1 and 2, Hachim Haddouti 28
9 Time Dimension The time dimension (time_key, holiday_flag, fiscal_period, season, event, . . ) Most data warehouses need an explicit time dimension table even though the primary time key may be an SQL date-valued object. The explicit time dimension is needed to describe fiscal periods, seasons, holidays, weekends, and other calendar calculations that are difficult to get from the SQL data machinery. u Ch 1 and 2, Hachim Haddouti 29
9 The product dimension u u u Production dimension (product_key, SKU_desc, SKU_no, brand, . . ) merchandise hierarchy (eg. SKU to package to brand to subcategories to departments) drill up/down drilling down in a data warehouse is nothing more than adding row headers from the dimension tables. Drilling up is subtracting row headers. An explicit hierarchy is not needed to support drilling down. The product dimension is one of the two or three dimensions in nearly every data warehouse. Great care should be taken to fill this dimension with as many descriptive attributes as possible. Retail product dimension tables should have at least 50 attributes. Ch 1 and 2, Hachim Haddouti 30
9 Store Dimension u The store dimension (store_key, store_name, store_no, store_address, …) u pull down list u SYNONYM (create FIRST_OPEN_TIME as SYNONYM FOR DATE) Ch 1 and 2, Hachim Haddouti 31
9 The promotion dimension u Promotion dim (promotion_key, price_reduction_type, display_type, . . ) promotion_name, u causal dimension (temporary price reduction, coupons, . . ) u Lift/baseline sales (gain in sales during promotion) u time shifting u cannibalization u growing the market u profitability Ch 1 and 2, Hachim Haddouti 32
9 The grocery store facts Additive Attributes, we can compute: u gross profit (DH cost – DH Revenue) u gross margin ( gross profit / DH Revenue) 8. 9. A nonadditive calculation, such as the ratio like gross margin, can be calculated for any slice of the fact table by remembering to calculate the ratio of sums, not the sum of the ratios. In other words, the computation must be distributed over the sums, not the other way around. Customer counts are usually semi-additive when they occur in time snapshot fact tables because they double count activity across products during the customer event. In these cases they can be used correctly in user applications only by restricting the keys in the nonadditive dimensions to single values. Ch 1 and 2, Hachim Haddouti 33
Database sizing for the grocery chain u u u u 9 time dimension: 2 years x 365 days = 730 days store dimension: 300 stores, reporting sales each day product dimension: 30, 000 products in each store; 3, 000 sell each day/store promotion dimension: 1 item in no more than 1 promotion/store/day base fact records: 657 M key fields: 4; fact fields: 4; total fields: 8 base fact table size: 657 M x 8 fields x 4 bytes = 21 G Ch 1 and 2, Hachim Haddouti 34
e53a2b198b4224f61ea8e97b403ed86e.ppt