0e734703fba6761f8e2d8788df0c8354.ppt
- Количество слайдов: 44
Data Warehouse Systems Dimensional Model Advanced Gabriel David gtd@fe. up. pt MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 1
FAMILIES OF FACT TABLES MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 2
Families of fact tables n The four-step method is about one star • Fact table + dimensions n Data mart: set of coordinated stars • Conformed dimensions, common to several stars, with the same meaning across the DW n Four reasons to build families of fact tables in a data mart • A – Chains and circles • B – Heterogeneous products schema • C – Transactions and snapshots schema • D - Aggregates MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 3
A – Chains of fact tables n Process flow in a organization • Many times it has a start, several steps, and an end • Transactions or snapshots are captured for each step, possibly in several legacy systems • A step may correspond to a fact table • Example: supply chain and demand chain Supply chain Raw materials production (mining, agriculture) Ingredient acquisition Ingredient delivery Ingredient inventory Product composition list Manufacturing process control Production costs Packaging Transport to the store Finished product inventory Demand chain Finished product inventory Production shipments Distributors inventory Distributor shipment Retail inventory Retail sales MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 4
Method n n Define a fact table and a set of associated dimensions for each step Create the conformed dimensions • DW bus • Build each star separately • Not all the steps are necessarily implemented • In the end, the whole value chain is supported MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 5
Circles of fact tables n Businesses organized in value circles • All the entities execute or measure the same kind of transaction • Example of the national health system - Everybody generates or measures patient treatments • Data sharing requires conforming the dimensions for the several sources - Time (calendar) Patient Supplier (doctor) Local Treatment Diagnostic Employer Payer Value circle Hospitals Clinics Continued care institutions Doctor’s office Pharmacies Pharmaceutical companies Laboratories Employers Insurance companies Health subsystem MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 6
B-Heterogeneous products schema n n Situation: retail bank with many products Two perspectives on the account activity Example products Current account Saving account Loan Credit card Retirement saving account Deposit safe • Global view of all the accounts of all kinds Nuclear fact table - Cross sales, global analysis time_id account_id - 25 products query 25 fact agency_id tables? ! family_id - 1 nuclear fact table with all the balance accounts, just common facts tax_paid tax_earned num_transactions MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David Time Family Agency Account (nuclear dimention) 7
Second perspective n Specific view on each business line • Several specific facts • Group everything in the nuclear fact table 200 facts and many nulls? ! • Create an extension to both the fact table and the Account dimension for each business line (restricted to current accounts) Nuclear fact table time_id account_id agency_id family_id balance tax_paid tax_earned num_transactions specific fact key Current account specific fact table specific fact key num_uncovered num_ATM_usage num_non_ATM num_deposits total_deposits MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David Time Family Agency Account (nuclear dimension) + specific dimension key Specific dimension key + current account specific attributes 8
Implementation alternatives n Specific business lines produce a partition in the nuclear fact table • Disjoint accounts n Case data marts physically separated by business line • Nuclear tables (facts and dimensions) in a separate machine from the specific tables machines, but with copy here • Each line is copied just once n Case sharing the same database • Avoids the copy of the nuclear tables • To each nuclear record a join key is added which the querying tools and applications know how to use • A SQL query uses just one extension MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 9
C - Transactions and snapshots n Typical situation, in a operational system • Individual transactions - Withdraw from an account - Phone call - Product buyed • Snapshot at the end of a summary period - Daily summary - Monthly balance n Both may be interesting in a DW • Transactions – detailed behavior • Snapshots – activity status check MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 10
Transactions schema n Transactions schema in the DW is usually straightforward from the operational system • Extraction, key conversion • If there is no operational transactions source, artificial transactions may be reconstructed from the differences of successive snapshots ATM transaction time_id account_id transaction_id local_id audit_id account_number ref_transaction amount Time Account Transaction Local Audit MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 11
Comments on the fact table n Obvious external keys • Time, Local, Account, Transaction type n Other attributes with operational system keys • Account number, transaction reference • For direct reference and to return to the operational system n Audit dimension (special, new key) • Created by the extraction process itself to store the record history - Extraction timestamp, source table, extraction software version MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 12
More comments n Typically, a single fact • Amount – generic name for any type of transaction • Not usual to add more facts • More common to add more kinds of transactions - Just data, schema unchanged n Detailed behavior studies are possible • Number of mortgage return transactions far from home n Temporal analyses • Number of transactions at lunch time • Waiting queue analyses MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 13
Even more n Response time calculation • Organizational efficiency n Sequential behavior • Fraud analyses (insurance, credit cards) • Account cancelation alarm n Shopping basket analyses • Those who buy beer, what else do they buy? - May be used to put those products by the beer - Or on the other side of the shop to force crossing it • Powerful data mining requires fine grain (transactions) MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 14
Need of snapshots n n How to answer questions about the current status? Get the total current income • Transactions which are elementary income - Just add the values • Transactions which are deposits and withdrawals - It is required to pay attention to the differences • Payments in installments and insurance premium - More fuzzy, with effects spreading across several summary periods • Appropriate logic is required to transform transactions in summaries, justifying the need for periodic snapshots with these summaries MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 15
Snapshot information n A second fact table with the snapshots is needed, besides transactions (partner tables) • With luck, the snapshots are already in the operational system • Otherwise, an extraction algorithm is required to compute them, at the end of each reporting period (just once) • It may be interesting to perform the incremental creation of the snapshot along the period, updating it with the current snapshot, which is frozen at the end of the period, and a new one is started MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 16
Snapshot schema n Tightly related with the corresponding transactions schema • Looses some dimensions (Transaction, Local, …) • Keeps other (Account, Time, …) - Audit keeps information on the whole period • New dimension Status for the current situation along the period ATM Snapshot time_id account_id status_id audit_id income nr_transactions final_balance average_daily_balance + other summaries Time Account Status Audit MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 17
Snapshot facts characteristics n Facts structure is richer and open • Many summaries are conceivable for a certain period • Keep the granularity n Some facts are additive, in all dimensions • income, nr_transactions n Other are semi-additive (in all dimensions, except time) • final_balance, average_daily_balance • Over the time dimension the temporal average may be used MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 18
Intermediate snapshots n It is common to add more facts during the DW development or even during its lifecycle • Avoid to include obscure meaning or rarely used facts n Snapshots between reporting periods • Facts included in the snapshot: take the immediately previous period and add the incremental transactions from that date until the target date • Facts not included: process the transactional facts since the beginning of time until the target date n Transactions and snapshots together allow for a complete view of the organization MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 19
D - Aggregates n Aggregates are stored summaries • Usually by performance reasons • Main issues are relative to implementation n Storage in separate tables n Aggregation along one or more dimensions • These dimensions shorten, the other remain n Different granularity data available • Coarser granularity data are directly recorded in what could be an aggregation table (may be with really aggregate records, if they get the same granularity) • Ex: records of the global shipping cost of a delivery, versus the records of the individual line items cost in the invoice MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 20
FACTLESS FACT TABLES MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 21
Factless fact tables n n Situations where the design process of a star is followed but in the end the fact table has no facts Record of events • Ex: Class attendance • Dimensions: Date, Student, Course, Professor, Room • Facts: contain just the 5 external keys, there is no fact, no measure; one line for each student coming in a class • Even so it is useful: courses with more attendance? , etc. • To be uniform: put an Attendance fact with constant value 1, which proves useful in aggregations MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 22
Class attendance Course course_id name department level code laboratory Professor professor_id number name address department title degree Attendance time_id student_id course_id professor_id room_id attendance = 1 Room room_id typo local department places Time time_id SQL_date day_of_week month Student student_id ID name address major minor first_enrollment graduation_class MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 23
More tables without facts n Coverage table • Required when the primary DW fact table is sparse • Ex: record of a promotion in a retail chain; outlines which products, in which days, in which shops - The sales table has all the products • Dimensions: Time, Product, Shop, Promotion • To know which products were in the promotion but were not sold, one can’t resort to the normal sales fact table - It just records the sales; not the non-sales - Fill in the sales table with zero records has the disadvantage of increasing it too much and of recording the promotion just indirectly MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 24
Coverage table Shop shop_id name ID_shop address county region Promotion promotion_id name type price_type Publicity presentation coupon (Table with no facts) Promotion coverage time_id product_id shop_id promotion_id Time time_id SQL_date day_of_week month Product product_id SKU description brand category package size flavour MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 25
SPECIAL MODELING CASES MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 26
Many-to-many dimensions Supplier attributes… Diagnostic diagnostic_id designation classification HL 7 … Patient bill time_id patient_id supplier_id local_id financing. Inst_id procedure_id diagnostic_id debit. Patient debit. Financing. Inst Time attributes… Patient attributes… Procedure attributes… Local attributes… What if there are multiple diagnostics? Financing. Inst attributes… MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 27
Bridge table muitos-muitos Diagnostic diagnostic_id designation classification HL 7 … n n n Diagnostic. Group diagnostic. Group_id diagnostic_id weight Patient bill time_id patient_id supplier_id local_id financing_id procedure_id diagnostic. Group_id debit. Patient debit. Financing. Inst Each patient gets a group of diagnostics with an id The group id plus the individual diagnostic id make the key of the bridge table Each diagnostic is weighted such as the sum of all the weights in each group is 1. 0 MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 28
Typical questions to bridge tables n Charges by diagnostic (weighted join) • total is consistent n Impact of each diagnostic (non weighted join) • multiplies totals but is meaningful in the perspective of the diagnostic that shows up in many lines n Similar cases • Bank account with several owners • Profit distribution in a company with many Economic Activity Codes MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 29
Care with joins Sales client_id time_id product_id audit_id order_nr quantity value n Client client_id attributes… Bills client_id time_id product_id return_place_id status_id order_nr quantity value Record sales and returns • Two stars of different cardinality n If a report with all the sales and returns is needed, how to proceed? • A single Select may yield a wrong result • Ask two separate queries, one for each star • Perform an external join of the results MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 30
Roles in dimensions n Orders (type accumulating snapshot) • Order date • Assembly date • Shipping date • Delivery date • Payment date • Return date • Order status • Client • Product • Warehouse • Promotion Encomenda order_date_id assembly_date_id shipping_date_id delivery_date_id payment_date_id return_date_id client_id product_id warehouse_id promotion_id quantity value Time time_id SQL_date attributes… Client attributes… Product attributes… Warehouse attributes… Promotion attributes… MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 31
Creation of views n To avoid an SQL query to set all the dates equal, use views • A view implements a role (the key of the view gets the name of the corresponding external key in the fact table) • Different names in the columns to help report readability • Just one actual time dimension is kept MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 32
Frequent flyer passenger n Similar case • Flight segments table (Frequent Flyer) • Flight date • Segment origin airport • Segment destination airport • Trip origin airport • Trip destination airport • Flight • Class • Client MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 33
Telecoms n Several companies may be involved in a single call • Origin systems • Local commutation • Long distance • Added value service • Caller • Callee n A single Company dimension playing one or more roles in the call MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 34
Location dimensions n Location: feature present in many dimentions • Physical address, latitude-longitude, … n Example dimentions • • • n n Phone number Billing phone number Terminal equipment Network equipment Building Company Location aggregates by area code, county, sales region (multiple hierarchies) Embedded roles • Not a dimension shared by different roles, but sharing attributes by several dimensions MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 35
Acceptable snowflake n Recommend • Location subdimension as snowflake for all relevant dimensions • Views joining each dimension with Location - Similar to the views in roles - But heavier; if too heavy, materialized views • Star model between facts and these views MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 36
Snowflake Calls phone_id billing_phone_id company_id promotion_id caller_id callee_id duration value Callee callee_id census_id attributes… Census census_id attributes… Caller caller_id census_id attributes… Phone phone_id location_id attributes… Location location_id attributes … Billing_phone billing_phone_id location_id attributes … Company company_id location_id attributes … Promotion attributes … MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 37
Organizational Hierarchies n Representation of hierarchies of business entities • Holdings, groups, subgroups, companies, departments, warehouses • Every entity may be a client n First attempt • Client dimension with parent pointer • Does not work with GROUP BY • Even CONNECT does not work well because of the joins MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 38
Parent pointer Client client_id client_nr name address type industrial sector first buy date buying profile credit profile parent_client_id Fact table … client_id … quantity value n It is not a simple star MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 39
Representation requirements n Keep the granularity of the Client dimension • Allows direct join with the fact table n n n Aggregation of a complete subtree with GROUP BY Aggregation of a client immediate subsidiaries or of the subsidiaries with no children, at any level Parent and top level retrieval in a single step MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 40
Solution Fact table … client_id … quantity value n Special bridge table, of optional use • • n Bridge parent_client_id subsidiary_id levels leaf Client client_id client_nr name address type industrial_sector first buy date buying profile credit profile One record for each path from a client to a subsidiary Includes 0 length paths, from a client to itself Records the number of levels between both Normal star, if the bridge is not used Two usage modes • This example shows going down a hierarchy - With “levels=1” one gets the children - Com “leaf=‘Yes’” one gets just the leaves MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 41
Solution (cont) Fact table … client_id … quantity value n Bridge parent_client_id subsidiary_id levels leaf Client client_id client_nr name address type industrial_sector first buy date buying profile credit profile Two usage modes • This example shows going up a hierarchy, taking as query entry point the Client dimension - One gets all the facts about the ascendents - Restricting to “levels=1” the parent is got - Calculating the maximum number of levels, the root is got MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 42
Other applications n To deal with historical information, in terms of organizational relationships • Add validity period to the bridge record • Requires the specification of the reference date in each query n An hierarchy maintenance application is required n Product composition hierarchy • Small to medium size (the “compilation” of the hierarchy is exponential on the number of the nodes) MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 43
Temporal marks n Human resources of a large company • 100 000 employees • 100 attributes per employee n Kinds of queries • Monthly summaries - Counts, snapshot totals and accumulating totals - Problem: totals should work even if there is no activity in the period under review for a given employee • Anytime analysis, even out of the end of month interval • Employee process with each transaction individualized MAP-I / DWS / Orlando Belo, Maribel Santos, Gabriel David 44
0e734703fba6761f8e2d8788df0c8354.ppt