Скачать презентацию 11 Transforming Data Copyright Ó Oracle Corporation 1999 Скачать презентацию 11 Transforming Data Copyright Ó Oracle Corporation 1999

4173587ce8d15afd99b1d75a431b0629.ppt

  • Количество слайдов: 45

11 Transforming Data Copyright Ó Oracle Corporation, 1999. All rights reserved. 11 Transforming Data Copyright Ó Oracle Corporation, 1999. All rights reserved.

Overview Defining DW Concepts & Terminology Planning for a Successful Warehouse Choosing a Computing Overview Defining DW Concepts & Terminology Planning for a Successful Warehouse Choosing a Computing Architecture Meeting a Business Need Modeling the Data Warehouse Analyzing User Query Needs Planning Warehouse Storage ETT (Building the Warehouse) Supporting End User Access Project Management (Methodology, Maintaining Metadata) 11 -2 Copyright Ó Oracle Corporation, 1999. All rights reserved. Managing the Data Warehouse

Objectives After completing this lesson, you should be able to do the following: • Objectives After completing this lesson, you should be able to do the following: • • • 11 -3 Explain the importance of quality data Define the term “transformation” Identify transformation issues Describe techniques for transforming data List tools that can be used to transform data Copyright Ó Oracle Corporation, 1999. All rights reserved.

Importance of Data Quality Browser: http: // Hollywood Cu sto me rs+ : X Importance of Data Quality Browser: http: // Hollywood Cu sto me rs+ : X Browser: Cus tom http: // a rec orof as Hollywood ers+ : X Hollywood Speedy Pizza Browser: http: // X + Hollywood Summit Sports 11 -4 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Benefits of Quality Data • Clean data is essential for: – Targeting customers – Benefits of Quality Data • Clean data is essential for: – Targeting customers – Determining buying patterns – Identifying householders: private and commercial – Matching customers – Identify historical data • 11 -5 Dirty data must be removed. Copyright Ó Oracle Corporation, 1999. All rights reserved.

Standards • • 11 -6 Define a quality strategy. Decide on optimal data-quality level. Standards • • 11 -6 Define a quality strategy. Decide on optimal data-quality level. Copyright Ó Oracle Corporation, 1999. All rights reserved.

Quality Improvements • • • 11 -7 Consider modifying rules for operational data. Document Quality Improvements • • • 11 -7 Consider modifying rules for operational data. Document the sources. Create a data stewardship program. Design the cleanup process carefully. Initial cleanup and refresh routines may differ. Copyright Ó Oracle Corporation, 1999. All rights reserved.

Guidelines • Operational data should not be used directly in the warehouse • Operational Guidelines • Operational data should not be used directly in the warehouse • Operational data must be cleaned for each increment • Operational data is not simply fixed by modifying applications 11 -8 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Solutions • • Conventional COBOL, 4 GL Specialized tools Customized conversion process Business experts Solutions • • Conventional COBOL, 4 GL Specialized tools Customized conversion process Business experts Investigation Conditioning Standardization Integration 11 -9 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Management Poor data quality • • Own Take responsibility Resolve problems Data quality manager Management Poor data quality • • Own Take responsibility Resolve problems Data quality manager 11 -10 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Transformation Clean up Consolidate Restructure Operational system Extract Data staging area Transport (Load) Transform Transformation Clean up Consolidate Restructure Operational system Extract Data staging area Transport (Load) Transform Warehouse Transformation eliminates operational data anomalies • • • Cleans Standardizes Presents subject-oriented data 11 -11 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Source Data Anomalies • • No unique key Data naming and coding anomalies Data Source Data Anomalies • • No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies CUSNUM NAME 90328575 90238475 90233479 90233489 90234889 90345672 11 -12 Oracle Corp Oracle Services Oracle Limited Oracle Computing Oracle Corp. UK Oracle Corp UK Ltd ADDRESS 100 NE 1 st Street, Tampa 100 NE. First St. , Tampa 100 North East 1 st St. , FLA 100 N. E. 1 st St. 15 Main Road, Ft. Lauderdale, FLA 181 North Street, Key West, FLA Copyright Ó Oracle Corporation, 1999. All rights reserved.

Transformation Routines • • • Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating Transformation Routines • • • Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load 11 -13 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Transforming Data: Problems and Solutions Multipart keys Product code = 12 M 65431345 Country Transforming Data: Problems and Solutions Multipart keys Product code = 12 M 65431345 Country Sales code territory 11 -14 Product Salesperson number code Copyright Ó Oracle Corporation, 1999. All rights reserved.

Transforming Data • Multiple encoding m, f 1, 0 m, f male, female • Transforming Data • Multiple encoding m, f 1, 0 m, f male, female • Must pick up erroneous data mle, female 1 , NULL 11 -15 If field not in (‘m’, 1, ’male’) then … m, f else if field is NULL then … Copyright Ó Oracle Corporation, 1999. All rights reserved.

Transforming Data • • Multiple local standards Tools or filters to preprocess cm cm Transforming Data • • Multiple local standards Tools or filters to preprocess cm cm inches DD/MM/YY DD-Mon-YY MM/DD/YY 1, 000 GBP USD 600 FF 9, 990 11 -16 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Multiple Files Problem • • Added complexity of multiple source files Start simple Multiple Multiple Files Problem • • Added complexity of multiple source files Start simple Multiple source files 11 -17 Logic to detect correct source Copyright Ó Oracle Corporation, 1999. All rights reserved. Extracted data

Transforming Data from Multiple Files File 11 -18 File File Copyright Ó Oracle Corporation, Transforming Data from Multiple Files File 11 -18 File File Copyright Ó Oracle Corporation, 1999. All rights reserved.

Missing Values Problem Solution • • Ignore Wait Mark rows Extract when time-stamped If Missing Values Problem Solution • • Ignore Wait Mark rows Extract when time-stamped If NULL then field = ‘A’ 11 -19 A Copyright Ó Oracle Corporation, 1999. All rights reserved.

Duplicate Value Problem Solution • • SQL self-join techniques RDMBS constraint utilities ACME Inc Duplicate Value Problem Solution • • SQL self-join techniques RDMBS constraint utilities ACME Inc SELECT … FROM table_a, table_b WHERE table_a. key (+) = table_b. key UNION SELECT … FROM table_a, table_b WHERE table_a. key = table_b. key (+) ACME Inc 11 -20 Copyright Ó Oracle Corporation, 1999. All rights reserved.

 • • tom http: // ers+ : Hollywood as Browser: Cus a rec • • tom http: // ers+ : Hollywood as Browser: Cus a rec Solution orof Element Names Problem X Customer CTAS SQL*Loader Browser: http: // X + Hollywood Client Browser: http: // Cu Hollywood sto me + rs: X 12345. 00 12780. 00 2345787. 00 87877. 98 5678. 00 11 -21 ABC CO 100% 110% GMBH LTD 230% GBUK INC 200% FFR ASSOC -10% Contact Name MCD CO Copyright Ó Oracle Corporation, 1999. All rights reserved. Customer

Element Meaning Problem Customer’s name Hollywood Cu s tom ers : a rec orof Element Meaning Problem Customer’s name Hollywood Cu s tom ers : a rec orof http: // + X Customer_detail 11 -22 All details except name • • • as Browser: All customer details Avoid misinterpretation Complex solution Document meaning in metadata Copyright Ó Oracle Corporation, 1999. All rights reserved.

Input Format Problem EBCDIC “ 123 -73” ASCII 12373 ACME Co. áøåëéí äáàéí 11 Input Format Problem EBCDIC “ 123 -73” ASCII 12373 ACME Co. áøåëéí äáàéí 11 -23 Beer (Pack of 8) Copyright Ó Oracle Corporation, 1999. All rights reserved.

Referential Integrity Problem Solution • • • SQL anti-join Server constraints Dedicated tools Department Referential Integrity Problem Solution • • • SQL anti-join Server constraints Dedicated tools Department 10 20 30 40 11 -24 Emp 1099 1289 1234 6786 Name Smith Jones Doe Harris Department 10 20 50 60 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Name and Address Problem • • No unique key Missing values Personal and commercial Name and Address Problem • • No unique key Missing values Personal and commercial names mixed Different addresses for same member Different names and spelling for same member Many names on one line One name on two lines NAME LOCATION Database 1 DIANNE ZIEFELD HARRY H. ENFIELD FRED AND SARA MULLEN N 100 D 589 M 300 Database 2 ZIEFLED, DIANNE ENFIELD, HARRY H MULLEN, SARA AND FRED 100 589 300 11 -25 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Name and Address Problem • Single-field format Mr. J. Smith, 100 Main St. , Name and Address Problem • Single-field format Mr. J. Smith, 100 Main St. , Bigtown, County Luth, 23565 • Multiple-field format Name Street Town County Code 11 -26 Mr. J. Smith 100 Main St. Bigtown County Luth 23565 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Clean and Organize 1. Create atomic values. 2. Standardize formats. 3. Verify data accuracy. Clean and Organize 1. Create atomic values. 2. Standardize formats. 3. Verify data accuracy. 4. Match with other records. 5. Identify private and commercial addresses and inhabitants. 6. Document in metadata. Requires sophisticated tools and techniques 11 -27 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Merging Data • Operational transactions do not usually map one-to-one with warehouse data • Merging Data • Operational transactions do not usually map one-to-one with warehouse data • Data for the warehouse is merged to provide information for analysis Pizza sales/returns by day, hour, seconds Sale 1/2/98 12: 00: 01 Ham Pizza $10. 00 Sale 1/2/98 12: 00: 02 Cheese Pizza $15. 00 Sale 1/2/98 12: 00: 02 Anchovy Pizza $12. 00 Return 1/2/98 12: 00: 03 Anchovy Pizza - $12. 00 Sale 11 -28 1/2/98 12: 00: 04 Sausage Pizza $11. 00 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Merging Data Sale 1/2/98 12: 00: 01 Ham Pizza $10. 00 Sale 1/2/98 12: Merging Data Sale 1/2/98 12: 00: 01 Ham Pizza $10. 00 Sale 1/2/98 12: 00: 02 Cheese Pizza $15. 00 Browser: http: // Cu sto me + rs: 1/2/98 12: 00: 02 Anchovy Pizza $12. 00 Return 1/2/98 12: 00: 03 Anchovy Pizza - $12. 00 Sale 12: 00: 04 Sausage Pizza $11. 00 1/2/98 Sale 1/2/98 12: 00: 01 Ham Pizza $10. 00 Sale 1/2/98 12: 00: 02 Cheese Pizza $15. 00 Sale 1/2/98 12: 00: 04 Sausage Pizza $11. 00 11 -29 Copyright Ó Oracle Corporation, 1999. All rights reserved. a reco rof Sale as X Hollywood

Adding a Date Stamp • • • Enables time analysis Label loaded data with Adding a Date Stamp • • • Enables time analysis Label loaded data with a date stamp Add time to fact and dimension data 11 -30 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Adding a Date Stamp Product Table Product_id Time_key Product_desc Store Table Store_id District_id Time_key Adding a Date Stamp Product Table Product_id Time_key Product_desc Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Time Table Week_id Period_id Year_id Time_key 11 -31 Item Table Item_id Dept_id Time_key Copyright Ó Oracle Corporation, 1999. All rights reserved.

Adding a Date Stamp • Fact table – Add triggers – Recode applications – Adding a Date Stamp • Fact table – Add triggers – Recode applications – Compare tables • • Dimension table Time representation – Point in time – Time span 11 -32 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Adding Keys to Data #1 Sale 1/2/98 12: 00: 01 Ham Pizza $10. 00 Adding Keys to Data #1 Sale 1/2/98 12: 00: 01 Ham Pizza $10. 00 #2 Sale 1/2/98 12: 00: 02 Cheese Pizza $15. 00 #3 Sale 1/2/98 12: 00: 02 Anchovy Pizza $12. 00 #4 Return 1/2/98 12: 00: 03 Anchovy Pizza - $12. 00 #5 Sale 12: 00: 04 Sausage Pizza $11. 00 1/2/98 Data values or artificial keys #dw 1 1/2/98 12: 00: 01 Ham Pizza $10. 00 #dw 2 Sale 1/2/98 12: 00: 02 Cheese Pizza $15. 00 #dw 3 11 -33 Sale 1/2/98 12: 00: 04 Sausage Pizza $11. 00 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Summarizing Data During extraction on staging area After loading onto the warehouse server http: Summarizing Data During extraction on staging area After loading onto the warehouse server http: // Hollywood tom ers + : as Browser: Cus a rec orof • • X Operational databases 11 -34 Staging area Warehouse database Copyright Ó Oracle Corporation, 1999. All rights reserved.

Maintaining Transformation Metadata Contains transformation rules, algorithms, and routines Browser: http: // X + Maintaining Transformation Metadata Contains transformation rules, algorithms, and routines Browser: http: // X + Hollywood a rec orof as Browser: Cu http: //s Hollywood tom + ers X : Cu s Browser: to http: // mer X + s: Hollywood Sources Stage Rules Extract Transform 11 -35 Publish Load Query Copyright Ó Oracle Corporation, 1999. All rights reserved.

Maintaining Transformation Metadata • • Key restructuring Coding differences Multiple sources Exception rules Format Maintaining Transformation Metadata • • Key restructuring Coding differences Multiple sources Exception rules Format differences Referential integrity fixes Aggregated data 11 -36 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Data Ownership and Responsibilities • • • Operational and application development teams Data warehouse Data Ownership and Responsibilities • • • Operational and application development teams Data warehouse development team Business benefit gained with a one-team approach Browser: http: // + Hollywood 11 -37 sto m ers + : X Browser: Cu http: // Hollywood sto me rof Browser: X Cu rs: + X as Hollywood a reco http: // Copyright Ó Oracle Corporation, 1999. All rights reserved.

Transformation Timing and Location • Transformation is performed: – Before load – In parallel Transformation Timing and Location • Transformation is performed: – Before load – In parallel • May be initiated at different points 12 M 65431 12 -m-65421 12 m 65421 12 M 65421 “ 12 m 65421” 12 m 65421 “ 12 m 65421” “ ” 12 M 65431 Unlikely 11 -38 “ ” 12 M 65431 Probable Copyright Ó Oracle Corporation, 1999. All rights reserved. Possible

Choosing a Transformation Point • • 11 -39 Workload Environment impact CPU use Disk Choosing a Transformation Point • • 11 -39 Workload Environment impact CPU use Disk space • Network bandwidth • Parallel execution • Load window time • User information needs Copyright Ó Oracle Corporation, 1999. All rights reserved.

Monitoring and Tracking Transforms should: • • • Be self-documenting Provide summary statistics Handle Monitoring and Tracking Transforms should: • • • Be self-documenting Provide summary statistics Handle process exceptions 12 M 65431 12 -m-65421 12 m 65421 12 M 65421 “ 12 m 65421” 12 m 65421 “ 12 m 65421” “ ” 12 M 65431 11 -40 “ 1 ” 12 M 65431 Copyright Ó Oracle Corporation, 1999. All rights reserved. 2 3 4 5 1, 200 1, 400 100 6, 001 20, 890

Designing Transformation Processes • Analysis: – Sources and target mappings, business rules – Key Designing Transformation Processes • Analysis: – Sources and target mappings, business rules – Key users, metadata, grain • Design options: PL/SQL, replication, custom, third-party tools • Design issues: – Performance – Size of the staging area – Exception handling, integrity maintenance 11 -41 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Transformation Tools • • • Purchased SQL*Loader In-house developed 11 -42 Copyright Ó Oracle Transformation Tools • • • Purchased SQL*Loader In-house developed 11 -42 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Data Management, Quality and Auditing Tools • Data management: – Innovative Systems – Postalsoft Data Management, Quality and Auditing Tools • Data management: – Innovative Systems – Postalsoft – Vality Technology • Data quality and auditing: – Innovative Systems – Vality Technology 11 -43 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Summary • • This lesson discussed the following topics: Importance of data quality Transformation Summary • • This lesson discussed the following topics: Importance of data quality Transformation process Data transformation issues Data anomalies Name and address management Tools 11 -44 Copyright Ó Oracle Corporation, 1999. All rights reserved.

Practice 11 -1 Overview This practice covers the following topics: • • Answering a Practice 11 -1 Overview This practice covers the following topics: • • Answering a series of short questions Specifying true or false to a series of statements 11 -45 Copyright Ó Oracle Corporation, 1999. All rights reserved.