507bbe49f378ae512a7c7fdb2ac113fa.ppt
- Количество слайдов: 26
IBM Information Server Data Quality Everywhere © 2005 IBM Corporation
Increasing Focus on Data Quality § Businesses are beginning to realize that data quality issues not only cost them time and money, but also inhibit their ability to address core strategic projects § More and more businesses are establishing programs for data quality, to measure and improve the reliability of information § Analysts contend that companies with focused data quality programs will find more opportunities to outperform their peers 2
Business Drivers for Investment Depend on Data Quality • Empowering risk & compliance initiatives with the information they require • Optimizing Revenue Opportunities by ensuring effective and efficient interactions with customers, partners, and suppliers • Enabling collaborative business processes with consistent and trustworthy information § Reducing the total cost of ownership for maintaining consistent information across the enterprise 3
What is the Impact of Poor Data Quality? "If you look at. . . any business function in your company, you're going to find some direct cost there attributed to poor data quality. " - Gartner 2006 Lost Sales Opportunity “Hard” Losses § SKU misplaced or hard to find § Out of stocks attributed to the store 1. 5% 1. 7% “Soft” Losses § Lost potential for cross-sell and up-sell (staff not trained or available) § Reduced store visit frequency § Abandoned carts (poor service or excessive queues) Total 4 2 -4% 1 -3% 1 -2% 7. 2%- 12% Source: GMA/FMI/CIES 2003 (US grocery), ECR Europe 2003, Lineraires. com, California Management Review, IBM case studies, interviews and IBM Institute for Business Value analysis
Data Quality is a Subjective Business Standard § Data = facts used as a basis for decision making suitable for storage on a computer § Quality = the general standard or grade of something Data Quality = a subjective standard used to determine if a set of facts is suitable for a particular business purpose Business Purpose Relevant? Accurate? Valid? Complete? Ultimately, Data Quality = Trust 5
So, What Constitutes Data Quality? § Data is standardized § Data is fit for purpose (conforms to rules) § Each record is unique § View of information is complete § Records are certified against authoritative sources § Lineage is understood § Data quality is measured over time 6
Common Data Problems § Lack of information standards – Different formats & structures across different systems § Data surprises in individual fields – Data misplaced in the database § Information buried in free-form fields Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116 Catherine Roberts Four sixteen Columbus APT 2, Boston, MA 02116 Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116 Name Tax ID Telephone J Smith DBA Lime Cons. Williams & Co. C/O Bill 1 st Natl Provident HP 15 State St. 228 -02 -1975 025 -37 -1888 34 -2671434 508 -466 -1200 6173380300 415 -392 -2000 3380321 Orlando WING ASSY DRILL 4 HOLE USE 5 J 868 A HEXBOLT 1/4 INCH WING ASSEMBY, USE 5 J 868 -A HEX BOLT. 25” - DRILL FOUR HOLES USE 4 5 J 868 A BOLTS (HEX. 25) - DRILL HOLES FOR EA ON WING ASSEM RUDER, TAP 6 WHOLES, SECURE W/KL 2301 RIVETS (10 CM) § Data myopia – Lack of consistent identifiers inhibit a single view § The redundancy nightmare – Duplicate records with a lack of standards 7 19 -84 -103 RS 232 Cable 6' M-F Cand. S CS-89641 6 ft. Cable Male-F, RS 232 #87951 C&SUCH 6 Male/Female 25 PIN 6 Foot Cable 90328574 90328575 90238495 90233479 90233489 90345672 IBM I. B. M. Inc. Int. Bus. Machines International Bus. M. Inter-Nation Consults I. B. Manufacturing 187 N. Pk. Str. Salem NH 01456 187 N. Pk. St. Salem NH 01456 187 No. Park St Salem NH 04156 187 Park Ave Salem NH 04156 15 Main Street Andover MA 02341 Park Blvd. Bostno MA 04106
Why Does this Problem Exist? § Most enterprises are running distinct sales, services, marketing, manufacturing and financial applications, each with it’s own “master” reference data. § No one system is the universally agreed-to system of record. § Enterprise Application Vendors do not guarantee a complete & accurate integrated view – they point to their dependence on the quality of the raw input data § Data quality continues to erode at the point of entry, though it is not a data entry problem 8
What Do You Need to Establish a Data Quality Program? § A foundation platform that centralizes quality rules and provides auditable data quality § Business-driven, data-centric design environment for data quality rules § An ongoing process for data quality § A way to measure quality over time § Universal deployment of quality rules across all points of entry § Data quality ownership and data governance 9 § Management sponsorship and a corporate mandate for data quality improvement
IBM Information Server A Platform for Data Quality IBM Information Server Unified Deployment Understand Cleanse Transform Deliver Discover, model, and govern information structure and content Standardize, merge, and correct information Combine and restructure information for new uses Synchronize, virtualize and move information for in-line delivery Unified Metadata Management Parallel Processing Rich Connectivity to Applications, Data, and Content 10
A Process For Data Quality Establish Data Quality Ownership & Sponsorship Analyze Source Data Measure & Baseline Data Quality Standardize Certify & Enrich Match Link or Survive Re-Measure Report 11
Understanding the Problem: Source System Analysis §Quality Controls for Completeness and Validity of data values §Incomplete or Invalid values set by value, range, or reference sources §Consistency checks for data formats 12
Measuring & Resolving: Designing Data Quality Rules § Data quality rules should be embedded into data flows Investigate source data Standardize information Match records together Survive the best data across sources into a new record 13
Investigation 123 St. Virginia St. Parsing: 123 | St. | Virginia | St. Separating multi-valued fields into individual pieces Number Lexical analysis: Street Type Alpha Street Type 123 | St. | Virginia | St. Determining business significance of individual pieces Context Sensitive: House Number Street Name Street Type 123 | St. Virginia | St. Identifying various data structures and content “The instructions for handling the data are inherent within the data itself. ” 14
Standardization - Address Input File: Address Line 1 Address Line 2 639 N MILLS AVENUE 306 W MAIN STR, CUMMING, GA 30130 3142 WEST CENTRAL AV 843 HEARD AVE 1139 GREENE ST ACCT #1234 4275 OWENS ROAD SUITE 536 EVANS ORLANDO, FLA 32803 TOLEDO OH 43606 AUGUSTA-GA-30904 AUGUSTA GEORGIA 30901 GA 30809 Result File: House # Dir Str. Name Type Unit No. 639 306 3142 843 1139 4275 N W W MILLS MAIN CENTRAL HEARD GREENE OWENS AVE ST RD STE 536 NYSIIS City SOUNDEX State Zip MAL MAN CANTRAL HAD GRAN ON ORLANDO CUMMING TOLEDO AUGUSTA EVANS O 645 C 552 T 430 A 223 E 152 FL GA OH GA GA GA 32803 30130 43606 30904 30901 1234 30809 Results in strongly “typed” fixed fielded standardized data 15 ACCT#
Effective Matching is the most beneficial and technically challenging part of data quality § Matching should be based on statistical probability § Match rules should take into account frequency, discriminating values, & reliability of fields when determining which fields to weight in a match § Matching against more fields of data produces higher quality matches § Matching logic is a very business-sensitive issue – business users should be involved in the design of matching rules § Matching is a science that requires careful calibration of match rules – design should be iterative, and should give results based on real data § Matching design should allow for baseline comparison to ensure rule changes are improving quality § The matching engine should provide clerical review capabilities § Setting up clerical review and match cutoffs should be intuitive 16
Designing Data Quality Rules Holding area allows Visual Histogram allows users experimental match rules to be retained to understand results Pass Composer provides an intuitive overview of match passes Decision Rules define match criteria Cutoff Tuning allows match & clerical cutoffs to be visually fine-tuned Data Viewer provides immediate feedback on match rule effects, using actual data 17
What Do You Do with Match Results? ? § Clerical review § Record linkage Cross-reference § Survivorship § Append/ sources 18 Fix =
Deployment Models for Data Quality Rules Data quality rules need to be applied universally § In bulk movement and consolidation of data Logic Reuse Request § Applied when data changes in source systems Response Query 19 § Available as data quality services in a SOA § Embedded in federated queries § Callable directly from enterprise applications
Measuring Data Quality Over Time Complete analysis of structure and content Analysis can be run on a scheduled basis, or embedded in batch processes 20 View differences between current state and the baseline
Lessons Learned & Best Practice: Involve the Business Early § Recruit an executive sponsor – Signals that the initiative is important – Assures that funds continue to be available – Discourages other business units from implementing conflicting projects § Convene a data quality working group – Assess and report on quality early in the process – May coincide with implementation teams or data warehousing teams – Business leads, but IT coordinates and facilitates – Strive for consensus § Have the business appoint a data quality steward for each business unit 21 – For business units with large user populations, several stewards are appropriate
Lessons Learned & Best Practice: Control Scope Ruthlessly / Focus on Benefits § Business must own scope – Business should be owners, not renters – IT maintains its independence by not taking sides – Controlling scope encourages project discipline § Iterate – Projects which try to do it all in one pass generally fail § Meaure, Report, and Deliver benefits regularly – Initial projects must provide some benefit within 6 - 9 months at the minimum (even if a small benefit) – Subsequent phases should provide benefits every 3 -6 months 22
Summary § Data quality is becoming an increasingly important organizational issue § Most critical business initiatives depend of quality information § Improving data quality requires a focused programmatic approach § The IBM Information Server provides all of this in a unified platform § At the core of any data quality program is a platform capable of providing auditable data quality services 23 IBM Information Server
How Can IBM Help? § Comprehensive platform for data quality § Experience and repeatable process for helping organizations set up data quality programs § Domain and industry-specific expertise in establishing repeatable data quality services § Data quality assessment offering to report on existing data quality and establish the business value of a data quality program § Contact your IBM representative for more information 24
Information On Demand 2006 Register Now: www. ibm. com/events/informationondemand Why attend: § Participate in the PREMIER discussion on the future of Information Management § Learn how the transformation to Information as a Service will help you unlock business value and drive competitive advantage § Hear how your peers are realizing ROI IBM Information On Demand 2006 October 15 -20, 2006 Anaheim, California § The premier information management event for business and IT executives, managers, professionals, DBA's and developers. § Select from over 800 sessions: a 2 1/2 day business leadership track with 180 sessions and a 5 day technical track with 650 sessions. § Latest strategy and product announcements § Large Expo Center, Hands on labs § One on ones with executives and specialists § Birds of a Feather roundtables 25 § Understand the roadmap to long term strategic advantage § Learn best practices in your industry § Receive the best in technical education and free certification § Extensive opportunities for networking with both your peers and industry experts
26
507bbe49f378ae512a7c7fdb2ac113fa.ppt