283595aedcac0b21ef9917efff47aa0d.ppt
- Количество слайдов: 22
® IBM Software Group IBM Information Server Cleanse - Quality. Stage ©IBM Corporation
IBM Software Group IBM Information Server Delivering information you can trust Support for Service-Oriented Architectures Understand Cleanse Transform Deliver Discover, model, and govern information structure and content Standardize, merge, and correct information Combine and restructure information for new uses Synchronize, virtualize and move information for in-line delivery Platform Services Parallel Processing Connectivity Metadata Administration Deployment 2
IBM Software Group The IBM Solution: IBM Information Server Delivering information you can trust IBM Information Server Unified Deployment Understand Cleanse Transform Deliver Unified Metadata Management Web. Sphere Quality. Stage Data cleansing, standardization, matching, Parallel Processing and survivorship for enhancing data quality Rich Connectivity to Applications, Data, and Content and creating coherent business views 3
IBM Software Group Need for Data Quality Data Sources Values Data Kentucky Fried Chicken KFC 227 G CB&NAT STICK P QUE/MOZZ WRAPP. Molly Talber DBA KFC Kent Fried Chick Kentucky Fried Mrs. M. Talber 227 G CB&NATURAL STICK MOZZ WRAPPER John & Molly Talber, KFC, ATIMA Critical Problems § Need to create & maintain 360 degree views of customers, suppliers, products, locations, events § Need to leverage data - make reliable decisions, comply with regulations, meet service agreements Why? § § § No common standards across organization Unexpected values stored in fields Required information buried in free-form fields Fields evolve - used for multiple purposes No reliable keys for consolidated views Operational data degrades 2% per month Alternative Approaches § Denial – problem misunderstood and ignored until too late; load and explode § Hand-coding - clerical exception processing; very time consuming and resource intensive § Simplistic cleansing apps - evolved from direct marketing & list hygiene, lack flexibility 4
IBM Software Group Why Should I Care About Cleansing Information? § Lack of information standards 4 Different formats & structures across different systems § Data surprises in individual fields 4 Data misplaced in the database § Information buried in free-form fields § Data myopia 4 Lack of consistent identifiers inhibit a single view § The redundancy nightmare 4 Duplicate records with a lack of standards 5
IBM Software Group Importance of Data Quality § Low data quality impacts an organization in several ways 4 Poor data quality leads to misguided marketing promotions 4 Cross sell opportunities may be missed because same customer appears several times in slightly different ways 4 Valued customers may not be recognized during support calls or other important touchpoints 4 Data mining is difficult because related items are not detected as related § What is good data quality? 4 Two percent of “bad” data doesn’t sound that bad? 4 Two percent of 10 M rows means that you have 200 K errors § 200 K errors add up to big problem for analytics/operations/anything! 6
IBM Software Group Enterprise initiatives… …to satisfy critical business requirements. § Supply chain collaboration & item synchronization § Inventory consolidation § Single view of a customer or supplier § Compliance § ERP Implementations § Business to Business Standards § ERP instance consolidation § IT System renovation § Consolidation resulting from M&A activity …need high quality data… § Risk Management § Reduce Costs & Increase Productivity § Enterprise Data Warehouse § Increase Revenue / CRM Payoff § Compliance & Regulatory projects (SOX, HIPAA, ACCORD, etc. ) § Business Intelligence Payoff 7
IBM Software Group IBM Web. Sphere Quality. Stage § Shared design environment with Data. Stage increases functionality and reduces development time § Visual match rule interface simplifies match tuning § Service orientation provides ‘continuous’ quality & delivers confidence in your data § Parallel architecture shortens execution time 8
IBM Software Group How will you get an accurate, consolidated view of your business? Customers Web. Sphere Quality. Stage Process Products / Materials Transactions 1. Free Form Investigation 2. Data Standardization 3. Data Matching 4. Data Survivorship Target Database with Consolidated Views Vendors / Suppliers 9
IBM Software Group Why Investigate § Discover trends and potential anomalies in the data § 100% visibility of single domain and free-form fields § Identify invalid and default values § Reveal undocumented business rules and common terminology § Verify the reliability of the data in the fields to be used as matching criteria § Gain complete understanding of data within context 10
IBM Software Group Investigation - Free Form 123 St. Virginia St. Parsing: 123 | St. | Virginia | St. Separating multi-valued fields into individual pieces number Lexical analysis: street type state street type 123 | St. | Virginia | St. Determining business significance of individual pieces Context Sensitive: House Number Street Name Street Type 123 | St. Virginia | St. Identifying various data structures and content “The instructions for handling the data are inherent within the data itself. ” 11
IBM Software Group Rule Sets § Pre-defined rules for parsing and standardizing: 4 Name 4 Address 4 Area (City, State and Zip) § Multi-national address processing § Validate structure: 4 Tax ID 4 US Phone 4 Date 4 Email § Append ISO country codes § Pre-process or filter name, address and area § Rule sets are stored in the common repostiory 12
IBM Software Group Standardization - Example Input File: Address Line 1 Address Line 2 639 N MILLS AVENUE 306 W MAIN STR, CUMMING, GA 30130 3142 WEST CENTRAL AV 843 HEARD AVE 1139 GREENE ST ACCT #1234 4275 OWENS ROAD SUITE 536 EVANS ORLANDO, FLA 32803 TOLEDO OH 43606 AUGUSTA-GA-30904 AUGUSTA GEORGIA 30901 GA 30809 Result File: House # Dir Str. Name Type Unit No. 639 306 3142 843 1139 4275 N W W MILLS MAIN CENTRAL HEARD GREENE OWENS AVE ST RD STE 536 NYSIIS City SOUNDEX State Zip ACCT# MAL MAN CANTRAL HAD GRAN ON ORLANDO CUMMING TOLEDO AUGUSTA EVANS O 645 C 552 T 430 A 223 E 152 FL GA OH GA GA GA 32803 30130 43606 30904 30901 1234 30809 13
IBM Software Group Why Match § Identify duplicate entities within one or more files § Perform householding § Create consolidated view of customer § Establish cross-reference linkage § Enrich existing data with new attributes from external sources 14
IBM Software Group Two Methods to Decide a Match Are these two records a match? WILLIAM J KAZANGIAN 128 MAIN ST 02111 12/8/62 WILLAIM JOHN KAZANGIAN 128 MAINE AVE 02110 12/8/62 B B BBAABDBA +5 +2 A A B D B A = +20 +3 +4 -1 +7 +9 = +49 Deterministic Decisions Tables: • Fields are compared • Letter grade assigned • Combined letter grades are compared to a vendor delivered file • Result: Match; Fail; Suspect Probabilistic Record Linkage: • Fields are evaluated for degree-of-match • Weight assigned: represents the “information content” by value • Weights are summed to derived a total score • Result: Statistical probability of a match 15
IBM Software Group Why Survive § Provide consolidated view of data § Provide consolidated view containing the “best-of-breed” data § Resolve conflicting values and fill missing values § Cross-populate best available data § Implement business and mapping rules § Create cross-reference keys 16
IBM Software Group Survivorship - Example Survivorship Input (Match Output) Group Legacy 1 D 150 1 A 1367 First Bob Robert Middle Last Dixon Dickson No. 1500 23 23 23 Ernest Ernie A Alex 5901 SW 5901 D 689 A 436 D 352 Obrian O’Brian Obrian Dir. SE Str. Name Type Unit ROSS CLARK CIR No. 74 TH 74 STE 202 # 202 ST ST ST Consolidated Output Group Legacy 1 D 150 1 D 689 A 436 D 352 First Robert Middle Last No. Dickson 1500 23 Ernie Alex Dir. SE Str. Name Type Unit ROSS CLARK CIR No. 74 TH 202 A 1367 23 23 23 Group 1 O’Brian 5901 SW ST STE 17
IBM Software Group How Does Web. Sphere Quality. Stage Integrate Database DB 2 Oracle Sybase Onyx IDMS etc. Data Extraction and Load Routines Quality. Stage 1. 2. 3. 4. Investigation Standardization Integration Survivorship Target DB 2 Oracle Sybase Onyx IDMS etc. 18
IBM Software Group Web. Sphere Data. Stage and Web. Sphere Quality. Stage: Fully Integrated! 19
IBM Software Group Quality. Stage: Data Quality Extensions § IBM Web. Sphere Quality. Stage Geo. Locator § IBM Web. Sphere Quality. Stage Postal Verification Products 4 WAVES (World. Wide) IBM Web. Sphere Worldwide Address Verification Solution § IBM Web. Sphere Quality. Stage Postal Certification Products 4 CASS (United States) 4 SERP (Canada) 4 DPID (Australia) § IBM Information Server Data Quality Module for SAP § IBM Web. Sphere Quality. Stage for Siebel 20
IBM Software Group Key Strengths for IBM Quality. Stage § Intuitive, “Design as you think” User Interface 4 Simple rule design & fine tuning § Seamless Data Flow integration § Intuitive rule design & fine tuning § Defining the technology standard with SOA § Industry leading probabilistic matching engine 21
® IBM Software Group Thank You ©IBM Corporation
283595aedcac0b21ef9917efff47aa0d.ppt