Скачать презентацию GIS Data Quality Producing better data quality through Скачать презентацию GIS Data Quality Producing better data quality through

47ec12fe4856e9c14f8a477110a3b311.ppt

  • Количество слайдов: 49

GIS Data Quality Producing better data quality through robust business processes Kim Ollivier Bright. GIS Data Quality Producing better data quality through robust business processes Kim Ollivier Bright. Star TRAINING

Schedule Day One Suggested breaks for the following times: Start: 9: 00 Session 1 Schedule Day One Suggested breaks for the following times: Start: 9: 00 Session 1 ( 90 min) Morning tea: 10: 30 to 10: 45 Session 2 ( 105 min) Lunch: 12: 30 to 1: 30 Session 3 ( 90 min) Afternoon tea: 3: 00 to 3: 15 Session 4 ( 105 min) Finish: 5: 00 Each session will have an exercise or interactive discussion

Today n Introduction What causes poor quality n Lunch n n n Assessing Quality Today n Introduction What causes poor quality n Lunch n n n Assessing Quality processes GIS upgrade project examples

Tomorrow n Metadata Designing rules n Lunch n n n Data warehouse and ETL Tomorrow n Metadata Designing rules n Lunch n n n Data warehouse and ETL Feature maintenance

Overview n n n Introduce yourself Your goals for this course? Build a data Overview n n n Introduce yourself Your goals for this course? Build a data quality system Avoid the worst traps Be able to describe a project scope • Budget, timeline, priorities

Sections of course based on With permission from the author ISBN 978 -0 -09771400 Sections of course based on With permission from the author ISBN 978 -0 -09771400 -2

What is Data Quality? “If they are fit for their intended uses in operations, What is Data Quality? “If they are fit for their intended uses in operations, decision making and planning. ” “If they correctly represent the real-world construct to which they refer. ”

Spatial Accuracy Spatial Accuracy

Statistical Accuracy Completeness Score Accuracy Score Overall Score = Relevant + Missing = Relevant Statistical Accuracy Completeness Score Accuracy Score Overall Score = Relevant + Missing = Relevant - Errors Relevant + Missing

Completeness n n LINZ Bulk Data Extract metadatameta. html Completeness n n LINZ Bulk Data Extract metadatameta. html

Data Profiling n n Find out what is there Assess the risks Understand data Data Profiling n n Find out what is there Assess the risks Understand data challenges early Have an enterprise view of all data

Profile Metrics n n n n Integrity Consistency Completeness, Density Validity Timeliness Accessibility Uniqueness Profile Metrics n n n n Integrity Consistency Completeness, Density Validity Timeliness Accessibility Uniqueness

Security n n n Confidentiality Possession Integrity Authenticity Availability Utility Security n n n Confidentiality Possession Integrity Authenticity Availability Utility

Consistency n n n Discrepancies between attributes Exceptions in a cluster Spatial discrepancies Consistency n n n Discrepancies between attributes Exceptions in a cluster Spatial discrepancies

A GIS Data Quality System Assess Data Quality Assessment Data Profiling Improve Data Cleaning A GIS Data Quality System Assess Data Quality Assessment Data Profiling Improve Data Cleaning Prevent Monitoring Data Integration Interfaces Ensuring Quality of Data Conversion and Consolidation Monitor Recurrent Data Quality Assessment Recognise Building Data Quality Metadata Warehouse

Course examples n n n LINZ coordinate upgrade 1998 -2003 NSCC services upgrade 2008 Course examples n n n LINZ coordinate upgrade 1998 -2003 NSCC services upgrade 2008 Valuation roll structure and matching ETL of utilites from SDE to Autocad Address location issues NAR, DRA Documents and examples on memory stick

Exercise 1: Nominate your database Select a representative example dataset for later discussion n Exercise 1: Nominate your database Select a representative example dataset for later discussion n You may be responsible for n Or, you have to integrate n Or, you have to load it n Or, you supply it to others Morning Tea

Assessing Quality 1. 2. 3. 4. 5. 6. 7. Project steps Required roles Defining Assessing Quality 1. 2. 3. 4. 5. 6. 7. Project steps Required roles Defining the objectives Designing rules Scorecard and Metadata Frequency of assessment Common mistakes

Processes Affecting Data Quality Processes bringing data from outside Processes causing data decay Initial Processes Affecting Data Quality Processes bringing data from outside Processes causing data decay Initial Data Conversion Changes not captured System Consolidations System Upgrades Manual Data Entry Database Loss of Expertise Batch Feeds Real-Time Interfaces New Data Uses Process Automation Processes changing data from within Data processing Data cleaning Data purging

Outside: Initial Data Conversion n Define data mapping Extract, Transform, Load (ETL) Drown in Outside: Initial Data Conversion n Define data mapping Extract, Transform, Load (ETL) Drown in Data Problems n Find Scapegoat n n

Outside: System Consolidation n Often from mergers (Auckland? ) • Unplanned, unreasonable timeframes n Outside: System Consolidation n Often from mergers (Auckland? ) • Unplanned, unreasonable timeframes n n n Head-on two car wreck Square pegs into round holes Winner – loser merging (50% wrong)

Outside: Manual Data Entry n n High error rate Complex and poor entry forms Outside: Manual Data Entry n n High error rate Complex and poor entry forms Users find ways around checks Forcing non blanks does not work

Outside: Batch Feeds n n Large volumes mean lots of errors Source system subject Outside: Batch Feeds n n Large volumes mean lots of errors Source system subject to changes Errors accumulate Especially dangerous if triggers activated

Outside: Real-Time Interfaces n Data between db’s in synchronisation Data in small packets out Outside: Real-Time Interfaces n Data between db’s in synchronisation Data in small packets out of context Too fast to validate Rejection loses record, so accepted n Faster or better but not both! n n n

Decay: Changes Not Captured n n Object changes are unnoticed by computers Retroactive changes Decay: Changes Not Captured n n Object changes are unnoticed by computers Retroactive changes may not be propagated

Decay: System Upgrades n n n The data is assumed to comply with the Decay: System Upgrades n n n The data is assumed to comply with the new requirements Upgrades are tested against what the data is supposed to be, not what is actually there Once upgrades are implemented everything goes haywire

Decay: New Data Uses n n “Fitness to the purpose of use” may not Decay: New Data Uses n n “Fitness to the purpose of use” may not apply Acceptable error rates may now be an issue Value granularity, map scale Data retention policy

Decay: Loss of Expertise n n Meaning of codes may change over time that Decay: Loss of Expertise n n Meaning of codes may change over time that only “experts” know Experts know when data looks wrong Retirees rehired to work systems Auckland address points were entered on corners and the rest guessed, later used as exact.

Decay: Process Automation n Web 2. 0 bots automate form filling Transactions are generated Decay: Process Automation n Web 2. 0 bots automate form filling Transactions are generated without ever being checked by people Customers given automated access are more sensitive to errors in their own data

Within: Data Processing n n n Changes in the programs Programs may not keep Within: Data Processing n n n Changes in the programs Programs may not keep up with changes in data collection Processing may be done at the wrong time

Special GIS Data Issues n n n n Coordinate data not usually readable Data Special GIS Data Issues n n n n Coordinate data not usually readable Data models CAD v GIS Fuzzy matching is not Boolean (near) Atomic objects harder to define Features have 2, 3, 4, 5 dimensions Projection systems are not exact Topology requires special operators

Within: Data Purging n n Highly risky for data quality Relevant data may be Within: Data Purging n n Highly risky for data quality Relevant data may be purged Erroneous data may fit criteria It may not work the next year

Within: Data Cleaning n n n En masse processes may add errors Cleaning processes Within: Data Cleaning n n n En masse processes may add errors Cleaning processes may have bugs Incomplete information about data

Assessing Data Quality n n Data profiling Interview users Examine data model Data Gazing Assessing Data Quality n n Data profiling Interview users Examine data model Data Gazing

Data Gazing n n n Count the records Just open the sources and scroll Data Gazing n n n Count the records Just open the sources and scroll Sort and look at the ends Run some simple frequency reports See if the field names make sense What is missing that should be there Lunch

Data Cleaning n n n There always lots of errors It is too much Data Cleaning n n n There always lots of errors It is too much to inspect all by hand Data experts are rare and too busy It does not fix process errors You may make it worse

Automated Cleaning n n The only practical method Needs sophisticated pattern analysis Allow for Automated Cleaning n n The only practical method Needs sophisticated pattern analysis Allow for backtracking Data quality rules are interdependent

Common Mistakes Inadequate Staffing of Data Quality Teams 2. Hoping That Data Will Get Common Mistakes Inadequate Staffing of Data Quality Teams 2. Hoping That Data Will Get Better by Itself 3. Lack of Data Quality Assessment 4. Narrow Focus 5. Bad Metadata 6. Ignoring Data Quality During Data Conversions 7. Winner-Loser Approach in Data Consolidation 8. Inadequate Monitoring of Data Interfaces 9. Forgetting About Data Decay 10. Poor Organization of Data Quality Metadata 1.

Metadata Includes everything known about the data n n n Data model Business rules, Metadata Includes everything known about the data n n n Data model Business rules, relations, state Subclasses (lookup tables) GIS Metadata (NZGLS or ISO) XML Readme. txt

Data Exchange n n Batch or interactive ETL (Extract Transform Load) Replication Time differences Data Exchange n n Batch or interactive ETL (Extract Transform Load) Replication Time differences in data

GIS in Business Processes n n Integrates many different sources Spatial patterns are revealed GIS in Business Processes n n Integrates many different sources Spatial patterns are revealed Display thousands of records simultaneously with direct access Location now seen as important

Scorecard DQ Score Summary Score Decompositions Intermediate Error Reports Atomic Level Data Quality Information Scorecard DQ Score Summary Score Decompositions Intermediate Error Reports Atomic Level Data Quality Information

Case Study n n n Outline a GIS data quality system Measles Chart Prioritise Case Study n n n Outline a GIS data quality system Measles Chart Prioritise Interview Build up a scorecard Afternoon Tea

Assessment Exercise n n Split into pairs Interview one person about their dataset Collect Assessment Exercise n n Split into pairs Interview one person about their dataset Collect basic information Devise a strategy for a profile n Rotate pair with another Interview other person n Verbal reports to class n

Major Upgrade Projects n n LINZ Coordinate upgrade NSCC Coordinate upgrade Major Upgrade Projects n n LINZ Coordinate upgrade NSCC Coordinate upgrade

References n Data Quality Assessment – Arkady Maydanchik References n Data Quality Assessment – Arkady Maydanchik