Скачать презентацию STATISTICS NETHERLANDS STATISTICS NORWAY On the general Скачать презентацию STATISTICS NETHERLANDS STATISTICS NORWAY On the general

e40c94f575fa8215ff982f84c7c86dcb.ppt

  • Количество слайдов: 14

STATISTICS NETHERLANDS – STATISTICS NORWAY On the general flow of editing Jeroen Pannekoek and STATISTICS NETHERLANDS – STATISTICS NORWAY On the general flow of editing Jeroen Pannekoek and Li-Chun Zhang Work Session on Statistical Data Editing Oslo, Norway, 24 -26 September 2012 CBS-SSB

Introduction • An overall data editing process involves all activities to transform raw micro-data Introduction • An overall data editing process involves all activities to transform raw micro-data with errors and missing values into edited statistical micro-data that are suitable for production of publication figures. GSBPM: review, validate and edit, impute, output control. • For implementation of an E&I system we need more detailed descriptions called statistical functions that each perform some action on the data. • This paper tries to identify common statistical functions that are used as building blocks in different overall E&I processes or strategies. • The decomposition of the overall process can facilitate process design, re-use of methodological components and documentation and generic software tools. CBS - SSB 1

Contents • Some classifications of data editing functions that are relevant for the process Contents • Some classifications of data editing functions that are relevant for the process design. • A summary of statistical data editing functions in some detail. • Some process flow examples, using the statistical functions as building blocks, from the Netherlands and Norway. • Concluding remarks CBS - SSB 2

Classification of functions by purpose • Verification Checking of hard and soft edit rules, Classification of functions by purpose • Verification Checking of hard and soft edit rules, calculation scores, detection of systematic errors. Input: rules and data → Output: quality indicators and measures Less formal: graphical macro-editing, output control. • Selection (for further processing) Selection of units for manual editing. Selection of variables to change, error localisation. Input: quality indicators and data → Output: selection of records or fields • Amending Modifying selected data values to resolve problems detected by verification, including imputation of missing values. CBS - SSB 3

Unit-mode versus batch-mode operation Since manual editing is time-consuming it should start during the Unit-mode versus batch-mode operation Since manual editing is time-consuming it should start during the sometimes lengthy data collection period. This must then also hold for any automatic editing function that is applied before manual editing. • Unit-mode functions Proceed on a record-by-record basis and can be applied during the data collection phase. • Bach-mode functions Use all of the data (or a large subset) and can only be applied near the end of the data collection phase. CBS - SSB 4

Editing functions: verification (1/2) • Edit-rules (unit-mode) Systems of connected balance edits: profit=turnover-total costs Editing functions: verification (1/2) • Edit-rules (unit-mode) Systems of connected balance edits: profit=turnover-total costs = costs of employees + costs of purchases + Non-negativity edits and inequalities. Ratio edits (soft). • Score functions Measure the potential effect that editing a unit may have on estimates of totals or other aggregate parameters of interest. Based on measures of the deviation between observed values and predicted or “anticipated” values si =f(xj, xja ). Ø Unit-mode: xja is based historical data or other external source. Batch-mode: xja is based on current data. Ø Also applied to measure and check the actual effect of (automatic) editing instead of the potential effect of editing. Then xja is the edited value. Ø CBS - SSB 5

Editing functions: verification (2/2) • Extended score functions Score functions can be extended by Editing functions: verification (2/2) • Extended score functions Score functions can be extended by adding indicators for further processing based on simple criteria, other than the regular score function. For instance: >0: regular score value -9: “crucial” (dominates the totals in its branch) → manual editing -8: influential and main variables are missing → re-contact -7: non-influential and main variables missing → unit nonrespons • Macro-verification functions are batch-mode by definition. They include all macro-editing activities: verifying aggregates, graphical inspection of distributions, graphical or model-based outlier detection etc. CBS - SSB 6

Editing functions: selection • Selection of units for manual editing using regular scores By Editing functions: selection • Selection of units for manual editing using regular scores By comparing to a predetermined threshold value – unit-mode. By ordering units on scores and select the highest ranking – batch-mode • Selection of variables for amendment: error localization (unit -mode). To resolve edit-failures, some values need to be changed. The error localization problem is the selection of which variables to be changed. A generic automatic approach (Felligi-Holt): select the fewest (weighted) number of variables to change • Macro-selection (batch mode) of units for manual editing Implausible aggregates eventually lead to suspect units (down-drilling) Graphical verification leads to selection of the most extraordinary units. CBS - SSB 7

Editing functions: amendment • Amendment of systematic errors (unit-mode) Errors with a detectable cause Editing functions: amendment • Amendment of systematic errors (unit-mode) Errors with a detectable cause and reliable correction mechanism. Generic: Thousand errors, recognizable typos, rounding errors. Subject-related: specific “if-then” type of correction rules. • Deductive imputation of missing values (unit-mode) Some missing values can univocally be determined by the hard editrules. Which gives the only possible feasible imputation. • Model based imputation (batch- or unit-mode) For most missing value we need model-based predicted values to impute. Batch-mode if current data are used to estimate parameters. • Adjustment for inconsistency (unit-mode) Adjustment of imputation to ensure consistency with edit-rules CBS - SSB 8

Illustration of automatic editing Data from child day care institutions: 500 records with 68 Illustration of automatic editing Data from child day care institutions: 500 records with 68 SBS-type variables and 40 hard edit-rules. Action # failed hard edits # missing values none 514 0 Thousand errors 514 0 Typing errors 476 0 Rounding errors 440 0 - 397 Deductive imputation - 266 Regression imputation 254 0 Adjustment of imp. values 0 0 Treatment of Systematic errors Selection of fields to change F-H Error localization Automatic imputation/adjustment CBS - SSB 9

Process flow. Scenario A: Selective editing Input micro data 1 a. Systematic errors 1 Process flow. Scenario A: Selective editing Input micro data 1 a. Systematic errors 1 b. Evaluation of scores 2 a. Selection using scores 2 b. (FH-)selection of fields 1. Primary automated processing 2. Micro-selection Yes No 4 a. Imputation of missings 4 b. Adjustments 5. Macro-verification and selection 3. Clerical interactive editing 4. Automatic amendment of uncritical units 5. Macro-selection Yes No Edited micro data CBS - SSB 10

Process flow. Scenario B: More automatic editing 1. All unit-mode automatic editing Input micro Process flow. Scenario B: More automatic editing 1. All unit-mode automatic editing Input micro data 1 a. Systematic errors 1 b. (FH-)selection of fields 1 c. Imputation 1 d. Adjustments 1 e. Evaluation of scores 1. Primary automated processing 2. Micro-selection Yes No 4 a. Batch-mode Imputation 4 b. Adjustments 5. Macro-verification and selection 3. Clerical interactive editing 4. Automatic amendment of uncritical units 5. Macro-selection Yes No Edited micro data CBS - SSB 11

Process flow: Scenario C. No timeliness problems, Input micro data 1. Systematic errors 2. Process flow: Scenario C. No timeliness problems, Input micro data 1. Systematic errors 2. Macro-verification and selection. Including batchmode scores 1. Primary automated processing 2. Macro-selection No 3. (Partial) Clerical interactive editing. 4. Automatic amendment Yes 3. (Partial) Clerical interactive editing 4 a. Imputation of missings 4 b. Adjustments No Edited micro data CBS - SSB 12

Concluding remarks • The shown description of the overall process can be helpful in Concluding remarks • The shown description of the overall process can be helpful in the communication between editing staff, project managers, process designers and methodologists. It clarifies the organization of the process and the choices that must be made. • It also helps to define the functionalities and interfaces of generic software components by placing them in the context of the overall process scheme. • Increasing automatic editing can greatly reduce the amount of manual editing. This may involve automatic editing of influential units and subject specific “if-then” rules. CBS - SSB 13