
5480f1ef0bcdfc14f6b815f95cdcd508.ppt
- Количество слайдов: 25
The Best Practices in Census Data Processing Operation: Case of 2009 Census: By Cleophas Kiio Director, ICT 15 -sep-10 1
Overview • • 15 -sep-10 Data processing Activities Review Planning for Data processing Setting the Data processing site Implementation Data capture Analysis Dissemination Archival 2
Data Processing Activities Review DP follows the completion of field data collection and entails the following: • Capture • Cleaning/Editing • Tabulation • Analysis • Dissemination • Archival 15 -sep-10 3
Planning for Data Processing (DP) 1. Identification of Methodology/technology: – – – • • • 15 -sep-10 Keying From Paper (KFP) - Manual Data Entry largely used in KNBS for small Surveys Keying From Image (KFI) -scanning Optical Mark Reading (OMR)- scanning Optical/Intelligent Character Recognition (OCR/ICR) - scanning Online data capture – use of pc Use of mobile devices (PDA) For the 2009 Census, KNBS chose scanning technology with OCR/ICR having used the same in the 1999 Census. A study tour the US Census Bureau was conducted to understudy the best practices. Major considerations were the budget and availability of technical knowhow. 4
Planning for Data Processing (DP) cont’d 2. Selection of Tools and Equipment: • • – Computers – acquired 125 high capacity computers with duo screens. Servers- 3 high-end servers did the census (32 GB memory, multiple processors, 1 Terabyte secondary storage each) Storage – 3 high capacity Storage Area Networks (SANs) were procured initially 5 Terabytes (TB) each but later upgraded to 14 TB each. Software. Capture software - with the challenges faced the 1999 census where the bureau used the AFPS pro from Top Image Systems (TIS), the Bureau chose to use the i. CADE system ( integrated Computer Assisted Data Entry System) developed by the US Census Bureau. Cleaning/tabulation- Cspro (Census and Surveys Processing software) – • Scanners- 3 new Kodak 1860 high volume scanners were acquired in addition to the 2 existing Kodak 1900 scanners used during the 1999 Census. Capable of scanning over 200 ppm. • Network infrastructure- all computers, scanners, servers and SAN were connected in a wide area network (WAN) 15 -sep-10 5
Planning for Data Processing (DP) cont’d 3. Design of Questionnaires • As standard practice questionnaires are developed and designed with technology to be used in mind. • The 2009 Census questionnaires were designed by highly trained Bureau staff. • Technical support was offered by the US Census Bureau • Precision in design was critical for compatibility with the i. CADE system. 15 -sep-10 6
Setting Up the DP Site 1. Planning the layout (library, KFI, OCR/Manual registration, server room, editing ) 2. Installing the computer network 3. Installing the power supply system and provisioning for power backup system: UPS and generators 4. Installing the furniture, lifts and Air-conditioning 5. Procuring high bandwidth internet. 6. A ware house for storage 7. Recruitment of staff 15 -sep-10 7
Implementation – Installation Systems and testing was completed after census enumeration – Integrated Computer Aided Data Entry (i. CADE) system training – In 2009 we had approximately 12 million A 3 questionnaires. – Engaged close to 500 personnel for the processing. – Processing took less than a one (1) year to complete 15 -sep-10 8
a) 2009 Data Capture Processes • Tracking of questionnaires done with a custom made tracking system • with inbuilt geocode list to ascertain completeness and flow control • Guillotining- trimming/cutting off the spirals • i. CADE system processes o Batching- registering books from each EA in the i. CADE o Scanning o Auto and Manual registration o Exception review o OCR review o Key From Image (KFI) 15 -sep-10 9
i. CADE Processes flow Check-in and Guillotining Output Data Server/SAN Batching Scanning Library (Questionnaires Holding area) Key From Image (KFI) Exception Review OCR Review Auto and Manual Registration Images and Script files database Server/SAN Minimum Interaction 15 -sep-10 Process Flow 10
Capture Output – Captured data was output to a text file then auto-formated as input to the CSPro software – OCR characters read: 2, 485, 008, 272 with an accuracy rate of 99. 86% (0. 14% error) – KFI characters keyed: 228, 771, 647 with a 99. 94 accuracy rate (0. 055%error) – This means the OCR read over 90% of the characters with a very high accuracy rate (OCR review definitely helped get this accuracy rate but customization algorithms had to be added to the quality). – 22, 326, 373 images from the census questionnaires – 273, 201 books in 144, 098 batches – 10, 602 batches went to exception review and 133, 496 batches bypassed Exception Review altogether and went straight into OCR. 15 -sep-10 11
b) 2009 Data Analysis – KNBS used CSPro a freeware from the US Census Bureau. – This process required: – Subject matter specialists provide editing rules – Programmers implement editing rules through programs – The team developed the editing program with which data is cleaned. 15 -sep-10 12
Editing/cleaning and Imputation • Systematic inspection of invalid and inconsistent responses, and subsequent manual or automatic correction according to predetermined rules (edit specs). • Imputation is the procedure of assigning values to missing, invalid, or inconsistent data using a set of predefined criteria embedded to an editing program. 15 -sep-10 13
Why Edit and Impute? • Clean up data to facilitate analysis • Identify types and sources of error • Improve quality of census data • Errors must be detected and their causes identified • Appropriate corrective measures are taken to improve the overall data quality. 15 -sep-10 14
Graphic flow of Editing and imputation Codes book (Dictionary) Editing and Imputation (Edit Specs) Data Cleaning Program i. CADE Output Data Clean Data 15 -sep-10 15
c) Data Tabulation – Process of producing data outputs (tables, frequencies, cross-tabulations, …) – Requires subject matter specialists to prepare dummy output layouts supported programmers – Data in then presented in this tabular layouts. 15 -sep-10 16
Graphic flow of Tabulation Codes book (Dictionary) Area Names Preferred Presentation (Tabulation Specs) Tabulation Program Clean Data Reports Volume IA 15 -sep-10 Volume IB Volume IC Volume II … 17
d) Data Dissemination – Providing public with information through census books, fliers, CDs, DVDs, online databases (Census info, IMIS, sms service) e) Data Archival – Documentation for permanent storage for further and future analysis 15 -sep-10 18
Challenges – Ware-house was located about 10 Km from processing centre – Inadequate processing space – Printing was not perfect this affected the OCR – Limited number and constant breakdown of the KNBS dedicated lift slowed down processing. – Power outages posed a major challenges – Being a new system, there was a cautious and slow acceptance of the system. 15 -sep-10 19
Best practices: Lessons learnt – Comprehensive DP plan be developed with clearly defined objectives: 1. Efficiency and effectiveness to process in the shortest time possible. 2. Control cost of processing to avoid budget overruns. 3. Quality data output – Carry out risk analysis beforehand to identify potential pitfalls and put in place mitigation measures. 15 -sep-10 20
Best practices: Lessons learnt cont’d – Cartographic mapping be completed 1 year before census – geographical codes and related documentation (geo-codes) to be ready 6 months before enumeration. – Timely acquisition of census tools and equipment – DP site be ready 6 months before enumeration date for test runs. – Technical and maintenance support measures must be instituted and enforced. 15 -sep-10 21
Best practices: Lessons learnt cont’d – Questionnaires and manuals be ready 5 months before census date to allow for logistics and pretesting. – Total quality control at the printing press must be ensured for precision printing. – Recruitment and training of staff be done before the census date. – DP site be located in close proximity to the questionnaire warehouse 15 -sep-10 22
Conclusion • Despite the challenges, it was possible to complete DP in less than a year after census. • However better planning and organization of the exercise it possible to complete the exercise within 6 months after enumeration. • The lessons learnt may form the recommendations that if adopted the above can be attained. 15 -sep-10 23
15 -sep-10 24
Thank You! 15 -sep-10 25