Скачать презентацию Wye City Group Meeting on statistics on rural Скачать презентацию Wye City Group Meeting on statistics on rural

43daa11f295c185e8064802d5378733d.ppt

  • Количество слайдов: 20

Wye City Group Meeting on statistics on rural development and agriculture household income An Wye City Group Meeting on statistics on rural development and agriculture household income An Open Source Approach to Disseminate Statistical Data on the Web Giulio Barcaroli Stefania Bergamasco Stefano De Francisci Leonardo Tininini ISTAT Rome, June 11 -12, 2009

Outline • The big picture – motivations (from the Wye Group Handbook) – generalized Outline • The big picture – motivations (from the Wye Group Handbook) – generalized and open source software at ISTAT • A first zoom: the ISTAR framework for integrated output management • A further zoom: the ISTAR. MD module and its components – (open source) software architecture – motivations: do we really need another DWH / Web statistical dissemination system? – ISTAR. MD basic principles – the Web. MD navigation component – the Foxtrot. MD administration component WCG Meeting - Leonardo Tininini - June 11 -12, 2009 2

Motivations (from the Wye Group Handbook) • Territory handling – [. . . ] Motivations (from the Wye Group Handbook) • Territory handling – [. . . ] have clear understanding of what “rural” means and the geographical areas to which it is applied [. . . ] use of various levels to suit the problem at hand; sometimes the concern will be a large area while for others something much smaller might be needed Þ the dissemination system should be able to deal with data at different levels of territorial detail (but also different ways of partitioning the territory) in a simple way • Integration, reliability, timeliness – Indicators [. . . ] need to be drawn from many different data sources [. . . ] They must be reliable, timely and avoid the pitfalls that come with the need to work with existing data and to mix sources Þ the dissemination system should support integration, but also provide automatic tools to avoid the bottlenecks due to frequent required IT experts intervention, as well as to support the entire workflow from validated to disseminated data WCG Meeting - Leonardo Tininini - June 11 -12, 2009 3

Developing generalized/open software at ISTAT • distributed organization that tend to produce a proliferation Developing generalized/open software at ISTAT • distributed organization that tend to produce a proliferation of similar (and often incompatible) subsystems • strong need to harmonize and integrate the production processes • development of generalized software supporting: – data integration – systems interoperability – reusability (especially in international cooperation projects) • free/open source software – portable on many different hardware/software platforms – inherently modular – no cost for the final user (relevant in cooperation projects with developing countries, but not only. . . ) – further fostering the reusability strategy WCG Meeting - Leonardo Tininini - June 11 -12, 2009 4

ISTAT open source tools for stat. cooperation • CONCORD: edit and imputation of data ISTAT open source tools for stat. cooperation • CONCORD: edit and imputation of data – migrated from the original (SAS-based) to the current (Javabased) version, in the cooperation project with Bosnia and Herzegovina for its 2004 HBS • MAUSS and GENESEES: sampling design and estimation, respectively – migrated from the original (SAS-based) to the current (Rbased) versions: MAUSS-R and EVER (2007 and 2010 HBS in Bosnia and Herzegovina) • ISTAR: statistical data dissemination, starting from validated microdata – originating from the system for the Web dissemination of the 2001 Italian Population and Housing Census data (Kosovo Pop. Census, HBS in B&H, . . . ) • Other R packages (for sampling, data analysis, data mining imputation of missing values, etc. ), Cs. Pro (CAPI), Lime. Survey (Web surveys), ARGUS (disclosure control), . . . WCG Meeting - Leonardo Tininini - June 11 -12, 2009 5

ISTAR – The Integrated Output Management System Main features • supporting all phases of ISTAR – The Integrated Output Management System Main features • supporting all phases of the statistical data life cycle (workflow) from validated data to dissemination • integrated management of data (both elementary and aggregated), and metadata (e. g. classifications, reference metadata, glossaries and thesauri, metadata specifically designed for search engines) • toolkit: collection of software modules and subsystems to be used in connection or independently, depending on the user's specific needs – modelling tools to design the semantic layer of the system (e. g. information content, data mappings, ETL procedures) – analysis and reporting tools, supporting both in-house data warehousing and (controlled) multi-dimensional navigation on the Web WCG Meeting - Leonardo Tininini - June 11 -12, 2009 6

ISTAR modules • ISTAR. Meta to transform a (sequential) microdata file into a Data ISTAR modules • ISTAR. Meta to transform a (sequential) microdata file into a Data Mart (ready for data warehousing) • ISTAR. Smol for OLAP and conventional data warehousing (ISTAT production sectors only) • ISTAR. PD for table-oriented (and Partially Dimensional) Web dissemination of statistical data with the 2 components – ISTAR. Foxtrot. PD for administration and ETL processes – ISTAR. Web. PD for statistical dissemination by table-oriented Web navigation • ISTAR. MD for data-warehouse-like (Multi Dimensional) Web dissemination of statistical data with the 2 components: – ISTAR. Foxtrot. MD for administration and ETL processes – ISTAR. Web. MD for statistical dissemination by multi-dimensional Web navigation • ISTAR. Glossary for managing glossaries and thesauri • ISTAR. Doc for managing documentation information • ISTAR. Search for managing all data related to search engine support WCG Meeting - Leonardo Tininini - June 11 -12, 2009 7

ISTAR - a little bit of history • Originated from two Web applications developed ISTAR - a little bit of history • Originated from two Web applications developed to disseminate the data of the 2001 Italian Population and Housing Census (ancestors of ISTAR. Web. PD and ISTAR. Web. MD) – http: //dawinci. istat. it/ • Enhanced and used in a cooperation project for disseminating the data of the 2004 Bosnia and Herzegovina HBS (and for the 2007 survey the whole ISTAR. MD module will be released) – http: //hbsdw. istat. it/ • Enhanced and used in the last few years for disseminating on the Web the data from several ISTAT surveys – – • http: //incipit. istat. it/ http: //dip. istat. it/ http: //lau. istat. it/ http: //agri. istat. it/ The ISTAR. MD module has recently evolved to support both My. SQL and Oracle DBMSs and a fully multi-lingual navigation – for the 2007 HBS of B&H the data will be available in Bosnian, Croatian, Serbian (using cyrillic alphabet) and English languages WCG Meeting - Leonardo Tininini - June 11 -12, 2009 8

ISTAR. MD open source architecture B R O W S E R TOMCAT A ISTAR. MD open source architecture B R O W S E R TOMCAT A P A C H E JAVA 2 Istar. MD Applic. Toolkit J D B C My. SQL DBMS POI - HSSF Spreadsheet files XML config. files WCG Meeting - Leonardo Tininini - June 11 -12, 2009 9

ISTAR. MD: why not using a ISTAR. MD: why not using a "table-based" system? • Several applications are available for the dissemination of statistical tables on the Web (e. g. Country. Stat / PC-Axis) • (Statistical) table-based approach to Web dissemination very useful to disseminate a (relatively) limited number of “key” statistical tables possibility to combine heterogeneous data in a single table as the number of tables increases, the selection of the “right” table can become more and more difficult it is difficult (if not impossible) to support navigation from different tables (increase/decrease classificatory detail, add/remove classifications), typical of data warehouses WCG Meeting - Leonardo Tininini - June 11 -12, 2009 10

ISTAR. MD: why not using a conventional data warehouse? • Strict correspondence between statistical ISTAR. MD: why not using a conventional data warehouse? • Strict correspondence between statistical databases (statistical dissemination systems) and data warehouses (Shoshani, 1997) • Some DWHs are even open-source (e. g. the Pentaho suite) • However, despite all similarities, several peculiarities exist. . . – – – sample surveys data quality preserving privacy microdata unavailability filter questions and heterogeneous hierarchies WCG Meeting - Leonardo Tininini - June 11 -12, 2009 11

“Sparse” dimensional cubes • significance of data • privacy protection • . . . “Sparse” dimensional cubes • significance of data • privacy protection • . . . We may have few “points” corresponding to publishable data in the multidimensional cube WCG Meeting - Leonardo Tininini - June 11 -12, 2009 12

"Basic tables": the 4 fundamental components Time Object (measure) BT Classifications (dimensions) Territory (detail + area) Resident population by sex and civil status, regions of Northern Italy, 2001 Census WCG Meeting - Leonardo Tininini - June 11 -12, 2009 13

Maximum detail combinations and dimensional navigation • Arbitrary number of maximum detail dimensional combinations Maximum detail combinations and dimensional navigation • Arbitrary number of maximum detail dimensional combinations (“maximum detail tables”) for each measure and year • “constrained” navigation roll-up drill-down WCG Meeting - Leonardo Tininini - June 11 -12, 2009 14

Navigating data by ISTAR. Web. MD WCG Meeting - Leonardo Tininini - June 11 Navigating data by ISTAR. Web. MD WCG Meeting - Leonardo Tininini - June 11 -12, 2009 15

Choosing the statistical table components • Objects and classifications are organized into hierarchies, mainly Choosing the statistical table components • Objects and classifications are organized into hierarchies, mainly based on generalization • Possibility to choose “generic” concepts that are automatically mapped to “real” tables by the system WCG Meeting - Leonardo Tininini - June 11 -12, 2009 16

An example of An example of "generic" query Queries can be expressed in even more generic terms, e. g. by only specifying “Level of education” and “Age”. The system will complete the query with the appropriate objects (measures) WCG Meeting - Leonardo Tininini - June 11 -12, 2009 17

The Foxtrot. MD administration component • • Managing objects (and corresp. hierarchies, descriptions, . The Foxtrot. MD administration component • • Managing objects (and corresp. hierarchies, descriptions, . . . ) Managing classifications and corresp. hierarchies, values, . . . Managing maximum detail tables, aggregation rules, . . . Managing ETL workflow WCG Meeting - Leonardo Tininini - June 11 -12, 2009 18

ETL: from micro to aggregate data • Checks on the structure of the microdata ETL: from micro to aggregate data • Checks on the structure of the microdata table (mandatory columns, territorial granularity, missing/unexpected territorial codes) • Construction of auxiliary structures to speed up the aggregation computation • Checks on the contents of the single microdata table columns: missing or unexpected values, null values, special values corresponding to missing answers, etc. • Microdata grouping and aggregation • Missing values handling, according to what obtained in the previous phases • Generation of supporting files to improve the system's performances WCG Meeting - Leonardo Tininini - June 11 -12, 2009 19

Some contacts. . . • ISTAT Generalized Software – Giulio Barcaroli (barcarol@istat. it) Head Some contacts. . . • ISTAT Generalized Software – Giulio Barcaroli (barcarol@istat. it) Head of section for methods, tools and methodological support • ISTAR Project – Stefano De Francisci (defranci@istat. it) Head of section for output management and integrated analysis – Stefania Bergamasco (bergamas@istat. it) Head of unit for Integrated Information Systems - ISTAR Project Manager – ISTAR Working Group Cecilia Colasanti, Paolo Giacomi, Paola Giorgetti, Fausto Panicali, Paolo Piergentili, Domenico Scalzo, Nicoletta Severini, Leonardo Tininini • ISTAT information technologies and methodologies – Carlo Vaccari (vaccari@istat. it) Head of section for information technologies and methodological support – Leonardo Tininini (tininini@istat. it) Head of unit for database technologies • ISTAT international cooperation projects – Micaela Jouvenal (jouvenal@istat. it) Head of unit for technical cooperation WCG Meeting - Leonardo Tininini - June 11 -12, 2009 20