
78a0419aa56acb163273e325826174f3.ppt
- Количество слайдов: 68
Software for Improving Scientific Data Access Infrastructure Russ Rew Unidata Program Center
Overview • Problems in scientific data management • Some efforts toward finding solutions – Distributing near-real time data – A data model for the earth sciences – Advancing a metadata standard for the earth system science community – Serving metadata and data – Visualizing and analyzing geoscience data • Thoughts on the value of infrastructure
Thanks to: • GFD Dennou Club and members who visited Unidata in 2004 • Research Institute for Sustainable Humanosphere • The Japanese Meteorological Agency • National Science Foundation and UCAR • Unidata Program Center staff and associated community Masato Shiotani, Yasuhiro Morikawa, Masaki Ishiwata, Russ Rew, Takeshi Horinouchi, Ethan Davis, and Yoshi-Yuki Hayashi
Unidata • Funded primarily by the U. S. National Science Foundation • Mission: To provide data, tools, and community leadership for improving Earth-system education and research • At the Unidata Program Center, we – Provide access to data (via push and pull systems) – Develop open source tools and infrastructure for data access, analysis, visualization, and data management – Support users of our technologies: faculty, students, and researchers – Help to build, represent, and advocate for a community
Background • Science increasingly advances through collaborations and synthesis of data across disciplines • Tools are not keeping up with need to analyze and combine data from different disciplines • We need to continue improving tools for data distribution, data modeling, metadata, remote access, and visualization – Within geosciences – Across scientific disciplines – For collaborations, including international scale
Problems in the Current Scientific Data Landscape • • • Data volumes are approximately doubling every year Data tools are not keeping pace with data volumes Very large datasets demand new techniques for data management Integrating data across organizations and disciplines is very difficult Input/Output bandwidth is not keeping up with storage capacity No single data model has achieved critical mass in the scientific community, each discipline has their own • There are many metadata models for each discipline and a lack of common conventions for metadata Summarized from Scientific Data Management in the Next Decade, by Jim Gray, David T. Liu, Maria Nieto-Santisteban, Alex Szalay, … CT Watch Quarterly, February 2005 www. ctwatch. org
Some Earth Science Data Characteristics • Large multidimensional arrays from forecast models • Coordinate systems to access data by region and time • Need for near-real time data • Read-only or read-mostly access to existing data • Security often a minor issue: data is usually freely available and shared, but data systems must – Protect data from accidental or malicious changes – Provide restricted access for a few data collections – Protect from denial of service caused by users asking for too much
Five Areas of Unidata Involvement 1. Distributing near-real time data 2. A data model for the earth sciences 3. Advancing a metadata standard for the earth system science community 4. Serving metadata and data 5. Visualizing and analyzing geoscience data
Distributing Near-Real Time Data • Unidata’s Internet Data Distribution system – Delivers near-real time data: model outputs, surface, radar, upper-air, satellite observations, lightning, aircraft, observations from mesoscale networks, … – A collaboration of universities and other institutions – No data center, data products are injected from multiple sources – Unidata’s part includes development of client-server software, providing support, training workshops, coordination, negotiating data agreements
Why Not Just Use FTP? • esigned to send whole files, not many small D • • products If client pulls data from server, delays result from repeatedly asking server “is new data available yet? ” If server pushes data to clients, server must maintain connection information and state for each client FTP can be slow for sending many small data products In spite of these shortcomings, FTP is still used successfully in many data distribution systems, when: – Delay not an issue – Small number of clients per server – Only large products in whole files are distributed
Unidata’s LDM (Local Data Manager) • • An alternative to FTP for data distribution Protocols and client-server software for capturing, distributing, and organizing data in near-real time using reliable, event-driven data distribution Supports subscriptions to subsets of data feeds Suitable for pushing many small products, as well as large products Highly configurable: can inject, distribute, capture, filter, and process arbitrary data products Requires Unix system Heart of the Internet Data Distribution system
IDD (Internet Data Distribution) Source LDM LDM LDM Internet Source LDM LDM Pushes data from multiple sources using cooperating LDMs Over 170 institutions on 5 continents and growing
Real-Time Data Flows In the Beginning. . . Now • • • “a dizzying volume of information – on the order of 100 MBytes/day” (AMS paper on LDM-2, Davis and Rew, 1990) • • 30 data feeds provide radar, satellite, text bulletins, lightning, model forecasts, surface and upper air observations, … LDM-6 commonly handles 5 GB/hour input, with as many as 140, 000 products/hour LDM-6 was recently selected for data collection for the THORPEX Interactive Grand Global Ensemble (TIGGE) A cluster LDM configuration can handle 400 downstream connections Currently over 300 machines at 170 sites run LDM-6 continuously Redundant feeds support reliability in case of failure of “upstream” machines or network The National Weather Service uses LDM-6 to collect and relay NEXRAD level 2 radar data operationally for over 150 radars
IDD 2007 Unidata IDD North American data delivery and sharing network IDD-Brasil South American peer of North American IDD-Caribe (planning) Central American peer of North American IDD Antarctic-IDD Support of US Antarctic research community Participants United States Canada Puerto Rico Costa Rica Barbados Venezuela Chile Brazil Argentina England Portugal Spain Austria Russia Vietnam China (Hong Kong) South Korea Antarctica (incipient)
Extract from 2006 American Meteorological Society presentation Data Products from CPTEC available on the IDD-Brasil W. G. Almeida, A. A. Lima, A. S. Pessoa, A. L. T. Ferreira, A. Bonfin, M. V. Mendes, Ferreira, S. H. - CPTEC/INPE G. O. Chagas, D. G. Coelho, M. G. Justi - UFRJ T. Yoksas - Unidata/UCAR
The IDD-Brasil § Began as a collaboration under Meteoforum project. § Participants: Unidata Program Center/UCAR, CPTEC/INPE, UFRJ, UFPA and USP § Inaugurated in January of 2004 with 4 nodes
IDD-Brasil Participation § The working paradigm for IDD-Brasil is: § You get free access to global data, tools and support § You give free access to your data-sets, provide infrastructure and support § Free data access and cooperation is a major topic of discussion in every Brazilian meteorological meeting
CPTEC’s Data ingesting in IDD-Brasil § ETA regional Model, 40 Km resolution (operational) § Automatic data-collecting network (operational) § GOES satellite imagery, full-resolution for South America (under testing) § Global T 213 model (under testing) § Ensemble T 126 Global model – 15 members (under testing)
INPE´s automatic network: Data Collecting Platforms § More than 524 automated stations from INPE and cooperating Institutions § 50 Stations reporting Atmospheric Pressure § These were the first new data shared through the IDDBrasil, soon they will be also on GTS (in BUFR format)
§ Location of all INPE data collecting platforms (PCD)
Conclusions I: § The IDD extension to Brazil (IDD-Brasil) is changing Brazilian Meteorology through: § § § Easier access to Global Data Free availability of good analysis tools Spreading ideas and practices § Free Data Sharing § Cooperation and mutual support
Conclusions II: § The IDD-Brasil shows a sharply growing rate § Today Brazil is the largest international IDD-user community § Numerical models of Brazilians institutions distributed by the IDD/IDD-Brasil are easily available to national and international users.
Conclusions III: § The data from several Brazilians mesonets may be distributed by the IDD/IDD-Brasil § These data are not available on GTS § They are very important because the data network in South America is sparse. § As a result of IDD expansion to Brazil, more Brazilian data are becoming available for International community.
Contact Information Waldenio Gambi de Almeida CPTEC/INPE Maria G. A. Justi da Silva UFRJ Tom Yoksas Unidata/UCAR gambi@cptec. inpe. br justi@igeo. ufrj. br www. meteorologia. ufrj. br yoksas@unidata. ucar. edu www. unidata. ucar. edu
1. Distributing near-real time data 2. A data model for the earth sciences 3. Advancing a metadata standard for the earth system science community 4. Serving metadata and data 5. Visualizing and analyzing geoscience data
How Adequate is the Relational Database Model for Scientific Data? • Designed and optimized for – Data in tables – Online transaction processing systems – Other business and enterprise problems • Successful in Geospatial Information Systems integration, such as ESRI Arc. GIS • Also very successful in aother disciplines, such as astronomy (Sloan Digital Sky Survey, U. S. National Virtual Observatory) • Not adequate for earth sciences data – N-dimensional arrays – Event-oriented systems, such as sensor webs or high-speed data streams – Indexing unstructured data, for example metadata in XML form – Supporting scientific analysis and visualization tools
Open Questions in Modeling Scientific Data • For how wide a realm is the relational database model adequate? • Is any data model that unifies data collections from many disciplines too complex to be useful? • Can one scientific data model be useful across many scientific disciplines? • Is one scientific data model even practical for earth sciences?
Network Common Data Form • A simple data model for scientific datasets • A format for portable, self-describing data • A programming library that uses efficient direct access and efficient subsetting of multidimensional arrays • Several programming interfaces: C, Fortran, C++, Java, Python, Perl, Ruby, . . . • Support for appending, sharing, and archiving data
The Net. CDF-3 Data Model
Limitations of the Net. CDF-3 Data Model • Too simple to represent some data structures and relationships • No real data structures, just scalars and multidimensional arrays • No “ragged arrays” or nested structures • Only one shared unlimited dimension, along which data can be appended • A flat name space for dimensions and variables • No strings, just arrays of characters • A limited set of numeric types • Only ASCII characters in names
The Net. CDF-4 Data Model
Net. CDF’s Future • Net. CDF-4 integrates net. CDF with HDF 5, another major standard format and data model • Parallel net. CDF has proved suitable for highperformance computing • Net. CDF-4 data model (CDM) improves interoperability with other scientific data representations • Net. CDF-Java has advanced features, including access to remote data
Net. CDF-4 Features Address limitations of net. CDF-3 • User-defined compound types (portable structs) • User-defined variable-length types • Groups for nested scopes • Multiple unlimited dimensions • String type • Additional numeric types • Unicode names • Efficient dynamic schema changes • Multidimensional tiling (chunking) • Per variable compression • Parallel I/O • Reader-Makes-Right conversion
The Unidata Common Data Model • For a common subset of abstractions in OPe. NDAP, HDF 5, and net. CDF-4 • User-defined compound types (portable structs) • User-defined variable-length types • Groups for nested scopes • String type • Additional numeric types • Prototype implemented in net. CDF-Java • Attempts a balance between simplicity and power of representation
Net. CDF-Java • 100% Java library has advances compared to C -based interfaces • Prototype implementation of Common Data Model for access to net. CDF-4, OPe. NDAP, HDF 5 – Provides net. CDF interfaces to other formats: Grids (GRIB 1, GRIB 2), Radar (NEXRAD, NIDS, DORADE), Satellite (DMSP, GINI), Point Observations (BUFR) – Provides uniform coordinate systems layer • Includes access to THREDDS inventory catalogs
Common Data Model Applications Scientific Datatypes Point Trajectory Radial Grid Station Swath Coordinate Systems Common Data Access Model THREDDS OPe. NDAP HDF 5 GRIB net. CDF . . .
1. Distributing near-real time data 2. A data model for the earth sciences 3. Advancing a metadata standard for the earth system science community 4. Serving metadata and data 5. Visualizing and analyzing geoscience data
Some Metadata Issues • Every scientific discipline has their own metadata standard. Is any convergence likely? • How do you choose among the multiple candidates for metadata standards? • How can metadata be improved for existing data without rewriting the data?
Climate and Forecast (CF) Conventions • A widely used metadata standard for atmospheric, ocean, and climate data, based on net. CDF • Specifies coordinate systems used in models, data cell properties and methods, packing, standard names for quantities, and grid mappings • CF-aware software can automatically determine spacetime location of data variables • Originally intended for climate model output conventions, but use has broadened to weather and ocean models and observational data • Community governance structure now in place for maintaining and advancing the CF conventions, WMO Working Group on Coupled Modeling (WGCM)
Libcf • Purpose: to ease the creation and use of datasets conforming to the CF Conventions • In early stages of development and testing • C and Fortran interfaces available from Unidata in alpha release
Udunits (Unidata Units) • Library for manipulating units of physical qualities. – Conversion of unit specifications between formatted and binary forms – Arithmetic manipulation of unit specifications – Conversion of values between compatible scales of measurement • C, Fortran, and Java interfaces • Required by CF conventions • May soon be available as part of net. CDF release
Nc. ML (Net. CDF Markup Language) • An XML representation of net. CDF metadata, similar to CDL • A schema language for Earth science data – To get Nc. ML from net. CDF data, use ncdump -x or Java Tools. UI program – To create net. CDF from Nc. ML, use Tools. UI or (eventually) ncgen • Provides a way to add to or change metadata without rewriting, by referencing and overriding metadata in a file • Also supports aggregation of multiple files
1. Distributing near-real time data 2. A data model for the earth sciences 3. Advancing a metadata standard for the earth system science community 4. Serving metadata and data 5. Visualizing and analyzing geoscience data
Client Server Data • Open-source Project for a Network Data Access Protocol, see opendap. org • A discipline-neutral protocol to get remote scientific data and metadata (not files) • Allows requests for subsets and aggregations • Software reference implementations for many kinds of data: net. CDF, SQL (databases), HDF, FITS, JGOFS, • In use in earth sciences, astronomy, medicine, … • Serves IPCC model output
FTP (File Transfer) versus OPe. NDAP for Access to Remote Data • FTP accesses only whole files • OPe. NDAP includes services for – Selected subset of data from a file – Aggregation of data in multiple files – Selected subset of aggregated data
• Protocol uses URLs and HTTP • Unidata provides OPe. NDAP support • Several OPe. NDAP servers available: py. DAP, FDS, GDS, DAPPER, THREDDS Data Server • OPe. NDAP clients include: Ferret, Gr. ADS, Matlab, IDL, Arc. GIS, net. CDF-Java, IDV • OPe. NDAP version 2 now a NASA standard • Version 4 under development with a test version available: adds XML, new types, new functions, THREDDS catalogs, SOAP, outputs in HTML and ASCII • Will add authentication, more server-side processing
Thematic Real-time Environmental Distributed Data Services (THREDDS) • For data providers, implements data catalogs to present to users and applications • Catalogs are XML documents (metadata) describing and pointing to datasets accessible via client/server protocols (OPe. NDAP, ADDE, WCS, HTTP) • Datasets may be found by discovery centers (master directories, digital libraries, data portals) via catalogs • Catalog hierarchy provides places to hang common metadata • Unidata coordinates THREDDS activities, community implements servers • Many partners as data providers, tool builders, interoperability experts from academia, government, industry
Motherlode Portal Catalog of Catalogs
NCDC Server
NCEP NAM Individual Run
THREDDS Data Server (TDS) • Serves data, THREDDS catalogs, and metadata • Reads and serves several kinds of data through a uniform CDM interface: net. CDF, OPe. NDAP, HDF 5, GRIB, NEXRAD, … • Adds Earth-location coordinate systems to data • Provides OPe. NDAP access and subsetting of any data readable with Net. CDF-Java library • An integrated server provides data access through the Open. GIS Consortium Web Coverage Service (OGC/WCS) • Easy to install, 100% Java, freely available • Supports dynamic generation of catalogs
THREDDS Data Server HTTP Tomcat Server Catalog. xml THREDDS Server • OPe. NDAP Application • HTTPServer • WCS Net. CDF-Java library hostname. edu Datasets IDD Data
1. Distributing near-real time data 2. A data model for the earth sciences 3. Advancing a metadata standard for the earth system science community 4. Serving metadata and data 5. Visualizing and analyzing geoscience data
Integrated Data Viewer (IDV) • Unidata’s newest scientific analysis and visualization tool • Freely available 100% Java framework and reference application • Provides 2 - and 3 -D displays of geoscience data • Stand-alone or networked application • Integrates data from different sources • Provides End-to-end test for technologies
Some IDV Features • Client-server data access from remote systems • Suite of data probes for interactive exploration (slice and dice) • Animations (temporal and spatial) • HTML interface for pedagogic materials • XML configuration and bundling allows collaboration with other educators • Java-based framework supports Extensions built via plug-ins: for example, geosciences network (GEON) solid earth community
Catalog of catalogs in IDV (Catalog from within a Client)
Summary and Tentative Conclusions • Database “One Size Fits All” databases are not a comprehensive solution for scientific data or metadata. • Similarly, the old file-FTP approach is running out of steam for distributed data systems. • There is a limited time for establishing interdisciplinary data tools and services, before each discipline crystallizes on its own solutions. • Islands of non-interoperability that result would be unfortunate. • Flexibility to be ready adapt to better solutions is required, maybe even before the best solutions are evident.
A Few Last Thoughts on Infrastructure …
What is Infrastructure? • The basic facilities, services, and installations needed for the functioning of a community – Utilities: water and power lines – Transportation and communications systems • Good infrastructure is reliable, sturdy, useful, long lasting, standardized, widely used, and invisible
Infrastructure: Stones in a Wall • Higher layers are built on lower layers • Stones may be replaced with other stones of similar size and shape • From the top, lower layers are invisible
Cyberinfrastructure: the Middle Layers Community-Specific Knowledge Environments for Research and Education (collaboratories, grid community, e-science community, virtual community) Customization for discipline-and project-specific applications Data, High Observation, information, Interfaces, performance measurement Collaboration visualization Knowledge services computation fabrication management services Networking, Operating Systems, Middleware Base Technology: computation, storage, communication From the “Atkins report” on Cyberinfrastructure
Is Developing Infrastructure Rewarding? • • It’s abstract, so hard to explain at a party You can’t take a picture or movie about it If it works well, it is invisible End users are often not aware of it It doesn’t get referenced in scientific papers It can be expensive to evolve and support If not maintained, it eventually crumbles You can’t sell it, so you have to give it away
Earth Science Infrastructure: Bricks in a Wall of Acronyms IDV GEMPAK IDD CDM LDM TDS THREDDS Unidata decoders Net. CDF XML Mc. IDAS Unix Vis. AD Net. CDF Java Nc. ML Udunits Arc. GIS Developed by Unidata CONDUIT project Libcf Net. CDF-4 GRIB SQL Ferret LEAD project OGC WCS CF CDL HTTP Gr. ADS C Involvement by Unidata GALEON project OPe. NDAP HDF 5 BUFR NCO ADDE CSML HDF 4 Fortran GML Java Other technologies Python, Ruby, …
Visible and Invisible Infrastructure Visible to End Users: IDV GEMPAK Mc. IDAS Arc. GIS Gr. ADS Ferret NCO “Cloak of invisibility” IDD LDM THREDDS Unidata decoders Net. CDF XML CDM Udunits Unix TDS net. CDF Java Nc. ML Libcf Net. CDF-4 GRIB SQL CONDUIT project OGC WCS CF CDL HTTP Vis. AD BUFR C LEAD project OPe. NDAP HDF 5 ADDE CSML HDF 4 Fortran GALEON project GML Java Python, Ruby, …
What Is Good Infrastructure? • • • Provides a useful service Makes abstractions at the right level Cloaks invisible details with a simple interface Binds loosely to other infrastructure Behaves reliably Adapts easily to changes
An Example of Great Infrastructure: Popular Programming Languages • Base of huge collection of higher layers of infrastructure • People continue to build on top of this infrastructure • The opportunity to create a long-lasting and popular programming language is rare • Jim Backus (Fortran), John Mc. Carthy (Lisp), Dennis Ritchie (C), Bjarne Stroustrup (C++), James Gosling (Java), Yukihiro “Matz” Matsumoto (Ruby) • Other great infrastructures: Unix, TCP/IP, HTTP, …
Rewards of Developing Infrastructure? • It “raises the level” for other developers • Beautiful and useful new layers and applications are built on top of it • You can feel a part of everything it supports • If it’s long lasting and widely used, you have made a difference for future generations • So, it’s one way to get closer to immortality • Infrastructure is abstract, but rewards can also be real • … like this trip to Japan!
For More Information http: //www. unidata. ucar. edu/ support@unidata. ucar. edu russ@ucar. edu