722f8b99d5e5ca4e8ad1a5d84f5a5606.ppt
- Количество слайдов: 20
“Practical” Provenance Using Sci. Flo Workflows Brian Wilson Gerald Manipon and Hook Hua Jet Propulsion Laboratory Wilson et al. AGU Talk, Dec. 17, 2008 1
Outline • Decade-Scale Climate Science Using Workflows • Review of Sci. Flo: Scientific Data. Flow Engine • Uses of Provenance • Auto-provenance from dataflow execution engine • processor versions, annotated results, etc. • “Practical Provenance” • commit to workflow • get basic provenance for free • reproducible workflows • use web graph to store resources Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 2
Large-Scale, Distributed Data Fusion • Find Level-2 datasets across multiple data centers – Space/time granule query for multiple EOS (“A-Train”) instruments – AIRS, AMSR-E, AMSU, MODIS, Cloudsat, GPS • Co-locate retrievals using space/time metadata – Instantaneous “matchups” in space & time • Read the data – Temperature, water vapor, quality flags, cloud properties (HDF) • Understand the data – Units, quality control (non-trivial !!), etc. • Publish merged products – Water vapor climatology, stratified by Cloudsat cloud classes • Publish multi-sensor “fused” products – Determine instrument biases, understand by stratifying – Fuse L 2 data on a common grid Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 3
AIRS / GPS Matchups AIRS/GPS Temperature & Water Vapor Comparison Plots Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 4
Sci. Flo Workflow Engine n Automate large-scale, multi-instrument science processing by authoring a dataflow document that specifies a tree of executable operators/services. n Viz. Flow Visual Authoring Tool (AJAX GUI in browser) Distributed Dataflow Execution Engine (in python) n Data Grid: Move data “granules” to the operators using n FTP, HTTP, or Open. DAP URLs. n n Wilson et al. Compute Grid: Move operators (executables) to the data. Built-in reusable operators provided for many tasks such as subsetting, co-registration, regridding, data fusion, etc. Custom operators easily plugged in by scientists. Leverage convergence of Web Services (SOAP) with Grid Services (Globus toolkit v 4). AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 5
Service/Operator Orchestration • Each Sci. Flo processing step is one of: – – – – Template for XML (or string) generation REST (http GET) call: e. g. WMS/WCS, DAP URLs SOAP service call: “have WSDL, will call” XPath 2. 0 transformation for XML mediation XQuery 1. 0 query/transformation Command-line script or executable Python method call – Scientist’s custom IDL or MATLAB script – Other (What do you need? ) Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 6
Viz. Flowchart GPS-AIRS Matchup & Temp. Profile Comparison • Connect a series of services and operators into a dataflow • Drag services/operators from menu, and drop onto the canvas • Lay out the flowchart by moving nodes • Connect the input/output ports by drawing lines • User guided by matching up port names and types Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 7
Sci. Flo Applications n AMAPS = Aerosol Modeling And Processing System n n n AQUA = Automated Query & Access n n n Publish a temperature & water vapor climatology stratified by cloud scene (Cloud. Sat classes) using A-Train data (AIRS, AMSR-E, AMSU, MODIS, MLS) Cimate Virtual Observatory n n Wilson et al. One-year ACCESS ECHO grant (Brian Wilson, PI) Automated, repeatable access to 5 -year EOS datasets for large-scale data mining MEASUREs Project – Eric Fetzer, PI n n Amy Braverman, ACCESS PI; Joyce Penner, U. Michigan Carbon Cycle Compare Aerosol Optical Depth (AOD) from MODIS, MISR, & AERONET to IMPACT model Examine the biases of temperature retrievals from AIRS, AMSU, MLS by comparisons to GPS occulations Stratify biases by geophysical conditions, cloud scene, etc. ; study decade-scale trends. 17, 2008 AGU Talk, ESIP Federation Dec. July 8, 2009 9
Uses of Provenance n Debugging production (instrument drift, algorithm bugs) n n n Traceable science n n n Carbon Cycle What data granule caused a production failure? What executable version yielded “dubious” products? What data observed a climate event/anomaly? What data contributed to analysis of a climate trend? Reproducible science n n Wilson et al. Re-generate the science analysis years later Allow peer reviewers to reproduce and vary the science analysis by executing the workflows AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 10
Provenance in Production Systems n n Two Approaches: Instrument the Production Scripts n n n Call out to ‘provenance capture’ API to record metadata Could be web service calls Intrusive Only retain what you explicitly save Use Formal Workflow for Production n n n Wilson et al. Carbon Cycle Annotated workflow document contains provenance Versions of operators Web service endpoints Intermediate & final results, or pointers to them Use links to limit combinatorial explosion Workflow points to entire provenance, if URI’s are permanent AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 11
Provenance Graph n n Chain of provenance is a directed graph linking inputs, processors with configuration, & computed outputs Carbon Cycle Saving the graph (replicas of resources) n n Saving the graph (links to resources) n n n Provenance graph is on the web But links rot, so more fragile Importance of permanent names (simplifies problem) n n Wilson et al. Bullet proof But unnecessary duplication Combinatorial explosion Permanent names under-used on the web URL’s can be permanent, just policy Provenance system can guarantee permanence Also could migrate to another system, e. g. XRI AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 12
Example: Subsetting Workflow MISR Granule Subsetter Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 13
Annotated Sci. Flo Document Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 14
Auto-generated Sci. Flo Input Form Input widgets Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 15
Processing Step #1 Call to versioned web service Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 16
Processing Step #2 Call to method in python module Version provenance: - Execution engine adds version annotation - Or here code bundle is versioned Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 17
Annotated Results Section Results document exact granules used Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 18
Annotated Results (2) Intermediate results from each processing step Wilson et al. AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 19
Provenance Features n Web Services & Operators versioned n n n Save (small) intermediate results n n Particularly, list of input data granules returned from query Can still reproduce workflow even if service/op unavailable Sci. Flo user controls what is saved Provenance is contained in annotated Sci. Flo doc. and resources it links to (URI’s): n n n Wilson et al. Versioned URI’s or annotations Carbon Cycle Can snapshot code bundles, or better a virtual image of OS with installed operators Provenance is immediately returned to user with results Distributed provenance implicit in links between workflows Resource replicas can be bundled according to user policy Database not necessary, but can be used for archiving/query Provenance graph can be transformed to OPM/XML for interoperability AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 20
“Practical” Provenance n Commit to Workflow (many other benefits) n n Get Basic Provenance For Free n n n No need to instrument production systems Provenance graph implicit in Sci. Flo document & REST URLs Web graph of permanent resources By policy, can archive resource replicas (with URL redirection) Web Services Era n n Wilson et al. Declarative production streams Carbon Cycle Auto-parallel execution Workflow can be distributed using web services (multi-sensor, multi-data center science) Distributed provenance is important Sci. Flo or OPM docs. link to each other Trace full graph by following REST URL’s Reproducible workflows AGU Talk, July 8, 2009 ESIP Federation Dec. 17, 2008 21
722f8b99d5e5ca4e8ad1a5d84f5a5606.ppt