Скачать презентацию The Gri Phy N Virtual Data System GRIDS Скачать презентацию The Gri Phy N Virtual Data System GRIDS

5bf017db0f59fd542d1690e6a7d67744.ppt

  • Количество слайдов: 35

The Gri. Phy. N Virtual Data System GRIDS Center Community Workshop Michael Wilde wilde@mcs. The Gri. Phy. N Virtual Data System GRIDS Center Community Workshop Michael Wilde [email protected] anl. gov Argonne National Laboratory 24 June 2005

Acknowledgements …many thanks to the entire Trillium / OSG Collaboration, i. VDGL and OSG Acknowledgements …many thanks to the entire Trillium / OSG Collaboration, i. VDGL and OSG Team, Virtual Data Toolkit Team, and all of our application science partners in ATLAS, CMS, LIGO, SDSS, Dartmouth DBIC and f. MRIDC, SCEC, and Argonne’s Computational Biology and Climate Science Groups of the Mathematics and Computer Science Division. The Virtual Data System group is: u u u ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi U of Chicago: Catalin Dumitrescu, Ian Foster, Luiz Meyer (UFRJ, Brazil), Doug Scheftner, Jens Voeckler, Mike Wilde, Yong Zhao www. griphyn. org/vds Gri. Phy. N and i. VDGL are supported by the National Science Foundation Many of the research efforts involved in this work are supported by the US Department of Energy, office of Science. www. griphyn. org/vds

The Gri. Phy. N Project Enhance scientific productivity through… l Discovery, application and management The Gri. Phy. N Project Enhance scientific productivity through… l Discovery, application and management of data and processes at petabyte scale l Using a worldwide data grid as a scientific workstation The key to this approach is Virtual Data – creating and managing datasets through workflow “recipes” and provenance recording. www. griphyn. org/vds

Virtual Data Example: Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution Jim Virtual Data Example: Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago www. griphyn. org/vds

A virtual data glossary l virtual data u l VDS – Virtual Data System A virtual data glossary l virtual data u l VDS – Virtual Data System u l A larger set of tools, based on NMI, VDT provides the Grid environment in which VDL workflows run VDL – Virtual Data Language u l The tools to define, store, manipulate and execute virtual data workflows VDT – Virtual Data Toolkit u l defining data by the logical workflow needed to create it virtualizes it with respect to location, existence, failure, and representation A language (text and XML) that defines the functions and function calls of a virtual data workflow VDC – Virtual Data Catalog u The database and schema that store VDL definitions www. griphyn. org/vds

What must we “virtualize” to compute on the Grid? l Location-independent computing: represent all What must we “virtualize” to compute on the Grid? l Location-independent computing: represent all workflow in abstract terms l Declarations not tied to specific entities: u u file systems u l sites schedulers Failures – automated retry for data server and execution site un-availability www. griphyn. org/vds

Expressing Workflow in VDL TR grep (in a 1, out a 2) { file Expressing Workflow in VDL TR grep (in a 1, out a 2) { file 1 argument stdin = ${a 1}; argument stdout = ${a 2}; } grep TR sort (in a 1, out a 2) { argument stdin = ${a 1}; file 2 argument stdout = ${a 2}; } DV grep (a [email protected]{in: file 1}, a [email protected]{out: file 2}); sort DV sort (a [email protected]{in: file 2}, a [email protected]{out: file 3}); file 3 www. griphyn. org/vds

Expressing Workflow in VDL TR grep (in a 1, out a 2) { Define Expressing Workflow in VDL TR grep (in a 1, out a 2) { Define a “function” wrapper for an application file 1 argument stdin = ${a 1}; argument stdout = ${a 2}; } Define “formal arguments” for the application TR sort (in a 1, out a 2) { argument stdin = ${a 1}; grep Connect applications via output-to-input dependencies file 2 argument stdout = ${a 2}; } DV grep (a [email protected]{in: file 1}, a [email protected]{out: file 2}); sort DV sort (a [email protected]{in: file 2}, a [email protected]{out: file 3}); file 3 Define a “call” to invoke application www. griphyn. org/vds Provide “actual” argument values for the invocation

Essence of VDL l Elevates specification of computation to a logical, location-independent level l Essence of VDL l Elevates specification of computation to a logical, location-independent level l Acts as an “interface definition language” at the shell/application level l Can express composition of functions l Codable in textual and XML form l Often machine-generated to provide ease of use and higher-level features l Preprocessor provides iteration and variables www. griphyn. org/vds

Compound Workflow l preprocess Complex structure u Fan-in u Fan-out u findrange l Uses Compound Workflow l preprocess Complex structure u Fan-in u Fan-out u findrange l Uses input file u l analyze "left" and "right" can run in parallel Register with RC Supports complex file dependencies u Glues workflow www. griphyn. org/vds

Compound Transformations for nesting Workflows l Compound TR encapsulates an entire sub-graph: TR range. Compound Transformations for nesting Workflows l Compound TR encapsulates an entire sub-graph: TR range. Analysis (in fa, p 1, p 2, out fd, io fc 1, io fc 2, io fb 1, io fb 2, ) { call preprocess( a=${fa}, b=[ ${out: fb 1}, ${out: fb 2} ] ); call findrange( a 1=${in: fb 1}, a 2=${in: fb 2}, name="LEFT", p=${p 1}, b=${out: fc 1} ); call findrange( a 1=${in: fb 1}, a 2=${in: fb 2}, name="RIGHT", p=${p 2}, b=${out: fc 2} ); call analyze( a=[ ${in: fc 1}, ${in: fc 2} ], b=${fd} ); } www. griphyn. org/vds

Compound Transformations (cont) l Multiple DVs allow easy generator scripts: DV d 1 -> Compound Transformations (cont) l Multiple DVs allow easy generator scripts: DV d 1 -> range. Analysis ( [email protected]{out: "f. 00005"}, fc [email protected]{io: "f. 00004"}, fc [email protected]{io: "f. 00003"}, fb [email protected]{io: "f. 00002"}, fb [email protected]{io: "f. 00001"}, [email protected]{io: "f. 00000"}, p 2="100", p 1="0" ); DV d 2 -> range. Analysis ( [email protected]{out: "f. 0000 B"}, fc [email protected]{io: "f. 0000 A"}, fc [email protected]{io: "f. 00009"}, fb [email protected]{io: "f. 00008"}, fb [email protected]{io: "f. 00007"}, [email protected]{io: "f. 00006"}, p 2="141. 42135623731", p 1="0" ); . . . DV d 70 -> range. Analysis ( [email protected]{out: "f. 001 A 3"}, fc [email protected]{io: "f. 001 A 2"}, fc [email protected]{io: "f. 001 A 1"}, fb [email protected]{io: "f. 001 A 0"}, fb [email protected]{io: "f. 0019 F"}, [email protected]{io: "f. 0019 E"}, p 2="800", p 1="18" ); www. griphyn. org/vds

Using VDL l Generated directly for low-volume usage l Generated by scripts for production Using VDL l Generated directly for low-volume usage l Generated by scripts for production use l Generated by application tool builders as wrappers around scripts provided for community use l Generated transparently in an applicationspecific portal (e. g. quarknet. fnal. gov/grid) l Generated by drag-and-drop workflow design tools such as Triana www. griphyn. org/vds

Basic VDL Toolkit l l l Convert between text and XML representation Insert, update, Basic VDL Toolkit l l l Convert between text and XML representation Insert, update, remove definitions from a virtual data catalog Attach metadata annotations to defintions Search for definitions Generate an abstract workflow for a data derivation request Multiple interface levels provided: u Java API, command line, web service www. griphyn. org/vds

Representing Workflow l Specifies a set of activities and control flow l Sequences information Representing Workflow l Specifies a set of activities and control flow l Sequences information transfer between activities l VDS uses XML-based notation called “DAG in XML” (DAX) format l VDC Represents a wide range of workflow possibilities l DAX document represents steps to create a specific data product www. griphyn. org/vds

Executing VDL Workflows Workflow spec VDL Program Virtual Data catalog Virtual Data Workflow Generator Executing VDL Workflows Workflow spec VDL Program Virtual Data catalog Virtual Data Workflow Generator Create Execution Plan Statically Partitioned DAG Dynamically Planned DAG Local planner Abstract workflow www. griphyn. org/vds Grid Workflow Execution DAGman & Condor-G Job Planner Job Cleanup

OSG: The “target chip” for VDS Workflows Supported by the National Science Foundation and OSG: The “target chip” for VDS Workflows Supported by the National Science Foundation and the Department of Energy. www. griphyn. org/vds

NMI Sources (CVS) VDS Supported Via Virtual Data Toolkit VDT Build & Test Condor NMI Sources (CVS) VDS Supported Via Virtual Data Toolkit VDT Build & Test Condor pool 22+ Op. Systems Build Test Pacman cache Package Patching GPT src bundles Binaries RPMs Build Binaries Test Build Binaries Many Contributors A unique laboratory for testing, supporting, deploying, packaging, upgrading, & troubleshooting complex sets of software! Slide www. griphyn. org/vds courtesy of Paul Avery, UFL

Collaborative Relationships Partner science projects Partner networking projects Partner outreach projects Requirements Prototyping & Collaborative Relationships Partner science projects Partner networking projects Partner outreach projects Requirements Prototyping & experiments Other linkages Ø Work force Ø CS researchers Ø Industry U. S. Grids Int’l Outreach Production Deployment Computer Virtual Larger Techniques Tech Science Data Science & software Research Toolkit Transfer Community Globus, Condor, NMI, i. VDGL, PPDG EU Data. Grid, LHC Experiments, Quark. Net, CHEPREO, Dig. Divide Slide courtesy of Paul Avery, UFL www. griphyn. org/vds

VDS Applications Application Jobs / workflow Levels Status ATLAS 500 K 1 In Use VDS Applications Application Jobs / workflow Levels Status ATLAS 500 K 1 In Use ~700 2 -5 Inspiral In Use 1000 s 7 Both In Use 40 K 1 In Use 100 s 12 In Devel <10 3 -6 In Use Coadd; Cluster Search 40 K 500 K 2 8 In Devel / CS Research FOAM 2000 (core app runs 3 In use Ocean/Atmos Model 250 8 -CPU jobs) GTOMO 1000 s 1 In Devel HEP Event Simulation LIGO Inspiral/Pulsar NVO/NASA Montage/Morphology GADU Genomics: BLAST, … f. MRI DBIC AIRSN Image Proc Quark. Net Cosmic. Ray science SDSS Image proc SCEC 1000 s In use Earthquake sim www. griphyn. org/vds

A Case Study – Functional MRI l Problem: “spatial normalization” of a images to A Case Study – Functional MRI l Problem: “spatial normalization” of a images to prepare data from f. MRI studies for analysis l Target community is approximately 60 users at Dartmouth Brain Imaging Center l Wish to share data and methods across country with researchers at Berkeley l Process data from arbitrary user and archival directories in the center’s AFS space; bring data back to same directories l Grid needs to be transparent to the users: Literally, “Grid as a Workstation” www. griphyn. org/vds

A Case Study – Functional MRI (2) l Based workflow on shell script that A Case Study – Functional MRI (2) l Based workflow on shell script that performs 12 -stage process on a local workstation l Adopted replica naming convention for moving user’s data to Grid sites l Creates VDL pre-processor to iterate transformations over datasets l Utilizing resources across two distinct grids – Grid 3 and Dartmouth Green Grid www. griphyn. org/vds

Functional MRI Analysis Workflow courtesy James Dobson, Dartmouth Brain Imaging Center www. griphyn. org/vds Functional MRI Analysis Workflow courtesy James Dobson, Dartmouth Brain Imaging Center www. griphyn. org/vds

f. MRI Dataset processing FOREACH BOLDSEQ DV reorient (# Process Blood O 2 Level f. MRI Dataset processing FOREACH BOLDSEQ DV reorient (# Process Blood O 2 Level Dependent Sequence input = [ @{in: "$BOLDSEQ. img"}, @{in: "$BOLDSEQ. hdr"} ], output = [@{out: "$CWD/FUNCTIONAL/r$BOLDSEQ. img"} @{out: "$CWD/FUNCTIONAL/r$BOLDSEQ. hdr"}], direction = "y", ); END DV softmean ( input = [ FOREACH BOLDSEQ @{in: "$CWD/FUNCTIONAL/har$BOLDSEQ. img"} END ], mean = [ @{out: "$CWD/FUNCTIONAL/mean"} ] ); www. griphyn. org/vds

f. MRI Virtual Data Queries Which transformations can process a “subject image”? l Q: f. MRI Virtual Data Queries Which transformations can process a “subject image”? l Q: xsearchvdc -q tr_meta data. Type subject_image input l A: f. MRIDC. AIR: : align_warp List anonymized subject-images for young subjects: l Q: xsearchvdc -q lfn_meta data. Type subject_image privacy anonymized subject. Type young l A: 3472 -4_anonymized. img Show files that were derived from patient image 3472 -3: l Q: xsearchvdc -q lfn_tree 3472 -3_anonymized. img l A: 3472 -3_anonymized. img 3472 -3_anonymized. sliced. hdr atlas. img … atlas_z. jpg 3472 -3_anonymized. sliced. img www. griphyn. org/vds

CPU-day US-ATLAS Data Challenge 2 Mid July Event generation using Virtual Data Sep 10 CPU-day US-ATLAS Data Challenge 2 Mid July Event generation using Virtual Data Sep 10 www. griphyn. org/vds

Provenance for DC 2 How much compute time was delivered? | years| mon | Provenance for DC 2 How much compute time was delivered? | years| mon | year | +------+------+ |. 45 | 6 | 2004 | | 20 | 7 | 2004 | | 34 | 8 | 2004 | | 40 | 9 | 2004 | | 15 | 10 | 2004 | | 15 | 11 | 2004 | | 8. 9 | 12 | 2004 | +------+------+ Selected statistics for one of these jobs: start: 2004 -09 -30 18: 33: 56 duration: 76103. 33 pid: 6123 exitcode: 0 args: 8. 0. 5 Job. Transforms-08 -00 -05 -09/share/dc 2. g 4 sim. filter. trf CPE_6785_556. . . -6 6 2000 4000 8923 dc 2_B 4_filter_frag. txt utime: 75335. 86 stime: 28. 88 minflt: 862341 majflt: 96386 Which Linux kernel releases were used ? How many jobs were run on a Linux 2. 4. 28 Kernel? www. griphyn. org/vds

LIGO Inspiral Search Application l Describe… Inspiral workflow application is the work of Duncan LIGO Inspiral Search Application l Describe… Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, and the LSC Inspiral group www. griphyn. org/vds

Small Montage Workflow ~1200 node workflow, 7 levels Mosaic of M 42 created on Small Montage Workflow ~1200 node workflow, 7 levels Mosaic of M 42 created on the Teragrid using Pegasus www. griphyn. org/vds

 FOAM: Fast Ocean/Atmosphere Model 250 -Member Ensemble Run on Tera. Grid under VDS FOAM: Fast Ocean/Atmosphere Model 250 -Member Ensemble Run on Tera. Grid under VDS Remote Directory Creation for Ensemble Member 1 Remote Directory Creation for Ensemble Member 2 FOAM run for Ensemble Member 1 Atmos Postprocessing for Ensemble Member 2 Remote Directory Creation for Ensemble Member N FOAM run for Ensemble Member N Ocean Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 22 Ensemble Member Results transferred to archival storage Work of: Rob Jacob (FOAM), Veronica Nefedova (Workflow design and execution) www. griphyn. org/vds

FOAM: Tera. Grid/VDSBenefits Climate Supercomputer Tera. Grid with NMI and VDS www. griphyn. org/vds FOAM: Tera. Grid/VDSBenefits Climate Supercomputer Tera. Grid with NMI and VDS www. griphyn. org/vds Visualization courtesy Pat Behling and Yun Liu, UW Madison

NMI Tools Experience l GRAM & Grid Information System u Tools needed to facilitate NMI Tools Experience l GRAM & Grid Information System u Tools needed to facilitate app deployment, debugging, and maintenance across sets of sites: “gstar” prototype at osg. ivdgl. org/twiki/bin/view/Griphyn. Main. TWiki/Gstar. Toolkit (work of Jed Dobson and Jens Voeckler) l Condor-G/DAGman u l Site Selection u u l Automated, opportunistic approaches being designed and evaluated Policy based approaches are a research topic (Dumitrescu, others) RLS Namespace u l Efforts under way to provide means to dynamically extend a running DAG; also research exploring the influence of scheduling parameters on DAG throughput and responsiveness Needed to extend data archives to Grid and provide app transparency – some efforts underway to prototype this Job Execution Records (accounting) u Several different efforts – desire to unify in OSG www. griphyn. org/vds

Conclusion l Using VDL to express location-independent computing is proving effective: science users save Conclusion l Using VDL to express location-independent computing is proving effective: science users save time by using it over ad-hoc methods u VDL automates many complex and tedious aspects of distributed computing l Proving capable of expressing workflows across numerous sciences and diverse data models: HEP, Genomics, Astronomy, Biomedical l Makes possible new capabilities and methods for data-intensive science based on its uniform provenance model l Provides an abstract front-end for Condor workflow, automating DAG creation www. griphyn. org/vds

Next Steps l l Unified representation of data-sets, metadata, provenance and mappings to physical Next Steps l l Unified representation of data-sets, metadata, provenance and mappings to physical storage Improved queries to discover existing products and to perform incremental work (versioning) Improved error handling and diagnosis: VDS is like a compiler whose target chip architecture is the Grid – this is a tall order, and much work remains. Leverage XQuery to formulate new workflows from those in a VO’s catalogs www. griphyn. org/vds

Acknowledgements …many thanks to the entire Trillium / OSG Collaboration, i. VDGL and OSG Acknowledgements …many thanks to the entire Trillium / OSG Collaboration, i. VDGL and OSG Team, Virtual Data Toolkit Team, and all of our application science partners in ATLAS, CMS, LIGO, SDSS, Dartmouth DBIC and f. MRIDC, SCEC, and Argonne’s Computational Biology and Climate Science Groups of the Mathematics and Computer Science Division. The Virtual Data System group is: u u u ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi U of Chicago: Catalin Dumitrescu, Ian Foster, Luiz Meyer (UFRJ, Brazil), Doug Scheftner, Jens Voeckler, Mike Wilde, Yong Zhao www. griphyn. org/vds Gri. Phy. N and i. VDGL are supported by the National Science Foundation Many of the research efforts involved in this work are supported by the US Department of Energy, office of Science. www. griphyn. org/vds