e9f2e186d9bc37e8e77567562c38004b.ppt
- Количество слайдов: 67
Scientific Workflows and Systems Ewa Deelman USC Viterbi School of Engineering
Outline • Scientific workflows • Business workflows • Different workflow systems – – Taverna Kepler Triana Askalon USC Viterbi School of Engineering
Applications today • Complex – Involve many computational steps – Require many (possibly diverse resources) • Composed of individual application components – Components written by different individuals – Components require and generate large amounts of data – Components written in different languages • Reuse of individual intermediate data products • Need to keep track of how the data was produced USC Viterbi School of Engineering Ewa Deelman deelman@isi. edu
Workflow Instance Collect image Image 1 Adjust Color Image 2 Collect image Image n Adjust Color Co-Add image Visualize … Collect image Ewa Deelman, deelman@isi. edu Adjust Color www. isi. edu/~deelman USC Viterbi School of Engineering pegasus. isi. edu
Business Workflows USC Viterbi School of Engineering
Business Workflows • Designed to compose applications based on web services • BPEL – Standard language for service interactions – Has many constructs to deal with the invocation of web services, including fault handling, and support for conditional logic. USC Viterbi School of Engineering
BPEL constructs • <receive>: Blocks until a matching message is received. This is typically used to receive a message from the client or a callback from a partner web service. • <reply>: Send a message in response to a message received via a <receive> • <invoke>: Perform an invocation on a web service. (oneway or request-response) • <assign>: Assign a value to a variable. • <sequence>: Executes a list of activities sequentially in lexical order. • <flow>: Executes the activities in parallel. • <while>: Used for looping until a criteria is true. • <switch>: Select one branch for execution amongst a set of branches based on a value. USC Viterbi School of Engineering
Many BPEL engines • • • Active bpel IBM BPEL 4 J Oracle BPEL Process Manager Microsoft Windows Foundation …. USC Viterbi School of Engineering
Scientific vs Business Workflows • Large amounts of data • Varied granularity of computations • Large number of computations • Often standalone components • Non-programmers need to be able to compose them • Need to provide provenance info • Performance is important • Deal with services across domains • Do not deal with standalone application components • Usually not very data intensive – Data can be easily sent between services • Important to agree on standard interfaces so that MS & IBM can work together • Focus on functionality/interoperability rather than performance USC Viterbi School of Engineering
Example of a business workflow USC Viterbi School of Engineering
• Example of Scientific Workflow • Workflow Specification Components – Standalone computations – Designed by different individuals USC Viterbi School of Engineering
Different workflow systems Taverna, a workbench for bioinformatics workflows Slides courtesy of Katy Wolstencroft USC Viterbi School of Engineering
The Community Problems • Everything is Distributed – Data, Resources and Scientists • Heterogeneous data • Very few standards – I/O formats, data representation, annotation – Everything is a string! Integration of data and interoperability of resources is difficult USC Viterbi School of Engineering
Lots of Resources NAR 2007 – 968 databases USC Viterbi School of Engineering
Traditional Bioinformatics 12181 12241 12301 12361 12421 12481 12541 12601 12661 12721 12781 acatttctac cagtctttta gaccatccta gactaattat taggtgactt aggagctatt ttcttataag tggttaagta tggcattaag atccaatacc taacccattt caacagtgga aattttaacc atagatacac gttgagcttg gcctgttttt tatatattct tctgtggttt tacatgacat tacatccaca cattaagctg tctgtctcta tgaggttgtt tttagagaag agtggtgtct ttaccattta ttttaattgg ggatacaagt ttatattaat aaaacggatt atattgtgca tcactcccca tggatttgcc ggtctatgtt agtcatacag cactgtgatt gacaacttca gatcttaatt tctttatcag gtttttattg atcttaacca actatcacca atctcccatt tgttctggat ctcaccaaat tcaatagcct ttaatttgca ttagagaagt tttttaaatt atacacagtt atgactgttt ttttaaaatg ctatcatact ttcccacccc attcatatta ttggtgttgt tttttagctt ttttcctgct gtctaatatt attgatttgt tgtgactatt tttacaattg taaaattcga ccaaaagggc tgacaatcaa atagaatcaa USC Viterbi School of Engineering
Cutting and Pasting • Advantages: – Low Technology on both server and client side – Very Robust: Hard to break. – Data Integration happens along the way • Disadvantages: – Time Consuming (and painful!) • Can be repeated rarely • Limited to small data sets. – Error Prone: • Poor repeatability USC Viterbi School of Engineering
Pipeline Programming • Advantages – Repeatable – Allows automation – Quick, reliable, efficient • Disadvantages – Requires programming skills – Difficult to modify – Requires local tool and database installation – Requires tool and database maintenance!!! USC Viterbi School of Engineering
What we want as a solution A system that is: • Allows automation • Allows easy repetition, verification and sharing of experiments • Works on distributed resources • Requires few programming skills • Runs on a local desktop / laptop USC Viterbi School of Engineering
my. Grid as a solution my. Grid allows the automated orchestration of in silico experiments over distributed resources from the scientist’s desktop Built on computer science technologies of: • Web services • Workflows • Semantic web technologies USC Viterbi School of Engineering
Workflows – General technique for describing and enacting a process – Describes what you want to do, not how you want to do it – High level description of the experiment Repeat Masker Web service Gen. Scan Web Service Blast Web Service USC Viterbi School of Engineering
Workflows Workflow language specifies how bioinformatics processes fit together. High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows. Workflow is a kind of script or protocol that you configure when you run it. Easier to explain, share, relocate, reuse and repurpose. Workflow <=> Model Workflow is the integrator of knowledge The METHODS section of a scientific publication USC Viterbi School of Engineering
Workflow Advantages • Automation – Capturing processes in an explicit manner – Tedium! Computers don’t get bored/distracted/hungry/impatient! – Saves repeated time and effort • Modification, maintenance, substitution and personalisation • Easy to share, explain, relocate, reuse and build • Releases Scientists/Bioinformaticians to do other work • Record – Provenance: what the data is like, where it came from, its quality USC Viterbi School of Engineering
Taverna Workflow Components Web Service Scufl Simple Conceptual Unified Flow Language Taverna Writing, running workflows & examining results SOAPLAB Makes applications available USC Viterbi School of Engineering e. g. DDBJ BLAST SOAPLAB Web Service Any Application
An Open World • • Open domain services and resources. Taverna accesses 3000+ services Third party – we don’t own them – we didn’t build them All the major providers – NCBI, DDBJ, EBI … • Enforce NO common data model. • Quality Web Services considered desirable USC Viterbi School of Engineering
Adding your own web services • Soap. Lab • Java API Consumer http: //www. ebi. ac. uk/soaplab/ USC Viterbi School of Engineering import Java API of lib. SBML as workflow components
Shield the Scientist – Bury the Complexity Taverna Workbench Application Scufl Model Simple Conceptual Unified Flow Language Workflow Execution Workflow enactor Processor Processor Styx Processor . . . Bio WSRF Plain Soap Local Enactor Styx R MART Web lab MOBY Java client package Service App USC Viterbi School of Engineering . . .
Kepler Slides courtesy of Bertram Ludaesher USC Viterbi School of Engineering
Scientific Workflow Capture how a scientist works with data and analytical tools – data access, transformation, analysis, visualization – possible worldview: dataflow-oriented (cf. signal-processing) Scientific workflow (wf) benefits (compare w/ script-based approaches) : – wf automation – wf & component reuse – wf design, documentation – wf archival, sharing – built-in concurrency (task-, pipeline-parallelism) – built-in provenance support – distributed execution (Grid) support –… USC Viterbi School of Engineering
Ex: SEEK Ecological Niche Modeling Pipeline • Scientific Workflow paradigm: – Reusable components (“actors”): a scientist’s verbs/actions – Top-level workflows ≈ conceptual representation of the science process, sentences in the scientist’s language – Sub-workflows ≈ increasing levels of detail • Separation of concerns: – – – actors: what to do parameters: configurable behavior channels: dataflow, pipeline composition directors: fix execution model, scheduling semantic types: smart discovery, linking D Pennington, D Higgins, AT Peterson, M Jones, B Ludaescher, S Bowers. Ecological USC Viterbi Niche Modeling. Engineering School of using the Kepler Workflow System. Workflows for e-Science, Springer.
Simple Kepler workflow using R (a statistics package) Data source from Eco. Grid (metadata-driven ingestion) R processing script res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res) USC Viterbi School of Engineering
Plumbing with Style … (Norbert Podhorszki UC Davis, Scott Klasky ORNL) Monitor Transfer Convert Archive • Plasma physics simulation on 2048 processors on Seaborg@NERSC (LBL) – – • Gyrokinetic Toroidal Code (GTC) to study energy transport in fusion devices (plasma microturbulence) Generating 800 GB of data (3000 files, 6000 timesteps, 267 MB/timestep), 30+ hour simulation run Under workflow control: – – Monitor (watch) simulation progress (via remote scripts) Transfer from NERSC to ORNL concurrently with the simulation run Convert each file to HDF 5 file USC Viterbi School of Engineering Archive files to 4 GB chunks into HPSS
Our Starting Point: Actor-Oriented Modeling Ports – – each actor has a set of input and output ports denote the actor’s signature produce/consume data (a. k. a. tokens) parameters are special “static” ports USC Viterbi School of Engineering
Actor-Oriented Modeling Dataflow Connections – unidirectional actor “communication” channels – connect output ports with input ports – for composing analysis pipelines USC Viterbi School of Engineering
Actor-Oriented Modeling Sub-workflows / Composite Actors – composite actors “wrap” sub-workflows – like actors, have signatures (i/o ports of sub-workflow) – hierarchical workflows (arbitrary nesting levels) USC Viterbi School of Engineering
Actor-Oriented Modeling Directors – – define the execution semantics of workflow graphs executes workflow graph (some schedule) sub-workflows may have different directors promotes reusability USC Viterbi School of Engineering
Models of Computation (A Wf Engineer’s Issue) Directors separate the concerns of orchestration and scheduling from conceptual design – Synchronous Dataflow (SDF) • Statically analyzable: schedule, no deadlocks, fixed buffer requirements; executable as a single thread by the director. – Process Networks (PN) • Generalizes SDF. Actors execute as separate threads/processes, with queues of unbounded size (Kahn/Mac. Queen networks). – Directed Acyclic Graph (DAG) • Special case of SDF. No loops, no pipelining. – Continuous Time (CT) • Connections represent the value of a continuous time signal at some point in time. . . Often used to model physical processes. – Discrete Event (DE) • Actors communicate through a queue of events in time. Used for instantaneous reactions in physical systems. – … USC Viterbi School of Engineering
Everything is a service / actor… USC Viterbi School of Engineering
Smart Discovery Browse for Components Search for Component Name Search for Category / Keyword Find a component (here: an actor) in different locations (“categories”) • … based on the semantic annotation of the component (or its ports) USC Viterbi School of Engineering
Behold the Beauty of Scientific Workflow Design Author: Kristian Stevens, UC Davis USC Viterbi School of Engineering
… Shimology Part 2: the ugly truth inside Author: Kristian Stevens, UC Davis USC Viterbi School of Engineering
Triana Slides courtesy of Ian Taylor USC Viterbi School of Engineering
Triana Focus • Two core underlying focuses: – Interactive graphical programming of the distributed tasks complex editing • Intuitive drag/drop flexible editing - copy/paste services, wizards for creating tools/toolboxes, user interfaces, adding nodes and multi-level grouping. • Has been used as a “graphical editor” for other languages, e. g. DAG, VDLx (DAX in progress). – Heterogeneous workflows - Bridge the gap between different distributed environments • Use cross-environment interfaces • led to integration with GAT (pre SAGA), GAP USC Viterbi School of Engineering
Types of Uses – For fine-grained operations, specifying dataflow for local operations – Or course-grained composition of a distributed workflow – Or Both - can connect heterogeneous tools (e. g. Web services, Java units, Jxta services) on one workflow Has been used as a dataflow system, a distributed-workflow environment, workflow-management system, an automated scripting tool, workflow editor. USC Viterbi School of Engineering
Current Capabilities • Local Java Units – 600 units in signal, image, audio, text processing, complete math/stats toolboxes etc – Common units - flexible importers/exporters, graphing, duplicators – Data types - strong data types for a number of domains - includes run-time checking • Distributed Integration – GAT - Java GAT implementation - graphical representation of GAT primitives - supports GRAM, Grid. FTP, etc – GAP - SOA publish, find, bind triad of operations • Bindings: Jxta, P 2 PS, Web Services, WS-RF – Group unit deployment • Legacy Applications – Can incorporate legacy applications easy (using local GAT adaptor) - standard file in/out interface USC Viterbi School of Engineering
Distributed Work-flow Triana Service & Engine Distributing Triana Units or Groups (Java) Upperware Middleware Workflow, e. g. BPEL 4 WS Workflow Commands GAP Remote Legacy Applications Integrating Legacy applications into Workflow Distributed services Integrating Web Services or P 2 P Services GAT & GAP Triana Engine GAP USC Viterbi School of Engineering
Triana, the GAT and the GAP Service Based Computing: Grid Computing: Deployment, discovery and communication with distributed services e. g. P 2 P and (GSI) Web services Job Submission, File services A Graphical Grid Computing Environment or Portal GAP Interface GAT Interface Condor Unicore Globus RLS SSH Grid. FTP PBS SGE GRMS Grid. Lab LDR WSRF. NET Other. . P 2 PS Discovery P 2 PS Pipes USC Viterbi School of Engineering JXTA Discovery Web Services UDDI JXTA Pipes SOAP
Audio Processing (Groups) USC Viterbi School of Engineering
Group Units USC Viterbi School of Engineering
GAT Interface • Main deliverable of Gridlab • Application-level interface • With a set of adapters – That adapt the interface to an underlying capability • Versions in C++ and Java • Pre-cursor to SAGA - Simple API for Grid Applications USC Viterbi School of Engineering
GAT Adapters: Example GAT API Resource Streaming/ File Job Management Comms Management Monitoring Collection Management Copy File(Machine A, Machine B) GAT Engine Grid FTP Adapter Jxta File Adapter Jxta Pipe Grid FTP Connection Grid Environment P 2 P Environment USC Viterbi School of Engineering
GAP Interface • Motivation by GAT • A Simple Service based API, for – Service Deployment, – Service Discovery – Pipe Based Communication • Static application interface with multiple middleware bindings – P 2 PS (name…? ) – JXTA – Web services GAP Interface P 2 PS Discovery Web Services JXTA P 2 PS Pipes JXTA Discovery USC Viterbi School of Engineering UDDI JXTA Pipes SOAP
Deploying and Connecting To Remote Services • Running services are automatically discovered via the GAP Interface, and appear in the tool tree • User can drag remote services onto the workspace and connect cables to them like standard tools (except the cables represent actual JXTA/P 2 PS pipes) USC Viterbi School of Engineering Remote Services
Web Service Discovery • Triana allows users to query UDDI repositories • Alternatively, users can import services directly from WSDL USC Viterbi School of Engineering
Complex Data Types • Users can build their own interface for creating/mediating between complex types • Alternatively, Triana can dynamically generate an interface from the WSDL 2 Java generated bean class USC Viterbi School of Engineering
Askalon Slides Courtesy of Thomas Fahringer USC Viterbi School of Engineering
ASKALON Application Development and Runtime Environment for the Grid Goal: simple, efficient, effective application development for the Grid • Invisible Grid • Application Modeling (UML) and programming at a high level of abstraction (AGWL) • Semantics technologies • Semi-automatic deployment • SOA-based runtime environment with stateful services • Analysis and optimization of performance, costs and reliability USC Viterbi School of Engineering
ASKALON Workflow Composition and Runtime Environment UML-based Workflow Composition AGWL Runtime Middleware Services <agwl> <parallel> activity </parallel> </agwl> Execution Engine Data Repository Performance Analysis Job Globus toolkit The Grid USC Viterbi School of Engineering Scheduler WSRF Resource Manager
Austrian Grid Parallel computer # CPU Clock Architecture Location altix 1. jku hydra. gup schafberg. sbg grid. fhv. at gescher. vcpc karwendel. dps altix 1. uibk hc-ma. uibk zid-grid 64 16 16 21 32 80 16 16 272 ITA 2 Athlon ITA 2 Xeon Opteron ITA 2 Opteron P 4 1. 6 3 3 2. 2 1. 6 2. 2 1. 8 cc. NUMA COW COW cc. NUMA COW NOW Linz Salzburg Vorarlberg Vienna Innsbruck altix 1 grid 21 CPUs MAUI RA` FHV schafberg 16 CPUs Torque karwendel 80 CPUs SGE HPC 16 CPUs SGE UIBK RA ZID 16 CPUs PBS 64 CPUs PBS Uni-Sbg RA hydra 16 CPUs Torque Uni-Linz RA Grid 272 CPUs PBS/Torque Uni. Vie CA RA • 517 CPUs distributed across 5 cities and over 20 parallel computers USC Viterbi School of Engineering gescher 16 CPUs MAUI
ASKALON Workflows • Activity = basic or atomic unit of computation • Activity type – Functional description of the activity • Signature specified by data input/output ports – Semantically meaningful name • E. g. matrix multiplication, Gaussian elimination, povray, png 2 yuv, ffmpeg, FFT, LAPW, WASIM, … – Implementation-independent • Workflow = collection of activity types interconnected through control flow and data flow dependencies – Plus some advanced constructs • Activity deployment – Binds an activity type to a concrete installed implementation – Description how to instantiate the activity – Registered by the application provider in a special registry of the Resource Management service USC Viterbi School of Engineering
ASKALON: Abstract Grid Workflow Language (AGWL) • • • Atomic activities – abstract from the real implementation, e. g. Web services, legacy applications – Sequential constructs: <sequence> – Conditional constructs: <if>, <switch> Basic compound activities – Loop constructs: <while>, <dowhile>, <for. Each> – Directed Acyclic Graph constructs: <dag> Advanced compound activities – Parallel section constructs: <parallel> – Parallel loop constructs: <parallel. For>, <parallel. For. Each> Data flow constructs – data. In/data. Out ports, collections, data repositories, data set distributions, etc. Properties – provide hints about the behavior of activities – Predicted I/O data size, computational complexity, non-functional parameters Constraints – Optimization metric (e. g. performance, cost, fault tolerance) – Scheduling constraints (e. g. compute architecture, disk, memory) USC Viterbi School of Engineering
ASKALON Workflow Development Stack Application Developer AGWL Abstract Grid Workflow Language CGWR Concrete Grid Workflow Representation ASKALON Middleware Grid USC Viterbi School of Engineering Workflow UML model XML Activity Type Java Activity Type ASKALON Activity Deployment Grid Activity Instance Concretizing Portal UML
Real-world Scientific Workflows with ASKALON • WIEN 2 k • Material science application • Technical University of Vienna – Institute of Theoretical Chemistry • Seven activity types • Over 500 activity instances • Statically unknown number of sequential loop iterations USC Viterbi School of Engineering
Resource Management • Resource brokerage – Interface to MDS information service for resource discovery – Selection based on matchmaking • Advance reservation – Useful for co-allocation purposes • GLARE – Registry of activity deployments • Activity deployment – Binds an abstract activity type to a concrete implementation – Refers to an installed executable or a deployed Web/Grid service – Description how to instantiate the activity – Registered in GLARE by the application provider USC Viterbi School of Engineering
Dynamic Bindings of Workflow Abstract - Concrete Askalon Runtime Environment Concrete Workflow Abstract Workflow x A B x y A Activity Type (abstract) B x y A B y Resource Manager Activity Deployment A G Node 1 Web Services Nod 2 B A Node 4 D A x USC Viterbi School of Engineering Executables C B Node 3 A y
Composite Activities Sequence • Composite activity – Sequence – Parallel activities – Conditional activities: if, switch – Sequential loops: for, while, for each – Parallel loops: parallel for, parallel for each – Sub-workflows <sequence name=“seq”> <data. In name=“in” source=. . . /> <activity name=“A 1”> <data. In name=“in” source=seq/in />. . . <data. Out name=“out” /> </activity> <activity name=“A 2”> <data. In name=“in” source=“A 1/out” />. . . <data. Out name=“out” /> </activity> <data. Out name=“out” source=“A 2/out” /> </sequence> USC Viterbi School of Engineering A 1 A 2 data flow control flow
A 0 If-then-else (2) (1) then <if> <data. In. . . > <condition>. . . </condition> <then> <activity name=“A 2”> <data. In name=“in” source=“. . . ” />. . . <data. Out name=“out” /> </activity> </then> <else> <activity name=“A 3”> <data. In name=“in” source=“. . . ” />. . . <data. Out name=“out” /> </activity> <else> <data. Out name=“ifout” source=“A 2/out, A 3/out”> </if> USC Viterbi School of Engineering else A 2 A 1 (3) A 3 (4)
• Workflow controller Execution Engine – Converts XML-based specification (AGWL) to internal representation – Executes the workflow according to control and data flow dependencies • One separate Controller for every workflow instance • Event system – Other components can subscribe to the internal events – e. g. logging, controller, tool (WS-Notification), . . . • Logging and database – For post-mortem performance analysis • AGWL GT 4 WSRF wrapper – Send WS-Notifications to the portal GT 4 WSRF Service § Scheduler – Receives jobs ready to execute from the task loop – Retrieves the resources with available from Grid. ARM – Assigns the task to the best machine according to the selection criteria o Clock speed * no free processors o Prediction information, memory available, … Grid. ARM Controller AGWL Interpreter Core Scheduler Task Loop Fault Handler Event System Execution / Launching Framework USC Viterbi School of Engineering Logging & Database
e9f2e186d9bc37e8e77567562c38004b.ppt