b238dd8397463e164ee78e0c6aef996d.ppt
- Количество слайдов: 48
The KEPLER Scientific Workflow System Bertram Ludäscher Ilkay Altintas … & the Kepler Team San Diego Supercomputer Center University of California, San Diego SDM Center AHM, LBL, August 3 -5, 2004
Outline • Project Overview – from Ptolemy II to Kepler • Workflow Modeling Issues – from Dataflow to Control-flow (CCA et al) • Current Kepler Features – from plumbing to distributed execution • Example Workflows – from bioinformatics to geoinformatics • Future Plans – from today to tomorrow ; -) Kepler, B. Ludäscher, SDSC 2
What is a Scientific Workflow (SWF)? • Goals: – automate a scientist’s repetitive steps (data analysis, data transformation, computational steps, …) – can encompass data generation, aggregation, analysis, visualization (WF granularity) – design, test, share, deploy, execute, reuse, … SWFs • Typical requirements/characteristics: – – – – data-intensive and/or compute-intensive plumbing-intensive dataflow-oriented distribution (data, processing) user-interaction “in the middle”, … … vs. (C-z; bg; fg)-ing (“detach” and reconnect) advanced programming constructs (map(f), zip, takewhile, …) logging, provenance, “registering back” (intermediate) products… • … easy to recognize a SWF when you see one! Kepler, B. Ludäscher, SDSC 3
Promoter Identification Workflow Kepler, B. Ludäscher, SDSC 4 Source: Matt Coleman (LLNL)
Source: NIH BIRN (Jeffrey Grethe, UCSD) Kepler, B. Ludäscher, SDSC 5
Ecology: GARP Analysis Pipeline for Invasive Species Prediction Test sample (d) Registered Ecogrid Database Eco. Grid Query Species presence & absence points (native range) (a) Registered Ecogrid Database +A 1 +A 2 +A 3 Sample Data Training sample (d) Data Calculation GARP rule set (e) Integrated layers (native range) (c) Invasion area prediction map (f) Map Generation Layer Integration Registered Ecogrid Database Validation Model quality parameter (g) Environmental layers (native range) (b) Environmental layers (invasion area) (b) Layer Integration User Model quality parameter (g) Integrated layers (invasion area) (c) Eco. Grid Query Registered Ecogrid Database Map Generation Native range prediction map (f) Validation Archive To Ecogrid Selected prediction maps (h) Generate Metadata Species presence &absence points (invasion area) (a) Kepler, B. Ludäscher, SDSC Source: NSF SEEK (Deana Pennington et. al, UNM) 6
Kepler, B. Ludäscher, SDSC 7
Starting Point for SDMCenter/SPA + SEEK: Ptolemy II read! see! try! Source: Edward Lee et al. http: //ptolemy. eecs. berkeley. edu/ptolemy. II/
An Early Example: Promoter Identification SSDBM, AD 2003 • • • Scientist models application as a “workflow” of connected components (“actors”) If all components exist, the workflow can be automated/ executed Different directors can be used to pick appropriate execution model (often “pipelined” execution: PN director) Kepler, B. Ludäscher, SDSC 9
Why Ptolemy II (and thus KEPLER)? • Ptolemy II Objective: – “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation. ” • Dataflow Process Networks w/ natural pipelining/streaming support • User-Orientation – Workflow design & exec console (Vergil GUI) – “Application/Glue-Ware” • • excellent modeling and design support run-time support, monitoring, … not a middle-/underware (we use someone else’s, e. g. Globus, SRB, …) but middle-/underware is conveniently accessible through actors! • PRAGMATICS – Ptolemy II is mature, continuously extended & improved, well-documented (500+pp) – open source system – Ptolemy II folks actively participate in KEPLER Kepler, B. Ludäscher, SDSC 10
KEPLER: An Open Collaboration • “Founding projects”: – DOE SDM/SPA and NSF SEEK • Open Source (BSD-style license) • Intensive Communications: – Web-archived mailing lists – IRC (!) • Co-development: – via shared CVS repository – joining as a new co-developer (currently): • get a CVS account (read-only) • local development + contribution via existing KEPLER member • be voted “in” as a member/co-developer • Software & social engineering – How to better accommodate new groups/communities? – How to better accommodate different usage/contribution models (core dev … special purpose extender … user)? Kepler, B. Ludäscher, SDSC 11
KEPLER/CSP: Contributors, Sponsors, Projects (or loosely coupled Communicating Sequential Persons ; -) Ilkay Altintas SDM, Resurgence Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher BIRN, SDM, SEEK, GEON Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK Kepler, B. Ludäscher, SDSC • • • Ptolemy II 12
History • • • Gabriel (1986 -1991) – – – Written in Lisp • Aimed at signal processing Synchronous dataflow (SDF) block diagrams • Parallel schedulers Code generators for DSPs • Hardware/software co-simulators – – – – Written in C++ Multiple models of computation Hierarchical heterogeneity Dataflow variants: BDF, DDF, PN C/VHDL/DSP code generators Optimizing SDF schedulers Higher-order components Ptolemy Classic (1990 -1997) Ptolemy II (1996 -2022) – Written in Java – Domain polymorphism – Multithreaded – Network integrated – Modal models – Sophisticated type system – CT, HDF, CI, GR, etc. Kepler, B. Ludäscher, SDSC Pt. Plot (1997 -? ? ) – Java plotting package Tycho (1996 -1998) – Itcl/Tk GUI framework Diva (1998 -2000) – Java GUI framework • Copernicus (code generator) • KEPLER (2003 -2028) – scientific workflow extensions Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving environment for Scientific Workflows KEPLER = “Ptolemy II + X” for Scientific Workflows Source (Ptolemy): Edward Lee et al. http: //ptolemy. eecs. berkeley. edu/ 13
KEPLER then … Kepler, B. Ludäscher, SDSC 14
… and KEPLER today… … so, you see, scientific workflows need domain and datapolymorphic actors & must scale to HPC! What’s a scientific workflow? What’s a polymorphic actor? BTW: Kepler is NOT a GUI (Vergil is) Kepler, B. Ludäscher, SDSC 15 What is HPC?
The KEPLER/Ptolemy II GUI (Vergil) “Directors” define the component interaction & execution semantics Large, polymorphic component (“Actors”) and Directors libraries (drag & drop) Kepler, B. Ludäscher, SDSC 16
Actor-Oriented Design • Object orientation: What flows through an object is sequential control (cf. CCA, MPI) class name data methods call return • Actor/Dataflow orientation: actor name data (state) Input data Kepler, B. Ludäscher, SDSC parameters ports What flows through an object is a stream of data tokens (in SWFs/KEPLER also references!!) Output data Source: 17 Edward Lee et al. http: //ptolemy. eecs. berkeley. edu/
Object-Oriented vs. Actor-Oriented Interfaces Object Oriented Actor/Dataflow Oriented Text. To. Speech initialize(): void notify(): void is. Ready(): boolean get. Speech(): double[] OO interface gives procedures that have to be invoked in an order not specified as part of the interface definition. AO interface definition says “Give me text and I’ll give you speech” Source: Edward Lee et al. http: //ptolemy. eecs. berkeley. edu/ Kepler, B. Ludäscher, SDSC 18
Ptolemy II: Actor-Oriented Modeling • Component (“actor”) interaction semantics not hard-wired inside components, but “factored out” in a “director” • Different directors for different modeling and execution needs (… can even be combined!) Better abstraction, modeling, component reuse, … Kepler, B. Ludäscher, SDSC 19
Behavioral Polymorphism in Ptolemy These polymorphic methods implement the communication semantics of a domain in Ptolemy II. The receiver instance used in communication is supplied by the director, not by the component. (cf. CCA, WS-? ? , [G]BPL 4? ? , … !) Director Behavioral polymorphism is the idea that components can be defined to operate with multiple models of computation and multiple middleware frameworks. Kepler, B. Ludäscher, SDSC IOPort consumer actor producer actor Receiver Source: 20 Edward Lee et al. http: //ptolemy. eecs. berkeley. edu/
Component Composition & Interaction DIR 1 DIR 2 DIR 3 DIR 4 ? ? ? • • • Components linked via ports Dataflow (and msg/ctl-flow) Where is the component interaction semantics defined? ? – • Kepler, B. Ludäscher, SDSC each component is its own director! But still useful for special applications, e. g. parallel programs (MPI, …) Source: GRIST/SC 4 DEVO workshop, July 2004, Caltech 21
Data/Control-Flow Spectrum “clean” data(=ctl)-flow special tokens flow message passing, control flow • Data (tokens) flow – (almost) no other side effects – WYSIWYG (usually) • References flow – token reference type may be “http-get”, “ftp-get”, “hsi put”… – generic handling still possible • Application specific tokens flow – e. g. current Nimrod job management in Resurgence – “invisible contract” between components – Director is unaware of what’s going on … (sounds familiar? ; -) • Specific messages passing protocols (e. g. , CSP, MPI) – for systems of tightly coupled components Kepler, B. Ludäscher, SDSC 22
CCA via special (“look the other way”) Director(s)? CCA!? • Dataflow in CCA • a CCA “convention” can be used to accommodate actororiented/dataflow modeling • CCA/Message Passing in KEPLER • Kepler/Ptolemy can be extended to accommodate message passing semantics (CSP is already in Ptolemy II) Kepler, B. Ludäscher, SDSC 23
Domains and Directors: Semantics for Component Interaction • • • CI – Push/pull component interaction CSP – concurrent threads with rendezvous CT – continuous-time modeling For (finer-grained) DE – discrete-event systems concurrent jobs!? DDE – distributed discrete events FSM – finite state machines DT – discrete time (cycle driven) For (coarse grained) Scientific Workflows! Giotto – synchronous periodic GR – 2 -D and 3 -D graphics • PN – process networks • SDF – synchronous dataflow • SR – synchronous/reactive • TM – timed multitasking Kepler, B. Ludäscher, SDSC Source: 24 Edward Lee et al. http: //ptolemy. eecs. berkeley. edu/
Polymorphic Actor Components Working Across Data Types and Domains • Actor Data Polymorphism: – – Add Add numbers (int, float, double, Complex) strings (concatenation) complex types (arrays, records, matrices) user-defined types • Actor Behavioral Polymorphism: – In dataflow, add when all connected inputs have data – In a time-triggered model, add when the clock ticks – In discrete-event, add when any connected input has data, and add in zero time – In process networks, execute an infinite loop in a thread that blocks when reading empty inputs – In CSP, execute an infinite loop that performs rendezvous on input or output – In push/pull, ports are push or pull (declared or inferred) and behave accordingly – In real-time CORBA, priorities are associated with ports and a dispatcher determines when to add Kepler, B. Ludäscher, SDSC Source: 25 By not choosing among these when defining the component, we get a huge increment in component reusability. But how do we ensure that the component will work in all these circumstances? Edward Lee et al. http: //ptolemy. eecs. berkeley. edu/
Directors and Combining Different Component Interaction Semantics Possible app. in SWF: • time-series aware … • parameter-sweep aware … • XY aware … … execution models Source: Edward Lee et al. http: //ptolemy. eecs. berkeley. edu/ptolemy. II/ Kepler, B. Ludäscher, SDSC 26
A Few Specific Kepler Features and Example Workflows
Web Services Actors (WS Harvester) 1 2 4 3 “Minute-made” (MM) WS-based application integration • Similarly: MM workflow design & sharing w/o implemented components Kepler, B. Ludäscher, SDSC 28
Recent Actor Additions Kepler, B. Ludäscher, SDSC 29
Digression: Who are the clients? • Domain scientists – C/Perl/Python/Java/WS/DB-enabled ones – others (the rest of us? ) • Goal: make the life better for both categories! – Workflow automation – Plumbing support – Execution monitoring, steering, runtime revision (pause -inspect-modify-resume cycle) Kepler, B. Ludäscher, SDSC 30
GEON Mineral Classification Workflow Kepler, B. Ludäscher, SDSC 31
… inside the Classifier Browser. UI actor w/ SVG client display Kepler, B. Ludäscher, SDSC 32
GEON Dataset Generation & Registration (and co-development in KEPLER) % Makefile $> ant run Matt et al. (SEEK) SQL database access (JDBC) Efrat (GEON) Ilkay (SDM) Yang (Ptolemy) Xiaowen (SDM) Edward et al. (Ptolemy) Kepler, B. Ludäscher, SDSC 33
GEON Data Registration UI Kepler, B. Ludäscher, SDSC 34
GEON Data Registration in KEPLER Kepler, B. Ludäscher, SDSC 35
Registered Resources show up in Vergil (joint SEEK, SPA, GEON, … Registry!? ) Kepler, B. Ludäscher, SDSC 36
Data Analysis: Biodiversity Indices Kepler, B. Ludäscher, SDSC 37
Kepler, B. Ludäscher, SDSC Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway. 38
Kepler, B. Ludäscher, SDSC Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway. 39
Kepler, B. Ludäscher, SDSC Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway. 40
Re-engineered PIW w/ Iteration Constructs AD 2004 map(Genbank. WS) Input: {“NM_001924”, “NM 020375”} Output: {“CAGT…AATATGAC", “GGGGA…CAAAGA“} Kepler, B. Ludäscher, SDSC 41
Streaming Real-time Data Straightforward Example: Laser Strainmeter Channels in; Scientific Workflow; Earth-tide signal out Kepler, B. Ludäscher, SDSC 42 Seismic Waveforms
ORB Kepler, B. Ludäscher, SDSC 43
Job Management (here: NIMROD) • Job management infrastructure in place • Results database: under development • Goal: 1000’s of GAMESS jobs (quantum mechanics) – Fall/Winter’ 04 Kepler, B. Ludäscher, SDSC 44
KEPLER Today • Support for SWF life cycle – Design, share, prototype, run, monitor, deploy, … • Coarse-grained scientific workflows, e. g. , – web service actors, grid actors, command-line actors, … • Fine grained workflows and simulations, e. g. , – Database access, XSLT transformations, … • Kepler Extensions – SDM Center/SPA: support for data- and compute-intensive workflows! – real-time data streaming (ROADNet) – other special and generic extensions (e. g. GEON, SEEK) • Status – – first release (alpha) was in May 2004 nightly builds w/ version tests “Link-Up Sister Project” w/ other SWF systems (UK Taverna, Triana, …) Participation in various workshops and conferences (GGF 10, SSDBMs, e. Science WF workshop, …) Kepler, B. Ludäscher, SDSC 45
KEPLER Tomorrow • Application-driven extensions: – access to/integration with other IDMAF components • Sci. RUN? , Pnet. CDF? , PVFS(2)? , MPI-IO? , parallel-R? , ASPECT? , Fast. Bit, … – support for execution of new SWF domains • Astrophysics: TSI/Blondin (SPA/NCSU) • Nuclear Physics: Swesty (SPA/LLNL) • … • Generic extensions: – addtl. support for data-intensive and compute-intensive workflows (all SRB Scommands, CCA support, …) – (C-z; bg; fg)-ing (“detach” and reconnect) – workflow deployment models • Additional “domain awareness” (e. g. via new directors) – time series, parameter sweeps, job scheduling, … – hybrid type system with semantic types • Consolidation – More installers, regular releases, improved documentation, … Kepler, B. Ludäscher, SDSC 46
KEPLER & SPA First alpha releases since May 2004 http: //kepler. ecoinformatics. org Kepler, B. Ludäscher, SDSC https: //www-casc. llnl. gov/sdm/ 47
Hybrid Types (Structure + Semantics) • Services can be semantically compatible, but structurally incompatible Ontologies (OWL) Semantic Type Ps Compatible (⊑) Structural Type Ps Incompatible (⋠) Source Service Kepler, B. Ludäscher, SDSC (Ps) Structural Type Pt (≺) Desired Connection Pt Ps 49 Semantic Type Pt Target Service Source: [Bowers-Ludaescher, DILS’ 04]
b238dd8397463e164ee78e0c6aef996d.ppt