b73b4cb31dfccbb2a051262804c450f0.ppt
- Количество слайдов: 38
Data and the Grid: From Databases to Global Knowledge Communities Ian Foster Argonne National Laboratory University of Chicago www. mcs. anl. gov/~foster Image Credit: Electronic Visualization Lab, UIC October 2, 2002 Grid, Globus Toolkit, and OGSA Keynote Talk, 15 th Intl Conf on Scientific and Statistical Database Management, Boston, July 11, 2003
2 My Presentation 1) Data integration as a new opportunity – Driven by advances in technology & science – The need to discover, access, explore, analyze diverse distributed data sources – Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow – The need to organize, archive, reuse, explain, and schedule scientific workflows – Virtual data as a unifying concept www. mcs. anl. gov/~foster ARGONNE öCHICAGO
It’s Easy to Forget How Different 2003 is From 1993 l 3 Enormous quantities of data: Petabytes – For an increasing number of communities, gating step is not collection but analysis l Ubiquitous Internet: 100+ million hosts – Collaboration & resource sharing the norm l Ultra-high-speed networks: 10+ Gb/s – Global optical networks l Huge quantities of computing: 100+ Top/s – Moore’s law gives us all supercomputers www. mcs. anl. gov/~foster ARGONNE öCHICAGO
Consequence: The Emergence of Global Knowledge Communities l 4 Teams organized around common goals – Communities: “Virtual organizations” l With diverse membership & capabilities – Heterogeneity is a strength not a weakness l And geographic and political distribution – No location/organization possesses all required skills and resources l Must adapt as a function of the situation – Adjust membership, reallocate responsibilities, renegotiate resources www. mcs. anl. gov/~foster ARGONNE öCHICAGO
The Emergence of Global Knowledge Communities www. mcs. anl. gov/~foster 5 ARGONNE öCHICAGO
6 Global Knowledge Communities Often Driven by Data: E. g. , Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Largest catalogs near 1 B objects www. mcs. anl. gov/~foster ARGONNE öCHICAGO Data and images courtesy Alex Szalay, John Hopkins
Data Integration as a Fundamental Challenge Many sources of data, services, computation Security & policy must underlie access & management decisions Discovery R RM Access Registries organize services of interest to a community RM RM Security service Data integration activities may require access to, & exploration of, data at many locations www. mcs. anl. gov/~foster 7 RM Resource management is needed to ensure progress & arbitrate competing demands Policy service Exploration & analysis may involve complex, multi-step workflows ARGONNE öCHICAGO
Performance Requirements Demand Whole-System Management l 8 Assume – Remote data at 1 GB/s – 10 local bytes per remote – 100 operations per byte Remote data >1 GByte/s achievable today (FAST, 7 streams, LA Geneva) Parallel computation: 1000 Gop/s Local Network Wide area link (end-to-end switched lambda? ) 1 GB/s www. mcs. anl. gov/~foster Parallel I/O: 10 GB/s ARGONNE öCHICAGO
9 Data Integration: Key Challenges l l Of course, familiar issues: data organization, schema definition/mediation, etc. But also new challenges relating to dynamic, distributed communities – Establishment, negotiation, management, & evolution of multi-organizational federations l And to the sheer number of resources, speed of networks, and volume of data – Coordination, management, provisioning, & monitoring of workflows & required resources www. mcs. anl. gov/~foster ARGONNE öCHICAGO
10 Enter Grid Technologies l Infrastructure (“middleware”) for establishing, managing, and evolving multi -organizational federations – Dynamic, autonomous, domain independent – On-demand, ubiquitous access to computing, data, and services l Mechanisms for creating and managing workflow within such federations – New capabilities constructed dynamically and transparently from distributed services – Service-oriented, virtualization www. mcs. anl. gov/~foster ARGONNE öCHICAGO
11 Increased functionality, standardization The Emergence of Open Grid Standards Managed shared virtual systems Computer science research Open Grid Services Arch Web services, etc. Real standards Multiple implementations Internet standards Custom solutions Globus Toolkit Defacto standard Single implementation 1990 www. mcs. anl. gov/~foster 1995 2000 2005 2010 ARGONNE öCHICAGO
14 OGSA Structure l A standard substrate: the Grid service – Standard interfaces and behaviors that address key distributed system issues: naming, service state, lifetime, notification – A Grid service is a Web service l … supports standard service specifications – Agreement, data access & integration, workflow, security, policy, diagnostics, etc. – Target of current & planned GGF efforts l … and arbitrary application-specific services based on these & other definitions www. mcs. anl. gov/~foster ARGONNE öCHICAGO
15 Open Grid Services Infrastructure Client Introspection: • What port types? • What policy? • What state? Lifetime management • Explicit destruction • Soft-state lifetime Grid. Service (required) Grid Service Handle handle resolution Grid Service Reference Data access Service data element Other standard interfaces: factory, notification, collections Service data element Implementation Hosting environment/runtime (“C”, J 2 EE, . NET, …) www. mcs. anl. gov/~foster ARGONNE öCHICAGO
16 Open Grid Services Infrastructure GWD-R (draft-ggf-ogsi- gridservice-23) Open Grid Services Infrastructure (OGSI) http: //www. ggf. org/ogsi-wg Editors: S. Tuecke, ANL K. Czajkowski, USC/ISI I. Foster, ANL J. Frey, IBM S. Graham, IBM C. Kesselman, USC/ISI D. Snelling, Fujitsu Labs P. Vanderbilt, NASA February 17, 2003 Open Grid Services Infrastructure (OGSI) “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Foster, Kesselman, Nick, Tuecke, 2002 www. mcs. anl. gov/~foster ARGONNE öCHICAGO
Example: Reliable File Transfer Service Client 17 Client Request and manage file transfer operations Notf’n Policy File Grid Service Transfer Source Fault Monitor Perf. Monitor Query &/or subscribe to service data Pending Performance Policy Faults interfaces service data elements Internal State Data transfer operations www. mcs. anl. gov/~foster ARGONNE öCHICAGO
18 OGSA and Data Integration l OGSI provides key enabling mechanisms for distributed data integration – Introspect on distributed system elements – Create and manage distributed state l We need more than OGSI, of course, e. g. , – WS-Agreement: negotiate agreements between service provider and consumer – OGSA-DAI: Data Access and Integration – WS-Management: service management – Security and policy www. mcs. anl. gov/~foster ARGONNE öCHICAGO
19 Infrastructure Architecture Data Intensive X-ology Researchers Data Intensive Applications for X-ology Research Simulation, Analysis & Integration Technology for X-ology Generic Virtual Data Access and Integration Layer Job Submission Brokering Registry Workflow Banking Authorisation OGSA Structured Data Integration Data Transport Resource Usage Transformation Structured Data Access OGSI: Interface to Grid Infrastructure Compute, Data & Storage Resources Distributed Structured Data Relational XML Semi-structured - Virtual Integration Architecture www. mcs. anl. gov/~foster ARGONNE öCHICAGO Slide Courtesy Malcolm Atkinson, UK e. Science Center
Data as Service: OGSA Data Access & Integration l 20 Service-oriented treatment of data appears to have significant advantages – Leverage OGSI introspection, lifetime, etc. – Compatibility with Web services l Standard service interfaces being defined – Service data: e. g. , schema – Derive new data services from old (views) – Externalize to e. g. file/database format – Perform queries or other operations www. mcs. anl. gov/~foster ARGONNE öCHICAGO
Data Access & Integration Services 21 1 a. Request to Registry for sources of data about “x” 1 b. Registry responds with Factory handle Registry SOAP/HTTP service creation API interactions 2 a. Request to Factory for access to database Factory Client 2 c. Factory returns handle of GDS to client 3 a. Client queries GDS with XPath, SQL, etc 3 c. Results of query returned to client as XML 2 b. Factory creates Grid. Data. Service to manage access Grid Data Service XML / Relationa l database 3 b. GDS interacts with database www. mcs. anl. gov/~foster ARGONNE öCHICAGO Slide Courtesy Malcolm Atkinson, UK e. Science Center
Globus Toolkit v 3 (GT 3) Open Source OGSA Technology l Implements and builds on OGSI interfaces l 22 Supports primary GT 2 interfaces – Public key authentication – Scalable service discovery – Secure, reliable resource access – High-performance data movement (Grid. FTP) l Numerous new services included or planned – SLA negotiation, service registry, community authorization, data access & integration, … l Rapidly growing adoption and contributions – E. g. , OGSA-DAI from U. K. e. Science program www. mcs. anl. gov/~foster ARGONNE öCHICAGO
23 My Presentation 1) Data integration as a new opportunity – Driven by advances in technology & science – The need to discover, access, explore, analyze diverse distributed data sources – Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow – The need to organize, archive, reuse, explain, & schedule scientific workflows – Virtual data as a unifying concept www. mcs. anl. gov/~foster ARGONNE öCHICAGO
24 Science as Workflow l Data integration = the derivation of new data from old, via coordinated computation(s) – May be computationally demanding – The workflows used to achieve integration are often valuable artifacts in their own right l Thus we must be concerned with how we – Build workflows – Share and reuse workflows – Explain workflows – Schedule workflows www. mcs. anl. gov/~foster ARGONNE öCHICAGO
Sloan Digital Sky Survey Production System www. mcs. anl. gov/~foster 25 ARGONNE öCHICAGO
26 Virtual Data Concept l Capture and manage information about relationships among – Data (of widely varying representations) – Programs (& their execution needs) – Computations (& execution environments) l Apply this information to, e. g. – Discovery: Data and program discovery – Workflow: Structured paradigm for organizing, locating, specifying, & requesting data – Explanation: provenance – Planning and scheduling – Other uses we haven’t thought of www. mcs. anl. gov/~foster ARGONNE öCHICAGO
27 “I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes. ” Motivations Data created-by Transformation execution-of “I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from www. mcs. anl. gov/~foster scratch. ” “I’ve detected a calibration error in an instrument and want to know which derived data to recompute. ” consumed-by/ generated-by Derivation “I want to apply an astronomical analysis program to millions of objects. If the results already exist, I’ll save weeks of computation. ” ARGONNE öCHICAGO
Chimera Virtual Data System (www. griphyn. org/chimera) l 28 Virtual data catalog – Transformations, derivations, data l Virtual data language – Catalog definitions l l Query tool Applications include browsers and data analysis applications www. mcs. anl. gov/~foster ARGONNE öCHICAGO
29 Chimera Virtual Data Schema describes Metadata www. mcs. anl. gov/~foster ARGONNE öCHICAGO
Virtual Data in CMS HEP Analysis Define a virtual data space for exploration by other scientists mass = 200 decay = bb mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW stability = 1 mass = 200 event = 8 • Knowledge capture mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida
Virtual Data in CMS HEP Analysis Search for WW decays of the Higgs Boson for which only stable, final state particles are mass = 200 recorded? stability = 1 mass = 200 decay = bb mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW stability = 1 mass = 200 event = 8 • Knowledge capture • On-demand data gen • Workload mgmt mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida
Virtual Data in CMS HEP Analysis Search for WW decays of the Higgs Boson and where only stable, final state particles are mass = 200 recorded: stability = 1 mass = 200 decay = bb mass = 200 decay = ZZ Scientist discovers an interesting result – wants to know how it was derived. mass = 200 decay = WW stability = 1 event = 8 • Knowledge capture • On-demand data gen. • Workload mgmt • Explain provenance mass = 200 decay = WW stability = 3 mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida
Virtual Data in CMS HEP Analysis Search for WW decays of the Higgs Boson and where only stable, final state particles are mass = 200 recorded: stability = 1 mass = 200 decay = ZZ Scientist discovers an interesting result – wants to know how it was derived. mass = 200 decay = WW stability = 3 mass = 200 decay = WW stability = 1 event = 8 • Knowledge capture • On-demand data gen. • Workload mgmt • Explain provenance • Collaboration . . . The scientist adds a new derived data Branch. . . mass = 200 decay = bb mass = 200 decay = WW event = 8 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 Low. Pt = 20 High. Pt = 10000 . . . and continues to Investigate … mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida
Virtual Data “Explorations” Can be Long-Lived Computations l 34 Production Run on the Integration Testbed – Simulate 1. 5 million full CMS events for physics studies: ~500 sec per event on 850 MHz processor – 2 months continuous running across 5 testbed sites – Managed by a single person at the US-CMS Tier 1 www. mcs. anl. gov/~foster ARGONNE öCHICAGO
Virtual Data in Sloan Galaxy Cluster Analysis 35 DAG Sloan Data Galaxy cluster size distribution www. mcs. anl. gov/~foster Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael ARGONNE öCHICAGO Milligan, Yong Zhao, Chicago
36 Virtual Data in Genome Analysis DOESG Resource www. mcs. anl. gov/~foster ARGONNE öCHICAGO
Bringing it All Together: A Virtual Data Grid www. mcs. anl. gov/~foster 37 ARGONNE öCHICAGO
38 User Also Very Relevant: Workflow & Web Services WF-Pilot Design Execution monitoring WF-Engine AWF Scheduling and execution EWF web service invocation WF-Compiler ET web service invocation ET AWF EWF Translation query rewriting web service matching AAV rules Abstract Task (AT) Repository semantic type checking ET schemas Executable Task (ET) Repository data type conversion Genbank BLAST C C C Data & Parameter Ontologies conversion rules Datatype & Conversion Repository www. mcs. anl. gov/~foster ARGONNE öCHICAGO B. Ludäscher, I. Altintas, A. Gupta – http: //kbi. sdsc. edu/Sci. DAC-SDM/scidac-tn-02 -01. pdf
39 Summary 1) Data integration as a new opportunity – Driven by advances in technology & science – The need to discover, access, explore, analyze diverse distributed data sources – Grid technologies as a substrate for essential management functions 2) Science as collaborative workflow – The need to organize, archive, reuse, explain, and schedule scientific workflows – Virtual data as a unifying concept www. mcs. anl. gov/~foster ARGONNE öCHICAGO
40 For More Information l The Globus Project™ – www. globus. org l Technical articles – www. mcs. anl. gov/~foster l Open Grid Services Arch. – www. globus. org/ogsa l Chimera – www. griphyn. org/chimera l Global Grid Forum – www. ggf. org www. mcs. anl. gov/~foster 2 nd Edition: November 2003 ARGONNE öCHICAGO
b73b4cb31dfccbb2a051262804c450f0.ppt