SKR 5800 Selected Topics in Distributed Computing Grid

SKR 5800 Selected Topics in Distributed Computing Grid Computing: Introduction AZIZOL ABDULLAH, Ph. D DEPARTMENT OF COMMUNICATION TECHNOLOGY AND NETWORK

Lecture Contents n n n n Why do we have Grid Computing What is Grid Computing Ian Foster’s 3 point checklist Defining Grid Computing What is Grid and Grid Computing? Why we need grids Why Now? The Grid Problems

Why do We Have Grid Computing? n n n The term was coined in 1996 by Ian Foster and Carl Kesselman Used to describe software that was needed by the rapidly growing, highly advanced community of highperformance Computing (HPC) Resources that scale with technologies: n Supercomputers (MFlops in 96, but now using TFlops) n n Big and not portable Large data sets (GB in 96, but now peta-bytes) Need fast networks to move data around to resources Need security: n NSF (and other gov agencies) spend money to build infrastructure, so it is hard to get access

What is Grid Computing? n Is it a new, unique idea or the next generation of distributed or metacomputing? Please find and read this paper: Ian Foster Paper “What is the Grid? A Three-point Checklist” http: //www-fp. mcs. anl. gov/~foster/Articles/What. Is. The. Grid. pdf

Ian Foster’s 3 point checklist n A Grid is a system that is able to n n coordinate “resources that are not subject to centralized control” Use “standard, open, general-purpose protocols and interfaces” “to deliver nontrivial qualities of service. ” What does this mean? n We will try to understand this in this course.

Defining Grid Computing n n There are several competing definitions for “The Grid” and Grid computing These definitions tend to focus on: n n Implementation of Distributed computing A common set of interfaces, tools and APIs Some stress the inter-institutional aspect of grids and Virtual Organizations “The Virtualization of Resources” abstraction of resources

What is Grid and Grid Computing? n n Grid computing promises a standard, ‘complete’ set of distributed computing capabilities There is a lot of hype around grid computing n n n Traditional users need to get work done now! Some CS researchers see it as a fad But there is real-world value! n In e-science and e-business

What is Grid and Grid Computing? (cont. . ) n Grid computing must provide basic functions 1. 2. 3. 4. 5. n resource discovery and information collection & publishing data management on and between resources process management on and between resources common security mechanism underlying the above process and session recording/accounting Current grid computing tools such as Globus provide most of the above at some level n n The current capabilities are incomplete New web service based-standard will help current tools become interoperable.

The Grid “Resource sharing & coordinated problem solving in dynamic … virtual organizations” 1. 2. 3. Enable integration of distributed service & resources Using general-purpose protocols & infrastructure To achieve useful qualities of service “The Anatomy of the Grid”, Foster, Kesselman, Tuecke, 2001

Why we need grids

Grid 3: An Operational Grid Ø 28 sites (2100 -2800 CPUs) & growing Ø 400 -1300 concurrent jobs Ø 8 substantial applications + CS experiments Ø Running since October 2003 Korea http: //www. ivdgl. org/grid 3 Slide Courtesy of Ian Foster

Data Grids for High Energy Physics ~PBytes/sec Online System ~100 MBytes/sec ~20 TIPS There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~622 Mbits/sec or Air Freight (deprecated) France Regional Centre Spec. Int 95 equivalents Offline Processor Farm There is a “bunch crossing” every 25 nsecs. Tier 1 1 TIPS is approximately 25, 000 Tier 0 Germany Regional Centre Italy Regional Centre ~100 MBytes/sec CERN Computer Centre Fermi. Lab ~4 TIPS ~622 Mbits/sec Tier 2 ~622 Mbits/sec Institute ~0. 25 TIPS Physics data cache Institute ~1 MBytes/sec Tier 4 Caltech ~1 TIPS Tier 2 Centre Tier 2 Centre ~1 TIPS Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Physicist workstations Image courtesy Harvey Newman, Caltech

Grid Physics Network (Gri. Phy. N) Enabling R&D for advanced data grid systems, focusing in particular on Virtual Data concept ATLAS CMS LIGO SDSS www. griphyn. org; Slide from C. Kesselman/Cal(IT)2 presentation

Why Now? n The Internet as infrastructure n n Advances in storage capacity n n Terabytes, petabytes per site Increased availability of compute resources n n Increasing bandwidth, advanced services clusters, supercomputers, etc. Advanced applications n simulation based design, advanced scientific instruments, . . .

The Grid Problem n n Flexible, secure, coordinated sharing of computation among dynamic collections of individuals, institutions, and resources Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… n n central location central control omniscience existing trust relationships The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.

Elements of the Problem n Resource sharing n n n Coordinated problem solving n n Computers, storage, sensors, networks, … Sharing always conditional: issues of trust, policy, negotiation, payment, … Beyond client-server: distributed data analysis, computation, collaboration, … Dynamic, multi-institutional virtual orgs n n Community overlays on classic org structures Large or small, static or dynamic

The Programming Problem n n Applications require resources (compute power, storage, data, instruments, displays) at many sites for many users. Some requirements: n n n Abstractions and models to increase speed/robustness/etc. of development Tools to ease application development and diagnose common problems, ease deployment Code/tool sharing to allow reuse of code components developed by others

Grid must suspport computational workflows n n n n n Locate “suitable” computers Authenticate with appropriate sites Allocate resources on those computers Initiate computation on those computers Configure those computations Select “appropriate” communication methods Compute with “suitable” algorithms Access data files, return output Respond “appropriately” to resource changes

Grid Requirements n n n identity & authentication authorization & policy resource/service discovery resource allocation (co-)reservation, workflow remote data access n n n n rapid data transfer monitoring intrusion detection resource management accounting fault management system evolution and more…

Grid Computing - Functions n Grid computing must provide typically these basic functions (Foster/Kesselman) n n n resource discovery and information collection & publishing data management on and between resources process management on and between resources common security mechanism underlying the above In addition, it should include: n process and session recording/accounting

The Grid Problem n n Flexible, secure, coordinated sharing of computation among dynamic collections of individuals, institutions, and resources Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… n n central location central control omniscience existing trust relationships The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.

Elements of the Problem n Resource sharing n n n Coordinated problem solving n n Computers, storage, sensors, networks, … Sharing always conditional: issues of trust, policy, negotiation, payment, … Beyond client-server: distributed data analysis, computation, collaboration, … Dynamic, multi-institutional virtual orgs n n Community overlays on classic org structures Large or small, static or dynamic

The Programming Problem n n Applications require resources (compute power, storage, data, instruments, displays) at many sites for many users. Some requirements: n n n Abstractions and models to increase speed/robustness/etc. of development Tools to ease application development and diagnose common problems, ease deployment Code/tool sharing to allow reuse of code components developed by others

Grid Computing Vs Distributed Computing n How does grid computing differ from traditional distributed computing? n Where do grids get their names? n Grid hardware n Grid applications

Distributed Computing: A Quick Review Andrew Tannenbaum: “A distributed system is a collection of independent computers that appear to the users of the system as a single computer. ”

Distributed Systems: Hardware n Distributed in the local area n Memory organization: 1. Shared-memory multiprocessors q 2. Multicomputers with private memories q n Single virtual address space shared by all CPUs Separate address spaces Interconnection network organization: 1. Bus-based q 2. A single shared network, backplane, bus or cable Switch-based q Individual connections between machines

Simplest Hardware: A Bus-based Shared-Memory Multiprocessor Processor Cache Memory Bus n n n Shared memory Caches must be kept consistent Bus bandwidth limits to ~64 processors

Bus-based Distributed Shared. Memory (DSM)Multiprocessor Memory Cache Processor Bus n n n Each processor contains portion of shared memory Local accesses fast, remote accesses slow “NUMA”: non-uniform memory access

Switch-Based Multicomputer: Workstation Cluster Workstation n n n Workstation Ethernet Switch Workstations share resources: file servers, printers, storage archives Schedule jobs Use idle workstations

Hardware: What is different in a grid? n Heterogeneous hardware environment n n Wide-area distribution n computing platforms network connections storage systems and caches Wide-area network latency and bandwidth Resources in different administration domains Dynamic environment n Resources enter and leave grid

Software: Issues in Distributed Operating Systems n Communication models n n Client-Server Model Remote procedure call Group communication In a grid: n n n Algorithms must tolerate wide-area latency for message transfers Avoid large numbers of messages Typically perform larger transfers, initiate remote jobs rather than procedure calls

Software: Issues in Distributed Operating Systems n Synchronization n n Clock synchronization Election algorithms: determine a coordinator Atomic transactions In a grid: n n With wide-area latencies, typically perform synchronization on larger grain Can implement atomic operations

Software: Issues in Distributed Operating Systems n Processes and Processors n n n Threads Allocating Processors Scheduling and co-scheduling resources Fault tolerance In a grid: scheduling, allocation, & fault tolerance issues get more complicated in the wide area environment

Software: Issues in a Distributed Operating System n Distributed file systems n n n n File service that reads and writes file, controls access Creating, deleting & managing directories Naming Sharing Caching and consistency Replication and updates In a grid, same issues complicated by wide area distribution, different administrative domains, enormous data sets

Software: Issues for a Distributed Operating System n Distributed Shared Memory n n n Generally applies to machines in a LAN Each processor contains memory corresponding to part of the shared memory address space Each processor caches data from other processors Many consistency algorithms In a grid: EASIER! Globus does not support a shared address space n Legion has a single shared object space

Summary: Heterogeneity makes things harder in a grid n n n Heterogeneous software and hardware Different administrative domains Different policies for use and management of local resources n n n Must do coordinated scheduling Different security policies Dynamic environment n n Must discover resources Robust in the presence of network, resource failures

Where do computational grids get their names? n n “A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities. ” Name (and definition) imply an analogy to the electric power grid n n Power inexpensive, universally available Enabled new devices and industries

An Infrastructure Analogy: The Electric Power Grid n n Revolutionary development: and distribution of electricity Before: power accessible in crude forms n n n transmission human work horses water power steam engines Today: cheap, reliable power universally available

Electric Power Grid (cont. ) n n n n Power to billions of devices Efficient Low-cost Reliable North America: 10, 000 generators linked to billions of outlets Heterogeneous components, distributed ownership Interconnections between regions: share reserve capacity, trade excess power

Electric Power Grid (cont. ) n Required more than just technology n n n Huge social impact n n Regulatory, political and institutional development Infrastructure for monitoring and management Fundamentally changed work and home life Huge environmental impact n Consume resources, generate pollution, global warming, …

Based on Infrastructure Analogies: Desired Characteristics of Grids n Pooling of resources n n Compute cycles, data, people, sensors Dependable service n n n Predictable Sustained performance Often high-performance

Grid Characteristics (cont. ) n Consistent service n n Pervasive n n Standard services available Via standard interfaces Enable application development Services always available Inexpensive n Otherwise not widely accepted and used

A Grid Application Scenario n A distributed simulation involving 10 supercomputers at 10 different locations n n n n How do you know where they are? How do you identify yourself to each? How do you get permission to use them? How do you submit remote jobs? How do you get access to resources on all the machines simultaneously? What happens if a machine fails? How are input/output files managed?

Grid Services Architecture Applications Application Toolkit Layer Grid Services Layer Grid Fabric Layer High-energy Collaborative On-line physics data engineering instrumentation analysis Regional Parameter climate studies Distributed computing Collab. design Dataintensive Information Security Transport Instrumentation Resource mgmt Data access. . . Remote control Remote viz . . . Fault detection Multicast Control interfaces Qo. S mechanisms

Layered Grid Architecture (By Analogy to Internet Architecture) “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services “Sharing single resources”: negotiating access, controlling use Collective Application Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Slide courtesy of C. Kessleman Cal(IT)2 Presentation Internet Protocol Architecture Application

Layered Grid Architecture n Fabric Layer - provides the local services of a resource: n n computational, storage, network Connective Layer - core communication and authentication protocols n n Enables exchange of data between fabric layer resources Security and authentication important here

Layered Grid Architecture (cont. ) n Resource Layer – enables resource sharing n n Collective Layer - coordinates interactions across multiple resources n n n Builds on connectivity layer to control and access resources (Ex: data servers) Ties multiple resources and services together (Ex: metacatalogues) Application Layer - user applications use collective, resource, and connective layers to perform grid operations in a virtual organization

Basic Grid Services n Security n n Authentication: both client and server Authorization: what privileges does the client have? Access control: Sites want local control of operations that remote users are allowed to perform Confidential data transfer using encryption

Basic Grid Services (cont. ) n Resource management n n n Mechanism for submitting jobs to remote locations Local policies for use, management, resource configuration Scheduling of important resources n n Coordinating scarce, expensive resources (e. g. , cooperating supercomputers) Advanced reservations to guarantee: n n Quality of service Completion of operations (e. g. , reserve disk space for a large data transfer)

Basic Grid Services (cont. ) n Information Services n Register and query information about grid resources n n Centerpiece for many Grid components Performance measurement services n n Where all the Cray T 3 E’s in the grid? Where is a storage system with 250 gigabytes of free space that transfers data at 1 gigabit/sec? What is the current bandwidth of the link from jupiter. isi. edu to apogee. sdsc. edu? Dynamic environment: assume the information service contains old information

Basic Grid Services (cont. ) n Efficient Data Transfers n n n Secure (authentication, encryption) Parallel transfers Partial file transfers Third-party transfers Reliable transfers Replica Management Service n n n Large (petabyte-scale) datasets Multiple stored and cached copies Select the “best” copy with best performance

Basic Grid Services (cont. ) n n Fault detection Detect and report “failure” of component of a computation n Limited by ability to distinguish between network partition and system failure Goal: make low-level operations reliable No libraries for checkpoint and restart n n n Can’t “checkpoint” a socket Only application knows how to checkpoint and restart Likewise, storage system must do logging

Major Grid Computing Infrastructure Projects n The Globus Project n n The Legion Project n n “Bag of services” model for grid computing USC Information Sciences Institute and Argonne National Laboratory (Chicago) We will use Globus for most of the examples in this class Object-oriented approach to grid computing The Condor Project n Schedule computations on pool of resources

Application Examples n n n Online instrumentation Distributed supercomputing Collaborative engineering High-throughput computing Remote job submission, meta-queueing

Online Instrumentation Advanced Photon Source wide-area dissemination real-time collection archival storage desktop & VR clients with shared controls tomographic reconstruction DOE X-ray grand challenge: ANL, USC/ISI, NIST, U. Chicago

Grid Applications: Distributed Supercomputing n n n Solve problems that cannot be solved using a single system Example application: distributed, interactive simulation involving 100, 000 s of entities Difficult issues: n n n “Co-scheduling” of scarce, expensive resources Algorithms that scale to many nodes and tolerate latency Achieving and sustaining high performance across heterogeneous systems

Globus Example: Distributed Supercomputing NCSA Origin Caltech Exemplar n n CEWES SP Maui SP n n SF-Express distributed, interactive simulation 100 K vehicles (2002 goal) using 13 computers, 1386 nodes, 9 sites Largest DIS ever done Globus mechanisms for n n n P. Messina et al. , Caltech n Resource allocation Distributed startup I/O and configuration Fault detection

Grid Applications: High-Throughput Computing n n Schedule large numbers of looselycoupled or independent tasks Tie together idle workstations n n n Put unused cycles to work Example applications: chip design, solving cryptographic problems Systems: n n Condor: manages pool of hundreds of workstations around the world Entropia: startup company

High-Throughput Computing: SETI@home

Grid Applications: Data-Intensive Computing n n Geographically distributed data repositories, digital libraries and databases Up to petabytes of data Example applications: High-energy physics experiments, climate modeling, human genome project databases Challenging Issues: n n High-performance data transfers in wide-area environments Management of caching and replication

Globus Data-Intensive Computing “Access datasets A, B; Query run A->meso->hydro; manager compare result with B” “How do midwest flood frequencies under 2 x. CO 2 scenario compare with historical data? ” Cache manager Historical data archive Resource manager Cache Simulation data archive Analysis engine Cache meso hydro compare

Grid Applications: Collaborative Computing n Enabling and enhancing human interactions n n Virtual shared spaces Shared resources: data archives, simulations Example applications: collaborative design or collaborative exploration of data sets Challenges: n n Real-time requirements of human perception Rich interactions

Globus Example: Collaborative Engineering Manipulate shared virtual space: Simulation components Multiple flows: Control, Text, Video, Audio, Database, Simulation, Tracking, Haptics, Rendering Uses Globus communication CAVERNsoft: UIC Electronic Visualization Laboratory

NEES (Network for Earthquake Engineering Simulation) Collaboratory U. Nevada Reno www. neesgrid. org

How Grid Software Works: NSF Network for Earthquake Engineering Simulation (NEES) Transform our ability to carry out research vital to reducing vulnerability to catastrophic earthquakes Side courtesy of Ian Foster

Building a NEES Collaboratory: What the User Wants Secure, reliable, ondemand access to data, software, people, and other resources (ideally all via a Web Browser) Side courtesy of Ian Foster

How it Really Happens (A Simplified View) Web Browser Compute Server Simulation Tool Web Portal Registration Service Data Viewer Tool Chat Tool Credential Repository Telepresence Monitor Application services organize VOs & enable access to other services Camera Database service Data Catalog Database service Certificate authority Users work with client applications Compute Server Collective services aggregate &/or virtualize resources Side courtesy of Ian Foster Resources implement standard access & management interfaces

How it Really Happens (without Grid Software) Web Browser Application Developer 12 Globus Toolkit 0 Web Portal Grid Community 0 Compute Server Registration Service Data Viewer Tool Chat Tool Credential Repository Application services organize VOs & enable access to other services Camera Telepresence Monitor Camera C Collective services aggregate &/or virtualize resources Database service D Database service E Data Catalog Certificate authority Users work with client applications Compute Server B Simulation Tool 10 Off the Shelf A Database service Side courtesy of Ian Foster Resources implement standard access & management interfaces

How it Really Happens (with Grid Software) Web Browser Simulation Tool Globus GRAM Globus Index Service CHEF Compute Server Camera Application Developer 2 Off the Shelf 9 Globus Toolkit 5 Data Viewer Tool Grid Community 3 CHEF Chat Teamlet My. Proxy Telepresence Monitor OGSA DAI Globus MCS/RLS Application services organize VOs & enable access to other services OGSA DAI OGSA Certificate Authority Users work with client applications Camera DAI Collective services aggregate &/or virtualize resources Database service Side courtesy of Ian Resources implement Foster standard access & management interfaces

Online Simulation Test (July 2003) Colorado Illinois (simulation)

Grids Changing Science n NSF National Earthquake Engineering Center n n Integrated instrumentation, collaboration, simulation environment National Environmental High-energy Physics Grid (Gri. Phyn) CERN Data Grid

Current and Future Applications n Interesting applications exist today More sophisticated applications will follow n Characteristics n n n Appetite for resources (CPU, memory, storage) Synchronization Only satisfied by multiple systems Need high availability of resources

Summary n n n Grids will change the way we do science and engineering Transition of services and application to production use Future will see increases sophistication and scope of services, tools, and applications