Grid Computing 1 Grid Book Chapters 1 2

Grid Computing 1 Grid Book, Chapters 1, 2, 3, 22 “Implementing Distributed Synthetic Forces Simulations in Metacomputing Environments” Brunett, Davis, Gottschalk, Messina, Kesselman http: //www. globus. org CSE 160/Berman

Outline • • • What is Grid computing? Grid computing applications Grid computing history Issues in Grid Computing Condor, Globus, Legion The next step CSE 160/Berman

What is Grid Computing? • Computational Grid is a collection of distributed, possibly heterogeneous resources which can be used as an ensemble to execute large-scale applications • Computational Grid also called “metacomputer” CSE 160/Berman

Computational Grids • Term computational grid comes from an analogy with the electric power grid: – Electric power is ubiquitous – Don’t need to know the source (transformer, generator) of the power or the power company that serves it – Analogy falls down in the area of performance • Ever-present search for cycles in HPC. Two foci of research – “In the box” parallel computers -- Peta. FLOPS architectures – Increasing development of infrastructure and middleware to leverage the performance potential of distributed Computational Grids CSE 160/Berman

Grid Applications • Distributed Supercomputing – Distributed Supercomputing applications couple multiple computational resources – supercomputers and/or workstations – Examples include: • SFExpress (large-scale modeling of battle entities with complex interactive behavior for distributed interactive simulation) • Climate Modeling (high resolution, long time scales, complex models) CSE 160/Berman

Distributed Supercomputing Example – SF Express • SF Express = (Synthetic Forces Express) large scale distributed simulation of behavior and movement of entities (tanks, trucks, airplanes, etc. ) for interactive battle simulation. • Entities require information about – State of terrain – Location and state of other entities • Info updated several times a second • Interest management allows entities to only look at relevant information, enabling scalability CSE 160/Berman

SF Express • Large scale SF Express run goals – Simulation of 50, 000 entities in 8/97, 100, 000 entries in 3/98 – Increase fidelity and resolution of simulation over previous runs – Improve • Refresh rate • Training environment responsiveness • Number of automatic behaviors – Ultimately use simulation for real-time planning as well as training • Large scale runs extremely resource-intensive CSE 160/Berman

SF Express Programming Issues • How should entities be mapped to computational resources? • Entities receive information based on “interests” – Communication reduced and localized based on “interest management” • Consistency model for entity information must be developed – Which entities can/should be replicated? – How should updates be performed? CSE 160/Berman

SF Express Distributed Application Architecture • D = data server, I = interest management, R = router, S = simulation node R R I S S S S D S CSE 160/Berman R I S S D S

50, 000 entity SF Express Run • 2 large-scale simulations run on August 11, 1997 Site Hardware Caltech Entities / First Run Entities / Second Run HP Exemplar 256 13, 095 12, 182 ORNL Intel Paragon 1024 16, 695 15, 996 NASA Ca IBM SP 2 139 5464 5637 CEWES, Va IBM SP 2 229 9739 9607 Maui IBM SP 2 128 5056 7027 HP/Convex, Tx HP Exemplar 128 5348 6733 55, 397 57, 182 Total Processors 1904 CSE 160/Berman

50, 000 entity SF Express Run • Simulation decomposed terrain (Saudi Arabia, Kuwait, Iraq) contiguously among supercomputers • Each supercomputer simulated a specific area and exchanged interest and state information with other supercomputers • All data exchanges were flow-controlled • Supercomputers fully interconnected, dedicated for experiment • Success depended on “moderate to significant system administration, interventions, competent system support personnel, and numerous phone calls. ” • Subsequent Globus runs focused on improving data, control management and operational issues for wide area CSE 160/Berman

High-Throughput Applications • Grid used to schedule large numbers of independent or loosely coupled tasks with the goal of putting unused cycles to work • High-throughput applications include RSA keycracking, Seti@home (detection of extra-terrestrial intelligence), MCell CSE 160/Berman

High-Throughput Applications • Biggest master/slave parallel program in the world with master = website, slaves = individual computers CSE 160/Berman

High-Throughput Example MCell • MCell – Monte Carlo simulation of cellular microphysiology. Simulation implemented as large-scale parameter sweep. CSE 160/Berman

MCell • MCell architecture: simulations performed by independent processors with distinct parameter sets and shared input files CSE 160/Berman

MCell Programming Issues • How should we assign tasks to processors to optimize locality? • How can we use partial results during execution to steer the computation? • How do we mine all the resulting data from experiments for results – During execution – After execution • How can we use all available resources? CSE 160/Berman

Data-Intensive Applications • Focus is on synthesizing new information from large amounts of physically distributed data • Examples include NILE (distributed system for high energy physics experiments using data from CLEO), SAR/SRB applications (Grid version of MS Terraserver), digital library applications CSE 160/Berman

Data-Intensive Example - SARA • SARA = Synthetic Aperture Radar Atlas – application developed at JPL and SDSC • Goal: Assemble/process files for user’s desired image – Radar organized into tracks – User selects track of interest and properties to be highlighted – Raw data is filtered and converted to an image format – Image displayed in web browser CSE 160/Berman

SARA Application Architecture • Application structure focused around optimizing the delivery and processing of distributed data Compute Servers Data Servers Client Computation servers and data servers are logical entities, not necessarily different nodes . . . CSE 160/Berman

SARA Programming Issues • Which data server should replicated data be accessed from? • Should computation be done at the data server or data moved to a compute server or something in between? • How big are the data files and how often will they be accessed? OGI UTK UCSD CSE 160/Berman App. Le. S/NWS

Tele. Immersion • Focus is on use of immersive virtual reality systems over a network – Combines generators, data sets and simulations remote from user’s display environment – Often used to support collaboration • Examples include – Interactive scientific visualization (“being there with the data”), industrial design, art and entertainment CSE 160/Berman

Teleimmersion Example – Combustion System Modeling • A shared collaborative space – Link people at multiple locations – Share and steer scientific simulations on supercomputer • Combustion code developed by Lori Freitag at ANL • Boiler application used to troubleshoot and design better products CSE 160/Berman Chicago San Diego

Early Experiences with Grid Computing • Gigabit Testbeds Program – Late 80’s, early 90’s, gigabit testbed program was developed as joint NSF, DARPA, CNRI (Corporation for Networking Research, Bob Kahn) initiative – Goals were to • investigate potential architecture for a gigabit/sec network testbed • explore usefulness for end-users CSE 160/Berman

Gigabit Testbeds –Early 90’s • 6 testbeds formed: – – – CASA (southwest) MAGIC (midwest) BLANCA (midwest) AURORA (northeast) NECTAR (northeast) VISTANET (southeast) • Each had a unique blend of research in applications and in networking and computer science research CSE 160/Berman

Gigabit Testbeds Testbed Site Hardware Application Focus Remarks Blanca NCSA, UIUC, UCB, UWisc, AT&T Experimental ATM switches running over experimental 622 Mb/s and 45 Mb/s circuits developed by AT&T and universities Virtual environments, Remote visualization and steering, multimedia digital libraries Network spanned US (UCB to AT&T). Network research included distributed virtual memory, real-time protocols, congestion control, signaling protocols etc. Vistanet MCNC, UNC, Bell. South ATM network at OC-12; (622 Mb/s) interconnecting HIPPI local area networks Radiation treatment planning applications involving supercomputer, remote instrument (radiation beam) and visualization Medical personnel planned radiation beam orientation using a supercomputer. Extended the planning process from 2 beams in 2 dimensions to multiple beams in 3 dimensions. Nectar CMU, Bell Atlantic, Bellcore, PSC OC-48 (2. 4 Gb/s) links between PSC supercomputer facility and CMU Coupled Metropolitan area testbed with supercomputers running OC-48 links between PSC and chemical reaction downtown CMU campus. dynamics and CS research CSE 160/Berman

Gigabit Testbeds Testbed Site Hardware Application Focus Remarks Aurora MIT, IBM, Bellcore, Penn, MCI OC-12 network interconnecting 4 research sites and supporting the development of ATM host interfaces, ATM switches and network protocols. Telerobotics, distributed virtual memory and operating system research East coast sites. Research focused mostly on network and computer science issues. Magic Army Battle Lab, Sprint, UKansas, UMinn, LBL, Army HPC Lab OC-12 network to interconnect ATMattached hosts Remote vehicle Funded separately by control applications DARPA after CNRI and high-speed initiative had begun. access to databases for terrain visualization and battle simulation Casa Caltech, SDSC, LANL, JPL, MCI, USWest, Pac. Bell Hipp. I switches connected by HIPPI-over-SONET at OC-12 Distributed Supercomputing CSE 160/Berman Targeted improving the performance of distributed supercomputing applications by strategically mapping application components on resources.

• First large-scale “modern” Grid experiment I-Way • Put together for SC’ 95 (the “Supercomputing” Conference) • I-Way consisted of a Grid of 17 sites connected by v. BNS • Over 60 applications ran on the I-WAY during SC’ 95 CSE 160/Berman

I-Way “Architecture” • Each I-WAY site served by an I-POP (IWAY Point of Presence) used for – authentication of distributed applications – distribution of associated libraries and other software – monitoring the connectivity of the I-WAY virtual network • Users could use single authentication and job submission across multiple sites or they could work directly with end-users • Scheduling done with a “human-in-the-loop” CSE 160/Berman

I-Soft – Software for I-Way • Kerberos based authentication – I-POP initiated rsh to local resources • AFS for distribution of software and state • Central scheduler – Dedicated I-WAY nodes on resource – Interface to local scheduler • Nexus based communication libraries – MPI, Cave. Comm, CC++ • In many ways, I-Way experience formed foundation of Globus CSE 160/Berman

I-Way Application: Cloud Detection • Cloud detection from multimodal satellite data – Want to determine if satellite image is clear, partially cloudy or completely cloudy • Used remote supercomputer to enhance instruments with – Real-time response – Enhanced function, accuracy (of pixel image) • Developed by C. Lee, Aerospace Corporation, Kesselman, Caltech et al. CSE 160/Berman SPRINT

PACIs • 2 NSF Supercomputer Centers (PACIs) – SDSC/NPACI and NCSA/Alliance, both committed to Grid computing • v. BNS backbone between NCSA and SDSC running at OC-12 with connectivity to over 100 locations at speeds ranging from 45 Mb/s to 155 Mb/s or more CSE 160/Berman

PACI Grid CSE 160/Berman

NPACI Grid Activities • Metasystems Thrust Area one of the NPACI technology thrust areas – Goal is to create an operational metasystems for NPACI • Metasystems players: – – Globus (Kesselman) Legion (Grimshaw) App. Le. S (Berman and Wolski) Network Weather Service (Wolski) CSE 160/Berman

Alliance Grid Activities • Grid Task Force and Distributed Computing team are Alliance teams • Globus supported as exclusive grid infrastructure by Alliance • Grid concept pervasive throughout Alliance – Access Grid developed for use by distributed collaborative groups • Allliance grid players include Foster (Globus), Livny (Condor), Stevens (ANL), Reed (Pablo), etc. CSE 160/Berman

Other Efforts • Centurion Cluster = Legion testbed – – Legion cluster housed at UVA 128 533 MHz Dec Alphas 128 Dual 400 MHz Pentium 2 Fast ethernet and myrinet • Globus testbed = GUSTO which supports Globus infrastructure and application development – 125 sites in 23 countries as of 2/2000 – Testbed aggregated from partner sites (including NPACI) CSE 160/Berman

GUSTO (Globus) Computational Grid CSE 160/Berman

IPG • IPG = Information Power Grid • NASA effort in grid computing • Globus supported as underlying infrastructure • Application focus include aerospace design, environmental and space applications CSE 160/Berman

Research and Development Foci for the Grid • Applications – Questions revolve around design and development of “Grid-aware” applications – Different programming models: polyalgorithms, components, mixed languages, etc. – Program development environment and tools required for development and execution of performance-efficient applications CSE 160/Berman Applications Middleware Infrastructure Resources

Research and Development Foci for the Grid • Middleware – Questions revolve around the development of tools and environments which facilitate application performance – Software must be able to assess and utilize dynamic performance characteristics of resources to support application – Agent-based computing and resource negotiation CSE 160/Berman Applications Middleware Infrastructure Resources

Research and Development Foci for the Grid • Infrastructure – Development of infrastructure that presents a “virtual machine” view of the Grid to users – Questions revolve around providing basic services to user: security, remote file transfer, resource management, etc. , as well as exposing performance characteristics. – Services must be supported by heterogeneous and interoperate CSE 160/Berman Applications Middleware Infrastructure Resources

Research and Development Foci for the Grid • Resources – Questions revolve around heterogeneity and scale. – New challenges focus on combining wireless and wired, static and dynamic, low-power and highpower, cheap and expensive resources – Performance characteristics of grid resources vary dramatically, integrating them to support performance of individual and multiple applciations extremely challenging CSE 160/Berman Applications Middleware Infrastructure Resources