Grid and High-Performance Computing in Israel An overview

Скачать презентацию Grid and High-Performance Computing in Israel An overview

ebc1f206fa337a9d9cba0843f49a95ec.ppt

Количество слайдов: 58

Grid and High-Performance Computing in Israel An overview Guy Tel-Zur NRCN tel-zur@computer. org HPC 2006, Cetraro, Italy July 5, 2006

Outline • Academia (IUCC, IAG) • Emphasize on the BGU activity • Industry & Trade (IGT) • Future plans

An overview of Grid and HPC in Israel IUCC – The Inter University Computation Center

Network Infrastructure - I

Network Infrastructure - II

Network infrastructure - III IUCC/ ILAN IIX Med-1 tot. traffic http: //noc. ilan. net. il/stats/ILAN-GP 0/linktogeant-petach-tikva-gp. html

The Israel Academic Grid ((IAG • http: //iag. iucc. ac. il/ • Funded by the MOST • Steering & Technical Committees • Coordinates the Israeli activity in EGEE IUCC is the CA for the IAG

EGEE • 4 GOCs (certified EGEE sites): Technion, TAU, WIS, Open. U. – LCG-2 moving now to g. Lite • BGU is next – g. Lite 3. 0 being installed these days

Grid Computing at the Technion: Israel Institute of Technology • Distributed Systems Laboratory • Prof. Assaf Schuster – Head • Project: – – – GMS Super-Link Online The Dependable Grid EGEE …and more

GMS – Grid Monitoring System [Noam Palatin, Assaf Schuster, Ran Wolff. Forthcoming SIGKDD 2006, August, Chicago] • Distributively store all logs of a large batch system in local databases • Apply distributed data mining on logs • Implementation using Condor • Taken up by Intel Net. Batch team: started a $3 M project

Super. Link Online [Danny Geiger, Miron Livny, Assaf Schuster, Mark Zilberstein. American Journal of Human Genetics, May 2006. HPDC, June 2006. Science Net. Watch, May 2006. ] • http: //bioinfo. cs. technion. ac. il/superlink-online/ a production portal for geneticists working at hospitals • Submitted tasks contain gene mapping results from lab experiments • Portal user sees a single computer (!) • Implemented using a hierarchy of Condor pools – Highest/smallest pool in Technion (DSL) – Lowest/largest in Madison (GLOW). • In progress: linkage@home and EGEE Bio. Med implementations. • Many success stories: hunted genes causing various syndromes: http: //bioinfo. cs. technion. ac. il/superlinkonline/tips/papers. shtml

The Dependable Grid [Gabi Kliot, Miron Livny, Assaf Schuster, Artyom Sharov, Mark Zilberstein. HPDC hot topics, June 2006. ] • Provide a High Availability (HA) Library as a service for any Grid component • Decorative approach – no need to change component (!) • Production-quality Implementation using Condor severe development standards • HA for Condor matchmaker with zero loc changes (!!!) • Part of Condor 6. 8 distribution • Deployed in many large Condor production pools • Plans to develop and support an open-source distribution

An EGEE Certified Node [Dedi Carmeli, Max Kovgan, Assaf Schuster] • A 200 -cpu Condor pool exposed to EGEE as a single resource • Resources are non-dedicated • Configuration of local priorities: local jobs preempt EGEE jobs • Work behind several firewalls, in a dedicated zone (Isolation, security, privacy).

The Hebrew University: MOSIX Prof. Amnon Barak MOSIX is a cluster and an organizational grid management system Targeted for: • x 86 based Linux clusters • Organizations that wish to link several such clusters • Service centers- for dynamic partition of a cluster to users The grid model: a federation of Linux clusters, servers and workstations whose owners trust each other and wish to cooperate from time to time Main goal: automatic management Geared for: HPC

Main features Process migration and supervising algorithms for: – Automatic resource management and resource discovery – Grid-wide resource sharing – Adaptive workload distribution and load-balancing – Flexible (dynamic) partitions of nodes to users – Preserving running processes in disruptive configurations • Other services: batch jobs, checkpoint & recovery, live queuing and an on-line monitor of the grid and each cluster Outcome: the grid and each cluster performs like a single computer with multiple processors An organizational grid: due to trust

The core technology • Preemptive process migration, e. g. for loadbalancing or to evacuate guest processes from a disconnecting cluster • The user sees a Single-System Image (SSI) • No need to change applications, copy files to remote nodes or to link applications with any library • Provides a secure run-time environment (sandbox) • Guest processes can’t modify resources in hosting nodes

The Hebrew University Organizational Grid • 12 MOSIX clusters ~300 nodes, 2 campuses – • – – In life-sciences, medical school, chemistry and computer science Target: 2000 nodes Nano-technology, Molecular dynamics, Protein folding, Genomics (BLAT, SW), Meteorological weather forecast (MM 5), Navier-Stokes equations and turbulence (CFD) , CPU simulator of new hardware design (Simple. Scalar) Applications: Some users are interested in “Pay-per-Use” instead of cluster ownership More information at http: //www. MOSIX. org

Tel-Aviv University – School of Computer Science Grid projects & Local clusters • Condor pool (average 150 nodes peak 300 nodes). Opportunistic computer cluster. Used for Bioinformatics, network simulations and classical HPC (MPI, neural networks, MC, fluid dynamics, etc). • Planet Lab – mainly infrastructure research led by Princeton. Very small compute power (4 -20 nodes in each site). • EGEE II – 20 nodes (soon ~100), used for physics, bio-info and general research

Ben Gurion University of the Negev • Inter campus Condor pool • Grid Computing

The BGU Condor Pool • • • Started in 2000 Today: 150+ nodes Linux & Windows (2000, XP) Campus-wide project Non-dedicated resources (Next slide)

Campus Wide Condor Pool • • ECE Dept. IE&M Dept. Nucl. Eng. Dept. Public Labs • Soon to be connected – CS – Physics Dept.

• Currently there are 4 science projects

Condor at the BGU • Nucl. Eng. Dept. : Itzhak Orion – MCNP simulations • CS Dept. , Chen Keasar – Protein structure prediction • Physics Dept, Yigal Meir – Solid State • IE&M, O. Levi, J. Miao & G. Zaslavsky – 3 D image reconstruction

I. Orion: MCNP and Condor • 48 hours for a single job of 2 x 109 histories on a single CPU • One layer imaging, at the desired resolution, requires 50 jobs 100 days on a single CPU!!! • OS: Windows • Status: Initial tests completed

C. Keasar: Protein Folding MESHI is a software package for protein modeling. It is written in Java. Ref: Kalisman N. , Levi A. , Maximova T. Reshef D. , Zafriri-Lynn S. , Gleyzer Y. and Keasar C. (2005) MESHI: a new library of Java classes for molecular modeling Bioinformatics 21: 39313932

Y. Meir: The Kondo effect • “Investigate the dependence of the conductance and of the current through the quantum dot on temperature and, in particular, on its relation to the Kondo temperature” • “…The plan is to run the program for many sets of parameters of the quantum dot, giving rise to different Kondo temperatures and for different temperatures. This will allow us to determine whether the physical properties of the system depend only on the ratio of these temperatures, and how. ”

3 D image reconstruction from equally sloped projections by Parallel Computing Ofer Levi 1, John Miao 2 and Genady Zaslavsky 1 1 Department of Industrial Engineering and Managemet, Ben Gurion, Israel 2 Department of Physics and Astronomy and California Nanosystems Institute, University of California, USA • Equally sloped tomography of a 3 D object from a number of 2 D projections is efficient analysis method for accurate determination of intramolecular structures (Miao , Foerster and Levi 2005 ). • However, true size data analysis by this method is a highly time consuming process and can not be done using a single working station.

Parallel Implementation • A new parallel friendly method was developed and implemented. • The parallel computations were managed by Condor environment. • The use of Condor and parallel computing enabled successful reconstruction of complex real data in a reasonable time, a task that was impossible before.

Time Analysis • Typical computation time of medium size 3 D image ( 2563 voxels) takes approximately one month on the single 3 GHz machine. • With Condor help we succeeded to reduce total runtime to 4 days. When in average 30 machines were used during the computation. • The goal is to process as much as possible 3 D images on regular basis.

Submitting Jobs

Grid Computing at the BGU • Currently: 6 nodes (12 proc) • g. Lite 3. 0 is being installed these days • We expect to get more new computers from a bid that is going to take place within the next days. Grid Computing at NRCN • A small Condor pool • Plan to operate a small Grid site (~40 processors) – IAG, IGT, EGEII….

Ganglia monitoring: http: //grid 4. bgu. ac. il/ganglia

The Israeli Association of Grid Technologies (IGT) • A non-profit association • Supported by part by the MI, T&L The Israeli Association of Grid Technologies

IGT Achievements • Founded: May 1 st 2004 with 5 members • Today: 30 members • 16 Conferences and more than 20 overseas speakers • Grid in the Financial sector, more than 130 people • Annual conference/expo with 200 people • Work Groups: Grid-SOA, Grid-HPC, Virtualization, RDMA • Enterprise workshops • www. Grid. org. il • Knowledge Center • Virtual Community Web Site • IGT Grid Lab • International corporations: Europe, USA • Grid Award Contest

Members Members

Israeli Grid R&D Fast Networks • Mellanox - Infiniband • Voltaire - Infiniband • Software Infrastructure • Giga. Spaces – Grid application server • Xeround - Networked Database • Exanet – Distributed file system virtualization • BMC – Data Center Virtualization management • IBM Haifa – File/Storage Virtualization • Grid based solutions • Elbit – Management & Control systems • Rafael – Management & Control systems • Xoreax – Grid based software build • Sungard – Financial Broker • Other • Shunra - Grid WAN simulations • Symantec (Precise) – Performance management • IAI/Elta – Internal Grid systems for Engineering simulations • Intel Israel - Net. Batch

IGT Grid Lab Virtual Grid Lab – Secured Resources Sharing Internet based VPN CPUs Grid Lab Management CPUs CPUs - IGT Member

HPC in the Industry June 06

(My) Near Future Plans at BGU • Campus wide Condor pool More Scientific Projects • A web portal for submission of Condor jobs • An operational EGEE-II site • A joint Singapore-Israel Grid project (Flocking between our Condor pools) • We are thinking about opening a new and unique course of study in “Grid Computing”!

Isra. Grid • Similar to other national grid initiatives • We want to build a 1 Gb/sec network for the IGT member organizations • RFI • Pending for an approval as a National Project

Next Events in Israel http: //gccb 2006. ulster. ac. uk/-Introduction-. html The 2 nd IGT annual event, December 2006

Summary • From the infrastructure point of view we are only at the beginning of the road – We have O(Zero) gov. funding • But there is a lot of technological capability – Many new projects in the industry and in the academy

Thank You! References: • Condor at the BGU: http: //www. ee. bgu. ac. il/~tel-zur/condor/ • "An Introduction to Parallel Processing” course at the BGU: http: //www. ee. bgu. ac. il/~tel-zur/2006 A/Welcome. html • Grid Computing at the BGU http: //www. ee. bgu. ac. il/~tel-zur/grid. html • IGT http: //www. grid. org. il

The Technion

GMS – Grid Monitoring Systemcritical situations and “Detection and prediction of errors or faulty results” • Collects, organizes and stores system status data • Translates data to semantically meaningful terms • Analyzes the resulting distributed dataset using suitable algorithms

GMS – Grid Monitoring System • • • Tested on 100 -cpu Condor pool Employed a novel distributed outliers detection algorithm Piggybacking on Condor job execution mechanisms Detected three misconfigured machines System and result reported in: “Mining for Misconfigured Machines in Grid Systems”, accepted to ACM-SIGKDD’ 06 • System is in packaging, to be made available open-source • System to be deployed on GLOW, Madison (x, 000 machines)

Super. Link Online [Danny Geiger, Miron Livny, Assaf Schuster, Mark Zilberstein. American Journal of Human Genetics, May 2006. HPDC, June 2006. Science Net. Watch, May 2006. ] • http: //bioinfo. cs. technion. ac. il/superlink-online/ a production portal for geneticists working at hospitals • Submitted tasks contain gene mapping results from lab experiments • System performs linkage analysis using powerful Bayesian network manipulation • Automatic and adaptive parallelization – the execution hierarchy • Turnaround times for mixed workload (exponential distribution) – short jobs – seconds on a single machine – large jobs – days on thousands machines • Portal user sees a single computer (!) • Implemented using a hierarchy of Condor pools – Highest/smallest pool in Technion (DSL) – Lowest/largest in Madison (GLOW). • In progress: linkage@home and EGEE Bio. Med implementations. • Many success stories: hunted genes causing various syndromes: http: //bioinfo. cs. technion. ac. il/superlink-online/tips/papers. shtml

The Dependable Grid [Gabi Kliot, Miron Livny, Assaf Schuster, Artyom Sharov, Mark Zilberstein. HPDC hot topics, June 2006. ] • Provide a High Availability (HA) Library as a service for any Grid component • Decorative approach – no need to change component (!) • Decouple election and replication of state • A general approach for any consistency guarantees of state replication (in progress) • Production-quality Implementation using Condor severe development standards • HA for Condor matchmaker with zero loc changes (!!!) • Part of Condor 6. 8 distribution • Deployed in many large Condor production pools • Plans to develop and support an open-source distribution

Data. Mining. Grid [Valentin Kravchov, Assaf Schuster. Five EU partners. Several Publications in 2006/7. ] • EC FP-6 STREP, 1. 84 M Euro, 2004— 2006. http: //www. datamininggrid. org/ • Tools and services for deploying data mining applications on the grid. • Main achievement: a generic platform for enterprise data mining using Triana workflow editor, GT-4, Weka, other. • Parallel/distributed mining algorithms.

Qos. Cos. Grid [David Carmeli, Valentin Kravchov, Assaf Schuster, Benny Yoshpa. 11 EU partners + 1 Australian. ] • EC FP-6 STREP, 3. 8 M Euro • Beginning Sep. 2006— 2009. • Quasi-Opportunistic Supercomputing for Complex Systems in Grid • 5 X 1, 000 Machines testbed (“grid 5000”)

Scalable Data Mining [Amir Bar-Or, Danny Keren, Denis Krivitski, Liran Liss, Tsachi Scharfmann, Assaf Schuster, Ran Wolff] • Scalable data mining, data-centric distributed algorithms • Large-scale distributed systems • Monitoring distributed data streams • “Infinite” scalability – Local algorithms – Performance independent of system size • Technion patented paradigms • Publications 2003— 2006 [ICDM, CCGRID, HPDC, SIGKDD, SIGMOD, SDM, HPDC, DISC, TKDE, others]

HUJI HUJI

Example: support of disruptive configurations When a cluster is disconnected: • All guest processes move out – To available grid nodes or to the home cluster • All local processes that were migrated, move back – Long running processes are preserved • Returning processes are automatically frozen in their home clusters (memory image stored in local disks) – – No overloading of nodes in the home cluster Frozen processes are reactivated gradually • Example: hibernate during the day, run at night

TAU TAU

Research areas that use HPC • Bio-informatics – Docking – Phylogenetics – Artificial Life and Evolutionary Computation – Systems Biology • Language processing • Numeric analysis • http: //www. cs. tau. ac. il/research. html