Скачать презентацию Purdue Campus Grid Preston Smith psmith purdue edu Condor Скачать презентацию Purdue Campus Grid Preston Smith psmith purdue edu Condor

53d66a503583d6fa3b79a53163eb2d9b.ppt

  • Количество слайдов: 23

Purdue Campus Grid Preston Smith psmith@purdue. edu Condor Week 2006 April 24, 2006 Purdue Campus Grid Preston Smith psmith@purdue. edu Condor Week 2006 April 24, 2006

Overview • RCAC – Community Clusters • Grids at Purdue – Campus – Regional Overview • RCAC – Community Clusters • Grids at Purdue – Campus – Regional • NWICG – National • • OSG CMS Tier-2 Nano. HUB Teragrid • Future Work

Purdue’s RCAC • Rosen Center for Advanced Computing – Division of Information Technology at Purdue’s RCAC • Rosen Center for Advanced Computing – Division of Information Technology at Purdue (ITa. P) – Wide variety of systems: shared memory and clusters • 352 CPU IBM SP • Five 24 -processor Sun F 6800 s, Two 56 -processor Sun E 10 ks • Five Linux clusters

Linux clusters in RCAC • Recycled clusters – Systems retired from student labs – Linux clusters in RCAC • Recycled clusters – Systems retired from student labs – Nearly 1000 nodes of single-CPU PIII, P 4, and 2 -CPU Athlon MP and EM 64 T Xeons for general use by Purdue researchers

Community Clusters • Federate resources at a low level • Separate researchers buy sets Community Clusters • Federate resources at a low level • Separate researchers buy sets of nodes to federate into larger clusters – Enables larger clusters than a scientist could support on his own – Leverage central staff and infrastructure • No need to sacrifice a grad student to be a sysadmin!

Community Clusters Macbeth § 126 nodes dual Opteron (~1 Tflops) § 1. 8 GHz Community Clusters Macbeth § 126 nodes dual Opteron (~1 Tflops) § 1. 8 GHz § 4 -16 GB RAM §Infiniband, Gig. E for IP traffic § 7 owners (ME, Biology, HEP Theory) Lear § 512 nodes dual Xeon 64 bit (6. 4 Tflops) § 3. 2 GHz § 4 GB and 6 GB RAM §Gig. E § 6 owners (EEx 2, CMS, Provost, VPR, Teragrid) Hamlet § 308 nodes dual Xeon (3. 6 Tflops) § 3. 06 GHz to 3. 2 GHz § 2 GB and 4 GB RAM §Gig. E, Infiniband § 5 owners (EAS, BIOx 2, CMS, EE)

Community Clusters • Primarily scheduled with PBS – Contributing researchers are assigned a queue Community Clusters • Primarily scheduled with PBS – Contributing researchers are assigned a queue that can run as many “slots” as they have contributed. • Condor co-schedules alongside PBS – When PBS is not running a job, a node is fair game for Condor! • But Condor work is subject to preemption if PBS assigns work to the node.

Condor on Community Clusters • All in all, Condor joins together 4 clusters (~2500 Condor on Community Clusters • All in all, Condor joins together 4 clusters (~2500 CPU) within RCAC.

Grids at Purdue - Campus • Instructional computing group manages a 1300 -node Windows Grids at Purdue - Campus • Instructional computing group manages a 1300 -node Windows Condor pool to support instruction. – Mostly used by computer graphics classes for rendering animations • Maya, etc. – Work in progress to connect Windows pool with RCAC pools.

Grids at Purdue - Campus • Condor pools around campus – Physics department: 100 Grids at Purdue - Campus • Condor pools around campus – Physics department: 100 nodes, flocked – Envision Center: 48 nodes, flocked • Potential collaborations – Libraries: ~200 nodes on Windows terminals – Colleges of Engineering: 400 nodes in existing pool • Or any department interested in sharing cycles!

Grids at Purdue - Regional • Northwest Indiana Computational Grid – Purdue West Lafayette Grids at Purdue - Regional • Northwest Indiana Computational Grid – Purdue West Lafayette – Purdue Calumet – Notre Dame – Argonne Labs • Condor pools available to NWICG today. • Partnership with OSG?

Open Science Grid • Purdue active in Open Science Grid – CMS Tier-2 Center Open Science Grid • Purdue active in Open Science Grid – CMS Tier-2 Center – Nano. HUB – OSG/Teragrid Interoperability • Campus Condor pools accessible to OSG – Condor used for access to extra, non-dedicated cycles for CMS and is becoming the preferred interface for non-CMS VOs.

CMS Tier-2 - Condor – MC production from UW-HEP ran this spring on RCAC CMS Tier-2 - Condor – MC production from UW-HEP ran this spring on RCAC Condor pools. • Processed 23% or so of entire production. • High rates of preemption, but that’s expected! – 2006 will see addition of dedicated Condor worker nodes to Tier-2, in addition to PBS clusters. • Condor running on resilient d. Cache nodes.

Nano. HUB Science Gateway Workspaces Campus Grids Capability Computing Purdue, GLOW Middleware Grid VM Nano. HUB Science Gateway Workspaces Campus Grids Capability Computing Purdue, GLOW Middleware Grid VM nano. HUB VO Virtual backends Capacity Computing Research apps Virtual Cluster with VIOLIN

Teragrid • Teragrid Resource Provider • Resources offered to Teragrid – Lear cluster – Teragrid • Teragrid Resource Provider • Resources offered to Teragrid – Lear cluster – Condor pools – Data collections

Teragrid • Two current projects active in Condor pools via Teragrid allocations – Database Teragrid • Two current projects active in Condor pools via Teragrid allocations – Database of Hypothetical Zeolite Structures – CDF Electroweak MC Simulation • Condor-G Glide-in • Great exercise in OSG/TG Interoperability – Identifying other potential users

Teragrid • Tera. DRE - Distributed Rendering on the Teragrid – Globus, Condor, and Teragrid • Tera. DRE - Distributed Rendering on the Teragrid – Globus, Condor, and IBRIX Fusion. FS enables Purdue’s Teragrid site to serve as a render farm • Maya and other renderers available

Grid Interoperability “Lear” Grid Interoperability “Lear”

Grid Interoperability • Tier-2 to Tier-2 connectivity via dedicated Teragrid WAN (UCSD->Purdue) • Aggregating Grid Interoperability • Tier-2 to Tier-2 connectivity via dedicated Teragrid WAN (UCSD->Purdue) • Aggregating resources at low level makes interoperability easier! – OSG stack available to TG users and vice versa • “Bouncer” Globus job forwarder

Future of Condor at Purdue • Add resources – Continue growth around campus • Future of Condor at Purdue • Add resources – Continue growth around campus • RCAC • Other departments • Add Condor capabilities to resources – Teragrid data portal adding on-demand processing with Condor now • Federation – Aggregate Condor pools with other institutions?

Condor at Purdue • Questions? Condor at Purdue • Questions?

PBS/Condor Interaction PBS Prologue # Prevent new Condor jobs and push any existing ones PBS/Condor Interaction PBS Prologue # Prevent new Condor jobs and push any existing ones off # /opt/condor/bin/condor_config_val -rset -startd PBSRunning=True > /dev/null /opt/condor/sbin/condor_reconfig -startd > /dev/null if ( condor_status -claimed -direct $(hostname) 2>/dev/null | grep -q Machines ) then condor_vacate > /dev/null sleep 5 fi

PBS/Condor Interaction PBS Epilogue /opt/condor/bin/condor_config_val -rset -startd  PBSRunning=False > /dev/null /opt/condor/sbin/condor_reconfig -startd > PBS/Condor Interaction PBS Epilogue /opt/condor/bin/condor_config_val -rset -startd PBSRunning=False > /dev/null /opt/condor/sbin/condor_reconfig -startd > /dev/null Condor START Expression in condor_config. local PBSRunning = False # Only start jobs if PBS is not currently running a job PURDUE_RCAC_START_NOPBS = ( $(PBSRunning) == False ) START = $(START) && $(PURDUE_RCAC_START_NOPBS)