Скачать презентацию SPRUCE Special PRiority and Urgent Computing Environment http Скачать презентацию SPRUCE Special PRiority and Urgent Computing Environment http

92e71748fc513705d727141ddf9bf072.ppt

  • Количество слайдов: 61

SPRUCE Special PRiority and Urgent Computing Environment http: //spruce. teragrid. org/ Pete Beckman Argonne SPRUCE Special PRiority and Urgent Computing Environment http: //spruce. teragrid. org/ Pete Beckman Argonne National Laboratory University of Chicago

Modeling and Simulation are critical Part of Decision Making University of Chicago Urgent Computing Modeling and Simulation are critical Part of Decision Making University of Chicago Urgent Computing - 2 Argonne National Lab

I Need it Now! • Applications with dynamic data and result deadlines are being I Need it Now! • Applications with dynamic data and result deadlines are being deployed • Late results are useless w Wildfire path prediction w Storm/Flood prediction w Influenza modeling • Some jobs need priority access “Right-of-Way Token” University of Chicago Urgent Computing - 3 Argonne National Lab

Example 1: Severe Weather Predictive Simulation from Real-Time Sensor Input Severe weather example Source: Example 1: Severe Weather Predictive Simulation from Real-Time Sensor Input Severe weather example Source: Kelvin Droegemeier, Center for Analysis and Prediction of Storms (CAPS), University of Oklahoma. Collaboration with LEAD Science Gateway project. Urgent Computing -

Example 2: Real Time Neurosurgical Imaging Using Simulation (GENIUS project - Heme. LB) Medical Example 2: Real Time Neurosurgical Imaging Using Simulation (GENIUS project - Heme. LB) Medical Example Getting the Patient Specific Data is generated by MRA scanners at the National Hospital for Neurosurgery and Neurology Heme. LB trilinear interpolation 5122 pixels x 100 slices, res: 0. 468752 mm x 0. 8 mm Our graphical-editing tool 20482 x 682 cubic voxels, res: 0. 46875 mm Source: Peter Coveney, GENIUS Project University College London

Example 3: SURA Coastal Ocean Observing Program (SCOOP) Flood modeling example Source: Center for Example 3: SURA Coastal Ocean Observing Program (SCOOP) Flood modeling example Source: Center for Computation and Technology, Louisiana State University

How can we get cycles? • Build supercomputers for the application w Pros: Resource How can we get cycles? • Build supercomputers for the application w Pros: Resource is ALWAYS available w Cons: Incredibly costly (99% idle) w Example: Coast Guard rescue boats • Share public infrastructure w Pros: low cost w Cons: Requires complex system for authorization, resource management, and control w Examples: school buses for evacuation, cruise ships for temporary housing University of Chicago Urgent Computing - 7 Argonne National Lab

Introducing SPRUCE • The Vision: w Build cohesive infrastructure that can provide urgent computing Introducing SPRUCE • The Vision: w Build cohesive infrastructure that can provide urgent computing cycles for emergencies • Technical Challenges: w Provide high degree of reliability w Elevated priority mechanisms w Resource selection, data movement • Social Challenges: w Who? When? What? w How will emergency use impact regular use? w Decision-making, workflow, and interpretation University of Chicago Urgent Computing - 8 Argonne National Lab

Existing “Digital Right-of-Way” Emergency Phone System Calling cards are in widespread use and easily Existing “Digital Right-of-Way” Emergency Phone System Calling cards are in widespread use and easily understood by the NS/EP User, simplifying GETS usage. GETS priority is invoked “call-by-call” GETS USER ORGANIZATION GETS is a ‘ubiquitous’ service in the Public Switched Telephone Network. If you can get a DIAL TONE, you can make a GETS call. University of Chicago Urgent Computing - 9 Argonne National Lab

SPRUCE Architecture Overview (1/2) Right-of-Way Tokens Event 2 Automated Trigger 1 First Responder SPRUCE SPRUCE Architecture Overview (1/2) Right-of-Way Tokens Event 2 Automated Trigger 1 First Responder SPRUCE Gateway / Web Services Right-of-Way Token Human Trigger Right-of-Way Token University of Chicago Urgent Computing - 10 Argonne National Lab

SPRUCE Architecture Overview (2/2) Submitting Urgent Jobs User Team Authentication 4 Urgent Computing Job SPRUCE Architecture Overview (2/2) Submitting Urgent Jobs User Team Authentication 4 Urgent Computing Job Submission Conventional Job Submission Parameters ! Urgent Computing Parameters University of Chicago Priority Job Queue Choose a Resource SPRUCE Job Manager 3 Local Site Policies 5 Supercomputer Resource Urgent Computing - 11 Argonne National Lab

[SPRUCE Server (WS interface, Web portal)] • Token & session management w admin, user, [SPRUCE Server (WS interface, Web portal)] • Token & session management w admin, user, job manager Distributed, Site-local [configuration & policy] • Priority queue and local policies [installed software] • Authorization & management for job submission and queuing University of Chicago Urgent Computing - 12 Central Summary of Components Argonne National Lab

Internal Architecture Computing Resource: Job Manager & Scripts Web Portal or Workflow Tools Java Internal Architecture Computing Resource: Job Manager & Scripts Web Portal or Workflow Tools Java AJAX Axis 2 PHP / Perl Client Interfaces R AP st SO que e Client-Side Job Tools P A SO est u Req SPRUCE User Services || Validation Services Axis 2 Web Service Stack JDB C Tomcat Java Servlet Container My. SQL Mirror Future work Apache Web Server Central SPRUCE Server University of Chicago Urgent Computing - 13 Argonne National Lab

Site-Local Response Policies: How will Urgent Computing be treated? • “Next-to-run” status for priority Site-Local Response Policies: How will Urgent Computing be treated? • “Next-to-run” status for priority queue w wait for running jobs to complete • Force checkpoint of existing jobs; run urgent job • Suspend current job in memory (kill -STOP); run urgent job • Kill all jobs immediately; run urgent job • Provide differentiated CPU accounting w “Jobs that can be killed because they maintain their own checkpoints will be charged 20% less” • Other incentives University of Chicago Urgent Computing - 14 Argonne National Lab

Emergency Preparedness Testing: “Warm Standby” • For urgent computation, there is no time to Emergency Preparedness Testing: “Warm Standby” • For urgent computation, there is no time to port code w Applications must be in “warm standby” w Verification and validation runs test readiness periodically (Inca, MDS) w Reliability calculations w Only verified apps participate in urgent computing • Grid-wide Information Catalog w Application was last tested & validated on w Also provides key success/failure history logs University of Chicago Urgent Computing - 15 Argonne National Lab

Resource Advisor • Purpose w Given a set of distributed resources and a deadline, Resource Advisor • Purpose w Given a set of distributed resources and a deadline, how does one select the “best” resource on which to run? § Analyze historical and live data to determine the likelihood of meeting a deadline. • Generate a bound for the total turnaround time w Generate bounds for: • File Staging (FT ) • Allocation time (e. g. , queue delay) (AT ) • Execution time (ET ) w Overall turnaround time = FT + AT + ET University of Chicago Urgent Computing - 16 Argonne National Lab

Selecting a Resource Trigger • Network bandwidth • Queue wait times • Warm standby Selecting a Resource Trigger • Network bandwidth • Queue wait times • Warm standby validation • Local site policies ! Urgent Severe Weather Job Historical Data Deadline: 90 Min • Network status • Job/Queue data Advisor Live Data • Performance Model • Verified Resources • Data Repositories “Best” HPC Resource Application-Specific Data University of Chicago Urgent Computing - 17 Argonne National Lab

Highly Available Resource Co-allocator (HARC) • Some scenarios have definite early notice w SCOOP Highly Available Resource Co-allocator (HARC) • Some scenarios have definite early notice w SCOOP gets hurricane warnings a few hours to days in advance w No need for drastic steps like killing jobs • HARC with SPRUCE for reservations w Reservation made via portal will be associated to a token w Any user can use the reservation if added onto that active token w Can bypass local access control lists University of Chicago Urgent Computing - 18 Argonne National Lab

Bandwidth Tokens University of Chicago Urgent Computing - 19 Argonne National Lab Bandwidth Tokens University of Chicago Urgent Computing - 19 Argonne National Lab

Deployment Status • Deployed and Available on Tera. Grid w w w UC/ANL NCSA Deployment Status • Deployed and Available on Tera. Grid w w w UC/ANL NCSA SDSC NCAR Purdue TACC • Other sites w LSU w Virginia Tech w LONI University of Chicago Urgent Computing - 20 Argonne National Lab

Roadmap • Ongoing work w w w SPRUCE integration with Condor Policy mapping for Roadmap • Ongoing work w w w SPRUCE integration with Condor Policy mapping for resources and applications SPRUCE integration with network bandwidth WS-GRAM compatibility Notification system with triggers on token use INCA Q/A monitoring system for SPRUCE services • Future Work w w w Automatic restart tokens Aggregation, Extension of tokens, ‘start_by’ deadlines Encode (and probe for) local site policies Warm standby integration Data movement, network reservation, data storage Failover & redundancy of SPRUCE server University of Chicago Urgent Computing - 21 Argonne National Lab

Imagine… • A world-wide system for supporting urgent computing on supercomputers • Slow, patient Imagine… • A world-wide system for supporting urgent computing on supercomputers • Slow, patient growth of large-scale urgent apps • Expanding our notion of priority queuing, checkpoint/restart, CPU pricing, etc • A standardized set of web services for “request VM”, including all the complicated small bits w DHCP, VLANS, DNS, local storage, remote storage • For Capability: 10 to 20 supercomputers available on demand • For Capacity: Condor Flocks & Dynamic VMs provide availability of 250 K “node instances” University of Chicago Urgent Computing - 22 Argonne National Lab

Partners University of Chicago Urgent Computing - 23 Argonne National Lab Partners University of Chicago Urgent Computing - 23 Argonne National Lab

Questions? Ready to Join? spruce@ci. uchicago. edu http: //spruce. teragrid. org/ University of Chicago Questions? Ready to Join? [email protected] uchicago. edu http: //spruce. teragrid. org/ University of Chicago Urgent Computing - 24 Argonne National Lab

Screen Shots Running Urgent Jobs Screen Shots Running Urgent Jobs

Direct SPRUCE Job Submission (No Grid Middleware) # spruce_sub urgency=red spruce_test. pbs No Valid Direct SPRUCE Job Submission (No Grid Middleware) # spruce_sub urgency=red spruce_test. pbs No Valid Token found for user = beckman, aborting job submission # spruce_sub urgency=red spruce_test. pbs 240559 # qstat Job. Id Name User -----------240552 Cylinder-1 240556 STDIN 240559 spruce-job University of Chicago S gustav lgrinb beckman Queue -----Q Q R dque spruce Urgent Computing - 26 Argonne National Lab

SPRUCE Job Submission via Globus # grid-proxy-init Enter GRID pass phrase for this identity: SPRUCE Job Submission via Globus # grid-proxy-init Enter GRID pass phrase for this identity: ******* Your proxy is valid until: Sat Nov 18 03: 21: 30 2007 # cat globus_test. rsl <…> (resource. Manager. Contact = tg-grid 1. uc. teragrid. org: 2120/jobmanager-spruce) (executable = /home/beckman/spruce/mpihello) <…> (urgency = red) <…> # globusrun -o -f globus_test. rsl University of Chicago Urgent Computing - 27 Argonne National Lab

Screen Shots LEAD Demo Shots Screen Shots LEAD Demo Shots

1 University of Chicago Urgent Computing - 29 Argonne National Lab 1 University of Chicago Urgent Computing - 29 Argonne National Lab

2 University of Chicago Urgent Computing - 30 Argonne National Lab 2 University of Chicago Urgent Computing - 30 Argonne National Lab

3 University of Chicago Urgent Computing - 31 Argonne National Lab 3 University of Chicago Urgent Computing - 31 Argonne National Lab

4 University of Chicago Urgent Computing - 32 Argonne National Lab 4 University of Chicago Urgent Computing - 32 Argonne National Lab

5 University of Chicago Urgent Computing - 33 Argonne National Lab 5 University of Chicago Urgent Computing - 33 Argonne National Lab

6 University of Chicago Urgent Computing - 34 Argonne National Lab 6 University of Chicago Urgent Computing - 34 Argonne National Lab

7 University of Chicago Urgent Computing - 35 Argonne National Lab 7 University of Chicago Urgent Computing - 35 Argonne National Lab

8 University of Chicago Urgent Computing - 36 Argonne National Lab 8 University of Chicago Urgent Computing - 36 Argonne National Lab

9 University of Chicago Urgent Computing - 37 Argonne National Lab 9 University of Chicago Urgent Computing - 37 Argonne National Lab

Screen Shots Managing Tokens Screen Shots Managing Tokens

1 University of Chicago Urgent Computing - 39 Argonne National Lab 1 University of Chicago Urgent Computing - 39 Argonne National Lab

2 University of Chicago Urgent Computing - 40 Argonne National Lab 2 University of Chicago Urgent Computing - 40 Argonne National Lab

3 University of Chicago Urgent Computing - 41 Argonne National Lab 3 University of Chicago Urgent Computing - 41 Argonne National Lab

4 University of Chicago Urgent Computing - 42 Argonne National Lab 4 University of Chicago Urgent Computing - 42 Argonne National Lab

5 University of Chicago Urgent Computing - 43 Argonne National Lab 5 University of Chicago Urgent Computing - 43 Argonne National Lab

6 University of Chicago Urgent Computing - 44 Argonne National Lab 6 University of Chicago Urgent Computing - 44 Argonne National Lab

7 University of Chicago Urgent Computing - 45 Argonne National Lab 7 University of Chicago Urgent Computing - 45 Argonne National Lab

8 University of Chicago Urgent Computing - 46 Argonne National Lab 8 University of Chicago Urgent Computing - 46 Argonne National Lab

9 University of Chicago Urgent Computing - 47 Argonne National Lab 9 University of Chicago Urgent Computing - 47 Argonne National Lab

Screen Shots Admin Views Screen Shots Admin Views

1 University of Chicago Urgent Computing - 49 Argonne National Lab 1 University of Chicago Urgent Computing - 49 Argonne National Lab

2 University of Chicago Urgent Computing - 50 Argonne National Lab 2 University of Chicago Urgent Computing - 50 Argonne National Lab

3 University of Chicago Urgent Computing - 51 Argonne National Lab 3 University of Chicago Urgent Computing - 51 Argonne National Lab

4 University of Chicago Urgent Computing - 52 Argonne National Lab 4 University of Chicago Urgent Computing - 52 Argonne National Lab

Screen Shots Resource Advisor Screen Shots Resource Advisor

1 University of Chicago Urgent Computing - 54 Argonne National Lab 1 University of Chicago Urgent Computing - 54 Argonne National Lab

2 University of Chicago Urgent Computing - 55 Argonne National Lab 2 University of Chicago Urgent Computing - 55 Argonne National Lab

3 University of Chicago Urgent Computing - 56 Argonne National Lab 3 University of Chicago Urgent Computing - 56 Argonne National Lab

Screen Shots SPRUCE Bandwidth Screen Shots SPRUCE Bandwidth

1 University of Chicago Urgent Computing - 58 Argonne National Lab 1 University of Chicago Urgent Computing - 58 Argonne National Lab

2 University of Chicago Urgent Computing - 59 Argonne National Lab 2 University of Chicago Urgent Computing - 59 Argonne National Lab

3 University of Chicago Urgent Computing - 60 Argonne National Lab 3 University of Chicago Urgent Computing - 60 Argonne National Lab

4 University of Chicago Urgent Computing - 61 Argonne National Lab 4 University of Chicago Urgent Computing - 61 Argonne National Lab