92e71748fc513705d727141ddf9bf072.ppt
- Количество слайдов: 61
SPRUCE Special PRiority and Urgent Computing Environment http: //spruce. teragrid. org/ Pete Beckman Argonne National Laboratory University of Chicago
Modeling and Simulation are critical Part of Decision Making University of Chicago Urgent Computing - 2 Argonne National Lab
I Need it Now! • Applications with dynamic data and result deadlines are being deployed • Late results are useless w Wildfire path prediction w Storm/Flood prediction w Influenza modeling • Some jobs need priority access “Right-of-Way Token” University of Chicago Urgent Computing - 3 Argonne National Lab
Example 1: Severe Weather Predictive Simulation from Real-Time Sensor Input Severe weather example Source: Kelvin Droegemeier, Center for Analysis and Prediction of Storms (CAPS), University of Oklahoma. Collaboration with LEAD Science Gateway project. Urgent Computing -
Example 2: Real Time Neurosurgical Imaging Using Simulation (GENIUS project - Heme. LB) Medical Example Getting the Patient Specific Data is generated by MRA scanners at the National Hospital for Neurosurgery and Neurology Heme. LB trilinear interpolation 5122 pixels x 100 slices, res: 0. 468752 mm x 0. 8 mm Our graphical-editing tool 20482 x 682 cubic voxels, res: 0. 46875 mm Source: Peter Coveney, GENIUS Project University College London
Example 3: SURA Coastal Ocean Observing Program (SCOOP) Flood modeling example Source: Center for Computation and Technology, Louisiana State University
How can we get cycles? • Build supercomputers for the application w Pros: Resource is ALWAYS available w Cons: Incredibly costly (99% idle) w Example: Coast Guard rescue boats • Share public infrastructure w Pros: low cost w Cons: Requires complex system for authorization, resource management, and control w Examples: school buses for evacuation, cruise ships for temporary housing University of Chicago Urgent Computing - 7 Argonne National Lab
Introducing SPRUCE • The Vision: w Build cohesive infrastructure that can provide urgent computing cycles for emergencies • Technical Challenges: w Provide high degree of reliability w Elevated priority mechanisms w Resource selection, data movement • Social Challenges: w Who? When? What? w How will emergency use impact regular use? w Decision-making, workflow, and interpretation University of Chicago Urgent Computing - 8 Argonne National Lab
Existing “Digital Right-of-Way” Emergency Phone System Calling cards are in widespread use and easily understood by the NS/EP User, simplifying GETS usage. GETS priority is invoked “call-by-call” GETS USER ORGANIZATION GETS is a ‘ubiquitous’ service in the Public Switched Telephone Network. If you can get a DIAL TONE, you can make a GETS call. University of Chicago Urgent Computing - 9 Argonne National Lab
SPRUCE Architecture Overview (1/2) Right-of-Way Tokens Event 2 Automated Trigger 1 First Responder SPRUCE Gateway / Web Services Right-of-Way Token Human Trigger Right-of-Way Token University of Chicago Urgent Computing - 10 Argonne National Lab
SPRUCE Architecture Overview (2/2) Submitting Urgent Jobs User Team Authentication 4 Urgent Computing Job Submission Conventional Job Submission Parameters ! Urgent Computing Parameters University of Chicago Priority Job Queue Choose a Resource SPRUCE Job Manager 3 Local Site Policies 5 Supercomputer Resource Urgent Computing - 11 Argonne National Lab
[SPRUCE Server (WS interface, Web portal)] • Token & session management w admin, user, job manager Distributed, Site-local [configuration & policy] • Priority queue and local policies [installed software] • Authorization & management for job submission and queuing University of Chicago Urgent Computing - 12 Central Summary of Components Argonne National Lab
Internal Architecture Computing Resource: Job Manager & Scripts Web Portal or Workflow Tools Java AJAX Axis 2 PHP / Perl Client Interfaces R AP st SO que e Client-Side Job Tools P A SO est u Req SPRUCE User Services || Validation Services Axis 2 Web Service Stack JDB C Tomcat Java Servlet Container My. SQL Mirror Future work Apache Web Server Central SPRUCE Server University of Chicago Urgent Computing - 13 Argonne National Lab
Site-Local Response Policies: How will Urgent Computing be treated? • “Next-to-run” status for priority queue w wait for running jobs to complete • Force checkpoint of existing jobs; run urgent job • Suspend current job in memory (kill -STOP); run urgent job • Kill all jobs immediately; run urgent job • Provide differentiated CPU accounting w “Jobs that can be killed because they maintain their own checkpoints will be charged 20% less” • Other incentives University of Chicago Urgent Computing - 14 Argonne National Lab
Emergency Preparedness Testing: “Warm Standby” • For urgent computation, there is no time to port code w Applications must be in “warm standby” w Verification and validation runs test readiness periodically (Inca, MDS) w Reliability calculations w Only verified apps participate in urgent computing • Grid-wide Information Catalog w Application was last tested & validated on
Resource Advisor • Purpose w Given a set of distributed resources and a deadline, how does one select the “best” resource on which to run? § Analyze historical and live data to determine the likelihood of meeting a deadline. • Generate a bound for the total turnaround time w Generate bounds for: • File Staging (FT ) • Allocation time (e. g. , queue delay) (AT ) • Execution time (ET ) w Overall turnaround time = FT + AT + ET University of Chicago Urgent Computing - 16 Argonne National Lab
Selecting a Resource Trigger • Network bandwidth • Queue wait times • Warm standby validation • Local site policies ! Urgent Severe Weather Job Historical Data Deadline: 90 Min • Network status • Job/Queue data Advisor Live Data • Performance Model • Verified Resources • Data Repositories “Best” HPC Resource Application-Specific Data University of Chicago Urgent Computing - 17 Argonne National Lab
Highly Available Resource Co-allocator (HARC) • Some scenarios have definite early notice w SCOOP gets hurricane warnings a few hours to days in advance w No need for drastic steps like killing jobs • HARC with SPRUCE for reservations w Reservation made via portal will be associated to a token w Any user can use the reservation if added onto that active token w Can bypass local access control lists University of Chicago Urgent Computing - 18 Argonne National Lab
Bandwidth Tokens University of Chicago Urgent Computing - 19 Argonne National Lab
Deployment Status • Deployed and Available on Tera. Grid w w w UC/ANL NCSA SDSC NCAR Purdue TACC • Other sites w LSU w Virginia Tech w LONI University of Chicago Urgent Computing - 20 Argonne National Lab
Roadmap • Ongoing work w w w SPRUCE integration with Condor Policy mapping for resources and applications SPRUCE integration with network bandwidth WS-GRAM compatibility Notification system with triggers on token use INCA Q/A monitoring system for SPRUCE services • Future Work w w w Automatic restart tokens Aggregation, Extension of tokens, ‘start_by’ deadlines Encode (and probe for) local site policies Warm standby integration Data movement, network reservation, data storage Failover & redundancy of SPRUCE server University of Chicago Urgent Computing - 21 Argonne National Lab
Imagine… • A world-wide system for supporting urgent computing on supercomputers • Slow, patient growth of large-scale urgent apps • Expanding our notion of priority queuing, checkpoint/restart, CPU pricing, etc • A standardized set of web services for “request VM”, including all the complicated small bits w DHCP, VLANS, DNS, local storage, remote storage • For Capability: 10 to 20 supercomputers available on demand • For Capacity: Condor Flocks & Dynamic VMs provide availability of 250 K “node instances” University of Chicago Urgent Computing - 22 Argonne National Lab
Partners University of Chicago Urgent Computing - 23 Argonne National Lab
Questions? Ready to Join? spruce@ci. uchicago. edu http: //spruce. teragrid. org/ University of Chicago Urgent Computing - 24 Argonne National Lab
Screen Shots Running Urgent Jobs
Direct SPRUCE Job Submission (No Grid Middleware) # spruce_sub urgency=red spruce_test. pbs No Valid Token found for user = beckman, aborting job submission
SPRUCE Job Submission via Globus # grid-proxy-init Enter GRID pass phrase for this identity: ******* Your proxy is valid until: Sat Nov 18 03: 21: 30 2007 # cat globus_test. rsl <…> (resource. Manager. Contact = tg-grid 1. uc. teragrid. org: 2120/jobmanager-spruce) (executable = /home/beckman/spruce/mpihello) <…> (urgency = red) <…> # globusrun -o -f globus_test. rsl University of Chicago Urgent Computing - 27 Argonne National Lab
Screen Shots LEAD Demo Shots
1 University of Chicago Urgent Computing - 29 Argonne National Lab
2 University of Chicago Urgent Computing - 30 Argonne National Lab
3 University of Chicago Urgent Computing - 31 Argonne National Lab
4 University of Chicago Urgent Computing - 32 Argonne National Lab
5 University of Chicago Urgent Computing - 33 Argonne National Lab
6 University of Chicago Urgent Computing - 34 Argonne National Lab
7 University of Chicago Urgent Computing - 35 Argonne National Lab
8 University of Chicago Urgent Computing - 36 Argonne National Lab
9 University of Chicago Urgent Computing - 37 Argonne National Lab
Screen Shots Managing Tokens
1 University of Chicago Urgent Computing - 39 Argonne National Lab
2 University of Chicago Urgent Computing - 40 Argonne National Lab
3 University of Chicago Urgent Computing - 41 Argonne National Lab
4 University of Chicago Urgent Computing - 42 Argonne National Lab
5 University of Chicago Urgent Computing - 43 Argonne National Lab
6 University of Chicago Urgent Computing - 44 Argonne National Lab
7 University of Chicago Urgent Computing - 45 Argonne National Lab
8 University of Chicago Urgent Computing - 46 Argonne National Lab
9 University of Chicago Urgent Computing - 47 Argonne National Lab
Screen Shots Admin Views
1 University of Chicago Urgent Computing - 49 Argonne National Lab
2 University of Chicago Urgent Computing - 50 Argonne National Lab
3 University of Chicago Urgent Computing - 51 Argonne National Lab
4 University of Chicago Urgent Computing - 52 Argonne National Lab
Screen Shots Resource Advisor
1 University of Chicago Urgent Computing - 54 Argonne National Lab
2 University of Chicago Urgent Computing - 55 Argonne National Lab
3 University of Chicago Urgent Computing - 56 Argonne National Lab
Screen Shots SPRUCE Bandwidth
1 University of Chicago Urgent Computing - 58 Argonne National Lab
2 University of Chicago Urgent Computing - 59 Argonne National Lab
3 University of Chicago Urgent Computing - 60 Argonne National Lab
4 University of Chicago Urgent Computing - 61 Argonne National Lab