Скачать презентацию Grid Computing Jeffrey K Hollingsworth hollings cs umd edu Скачать презентацию Grid Computing Jeffrey K Hollingsworth hollings cs umd edu

c5895712564436fd41e699a40ac85ebd.ppt

  • Количество слайдов: 34

Grid Computing Jeffrey K. Hollingsworth hollings@cs. umd. edu Department of Computer Science University of Grid Computing Jeffrey K. Hollingsworth hollings@cs. umd. edu Department of Computer Science University of Maryland, College Park, MD 20742 Copyright 2006, Jeffrey K. Hollingsworth

The Need for GRIDS l Many Computation Bound Jobs – Simulations • Financial • The Need for GRIDS l Many Computation Bound Jobs – Simulations • Financial • Electronic Design • Science – Data Mining l Large-scale Collaboration – Sharing of large data sets – Coupled communication simulation codes 2 University of Maryland

Available Resources - Desktops l Networks of Workstations – Workstations have high processing power Available Resources - Desktops l Networks of Workstations – Workstations have high processing power – Connected via high speed network (100 Mbps+) – Long idle time (50 -60%) and low resource usage l Goal: Run CPU-intensive programs using idle periods • while owner is away: send guest job and run • when owner returns: stop and migrate guest job away – Examples: Condor (University of Wisconsin) 3 University of Maryland

Computational Grids l Environment – – l Collection of semi-autonomous computers Geographically distributed Goal: Computational Grids l Environment – – l Collection of semi-autonomous computers Geographically distributed Goal: Use these systems as a coordinated resource Heterogeneous: processors, networks, OS Target Applications – Large-scale programs: running for 100 -1, 000’s of seconds – Significant need to access long term storage l Needs – Coordinated access (scheduling) – Specific time requests (reservations) – Scalable system software (1000’s of nodes) University of Maryland 4

Two Models of Grid Nodes l Harvested Nodes (Desktop) – Computers on desktops – Two Models of Grid Nodes l Harvested Nodes (Desktop) – Computers on desktops – Have Primary user who has priority – Participate in Grid, when resources are free l Dedicated Nodes (Data Center) – Dedicated to computational bound jobs – Various Policies • May participate in grid 24/7 • May only participate when load is low 5 University of Maryland

Available Processing Power – CPU usage is low - 10% or less for 75% Available Processing Power – CPU usage is low - 10% or less for 75% of time – Memory is available - 30 MB available 70% of time 6 University of Maryland

OS Support for Harvested Grid Computing l Need To Manage Resources Differently – Scheduler OS Support for Harvested Grid Computing l Need To Manage Resources Differently – Scheduler • Normally designed to be fair • Need strict priority – Virtual Memory • Need priority for local jobs – File systems l Virtual Machines make things easier – Provide Isolation – Mange Resources 7 University of Maryland

Starvation Level CPU Scheduling l Original Linux CPU Scheduler – Run-time Scheduling Priority • Starvation Level CPU Scheduling l Original Linux CPU Scheduler – Run-time Scheduling Priority • nice value & remaining time quanta • Ti = 20 - nice_level + 1/2 * Ti-1 – Possible to schedule niced processes l Modified Linux CPU Scheduler – If runnable host processes exist • Schedule a host process with highest priority – Only when no host process is runnable • Schedule a guest process 8 University of Maryland

Prioritized Page Replacement l New page replacement algorithm – No limit on taking free Prioritized Page Replacement l New page replacement algorithm – No limit on taking free pages Main Memory Pages Priority to Host Job High Limit Based only on LRU Low Limit Priority to Guest Job l – High Limit : • Maximum pages guest can hold – Low Limit : • Minimum pages guest can hold Adaptive Page-Out Speed – When a host job steals a guest’s page, page-out multiple guest pages faster 9 University of Maryland

Micro Test l Prioritized Memory Page Replacement – Total Available Memory : 179 MB Micro Test l Prioritized Memory Page Replacement – Total Available Memory : 179 MB – Memory Thresholds: High Limit (70 MB), Low Limit (50 MB) – – Guest job starts at 20 acquiring 128 MB Host job starts at 38 touching 150 MB Host job becomes I/O intensive at 90 Host job finishes University of Maryland at 130 10

Application Evaluation - Setup l Experiment Environment – Linux PC Cluster • 8 pentium Application Evaluation - Setup l Experiment Environment – Linux PC Cluster • 8 pentium II PCs, Linux 2. 0. 32 • Connected by a 1. 2 Gbps Myrinet l Local Workload for host jobs – Emulate Interactive Local User • MUSBUS interactive workload benchmark • Typical Programming environment l Guest jobs – Run DSM parallel applications (CVM) – SOR, Water and FFT l Metrics – Guest Job Performance, Host Workload Slowdown University of Maryland 11

Application Evaluation - Host Slowdown l Run DSM Parallel Applications – 3 Host Workloads Application Evaluation - Host Slowdown l Run DSM Parallel Applications – 3 Host Workloads : 7%, 13%, 24% (CPU Usage) – Host Workload Slowdown – For Equal Priority: • Significant Slowdown • Slowdown increases with load – No Slowdown with Linger Priority University of Maryland 12

Application Evaluation - Guest Performance l Run DSM Parallel Applications – Guest Job Slowdown Application Evaluation - Guest Performance l Run DSM Parallel Applications – Guest Job Slowdown – Slowdown proportional to musbus usage – Running guest at same priority as host provides little benefit to guest job University of Maryland 13

Unique Grid Infrastructure l l Applies to both Harvested and Dedicated Resource Monitoring – Unique Grid Infrastructure l l Applies to both Harvested and Dedicated Resource Monitoring – Finding available resources – Need both CPUs and Bandwidth l Scheduling – Policies to sharing resources among organizations l Security – Protect nodes from guest jobs – Protect jobs on foreign nodes 14 University of Maryland

Security l Goals – Don’t require explicit accounts on each computer – Provide controlled Security l Goals – Don’t require explicit accounts on each computer – Provide controlled access • Define policies on what jobs run where • Authenticate access l Techniques – Certificates – Single account on system for all grid jobs 15 University of Maryland

Resource Monitoring l Need to find available resources – CPU cycles • With appropriate Resource Monitoring l Need to find available resources – CPU cycles • With appropriate OS/System Software • With sufficient memory & temporary disk – Network bandwidth • Between nodes running a parallel job • To the remote file system l Issues – Time varying availability – Passive vs. active monitoring 16 University of Maryland

Ganglia Toolkit Courtesy of NPACI, SDSC, and UC Berkeley University of Maryland 17 Ganglia Toolkit Courtesy of NPACI, SDSC, and UC Berkeley University of Maryland 17

Net. Logger Courtesy of Brian Tierney, LBL 18 University of Maryland Net. Logger Courtesy of Brian Tierney, LBL 18 University of Maryland

Scheduling l l Need to allocate resources on Grid Each site might: – Accept Scheduling l l Need to allocate resources on Grid Each site might: – Accept jobs from remote sites – Send jobs to other sites l Need to accommodate co-scheduling – A single job that spans multiple site l Need for reservations – Time certain allocate of resources 19 University of Maryland

Scheduling Parallel Jobs l Scheduling Constraints – Different jobs use different numbers of nodes Scheduling Parallel Jobs l Scheduling Constraints – Different jobs use different numbers of nodes – Jobs provide estimate of runtime – Jobs run from a few minutes to a few weeks l Typical Approach – One parallel job per node • Called space-sharing – Batch Style Scheduling Used • Even a single user often has more processes than can run at once • Need to have many nodes at once for a job 20 University of Maryland

Typical Parallel Scheduler l Packs Jobs into a schedule by – Required number of Typical Parallel Scheduler l Packs Jobs into a schedule by – Required number of nodes – Estimated runtime l Backfills with smaller jobs when – Holes develop due to early job termination 21 University of Maryland

Imprecise Calendars l Data structure to manage scheduling grids – permits allocations of time Imprecise Calendars l Data structure to manage scheduling grids – permits allocations of time to applications – uses hierarchical representation • each level maintains calendar for managed nodes – allows multiple temporal resolutions l Key Features: – allows reservations – supports co-scheduling semi-autonomous sites • a site can refuse an individual remote job • small jobs don’t need inter-site coordination 22 University of Maryland

Multiple Time/Space Resolutions Refine space Refine time l Parameters – number and sizes of Multiple Time/Space Resolutions Refine space Refine time l Parameters – number and sizes of slots – packing density l Have multiple time-scales at once – near events at finest temporal resolution 23 University of Maryland

Evaluation l Approach – use traces of job submission to real clusters – simulate Evaluation l Approach – use traces of job submission to real clusters – simulate different scheduling policies • imprecise calendars • traditional back-filling schedulers l Metrics for comparison – job completion time • aggregate and by job size – node utilization 24 University of Maryland

Comparison with Partitioned Cluster l l Based on job data from LANL Treat each Comparison with Partitioned Cluster l l Based on job data from LANL Treat each cluster as a trading partner 25 University of Maryland

Balance of Trade l l Jobs are allowed to split across partitions Significant shift Balance of Trade l l Jobs are allowed to split across partitions Significant shift in work from 128 node partition University of Maryland 26

Large Cluster of Clusters l Each cluster has 336 nodes – jobs < 1/3 Large Cluster of Clusters l Each cluster has 336 nodes – jobs < 1/3 of nodes and < 12 node-hours sched. locally – jobs were not split between nodes l l Data is one month of jobs per node Workload from CTC SP-2 University of Maryland 27

Balance of Trade: Large Clusters 28 University of Maryland Balance of Trade: Large Clusters 28 University of Maryland

Social, Political, and Corporate Barriers l “It’s my computer” – Even if the employer Social, Political, and Corporate Barriers l “It’s my computer” – Even if the employer purchased it l Tragedy of the Commons – Who will buy resources l Chargeback concerns – HW purchased for one project used by another l Data Security Concerns – You want to run our critical jobs where? University of Maryland 29

Globus Toolkit l Collection of Tools – Security – Scheduling – Grid aware Parallel Globus Toolkit l Collection of Tools – Security – Scheduling – Grid aware Parallel Programming l Designed for – Confederation of dedicated clusters – Support for parallel programs 30 University of Maryland

Condor l Core of tightly coupled tools – Monitoring of node – Scheduling (including Condor l Core of tightly coupled tools – Monitoring of node – Scheduling (including batch queues) – Checkpointing of jobs l Designed for – Harvested resources (dedicated nodes too) – Parameter sweeps using many serial program runs 31 University of Maryland

Layout of the Condor Pool Cluster Node Central Manager negotiator schedd Master startd Cluster Layout of the Condor Pool Cluster Node Central Manager negotiator schedd Master startd Cluster Node Collector Master Desktop Master startd schedd Desktop startd Master startd schedd Courtesy of Condor Group, University of Wisconsin University of Maryland 32

Conclusion l What the Grid is – An approach to improve computation utilization – Conclusion l What the Grid is – An approach to improve computation utilization – Support for data migration for large-scale computation – Several families of tools – Tools to enable collaboration l What the Grid is not – Free cycles from heaven 33 University of Maryland

Grid Resources l Books – The Grid 2: Blueprint for a New Computing Infrastructure Grid Resources l Books – The Grid 2: Blueprint for a New Computing Infrastructure • Foster & Kessleman, ed. – Grid Computing: Making the Global Infrastructure a Reality • Berman, Fox, Hey, ed. l Software Distributions – Condor: www. cs. wisc. edu/condor – Globus: www. globus. org 34 University of Maryland