High-Performance Clusters part 2 Generality David E Culler

Скачать презентацию High-Performance Clusters part 2 Generality David E Culler

56d7e5d805d2a7270917a4e6c2c3fcdc.ppt

Количество слайдов: 46

High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U. C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998 6/28/98 SPAA/PODC 1

What’s Different about Clusters? • Commodity parts? • Communications Packaging? • Incremental Scalability? • Independent Failure? • Intelligent Network Interfaces? • Fast Scalable Communication? => Complete System on every node – – 6/28/98 virtual memory scheduler file system. . . SPAA/PODC 2

Topics: Part 2 • Virtual Networks – communication meets virtual memory • • Scheduling Parallel I/O Clusters of SMPs VIA 6/28/98 SPAA/PODC 3

General purpose requirements • Many timeshared processes – each with direct, protected access • User and system • Client/Server, Parallel clients, parallel servers – they grow, shrink, handle node failures • Multiple packages in a process – each may have own internal communication layer • Use communication as easily as memory 6/28/98 SPAA/PODC 4

Virtual Networks • Endpoint abstracts the notion of “attached to the network” • Virtual network is a collection of endpoints that can name each other. • Many processes on a node can each have many endpoints, each with own protection domain. 6/28/98 SPAA/PODC 5

How are they managed? • How do you get direct hardware access for performance with a large space of logical resources? • Just like virtual memory – active portion of large logical space is bound to physical resources Host Memory Process n Processor *** Process 3 Process 2 Process 1 NIC Mem 6/28/98 SPAA/PODC P Network Interface 6

Endpoint Transition Diagram HOT R/W NIC Memory Evict Write Msg Arrival WARM R/O Paged Host Memory Read Swap COLD Paged Host Memory 6/28/98 SPAA/PODC 7

Network Interface Support Frame 0 Transmit • NIC has endpoint frames • Services active endpoints • Signals misses to driver – using a system endpont Receive Frame 7 End. Point Miss 6/28/98 SPAA/PODC 8

Solaris System Abstractions Segment Driver • manages portions of an address space Device Driver • manages I/O device Virtual Network Driver 6/28/98 SPAA/PODC 9

Log. P Performance • Competitive latency • Increased NIC processing • Difference mostly – – ack processing protection check data structures code quality • Virtualization cheap 6/28/98 SPAA/PODC 10

Bursty Communication among many Msg burst work Client Server Client 6/28/98 SPAA/PODC 11

Multiple VN’s, Single-thread Server 6/28/98 SPAA/PODC 12

Multiple VNs, Multithreaded Server 6/28/98 SPAA/PODC 13

Perspective on Virtual Networks • Networking abstractions are vertical stacks – new function => new layer – poke through for performance • Virtual Networks provide a horizontal abstraction – basis for building new, fast services • Open questions – What is the communication “working set” ? – What placement, replacement, … ? 6/28/98 SPAA/PODC 14

Beyond the Personal Supercomputer • Able to timeshare parallel programs – with fast, protected communication • Mix with sequential and interactive jobs • Use fast communication in OS subsystems – parallel file system, network virtual memory, … • Nodes have powerful, local OS scheduler • Problem: local schedulers do not know to run parallel jobs in parallel 6/28/98 SPAA/PODC 15

Local Scheduling • Schedulers act independently w/o global control • Program waits while trying communicate with its peers that are not running • 10 - 100 x slowdowns for fine-grain programs! => need coordinated scheduling 6/28/98 SPAA/PODC 16

Explicit Coscheduling • Global context switch according to precomputed schedule • How do you build it? Does it work? 6/28/98 SPAA/PODC 17

Typical Cluster Subsystem Structures Master LS A A Master-Slave LS LS A A Local service Applications Communication Global Service Peer-to-Peer GS 6/28/98 GS GS LS A GS LS LS LS A A Communication A A SPAA/PODC A 18

Ideal Cluster Subsystem Structure GS GS GS LS A GS LS LS LS A A A • Obtain coordination without explicit subsystem interaction, only the events in the program – – very easy to build potentially very robust to component failures inherently “service on-demand” scalable • Local service component can evolve. 6/28/98 SPAA/PODC 19

Three approaches examined in NOW • GLUNIX explicit master-slave (user level) – matrix algorithm to pick PP – uses stops & signals to try to force desired PP to run M LS LS A A A • Explicit peer-peer scheduling assist with VNs – co-scheduling daemons decide on PP and kick the solaris scheduler GS GS LS LS A A A • Implicit – modify the parallel run-time library to allow it to get itself coscheduled with standard scheduler GS GS LS LS A A A 6/28/98 SPAA/PODC 20

Problems with explicit coscheduling • Implementation complexity • Need to identify parallel programs in advance • Interacts poorly with interactive use and load imbalance • Introduces new potential faults • Scalability 6/28/98 SPAA/PODC 21

Why implicit coscheduling might work • Active message request-reply model • Infer non-local state from local observations; react to maintain coordination observation implication action fast response partner scheduled delayed response partner not scheduled WS 1 Job A sleep request response Job B WS 2 spin block Job A spin WS 3 WS 4 6/28/98 Job B Job A SPAA/PODC 22

Obvious Questions • Does it work? • How long do you spin? • What are the requirements on the local scheduler? 6/28/98 SPAA/PODC 23

How Long to Spin? • Answer: round trip time + context switch + msg processing – round-trip to stay scheduled together – plus wake-up to get scheduled together – keep spinning if serving messages » interval of 3 x wake-up 6/28/98 SPAA/PODC 24

Does it work? 6/28/98 SPAA/PODC 25

Synthetic Bulk-synchronous Apps • Range of granularity and load imbalance – spin wait 10 x slowdown 6/28/98 SPAA/PODC 26

With mixture of reads • Block-immediate 4 x slowdown 6/28/98 SPAA/PODC 27

Timesharing Split-C Programs 6/28/98 SPAA/PODC 28

Many Questions • What about – – – mix of jobs? sequential jobs? unbalanced placement? Fairness? Scalability? • How broadly can implicit coordination be applied in the design of cluster subsystems? • Can resource management be completely decentralized? – Computational economies, ecologies 6/28/98 SPAA/PODC 29

A look at Serious File I/O • Traditional I/O system Proc. Mem • NOW I/O system P-M P-M • Benchmark Problem: sort large number of 100 byte records with 10 byte keys – – 6/28/98 start on disk, end on disk accessible as files (use the file system) Datamation sort: 1 million records Minute sort: quantity in a minute SPAA/PODC 30

NOW-Sort Algorithm • Read – N/P records from disk -> memory • Distribute – scatter keys to processors holding result buckets – gather keys from all processors • Sort – partial radix sort on each bucket • Write – write records to disk (2 pass: gather data runs onto disk, then local, external merge sort) 6/28/98 SPAA/PODC 31

Key Implementation Techniques • Performance Isolation: highly tuned local disk-todisk sort – – manage local memory manage disk striping memory mapped I/O with m-advise, buffering manage overlap with threads • Efficient Communication – completely hidden under disk I/O – competes for I/O bus bandwidth • Self-tuning Software – probe available memory, disk bandwidth, trade-offs 6/28/98 SPAA/PODC 32

World-Record Disk-to-Disk Sort • Sustain 500 MB/s disk bandwidth and 1, 000 MB/s network bandwidth • but only in the wee hours of the morning 6/28/98 SPAA/PODC 33

Towards a Cluster File System • Remote disk system built on a virtual network Client RD server RDlib Active msgs 6/28/98 SPAA/PODC 34

Streaming Transfer Experiment 6/28/98 SPAA/PODC 35

Results • Data distribution affects resource utilization • Not delivered bandwidth 6/28/98 SPAA/PODC 36

I/O Bus crossings 6/28/98 SPAA/PODC 37

Opportunity: PDISK Fast Communication - Remote Queues P P P P Fast I/O - Streaming Disk Queues • • • Producers dump data into I/O river Consumers pull it out Hash data records across disks Match producers to consumers Integrated with work scheduling 6/28/98 SPAA/PODC 38

NOWs Nodes What will be the building block? Clusters of SMPs SMP memory interconnect memory SMPs network cards Processors per node SMP 6/28/98 SPAA/PODC network cloud SMP 39

Multi-Protocol Communication Send / Write communication layer shared memory network • Uniform Prog. Model is key • Multiprotocol Messaging – careful layout of msg queues – concurrent objects – polling network hurts memory • Shared Virtual Memory – relies on underlying msgs • Pooling vs Contention Rcv / Read 6/28/98 SPAA/PODC 40

Log. P analysis of shared mem AM 6/28/98 SPAA/PODC 41

Virtual Interface Architecture Application Host VIA Kernel Driver (Slow) Sockets, MPI, Legacy, etc. VI User Agent (“libvia”) User-Level (Fast) Doorbells Open, Connect, Map Memory Undetermined Descriptor Read, Write VI S R VI C S R C O M P Requests Completed NIC 6/28/98 VI-Capable NIC SPAA/PODC 42

VIA Implementation Overview Request Block Xfer Host VI Mapped Doorbells Descriptor Queues Data Buffers . . . Kernel Memory Mapped to Application 6/28/98 NIC 1 Write Doorbell Pages 2 DMAReq 3 DMARd Desc Buffer 4 DMAReq 5 DMARd Tx/Rx Buffers 7 DMAWrt SPAA/PODC 43

Current VIA Performance 6/28/98 SPAA/PODC 44

VIA ahead • You will be able to buy decent clusters • Virtualization in host memory is easy – will it go beyond pinned regions – still need to manage active endpoints (doorbells) • Complex descriptor queues will hinder low latency short messages – NICs will chew on them, but many instructions on host • Need to re-examine where error handling, flow control, retry are performed • Interactions with scheduling, I/O, locking etc. will dominate application speed-up – will demand new development methodologies 6/28/98 SPAA/PODC 45

Conclusions • Complete system on every node makes clusters a very powerful architecture – can finally get serious about I/O • Extend the system globally – virtual memory systems, – schedulers, – file systems, . . . • Efficient communication enables new solutions to classic systems challenges • Opens a rich set of issues for parallel processing beyond the personal supercomputer or LAN – where SPAA and PDOC meet 6/28/98 SPAA/PODC 46