TM Origin 2000 cc NUMA Architecture Joe Goyette

Скачать презентацию TM Origin 2000 cc NUMA Architecture Joe Goyette

95b4b88f639e5fbd1dcb0cd056aa4b93.ppt

Количество слайдов: 44

TM Origin 2000 cc. NUMA Architecture Joe Goyette Systems Engineer goyette@sgi. com

Presentation Overview TM 1. cc. NUMA Basics 2. SGI’s cc. NUMA Implementation (O 2 K) 3. Supporting OS Technology 4. SGI’s Next. Gen cc. NUMA (O 3 K) (brief) 5. Q&A February 2001

cc. NUMA TM cc: cache coherent NUMA: Non-Uniform Memory Access • Memory is physically distributed throughout the system • memory and peripherals are globally addressable • Local memory accesses are faster than remote accesses (Non-Uniform Memory Access = NUMA) • Local accesses on different nodes do not interfere with each other February 2001

Typical SMP Model TM Processor Snoopy Cache Central Bus Main Memory I/O I/O February 2001

Typical MPP Model TM Interconnect Network (ie. GSN, 100 Base. T, Myrinet) Operating System Main Memory Processor I/O I/O February 2001

Scalable Cache Coherent Memory Easy to Program TM Easy to Scale Hard to scale Hard to program Shared-memory Systems (SMP) Massively Parallel Systems (MMP) Scalable Shared Memory Systems [cc. NUMA) Easy to Program Easy to Scale February 2001

> Single Address Space > Modular Design Clusters/MPP Other NUMA TM Conventional SMP Origin cc. NUMA vs other Architectures > All aspects scale as system grows > Low-latency, high bandwidth global memory February 2001

Origin cc. NUMA Advantage Other NUMA Fixed bus SMP N N TM N N N Interconnection Bisections N N N N N R R N R N N N N N R N Clusters, MPP R N N N Origin 2000 cc. NUMA February 2001

IDC: NUMA is the future TM “Buses are the preferred approach for SMP implementations because of their relatively low cost. However, scalability is limited by the performance of the bus. ” “NUMA SMP. . . appears to be the preferred memory architecture for next-generation systems. ” - IDC, September 1998 Architecture type Bus-based SMP NUMA SMP Message Passing Switch-based SMP Uni-processor NUMA (uni-node) 1996 share 54. 7% 3. 9% 16. 4% 13. 0% 9. 4% 1. 5% 1997 share Change 41. 0% -13. 7 pts. 20. 8% +16. 9 pts. 15. 3% -1. 1 pts. 12. 1% -0. 9 pts. 5. 5% -3. 9 pts. 5. 3% +3. 8 pts. Source: High Performance Technical Computing Market: Review and Forecast, 1997 -2002 International Data Corporation, September 1998 February 2001

TM SGI’s First Commercial cc. NUMA Implementation Origin 2000 Architecture

History of Multiprocessing at SGI CPUs 256 Origin 2000 cc. NUMA introduced Origin 2000 2 -256 CPUs Origin 2000 128 2 -64 CPUs Origin 2000 Challenge 2 -32 CPUs 64 2 -36 CPUs TM Origin 3000 2 -1024 CPUs 32 2 1993 1996 1997 1998 1999 2000 February 2001

Origin 2000 Logical Diagram 32 CPU Hypercube (3 D) N N N R R N N TM N R R N N N February 2001

Origin 2000 Node Board TM Basic Building Block Main Memory Directory >32 P Hub Proc. Cache Node Board February 2001

MIPS R 12000 CPU • 64 -bit RISC design, 0. 25 micron CMOS process • Single-chip four-way superscalar RISC dataflow architecture • 5 fully-pipelined execution units • supports speculative and outof-order execution • 8 MB L 2 cache Origin 2000, 4 MB Origin 200 • 32 KB 2 -way set-associative instruction and data caches • 2, 048 -entry branch prediction table TM • 48 -entry active list • 32 -entry two-way setassociative Branch Target Address Cache (BTAC) • Doubled L 2 way prediction table for improved L 2 hit rate • Improved branch prediction by using global history mechanism • Improved performance monitoring support • Maintains code and instruction set compatibility with R 10000 February 2001

Memory Hierarchy TM 1. local cpu registers 2. local cpu cache 5 ns 3. local memory 318 ns 4. remote memory 554 ns 5. remote caches February 2001

Directory Based Cache Coherency TM Cache Coherency == System hw guarantees that every cached copy remains a true reflection of the memory data, without sw intervention. Directory Bits consist of two parts: a. 8 -bit integer representing node that has exclusive ownership of data b. Bit map that represents which nodes have copies of data in cache. February 2001

Cache Example TM 1. data read into cache for thread on CPU 0 2. threads on CPUs 1 and 2 read data into cache 3. thread on CPU 2 updates data in cache (cacheline is set exclusive) 4. Eventually cache line gets invalidated February 2001

Router and Interconnect Fabric TM • 6 -way non-blocking crossbar (9. 3 Gbytes/sec) • Link Level Protocol (LLP) uses CRC error checking • 1. 56 Gbyte/sec (peak full-duplex) per port • packet delivery prioritization (credits, aging) • uses internal routing table and supports wormhole routing • internal buffers (SSR/SSD) down-convert 390 MHz external signaling to core frequency. • Three ports connect to external 100 conductor Numa. Link cables. Global Switch Interconnect N N R N R N N N February 2001

Origin 2000 Module TM >32 P Proc. Directory >32 PCache Hub Proc. Hub. Cache Proc. Cache Proc. Cache Node Boards Cache XBOW Main Memory Main Directory Memory Main >32 P Directory Hub Directory Memory >32 P Directory Hub Directory XBOW Basic Building Block Midplane Router Board February 2001

Modules become Systems TM Etc. . . Multi-rack (4 Modules) Rack (2 Modules) Deskside (Module) . . 128 CPUs 32 CPUs 16 CPUs 2 -8 CPUs February 2001

Origin 2000 Grows to Single Rack TM Single Rack System • 2 -16 CPUs • 32 GB Memory • 24 XIO I/O slots N N N R R N N February 2001

Origin 2000 Grows to Multi-Rack TM Multi-Rack System • 17 -32 CPUs • 64 GB Memory • 48 XIO I/O slots • 32 -processor hypercube building R R block N N N N N R N R R N N February 2001

Origin 2000 Grows to Large Systems TM Large Multi-Rack Systems + • 2 -256 CPUs • 512 GB Memory • 384 I/O slots + + = February 2001

Bisection Bandwidth as System Grows TM February 2001

Memory Latency as System Grows TM February 2001

Origin 2000 Bandwidth Scales STREAM Triad results TM Origin 2000/300 Mh. Z Origin 2000/250 Mh. Z SUN UE 10000 Compaq/DEC 8400 HP/Convex V HP/Convex SPP February 2001

Performance on HPC job mix SPECfp_rate 95 results TM Origin 300 Mhz IBM Origin 250 Mhz Origin 195 Mhz DEC SUN HP February 2001

TM Enabling Technologies IRIX: NUMA Aware OS and System Utilities

Default Memory Placement TM Memory allocated on “first-touch” basis - on node where process that defines page is running - or as close as possible (minimize latency) - developers should initialize work areas in newly created threads IRIX scheduler maintains process affinity - re-schedules jobs on processor where they ran last - or on other CPU in the same node - or as close as possible )minimize latency) February 2001

Alternatives to “first-touch” policy TM Round Robin Allocation all - Data is distributed at run-time among nodes used for execution - setenv _DSM_ROUND_ROBIN_ February 2001

Dynamic Page Migration TM • IRIX can keep track of run-time memory access patterns and dynamically copy pages to new node. • Expensive operation. Requires: daemon, TLB invalidations, and the memory copy itself. ) • setenv _DSM_MIGRATION ON • setenv _DSM_MIGRATION_LEVEL 90 February 2001

Explicit Placement: source directives TM integer i, j, n, niters parameter (n = 8*1024, niters = 1000) c-----Note that the distribute directive is used after the arrays c-----are declared. real a(n), b(n), q c$distribute a(block), b(block) c-----initialization do i = 1, n a(i) = 1. 0 - 0. 5*i b(i) = -10. 0 + 0. 01*(i*i) enddo c-----real work do it = 1, niters q = 0. 01*it c$doacross local(i), shared(a, b, q), affinity (i) = data(a(i)) do i = 1, n a(i) = a(i) + q*b(i) enddo February 2001

Explicit Placement: dprof / dplace TM • Used for application that don’t use libmp (ie. explicit sproc, fork, pthreads, mpi, etc) • dprof: profiles memory access pattern • dplace can: – Change the page size used – Enable page migration – Specify the topology used by the threads of a parallel program – Indicate resource affinities February 2001

TM SGI 3 rd Generation cc. NUMA Implementation Origin 3000 Family

Compute Module vs. Bricks TM C-Brick R-Brick P-Brick 8 P 12 Compute Module (Origin 2000) System “Bricks” (Origin 3000) February 2001

Feeds and Speeds TM February 2001

TM Taking Advantage of Multiple CPUs Parallel Programming Models Available on Origin Family

Many Different Models and Tools To Choose From TM • Automatic Parallelization Option: compiler flags • Compiler Source Directives: Open. MP, c$doacross, etc • explicit multi-threading: pthreads, sproc • Message Passing APIs: MPI, PVM February 2001

Computing Value of π: Simple Serial TM program compute_pi integer n, i double precision w, x, sum, pi, f, a c function to integrate f(a) = 4. d 0 / (1. d 0 + a*a) print *, ‘Enter number of intervals: ’ read *, n c calculate the interval size w =1. 0 d 0/n sum = 0. 0 d 0 do i = 1, n x = w * (i - 0. 5 d 0) sum = sum + f(x) end do pi = w * sum print *, ‘computed pi =‘ , pi stop end February 2001

Automatic Parallelization Option TM • Add-on option for SGI Mips. Pro compilers • compiler searches for loops that it can parallelize f 77 -apo compute_pi. f 77 setenv MP_SET_NUM_THREADS 4. /a. out February 2001

Open. MP Source Directives TM program compute_pi integer n, i double precision w, x, sum, pi, f, a c function to integrate f(a) = 4. d 0 / (1. d 0 + a*a) print *, ‘Enter number of intervals: ’ read *, n c calculate the interval size w =1. 0 d 0/n sum = 0. 0 d 0 !$OMP PARALLEL DO PRIVATE(X), SHARED(W), REDUCTION(+: sum) do i = 1, n x = w * (i - 0. 5 d 0) sum = sum + f(x) end do !$OMP END PARALLEL DO pi = w * sum print *, ‘computed pi =‘ , pi stop end February 2001

Message Passing Interface (MPI) TM program compute_pi Include ‘mpif. h’ integer n, i, myid, numprocs, rc double precision w, x, sum, pi, f, a c function to integrate f(a) = 4. d 0 / (1. d 0 + a*a) call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) if (myid. eq. 0) then print *, ‘Enter number of intervals: ’ read *, n endif call MPI_BCAST(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr) c calculate the interval size w =1. 0 d 0/n sum = 0. 0 d 0 do i = myid+1, n, numprocs x = w * (i - 0. 5 d 0) sum = sum + f(x) end do February 2001

Message Passing Interface (MPI) TM mypi = w * sum c collect all the partial sums call MPI_REDUCE(mypi, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 0, $MPI_COMM_WORLD, ierr) c node 0 prints the answer if (myid. eq. 0) then print *, ‘computed pi =‘ , pi endif call MPI_FINALIZE(rc) stop end February 2001