The Cray XD 1 Computer and its Reconfigurable

The Cray XD 1 Computer and its Reconfigurable Architecture Dave Strenski stren@cray. com July 11, 2005

Outline XD 1 overview § Architecture § Interconnect § Active Manager XD 1 FPGAs § Architecture § Example execution § Core development stragity FORTRAN to VHDL considerations § Memory allocation § Unrolling § One verses many cores XD 1 FPGA running examples § § § MTA kernel and Ising Model FFT kernel from DSPlogic Smith-Waterman kernel from Cray LANL Traffic simulation code Other works in progress Slide 2

Cray Today Nasdaq: CRAY § Formed on April 1, 2000 as Cray Inc. § Headquartered in Seattle, WA § Roughly 900 employees across 30 countries Four Major Development Sites: § § Chippewa Falls, WI Mendota Heights, MN Seattle, WA Vancouver, Canada Significant Progress in the market § § X 1 Sales and Sandia National Laboratory Red Storm contract Oak Ridge National Laboratory Leadership Class system DARPA HPCS Phase II funding of $50 M through 2006 for Cascade Acquired Octiga. Bay – 70+ Cray XD 1 s sold to date Slide 3

Cray XD 1 Overview Slide 4

Cray XD 1 System Architecture Compute 12 AMD Opteron 32/64 bit, x 86 processors High Performance Linux Rapid. Array Interconnect 12 communications processors 1 Tb/s switch fabric Active Management Dedicated processor Application Acceleration 6 co-processors Processors directly connected via integrated switch fabric Slide 5

Cray XD 1 Chassis Six Two-way Opteron Blades Fans Six SATA Hard Drives Six FPGA Modules Chassis Front 0. 5 Tb/s Switch 12 x 2 GB/s Ports to Fabric Three I/O Slots (e. g. JTAG) Four 133 MHz PCI-X Slots Connector for 2 nd 0. 5 Tb/s Switch and 12 More 2 GB/s Ports to Fabric Chassis Rear Slide 6

Compute Blade 4 DIMM Sockets for DDR 400 Registered ECC Memory AMD Opteron 2 XX Processor Rapid. Array Communications Processor Connector to Main Board AMD Opteron 2 XX Processor 4 DIMM Sockets for DDR 400 Registered ECC Memory Slide 7

Cray Innovations Balanced Interconnect Active Management Cray XD 1 Application Acceleration Performance and Usability Slide 8

Architecture Intel Xeon. TM Processor 6. 4 GB/sec DDR Memory Controller AMD Opteron Hyper. Transport 6. 4 GB/sec Northbridge 3. 2 GB/sec HT Southbridge or PCI-X Bridge HT Rapid Array I/O SPEED LIMIT PCI-X Slot 1 GB/sec Slide 9

Removing the Bottleneck Giga. Bytes Memory GFLOPS Processor Giga. Bytes per Second I/O 1 GB/s PCI-X Xeon Server Interconnect 0. 25 GB/s Gig. E 5. 3 GB/s DDR 333 Cray XD 1 6. 4 GB/s DDR 400 8 GB/s Cray XT 3 RA SS 6. 4 GB/s DDR 400 31 GB/s 34. 1 GB/s 102 GB/s Cray X 1 Slide 10

Communications Optimizations Cray Communications Libraries § § § Rapid. Array Communications Processor MPI 1. 2 library TCP/IP PVM Shmem Global Arrays System-wide process & time synchronization § § § HT/RA tunnelling with bonding Routing with route redundancy Reliable transport Short message latency optimization DMA operations System-wide clock synchronization Rapid. Array Communications Processor AMD Opteron 2 XX Processor 2 GB/s 3. 2 GB/s Direct Connected Processor Architecture Slide 11

Synchronized Linux Scheduler Slide 12

Reducing OS Jitter Cray XD 1 Linux Synchronization increases application scaling Improves efficiency by 42% Lowers application license fees for equivalent processor count Slide 13

Direct Connect Topology 1 Cray XD 1 Chassis 12 AMD Opteron Processors 58 GFLOPS 8 GB/s between SMPs 1. 8 msec interconnect Integrated switching 3 Cray XD 1 Chassis 36 AMD Opteron Processors 173 GFLOPS 8 GB/s between SMPs 2. 0 msec interconnect Integrated switching 25 Cray XD 1 Chassis, two racks 300 AMD Opteron Processors 1. 4 TFLOPS 2 - 8 GB/s between SMPs 2. 0 msec interconnect Integrated switching Slide 14

Fat Tree Topology Spine switch 12 Cray XD 1 chassis 144 AMD Opteron Processors 691 GFLOPS 4/8 GB/s between SMPs 2. 0 msec interconnect Fat tree switching, integrated first & third order 6/12 Rapid. Array spine switches (24 -ports) Slide 15

MPI Latency Rapid. Array Short Message Latency is 4 times lower than Infiniband The Cray XD 1 has sent 2 KB before others have sent their first byte Slide 16

MPI Throughput The Cray XD 1 Delivers 2 X the Bandwidth of Infiniband (1 KB Message Size) Slide 17

Active Manager System Usability § Single System Command Control Resiliency § § CLI and Web Access Dedicated management processors, real-time OS and communications fabric. Proactive background diagnostics with self-healing. Active Management Software Automated management for exceptional reliability, availability, serviceability Slide 18

Active Manager GUI: Sys. Admin GUI provides quick access to status info and system functions Slide 19

Automated Management Users & Administrators Compute Partition 1 Front End Partition management Linux configuration Hardware monitoring Software upgrades File system management Data backups Compute Partition 2 • • • File Services Partition Compute Partition 1 Network configuration Accounting & user management Security Performance analysis Resource & queue management Single System Command Control Slide 20

Self-Monitoring Parity Heartbeat Temperature Fan speed Diagnostics Air Velocity Voltage Current Hard Drive Thermals Processors Memory Fans Power supply Active Manager Interconnect Power Supply Dedicated Management Processor, OS, Fabric Slide 21

Thermal Management Slide 22

File Systems: Local Disks One S-ATA HD per SMP; Local Linux directory per HD EXT 2/3 Rapid. Array EXT 2/3 Cray XD 1 Slide 23

File Systems: SAN SMP acting as a File Server for the SAN File Server EXT 2/3 FC HBA FC SAN NFS Compute Cray XD 1 Slide 24

Programming Environment Operating System Management Cray HPC Enhanced Linux Distribution (derived from Su. Se 8. 2) Active Manager for system administration & workload management Application Acceleration Kit IP Cores, Reference Designs, Command-line tools, API, JTAG interface card Scientific Libraries Shared Memory Access 3 rd Party Tools AMD Core Math Library (ACML) Shmem, Global Arrays, Open. MP Fortran 77/90/95, HPF, C/C++, Java, Etnus Total. View Communications Libraries MPI 1. 2 Cray XD 1 is standards-based for ease of programming – Linux, x 86, MPI Slide 25

Cray XD 1’s FPGA Architecture Slide 26

The Rebirth of Co-processing 1976 8086 Processor 8087 Coprocessor AMD Opteron Xilinx Virtex II Pro FPGA 2004 Slide 27

Application Acceleration Application Accelerator P P RAP Application Acceleration Reconfigurable Computing Tightly coupled to Opteron FPGA acts like a programmable co-processor Performs vector operations Well-suited for: § Searching, sorting, signal processing, audio/video/image manipulation, encryption, error correction, coding/decoding, packet processing, random number generation. Super. Linear speedup for key algorithms Slide 28

P P One Switch P P Two Switches P One Switch 4 configurations Slide 29

Application Acceleration FPGA. . . do for each array element. . . end … Data. Set Application Acceleration FPGA Compute Processor … … Fine-grained parallelism applied for 100 x potential speedup Slide 30

Compute Blade Expansion Module Rapid. Array Processor DDR 400 DRAM Opteron Processor Application Acceleration FPGA Slide 31

Interconnections HT Neighbor Module Expansion Module HT RT HT P P Rapid. Array Neighbor Module Rapid. Array Slide 32

Module Detail Neighbor Compute Module Hyper. Transport 3. 2 GB/s RAP 2 GB/s 3. 2 GB/s Rapid. Array QDR II SRAM Acceleration FPGA 2 GB/s 3. 2 GB/s Neighbor Compute Module Slide 33

Virtex II Pro FPGA Multi-Gigabit Transceivers (Rocket I/O) Virtex-II Series Fabric MGT XC 2 VP 30 – XC 2 VP 50 • 422 MHz max. clock rate • 30, 000 – 53, 000 LEs • 3 – 5 Million ‘system gates’ • 136 – 232 Block RAM • 136 – 232 18 x 18 Multipliers 300 MHz Power. PC • 8 – 16 Rocket I/O MGT Block RAM Slide 34

Virtex II Family Logic Blocks RAM 16 Virtex-II Family Logic Blocks SRL 16 LUT G LUT F CY CY 1 LE Register = LUT + Register 1 Slice = 2 LEs Register 1 CLB Slice = 4 Slices XC 2 VP 30 -6 Examples Function f (MHz) Size LE’s BRAM Mult. Number Possible 64 bit Adder 194 66 0 0 450 64 bit Accumulator 198 64 0 0 450 18 x 18 Multiplier 259 88 0 1 136 SP FP Multiplier 188 252 0 4 34 1024 FFT (16 bit complex) 140 5526 22 12 5 Slide 35

Module Variants A variety of Application Acceleration variants can be manufactured by populating different pin compatible FPGAs and QDR II RAMs. Speed Logic Elements Power. PC 18 x 18 Multipliers XC 2 VP 30 -6 30, 816 2 136 XC 2 VP 40 -6 43, 632 2 192 XC 2 VP 50 -7 53, 136 2 232 FPGA Speed Dimensions Quantity Module Memory K 7 R 163682 200 MHz 512 K x 36 4 8 MByte K 7 R 323682 200 MHz 1 M x 36 4 16 MByte K 7 R 643682 (future) 200 MHz 2 M x 36 4 32 MByte RAMs Slide 36

Processor to FPGA Processor Req Resp Hyper. Transport Req Resp DMA RAP Rapid. Array Transport • Since the Acceleration FPGA is connected to the local processing node through its Hyper. Transport I/O bus, the FPGA can be accessed directly using reads and writes. • Additionally, a node can also transfer large blocks of data to and from the Acceleration FPGA using a simple DMA engine in the FPGA’s Rapid. Array Transport Core. Slide 37

FPGA to Processor FPGA Processor RAP Req Resp • The Acceleration FPGA can also directly access the memory of a processor. Read and write requests can be performed in bursts of up to 64 bytes. • The Acceleration FPGA can access processor memory without interrupting the processor. • Memory coherency is maintained by the processor. Slide 38

FPGA to Neighbor 2 -3 GB/s SMP 4 SMP 2 SMP 1 SMP 3 SMP 6 SMP 5 • Each Acceleration FPGA is connected to its neighbors in a ring using the Virtex II Pro MGT (Rocket I/O) transceivers. • The XC 2 VP 40 FPGAs provide a 2 GB/s link to each neighbor FPGA. • The XC 2 VP 50 FPGAs provide a 3 GB/s link to each neighbor FPGA. Slide 39

Cray XD 1 FPGA Programming Slide 40

Hard, but it could be worse! Slide 41

Application Acceleration Interfaces Rapid. Array Transport Core User Logic QDR RAM Interface Core ADDR(20: 0) D(35: 0) Q(35: 0) TX RAP RX Rapid. Array Transport • • ADDR(20: 0) D(35: 0) Q(35: 0) QDR II SRAM ADDR(20: 0) D(35: 0) Q(35: 0) XC 2 VP 30 -50 running at up to 200 MHz. 4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s). 16 bit simplified Hyper. Transport I/F at 400 MHz DDR (800 MTransfers/s. ) QDR and HT I/F take up <20 % of XC 2 VP 30. The rest is available for user applications. Slide 42

FPGA Linux API Admininstration Commands § fpga_open § fpga_close § fpga_load – allocate and open fpga – close allocated fpga – load binary into fpga Operation Commands § fpga_start § fpga_reset – start fpga (release from reset) – soft-reset the FPGA Mapping Commands § fpga_set_ftrmem § fpga_memmap – map application virtual address to allow access by fpga – map fpga ram into application virtual space Control Commands § fpga_wrt_appif_val § fpga_rd_appifval – write data into application interface (register space) – read data from application interface (register space) Status Commands § fpga_status – get status of fpga DMA Commands § fpga_put § fpga_get – send data to FPGA – receive data from fpga Interrupt/Blocking Commands § fpga_intwait – blocks process waits for fpga interrupt Slide 43

Additional High Level Tools Adelante Celoxica Forte Design Systems Mentor Graphics Prosilog Synopsis int mask m) (a, { return(a & m); } DSPlogic RCIO Lib System. C, ANSI C/C++ The Math. Works High Level Flow MATLAB/ Simulink C Synthesis Xilinx process (a, m) is begin z <= a and m; end process; System Generator for DSP VHDL, Verilog VHDL/Verilog Synthesis Mentor Graphics Synopsis Synplicity Xilinx a m Xilinx z Gate Level EDIF File Standard Flow Place and Route 01001011010101001 01000101011010010101 Binary File for FPGA Slide 44

Standard Development Flow Download to XD 1 Cores 0100010101101 00101011010 Binary File RAP I/F, QDR RAM I/F DSPLogic RCIO Core HDL Synthesize Merge Metadata Load/Run Acceleration FPGA Implement From Command line or Application Model. Sim VHDL, Verilog, C Xilinx ISE Verify Simulate Xilinx Chip. Scope Model. Sim Slide 45

On Target Debugging • Integrated Logic Analyzer (ILA) blocks are used to capture and store internal logic events based on user defined triggers. • Trapped events can then be read out and displayed on a PC by the Chip. Scope Software. Acceleration FPGA User Function 1 User Function 2 ILA JTAG Parallel or USB Xilinx Chip. Scope Plus Software JTAG Xilinx Parallel Cable III/IV or Multi. LINX Octiga. Bay JTAG I/O Card Slide 46

FORTRAN to VHDL ideas program test integer xyz integer a, b, c, n(1000), temp(1000) do i = 1, 1000 n(i) = xyz (a, b, c, temp) end do end The variable temp is allocated once outside the loop calling the function. This is efficient FORTRAN code because you only allocate the space one. With an FPGA design you would want to allocated the temporary space on the FPGA. Slide 47

FORTRAN to VHDL ideas program test integer xyz integer a, b, n(1000) real delta = 0. 01 do i = 1, 1000 n(i) = xyz (a, b, delta) end do end program test integer xyz integer a, b, n(1000) integer delta = 100 ! 1/delta do i = 1, 1000 n(i) = xyz (a, b, delta) end do end function xyz (a, b, delta) if (a. gt. b*delta) then xyz = a else xyz = b endif return end function xyz (a, b, delta) if (a*delta. gt. b) then xyz = a else xyz = b endif return end Convert real variables to integers where possible. Slide 48

FORTRAN to VHDL ideas function xyz (i, j, mode) integer i, j, mode do i = 1, 1000 do j = 1, 1000 if (mode. eq. 2) then if (a(i, j, k). gt. b(i, j, k)) then xyz = a else xyz = b end if else Move code that doesn’t change outside xyz = 0 the function. Maybe make multiple cores, end if one for each mode. end do return end Slide 49

Mixing FPGAs and MPI It gets a bit tricky mixing FPGAs with an MPI code. The XD 1 has 2 or 4 Opterons per node and only one FPGA. Only one Opteron is able to grab the FPGA at a time. Job 1 CPU Job 1 FPGA Job 1 CPU FPGA Job 1 CPU Job 2 CPU Job 1 CPU Job 2 FPGA Job 1 CPU ? Job 2 CPU Job 1 FPGA Job 1 CPU Job 2 CPU Job 1 CPU Job 2 FPGA Job 2 CPU Job 1 FPGA Job 1 CPU Not Available Job 1 CPU Job 1 FPGA Job 2 CPU Job 1 FPGA Slide 50

Cray XD 1 FPGA Examples Slide 51

Random Number Example Processor RAP Mersenne Twister RNG pseudo-random numbers • FPGA implements “Mersenne Twister” RNG algorithm often used for Monte Carlo analysis. The algorithm generates integers with a uniform distribution and won’t repeat for 219937 -1 values. • FPGA automatically transfers generated numbers into two buffers located in the processor’s local memory. • Processor application alternately reads the pseudo-random numbers from two buffers. As processor marks the buffers as ‘empty’, the FPGA refills them with new numbers. Slide 52

MTA Example Application Accelerator Load/Start a. out in Opteron’s memory Call FPGA_OPEN Call FPGA_LOAD Buffer B Call FPGA_SET_FTRMEM (allocate memory) Buffer A Call FPGA_START P FPGA checks buffer flags RAP FPGA generate random numbers FPGA toggles buffer flag P Opteron consumes random numbers RAP Opteron/FPGA run asynchronously Call FPGA_CLOSE Opteron exits Slide 53

Random Number Results Source Original C Code VHDL Code Platform 2. 2 GHz Opteron FPGA (XC 2 VP 30 -6) @ 200 MHz Speed (32 bit integers/second) ~101 Million ~319 Million ~25% of chip (includes Rapid. Array Size N/A Core) • FPGA provides 3 X performance of fastest available Opteron • Algorithm takes up a small portion of the smallest FPGA. • Performance is limited by speed at which numbers can be written into processor memory, not by FPGA logic. The logic could easily produce 1. 6 billion integers/second by increasing parallelism. Slide 54

Ising Model with Monte Carlo Code was developed by Martin Siegert at Simon Fraser University Uses the MTA random number generation design Runs 2. 5 times faster with the FPGA Should run faster when the newest MTA design that returns floating point random number instead of integers. Tar file available for the Cray XD 1 Slide 55

FFT design from DSPlogic Code was developed by Mike Babst and Rod Swift at DSPlogic Uses 16 -bit fixed point data as input and 32 -bit fixed point as output, which yields an accuracy similar to single precision results posted at FFTW web site (www. fftw. org) A one dimensional complex FFT of length 65536 on the FPGA is about 5 times faster then on the 2. 2 GHz Opteron using FFTW. Packing the data more can double the performance to 10 x. Performance depends on the size of the data. Slide 56

Smith-Waterman Code was developed internally by Cray CUPS = Cell updates Per Second Rate = FPGA frequency * clocks/cell * num S-M Processing Elements Current: 80 MHz * 1 * 32 = 2. 6 Billion CUPS, 60% of the chip Optimization: 100 MHz * 1 * 50 = 5 Billion CUPS Virtex 4 FPGA: 100 MHz * 150 = 15 Billion CUPS Opteron using SSEARCH 34 = 100 Million CUPS Current version running 25 times faster then 2. 2 GHz Opteron. Nucleotide (4 -bit) version is running in house. Amino acid (8 -bit) is just finished, incorporating it into SSEARCH to make it easier to use. Smith-Waterman on the FPGA is about 10 times faster then BLAST on the Opteron. Slide 57

Los Alamos Traffic Simulation Code was developed by Justin Tripp, Henning Mortveit, Anders Hansson, and Maya Gokhale at Los Alamos National Labs Uses FPGA for straight road sections and Opteron for everything else. Runs 34. 4 times faster with the FPGA relative to a 2. 2 GHz Opteron System integration issues must be optimized to exploit this speedup in the overall simulation. Slide 58

Other XD 1 FPGA Projects Financial company using the random number generation core for a Monte Carlo simulation. Seismic companies using FPGAs for FFT and convolutions. Pharmaceutical companies using FPGAs for searching and sorting. NCSA is working on a civil engineering “dirt” code. University of Illinois is working on porting part of NAMD to an FPGA. Slide 59

Other Useful FPGA designs JPEG 2000 developed by Barco Silex, currently runs on Virtex FPGAs. Working with them on a real time, high resilution compression project. 64 -bit floating point Matrix Multiplication by Ling Zhuo and Viktor Prasanna at the University of Southern California. Gets 8. 3 Gflops on a XC 2 VP 125 as compared to 5. 5 Gflops 3. 2 GHz Xeon. Finite-Difference Time-Domain (FDTD) by Ryan Schneider, Laurence Turner, and Michal Okoniewski at University of Calgary. Slide 60

Questions Slide 61