Reconfigurable Supercomputing Challenges in Computing Space

Reconfigurable Supercomputing

Challenges in Computing • Space and Power Limits: • Large systems need space… Ø A hypothetical 50 node 1 rack unit system requires est. 20 square feet of floor space for air flow Ø One rack (dual core) delivers 100 CPU units Ø Speedup potential only 50 for 99% parallel application • …for cooling systems Ø Need for ambient air cooling increases with CPU power Ø Multiple cores still means more cooling per node • Bottom line: Building-size refrigerators (data centers) expensive to build, expand or retrofit Ø To build: $67 M to build data center consuming 15 MW power Ø To operate: $12 M+ per year at 40% capacity [Stahlberg 06] 2

Challenges in Computing • Challenges in face of increased demands: • Increasing amounts of data generated Ø By research simulations and instruments Ø By surveillance sensors, cameras, RFID • Longer secure records keeping Ø Government regulations (SOX, privacy, HIPAA) Ø Protecting and leveraging R/D intellectual property • Data volumes are soaring Ø Genbank volume doubles every 18 months Ø [Walmart preparing to add TBs regularly for RFID] • Increased application complexity • Demand for even lower times to solution 3

A Solution in Reconfigurable Computing • FPGA computing is well established Ø MAPLD, RECON, ACM/IEEE, etc. Ø FCCM around for 14 years • Diverse tools employed by logic designers in lieu of an ASIC approach Ø Generalizable platform Ø Reconfigurable – an option to change without physical replacement • Well developed embedded markets Ø Ø Vehicle control systems Defense systems (radars, missiles, aircraft) Medical systems Communication/networking hardware 4

Application-Specific vs. General Supercomputing • Application-specific acceleration: • Ø Cryptography, Ø Image processing, Ø Bioinformatics Supercomputing (HPC): Ø Each node: − A commercial microprocessor + − One or more large commercial FPGAs. • Largest supercomputer companies: 1. Cray Research, 2. SRC, 3. SGI (Silicon Graphics Inc. ) 5

Multi-Core Speedup Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1 -Serial%)/N) Serial% = 6. 7% N = 16 Cores, Speedup = 8 Serial% = 20% N = 6 Cores, Speedup = 3 6

Multi-Core Speedup • Amdahl’s Law: Ø Speedup by multi-cores is limited. Ø Even when the number of cores is large, it is limited by the serial part: − If 88% of the operations are not parallelizable, (regardless of the number of cores): Max speedup = 1/(1 - 0. 12) = 1. 136 ! − To achieve 50 x speedup, at least 98% of the application must be appropriate for substantial acceleration. 7

FPGA Technology • Potentials of FPGAs in supercomputing not used until recently, due to: Ø limited FPGA resources (e. g. for FP operations) Ø general unavailability of the tightly coupled RC/CPU systems, Ø lack of high-level programming languages suitable for rapid FPGA code development. • High progress in FPGA technology in recent years: Ø n x 102 MHz clock rate Ø n x 10 Gbps serial IO Ø On-board scalar processor cores Ø 45 nm process allowing for − low power consumption, − very high logic density. Ø Terabyte/sec memory bandwidth, Ø n x 10 of Tops/sec. − Powerful device for supercomputing 8

Common FPGA Acceleration Techniques 1. Coprocessor model: Ø with FPGA available via an I/O bus − typically PCI(X) or VME. • Ø Data is loaded into the FPGA in a DMA operation, Ø Results are loaded back to main memory. Advantages: Ø PCI and VME buses are everywhere through all general-purpose computing platforms. − Solid and well-known development environment. Ø Inexpensive solution, Ø Can use existing ASIC simulation software solutions. 9

Common FPGA Acceleration Techniques 1. Coprocessor model: • Disadvantages: Ø I/O bus limitations on bandwidth to main memory. Ø HDL use is difficult for most ISV or end-user programmers. − Solution 1: Schematic-based tools allow the FPGA programmer to take existing modules (e. g. FFT module) and link together with other modules. − A bitstream is created. − Solution 2: A compiler converts the C-like constructs into HDL. − Problem: Inefficient conversion. − Tools usually allow some sections to be described in HDL. 10

Challenges * • Challenges in using FPGAs for supercomputing: 1. Floating-Point Arithmetic: There is no native support for floating-point arithmetic on FPGAs. − Floating-point soft cores tend to require large amounts of area. − To achieve high clock frequencies, they must be deeply pipelined. − Floating-point performance of FPGAs is competitive with that of GPPs and is increasing at a faster rate than that of GPPs 11

Challenges * • Challenges in using FPGAs for supercomputing: 2. Low Clock Frequency: Especially because of the programmable interconnects. − The main ways FPGAs obtain high performance is through the exploitation of pipelining and parallelism. − The use of pipelining and parallelism has limits. 12

Challenges * • Challenges in using FPGAs for supercomputing: 3. Development of Designs: Designs for FPGAs have traditionally been written in HDL. − Unfamiliar to most in the scientific computing community. − At this lower level, developing large designs can be even more difficult than it is with high-level programming languages. − Development time: followed a digital design timeframe: − Months to define, design, develop and deliver an executable image. − Numerous efforts aimed at facilitating high-level spec-tohardware. − These efforts are becoming especially important with the development of reconfigurable computers. − All solutions are different (Nallatech, Mitrionics, DSPLogic, Celoxica, Impulse. C, SRC, …) 13

Challenges * • Challenges in using FPGAs for supercomputing: 4. Acceleration of Complete Applications: Much of the early studies of the acceleration has focused on individual tasks, (e. g. , matrix operations and FFT) − Impressive speedups in individual kernels does not necessarily translate into impressive speed-ups when the kernels are integrated back into complete applications. 14

Floating Point Arithmetic • FP arithmetic: Ø Many HPC applications require double precision FP arithmetic. − So prevalent that benchmarking application ranking supercomputers (LINPACK) heavily uses DP FP math. Ø FPGA vendors tailor their products toward their dominant customers: − DSP, network applications, embedded computing − None need FP performance. Ø Embedded FP: − [Chong 09] A flexible, generic embedded FPU, which over a variety of applications will improve performance and save a significant amount of FPGA real estate when compared to implementations on current FPGAs. − Can be configured to perform a wide range of operations: − Floating-point adder and multiplier: − − − one double-precision operation or two single-precision operations in parallel. Access to the large integer multiplier, adder and shifters in the FPU 15

Supercomputing Companies • Supercomputing companies offering clusters featuring programmable logic: 1. SRC Computers 2. DRC Computer Corp. 3. Cray, 4. Starbridge Systems, 5. SGI. 16

Speedup Comparison • Published FPGA supercomputing application results: SP: single precision DP: double precision [Craven 07] Ø Shows only the trend (not normalized to a common processor). Ø Many of them compare a highly optimized FPGA FP implementation with non-optimized software. 17

FPGA for Floating Point Arithmetic Ø [Dou 05]: One of the highest performance benchmarks of 15. 6 GFLOPS for FP matrix multiplication: − Places 39 floating-point processing elements on a theoretical Xilinx XC 2 VP 125 FPGA. − A linear array of MAC elements, linked to a host processor providing memory access. − Pipelined to a depth of 12, permitting operation at a frequency up to 200 MHz. − Interpolating for Virtex-II Pro device: 12. 4 GFLOPS, − 4 x a 3. 2 GHz Intel Pentium. − Taken as an absolute upper limit on FPGA’s DP FP performance. 18

Cost-Performance Comparison • • • Dou architecture vs u. P. Considering costperformance: Ø Worst processor beats best FPGA in current technology. But Ø Only chip cost is considered in the table − Costs for circuit board and other components (motherboard, memory, network, etc. ) are necessary to produce a functioning supercomputer. Ø As most clusters incorporating FPGAs include a host processor to handle serial tasks and communication − Table favors FPGAs. * Cell processor: from Sony * System X: Virginia Tech’s supercomputing cluster 19

FPGA FP vs. Processors • FPGA performance is improving faster than processors Ø Analysis shows in 2009 -2012 will be as costeffective as processors. 20

Xilinx Core vs. Dou Architecture • Xilinx 18 -bit multipliers: Ø FP core combines 16 of 18 -bit multipliers to produce the 52 -bit multiplication needed for DP FP mult. Ø Dou design uses 9 of them. • Non-standard data formats: Ø IEEE standard format prevents users from leveraging an FPGA’s configurability to effectively customize for a specific problem. 21

FPGA-Based Supercomputing • The main differences with accelerator board: 1. The FPGA board communicates with conventional processors as well as other FPGA boards through a highbandwidth, low-latency interconnection network. − An order of magnitude higher bandwidth, − Latency is on the order of microseconds. Ø FPGA computation can share state with microprocessor more easily, Ø The granularity of computation on the FPGAs can be smaller. 2. There can be many FPGA boards on the interconnection network, Ø The aggregate system can solve very large supercomputing problems. 22

SRC Architecture * • • MAP boards: Ø Contain high-end Xilinx FPGAs. Ø Exploits the DRAM interface of Pentium processors as a communication port to the “MAP” board SNAP ASIC: Ø manages the protocol and communication between a dual processor Pentium node and its associated MAP (FPGA) board. 23

SRC Architecture • • Cluster: Ø These microprocessor/FPGA units can be combined using commercial interconnect such as Gigabit Ethernet to form a cluster. Inter. Connections: Ø A proprietary switching network connects multiple “MAPStations. ” Ø MAP boards can also communicate directly through chain ports, − by-passing the interconnection network for large, multi-FPGA board designs. 24

SGI’s RASC • Reconfigurable Application-Specific Computing: Ø Combines − low power + volume cost + high performance of new FPGAs and − very high I/O capability of the SGI Altix system. − e. g. 120 adds in parallel instead of using 12 adder in Intel Itanium 2 (10 x improvement). Ø A software environment that allows control of powerful yet unfamiliar FPGAs in the familiar C and common Linux tools (e. g gdb). Ø Applications in use: − image processing, − encryption. 25

SRC’s Explicit/Implicit Architecture * Fortran Carte™ Programming Environment • Implicitly Controlled Device Ø Dense logic device Ø Higher clock rates Ø Typically fixed logic Ø µP, DSP, ASIC, etc. Implicit Device Explicit Device C • Explicitly Controlled Device Ø Direct execution logic Ø Lower clock rates Ø Typically reconfigurable Ø FPGA, CPLD, OPLD, etc. Memory Control Memory I/O Bridge Unified Executable SRC’s explicitly controlled processor is called MAP® 26

MAP® Implementation * 1400 MB/s sustained payload • MAP • Six Banks Dual-ported On-Board Memory (24 MB) 4800 MB/s (6 x 64 b) User Logic 1 XC 2 V 6000 108 b GPIO ports allow direct MAP to MAP chain connections or direct data input Multiple DMA engines support Ø Distributed SRAM in User Logic − 264 KB @ 844 GB/s Ø Block SRAM in User Logic − 648 KB @ 260 GB/s Ø On-Board SRAM − 28 MB @ 9. 6 GB/s Ø Microprocessor Memory − 8 GB @ 1400 MB/s • 4800 MB/s (6 x 64 b) 4800 MB/s 192 b Dual-ported Memory (4 MB) 108 b GPIO Multiple banks of On-Board Memory maximizes local memory bandwidth • 4800 MB/s (6 x 64 b) Control circuits allow explicit control of memory prefetch and data access • Controller XC 2 V 6000 Direct Execution Logic (DEL) made up of one or more User Logic devices User Logic 2 XC 2 V 6000 2400 MB/s each 27

SRC MAPstation™ SRC-6 uses standard external network connections MAPstation Configurations Single MAP Workstation Portable MAPstation SNAP: Ø SRC developed interface: µp boards connect to (and share memory with) the MAP processors. 4 x more than PCIX 133 GPIO Ports MAP SNAP™ 2 U Tower Memory P PCI-X Disk Storage Area Network Local Area Network Wide Area Network 28

SRC Cluster Based Systems Utilizes standard clustering technology MAPstation™ MAPstation GPIO Port MAP® MAP MAP SNAP™ SNAP Memory P P PCI-X Gig Ethernet etc. Disk Storage Area Network Local Area Network SRC-6 Wide Area Network 29

SRC MAPstation™ with Hi-Bar™ MAPstation towers hold up to 3 MAP or memory nodes MAPstation with 2 MAPs and Common Memory MAPstation Tower GPIO Ports MAP MEMORY MAP SRC Hi-Bar™ Switch SNAP™ Memory P PCI-X/EXP Disk Storage Area Network Local Area Network Wide Area Network 30

HLL-to-FPGA Compilation

Tools * Ø HPC user are scientists, researchers or engineers who desire to accelerate some scientific application. − Acquainted with programming languages (C, Fortran, MATLAB, …) − Some HLL-to-gates synthesis tools: − Celoxica, SRC, …. • Problems: Ø The state of these tools does not remove the need for hardware expertise: − Hardware debugging and interfacing are still needed. Ø Porting an existing scientific code to an RC platform using one of these languages is not as simple as just recompiling the code with a different compiler to run on a different microprocessor. − Requires adaptation of the code to the available FPGA resources − scientific application developers are not familiar. Ø Inefficient synthesis: − Translating inherently sequential HL description into a parallel hardware eats into the performance of hardware accelerators. 40

HLL-to-FPGA Compilation * • Three Approaches: 1. Compile a subset of an existing language (e. g. , C or Java) to hardware. − Typically omits some operations: − dynamic memory allocation, − recursion, − complex pointer-based data structures. 2. Extend a base sequential language with constructs to manipulate bit widths, explicitly describe parallelism, and connect pieces of hardware. − Celoxica’s Handel C, − Impulse C, − MAP C compiler in SRC’s Carte programming environment 3. Create a language for algorithmic description: − University of Montreal’s SHard 1 , − Mitrion-C data-flow language. − Simplifies the compiler’s work, − but it can require programmers to significantly restructure algorithmic description as well as rewrite in a new syntax. 41

Impulse C Ø From Impulse Accelerated Technologies − www. Impulse. C. com Ø Can process blocks of C code, (most often represented by one or a small number of C subroutines), into VHDL/Verilog. Ø Enables the automated scheduling of C statements for increased parallelism and automated and semiautomated optimizations such as loop pipelining and unrolling. Ø Interactive tools let designers iteratively analyze and experiment with alternative hardware pipelining strategies. 42

Mitrion-C • Mitrion-C and the Mitrion virtual processor Ø From Mitrionics Ø www. mitrionics. com Ø Offering a fully parallel programming language. − In standard C, programmers describe the program’s order-of-execution − not fit well with parallel execution. − Mitrion-C’s processing model is based on data dependencies, − a much better fit. 43

SRC Carte • Carte Design Environment: Ø A traditional program development methodology: − Write code in C or Fortran, − Compile, − Debug via standard debugger, − Edit code, − Recompile, − …. Ø When the application runs correctly in a microprocessor environment, it is recompiled and targeted for MAP (the direct execution logic processor). 44

SRC Carte • Three compilation modes: Ø Debug mode: Carte compiles microprocessor code using a MAP emulator to verify the interaction between the CPU and MAP. Ø Simulation mode: Carte supports applications composed of C or Fortran and Verilog or VHDL. − The compilation produces an HDL simulation executable that supports the simulation of generated logic. Ø Hardware compilation mode: the target is the direct execution logic that runs in MAP’s FPGAs. − In this mode, Carte optimizes for parallelism by pipelining loops, scheduling memory references, and supporting parallel code blocks and streams. 45

Handle-C • From Celoxica Ø www. celoxica. com Ø Synthesizes user code to FPGAs. Ø User replaces the algorithmic loop in the original Fortran, C, or C++ source application with a Celoxica API call to elicit the C code that is to be compiled into the FPGA. Ø Handel-C extends C with constructs for hardware design, such as parallelism and timing. 46

Trident • Trident: Ø Synthesizes circuits from an HLL. Ø 2006 R&D 100 award for innovative technology. Ø Provides an open framework for exploring algorithmic C computation on FPGAs: − by mapping the C program’s FP operations to hardware FP modules. Ø Users are free to select floating-point operators from a variety of standard libraries or to import their own. − Libraries: e. g. , FPLibrary, Quixilica. Ø The compiler’s open source code is available on Source. Forge: − http: //trident. sf. net 47

Trident Ø The programmer manually partitions the program into software and hardware sections and Ø writes C code to coordinate the data communication Ø between the two parts. Ø The C code to be mapped to hardware must conform to the synthesizable subset of C: Ø Not permitted: − Print statements, − Recursion, − Dynamic memory allocation, − Function arguments or returned values, − Calls to functions with variable-length argument lists, − Arrays without a declared size. 48

Floating Point Arithmetic on FPGAs

Floating Point Arithmetic Ø Floating-point arithmetic is essential to many scientific applications. Ø Floating-point arithmetic (especially double-precision) requires a great deal of hardware resources Ø Only recently become possible to implement many floating-point cores on a single FPGA. • Disadvatages of FP cores: Ø To achieve a high clock rate, FP cores for FPGAs must be deeply pipelined. − Difficult to reuse the same FP core for a series of computations that are dependent upon one another. Ø They use large area − Important to use as few FP cores in an architecture as possible. 50

References Ø [Scrafano 06] Scrafano, “Accelerating scientific computing applications with reconfigurable hardware, ” Ph. D Dissertation, 2006. − Presents the most in-depth analysis of the performance of doubleprecision floatingpoint FFTs on FPGAs Ø [Hemmert 05] K. Scott Hemmert and Keith D. Underwood. “An analysis of the double precision floating-point FFT on FPGAs’” In IEEE Symposium on Field-Programmable Custom Computing Machines, April 2005. Ø [Govindu 04] Gokul Govindu, Ling Zhuo, Seonil Choi, and Viktor K. Prasanna. Analysis of high performance oating point arithmetic on FPGAs. In Proceedings of the 11 th Reconfigurable Architectures Workshop (RAW 2004), April 2004. Ø [Zhuo 04] Ling Zhuo and Viktor K. Prasanna. Scalable modular algorithms for floatingpoint matrix multiplication on FPGAs. In Proceedings of the 11 th Reconfigurable Architectures Workshop (RAW 2004), April 2004. 51

Benchmarks Ø A supercomputer's performance is often measured by its performance in executing the LINPACK benchmark, which is composed of linear algebra routines [86, 112]. Additionally, the fast Fourier transform (FFT). as well as LINPACK and several others. appears as a kernel in the HPCChallenge benchmark suite, showing its importance as a kernel in scientific computing applications [27]. • [27] Jack J. Dongarra and Piotr Luszczek. Introduction to the HPCChallenge benchmark suite. Technical Report ICL-UT-0501, University of Tennessee, 2005. • [86] A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. HPL a portable implementation of the high-performance Linpack benchmark for distributed-memory computers. http: //www. netlib. org/benchmark/hpl/. • [112] Top 500 supercomputing sites. www. top 500. org. 52

References Ø [Craven 07] Craven and Athanas, “Examining the viability of FPGA supercomputing, ” EURASIP Journal of Embedded Systems, Article ID 93652, 2007. Ø [SGI 04] Silicon Graphics, Inc. “Extraordinary Acceleration of Workflows with Reconfigurable Application-specific Computing from SGI, ” (http: //www. sgi. com/pdfs/3721. pdf ), 2004. Ø [Gokhale 05] Gokhale, Graham, “Reconfigurable Computing Accelerating Computation with Field-Programmable Gate Arrays, ” Springer, 2005. Ø Tripp, Gokhale, Peterson, “Trident: From High-Level Language to Hardware Circuitry, ” IEEE Computer Magazine, March 2007. Ø [Dou 05] Dou, et al, “ 64 -bit Floating-Point FPGA Matrix Multiplication, ” FPGA 2005. Ø [Stahlberg 06] Stahlberg, Wohlever and Strenski, “Defining Reconfigurable Supercomputing, ” Cray User Group, 2006. 53