Portability for FPGA Applications Warp Processing and System

Portability for FPGA Applications —Warp Processing and System. C Bytecode Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Contributing Ph. D. Students Roman Lysecky (Ph. D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph. D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) Chen Huang (current) This research was supported in part by the National Science Foundation, the Semiconductor Research Corporation, Intel, Freescale, IBM, and Xilinx

Portable Applications on PCs One binary x 86 binary How? Why? Pentium Opteron Atom Dual Core Multiple platforms Frank Vahid, UC Riverside 2

Portable Applications on PCs n n Standard software binary Dynamic software binary translation x 86 Binary x 86 µP Applications Tools VLIW Binary SW binary translation Architectures “Ecosystem” Frank Vahid, UC Riverside 3

Meanwhile, Circuits on FPGAs Show Large Speedups n Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS, MICRO, CASES, DAC, DATE, ICCAD, RAW, … Frank Vahid, UC Riverside 4

FPGAs Entering Computing Mainstream n n n AMD Opteron Intel Quick. Assist Cray, SGI Mitrionics IBM Cell (research) Xilinx, Altera SGI Altix supercomputer (UCR: 64 Itaniums plus 2 FPGA RASCs) Xilinx Virtex II Pro. Source: Xilinx Frank Vahid, UC Riverside 5

Circuits on FPGAs are Software Binaries Microprocessor Binaries (Instructions) 01110100. . . 001010010 … … FPGA “Binaries” (Circuits) not hardware "Software" … … aka "bitstream" Bits loaded into LUTs and SMs Bits loaded into program memory "Hardware" 0010 … Processor 0111 … FPGA Processor Sep 2007 IEEE Computer Frank Vahid, UC Riverside 6

“Portable Applications” + “FPGAs” n n Standard software binary Dynamic translation x 86 Binary x 86 µP VLIW Binary SW binary translation Applications x 86 µP FPGA binary SW binary translation Tools Architectures “Ecosystem” “Warp Processing” Frank Vahid, UC Riverside 7

Warp Processing 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD Frank Vahid, UC Riverside 8

Warp Processing 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD Frank Vahid, UC Riverside 9

Warp Processing 3 Profiler monitors instructions and detects critical regions in binary Profiler beq beq beq add add add µP I Mem D$ FPGA On-chip CAD Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Critical Loop Detected Frank Vahid, UC Riverside 10

Warp Processing 4 On-chip CAD reads in critical region Profiler I Mem µP D$ FPGA Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 On-chip CAD Frank Vahid, UC Riverside 11

Warp Processing 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler I Mem µP D$ FPGA Dynamic CAD On-chip Part. Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Recover loops, reg 3 : = 0 arrays, reg 4 : = 0 subroutines, loop: etc. – reg 4 : = reg 4 + mem[ needed to reg 2 + (reg 3 << 1)] synthesize reg 3 : = reg 3 + 1 good circuits if (reg 3 < 10) goto loop ret reg 4 Frank Vahid, UC Riverside 12

Warp Processing 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler I Mem µP D$ FPGA Dynamic CAD On-chip Part. Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + + + reg 3 : = 0 + reg 4 : = 0+ + . . . loop: reg 4 + + : = reg 4+ mem[ reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 if (reg 3 < 10) goto loop +. . . ret reg 4 + . . . Frank Vahid, UC Riverside 13

Warp Processing 7 On-chip CAD maps circuit onto FPGA Profiler I Mem µP D$ FPGA Dynamic CAD On-chip Part. Module (DPM) Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + reg 3 : = 0 + + reg 4 : = 0+ SM SM SM. . . loop: reg 4 + CLB + : =+ + mem[ + reg 4 CLB + reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 SM + if (reg 3 < 10) SM. loop. goto. + + ret reg 4 + . . . Frank Vahid, UC Riverside 14

Warp Processing 8 On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Profiler I Mem µP D$ FPGA Dynamic CAD On-chip Part. Module (DPM) >10 x speedups for some apps Software Binary Mov reg 3, 0 Mov reg 4, 0 loop: // instructions that Shl reg 1, reg 3, 1 interact with FPGA Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 + Software-only “Warped” reg 3 : = 0 + + reg 4 : = 0+ SM SM SM. . . loop: reg 4 + CLB + : =+ + mem[ + reg 4 CLB + reg 2 + (reg 3 << 1)] reg 3 : = reg 3 + 1 SM + if (reg 3 < 10) SM. loop. goto. + + ret reg 4 Warp speed, Scotty + . . . Frank Vahid, UC Riverside 15

Warp Processing Challenges n n Can we decompile binaries sufficiently for synthesis? Can we just-in-time (JIT) compile to FPGAs? Binary Profiling & partitioning Decompilation Profiler µP FPGA I$ D$ CDFG Binary Updater On-chip CAD JIT FPGA compilation Microp Binary FPGA Binary binary Frank Vahid, UC Riverside 16

Decompilation n Recover high-level information from binary: branches, loops, arrays, subroutines, … n n Adapted previous methods for processor-processor translation (UQBT) Developed new synthesis-oriented methods (e. g. , “reroll” loops, strength “promotion”) Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Corresponding Assembly Mov reg 3, 0 Mov reg 4, 0 loop: Shl reg 1, reg 3, 1 Add reg 5, reg 2, reg 1 Ld reg 6, 0(reg 5) Add reg 4, reg 6 Add reg 3, 1 Beq reg 3, 10, -5 Ret reg 4 Almost Identical Representations Data Recovery Control/Data Flow Control Structure Recovery Function Recovery Array Flow Analysis. Graph Creation long f( long reg 2 reg 3 : = { ){ long f( short reg 2 ) { ) 0 long array[10] long reg 4 = 0; reg 4 : = 0 int reg 3 = 0; for (long reg 3 = 0; reg 3 < 10; reg 3++) { int reg 4 = 0; reg 4 array[reg 3]; loop: += mem[reg 2 + (reg 3 << 1)]; loop: } reg 4 = reg 4 + mem[reg 2 + << 1 reg 4 reg 1 : = reg 3 + mem[ reg 3 return reg 4; << 1)]; reg 2 + reg 1 1)] reg 5 : = + (reg 3 << } reg 3 = reg 3 + 1; reg 3 + 1 reg 6 : = mem[reg 5 + 0] if (reg 3 < 10) goto loop; if (reg 3 < 10) reg 6 reg 4 : = reg 4 +goto loop return reg 4; reg 3 : = reg 3 + 1 } if (reg 3 < 10) goto loop ret reg 4 Frank Vahid, UC Riverside 17

Decompilation Results vs. C n Synthesis from decompiled binary is competitive with synthesis from C Frank Vahid, UC Riverside 18

Decompilation Results on Optimized H. 264 In-depth Study with Freescale n Again, competitive with synthesis from C Frank Vahid, UC Riverside 19

n n Do compiler optimizations hurt decompilation? (Surprisingly) found optimized code synthesizes to even better circuits Speedup when decompiled binary is partitioned and synthesized to FPGA Decompilation Effective Even with Compiler Optimizations Average Speedup of 10 Examples Frank Vahid, UC Riverside 20

Decompilation Summary: Decompilation is surprisingly effective at recovering high-level program structures for synthesis Stitt et al ICCAD’ 02, DAC’ 03, CODES/ISSS’ 05, ICCAD’ 05, FPGA’ 05, TODAES’ 06, TODAES’ 07 Ph. D. work of Greg Stitt (Ph. D. UCR 2007, now Asst. Prof. at UF Gainesville) Frank Vahid, UC Riverside 21

Warp Processing Challenges n n Can we decompile binaries sufficiently for synthesis? Can we just-in-time (JIT) compile to FPGAs? Binary Profiling & partitioning Decompilation Profiler µP FPGA I$ D$ CDFG Binary Updater On-chip CAD JIT FPGA compilation Microp Binary FPGA Binary binary Frank Vahid, UC Riverside 22

Challenge: JIT Compile to FPGA Logic synthesis Tech. map. Placement Commercial tool 60 MB Routing 9. 1 s n Expand Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping, e. g. , n n on-set dc-set off-set Reduce Logic synthesis: run single expand phase Technology mapping: bottom-up graph clustering heuristic Placement: place critical path first, then adjacent items Routing: use resource graph that matches switch matrix / channel structure Irredundant Ultra-lean Riverside JIT FPGA tools (drawn to scale) 0. 2 s 3. 6 MB Penalty: 1. 3 -2 x in performance & size (even more might be acceptable) Ultra-lean Riverside JIT FPGA tools on a 75 MHz ARM 7 1. 4 s 3. 6 MB Frank Vahid, UC Riverside 23

JIT Compile to FPGA Summary: Ultra-lean JIT FPGA compiler 40 x speedup, 20 x less memory, 1. 3 x-2 x circuit penalty Lysecky et al, DAC’ 03, ISSS/CODES’ 03, DATE’ 04, DAC’ 04, DATE’ 05, FCCM’ 05, TODAES’ 06 Ph. D. work of Roman Lysecky (Ph. D. UCR 2005, now Asst. Prof. at Univ. of Arizona) Frank Vahid, UC Riverside 24

Warp Processing Results Performance Speedup (Most Frequent Kernel Only) vs. 200 MHz ARM Average kernel speedup of 41 Profiler I$ D$ µP FPGA On-chip CAD 1 = ARM-only execution Overall application speedup average is 7. 4 Frank Vahid, UC Riverside 25

Warping Thread-Based Applications Multi-core platforms multithreaded apps for (i = 0; i < 10; i++) { } OS schedules threads onto accelerators (possibly dozens), in addition to µPs Compiler Binary µP OS schedules threads onto available µPs Performance thread_create( f, i ); f() µP f() Remaining threads added to queue f() On-chip CAD OS µP FPGA Very large speedups possible – parallelism at bit, arithmetic, and now thread level too µP f() Acc. Lib OS invokes on-chip CAD tools to create accelerators for f() Thread warping: use one core to create accelerator for waiting threads Frank Vahid, UC Riverside 26

Memory Access Synchronization (MAS) n Must deal with widely known memory bottleneck problem n FPGAs great, but often can’t get data to them fast enough for (i = 0; i < 10; i++) { RAM DMA a() n n FPGA …. b() thread_create( thread_function, a, i ); Data for dozens of threads can create bottleneck } void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } Same. . } array Threaded programs exhibit unique feature: Multiple threads often access same or overlapping data Solution: Fetch data once, broadcast to multiple threads (MAS) Frank Vahid, UC Riverside 27

Memory Access Synchronization (MAS) n Detect overlapping memory regions – “windows” n Synthesis creates active “smart buffer” n n [Guo/Najjar FPGA 04] Actively fetches data, stores the reused data, delivers windows to threads Active rather than passive component; designed for specific threads a[0] a[1] a[2] a[3] a[4] a[5] for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . } Each thread accesses different addresses – but addresses may overlap RAM DMA A[0 -103] ……… Data streamed to “smart buffer” Smart Buffer A[0 -3] enable f() A[1 -4] f() ……………… A[6 -9] f() Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses Frank Vahid, UC Riverside 28

Speedups from Thread Warping n n Chose benchmarks with extensive parallelism Four core (ARM 11 400 MHz) base system Virtex IV FPGA at circuit-specific clock frequency (~100 -300 MHz) Average 130 x speedup But, FPGA uses additional area. n n n Our FPGA size = ~36 ARM 11 s Still 20 x faster than 32 -core system (and 11 x faster than 64 -core) Simulation pessimistic, actual results likely better FPGA more flexible Frank Vahid, UC Riverside 29

Warp Scenarios Warping takes time (seconds, minutes, or more) – when useful? n Long-running applications n n Scientific computing, etc. Recurring applications (save and reuse FPGA configurations) n n n Common in embedded systems Might view as (long) boot phase For networked/docked devices, CAD can occur on server (ongoing work) Long Running Applications µP FPGA Recurring Applications µP (1 st execution) On-chip CAD µP FPGA On-chip CAD Time Single-execution speedup Time Speedup Frank Vahid, UC Riverside 30

Applications Why Dynamic? Tools n Static good, but hiding FPGA opens technique to all sw platforms n Architectures “Ecosystem” Standard languages/tools/binaries Static Compiling to FPGAs Dynamic Compiling to FPGAs Specialized Language Any Language Specialized Compiler Any Compiler Binary Netlist Binary FPGA µ P On-chip CAD Frank Vahid, UC Riverside 31

Synthesis-Friendly Applications n Coding style impacts synthesis results Frank Vahid, UC Riverside 32

Synthesis-Friendly Application Coding Guidelines Conversion to Constants (CC) Conversion to Explicit Data Flow (CEDF) Conversion to Fixed Point (CF) Conversion to Explicit Memory Accesses (CEMA) Constant Input Enumeration (CIE) Loop Rerolling (LR) Conversion to Explicit Control Flow (CECF) Function Specialization (FS) Algorithmic Specialization (AS) Pass-By-Value Return (PVR) Frank Vahid, UC Riverside 33

Conversion to Explicit Control Flow (CECF) n n Problem: Function pointers may prevent static control flow analysis Guideline: Don’t use function pointers. Replace with if-else, static calls n Makes possible targets explicit Synthesis unlikely to determine possible targets of function pointer enum Target { FUNC 1, FUNC 2, FUNC 3 }; void f( int (*fp) (int) fp{) { enum Target ) } } . . . for (i=0; i < 10; i++) { a[i] === FUNC 1) if (fp fp(i); } a[i] = f 1(i); else if (fp == FUNC 2) a[i] = f 2(i); else Synthesized Hardware a[i] = f 3(i); } ? Synthesized Circuit a[i] f 1(i) f 2(i) f 3(i) fp 3 x 1 a[i] Frank Vahid, UC Riverside 34

Speedups from Synthesis-Friendly Coding Guidelines n n 10 guidelines For ~1, 000 line benchmark: 5 -6 changes typical, tens of minutes each Simple guidelines increased speedup to 6. 5 x Frank Vahid, UC Riverside 35

Speedups from Synthesis-Friendly Coding Guidelines n Original C code (Powerstone, Mediabench) n n Original average speedups with FPGA: 2. 6 x (excludes brev) Refined C code with guidelines n n Average speedup: 8. 4 x (excludes brev) Guidelines led to 3. 5 x improvement of speedup Frank Vahid, UC Riverside 36

“Spatial” Algorithms for FPGAs n n As FPGAs more common – app writers may expect FPGA presence Example – Count patterns n Sequential algorithm n n n Hash table 10 s cycles per pattern Current pattern Level 1 Level 2 Level 3 Spatial algorithm (for FPGA) n Pipelined stages Spatial algorithm: Essence is the connectivity of components, not the sequencing of instructions Level 4 logic count pattern . . . Level m count pattern logic . . . Frank Vahid, UC Riverside 37

Spatial Algorithms for FPGAs n Current pattern Spatial algorithm 2 n Pipelined binary tree Level 1 Level 2 Level 3 logic 1 Count Memory 1 pattern 2 Count 2 patterns Memory 2 patterns 4 Count Memory 4 patterns . . . Level n logic 2 n Count Memory 2 n patterns . . . Frank Vahid, UC Riverside 38

Example Possible patterns pre-stored in binary search tree circuit 48 73 Current pattern Level 1 Level 2 Level 3 Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns . . . Level n Stage 1 logic Memory 2 n patterns . . . Stage 2 Stage 3 Stage 4 Frank Vahid, UC Riverside 39

Example 23 48 Current pattern Level 1 Level 2 Level 3 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns . . . Level n logic Memory 2 n patterns . . . Stage 1 73 Stage 2 Stage 3 Stage 4 Frank Vahid, UC Riverside 40

Example Current pattern 75 Level 1 Level 2 23 Level 3 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns . . . Level n logic Memory 2 n patterns . . . Stage 1 48 Stage 2 73 Stage 4 Frank Vahid, UC Riverside 41

Example Current pattern 11 Level 2 75 Level 3 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns . . . Level n logic Memory 2 n patterns . . . Stage 1 23 Stage 2 48 Stage 3 73 Stage 4 1 Frank Vahid, UC Riverside 42

Current pattern Example Level 1 Level 2 11 Level 3 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns . . . Level n logic Memory 2 n patterns . . . Stage 1 75 Stage 2 1 23 Stage 3 48 Stage 4 1 1 Frank Vahid, UC Riverside 43

Study of Spatial Algorithms in FCCM Year 2001 2002 2002 2003 2003 2004 2004 2005 2005 2006 2006 Application 3 D Vec. Normalization Efficient CAM Automated Sensor Regular Expression Hyperspectral Image Machine Vision RC 4 Set Covering Template Matching Triangle Mesh Congruential Sieves Content Scanning F. P and Square Root Gaussian Noise TRNG 3 D FDTD Method Deep Packet Filter Online Floating Point Molecular Dynamics Pattern Matching Seismic Migration Software Deceleration V. M Window Data Mining Cell Automata Particle Graphics Radiosity Transient Waves Road Traffic All Pairs Shortest Path Apriori Data Mining Molecular Dynamics Gaussian Elimination Radiation Dose Random Variates Type Spatial -Temporal Spatial Spatial Temporal Spatial -Spatial --Spatial Temporal Spatial Spatial Temporal Spatial n FCCM 2001 -2006 n n n 70 papers describing fast application on FPGA Examined 35 in depth (every other one) 6 used device-specific features 9 represented expected synthesized circuit from the obvious sequential algorithm 20 were spatially-oriented applications n e. g. , earlier pipelined binary tree Frank Vahid, UC Riverside 44

Portable Spatial Applications? n Current portable microprocessor binaries – sequential n n How support spatial constructs n n Extensions for threads, processes, . . . Ports, connections, timing model . . . n n www. systemc. org n Adds libraries and macros, still standard C++ Sequential and spatial constructs Compiling links in the simulation kernel n n Self-executing simulation Intended for So. C simulation Frank Vahid, UC Riverside 45

Frank Vahid, UC Riverside 46

Bytecode n Modern portability approach n Java, C# Compiler Virtual Machine (VM): Program that executes bytecode May JIT compile to native architecture VM Pentium VM Opteron VM Atom Frank Vahid, UC Riverside 47

System. C Bytecode? System. C Compiler System. C bytecode VM Pentium VM FPGA VM Opteron + FPGA Frank Vahid, UC Riverside 48

$System. C Bytecode Compiler class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR()$

System. C Bytecode Compiler class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { System. C SC_method(main. Comp); sensitive << data. Ready; SC_method(get. Pixel); sensitive << clock. pos(); void get. Pixel(){ … data. Ready. write(1); } void main. Comp(){ int i, j; for(i = 0; i < 3; i++){ for(j = 0; j < 3; j++){ sum. X = sum. X + mem. read()*GX[i][j] } } … edge. write(sum. X + sum. Y) } System. C Bytecode Compiler Pinapa Front End AST Link ELAB Bytecode Back End Code Generation 1 System. C bytecode Register Allocation Frank Vahid, UC Riverside 49

System. C Bytecode Emulator Input Memory Output Memory UART Buttons LEDs Main Processor System. C bytecode Instruction Memory Read Signal Memory USB Interface Bytecode uploadable via USB drive Write Signal Memory Profiler Accelerator 1 Accelerator 2 FPGA µP Accelerator 3 FPGA I$ D$ On-chip CAD “Warping” also possible – JIT compile bytecode portions to circuits on FPGA Accelerators speedup emulation Frank Vahid, UC Riverside 50

Dynamic Enables Expandable Logic Concept RAM Expandable RAM – – Warp tools detect Expandable Logic System detects RAMFPGA, invisibly adapt amount of during start, improves performance invisibly application to use less/more hardware. DMA FPGA Cache FPGA Profiler µP µP Warp Tools Expandable Logic Expandable RAM u. P Performance Frank Vahid, UC Riverside 51

Summary n n n FPGAs entering mainstream Portability of applications is important Dynamic binary translation to FPGAs – Warp processing n n Shown feasible; Extensive future work Trends towards FPGA ubiquity n n n Microprocessor binaries need extensions for spatial constructs One approach: System. C bytecode and virtual machine Can also be warped for circuit-speed http: //www. cs. ucr. edu/~vahid/pubs Frank Vahid, UC Riverside 52