Скачать презентацию Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based Скачать презентацию Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based

ba4cf5ba679fab5dbaff320988d6b021.ppt

  • Количество слайдов: 38

Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based Synthesis Doug Johnson, Technical Marketing Manager Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based Synthesis Doug Johnson, Technical Marketing Manager NCSA/OSC Reconfigurable Systems Summer Institute Urbana, Illinois, July 11 -13 2005

Celoxica UK-Based System design company n Provider of design tools, IP & services for Celoxica UK-Based System design company n Provider of design tools, IP & services for Digital Imaging & Signal Processing o o Video Processing o Sonar/ Radar signal processing o Biometrics o n Image Processing Massively parallel data mining and matching Complete solutions for Electronic Level System (ESL) Design o o Co-design partitioning o Co-simulation & co-verification (C/ C++/ System. C/ Handel-C/ Matlab/ VHDL/ Verilog) o n System/ algorithm acceleration Hardware compilation & C synthesis to reconfigurable architectures Consulting and professional services o o 2 Systems analysis and design strategy System implementation capability NCSA/OSC Reconfigurable Systems Summer Institute

Presentation Objectives Prerequisites n n Objectives n n n 3 Motivations for using FPGAs Presentation Objectives Prerequisites n n Objectives n n n 3 Motivations for using FPGAs in RC and HPC and RC FPGA systems hardware and infrastructure HPC algorithms and Considerations for Reconfigurable Computing (RC) Share a perspective on the State-of-the-Art for C-based HW design Describe the C to FPGA Flow Illustrate with code examples … Look forward to some critical debate… NCSA/OSC Reconfigurable Systems Summer Institute

Agenda Reconfigurable Computing n Considerations, core algorithm relationships, commercial applications C-based design o o Agenda Reconfigurable Computing n Considerations, core algorithm relationships, commercial applications C-based design o o 4 The solution space (its place in EDA) Nature of C for HW design The Design Flow Summary JPEG 2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute

Agenda Reconfigurable Computing (RC) n Considerations, core algorithm relationships, commercial applications C-based design o Agenda Reconfigurable Computing (RC) n Considerations, core algorithm relationships, commercial applications C-based design o o The solution space (its place in EDA) Nature of C for HW design The Design Flow Summary “RC = Using FPGAs for (algorithmic) computation” 1. Embedded: Well established – body of knowledge/experience 2. Enterprise: Some 3. HPC: Starting Out 5 NCSA/OSC Reconfigurable Systems Summer Institute

Reconfigurable Computing Commercial C-to-FPGA tools FPGAs Closely Coupled Systems Partitioning Frameworks Intimately Coupled Systems Reconfigurable Computing Commercial C-to-FPGA tools FPGAs Closely Coupled Systems Partitioning Frameworks Intimately Coupled Systems Advanced Compilers First RC Successes 1980 n 20 X 0? Algorithm Acceleration o Exploit parallelism to increase performance with custom HW implementation Algorithm Offload o Free CPU resource by offloading bottleneck processes BIG Challenges n n 6 2000 Promised Opportunities n 1990 Development complexity o Design framework and methods, deployment and integration/middleware Coupling to coprocessor/data bandwidth Price/Performance/Power! Choosing the right applications! NCSA/OSC Reconfigurable Systems Summer Institute

FPGA Computing and Methodology High Performance Embedded and Reconfigurable Computing n n C-based design FPGA Computing and Methodology High Performance Embedded and Reconfigurable Computing n n C-based design for FPGAs n n 7 Why FPGA Computing? o Moore’s Law showing signs of strain o Ability to parallelize in HW o Price/GOPS coming down rapidly o Hard IP blocks – excellent density Example: Floating Point Performance o Maximum for Virtex-4 – 50 GFLOPS (Courtesy of Dave Bennett, Xilinx Labs) o Maximum for Virtex-2 – 17. 5 GFLOPS “ “ “ o “Can fit 10’s of FPUs on 2 Xilinx Virtex-4’s” (Courtesy of Justin Tripp, LANL) o Use of hard macros for functions is mandatory (example DSP 48 on Virtex-4) Several offerings on commercial marketplace or in research o Commercial – Celoxica, Mentor Graphics, Impulse Technologies, Mitrion… o Research – Sandia, UC Riverside, LANL RTL/HDL is the most widely used way to get to FPGAs but is not usable by SW engineers NCSA/OSC Reconfigurable Systems Summer Institute

2005 Conventional Wisdom for RC 1. Small data objects n 2. Modest arithmetic n 2005 Conventional Wisdom for RC 1. Small data objects n 2. Modest arithmetic n n n Fewer Issues with Latency in HPC Streaming Applications – most successful 5. Simple Control n 8 Essential Parallelism essential - FPGA clocks order of magnitude slower than CPUs Fine grain - wide data widths Medium grain - operation/function routine Course grain - multiple instantiations of application processes 4. Pipeline-ability n C-based design Difficult to design and implement complex algorithms in HW Integer/fixed precision calculations Floating point too resource expensive High Density Devices 3. Data-parallelism n Closely coupled systems Data transfer overhead to coprocessor, High operation to byte ratio Soft Cores/C-based design Difficult to design complex scheduling schemes in Parallel HW NCSA/OSC Reconfigurable Systems Summer Institute

Further Considerations 6. Exploiting “Soft” programmable HW n n 9 Configurable Applications o Schedule Further Considerations 6. Exploiting “Soft” programmable HW n n 9 Configurable Applications o Schedule and load HW content prior to HW execution Reconfigurable Applications Few Compelling Examples in HPC o Dynamically change HW content during HW execution NCSA/OSC Reconfigurable Systems Summer Institute

Commercial RC Applications …using C-based design Well established in embedded systems: Digital Video Technology Commercial RC Applications …using C-based design Well established in embedded systems: Digital Video Technology and Image Processing n n n n Consumer n Automotive & Industrial Internet reconfigurable multimedia terminal, MP 3, Vo. IP etc. Ground traffic simulation testbed for broadband satellite network communications Satellite based Internet data tracking system Rapid Systems Prototyping n 10 Engine control unit for 3 -phase motors Radar and sonar beamforming and spatial filtering Computer aided tomography security system Communications and Networking n Defense & Security Digital Signal Processing n “PROCESSING AT THE SENSOR” versus local and/or remote processing 3 D LCD display development and test Real-time verification of HDTV image processing algorithms Robust image matching - product tracking and production line control Automotive safety system incorporating sensor fusion Robotic vision system for object detection and robot guidance NCSA/OSC Reconfigurable Systems Summer Institute

Commercial RC Applications …using C-based design Enterprise Computing n High Performance Computing n n Commercial RC Applications …using C-based design Enterprise Computing n High Performance Computing n n n 11 Content processing solutions o XML parsing, virus checking o Packet/Pattern Matching/Filtering o Compression/decompression o Security/Encryption – DES/3 -DES, SHA, MD 5, AES/Rijndael Image processing o CT scan analysis, 3 D modeling, Ray Tracing Finite element analysis and simulation Custom Vector Engines Genome calculations Seismic data processing NCSA/OSC Reconfigurable Systems Summer Institute

Core Algorithm Relationships in HPC Rational Nanotechnology Drug Design Tomographic Fracture Mechanics Diffraction Inversion Core Algorithm Relationships in HPC Rational Nanotechnology Drug Design Tomographic Fracture Mechanics Diffraction Inversion Problems Atomic Scattering Condensed Matter Electronic Structure Astrophysics Military Logistics Transportation Systems Data Assimilation Electronic Structure Actinide Chemistry Cosmology Population Genetics Economics Air Traffic Control VLSI Design Pipeline Flows Flow in Porous Media Chemical Reactors Plasma Processing Transport CFD Basic Algorithms & Numerical Methods Discrete Events Monte Carlo Pattern Matching Computer Vision Multimedia Collaboration Tools PDE Cloud Physics Boilers Chemical Reactors CVD Multiphase Flow Weather and Climate Seismic Processing Multibody Dynamics Fields Geophysical Fluids Ecosystems Economics Models Symbolic Processing Cryptography Electromagnetics Aerodynamics Orbital Mechanics Astrophysics Intelligent Search Databases Intelligent Agents Reaction-Diffusion Structural Mechanics ODE Computer Algebra Data Mining CAD Radiation Graph Theoretic n-body Genome Processing Virtual Reality Computational Steering Scientific Visualization Signal Processing Raster Graphics Neutron Transport Virtual Prototypes Electrical Grids Fourier Methods Nuclear Structure QCD Distribution Networks Reservoir Modelling Biosphere/Geosphere Combustion Quantum Chemistry Manufacturing Systems Neural Networks MRI Imaging Molecular Modeling Chemical Dynamics 12 Phylogenetic Trees Biomolecular Reconstruction Dynamics Crystallography Automated Deduction Magnet Design Number Theory Source: Rick Stevens - ANL

Core Algorithm Relationships in HPC Rational Nanotechnology Drug Design Tomographic Fracture Mechanics Diffraction Inversion Core Algorithm Relationships in HPC Rational Nanotechnology Drug Design Tomographic Fracture Mechanics Diffraction Inversion Problems Atomic Scattering Condensed Matter Electronic Structure Astrophysics Military Logistics Transportation Systems Data Assimilation Electronic Structure Actinide Chemistry Cosmology Population Genetics Economics Air Traffic Control VLSI Design Pipeline Flows Flow in Porous Media Chemical Reactors Plasma Processing Discrete Events Monte Carlo CFD Basic Algorithms & Numerical Methods Pattern Matching Computer Vision Multimedia Collaboration Tools PDE Cloud Physics Boilers Chemical Reactors CVD Multiphase Flow Weather and Climate Seismic Processing Multibody Dynamics Fields Geophysical Fluids Ecosystems Economics Models Symbolic Processing Cryptography Electromagnetics Aerodynamics Orbital Mechanics Astrophysics Intelligent Search Databases Intelligent Agents Reaction-Diffusion Structural Mechanics ODE Computer Algebra Data Mining CAD Radiation Graph Theoretic Transport Genome Processing Virtual Reality Computational Steering Scientific Visualization Signal Processing n-body Raster Graphics Neutron Transport Virtual Prototypes Electrical Grids Fourier Methods Nuclear Structure QCD Distribution Networks Reservoir Modelling Biosphere/Geosphere Combustion Quantum Chemistry Manufacturing Systems Neural Networks MRI Imaging Molecular Modeling Chemical Dynamics 13 Phylogenetic Trees Biomolecular Reconstruction Dynamics Crystallography Automated Deduction Magnet we map out How do Design the right Apps? Number Theory Source: Rick Stevens - ANL

Exploiting FPGA in HPC Hardware: n n “Enterprise Quality” co-processor system products (Cray XD Exploiting FPGA in HPC Hardware: n n “Enterprise Quality” co-processor system products (Cray XD 1, SGI RASC) Robust PCI/PCIx/VME-based FPGA card solutions for development A software design methodology is essential: n SW dominated application sector o o n n Target developers have a SW background Register Transfer Level (RTL), Hardware Description Languages (HDL) are foreign Complete designs can be specified in a C environment o Porting to HW implementations simplified Platform abstractions through API’s and Libraries o 14 How do we select and benchmark? Simplified Specification, Development, Deployment NCSA/OSC Reconfigurable Systems Summer Institute

Agenda Reconfigurable Computing n Considerations, core algorithm relationships, commercial applications C-based design o o Agenda Reconfigurable Computing n Considerations, core algorithm relationships, commercial applications C-based design o o 15 The solution space (its place in EDA – Electronic Design Automation) Nature of C for HW design The Design Flow Summary JPEG 2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute

Embedded Hardware (HW) Design Specification Function Algorithm Design Block Design Fixed Point extraction DSP Embedded Hardware (HW) Design Specification Function Algorithm Design Block Design Fixed Point extraction DSP IP TLM Frameworks API’s/Libraries Implementation IP Models Architecture Fast Mixed Simulation Architecture Exploration Design Analysis HW Accelerated Simulation Custom Processors C-Based Synthesis HLL Synthesis Interface Synthesis Implementation Reconfigurable Prototypes FPGA/So. PC Implementation. IP Implementation IP Emulation Platforms RTL Verification RTL C to FPGA/So. PC 16 Physical Design NCSA/OSC Reconfigurable Systems Summer Institute

C to FPGA Accelerated System Function & Architecture AL Design C/C++ CA Specification Model C to FPGA Accelerated System Function & Architecture AL Design C/C++ CA Specification Model C for HW Algorithm Design Testbench Software Model System Model Partitioning API’s/Libraries HW Mixed Simulation COMMS SW Architecture Exploration Design Analysis Optimization C-Based Synthesis BSP RTL EDIF OBJ Synthesis P&R Implementation FPGA 17 NCSA/OSC Reconfigurable Systems Summer Institute Processor

Challenges for C-based synthesis Concurrency (Parallelism) n n Timing n n n Annotations, additional Challenges for C-based synthesis Concurrency (Parallelism) n n Timing n n n Annotations, additional or C++ Communication n 18 Constraints Explicit Rules-based Data Types n Compiler-determined (behavioral synthesis) Explicit Additional or C-like NCSA/OSC Reconfigurable Systems Summer Institute

Two Approaches to C-based Design So. C (System-on-a-Chip) Prototyping/Verification C Algorithm to FPGA System. Two Approaches to C-based Design So. C (System-on-a-Chip) Prototyping/Verification C Algorithm to FPGA System. C Core Libraries SCV, TLM, Master/Slave … Handel-C Core Libraries TLM (PAL/DSM), Fixed/Floating point … Standard Channels for Various MOC Kahn Process Networks, Static Dataflow… Primitive Channels Signal, Timer, Mutex, Semaphore, FIFO, etc Core Language Data Types par{…}, seq{…}, Interfaces, Channels, Bit Manipulation, RAM & ROM Single cycle assignment Bits and bit-vectors Arbitrary width integers Signals Modules, Ports, Processes, Events, Interfaces, Channels Event Driven Sim Kernel 4 -valued logic/vectors Bits and bit-vectors Arbitrary width integers Fixed-point C++ user-defined types ANSI/ISO C Language Standard ANSI/ISO C++ Language Standard 19 NCSA/OSC Reconfigurable Systems Summer Institute

Agenda Reconfigurable Computing n Considerations, core algorithm relationships, commercial applications C-based design o o Agenda Reconfigurable Computing n Considerations, core algorithm relationships, commercial applications C-based design o o 20 The solution space (its place in EDA) Nature of C for HW design The Design Flow Summary JPEG 2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute

System Design Refinement Function • System Function • Course grain parallelism A C • System Design Refinement Function • System Function • Course grain parallelism A C • • Parallel algorithm design Fine-grain parallism Bit/cycle true processes Algorithm Testbench A C Architecture • Add interfaces • Signal/cycle accurate test A C B D B D par{ process. A(…); process. B(…); process. C(…); process. D(…); } void process. D(…){ unsigned 9 a, b, c; par{ a=1; b=2; } c=3; }; void main(){ interface port_in… interface port_out… … } EDIF/RTL 21 NCSA/OSC Reconfigurable Systems Summer Institute AL C/C++ CP Handel-C CA Handel-C

Systems Integration Implementation • • Complete system design Interface to pins Multi-Clock domain IP Systems Integration Implementation • • Complete system design Interface to pins Multi-Clock domain IP Integration A C CLK RST A B D B EDIF (Electronic Design Interface Format) RTL from HDL IP Data C D set clock = external “CLK”; set reset = external “RST”; interface Data(…)… void main() { par{ process. A(…); { interface process. B(…)…}; process. B(…); process. C(…); process. D(…); } { interface process. D(…)…}; } EDIF/RTL 22 NCSA/OSC Reconfigurable Systems Summer Institute

Parallel Debug in C environment Algorithm Design 23 NCSA/OSC Reconfigurable Systems Summer Institute Parallel Debug in C environment Algorithm Design 23 NCSA/OSC Reconfigurable Systems Summer Institute

Resource Usage/Speed Estimations Architecture Exploration 24 NCSA/OSC Reconfigurable Systems Summer Institute Resource Usage/Speed Estimations Architecture Exploration 24 NCSA/OSC Reconfigurable Systems Summer Institute

FPGA Support Technology mapping Optimizations 25 NCSA/OSC Reconfigurable Systems Summer Institute FPGA Support Technology mapping Optimizations 25 NCSA/OSC Reconfigurable Systems Summer Institute

Handel-C Template Multiplier set clock = external Handel-C Template Multiplier set clock = external "clk"; void main() { … while(1) par { … process(); } } void process() { unsigned W A, B, C; while(1) par { … Multiply(A, B, &C); … } void Multiply(unsigned W A, unsigned W B, unsigned W *C) { static unsigned W a[W], b[W], c[W]; par{ a[0] = A; b[0] = B; c[0] = a[0][0] == 0 ? 0 : b[0]; par (i = 1; i < W; i++) { a[i] = a[i-1] >> 1; b[i] = b[i-1] << 1; c[i] = c[i-1] + (a[i][0] == 0 ? 0 : b[i]); } *C = c[W-1]; } } } Pipelined 26 NCSA/OSC Reconfigurable Systems Summer Institute

Agenda Reconfigurable Computing n Considerations, core algorithm relationships, commercial applications C-based design o o Agenda Reconfigurable Computing n Considerations, core algorithm relationships, commercial applications C-based design o o 27 The solution space (its place in EDA) Nature of C for HW design The Design Flow Summary JPEG 2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute

Summary Commercial C-based design is a reality For the HPC and RC communities it Summary Commercial C-based design is a reality For the HPC and RC communities it offers: Fastest route to accelerating SW designs in FPGA n n n Deterministic and quality results n State of the art tools used by embedded systems designers RC platforms for rapid prototyping n 28 Lower barrier to adoption than RTL technologies Greater customization and productivity than block based approaches Complete integration with RTL/block based approaches for “Power users” Simple migration, development to deployment with full library support NCSA/OSC Reconfigurable Systems Summer Institute

Design Example JPEG 2000 Image Compression Algorithm <date/time> Design Example JPEG 2000 Image Compression Algorithm

Example Design JPEG 2000 Compressor Original Image Pre processing Five Steps to HW Platform: Example Design JPEG 2000 Compressor Original Image Pre processing Five Steps to HW Platform: RGB to YUV conversion Quantization Tier-1 Encoder Tier-2 Encoder Direct Synthesis C to EDIF 5. HW Platform 30 Optimization 4. Implementation Model Coded Image System Estimations 3. Architecture and Communication Model Algorithm Profiling 2. Functional System Model DWT Rate Control 1. Specification Model Board level integration NCSA/OSC Reconfigurable Systems Summer Institute

1. Specification Model Function & Architecture 22 *. c and *. h files C/C++ 1. Specification Model Function & Architecture 22 *. c and *. h files C/C++ AL Specification Model 1468 lines of code Original Image Pre processing RGB to YUV conversion DWT Rate Control Design Algorithm Profiling - Memory - Processing Time - Data Flow Quantization Tier-1 Encoder Coded Image Tier-2 Encoder DWT/Tier 1 are the compute intensive blocks 31 NCSA/OSC Reconfigurable Systems Summer Institute Testbench Software Model

2. Functional System Model Function & Architecture AL C/C++ CA Specification Model Handel-C Original 2. Functional System Model Function & Architecture AL C/C++ CA Specification Model Handel-C Original Image Design Pre processing Testbench Software Model System Model Partitioning RGB to YUV conversion HW SW DWT Rate Control quantization /*Handel-C*/ extern “C” sw_block(…); Tier-1 Encoder Coded Image Cycles/speed/area… 32 Tier-2 Encoder void main(void){ while(1) par{ sw_block(…); hw_block(…); } } void hw_block(…) { … } NCSA/OSC Reconfigurable Systems Summer Institute /* C */ void sw_block(…) { … }

3. Architecture and Communication Model Function & Architecture AL C/C++ CA Handel-C Original Image 3. Architecture and Communication Model Function & Architecture AL C/C++ CA Handel-C Original Image Pre processing RGB to YUV conversion DWT Rate Control quantization FIFO Tier-1 Encoder Dsm. Port. H 2 S Coded Image Tier-2 Encoder Dsm. Read(…) Dsm. Write(…) Dsm. Flush(…) Dataflow/Cycles/speed/area… 33 NCSA/OSC Reconfigurable Systems Summer Institute

4. Implementation Model A C B D EDIF Device Family Implementation RTL 34 EDIF 4. Implementation Model A C B D EDIF Device Family Implementation RTL 34 EDIF NCSA/OSC Reconfigurable Systems Summer Institute void main(){ interface port_in… interface port_out… … }

Estimations from Synthesis DWT ~ 6% VII 1000 35 NCSA/OSC Reconfigurable Systems Summer Institute Estimations from Synthesis DWT ~ 6% VII 1000 35 NCSA/OSC Reconfigurable Systems Summer Institute

5. Hardware Platform From P&R Report for VII 1000 -4 A B u. P 5. Hardware Platform From P&R Report for VII 1000 -4 A B u. P HW u. P DWT HW C D u. P HW u. P RAM HW RAM Board Level Integration Specific I/O Implementations Pin Location constraints Slices: 758 Device utilization : 7% Speed (MHz): 151 Lines of code: 395 Implementation Model Estimations DWT ~6% Implementation • Microblaze + Xilinx FPGA • Nios + Altera FPGA • Xilinx V 2 Pro • Toshiba Me. P + FPGA • Power. PC + PLB + FPGA • PC + FPGA PCI Card • …etc 36 EDIF P&R FPGA NCSA/OSC Reconfigurable Systems Summer Institute

JPEG 2000 DWT Implementation Example taken from a “Xilinx Design Challenge” n n Comparison JPEG 2000 DWT Implementation Example taken from a “Xilinx Design Challenge” n n Comparison made with HDL approach See Article in Xcell Volume 46 http: //www. xilinx. com/publications/xcellonline/xcell_46/xc_celoxica 46. htm C-Based Design 1 st pass Slices 2 nd pass Final HDL 646 546 758 800 6% 5% 7% 7% Speed (MHz) 110 130 151 128 Lines of code 386 395 435 Design time (days) 6 7 (6+1) 20* 5 mins 20 mins +6 hours Device utilization Simulation time 5 mins * Lena used as testbench throughout, input bit width 12, max 1 K image width 37 Observations NCSA/OSC Reconfigurable Systems Summer Institute * Doesn’t include partitioning spec. development Comparable Using C faster Using C quicker Expert vs Novice

JPEG 2000 MQ coder Implementation > Celoxica 1 st Pass Celoxica Final HDL Slices JPEG 2000 MQ coder Implementation > Celoxica 1 st Pass Celoxica Final HDL Slices 1. 347 1, 999 620 Device utilization 12% 18% 6% Speed (MHz) 89. 5 115. 5 76 Lines of code 310 330 800 Design time (days) 10 12 (10+2) 30* Simulation time for Lena jpeg 5 mins Hours * Doesn’t include partitioning spec. development > Common language base eased porting to hardware of the MQ coder source & DSM allowed partition, co verification & data to be moved between hardware & software > Optimizations included adding parallelism, replacing for() loops with while() loops, & simplifying loop control. > Design developed in a unified design environment 38 Observations NCSA/OSC Reconfigurable Systems Summer Institute HDL Smaller HC Faster HC Quicker Expert vs Novice