
cea81668cbc566151fc3c723faaa21c1.ppt
- Количество слайдов: 61
Reconfigurable Computing Nehir Sönmez 25 -11 -2004
Reconfigurable Computing Standard Definition: A reconfigurable computer is a device which computes by using post-fabrication spatial components of compute elements. [Dehon] • FPGA implementation of a processor core to run a program is excluded - not spatial mapping of problem. • ASIC implementations excluded – not postfabrication programmable. The definition restricts RC to mapping to finegrained devices (such as FPGAs). Whereas General Purpose computers compute by making connections in time.
What is Reconfigurable Computing? Computation using hardware that can adapt at the logic level to solve specific problems • Why is this interesting? – Some applications are poorly suited to microprocessor. – VLSI “explosion” provides increasing resources. – Hardware/Software – Relatively new research area.
Spatial Computation Example: grade = 0. 2 × mt 1 + 0. 2 × mt 2 + 0. 2 × mt 3 + 0. 4 × project; • A hardware resource (multiplier or adder) is allocated for each operator in the compute graph. • The abstract computation graph becomes the implementation template.
Temporal Computation • A hardware resource is time-multiplexed to implement the actions of the operators in the compute graph. • Close to a sequential processor/software solution. Many inbetween cases exist.
Why is Custom Logic Faster Than Software? • Spatial vs. Temporal Computation – Processors divide computation across time, dedicated logic divides across space
Why is Custom Logic Faster Than Software? • Specialization – Instruction set may not provide the operations your program needs – Processors provide hardware that may not be useful in every program or in every cycle of a given program • Multipliers • Dividers • Instruction Memory – Processors need lots of memory to hold the instructions that make up a program and to hold intermediate results. • Bit Width Mismatches – In general, processors have a fixed bit width, and all computations are performed on that many bits • Multimedia vector instructions (MMX) a response to this
Microprocessor-based Systems Data Storage (Register File) A B ALU C 64 – Generalized to perform many functions well. – Operates on fixed data sizes. – Inherently sequential.
Reconfigurable Computing If (A > B) { H = A; L = B; } Else { H = B; L = A; } A B Functional Unit H L – Create specialized hardware for each application. – Functional units optimized to perform a special task.
Dataflow • Superscalar must find dataflow graph at run time • RC constructs data flow graph at compile time • no logic control overhead • no window size limitations
Implementation Spectrum Microprocessor Reconfigurable Hardware ASIC – ASIC gives high performance at cost of inflexibility. – Processor is very flexible but not tuned to the application. – Reconfigurable hardware is a nice compromise.
Flexibility vs Data-Processing Rate
Field-Programmable Gate Array Tracks Logic Element LE LE LE – Each logic element outputs one data bit. – Interconnect programmable between elements. – Interconnect tracks grouped into channels.
FPGA Architecture Issues Logic Element – Need to explore architectural issues. – How much functionality should go in a logic element? – How many routing tracks per channel? – Switch “population”?
Real World Physical Issues S S Wires have real cost – Modelling FPGA delay. – Improving performance through buffering/segmentation. – Technology dependent. – The cost of reconfigurability.
Translating a Design to an FPGA C program. . C = A+B. Circuit A B + Array C – CAD to translate circuit from text description to physical implementation well understood. – CAD to translate from C program to circuit not well understood. – Very difficult for application designers to successfully write high -performance applications Need for design automation!
High-level Compilers – Difficult to estimate hardware resources. – Some parts of program more appropriate for processor (hardware/software codesign). – Compiler must parallelize computation across many resources. – Engineers like to write in C rather than pushing little blocks around. A C = A+B B + C for (i = 0; i
Reconfigurable Hardware Logic Element A B C D Out A B C D = out – Each logic element operates on four one-bit inputs. – Output is one data bit. – Can perform any boolean function of four inputs 4 2 2 = 64 K functions!
Basic Logic Block Architecture
Xilinx - Spartan II Architecture • IOBs provide the interface between the package pins and the internal logic • CLBs provide the functional elements for constructing most logic • Dedicated block RAM memories of 4096 bits each • Clock DLLs for clockdistribution delay compensation and clock domain control • Versatile multi-level interconnect structure
Spartan II Configurable Logic Block • Basic block is a logic cell (LC) – A 4 -input function generator (LUT), – Carry logic – storage element. • Each CLB contains – four LCs, organized in two similar slices. – logic that combines function generators to provide functions of five or six inputs. LUT capacity is completely determined by the number of inputs, not the complexity
Spartan II CLB
Example: Two Bit Adder Made of Full Adders A B Co FA A+B = D Ci S Logic synthesis tool reduces circuit to SOP form S = ABCi + ABCi A B Ci LUT Co A B Ci Co = ABCi + ABCi LUT S
Circuit Compilation 1. Technology Mapping LUT 2. Placement ? Assign a logical LUT to a physical location. 3. Routing Select wire segments And switches for Interconnection.
Processor + FPGA Three possibilities Proc chip daughtercard FPGA Backplane bus (e. g. PCI) 1. FPGA serves as coprocessor for data intensive applications. Proc FPGA chip 2. FPGA serves as embedded computer for low latency transfer. “Reconfigurable Functional Unit”
Processor + FPGA (cont. . ) 3. Processor integration Processor RF ALU FPGA – FPGA logic embedded inside processor. – A number of problems with 2 and 3. • Process technology an issue. • ALU much faster than FPGA generally. • FPGA much faster than the entire processor.
Multi-FPGA Systems F F F F F – Most applications don’t fit on one device. – Create need for partitioning designs across many devices. – Effectively a “netlist computer” Each FPGA is a logic processor interconnected in a given topology.
Xilinx XC 4000 Cell – 2 4 -input look-up tables – 1 3 -input look-up table – 2 D flip flops
Altera Flex 10 K
Xilinx Virtex CLB
Reconfiguration methodology • Static • Partially static (=partial reconfiguration) • Dynamic
The Design Process 1. Partition a program into sections to be implemented on hardware and software separately 2. Synthesize the computations destined for reconfigurable hardware into gate-level or circuit level description. 3. Map the circuit onto reconfigurable blocks and connect them using reconfigurable routing. 4. After compilation, the circuit is ready for configuration onto the hardware at runtime.
RC Objectives • RC objectives: Specialization, performance, flexibility l l Performance Power consumption Specialization l l Flexibility Programming • Basic idea: “Programmable Hardware”
Reconfigurable Computing Reconfigurable Devices • Routing strategies A B C Continuous Routing A B C Structured Routing
Xilinx XC 4000 Routing 25
Reconfigurable Instruction Set Processors • By including reconfigurability we can increase flexibility with high specialization Processor Reconfigurable Processor PLD
Reconfigurable Instruction Set Processors • Coprocessor based approach ··· Task 1 ··· Task K Software Task K+1 Task N Hardware • ASIP based approach · · · Software Hardware Task 1 Task 2 Task N
Reconfigurable Instruction Set Processors Coprocessor based approach (I) • Typical example: CPU + PCI board – Altera ARC-PCI – Compaq Pamette • System on Chip (So. C) – Altera´s Excalibur device – Chameleon Systems, Inc.
Reconfigurable Instruction Set Processors Coprocessor based approach (II) • Altera ARC-PCI
Reconfigurable Instruction Set Processors Coprocessor based approach (III) • Compaq Pamette
Reconfigurable Instruction Set Processors Coprocessor based approach (IV) • Altera´s Excalibur device – Embedded Processor: ARM, MIPS or NIOS
Reconfigurable Instruction Set Processors Coprocessor based approach (V) • Chameleon Systems, Inc.
Reconfigurable Instruction Set Processors ASIP based approach (I) • Reconfigurable unit within CPU Fetch Decode Issue Integer Unit FP Unit Branch Unit LD/ST Unit Reconfigurable Unit
Reconfigurable Instruction Set Processors ASIP based approach (II) • Challenge: CAD tools C Code Compiler Instruction Description (Configuration bits) Assembly Code
Reconfigurable Instruction Set Processors ASIP based approach (III) C Code Compiler Structure C Parsing Optimizations Inst. Identification Hardware Estimator Inst. Selection Config. Scheduling Hardware Generation Code Generation Assembly Code Configuration bits
• Example: Philips Cin. CISe Architecture 5 32 32 MUX 5 Register File ALU 5 32 32 32 4 RFU Encoded Instruction Word Reconfigurable Instruction Set Processors ASIP based approach (II) 32
Why Compute With FPGAs? • Huge performance gap between software and hand -designed hardware systems – Often 100 -to-1 ratio of performance or performance/area • Hardware systems not so good for general computing – Big design, cost barriers to implementation – Not practical to buy a new machine every time you want to run a different program • Reconfigurable systems offer best-of-both-worlds – Run-time programmability – Hardware-level performance
Good Applications for Reconfigurable Computing • Relatively small application graph – FPGAs have limited capacity – Simple control flow helps a lot • Data Parallelism – Execute same computations on many independent data elements – Pipeline computations through the hardware • Small and/or varying bit widths – Take advantage of the ability to customize the size of operators
Reconfigurable Computing Successes • RSA Decryption – Programmable-Active-Memory machine set record for decryption of RSA-encrypted data • DNA Sequence Matching – Reconfigurable hardware has achieved 100 x better performance than contemporary supercomputers • Signal Processing – FPGA-based filters often get 10 x better performance than DSP chips – Benefit from customization of hardware to the application • Emulation – Use reconfigurable logic to simulate new processors at high speeds • Cryptographic Attacks – High-performance low-cost implementations for breaking encryption algorithms
FPGAs vs CPUs • Capacity: Instructions are very dense representation, logic blocks aren’t • Tools: Compilers for reconfigurable logic aren’t very good – Some operations are hard to implement on FPGAs One approach to capacity is to exploit the 90 -10 rule of software – Run the 90% of code that takes 10% of execution time on a conventional processor – Run the 10% of code that takes 90% of execution time on reconfigurable logic • Programmable-reconfigurable processors
Fine-Grained System: CHIMERAE • Treat reconfigurable array as ALU within superscalar –Array implements some number of custom instructions for each program –Register file is interface between programmable and reconfigurable
CHIMERAE • Programmed in C –Instruction combining –Control localization –SIMD Within a Register • Simulation Studies –Example applications only require 8 RFUOPs in the reconfigurable array –Equivalent to 32 rows in RFU • Performance Results –Vary strongly from application to application –Also dependent on model used for RFU delay –Average speedup of 20 -30%, one application sees >2 x improvement
Coarse-Grained System: Garp • Small programmable processor with large reconfigurable array – Interface through memory system
Garp • Again, Programmed in C –Compiler attempts to map loop nests onto the reconfigurable array • Data Encryption Standard –Estimate 24 x speedup over Ultra. SPARC • Image Dithering – 9 x Speedup • Sorting – 2 x Speedup
Advantages of RC • Relative to microprocessors: on average a higher percentage of peak (or raw) computational density is achieved with reconfigurable devices. • Fine-grain flexibility leads to exploitation of problem specific parallelism at many levels. • Also, many different computation models (or patterns) can be supported. In general, it is possible to match problem characteristics to hardware, through the use of problem specific architectures and low-level circuit specialization. • Spatial mapping of computation versus multiplexing of function units (as in processors) relieves pressure for memory capacity, BW, and low-latency and local communication patterns. • Modern FPGAs make good system-level components: • Relatively large number of IOs (many parallel memory ports) High. BW communications. • Machines based on these components can easily scale peak performance by riding Moore’s curve (FPGAs are process drivers). • Low-level redundancy permits fault-tolerance and great cost savings. • Built-in microprocessors. • Is there still room for research in novel devices for RC?
Advantages of RC • Even in an application with fixed algorithms, reconfigurable devices may offer advantages over a full-custom or ASIC approach: • FPGAs are processes drivers, therefore a generation ahead of ASIC. • Increasing NREs for ASIC and full-custom has pushed "cross-over" point way out. • Time to market advantage. • Programmability leads to: • project risk management • extended product life-times • Dynamic reconfiguration might permit even higher efficiency through hardware sharing (multiplexing) and on the fly circuit specialization. • Largely unexploited (unproven) to date. • A few research projects have explored this idea.
RC Disadvantages • Reconfiguration time might be critical in run-time reconfigurable systems. • Low utilization of hardware resources in configurable systems.
FPGAs are Reconfigurable 1. Commercial applications have not taken advantage of reconfigurability. • Xilinx/Altera haven’t done much to help. • Methodologies/tools nearly nonexistent. 2. Volume/cost graphs don’t accurately capture the potential real costs and other advantages. Reconfiguration uses: • Field upgrades. product life extension, changing requirements. • In system board-level testing and field diagnostics. • Tolerance to manufacturing faults. • Risk-management in system development. • Runtime reconfiguration -- higher silicon efficiency. • Time-multiplexed pre-designed circuits take maximum use of resources. • Runtime specialized circuit generation.
Silicon Usage
Performance: ~10 x Speedup Efficiency: ~10 x Lower Chip Costs: ~0. 5 x -- increased yield • Decreased complexity • Decreased design cost