Скачать презентацию Reconfigurable Computing Nehir Sönmez 25 -11 -2004 Скачать презентацию Reconfigurable Computing Nehir Sönmez 25 -11 -2004

cea81668cbc566151fc3c723faaa21c1.ppt

  • Количество слайдов: 61

Reconfigurable Computing Nehir Sönmez 25 -11 -2004 Reconfigurable Computing Nehir Sönmez 25 -11 -2004

Reconfigurable Computing Standard Definition: A reconfigurable computer is a device which computes by using Reconfigurable Computing Standard Definition: A reconfigurable computer is a device which computes by using post-fabrication spatial components of compute elements. [Dehon] • FPGA implementation of a processor core to run a program is excluded - not spatial mapping of problem. • ASIC implementations excluded – not postfabrication programmable. The definition restricts RC to mapping to finegrained devices (such as FPGAs). Whereas General Purpose computers compute by making connections in time.

What is Reconfigurable Computing? Computation using hardware that can adapt at the logic level What is Reconfigurable Computing? Computation using hardware that can adapt at the logic level to solve specific problems • Why is this interesting? – Some applications are poorly suited to microprocessor. – VLSI “explosion” provides increasing resources. – Hardware/Software – Relatively new research area.

Spatial Computation Example: grade = 0. 2 × mt 1 + 0. 2 × Spatial Computation Example: grade = 0. 2 × mt 1 + 0. 2 × mt 2 + 0. 2 × mt 3 + 0. 4 × project; • A hardware resource (multiplier or adder) is allocated for each operator in the compute graph. • The abstract computation graph becomes the implementation template.

Temporal Computation • A hardware resource is time-multiplexed to implement the actions of the Temporal Computation • A hardware resource is time-multiplexed to implement the actions of the operators in the compute graph. • Close to a sequential processor/software solution. Many inbetween cases exist.

Why is Custom Logic Faster Than Software? • Spatial vs. Temporal Computation – Processors Why is Custom Logic Faster Than Software? • Spatial vs. Temporal Computation – Processors divide computation across time, dedicated logic divides across space

Why is Custom Logic Faster Than Software? • Specialization – Instruction set may not Why is Custom Logic Faster Than Software? • Specialization – Instruction set may not provide the operations your program needs – Processors provide hardware that may not be useful in every program or in every cycle of a given program • Multipliers • Dividers • Instruction Memory – Processors need lots of memory to hold the instructions that make up a program and to hold intermediate results. • Bit Width Mismatches – In general, processors have a fixed bit width, and all computations are performed on that many bits • Multimedia vector instructions (MMX) a response to this

Microprocessor-based Systems Data Storage (Register File) A B ALU C 64 – Generalized to Microprocessor-based Systems Data Storage (Register File) A B ALU C 64 – Generalized to perform many functions well. – Operates on fixed data sizes. – Inherently sequential.

Reconfigurable Computing If (A > B) { H = A; L = B; } Reconfigurable Computing If (A > B) { H = A; L = B; } Else { H = B; L = A; } A B Functional Unit H L – Create specialized hardware for each application. – Functional units optimized to perform a special task.

Dataflow • Superscalar must find dataflow graph at run time • RC constructs data Dataflow • Superscalar must find dataflow graph at run time • RC constructs data flow graph at compile time • no logic control overhead • no window size limitations

Implementation Spectrum Microprocessor Reconfigurable Hardware ASIC – ASIC gives high performance at cost of Implementation Spectrum Microprocessor Reconfigurable Hardware ASIC – ASIC gives high performance at cost of inflexibility. – Processor is very flexible but not tuned to the application. – Reconfigurable hardware is a nice compromise.

Flexibility vs Data-Processing Rate Flexibility vs Data-Processing Rate

Field-Programmable Gate Array Tracks Logic Element LE LE LE – Each logic element outputs Field-Programmable Gate Array Tracks Logic Element LE LE LE – Each logic element outputs one data bit. – Interconnect programmable between elements. – Interconnect tracks grouped into channels.

FPGA Architecture Issues Logic Element – Need to explore architectural issues. – How much FPGA Architecture Issues Logic Element – Need to explore architectural issues. – How much functionality should go in a logic element? – How many routing tracks per channel? – Switch “population”?

Real World Physical Issues S S Wires have real cost – Modelling FPGA delay. Real World Physical Issues S S Wires have real cost – Modelling FPGA delay. – Improving performance through buffering/segmentation. – Technology dependent. – The cost of reconfigurability.

Translating a Design to an FPGA C program. . C = A+B. Circuit A Translating a Design to an FPGA C program. . C = A+B. Circuit A B + Array C – CAD to translate circuit from text description to physical implementation well understood. – CAD to translate from C program to circuit not well understood. – Very difficult for application designers to successfully write high -performance applications Need for design automation!

High-level Compilers – Difficult to estimate hardware resources. – Some parts of program more High-level Compilers – Difficult to estimate hardware resources. – Some parts of program more appropriate for processor (hardware/software codesign). – Compiler must parallelize computation across many resources. – Engineers like to write in C rather than pushing little blocks around. A C = A+B B + C for (i = 0; i

Reconfigurable Hardware Logic Element A B C D Out A B C D = Reconfigurable Hardware Logic Element A B C D Out A B C D = out – Each logic element operates on four one-bit inputs. – Output is one data bit. – Can perform any boolean function of four inputs 4 2 2 = 64 K functions!

Basic Logic Block Architecture Basic Logic Block Architecture

Xilinx - Spartan II Architecture • IOBs provide the interface between the package pins Xilinx - Spartan II Architecture • IOBs provide the interface between the package pins and the internal logic • CLBs provide the functional elements for constructing most logic • Dedicated block RAM memories of 4096 bits each • Clock DLLs for clockdistribution delay compensation and clock domain control • Versatile multi-level interconnect structure

Spartan II Configurable Logic Block • Basic block is a logic cell (LC) – Spartan II Configurable Logic Block • Basic block is a logic cell (LC) – A 4 -input function generator (LUT), – Carry logic – storage element. • Each CLB contains – four LCs, organized in two similar slices. – logic that combines function generators to provide functions of five or six inputs. LUT capacity is completely determined by the number of inputs, not the complexity

Spartan II CLB Spartan II CLB

Example: Two Bit Adder Made of Full Adders A B Co FA A+B = Example: Two Bit Adder Made of Full Adders A B Co FA A+B = D Ci S Logic synthesis tool reduces circuit to SOP form S = ABCi + ABCi A B Ci LUT Co A B Ci Co = ABCi + ABCi LUT S

Circuit Compilation 1. Technology Mapping LUT 2. Placement ? Assign a logical LUT to Circuit Compilation 1. Technology Mapping LUT 2. Placement ? Assign a logical LUT to a physical location. 3. Routing Select wire segments And switches for Interconnection.

Processor + FPGA Three possibilities Proc chip daughtercard FPGA Backplane bus (e. g. PCI) Processor + FPGA Three possibilities Proc chip daughtercard FPGA Backplane bus (e. g. PCI) 1. FPGA serves as coprocessor for data intensive applications. Proc FPGA chip 2. FPGA serves as embedded computer for low latency transfer. “Reconfigurable Functional Unit”

Processor + FPGA (cont. . ) 3. Processor integration Processor RF ALU FPGA – Processor + FPGA (cont. . ) 3. Processor integration Processor RF ALU FPGA – FPGA logic embedded inside processor. – A number of problems with 2 and 3. • Process technology an issue. • ALU much faster than FPGA generally. • FPGA much faster than the entire processor.

Multi-FPGA Systems F F F F F – Most applications don’t fit on one Multi-FPGA Systems F F F F F – Most applications don’t fit on one device. – Create need for partitioning designs across many devices. – Effectively a “netlist computer” Each FPGA is a logic processor interconnected in a given topology.

Xilinx XC 4000 Cell – 2 4 -input look-up tables – 1 3 -input Xilinx XC 4000 Cell – 2 4 -input look-up tables – 1 3 -input look-up table – 2 D flip flops

Altera Flex 10 K Altera Flex 10 K

Xilinx Virtex CLB Xilinx Virtex CLB

Reconfiguration methodology • Static • Partially static (=partial reconfiguration) • Dynamic Reconfiguration methodology • Static • Partially static (=partial reconfiguration) • Dynamic

The Design Process 1. Partition a program into sections to be implemented on hardware The Design Process 1. Partition a program into sections to be implemented on hardware and software separately 2. Synthesize the computations destined for reconfigurable hardware into gate-level or circuit level description. 3. Map the circuit onto reconfigurable blocks and connect them using reconfigurable routing. 4. After compilation, the circuit is ready for configuration onto the hardware at runtime.

RC Objectives • RC objectives: Specialization, performance, flexibility l l Performance Power consumption Specialization RC Objectives • RC objectives: Specialization, performance, flexibility l l Performance Power consumption Specialization l l Flexibility Programming • Basic idea: “Programmable Hardware”

Reconfigurable Computing Reconfigurable Devices • Routing strategies A B C Continuous Routing A B Reconfigurable Computing Reconfigurable Devices • Routing strategies A B C Continuous Routing A B C Structured Routing

Xilinx XC 4000 Routing 25 Xilinx XC 4000 Routing 25

Reconfigurable Instruction Set Processors • By including reconfigurability we can increase flexibility with high Reconfigurable Instruction Set Processors • By including reconfigurability we can increase flexibility with high specialization Processor Reconfigurable Processor PLD

Reconfigurable Instruction Set Processors • Coprocessor based approach ··· Task 1 ··· Task K Reconfigurable Instruction Set Processors • Coprocessor based approach ··· Task 1 ··· Task K Software Task K+1 Task N Hardware • ASIP based approach · · · Software Hardware Task 1 Task 2 Task N

Reconfigurable Instruction Set Processors Coprocessor based approach (I) • Typical example: CPU + PCI Reconfigurable Instruction Set Processors Coprocessor based approach (I) • Typical example: CPU + PCI board – Altera ARC-PCI – Compaq Pamette • System on Chip (So. C) – Altera´s Excalibur device – Chameleon Systems, Inc.

Reconfigurable Instruction Set Processors Coprocessor based approach (II) • Altera ARC-PCI Reconfigurable Instruction Set Processors Coprocessor based approach (II) • Altera ARC-PCI

Reconfigurable Instruction Set Processors Coprocessor based approach (III) • Compaq Pamette Reconfigurable Instruction Set Processors Coprocessor based approach (III) • Compaq Pamette

Reconfigurable Instruction Set Processors Coprocessor based approach (IV) • Altera´s Excalibur device – Embedded Reconfigurable Instruction Set Processors Coprocessor based approach (IV) • Altera´s Excalibur device – Embedded Processor: ARM, MIPS or NIOS

Reconfigurable Instruction Set Processors Coprocessor based approach (V) • Chameleon Systems, Inc. Reconfigurable Instruction Set Processors Coprocessor based approach (V) • Chameleon Systems, Inc.

Reconfigurable Instruction Set Processors ASIP based approach (I) • Reconfigurable unit within CPU Fetch Reconfigurable Instruction Set Processors ASIP based approach (I) • Reconfigurable unit within CPU Fetch Decode Issue Integer Unit FP Unit Branch Unit LD/ST Unit Reconfigurable Unit

Reconfigurable Instruction Set Processors ASIP based approach (II) • Challenge: CAD tools C Code Reconfigurable Instruction Set Processors ASIP based approach (II) • Challenge: CAD tools C Code Compiler Instruction Description (Configuration bits) Assembly Code

Reconfigurable Instruction Set Processors ASIP based approach (III) C Code Compiler Structure C Parsing Reconfigurable Instruction Set Processors ASIP based approach (III) C Code Compiler Structure C Parsing Optimizations Inst. Identification Hardware Estimator Inst. Selection Config. Scheduling Hardware Generation Code Generation Assembly Code Configuration bits

 • Example: Philips Cin. CISe Architecture 5 32 32 MUX 5 Register File • Example: Philips Cin. CISe Architecture 5 32 32 MUX 5 Register File ALU 5 32 32 32 4 RFU Encoded Instruction Word Reconfigurable Instruction Set Processors ASIP based approach (II) 32

Why Compute With FPGAs? • Huge performance gap between software and hand -designed hardware Why Compute With FPGAs? • Huge performance gap between software and hand -designed hardware systems – Often 100 -to-1 ratio of performance or performance/area • Hardware systems not so good for general computing – Big design, cost barriers to implementation – Not practical to buy a new machine every time you want to run a different program • Reconfigurable systems offer best-of-both-worlds – Run-time programmability – Hardware-level performance

Good Applications for Reconfigurable Computing • Relatively small application graph – FPGAs have limited Good Applications for Reconfigurable Computing • Relatively small application graph – FPGAs have limited capacity – Simple control flow helps a lot • Data Parallelism – Execute same computations on many independent data elements – Pipeline computations through the hardware • Small and/or varying bit widths – Take advantage of the ability to customize the size of operators

Reconfigurable Computing Successes • RSA Decryption – Programmable-Active-Memory machine set record for decryption of Reconfigurable Computing Successes • RSA Decryption – Programmable-Active-Memory machine set record for decryption of RSA-encrypted data • DNA Sequence Matching – Reconfigurable hardware has achieved 100 x better performance than contemporary supercomputers • Signal Processing – FPGA-based filters often get 10 x better performance than DSP chips – Benefit from customization of hardware to the application • Emulation – Use reconfigurable logic to simulate new processors at high speeds • Cryptographic Attacks – High-performance low-cost implementations for breaking encryption algorithms

FPGAs vs CPUs • Capacity: Instructions are very dense representation, logic blocks aren’t • FPGAs vs CPUs • Capacity: Instructions are very dense representation, logic blocks aren’t • Tools: Compilers for reconfigurable logic aren’t very good – Some operations are hard to implement on FPGAs One approach to capacity is to exploit the 90 -10 rule of software – Run the 90% of code that takes 10% of execution time on a conventional processor – Run the 10% of code that takes 90% of execution time on reconfigurable logic • Programmable-reconfigurable processors

Fine-Grained System: CHIMERAE • Treat reconfigurable array as ALU within superscalar –Array implements some Fine-Grained System: CHIMERAE • Treat reconfigurable array as ALU within superscalar –Array implements some number of custom instructions for each program –Register file is interface between programmable and reconfigurable

CHIMERAE • Programmed in C –Instruction combining –Control localization –SIMD Within a Register • CHIMERAE • Programmed in C –Instruction combining –Control localization –SIMD Within a Register • Simulation Studies –Example applications only require 8 RFUOPs in the reconfigurable array –Equivalent to 32 rows in RFU • Performance Results –Vary strongly from application to application –Also dependent on model used for RFU delay –Average speedup of 20 -30%, one application sees >2 x improvement

Coarse-Grained System: Garp • Small programmable processor with large reconfigurable array – Interface through Coarse-Grained System: Garp • Small programmable processor with large reconfigurable array – Interface through memory system

Garp • Again, Programmed in C –Compiler attempts to map loop nests onto the Garp • Again, Programmed in C –Compiler attempts to map loop nests onto the reconfigurable array • Data Encryption Standard –Estimate 24 x speedup over Ultra. SPARC • Image Dithering – 9 x Speedup • Sorting – 2 x Speedup

Advantages of RC • Relative to microprocessors: on average a higher percentage of peak Advantages of RC • Relative to microprocessors: on average a higher percentage of peak (or raw) computational density is achieved with reconfigurable devices. • Fine-grain flexibility leads to exploitation of problem specific parallelism at many levels. • Also, many different computation models (or patterns) can be supported. In general, it is possible to match problem characteristics to hardware, through the use of problem specific architectures and low-level circuit specialization. • Spatial mapping of computation versus multiplexing of function units (as in processors) relieves pressure for memory capacity, BW, and low-latency and local communication patterns. • Modern FPGAs make good system-level components: • Relatively large number of IOs (many parallel memory ports) High. BW communications. • Machines based on these components can easily scale peak performance by riding Moore’s curve (FPGAs are process drivers). • Low-level redundancy permits fault-tolerance and great cost savings. • Built-in microprocessors. • Is there still room for research in novel devices for RC?

Advantages of RC • Even in an application with fixed algorithms, reconfigurable devices may Advantages of RC • Even in an application with fixed algorithms, reconfigurable devices may offer advantages over a full-custom or ASIC approach: • FPGAs are processes drivers, therefore a generation ahead of ASIC. • Increasing NREs for ASIC and full-custom has pushed "cross-over" point way out. • Time to market advantage. • Programmability leads to: • project risk management • extended product life-times • Dynamic reconfiguration might permit even higher efficiency through hardware sharing (multiplexing) and on the fly circuit specialization. • Largely unexploited (unproven) to date. • A few research projects have explored this idea.

RC Disadvantages • Reconfiguration time might be critical in run-time reconfigurable systems. • Low RC Disadvantages • Reconfiguration time might be critical in run-time reconfigurable systems. • Low utilization of hardware resources in configurable systems.

FPGAs are Reconfigurable 1. Commercial applications have not taken advantage of reconfigurability. • Xilinx/Altera FPGAs are Reconfigurable 1. Commercial applications have not taken advantage of reconfigurability. • Xilinx/Altera haven’t done much to help. • Methodologies/tools nearly nonexistent. 2. Volume/cost graphs don’t accurately capture the potential real costs and other advantages. Reconfiguration uses: • Field upgrades. product life extension, changing requirements. • In system board-level testing and field diagnostics. • Tolerance to manufacturing faults. • Risk-management in system development. • Runtime reconfiguration -- higher silicon efficiency. • Time-multiplexed pre-designed circuits take maximum use of resources. • Runtime specialized circuit generation.

Silicon Usage Silicon Usage

Performance: ~10 x Speedup Efficiency: ~10 x Lower Chip Costs: ~0. 5 x -- Performance: ~10 x Speedup Efficiency: ~10 x Lower Chip Costs: ~0. 5 x -- increased yield • Decreased complexity • Decreased design cost