Скачать презентацию Industrial Experiences Pioneering Asynchronous Commercial Design Peter A Скачать презентацию Industrial Experiences Pioneering Asynchronous Commercial Design Peter A

cd8e0df22fdf74c258d321a6e8f6fe89.ppt

  • Количество слайдов: 35

Industrial Experiences Pioneering Asynchronous Commercial Design Peter A. Beerel Fulcrum Microsystems Calabasas Hills, CA, Industrial Experiences Pioneering Asynchronous Commercial Design Peter A. Beerel Fulcrum Microsystems Calabasas Hills, CA, USA 1

Agenda Introduction to Fulcrum Description of Integrated Pipelining n Circuit B Circuit A Fulcrum’s Agenda Introduction to Fulcrum Description of Integrated Pipelining n Circuit B Circuit A Fulcrum’s clockless circuit architecture Specification Description of Fulcrum’s Design Flow Design & Verification Synthesis & Floor Planning Simulation & Verification Design & Verification Physical Design Database Release to Manufacturing Overview of Nexus n Fulcrum’s Terabit crossbar Overview of Pivot. Point n Fulcrum’s first commercial product 2

Company Snapshot “Clockless” Semiconductor Company Located in Calabasas, CA (30 people) Formed out of Company Snapshot “Clockless” Semiconductor Company Located in Calabasas, CA (30 people) Formed out of Caltech (1/00) Technology proven in large-scale designs Backed by top-tier investors (raised $14 M in June) 3

Agenda Introduction to Fulcrum Description of Integrated Pipelining n Circuit B Circuit A Fulcrum’s Agenda Introduction to Fulcrum Description of Integrated Pipelining n Circuit B Circuit A Fulcrum’s clockless circuit architecture Specification Description of Fulcrum’s Design Flow Design & Verification Synthesis & Floor Planning Simulation & Verification Design & Verification Physical Design Database Release to Manufacturing Overview of Nexus n Fulcrum’s Terabit crossbar Overview of Pivot. Point n Fulcrum’s first commercial product 4

Fulcrum’s Integrated Pipelining Robust, power efficient, and high performance Dual-Rail Domino Logic Acknowledge Fast Fulcrum’s Integrated Pipelining Robust, power efficient, and high performance Dual-Rail Domino Logic Acknowledge Fast delay-insensitive style using domino logic without latches (Developed at Caltech by Fulcrum’s founders) 5

Integrated Pipelining Leaf Cell A Leaf Cell B Leaf Cell C Dual-Rail Domino Logic Integrated Pipelining Leaf Cell A Leaf Cell B Leaf Cell C Dual-Rail Domino Logic Control Input Completion Detection Dual-Rail Domino Logic Control Output Completion Detection Harnessing the power of Domino Logic n n n Addresses delay variability with Completion Sensing Addresses power inefficiency with Async Handshakes Leverages more efficient “N” transistors 6

Hierarchical Design n Multi-level hierarchy of communicating blocks At each level blocks communicate along Hierarchical Design n Multi-level hierarchy of communicating blocks At each level blocks communicate along channels Reg A Reg B Main FSM Memory Adder Register Bank Multiplier BN-1 BN-2 BN-3 ASIC leaf cells Subtract/ Divider Adder/ Mult. Reg C channels FAN-1 FAN-2 FAN-3 FA 0 7

Leaf Cells C Definition n F LCD RCD Smallest block that performs logic and Leaf Cells C Definition n F LCD RCD Smallest block that performs logic and communicates via channels D Based on small number of pipeline templates guiding design Forms basic building block for physical design Features n n Facilitates high throughput and low latency Provides easy timing validation and analog verification ~1, 000 digital leaf cell types compose our leaf cell library ~200 additional subtypes for different physical environments (e. g. , loads) 8

Template-Based Cell Design • Each pipeline style (QDI, timed…) has a different blueprint • Template-Based Cell Design • Each pipeline style (QDI, timed…) has a different blueprint • Library uses a blueprint to implement the lowest level blocks C LCD RCD LCD F C 2 -input 1 -output pipeline stage LCD RCD F C LCD Blueprint for a QDI N-input M -output pipeline stage RCD F RCD 1 -input 2 -output pipeline stage 9

Summary of Characteristics Delay-Insensitive timing model n Gates and wires can have arbitrary delays Summary of Characteristics Delay-Insensitive timing model n Gates and wires can have arbitrary delays 4 phase 1 of 4 handshake n n Uses 4 wires to send 2 bits Plus an acknowledge wire for flow control Returned to neutral between each data transfer Self shielding Precharge domino logic plus async handshake Low latency; high frequency; robust Auto power conservation; zero standby power 10

Agenda Introduction to Fulcrum Description of Integrated Pipelining n Circuit B Circuit A Fulcrum’s Agenda Introduction to Fulcrum Description of Integrated Pipelining n Circuit B Circuit A Fulcrum’s clockless circuit architecture Specification Description of Fulcrum’s Design Flow Design & Verification Synthesis & Floor Planning Simulation & Verification Design & Verification Physical Design Database Release to Manufacturing Overview of Nexus n Fulcrum’s Terabit crossbar Overview of Pivot. Point n Fulcrum’s first commercial product 11

Fulcrum Design Flow n n n Executable specifications Formal decomposition Creates design hierarchy Semi-custom Fulcrum Design Flow n n n Executable specifications Formal decomposition Creates design hierarchy Semi-custom synthesis & layout n n n Hierarchical floor planning Automated transistor sizing Semi-automated physical design Supports synchronous & asynchronous designs n Hard macro from place & route Architecture Design & Verification Micro-architecture Design & Verification Synthesis & Floor Planning Physical Design Mitered Simulation & Verification Hierarchical design flow Design Specification Database Release to Manufacturing 12

Managing Design Hierarchy Proprietary Objected Oriented Hardware Language n Integrated hierarchical design/verification language Defines Managing Design Hierarchy Proprietary Objected Oriented Hardware Language n Integrated hierarchical design/verification language Defines cell specification & implementation n Specification Java or communicating-sequential-processes (CSP) n Implementation: multiple forms Sub-cells defined in terms of specification or implementation Defines integrated test environment for each cell n Enables verification at all pairs of levels Efficiency features n Supports refinement of cells and channels 13

Physical Design Layout hierarchy based on design hierarchy n n n Hierarchical floor-planning semi-automated Physical Design Layout hierarchy based on design hierarchy n n n Hierarchical floor-planning semi-automated Large scale hand placement before sizing Long distance channels planned carefully Timing closure by construction n n Placement drives sizing Can insert extra pipelining on long wires late in design Tradeoffs between performance and design time n n Hand layout where necessary Automated layout where possible Goals n Full-custom density and speed within ASIC design time 14

Design Verification: System-Level Test Bench Device Under Test Configuration Manager Test Cases Mission n Design Verification: System-Level Test Bench Device Under Test Configuration Manager Test Cases Mission n Verify that executable. Generator written spec + gate. Traffic spec = & Checker level model Bus Functional Model Executable Spec Gate-level Verilog Model Use industry-standard tools & methods n n n Cadence NCSIM and efficient Java-Verilog interface Directed random testing Line & functional coverage Monitor 15

Design Verification: Unit-Level High level (Java/CSP) Test Engine Copy == Log Low level (CSP/PRS/CDL) Design Verification: Unit-Level High level (Java/CSP) Test Engine Copy == Log Low level (CSP/PRS/CDL) Mitered co-simulation for unit-level verification n Check correctness of digital model by comparing it to golden CSP/Java model Features n n n Framework automated and regressed Checks correctness Checks delay insensitivity and/or throughput and latency 16

Analog Verification: Charge Sharing Test Generator Synthesis SPICE-based charge sharing analysis Test case generation Analog Verification: Charge Sharing Test Generator Synthesis SPICE-based charge sharing analysis Test case generation and analysis automated Charge-sharing problems solved in numerous ways n n n Symmetrization Less transistor sharing Delay perturbations 17

Synthesis: Gate Generation / Sizing Automated generation of transistor netlists n n Dynamic logic Synthesis: Gate Generation / Sizing Automated generation of transistor netlists n n Dynamic logic generation Transistor sharing Symmetrization Gate-library matching Transistor sizing n Gate Library Floor planning Information Logic Synthesis Transistor Sizing Path-based sizing to meet amortized unit-delay model Micro-architecture feedback n CSP CDL Netlist Identifies where fanout limits performance 18

Fulcrum QDI v. Synchronous Flows Save clock tree design, analysis, optimization, and verification No Fulcrum QDI v. Synchronous Flows Save clock tree design, analysis, optimization, and verification No timing closure problems n Unexpected long-wire bottlenecks easily solved with additional pipeline buffers late in design cycle QDI/DI timing model reduces timing analysis challenges Fulcrum QDI hierarchical design facilitates: n Composability, re-use, and early bug detection Hierarchical-floorplanning improves predictability of wires Template-based leaf cell designs simplifies logic design Design reuse reduces criticality of high-level synthesis Decomposition methodology amenable to formal verification 19

Agenda Introduction to Fulcrum Description of Integrated Pipelining n Circuit B Circuit A Fulcrum’s Agenda Introduction to Fulcrum Description of Integrated Pipelining n Circuit B Circuit A Fulcrum’s clockless circuit architecture Specification Description of Fulcrum’s Design Flow Design & Verification Synthesis & Floor Planning Simulation & Verification Design & Verification Physical Design Database Release to Manufacturing Overview of Nexus n Fulcrum’s Terabit crossbar Overview of Pivot. Point n Fulcrum’s first commercial product 20

Globally Asynchronous, Locally Synchronous So. C designs: many cores with different clock domains Async Globally Asynchronous, Locally Synchronous So. C designs: many cores with different clock domains Async circuits can interconnect multiple sync cores in an So. C design, eliminating global clock distribution and simplifying clock domain crossing Fulcrum’s “Nexus” is a high speed on-chip interconnect: n 16 port, 36 bit asynchronous crossbar n Asynchronous cross-chip channels n Async-sync clock domain converters n Runs at 1. 35 GHz in 130 nm process 21

Nexus System-on-Chip Interconnect Generic Nexus Example - Synchronous IP block - Asynchronous IP block Nexus System-on-Chip Interconnect Generic Nexus Example - Synchronous IP block - Asynchronous IP block Non-blocking crossbar 16 full-duplex ports Flow control extends through the crossbar Full speed arbitration Arbitrary-length “bursts” Bridges clock domains Scales in bit width and ports Process portable - Pipelined repeater - Clock domain converter 22

Nexus Burst Format Incoming From Source Data 36 bit Tail 1 bit Control 4 Nexus Burst Format Incoming From Source Data 36 bit Tail 1 bit Control 4 bit Source Module DN 1 • • • Outgoing To Target D 3 D 2 D 1 DN 0 0 0 1 To • • • D 3 D 2 D 1 0 0 0 From Target Module Arbitrary-length source-routed bursts provide flexibility 23

Sync-to-Async Conversion Synchronous Request / Grant FIFO protocol n n n Data transferred if Sync-to-Async Conversion Synchronous Request / Grant FIFO protocol n n n Data transferred if request and grant both high on rising edge of clock Compensates for any skew on asynchronous side Low latency: 1/2 to 3/2 clock cycles at A 2 S S 2 A A 2 S Synchronous Datapath Request Grant Asynchronous Datapath Synchronous Datapath A A clock Request Grant clock Seamlessly Bridges Different Clock Domains 24

Arbitration and Ordering Unrelated sender/receiver links are independent Bursts sent from multiple input ports Arbitration and Ordering Unrelated sender/receiver links are independent Bursts sent from multiple input ports to the same output port are serviced fairly by built-in arbitration circuitry Bursts from A to B remain ordered Producer-consumer and global-store-ordering satisfied n n A sends X to B, A notifies C, C can read X from B A writes X to B, A writes Y to C, if D reads Y from C, it can read X from B Split transactions implement loads n n Load request and load completion bursts Load completions returned out-of-order Can tunnel common bus and cache coherance protocols 25

Example: Load/Store Systems Option 1: Pure Master/Target Ports n n Masters send Requests to Example: Load/Store Systems Option 1: Pure Master/Target Ports n n Masters send Requests to Targets, which may return Completions Each port must either be a Master or a Target so that Completions are never blocked by Requests Devices which need to be both Masters and Targets are given two separate full-duplex ports Could use two separate Nexus crossbars Option 2: Peers n n Modules which are both Masters and Targets implement an internal buffer to hold Requests so that Completions can bypass them All Masters or Peers restrict number of outstanding Requests to avoid overflowing Request buffers 26

Example: Switch Fabric Each module maintains input/output queues for traffic to/from each other module Example: Switch Fabric Each module maintains input/output queues for traffic to/from each other module Data is sent from an input queue to an output queue over Nexus as a series of short bursts Flow control credits for each output queue are sent backward Eliminates head-of-line blocking Segmentation, buffering, and overspeed optimize performance during congestion Used in Pivot. Point, Fulcrum’s first chip product. 27

Nexus Silicon Validation TSMC 130 nm LV Results Block diagram of Nexus Validation Chip Nexus Silicon Validation TSMC 130 nm LV Results Block diagram of Nexus Validation Chip S 1 S 2 S 5 S 3 S 6 S 4 S 7 ALU V GHz ns p. J/bit Low-K Serial IO Proc 1. 2 1. 35 2. 0 10. 4 Low-K 1. 0 1. 11 2. 4 7. 0 FSG 1. 2 1. 10 2. 5 11. 2 FSG 1. 0 0. 87 3. 1 7. 6 Crossbar area: 1. 75 mm^2 Total interconnect area: 4. 15 mm^2 Peak cross-section bandwidth: 778 Gb/s Plot of Nexus crossbar 28

Nexus Summary Nexus is an asynchronous crossbar interconnect designed to connect up to 16 Nexus Summary Nexus is an asynchronous crossbar interconnect designed to connect up to 16 synchronous modules in a So. C Nexus can be used to implement load/store systems as well as switch fabrics Systems using Nexus can be tested with standard equipment Nexus runs up to 1. 35 GHz in TSMC 130 nm Asynchronous interconnect is now viable for very high performance So. C designs 29

Agenda Introduction to Fulcrum Description of Integrated Pipelining n Circuit B Circuit A Fulcrum’s Agenda Introduction to Fulcrum Description of Integrated Pipelining n Circuit B Circuit A Fulcrum’s clockless circuit architecture Specification Description of Fulcrum’s Design Flow Design & Verification Synthesis & Floor Planning Simulation & Verification Design & Verification Physical Design Database Release to Manufacturing Overview of Nexus n Fulcrum’s Terabit crossbar Overview of Pivot. Point n Fulcrum’s first commercial product 30

Pivot. Point Blade Interconnect World’s first high-performance clockless chip Large-scale So. C design n Pivot. Point Blade Interconnect World’s first high-performance clockless chip Large-scale So. C design n Generic System “Blade” n CPU NPU ASIC FPGA SPI-4 I/O (Phy/MAC) CPU NPU ASIC FPGA Includes key Fulcrum IP n n X 8 Backplane Interface CPU NPU ASIC FPGA >32. 5 M transistors (83% async) 14 separate clock domains Nexus Terabit Crossbar Quad-port 600 MHz async SRAM Operates at over 1 GHz Delivers 192 Gbps of nonblocking switching capacity Testable via standard tools n JTAG; scan chain Activity-based power scaling 31 9 -month project

Pivot. Point Leverages Nexus Flexible architecture n n 6 duplex SPI-4. 2 interfaces All Pivot. Point Leverages Nexus Flexible architecture n n 6 duplex SPI-4. 2 interfaces All paths are independent Optimized for performance CPU Interface SPI-4 Route Table SPI-4 16 KB Buffer JTAG Interface Control Bus (Serial Tree) Boundary Scan 16 KB Buffer SPI-4 Route Table n n n SPI-4 Route Table SPI-4 16 KB Buffer 16 KB Buffer SPI-4 Route Table SPI-4 3 ns latency A true So. C GALS design Up to 14. 4 Gbps per interface Up to 32 Gbps per Nexus port Full-rate buffer memories Lossless flow control Easily configurable n n 16 -bit CPU interface JTAG support Modest size and power n n ~2 Watt per active interface 1036 ball package 32

Testing – A Multi-Dimensional Approach DFT n n n Synchronous scan chains for Synchronous Testing – A Multi-Dimensional Approach DFT n n n Synchronous scan chains for Synchronous logic Asynchronous scan-chain-like structures for asynchronous logic and sync-async interfaces Standardized JTAG interface for testing Fault-Grading n n Verilog fault-model for domino logic Industry-standard fault grading tools BIST n n Use Nexus for observability in Nexus-Based SOCs RAM self test and repair 33

Differentiating Through Technology Leveraging our clockless technology foundation Differentiated Product Offering High performance (latency, Differentiating Through Technology Leveraging our clockless technology foundation Differentiated Product Offering High performance (latency, capacity) Power efficient (linear scaling) Robust in operation Unique IP Blocks Unmatched performance Extremely robust (power and temperature) Easy to integrate (benign behavior) Clockless Technology Foundation Silicon proven and customer validated Mature CAD flow (integrated with commercial tools) Robust cell library (thousands of unique cells) 34

Thank You! Peter A. Beerel, Ph. D VP Strategic CAD pabeerel@fulcrummicro. com 818. 871. Thank You! Peter A. Beerel, Ph. D VP Strategic CAD pabeerel@fulcrummicro. com 818. 871. 8100 www. fulcrummicro. com 26775 Malibu Hills Road Suite 200 Calabasas Hills, CA 91301 “A group of engineers wants to turn the microprocessor world on its head by doing the unthinkable: tossing out the clock and letting the signals move about unencumbered. For those designers, inspired by research conducted at Caltech, clocks are for wimps. ” Anthony Cata 35