Скачать презентацию Design Automation for Asynchronous Circuits Alex Kondratyev Cadence Скачать презентацию Design Automation for Asynchronous Circuits Alex Kondratyev Cadence

d03d6791930ce9fe49653a43966dc712.ppt

  • Количество слайдов: 90

Design Automation for Asynchronous Circuits Alex Kondratyev Cadence Berkeley Labs, Berkeley, CA, USA In Design Automation for Asynchronous Circuits Alex Kondratyev Cadence Berkeley Labs, Berkeley, CA, USA In collaboration with Jordi Cortadella, Luciano Lavagno Kelvin Lwin and Christos Sotiriou 1

Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous design with commercial tools n Desynchronization n Delay-insensitive datapath 2

Optimization metrics Late 70 -s: - Literals - nodes of a Boolean network - Optimization metrics Late 70 -s: - Literals - nodes of a Boolean network - Levels of a Boolean network Nowadays: - Literals - nodes of a Boolean network - Levels of a Boolean network - Wire length Area Speed Tools are optimizing for area and speed! 3

Universal metrics Power: small ? P = P +P +P avg dyn short leak Universal metrics Power: small ? P = P +P +P avg dyn short leak P = a * fclk* C * dyn P short P dyn 2 Vdd C P leak 4

Universal metrics Power small ? P = P +P +P avg dyn short leak Universal metrics Power small ? P = P +P +P avg dyn short leak P = a * fclk* C * dyn 2 Vdd C Delay: 2 t d = Qc / Ids = C * V / k(V - Vt ) dd dd Supply voltage I ds Power , delay Speed can be taken as a universal metrics 5

Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous design with commercial tools n Desynchronization n Delay-insensitive datapath n Fine-grain pipelining 6

Timing margins § Algorithms/tools (approximations) § Modeling (process corners e. g. ) § Architecture Timing margins § Algorithms/tools (approximations) § Modeling (process corners e. g. ) § Architecture (unbalanced computation) 7

Algorithms/tools False paths (< 5%) Common path pessimism removal Hierarchy hurts!!! 10 -35% gain Algorithms/tools False paths (< 5%) Common path pessimism removal Hierarchy hurts!!! 10 -35% gain from floorplan flattening (Reshape) Bad news: we do not know how far we are from optimum Good news: optimum is not possible to find 8

Modeling 0. 25 , Vdd=2. 5 10%, T=0, 125 C 0. 13 , Vdd=1. Modeling 0. 25 , Vdd=2. 5 10%, T=0, 125 C 0. 13 , Vdd=1. 0 10%, T=- 40, 125 C INVX 2 (fall) Fast 0. 76 Typical Slow 1. 47 Typical INVX 2 (fall) slow typical fast Fast 0. 73 Typical Slow 1. 55 Typical Why to panic? New BIG players: signal integrity and process variability 9

Variability sources -Environment (T, Vdd) + signal integrity Within-die only -Process variations (gate length Variability sources -Environment (T, Vdd) + signal integrity Within-die only -Process variations (gate length L, wire width W, threshold voltage Vt) -Die-to-die (design independent) -Within-die (design dependent) 10

Environment + SI Supply voltage: ± 10% Temperature: -40 C to 125 C VDD Environment + SI Supply voltage: ± 10% Temperature: -40 C to 125 C VDD V’DD IR drop – decrease in the current from Vdd Bad news: Good news: 7 6 10 gates x 8 metal layers 9 10 RC elements in VDD grid Field solvers can handle 10 variables Abstraction, model reduction, IP reuse help further Tools make IR drop sign off at 5%Vdd (still 10% delay penalty) 11

Environment + SI aggressor Crosstalk victim pulse Pruning by coupling delay Tc (%) Worst Environment + SI aggressor Crosstalk victim pulse Pruning by coupling delay Tc (%) Worst coupling estimation H-Spice simulation Compute switching windows Pruning by timing Conservative analysis: up to 20% delay penalty (post-layout fixes) 12

Process variations -Die-to-die -Within-die design independent, well modeled via worst-case files design dependent, systematic Process variations -Die-to-die -Within-die design independent, well modeled via worst-case files design dependent, systematic and random!! Lgate within-die Wwire die-to-die Tt Nassif’ 01 13

Measuring variability % chips Microprocessor at-speed functional testing ASIC Bin 1 Bin 2 Bin Measuring variability % chips Microprocessor at-speed functional testing ASIC Bin 1 Bin 2 Bin 3 frequency no delay testing, no binning Strategically placed oscillators: Problem: Up to 15% delay variation in RO (Nassif’ 03) Vertical/horizontal (4%), spacing poli-SI (7%), distance (5%) 14

Modeling variability Model for gate delay (linear wrt variability sources) d var = env Modeling variability Model for gate delay (linear wrt variability sources) d var = env var + device var + wire var Independence of sources (within a group - model reduction (PCA or SVD)) For a single variability source: L var= L spatial + L random (is modeled by random normally distributed variables N(0, )) Variation of path delay: D var = d var (L var ) 15

Statistical timing analysis ? Reconvergence needs some care -Numerical computation of a distribution -Approximate Statistical timing analysis ? Reconvergence needs some care -Numerical computation of a distribution -Approximate convolution (5% accuracy) -Use upper and lower bounds (10% diff. Blaauw’ 03) Algorithms have linear complexity! 16

What it buys? WC confidence margin must be big (chips work) But it is What it buys? WC confidence margin must be big (chips work) But it is fully unknown Trading yield Confidence margin worst STA helps to quantify risk (reduce margin and be structure specific) STA might help to trade off confidence margin and yield (testing? ? ? ) Open issues: - why normal? - how to derive sensitivity coefficients? 17

Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous design with commercial tools n Desynchronization n Delay-insensitive datapath n Fine-grain pipelining 18

Summing this up Clock overhead Cycle time Real Computation Time Worstaverage 45% SI 25% Summing this up Clock overhead Cycle time Real Computation Time Worstaverage 45% SI 25% Non. Clock balanced Variabilityskew stages 30% 10% 20% Some designs work twice faster than needed by spec! Everything boils down to $$$ Synchronous design is turning out to become a costly proposition 19

Is asynchronous an option? It is about time but … “must” requirements to asynchronous Is asynchronous an option? It is about time but … “must” requirements to asynchronous CAD tool: § Competitive - added value with minimal (or no) penalty - scalable (capable of handling large designs) § Simple - minimal knowledge of asynchronous design - RTL input § Risk-free - does not change sign-off (STA) - complete solution in verification and testing - backup options (synchronous implementation) 20

Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous design with commercial tools n Desynchronization n Delay-insensitive datapath n Fine-grain pipelining 21

Design options QDI approach Bundled approach Dual-rail logic Single-rail logic • • • start Design options QDI approach Bundled approach Dual-rail logic Single-rail logic • • • start delay • • • done C • • • done 22

Sliding the trade-off curve Automation efforts QDI datapath NCL, phased logic Bundled data desynchronization Sliding the trade-off curve Automation efforts QDI datapath NCL, phased logic Bundled data desynchronization EMI, skew penalty Variability Penalties? Average speed gates blocks 23

Desyncronization flow Think synchronous Design synchronous: one clock and edge-triggered flip-flops De-synchronize (automatically) Run Desyncronization flow Think synchronous Design synchronous: one clock and edge-triggered flip-flops De-synchronize (automatically) Run it asynchronously 24

Synchronous circuit MS flip-flop L 0 L 1 CLK 0 L 25 Synchronous circuit MS flip-flop L 0 L 1 CLK 0 L 25

De-synchronization L 0 L 1 C C C 0 L 26 De-synchronization L 0 L 1 C C C 0 L 26

De-synchronization Distributed controllers substitute the clock network C C C The data path remains De-synchronization Distributed controllers substitute the clock network C C C The data path remains intact ! 27

A B C D A+ B- C+ D- A- B+ C- D+ Non-overlapping handshake A B C D A+ B- C+ D- A- B+ C- D+ Non-overlapping handshake protocol 28

A B C D A+ B+ C+ D+ A- B- C- D- Overlapping is A B C D A+ B+ C+ D+ A- B- C- D- Overlapping is also acceptable 29

Concurrent model A B bubble C data A+ A- B+ B- C+ C- • Concurrent model A B bubble C data A+ A- B+ B- C+ C- • + and – must alternate • data available at the previous latch • next latch must be closed before receiving new data 30

For any netlist 31 For any netlist 31

Synchronization layer 32 Synchronization layer 32

Synchronization layer 33 Synchronization layer 33

Synchronization layer This is a circuit marked graph (CMG) 34 Synchronization layer This is a circuit marked graph (CMG) 34

Properties of CMGs Any CMG is live and safe n n Safeness: no data Properties of CMGs Any CMG is live and safe n n Safeness: no data overwriting Liveness: no deadlock A+ B+ C+ A- B- C 35

Behavioral equivalence 36 Behavioral equivalence 36

37 37

38 38

Synchronous flow 39 Synchronous flow 39

40 40

41 41

42 42

43 43

44 44

45 45

De-synchronized flow 46 De-synchronized flow 46

47 47

+ 48 + 48

49 49

50 50

51 51

52 52

53 53

54 54

Flow equivalence [Guernic, Talpin, Lann, 2003] A B 55 Flow equivalence [Guernic, Talpin, Lann, 2003] A B 55

Flow equivalence CLK A B 1 5 3 1 A B 1 5 1 Flow equivalence CLK A B 1 5 3 1 A B 1 5 1 2 0 2 1 5 2 3 1 4 Synchronous behavior 3 0 2 1 5 3 2 3 3 1 4 2 4 De-synchronized behavior 1 4 1 6 3 0 1 56

Flow equivalence CLK A B 1 5 3 1 A B 1 5 1 Flow equivalence CLK A B 1 5 3 1 A B 1 5 1 2 0 2 1 5 2 3 1 4 Synchronous behavior 3 0 2 1 5 3 2 3 3 1 4 2 4 De-synchronized behavior 1 4 1 6 3 0 1 Theorem: The de-synchronization model preserves flow-equivalence 57

Timing equivalence La Lb Lc del_a A del_b Ld del_c B C D del_b Timing equivalence La Lb Lc del_a A del_b Ld del_c B C D del_b = del_a = del_c = del_d A del_ a a B del_ b b C del_ c c D A+ B- del_ a A- b del_ B+ C+ C- del_ c D- Synchronous-like behavior D+ 58

Timing equivalence La Lb Lc del_a A del_b Ld del_c B C D del_b Timing equivalence La Lb Lc del_a A del_b Ld del_c B C D del_b > del_a = del_c = del_d A del_ a a B del_ b b C del_ c c D A+ B- del_ a A- b del_ B+ C+ C- del_ c DD+ B keeps the same period and settles the rest 59

Compatibility Synchronous: Tsync Tcomb + Tsetup + Tskew + TCQ Desynchronized: Tdesync Tcomb + Compatibility Synchronous: Tsync Tcomb + Tsetup + Tskew + TCQ Desynchronized: Tdesync Tcomb + T controller + TCQ Statement: Desynchronized design is behavior and timing compatible to its synchronous counterpart 60

Synchronous environment A B C Clk Clk+ A+ B+ C+ Timing arc Clk- A- Synchronous environment A B C Clk Clk+ A+ B+ C+ Timing arc Clk- A- B- C 61

Implementation of a controller • Only local handshakes with adjacent controllers are necessary • Implementation of a controller • Only local handshakes with adjacent controllers are necessary • Synthesis by using intuition, common sense, … and petrify 62

Implementation of a controller 63 Implementation of a controller 63

Delay matching Combinational logic d 64 Delay matching Combinational logic d 64

Post-layout delay matching Combinational logic 65 Post-layout delay matching Combinational logic 65

Post-layout delay matching Combinational logic 66 Post-layout delay matching Combinational logic 66

Desynchronization. Gaining Trust Synchronous RTL = 67 Desynchronization. Gaining Trust Synchronous RTL = 67

Async DLX block diagram 68 Async DLX block diagram 68

Desynchronization. Gaining Trust Synchronous RTL Synchronous Desynchronized = Cycle: 4. 4 ns Cycle: 4. Desynchronization. Gaining Trust Synchronous RTL Synchronous Desynchronized = Cycle: 4. 4 ns Cycle: 4. 45 ns Power: 70. 9 m. W Power: 71. 2 m. W Area: 372, 656 m Area: 378, 058 m 69

DLX lessons. Positive § Asynchronous design with no area, power, delay penalties § 30% DLX lessons. Positive § Asynchronous design with no area, power, delay penalties § 30% less EMI § Partial tolerance of variability (matched delays scale with the rest of the gates) § Binning!!! Treq > Tclk Error req B C Clk 70

DLX lessons. Negative § Asynchronous design with no area, power, delay advantage § Clock DLX lessons. Negative § Asynchronous design with no area, power, delay advantage § Clock power is saved but latched designs have higher loads § P&R constraints of de-sync design are non-trivial § Matched delay variability might hurt Hard work to come out even with synchronous 71

Can we do better? § Clustering § Timing optimization § Retiming of M-latches early Can we do better? § Clustering § Timing optimization § Retiming of M-latches early A A D late C M S D late M C S 72

Problems of delay matching Max(STA_delay) z Min(STA_delay) Gate and wire profiles are different (must Problems of delay matching Max(STA_delay) z Min(STA_delay) Gate and wire profiles are different (must be compensated by margins) Matched delay margins vs inter-die variation matching? ? Calls for the use of different architectures 73

Sliding the trade-off curve Automation efforts QDI datapath NCL, phased logic Bundled data desynchronization Sliding the trade-off curve Automation efforts QDI datapath NCL, phased logic Bundled data desynchronization EMI, skew penalty Variability Average speed gates blocks 74

Phased Logic 00 11 Odd Phase 01 10 Value ‘ 0’ t v Even Phased Logic 00 11 Odd Phase 01 10 Value ‘ 0’ t v Even Phase 0 0 even 0 Linden’ 94 LSB is ‘value’ bit (v) MSB is ‘timing’ bit (t) Value ‘ 1’ 1 0 odd 0 1 1 even 1 0 1 odd 1 0 0 even 0 0 1 odd 1 1 1 even 1 A signal changes phase or value (only one bit changes) 75

Phased logic gate A PL gate has an internal state Even or Odd. A Phased logic gate A PL gate has an internal state Even or Odd. A PL gate fires when all inputs match the gate phase. E O Gate Phase: E Gate ready to fire E E Gate Phase: E O Gate is not ready to fire After Firing E O E Gate Phase: E O 76

LUT-4 based implementation a_v b_v c_v d_v new_v LUT 4 delay Input completion detection LUT-4 based implementation a_v b_v c_v d_v new_v LUT 4 delay Input completion detection fi a_v a_t b_v gate_phase b_t C c_v c_t d_v reset d_t fo fo_b G 1 G 2 out_phase D-latch D Q EN Q R rbit - reset v_rbit D-lat h c new_t D Q EN Q out_phase = gate_phase R rbit v t t_b reset t_rbit G 3 • Functionality: v(a_v, b_v, c_v, d_v) Phase: a_t, b_t, c_t, d_t, t Area penalty! 77

NCL Design Flow VHDL GTECH library Synchronous Synthesis Synchronous netlist NCL library 2 -rail NCL Design Flow VHDL GTECH library Synchronous Synthesis Synchronous netlist NCL library 2 -rail expansion+ optimization Asynchronous 1. Pattern matching (Ligthart’ 00) NCL netlist 2. Completion separation (NCLX) 78

Introduction to NCL 2 -phase functioning (evaluate (DATA) – precharge (NULL)) + Self-timed register Introduction to NCL 2 -phase functioning (evaluate (DATA) – precharge (NULL)) + Self-timed register interaction (acknowledgement of phases) Combinational logic Reg. CD NULL Ack+ DATA Ack+ Micropipeline with delay-insensitive (DI) datapath 79

From 2 to 3 -rail Scheme y. 0 y. 1 2 -rail gate … From 2 to 3 -rail Scheme y. 0 y. 1 2 -rail gate … x. 0 x. 1 F F z. 1 z. 0 z. 1, z. 0 are 2 -rails but they do not acknowledge inputs x. 1 y. 1 x. 0 y. 0 z. 1 z. 0 Not DI scheme!!! 81

From 2 to 3 -rail Scheme x. go y. go Functional part z. 1 From 2 to 3 -rail Scheme x. go y. go Functional part z. 1 F z. 0 F … y. 0 y. 1 … x. 0 x. 1 2 -rail gate C z. go Completion part Rationale behind delay-insensitivity of 3 -rail scheme: 1. 2 -rail circuit is hazard-free under monotonic input changes 2. All inputs changes are observable at outputs 82

NCLX flow (MUX ) a s z Tech. Map. b a s b z NCLX flow (MUX ) a s z Tech. Map. b a s b z z Unate a. 1 s. 0 b. 1 z z. 1 2 -rail expansion a. 1 s. 0 b. 1 a. 0 s. 1 b. 0 z. 1 Completing z. 0 2 -rail gate (incomplete) a. 1 s. 0 b. 1 a. 0 s. 1 b. 0 a. go s. go b. go Functional part z. 1 z. 0 C z. go Completion part 2 -rail gate (complete) 83

NCL lessons. Positive § Very low EMI § High security of computation § Automatic NCL lessons. Positive § Very low EMI § High security of computation § Automatic stand-by mode § Tolerance to variability 84

NCL lessons. Negative § Big area overhead: 2. 7 -3. 0 x § No NCL lessons. Negative § Big area overhead: 2. 7 -3. 0 x § No performance advantage (average case performance is swallowed by the penalty from NULL) § Completion introduces further penalties (power and delay) 85

Back in Business Performance improvement: § Fast reset - partition a circuit into chunks Back in Business Performance improvement: § Fast reset - partition a circuit into chunks 4 -6 levels logic deep - apply reset to each chunk simultaneously § Use faster negative gates - negative gates are about 20% faster than unate gates Area improvement: § Make completion by outputs only - single NOR gate suffices 86

Penalties and Savings Logic synthesis Place & Route 20 -35% performance improvement at the Penalties and Savings Logic synthesis Place & Route 20 -35% performance improvement at the expense of 100% of area penalty 87

Use Case FF Combinational Logic reset FF comp error clk Error signal provides a Use Case FF Combinational Logic reset FF comp error clk Error signal provides a mean to: - Calibrate chips during manufacture testing - Perform on-line delay testing 88

Use Case §Find timing critical portions of design §Re-implement them asynchronously Up to 30% Use Case §Find timing critical portions of design §Re-implement them asynchronously Up to 30% of performance improvement at the cost of 2 x area penalty for critical portions (appealing if the size of critical portion is small) 89

Best of both worlds. PLA designs go_A Dynamic PLAs are naturally two-phase go Delay Best of both worlds. PLA designs go_A Dynamic PLAs are naturally two-phase go Delay matching of PLA is easy data go_D go_C done_A PLA A done_C PLA D done_D C done_B Bundled routing to cope with wire delay variations PLA B PLA E go_B done_E go_E Critical path is always go-done path 90

PLA vs SC • PLA-based design vs standard cell based design. ( SC typical PLA vs SC • PLA-based design vs standard cell based design. ( SC typical vs PLA typical) SC_A 1274 7672 1160 18884 1622 18683 1521 80304 4401 75218 4235 apex 7 4339 784 4729 818 apex 6 14270 907 15291 838 C 1355 14237 1351 12162 1354 C 3540 29239 2074 29478 2083 k 2 47332 1361 30789 1602 x 3 PLA 11105 C 6288 P PLA_D alu 4 11353 862 14588 845 C 5315 SC PLA_A alu 2 P SC_D 31092 2005 42031 2076 1 0. 9923 0. 9975 Delay average 1 Delay Fully placed and globally routed 91