
d03d6791930ce9fe49653a43966dc712.ppt
- Количество слайдов: 90
Design Automation for Asynchronous Circuits Alex Kondratyev Cadence Berkeley Labs, Berkeley, CA, USA In collaboration with Jordi Cortadella, Luciano Lavagno Kelvin Lwin and Christos Sotiriou 1
Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous design with commercial tools n Desynchronization n Delay-insensitive datapath 2
Optimization metrics Late 70 -s: - Literals - nodes of a Boolean network - Levels of a Boolean network Nowadays: - Literals - nodes of a Boolean network - Levels of a Boolean network - Wire length Area Speed Tools are optimizing for area and speed! 3
Universal metrics Power: small ? P = P +P +P avg dyn short leak P = a * fclk* C * dyn P short P dyn 2 Vdd C P leak 4
Universal metrics Power small ? P = P +P +P avg dyn short leak P = a * fclk* C * dyn 2 Vdd C Delay: 2 t d = Qc / Ids = C * V / k(V - Vt ) dd dd Supply voltage I ds Power , delay Speed can be taken as a universal metrics 5
Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous design with commercial tools n Desynchronization n Delay-insensitive datapath n Fine-grain pipelining 6
Timing margins § Algorithms/tools (approximations) § Modeling (process corners e. g. ) § Architecture (unbalanced computation) 7
Algorithms/tools False paths (< 5%) Common path pessimism removal Hierarchy hurts!!! 10 -35% gain from floorplan flattening (Reshape) Bad news: we do not know how far we are from optimum Good news: optimum is not possible to find 8
Modeling 0. 25 , Vdd=2. 5 10%, T=0, 125 C 0. 13 , Vdd=1. 0 10%, T=- 40, 125 C INVX 2 (fall) Fast 0. 76 Typical Slow 1. 47 Typical INVX 2 (fall) slow typical fast Fast 0. 73 Typical Slow 1. 55 Typical Why to panic? New BIG players: signal integrity and process variability 9
Variability sources -Environment (T, Vdd) + signal integrity Within-die only -Process variations (gate length L, wire width W, threshold voltage Vt) -Die-to-die (design independent) -Within-die (design dependent) 10
Environment + SI Supply voltage: ± 10% Temperature: -40 C to 125 C VDD V’DD IR drop – decrease in the current from Vdd Bad news: Good news: 7 6 10 gates x 8 metal layers 9 10 RC elements in VDD grid Field solvers can handle 10 variables Abstraction, model reduction, IP reuse help further Tools make IR drop sign off at 5%Vdd (still 10% delay penalty) 11
Environment + SI aggressor Crosstalk victim pulse Pruning by coupling delay Tc (%) Worst coupling estimation H-Spice simulation Compute switching windows Pruning by timing Conservative analysis: up to 20% delay penalty (post-layout fixes) 12
Process variations -Die-to-die -Within-die design independent, well modeled via worst-case files design dependent, systematic and random!! Lgate within-die Wwire die-to-die Tt Nassif’ 01 13
Measuring variability % chips Microprocessor at-speed functional testing ASIC Bin 1 Bin 2 Bin 3 frequency no delay testing, no binning Strategically placed oscillators: Problem: Up to 15% delay variation in RO (Nassif’ 03) Vertical/horizontal (4%), spacing poli-SI (7%), distance (5%) 14
Modeling variability Model for gate delay (linear wrt variability sources) d var = env var + device var + wire var Independence of sources (within a group - model reduction (PCA or SVD)) For a single variability source: L var= L spatial + L random (is modeled by random normally distributed variables N(0, )) Variation of path delay: D var = d var (L var ) 15
Statistical timing analysis ? Reconvergence needs some care -Numerical computation of a distribution -Approximate convolution (5% accuracy) -Use upper and lower bounds (10% diff. Blaauw’ 03) Algorithms have linear complexity! 16
What it buys? WC confidence margin must be big (chips work) But it is fully unknown Trading yield Confidence margin worst STA helps to quantify risk (reduce margin and be structure specific) STA might help to trade off confidence margin and yield (testing? ? ? ) Open issues: - why normal? - how to derive sensitivity coefficients? 17
Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous design with commercial tools n Desynchronization n Delay-insensitive datapath n Fine-grain pipelining 18
Summing this up Clock overhead Cycle time Real Computation Time Worstaverage 45% SI 25% Non. Clock balanced Variabilityskew stages 30% 10% 20% Some designs work twice faster than needed by spec! Everything boils down to $$$ Synchronous design is turning out to become a costly proposition 19
Is asynchronous an option? It is about time but … “must” requirements to asynchronous CAD tool: § Competitive - added value with minimal (or no) penalty - scalable (capable of handling large designs) § Simple - minimal knowledge of asynchronous design - RTL input § Risk-free - does not change sign-off (STA) - complete solution in verification and testing - backup options (synchronous implementation) 20
Outline What do we optimize? End of deterministic design Technical and business implications Asynchronous design with commercial tools n Desynchronization n Delay-insensitive datapath n Fine-grain pipelining 21
Design options QDI approach Bundled approach Dual-rail logic Single-rail logic • • • start delay • • • done C • • • done 22
Sliding the trade-off curve Automation efforts QDI datapath NCL, phased logic Bundled data desynchronization EMI, skew penalty Variability Penalties? Average speed gates blocks 23
Desyncronization flow Think synchronous Design synchronous: one clock and edge-triggered flip-flops De-synchronize (automatically) Run it asynchronously 24
Synchronous circuit MS flip-flop L 0 L 1 CLK 0 L 25
De-synchronization L 0 L 1 C C C 0 L 26
De-synchronization Distributed controllers substitute the clock network C C C The data path remains intact ! 27
A B C D A+ B- C+ D- A- B+ C- D+ Non-overlapping handshake protocol 28
A B C D A+ B+ C+ D+ A- B- C- D- Overlapping is also acceptable 29
Concurrent model A B bubble C data A+ A- B+ B- C+ C- • + and – must alternate • data available at the previous latch • next latch must be closed before receiving new data 30
For any netlist 31
Synchronization layer 32
Synchronization layer 33
Synchronization layer This is a circuit marked graph (CMG) 34
Properties of CMGs Any CMG is live and safe n n Safeness: no data overwriting Liveness: no deadlock A+ B+ C+ A- B- C 35
Behavioral equivalence 36
37
38
Synchronous flow 39
40
41
42
43
44
45
De-synchronized flow 46
47
+ 48
49
50
51
52
53
54
Flow equivalence [Guernic, Talpin, Lann, 2003] A B 55
Flow equivalence CLK A B 1 5 3 1 A B 1 5 1 2 0 2 1 5 2 3 1 4 Synchronous behavior 3 0 2 1 5 3 2 3 3 1 4 2 4 De-synchronized behavior 1 4 1 6 3 0 1 56
Flow equivalence CLK A B 1 5 3 1 A B 1 5 1 2 0 2 1 5 2 3 1 4 Synchronous behavior 3 0 2 1 5 3 2 3 3 1 4 2 4 De-synchronized behavior 1 4 1 6 3 0 1 Theorem: The de-synchronization model preserves flow-equivalence 57
Timing equivalence La Lb Lc del_a A del_b Ld del_c B C D del_b = del_a = del_c = del_d A del_ a a B del_ b b C del_ c c D A+ B- del_ a A- b del_ B+ C+ C- del_ c D- Synchronous-like behavior D+ 58
Timing equivalence La Lb Lc del_a A del_b Ld del_c B C D del_b > del_a = del_c = del_d A del_ a a B del_ b b C del_ c c D A+ B- del_ a A- b del_ B+ C+ C- del_ c DD+ B keeps the same period and settles the rest 59
Compatibility Synchronous: Tsync Tcomb + Tsetup + Tskew + TCQ Desynchronized: Tdesync Tcomb + T controller + TCQ Statement: Desynchronized design is behavior and timing compatible to its synchronous counterpart 60
Synchronous environment A B C Clk Clk+ A+ B+ C+ Timing arc Clk- A- B- C 61
Implementation of a controller • Only local handshakes with adjacent controllers are necessary • Synthesis by using intuition, common sense, … and petrify 62
Implementation of a controller 63
Delay matching Combinational logic d 64
Post-layout delay matching Combinational logic 65
Post-layout delay matching Combinational logic 66
Desynchronization. Gaining Trust Synchronous RTL = 67
Async DLX block diagram 68
Desynchronization. Gaining Trust Synchronous RTL Synchronous Desynchronized = Cycle: 4. 4 ns Cycle: 4. 45 ns Power: 70. 9 m. W Power: 71. 2 m. W Area: 372, 656 m Area: 378, 058 m 69
DLX lessons. Positive § Asynchronous design with no area, power, delay penalties § 30% less EMI § Partial tolerance of variability (matched delays scale with the rest of the gates) § Binning!!! Treq > Tclk Error req B C Clk 70
DLX lessons. Negative § Asynchronous design with no area, power, delay advantage § Clock power is saved but latched designs have higher loads § P&R constraints of de-sync design are non-trivial § Matched delay variability might hurt Hard work to come out even with synchronous 71
Can we do better? § Clustering § Timing optimization § Retiming of M-latches early A A D late C M S D late M C S 72
Problems of delay matching Max(STA_delay) z Min(STA_delay) Gate and wire profiles are different (must be compensated by margins) Matched delay margins vs inter-die variation matching? ? Calls for the use of different architectures 73
Sliding the trade-off curve Automation efforts QDI datapath NCL, phased logic Bundled data desynchronization EMI, skew penalty Variability Average speed gates blocks 74
Phased Logic 00 11 Odd Phase 01 10 Value ‘ 0’ t v Even Phase 0 0 even 0 Linden’ 94 LSB is ‘value’ bit (v) MSB is ‘timing’ bit (t) Value ‘ 1’ 1 0 odd 0 1 1 even 1 0 1 odd 1 0 0 even 0 0 1 odd 1 1 1 even 1 A signal changes phase or value (only one bit changes) 75
Phased logic gate A PL gate has an internal state Even or Odd. A PL gate fires when all inputs match the gate phase. E O Gate Phase: E Gate ready to fire E E Gate Phase: E O Gate is not ready to fire After Firing E O E Gate Phase: E O 76
LUT-4 based implementation a_v b_v c_v d_v new_v LUT 4 delay Input completion detection fi a_v a_t b_v gate_phase b_t C c_v c_t d_v reset d_t fo fo_b G 1 G 2 out_phase D-latch D Q EN Q R rbit - reset v_rbit D-lat h c new_t D Q EN Q out_phase = gate_phase R rbit v t t_b reset t_rbit G 3 • Functionality: v(a_v, b_v, c_v, d_v) Phase: a_t, b_t, c_t, d_t, t Area penalty! 77
NCL Design Flow VHDL GTECH library Synchronous Synthesis Synchronous netlist NCL library 2 -rail expansion+ optimization Asynchronous 1. Pattern matching (Ligthart’ 00) NCL netlist 2. Completion separation (NCLX) 78
Introduction to NCL 2 -phase functioning (evaluate (DATA) – precharge (NULL)) + Self-timed register interaction (acknowledgement of phases) Combinational logic Reg. CD NULL Ack+ DATA Ack+ Micropipeline with delay-insensitive (DI) datapath 79
From 2 to 3 -rail Scheme y. 0 y. 1 2 -rail gate … x. 0 x. 1 F F z. 1 z. 0 z. 1, z. 0 are 2 -rails but they do not acknowledge inputs x. 1 y. 1 x. 0 y. 0 z. 1 z. 0 Not DI scheme!!! 81
From 2 to 3 -rail Scheme x. go y. go Functional part z. 1 F z. 0 F … y. 0 y. 1 … x. 0 x. 1 2 -rail gate C z. go Completion part Rationale behind delay-insensitivity of 3 -rail scheme: 1. 2 -rail circuit is hazard-free under monotonic input changes 2. All inputs changes are observable at outputs 82
NCLX flow (MUX ) a s z Tech. Map. b a s b z z Unate a. 1 s. 0 b. 1 z z. 1 2 -rail expansion a. 1 s. 0 b. 1 a. 0 s. 1 b. 0 z. 1 Completing z. 0 2 -rail gate (incomplete) a. 1 s. 0 b. 1 a. 0 s. 1 b. 0 a. go s. go b. go Functional part z. 1 z. 0 C z. go Completion part 2 -rail gate (complete) 83
NCL lessons. Positive § Very low EMI § High security of computation § Automatic stand-by mode § Tolerance to variability 84
NCL lessons. Negative § Big area overhead: 2. 7 -3. 0 x § No performance advantage (average case performance is swallowed by the penalty from NULL) § Completion introduces further penalties (power and delay) 85
Back in Business Performance improvement: § Fast reset - partition a circuit into chunks 4 -6 levels logic deep - apply reset to each chunk simultaneously § Use faster negative gates - negative gates are about 20% faster than unate gates Area improvement: § Make completion by outputs only - single NOR gate suffices 86
Penalties and Savings Logic synthesis Place & Route 20 -35% performance improvement at the expense of 100% of area penalty 87
Use Case FF Combinational Logic reset FF comp error clk Error signal provides a mean to: - Calibrate chips during manufacture testing - Perform on-line delay testing 88
Use Case §Find timing critical portions of design §Re-implement them asynchronously Up to 30% of performance improvement at the cost of 2 x area penalty for critical portions (appealing if the size of critical portion is small) 89
Best of both worlds. PLA designs go_A Dynamic PLAs are naturally two-phase go Delay matching of PLA is easy data go_D go_C done_A PLA A done_C PLA D done_D C done_B Bundled routing to cope with wire delay variations PLA B PLA E go_B done_E go_E Critical path is always go-done path 90
PLA vs SC • PLA-based design vs standard cell based design. ( SC typical vs PLA typical) SC_A 1274 7672 1160 18884 1622 18683 1521 80304 4401 75218 4235 apex 7 4339 784 4729 818 apex 6 14270 907 15291 838 C 1355 14237 1351 12162 1354 C 3540 29239 2074 29478 2083 k 2 47332 1361 30789 1602 x 3 PLA 11105 C 6288 P PLA_D alu 4 11353 862 14588 845 C 5315 SC PLA_A alu 2 P SC_D 31092 2005 42031 2076 1 0. 9923 0. 9975 Delay average 1 Delay Fully placed and globally routed 91