
13b1c81abc9a7c4229c5714439c3d1a6.ppt
- Количество слайдов: 82
Low-Power Design Techniques in Digital Systems Prof. Vojin G. Oklobdzija University of California
Outline of the Talk • • • Power trends in VLSI Scaling theory and predictions Research efforts in power reduction Efficiency measures and design guidelines Latches and Flip-Flops for Low-Power – Dual-Edge FFs – SOI • Conclusion: Low-Power perspective 2
Power trends in VLSI 3
CMOS Circuits dissipate little power by nature”. So believed circuit designers (Kuroda-Sakurai, 95) 100 Power (W) x 4 / 3 years 10 1 0. 01 80 85 90 95 “By the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages, even if the supply voltage can be feasibly reduced. ” (* Taken from Sakurai’s ISSCC 2001 presentation) 4
Gloom and Doom predictions Source: Shekhar Borkar, Intel 5
6 Source: Shekhar Borkar, Intel
Power versus Year: taken from ISSCC, u. P Report, Hot-Chips High-end growing at 25% / year RISC @ 12% / yr X 86 @ 15% / yr Consumer (low-end) At 13% / year 7
VDD, Power and Current Trend 200 Voltage Power per chip [W] Voltage [V] 2 Power 1. 5 Current 1 0. 5 0 1998 2002 2006 2010 500 0 2014 VDD current [A] 2. 5 0 Year International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) , Electronic Industries Association of Japan (EIAJ), Korea Semiconductor Industry Association (KSIA), 8 and Taiwan Semiconductor Industry Association (TSIA) (* Taken from Sakurai’s ISSCC 2001 presentation)
Power Delivery Problem (not just California) Your car starter ! Source: Shekhar Borkar, Intel 9
Trend in L di/dt • di/dt is roughly proportional to I * f, where I is the chip’s current and f is the clock frequency or I * Vdd * f / Vdd = P * f / Vdd, where P is the chip’s Vdd power. • The trend is: P f Vdd on-chip L package L slightly decreases • Therefore, L di/dt fluctuation increases significantly. (* Taken from Norman Chang, HP) 10
Energy-Delay product is improving more than 2 x / generation Saving Grace ! 11
X 86 efficiency improving dramatically 4 X / generation High-End processors efficiency not improving average improving 3 X / generation 12
Scaling theory and predictions 13
The power dissipation has increased 1000 times over the 15 years and is exceeding 70 Watts Scaling principles: 1. A “constant field scaling” theory [Dennard] assumes that device voltages as well as device dimensions are scaled by a scaling factor x (>1), resulting in a constant electric field in a device: Ø power density remains constant Ø circuit performance can be improved in terms of: § density x 2 § speed x § power 1/ x 2 § power-delay product 1/ x 3 Limitless progress in CMOS is promised with this scaling scenario 14
In practice neither a supply voltage nor a threshold voltage had been scaled till 1990 leading to theory of: “Constant voltage scaling” which assumes the constant voltage This assumption yields: • speed improvement by x 2 • power density increases rapidly by x 3 15
The constant field is not realistic, x 0. 5 is satisfactory - however even with that the power dissipation would exceed ECL by 2001: a new philosophy is required ! (* Taken from Sakurai and Kuroda, IEICE 95 paper) 16
High-Performance View Point on Power *taken from Ron Preston, DEC Alpha P=k C V 2 f : • Shrinking to the new technology (30% reduction in l) – C decreases by 30% – f increases by 1/0. 7 = 43% – Pnew=0. 7 (1/0. 7) Pold = Pold (No Change in Power ! ) • New design: – Double the No. of devices – Pnew=2 x 0. 7 (1/0. 7) Pold = 2 X Pold (Power Doubles !) Scale Vdd by 30% in the new design: – Pnew=2 x 0. 7 (1/0. 7) (0. 7)2 Pold = Pold (Power stays constant !) 17
High-Performance View Point on Power *taken from Ron Preston, DEC Alpha Reality: Chip l Vdd Freq. Power 21164 05 u 3. 3 V 300 MHz 50 W 21264 0. 35 u 2. 0 V 600 MHz 72 W Change -30% -39% +100% +44% Paradigm Changes: More Aggressive Circuits, Toggle rate increasing, Out of Order, Speculative Execution What to Expect: Power will be limited by the package and cooling techniques Frequency will be determined by the power - as high as package can take ! 18
Research Efforts in Low-Power Design Reduce the active load: • Minimize the circuits • Use more efficient design • Charge recycling • More efficient layout Psw = k CL Reduce Switching Activity: • Conditional clock • Conditional precharge • Switching-off inactive blocks • Conditional execution Technology scaling: • The highest win • Thresholds should scale • Leakage starts to byte • Dynamic voltage scaling 2 V cc f. CLK Run it slower: • Use parallelism • Less pipeline stages • Use double-edge flip-flop 19
Reducing the Power Dissipation • The power dissipation can be minimized by reducing: • supply voltage • load capacitance • switching activity – Reducing the supply voltage brings a quadratic improvement – Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed. 20
Voltage Scaling There are three means to maintain the throughput: • Reduce Vth to improve circuit speed • Introduce parallel and pipelined architecture while using slower device speeds (assumes limitless no. of transistors, in reality the transistor density is only increasing by 60% per year) • Prepare multiple supply voltages and for each cluster of circuits choose the lowest supply voltage that satisfies the speed. (A good level converter is necessary which exhibits small delay and consumes little power, small area) 21
22
Is there an optimal design point ? 23
Power Dissipation and Circuit Delay Power : 2 P = pt • f CLK • CL • VDD + I 0 • 10 V th S Delay VDD • -4 1 Power (W) k • Q I = k • CL • VDD a (VDD - Vth ) ( a=1. 3) -10 x 10 5 0. 8 x 10 4 Delay (s) 0. 6 0. 4 = A 0. 2 0 4 VD 3 D( V) 2 B 10. 8 (* Taken from T. Sakurai) 3 2 1 04 A -0. 4 V D 3 0 D( 0. 4 V) 2 ) Vth (V B 0 -0. 4 1 0. 8 0. 4 th (V) V 24
Sensitivity to Vth fluctuation Δ VTH = 1. 8 Normalized Delay VDD =1. 0 V 0. 15 V ± 1. 5 V ± 0. 05 V 1. 4 3. 0 V 1. 0 0. 6 (* Taken from T. Sakurai) 5. 0 V 0 0. 2 0. 4 0. 5 VTH (V) 0. 7 1 25
Power-Delay Product, Energy-Delay Product Lowest Voltage – Highest Threshold – no optimum (*from Sakurai, Kuroda, IEICE 95 paper) • Power-Delay Product is a misleading measure; it will always favor a processor that operates at lower frequency • Energy-Delay is more adequate - but Energy-Delay 2 should be used 26
Power-Delay Product, Energy-Delay Product Horowitz, Indermaur, Gonzales argue against Power-Delay, SLPE’ 94 27
Energy-Delay**2 (*courtesy of Prof. T. Sakurai) 28
Energy-Delay Product vs. Energy-Delay**2 Nowka, Hofstee, Carpenter of IBM argue against Energy-Delay as a design efficiency measure (private communication) 29
Energy-Delay Product vs. Energy-Delay**2 Optimal point: (due to to Vth being fixed ? ) The same design should have relatively the same efficiency Nowka, Hofstee, Carpenter of IBM argue against Energy-Delay as a design efficiency measure (private communication) 30
Feature 601+ 604 620 Diff. Frequency MHz CMOS Process 100 133 (100) . 5 u 5 -metal . 5 u 4 -metal same ~same Cache Total 32 KB Cache 64 K ~same Yes Load/Store Unit No 16 K+16 K Cache Yes Dual Integer Unit No Yes Register Renaming No Yes Peak Issue 2 + Br 4 Insts ~double Transistors 2. 8 Million 3. 6 Million 6. 9 Million +30% /+146% SPECint 92 105 160 +50% /+61% SPECfp 02 125 165 Power 4 W 13 W 26. 5/31. 2 12. 3/12. 7 225 (169) 300 (225) 30 W (22. 5 W) 7. 5/10 4. 0 E-6 13. 0 E-6 12. 8 E-6 (PF/Trans)*E 12 1. 43 3. 61 1. 86 IPC 1. 05 1. 69 PE*IPC**3 (*E 6) 4. 01 12. 98 12. 69 PE=Watt/Spec**3 3. 46 E-6 3. 17 E-6 2. 63 E-6 Example: Power. PC Spec/Watt PF=Watt/Freq**3 +30% /+80% +225%/+463% -115%/ -252% 31
Feature Digital 21164 MIPS 10000 Power. PC 620 500 MHz 200 MHz 180 MHz 250 MHz Pipeline Stages 7 5 -7 5 7 -9 6 -9 Issue Rate 4 4 4 6 lds 32 16 56 none Register Renam. (int/FP) none/8 32/32 8/8 56 none Transistors/ Logic transistors 9. 3 M/ 1. 8 M 5. 9 M/ 2. 3 M 6. 9 M/ 2. 2 M 3. 9 M*/ 3. 9 M 3. 8 M/ 2. 0 M 12. 6/18. 3 8. 9/17. 2 9/9 10. 8/18. 3 8. 5/15 25 W 30 W 40 W 20 W Spec. Int/ Watt 0. 5 0. 3 0. 27 0. 43 1/Energy*Delay 6. 4 2. 6 2. 7 2. 9 3. 6 Watt/Freq**3 0. 2 E-6 3. 75 E-6 6. 86 E-6 1. 28 E-6 (PF/Trans)*E 12 0. 022 0. 64 0. 54 1. 76 0. 34 (PF/LTrans)*E 12 0. 11 1. 63 1. 76 0. 64 12. 5 E-3 41. 5 E-3 31. 7 E-3 32. 5 E-3 Freq Out-of-Order Exec. SPEC 95 (Intg/Fl. Pt) Power Watt/Spec**3 HP 8000 Sun Ultra-Sparc 32
Use of Different Circuits Families 33
Capacitance Reduction The load capacitance is the sum of: • gate capacitance • diffusion capacitance • routing capacitance Using small number of transistors, or small size of transistors contributes to the reduction in the gate capacitance and the diffusion capacitance. Pass transistor logic may have advantage because it comprises fewer transistors and exhibits smaller stray capacitance than conventional static CMOS logic. 34
Pass-Transistor Logic 35
Pass-Transistor Logic: CVSL, CPL, SRPL, DSL, DPL, DCVSPG 36
SAPL: Sense-Amplifying Pass-transistor Logic All nodes are first discharged and then evaluated by inputs. Outputs are 100 m. V above GND 37
Where does the power go ? 38
Power use is different from chip to chip: (*from Sakurai, Kuroda, IEICE 95 paper) MPU 1 is a low end microprocessor MPU 2 is a high-end CPU with large cache ASSP 1 is MPEG-2 decoder ASSP 2 is an ATM switch 39
Design Example: Strong Arm 110 Two power modes: idle and sleep Power: 0. 5 W using 1. 1 V internal PS: 184 Drystone/MIPS @162 MHz 1. 1 W using 2 V internal PS: 245 Drystone/MIPS @ 215 MHz Power Breakdown: I-Cache D-Cache I-Unit Exec-Unit I-MMU D-MMU Clock Others *from D. Dobberpuhl 27% 16% 18% 8% 9% 8% 10% 4% (PLL < 1%) 40
Design Example: Strong Arm 110 *from D. Dobberpuhl 41
Design Example: Strong Arm 110 *from D. Dobberpuhl However, leakage currents starts to affect stand-by power 42
Controlling both: VDD and VTH for low power 43
Controlling VDD and VTH for low power Low power Low VDD Low speed Low VTH High leakage VDD-VTH control Software-hardware cooperation Technology-circuit cooperation *) MTCMOS: Multi-Threshold CMOS *) VTCMOS: Variable Threshold CMOS • Multiple : spatial assignment • Variable : temporal assignment (* from Prof. T. Sakurai) 44
Dual-VTH concept Low-VTH circuit (High leakage) High-VTH circuit (Low leakage) Critical paths Non-critical paths (* from Prof. T. Sakurai) 45
Clustered Voltage Scaling for Multiple VDD’s Conventional Design CVS Structure FF Level-Shifting F/F FF FF FF FF FF Critical Path Lower DD portion is shown as shaded V Once VL is applied to a logic gate, VL is applied to subsequent logic gates until F/F’s to eliminate DC current paths. F/F’s restore V H. M. Takahashi et al. , “A 60 m. W MPEG 4 Video Codec Using Clustered Voltage Scaling with Variable Supply-Voltage Scheme, ” ISSCC, pp. 36 -37, Feb. 1998. (* from Prof. T. Sakurai) 46
If you don’t need to hussle, VDD should be as low as possible VDD should be lowered to the minimum level which ensures the real-time operation. 1. 0 Normalized power Energy consumption is proportional to the square of VDD. 0. 8 0. 6 0. 4 0. 2 0. 0 (* from Prof. T. Sakurai) Variable Vdd Fixed Vdd 0. 2 0. 4 0. 6 0. 8 Normalized workload 1. 0 47
Measured voltage waveforms V DDmax =8% on average V DDmax V DDmin V DD 1 sync frame 200 ms Sleep signal Sleep=6% on average (* from Prof. T. Sakurai) 48
Measured power characteristics Total power = 0. 8 W x 0. 08 + 0. 16 W x 0. 86 + 0. 07 W x 0. 06 = 0. 2 W Power: P [W] 1 0. 8 W 0. 8 Time for VDDmax: 8% 0. 6 Down ƒ=200 MHz to 1/5 0. 4 0. 2 ƒ=100 MHz Time for VDDmin: 86% 0. 16 W Time for sleep: 6% 0. 07 W 0 0 1 2 Supply voltage: V DD [V] VDD hopping can cut down power consumption to 1/4 (* from Prof. T. Sakurai) 49
Simulation results MPEG-2 video decoding VSELP speech encoding 0. 40 0. 28 RPC: 2 levels (f, f/2) RPC: 3 levels (f, f/2, f/3) RPC: 4 levels (f, f/2, f/3, f/4) RPC: infinite levels post-simulation analysis 0. 24 0. 20 Normalized Power P/PFIX 0. 32 0. 16 0. 12 0. 08 0. 35 RPC: 2 levels (f, f/2) RPC: 3 levels (f, f/2, f/3) RPC: 4 levels (f, f/2, f/3, f/4) RPC: infinite levels post-simulation analysis 0. 30 0. 25 0. 20 0. 15 0. 10 0. 04 0. 05 0. 00 0. 2 0. 4 0. 6 0. 8 Transition Delay T (ms) TD (* from Prof. T. Sakurai) 1. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Transition Delay T (ms) TD 50
Aggressive Voltage Scaling *Taken from Kuroda If we can dynamically scale Vdd and Vth the advantage is obvious 51
Example 52
Trans. Meta Example *Taken from Doug Laird’s presentation, January 19 th 2000 53
Trans. Meta Example *Taken from Doug Laird’s presentation, January 19 th 2000 54
Trans. Meta Example *Taken from Doug Laird’s presentation, January 19 th 2000 • “Code Morphing” is another contributor to power reduction since it eliminates unnecessary external memory access 55
Trans. Meta Example 56
Latches and Flip-Flops for Low-Power 57
Simulation Condition and Testbench Timing § Total FF overhead is setup + clock-to-output time § Circuit optimization towards td-q § Clock skew robustness obtained from observing DQ curve Power-Delay Product § Overall performance parameter at fixed frequency 58
Flip-Flop Performance Comparison Test bench • Total power consumed – internal power – data power – clock power • Measured for four cases Delay is (minimum D-Q): Clk-Q + Setup time – no activity (0000… and 1111…) – maximum activity (0101010. . ) – average activity (random sequence) 59
OLD TEST BENCH: • Total Power = Drivers Power + Test Unit Power • PDP- Optimized = Equal Trade-off on Power and Delay • Improper Load on Drivers NEW TEST BENCH: • Drivers: Fixed Gain and Driving Test Unit Only • Data-to-Output Delay NEW TEST BENCH • PD 2 P Optimized = Best for Constant-Field Scaling 60
Comparison in terms of speed and EDPtot Technology: 0. 2 u, V =2 V, T=20 C, measured @ 100 MHz o dd • Delay: below 200 ps • PDPtot @100 MHz • SDFF 187 ps • HLFF 199 ps • K-6 ETL 200 ps – 200 -300 ps • Power. PC latch 266 ps • 21264 Alpha FF 272 ps • Strong Arm FF 275 ps • m. C 2 MOS latch 292 ps – below 30 f. J – above 500 ps – 50 - 70 f. J • • SSTC latch 592 ps DSTC latch 629 ps SSTC* latch 898 ps DSTC* latch 1060 ps • Power. PC latch 28 f. J – 30 - 50 f. J • • • HLFF 29 f. J SDFF 39 f. J m. C 2 MOS latch 40 f. J 21264 Alpha FF 43 f. J Strong Arm FF 45 f. J • K-6 ETL 70 f. J – above 70 f. J • SSTC latch 95 f. J • DSTC latch 125 f. J 61
Delay comparison • F-F design brings the fastest structures 62
Delay comparison • F-F design brings the fastest structures 63
Overall ranking @100 MHz • EDPtot accepted as the overall cost function • Proposed “low-power” latches from Yuan & Svensson, compared with other presented structures do not show advantage, (the optimization was not properly done - optimization is yet to be repeated under different setup) 64
Overall ranking, zoomed • Real signals have the activity between 0 and 1. 0 ( ) • Precharged hybrid structures are the fastest but their power consumption strongly depends on the probability of “ones” • More “ones” above the point 65
Overall performance • Real signals have the activity between 0 and 1. 0 ( ) • Precharged hybrid structures are the fastest but their power consumption strongly depends on the probability of “ones” • More “ones” above the point 66
Conventional Clk-Q vs. minimum D-Q • • Hidden positive setup time Degradation of Clk-Q 67
Internal Power distribution • Four sequences characterize the boundaries for internal power consumption – – … 010101… maximum random, equal transition probability, average … 111111… precharge activity … 000000… leakage + internal clock processing 68
Comparison of Clock power consumption 69
Using Dual-Edge Flip-Flop (run at ½ of the frequency save on the power consumed in clock distribution tree) 70
Dual-Edge vs. Single-Edge Flip-Flops Comparison Delay [ps] Total Power [ W] • Fujitsu 0. 18 u process; Clock frequency 500 MHz (250 MHz for Dual Edge FFs) • Data activity ratio = 0. 5 • VDD = 1. 8 V • Temp = 25º 71
Dual-Edge vs. Single-Edge Flip-Flops Comparison Internal Power [ W] Clock Power [ W] Data Power [ W] • Fujitsu 0. 18 u process; Clock frequency 500 MHz (250 MHz for Dual Edge FFs) • Data activity ratio = 0. 5 • VDD = 1. 8 V • Temp = 25º 72
Silicon on Insulator (SOI) Technology 73
SOI Comparison F= 1 GHz, = 0. 5, Le = 0. 08 m, VDD=1. 3 V, T = 25 C 74
In conclusion…. What can we expect that low power will bring to us ? 75
Wearable Computer 76
Wearable Computer 77
Wearable Computer 78
Digital Ink 79
Implantable Computer 80
Bluetooth 81
Year 2110 Extrapolation of the trend with some saturation Many important interesting application Home, Entertainment, Office, Translation , Health care Year 2120? ? ? More assembly technique: 3 D Year 2110 Sensor Infrared Humidity CO 2 Combination of bio and semiconductor Brain Ultra small volume Small number of neuron cells Extremely low power Real time image processing (Artificial) Intelligence Long lifetime 3 D flight control by DNA manipulation Bio-computer 82 Mosquito