Скачать презентацию Drowsy Caches Simple Techniques for Reducing Leakage Power Скачать презентацию Drowsy Caches Simple Techniques for Reducing Leakage Power

9eb278e9fc0c684ac03abcef4a6e8362.ppt

  • Количество слайдов: 16

Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge krisztian. flautner@arm. com kimns@eecs. umich. edu stevenmm@eecs. umich. edu blaauw@eecs. umich. edu tnm@eecs. umich. edu 1

Motivation v Ever increasing leakage power ü as feature size shrinks v Vt scales Motivation v Ever increasing leakage power ü as feature size shrinks v Vt scales down ü exponential increase in leakage power v On-chip caches ü responsible for 15%~20% of the total power ü leakage power can exceed 50% of total cache power according to our projection using Berkeley Predictive Models 2

Processor power trends • Based on ITRS roadmap and transistor count estimates. • Total Processor power trends • Based on ITRS roadmap and transistor count estimates. • Total power in this projection cannot come true. 3

An observation about data caches v L 1 data caches • Working set: fraction An observation about data caches v L 1 data caches • Working set: fraction of cache lines accessed in a time window. • Window size = 2000 cycles. • Only a small fraction of lines are accessed in a window. Working set of current + 1, 8, and 32 previous windows Working set of current window 4

The Drowsy Cache approach Instead of being sophisticated about predicting the working set, reduce The Drowsy Cache approach Instead of being sophisticated about predicting the working set, reduce the penalty for being wrong. Algorithm: • Periodically put all lines in cache into drowsy mode. • When accessed, wake up the line. • Optimize across circuit-microarchitecture boundary: – Use of the appropriate circuit technique enables simplified microarchitectural control. • Requirement: state preservation in low leakage mode. 5

Access control flow – Awake tags Hit Miss Awake tag match Line wake up Access control flow – Awake tags Hit Miss Awake tag match Line wake up Awake tag miss Line wake up Replacement Line access Memory • Drowsy hit / miss adds at most 1 cycle latency • Access to awake line is not penalized 6

Access control flow – Drowsy tags Hit Miss Awake tag match Tag wake up Access control flow – Drowsy tags Hit Miss Awake tag match Tag wake up Line wake up Awake tag miss Tag wake up Line wake up Replacement Line access Memory Unneeded tags and lines back to drowsy • Drowsy tags implementation is more complicated • Is the complexity worth it? – Tags use about 7% of data bits (32 bit address) – Only small incremental leakage reduction • Worst case: 3 cycle extra latency 7

Low-leakage circuit techniques Circuit Gated-VDD Pros • Largest leakage reduction • Fast mode switching Low-leakage circuit techniques Circuit Gated-VDD Pros • Largest leakage reduction • Fast mode switching • Easy implementation ABB-MTCMOS • Retains cell state DVS • Retains cell state • Fase mode switching • More power reduction than ABB Cons • Loses cell state • Slow mode switching • More SEU noise susceptible 8

Drowsy memory using DVS • Low supply voltage for inactive memory cells – Low Drowsy memory using DVS • Low supply voltage for inactive memory cells – Low voltage reduces leakage current too! P = I V – Quadratic reduction in leakage power supply voltage for normal mode leakage path supply voltage for drowsy mode 9

Leakage reduction using DVS • High-Vt devices for access transistors v reduce leakage power Leakage reduction using DVS • High-Vt devices for access transistors v reduce leakage power v increase access time of cache v Right Trade-off point ü 91% leakage reduction ü 6% cycle time increase Projections for 0. 07μm process 10

Drowsy cache line architecture drowsy bit voltage controller drowsy (set) word line driver row Drowsy cache line architecture drowsy bit voltage controller drowsy (set) word line driver row decoder drowsy VDD (1 V) power line SRAMs VDDLow (0. 3 V) drowsy word line wake up (reset) word line drowsy signal word line gate 11

Energy reduction Drowsy • • • Projections for 0. 07μm process High leakage: lines Energy reduction Drowsy • • • Projections for 0. 07μm process High leakage: lines have to be powered up when accessed. Drowsy circuit – Without high vt device (in SRAM): 6 x leakage reduction, no access delay. – With high vt device: 10 x leakage reduction, 6% access time increase. 12

1 cycle vs. 2 cycle wake up • • Fast wakeup is important – 1 cycle vs. 2 cycle wake up • • Fast wakeup is important – but easy to accomplish ! – Cache access time: 0. 57 ns (for 0. 07μm from CACTI using 0. 18μm baseline). – Speed dependent on voltage controller size: 64 x Leff – 0. 28 ns (half cycle at 4 GHz), 32 x Leff – 0. 42 ns, 16 x Leff – 0. 77 ns. Impact of drowsy tags are quite similar to double-cycle wake up. 13

Policy comparison simple 2000 simple 4000 noaccess 4000 14 Policy comparison simple 2000 simple 4000 noaccess 4000 14

Energy reduction Normalized Total Energy Normalized Leakage Energy Run-time increase DVS Theoretical min. Awake Energy reduction Normalized Total Energy Normalized Leakage Energy Run-time increase DVS Theoretical min. Awake tags 0. 46 0. 35 0. 29 0. 15 0. 41% Drowsy tags 0. 42 0. 31 0. 24 0. 09 0. 84% > 50% total energy reduction • • • > 70% leakage energy reduction Theoretical minimum assumes zero leakage in drowsy mode Total energy reduction within 0. 1 of theoretical minimum – Diminishing returns for better leakage reduction techniques Above figures assume 6 x leakage reduction, 10 x possible with small additional run-time impact 15

Conclusions • Simple circuit technique – Need high-Vt transistors, low Vdd supply • Simple Conclusions • Simple circuit technique – Need high-Vt transistors, low Vdd supply • Simple architecture – No need to keep counter/predictor state for each line – Periodic global counter asserts drowsy signal – Window size (for periodic drowsy transition) depends on core: ~4000 cycles has good E-delay trade-off • Technique also works well on in-order procesors – Memory subsystem is already latency tolerant • Drowsy circuit is good enough – Diminishing returns on further leakage reduction – Focus is again on dynamic energy 16