e932e8abff2a714732c217159efef96a.ppt
- Количество слайдов: 24
Morphable Computer Architectures for Highly Energy Aware Systems: PACC Program Review: Nov. 1 -3; Annapolis, MD Peter M. Kogge: CSE Dept. University of Notre Dame kogge@cse. nd. edu Kanad Ghose: CS Dept. SUNY-Binghamton; ghose@cs. binghamton. edu Nikzad “Benny” Toomarian: Center for Integrated Space Microsystems (CISM) Jet Propulsion Lab; benny@cism. jpl. nasa. gov May 23 -24, 2000 Scottsdale, AZ Kickoff_may_2000. ppt 1
Outline n Quad Chart n “Gear-Shifting” Simplified n The Morph Program n The Morph Architecture n Test Bed & Benchmarks Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 2
MORPH: Dynamic Low Energy Architectures MORPH Adds An “Energy Gear” to Dynamically Configurable Embedded Systems New Ideas • Morphable microarchitecture to allow dynamic changes in energy expended per cycle • Energy efficient morphable memory hierarchies • Energy efficient ISA extensions to process data more energy efficiently • Adaptive algorithms to select best configuration • Energy aware run-time which can reconfigure system IMPACT • Focus on energy, not just power, management • Develops suite of widely applicable energyreducing architectural techniques • Adds extra technology-independent degrees of freedom to dynamic energy control • Provides an overall inherently more energy efficient embedded computing system • Designed for transfer to real missions Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 5/00 11/00 5/01 11/01 5/02 Profiles Baseline Morphable Node Data Placement Adaptive Algorithms Run-time Demo & Eval 3
What is “Gear-Shifting” all about? n Definitions: o IPC = Instructions per Cycle o EPC = Energy per Cycle o C = Cycles per Second o Performance = “Instructions/second” = IPCx. C o Power = “Energy/second” = EPCx. C o M = performance required during some mode (instructions/second) n Real world: performance needs change very dramatically n Observations on Conventional Designs: o Conventional designs fix IPC at some IPCmax to meet peak need o In such designs EPC = Kx. IPCa, where “a” can range to almost 4 o Assume arbitrary clock selection (up to a maximum clock Cmax) o Ignore Vdd changes for now n Power @ M = Kx. IPCmaxax(M/ IPCmax) = Kx. Mx. IPCmaxa-1 o Dependent on clock only thru M Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 4
Some Simplified Gear Equations n Assume IPC smoothly changeable from IPCmin to IPCmax n Let R = (IPCmax/IPCmin) = “dynamic ratio” of performance range n Let g be a gear setting, ranging from 0 to 1 to change IPC n IPC(g) = IPCmin + (IPCmax - IPCmin)g = IPCmax[1/R + (1 -1/R)g] n EPC(g) = Kx{IPCmax[1/R + (1 -1/R)g]}a n Power(g, C) = K x {IPCmax[1/R + (1 -1/R)g]}a x C GEARS Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt Large R: OUR CHALLENGE 5
A Gear-Shifting Strategy 1 G 0 To minimize power as we vary performance requirement M: 0 Performance Rqmt Imax x Cmax Imin x Cmax n Cmax o G = 0 n C 0 Use most efficient IPCmin as long as possible (until clock at maximum) Then smoothly vary g while using Cmax 0 Performance Rqmt Imax x Cmax Imin x Cmax Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 6
The Result Power Savings Factor Ratio of Power under optimal gear change to conventional fixed IPC Power 1 Potentially huge for large R And we can still use all the other tricks to lower peak power! (1/R)a-1 Huge savings if applications spend most time here 0 Performance Rqmt M 0 Imin. Cmax Imax. Cmax Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 7
The Morph Program n Develop a microarchitecture with a large dynamic R o “Multi-cluster” superscalar CPU o Intelligent placement of data within mixed memory type hierarchy o Inherently low energy caches o Low energy ISA extensions n Define & use a realistic embedded benchmark suite o Drawn from deep-space processing needs - initially rovers o Include other DARPA benchmarks such as from DIS o Baseline on variety of systems n Develop real-time algorithms for reconfiguration n Demonstrate potential gains via simulation o Simplescalar + energy models n Technology transfer to potential future JPL missions Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 8
The Team Peter Kogge Vincent Freeh Jay Brockman UNIVERSITY OF NOTRE DAME • Morphable multi-cluster architecture • “At the sense amps” ISA extension • Runtime with hooks for dynamic morphing control Overall Goals: • Architectures with variable IPC, EPC • Tools & S/W to manage morphing • Realistic demonstrations Kanad Ghose Energy Aware Data Placement SUNY-BINGHAMTON • Morphable Caches, RFs • Dynamic Bit Slicing • Energy Eff VLIW archs • Supporting compiler techniques JET PROPULSION LABORATORY • Scenarios & benchmarks • Baseline characterizations • Runtime adaptation Nikzad Toomarian algorithms Mohammed Mojarradi Savio Chau Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 9
Starting A Solution: Multi Cluster Architecture (a) Simple Pipeline (c) New Multi Cluster (b) Classical Superscalar w Clusters Issue Width (IW) Problem: single large centralized register files with many ports w(IW/w)k << (IW)k IW/w Solution: multiple smaller EPC/IPC ~ (IW)k register files with few ports k as high as 1. 9 Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 10
Target Morph Configuration Energy-aware data placement Alternative ISA features EEPROM FLASH Embedded+external memory Dynamic issue width DRAM Dynamic ALU width SRAM Low energy caches Selective substrate bias Variable multi-cluster microarchitecture Dynamic data path width Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 11
Evaluation Methodology PACC Benchmarks + Today’s Performance Only Design Point + + + + + + + + + + + EPC: Energy per Cycle + + + + + ++ + + + + + ily m + Fa ent ici Eff + + gy ner E IPC: Instructions per Cycle Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 12
Multi-Cluster vs Conventional Results Conventional 1 x 8 Morph: dynamically change the cluster size & ride the EPC/IPC Savings 2 x 6 4 x 4 6 1 x 4 2 x 1 x 4 4 x 2 Up to 1/2 the energy at same IPC, or 20% better IPC at same energy Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 13
On-chip Caches: Addressing Dynamic & Static Leakage n On-chip caches dissipate 25% to 45% of total energy o Likely to increase because of leakage n Added line buffers (4 to 16) reduce dynamic energy dissipation by 40% to 65+%, with no penalty in access time and with 4% to 6% area penalty n Use of dynamic activation of recently-accessed L 2 cache areas reduce dynamic dissipation component by 40% to 80% Only selected areas of L 2 in active mode, rest in standby o Size of bit-cell groups controlled is critical o Additional L 2 area penalty of approx. 8% o Heuristics for controlling transitions between active & standby modes o Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 14
Addressing Dynamic & Static Dissipations in Caches Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 15
Exploiting Bit-Slice Inactivity in Datapaths n Expectation: Higher-order data bits likely to be insignificant at least some of the time n Opportunity: exploit byte slice inactivity over transfer paths, within storage devices (register files, caches) & function units FOR SPECfp 95 DP FOR INTEGERS FROM SPECfp 95 A circuit to provide read-enables in RFs to avoid energy dissipation on access Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 16
Deep Space: The Ultimate Power. Constrained Embedded System n Limited energy/power sources Renewable variable power: Solar cells o Constant power: RPGs o Fixed energy: batteries o n Multiple operational modes, all compute/energy constrained Cruise o Communication: compression vs transmission o Data gathering vs analysis o Movement: collision avoidance o n Today: o “Pre-canned” power management by serialized operations Morph Initial Focus: Rovers Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 17
Pathfinder Sojourner Energy Required Function 7. 51 W-hr 5. 63 W-hr 6. 92 W-hr 1. 83 W-hr 0. 45 W-hr 1. 2 W-hr 5. 2 W-hr 0. 63 W-hr 15. 0 W-hr 50 W-hr motor heating: 1 motor at a time motor heating: 2 motors at a time driving (extreme terrain @ -80 deg. C) hazard detection imaging (3 images @ 2 min/image) image compression (compress 3 images @ 6 min/image) 6 Mbit communication @ 50 min/sol 42, 10 sec health checks during day remainder of 7 hr daytime CPU operation WEB heating (as needed) 95 W-hr Time and Calculation = 7. 51 W x 1 hr = 11. 26 W x 0. 5 hr = 13. 85 W x 0. 5 hr = 7. 33 W x 0. 25 hr = 4. 5 W x 0. 1 hr = 3. 7 W x 0. 3 hr = 6. 27 W x 0. 8 hr = 6. 27 W x 0. 1 hr = 3. 7 W x 4 hr = 50 W-hr vs peak 15 W-hr Solar Cells + 150 W-hr non-rechargeable battery Effects on application code: • Many actions sequential, not simultaneous • No dynamic scheduling, no autonomy • Not even CPU-clock management • Nowhere near enough CPU performance • Designed to limit worst case power • Dump excess power into heaters Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 18
Athena/Mars ’ 03 Rovers Rover Configuration Pancam/Mini-TES • 3 Hrs/day of solar @ 50 W • 5 amp hr 16 V batteries • More complex communication • More complex on-board eqpt • Still statically scheduled Instrument Arm Cluster : Raman Spectrometer Alpha-Proton-X-Ray Spectrometer (APXS) Mössbauer Spectrometer Microscopic Imager Mini-Corer Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 19
MUSES-CN Asteroid Nano. Rover n To run a command: o o o Determine available solar power. Minimum required power = device + CPU power If available power < minimum required: m m m o n n including RF telecommunications system for communications to lander or small-body orbiter for relay to Earth. o Solar powered @ 1 watt n Clock-adjustable CPU speed Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt Set CPU speed to maximum allowable based on (power available) - (minimum needed for devices) Perform command: during command execution, if power drops significantly (or load shed indication? . . . ): m m o n if parameter enables re-orienting , re-orient to maximize solar power if still not enough and parameter enables waiting, wait up to parameter limit for solar power if still not enough, abort command CPU speed is reduced to minimum required Operate motors one-at-a-time Return CPU speed to parameter-specified idle Still “sequential” operation 20
Some Morph Test Beds • Different Power. PC configurations • Microarchitecture • Clock rates • ISA extensions • Run rover/PACC application code • Measure time/power • Use as input to Simplescalar simulation PACC-Gold • 400 MHz PPC 750 • Linux Oscilloscope PACC-Blue • 400 MHz PPC 7400 • Enhanced superscalar + Altivec • Linux Logic Analyzer Power. PC 750 NT Box Ethernet JPL PPC-SBC Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt • 200 MHz 750 • Vx. Works 21
The NASA X 2000 Avionics System • Design for 10 -20 X reduction in power, at 10 -20 X performance increase • With long-term survivability & technology scaling • Application-specific adaptive configuration to match run-time power supply constraints high-rate input symmetric multiprocessor modules (camera) reconfigurable hardware blocks communication module (CDMA) high-speed bus (e. g. IEEE 1394) bus power controller microcontroller-directed subnet - power regulations & control - analog telemetry sensors - safety inhibits - valve & pyro drive altimeter subnet Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt low-speed bus (e. g. I 2 C ) 22
X 2000 FD Testbed with Power Awareness Current Meter PCI Bus analyzer Current Meter Micro Gyro Built-In Power Supply c. PCI bus (6 U chassis) GPIB Empty Slot Dual I 2 C I/F (JPL) 1394 a I/F (Saderta) Hard Drive PPC 750 (Synergy) PMC PPC 750 (Synergy) Empty Slot FPGA Rapid Prototype Empty Slot Dual I 2 C I/F (JPL) 1394 a I/F (Saderta) Hard Drive PPC 750 (Synergy) PMC PPC 750 (Synergy) Empty Slot Dual I 2 C I/F (JPL) 1394 a I/F (Saderta) PPC 750 (Synergy) Hard Drive PPC 750 (Synergy) PMC Built-In Power Supply c. PCI bus (6 U chassis) PMC EPP Adapter Built-In Power Supply Terminal Server SUN Ultra 10 Workstation Empty Slot Dual I 2 C I/F (JPL) 1394 a I/F (Saderta) Built-In Power Supply 1394 a I/F (Saderta) Current Meter PPC 750 (Synergy) Hard Drive PPC 750 (Synergy) c. PCI bus (6 U chassis) Empty Slot Dual I 2 C I/F (JPL) 1394 a I/F (Saderta) PPC 750 (Synergy) Current Meter PPC 750 (Synergy) Hard Drive Pentium III w/1394 a analyzer (Saderta) SUN E 3500 Workstation (35 GB HD) c. PCI bus (6 U chassis) IEEE 1394 I 2 C COTS Support Equipment JPL In-House Product Outlets for power measurement Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt SUN Ultra 10 Workstation Built-In Power Supply Legends COTS Pentium III w/1394 a analyzer (Saderta) 23 Ethernet SCSI RS 232 IEEE 488
Near Term Activities n Extract Rover application code n Run on SBC & Apples for baseline data n Continue microarchitectural design and simulation n Continue activities not mentioned here o Instruction annotation for energy-aware data access o Benchmark analysis for data placement o ISA extensions Nov. 1 -3, 2000 Annapolis, MD Oct_2000_review. ppt 24
e932e8abff2a714732c217159efef96a.ppt