Скачать презентацию Systematic Energy Characterization of CMP SMT Processor Systems via Скачать презентацию Systematic Energy Characterization of CMP SMT Processor Systems via

23ee2991e59920920729ec47a6e5d93d.ppt

  • Количество слайдов: 17

Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks R. Bertran*+, A. Buyuktosunoglu*, Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks R. Bertran*+, A. Buyuktosunoglu*, M. Gupta*, M. Gonzalez+, P. Bose* *IBM T. J. Watson Research Center +Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Why do we need micro-benchmarks? What is the maximum power consumption? Any performance bug? Why do we need micro-benchmarks? What is the maximum power consumption? Any performance bug? Any reliability issues? … Micro-benchmarks! LU UT TI OM O N AT N E EE D D ED ! § Time consuming and tedious – Error prone task • Trial and error process – Several microbenchmarks are required SO A § Deep expertise limited to few designers – Detailed knowledge of the underlying architecture is required 2 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Micro. Probe: a micro-benchmark generation framework MICRO 2012 Tuesday, December 4, 2012 © 2012 Micro. Probe: a micro-benchmark generation framework MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Micro. Probe Workflow Inputs User Endless. Power Endless loop Max loop for each INT Micro. Probe Workflow Inputs User Endless. Power Endless loop Max loop for each INT 50% FP 50% instruction stressmark of the ISA Outputs Microbenchmark generation policy Micro. Probe Framework Micro. Benchmark Architecture Definition files External tools Real platforms MICRO 2012 Tuesday, December 4, 2012 Simulators Models © 2012 IBM Corporation Barcelona Supercomputing Center

Micro. Probe: Distinguishing Features Feature Previous works Micro. Probe ISA queries - Instruction type Micro. Probe: Distinguishing Features Feature Previous works Micro. Probe ISA queries - Instruction type - Operand length, binary codification etc. (manual) (no) (manual) Micro-architecture queries - Functional unit, latency, throughput, energy per instruction, average instruction power etc. Micro-architecture models - Set-associative cache model Code generation - Skeleton and instruction definition passes, memory modeling pass, branch modeling pass, ILP definition pass. - Configurable passes Design space exploration - Integrated - GA-based search - Exhaustive search - Customizable search 5 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Micro. Probe Usage and Design Overview Research idea Micro-benchmark generation policies (user-defined scripts) Loop Micro. Probe Usage and Design Overview Research idea Micro-benchmark generation policies (user-defined scripts) Loop stressing the floating point unit Sequence of loads hitting 50% L 1 and 50% L 2 Generate a stressmark for each functional unit of the architecture Search for the sequence of 2 loads and 2 integer operations with maximum IPC Micro. Probe Framework (Python API) Architecture module ISA ISA definitions Micro-architecture analytical models Micro-architecture definitions MICRO 2012 Tuesday, December 4, 2012 Automatic bootstrap process Code generation module Design space exploration module Micro-benchmark synthesizer Passes Search drivers Properties External tools © 2012 IBM Corporation Barcelona Supercomputing Center

Max-power Stressmark Generation Use Micro. Probe to generate maxpower stressmark Characterize energy per instruction Max-power Stressmark Generation Use Micro. Probe to generate maxpower stressmark Characterize energy per instruction (EPI) and IPC (Architecture Module) Select N instructions with max (IPC* EPI) Form a basic endless loop (e. g. 4 K) using selected instructions (Code Generation Module) Generate micro-benchmarks with different orders of the selected N instructions Evaluate using Design Space Exploration Module mulldo xvnmsubmdp lxvw 4 x Loop: … Loop: mulldo … mulldo lxvw 4 x mulldo xvnmsubmdp lxvw 4 x … xvnmsubmdp … Pick the highest power microbenchmark 7 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Micro. Probe: A Micro-benchmark Generation Framework CASE STUDIES 8 MICRO 2012 Tuesday, December 4, Micro. Probe: A Micro-benchmark Generation Framework CASE STUDIES 8 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Experimental Methodology § Platform: – Processor: POWER 7 @ 3 GHz • 8 -core Experimental Methodology § Platform: – Processor: POWER 7 @ 3 GHz • 8 -core 4 -way SMT • 32 KB L 1, 256 KB L 2 and 4 MB L 3 per core – Memory: 32 GB DDR 3 SDRAM @ 800 MHz – OS: RHEL 5. 7 + Linux 3. 0. 1 – Energy. Scale architecture • Power measurements in miliwatts • Sampling rate up to 1 ms § In-house software collects power and performance counter traces [C. Lefurgy et al, IBM] 9 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Case Study 1: EPI Characterization Category Instruction Core IPC Normalized EPI Global Category Functional Case Study 1: EPI Characterization Category Instruction Core IPC Normalized EPI Global Category Functional Units FXU LSU VSU mulldo subf addic lxvw 4 x lvewx lbz xvnmsubmdp xvmaddadp xstsqrtdp 1, 40 2, 00 1, 68 2, 00 2, 60 1, 69 1, 00 2, 88 2, 81 2, 14 2, 35 2, 31 1, 32 2, 60 1, 69 1, 00 1, 35 1, 31 1, 00 1, 78 1, 75 1, 00 1, 73 1, 58 1, 16 1, 49 1, 36 1, 00 5, 12 5, 01 4, 24 5, 51 5, 29 4, 80 1, 21 1, 18 1, 00 1, 15 1, 10 1, 00 8, 36 7, 16 5, 97 10, 00 9, 49 8, 40 1, 20 1, 00 1, 19 1, 13 1, 00 Simple Integer Operations FXU or LSU add nor and 3, 50 Integer Memory Operations ldux lwax lfsu lhaux lwax lhaux 1, 00 1, 00 LSU and FXU High differences in EPI across instructions stressing different micro. LSU and 2 FXU architecture components Vector/Float/Decimal memory operations 10 stxvw 4 x High differences in EPI across 0, 48 stxsdx 0, 48 LSU and VSU stfdmicro 0, 48 instructions stressing the same stfsux 0, 48 stfdux 0, 48 LSU and VSU and FXU architecture components and at the stfdu 0, 48 same rate (IPC) MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Case Study 2: Max-power Stressmark Generation Generate all possible Use complex Use a combinations Case Study 2: Max-power Stressmark Generation Generate all possible Use complex Use a combinations of instructions complex instructions accessing different Use Micro. Probe computational functional units with stressing different intensive kernel high IPC units ? Micro. Probe Expert manual Loop: … Micro. Probe mullw Selected intructions: Loops Selected instructions: Loops lxvd 2 x mullw Loops mullw DAXPY Loops mulldo, Loops mullw xvmaddadp Loops Heuristic: xvmaddadp xvnmsubmdp, xvmaddadp Max(EPI * IPC) lxvw 4 x lxvd 2 x xvmaddadp lxvd 2 x … 11 MICRO 2012 Tuesday, December 4, 2012 Expert DSE Loops Loops Micro. Probe © 2012 IBM Corporation Barcelona Supercomputing Center

Max-power Stressmark Generation 12 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Max-power Stressmark Generation 12 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Case Study 3: Counter-based Processor Power Model Func. Unit micro. Benchmarks CMP 1–SMT 1 Case Study 3: Counter-based Processor Power Model Func. Unit micro. Benchmarks CMP 1–SMT 1 Random micro. Benchmarks CMP 1–SMT 1 Bottom-up Power modeling method Random micro. Benchmarks CMP 1–SMT 2/4 Dynamic Power f(PMCs) 1 Intercept SMT 2 -4 2 SMT effect CMP effect Random micro. Benchmarks CMP 1/8–SMT 2/4 Linear Regression f(CMP) Model: Dynamic Power f(PMCs) 13 MICRO 2012 Tuesday, December 4, 2012 SMT effect SMT enabled CMP effect # cores 3 Uncore power © 2012 IBM Corporation Barcelona Supercomputing Center

Counter-based Processor Power Model Validation § Within acceptable error margins: < 4% on average Counter-based Processor Power Model Validation § Within acceptable error margins: < 4% on average MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Counter-based Processor Power Model Validation on Corner Cases § Models trained using non-micro-architecture aware Counter-based Processor Power Model Validation on Corner Cases § Models trained using non-micro-architecture aware training sets show high errors and variability § Models trained using the micro-architecture aware training set show acceptable error margins: < 5% on average MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Conclusions § Micro. Probe is a productive micro-benchmark generation framework – Adaptive and flexible Conclusions § Micro. Probe is a productive micro-benchmark generation framework – Adaptive and flexible – Includes micro-architecture semantics – Integrates design space exploration § Presented three case studies: – Instruction-based EPI characterization – Automated max-power stressmark generation – CMP/SMT-aware bottom-up counter-based processor power model 16 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center

Micro. Probe: A Micro-benchmark Generation Framework QUESTIONS? 17 MICRO 2012 Tuesday, December 4, 2012 Micro. Probe: A Micro-benchmark Generation Framework QUESTIONS? 17 MICRO 2012 Tuesday, December 4, 2012 © 2012 IBM Corporation Barcelona Supercomputing Center