Distributed L 0 Buffer Architecture and Exploration for

Distributed L 0 Buffer Architecture and Exploration for Low Energy Embedded Systems Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert Deconinck ESAT/ACCA, K. U. Leuven, Belgium Francky Catthoor Henk Corporaal IMEC, Leuven, Belgium

ESAT/ACCA Overview • Context: Introduction to the problem • Motivation for L 0 Buffer organization and status • Distributed L 0 Buffer organization • Instruction Memory Exploration q Software and Compiler Transformation • Conclusions 2

ESAT/ACCA Context Low Energy Embedded systems Low Power Embedded Systems q Battery operated (low energy) ð 10 -50 MOPS/m. W q q Small Low cost Flexible Multimedia Applications ð Video, audio, wireless ð High performance ü 10 -100 GOPS ü real-time constraints 3

ESAT/ACCA Context Embedded systems: Programmable Processor Based Embedded processors • Power Breakdown q 43 % of power in on-chip Memory ð Strong. ARM SA 110: A 160 MHz 32 b 0. 5 W CMOS ARM processor q 40 % of power in internal memory ð C 6 x, Texas Instruments Inc. 25 -30% of power in Instruction Memory To address the data memory issues: • Data Transfer and Storage Methodology (DTSE) F. Catthoor et. al. 4

ESAT/ACCA Related Work Significant Power consumption in Instruction Memory Hierarchy Compression (code size reduction) Main Memory (off-chip) - L. Benini et. al. , “Selective Instruction Compression for Memory Energy Reduction. . . ”, ISLPED 1999 - P. Centoducatte et. al, “Compressed Code Execution on DSP Architectures” ISSS 1999 - T. Ishihara et. al. , “A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors”, DATE 2000. L 1 cache (on-chip) Software Transformations - N. D. Zervas et. al. , ”A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications”, ICECS 2001 Core 5 - S. Parameswaran et. al. , “I-Co. PES: Fast Instruction Code Placement for Embedded Sytems to Improve Performance and Energy Efficiency”, ICCAD 2001

ESAT/ACCA Overview • Context: Introduction to the problem • Motivation for L 0 Buffer organization and status • Distributed L 0 Buffer organization • Instruction Memory Exploration q Software and Compiler Transformation • Conculsions 6

ESAT/ACCA Application Domain: Multimedia Characteristics (1) Instruction count ICstatic < 1% ICdynamic 2% 100% 0% 0% Instruction Count Static Instruction Count Dynamic High locality 7

ESAT/ACCA Normalized dynamic instruction count Application Domain: Multimedia Characteristics (2) Normalized static instruction count Within a program, few basic blocks or instructions take up most of the execution time (ICdynamic) 8

ESAT/ACCA Motivation for additional small memory Application Domain: high locality in few basic blocks Size ( basic blockshigh locality) is still large Main Memory (off-chip) if L 1 cache (on-chip) is made small performance degrades • capacity (compulsory) misses L 1 cache (on-chip) system power increases • off-chip memory / bus activity increases Core Small memory, in addition to the conventional L 1 cache should be used to reduce energy without compromising performance 9

ESAT/ACCA Related Work (Microarchitecture): Cache Design Main Memory (off-chip) N. Jouppi et. al, “Improving direct-mapped cache performance by addition of a small fully-associative cache and prefetch buffers”, ISCA 1990 • Aim: to reduce miss penalty cycles • miss caching, victim caching, stream buffers cache L 1 cache (on-chip) Core 10

ESAT/ACCA Related Work (Microarchitecture): Cache Design J. D. Bunda et. al, “Instruction-Processing Optimization Techniques for VLSI Microprocessors”, Phd thesis 1993 Main Memory (off-chip) • Aim: to reduce instruction cache energy • L 0 buffer: cache block buffer (1 cache block + 1 tag) • Limitations: block trashing L 1 cache (on-chip) L 0 Buffer J. Kin et. al, “Filtering memory references to increase energy efficiency”, IEEE Trans on Computer, 2000 • Aim: to reduce instruction cache energy • L 0 buffer: filter cache – Small regular cache (< 1 KB) – L 0 access (hit) latency: 1 cycle Core – L 1 access (hit) latency: 2 cycles • Limitations: – Energy reduced at the expense of performance – 256 Byte, 58% power reduction with 21% performance degradation 11

ESAT/ACCA Related Work (Architecture): Software controlled L 0 buffers R. S. Bajwa et. al, “Instruction Buffering to Reduce Power in Processors for Signal Processing”, IEEE Trans VLSI Systems, vol 5, no 4, 1997 Main Memory (off-chip) L. H. Lee et. al, (M-CORE), “Instruction Fetch Energy Reduction Using Loop Caches for Applications with Small and Tight Loops”, ISLPED 1999 - L 0 Buffer: Buffer (< 1 KB) + Local Controller (LC); [no tags] L 1 cache (on-chip) - L 0 / L 1 access latency: 1 cycle - Used only for specific program segments (innermost loops) - Software control: Special instruction (lbon, sbb) to map program segments to L 0 buffer LC L 0 Buffer Initiation Execution Normal Operation Filling L 1 Core L 0 Datapath 12 L 0 Buffer Operation L 0 Datapath Termination L 1 L 0 Datapath

ESAT/ACCA Related Work (Architecture): Software controlled L 0 buffers • Assumed Architecture q MIPS 4000 ISA q Single Issue Processor q L 1 Cache ð 16 KB Direct Mapped q Loop Buffer (2 KB) ð Depth = 128 instructions ð Width = 16 Bytes • Tools q Simplescalar 2. 0 q Wattch Power estimator • 13 Loops with less than 128 instructions were hand-mapped onto the loop buffer

ESAT/ACCA Related Work (Architecture): Software controlled L 0 buffers • Advantages q 50% (avg) energy reduction, with no performance degradation q Software control: enables to map only a selected program segments • Limitations q Supports only innermost loops (regular basic blocks) ð Other basic blocks frequently executed are still fetched from L 1 cache q No support for control constructs within loops F. Vahid et. al [2001 -2002]: Hardware support for conditional constructs within loops q q 14 Identifying the loop address bounds (preloading the program segment/loop) Sub-routines conditional constructs 1 level nested loop

ESAT/ACCA Related Work (Architecture): Compiler controlled L 0 buffers Main Memory (off-chip) N. Bellas et. al, “Architectural and Compiler Support for Energy Reduction in Memory Hierarchy of High Performance Microprocessors”, ISLPED 1998 • Aim: Reduce instruction cache energy by letting the compiler to assume the role of allocating basic blocks to L 0 buffer. • L 0 Buffer: Regular cache (< 1 KB; 128 instr) L 1 cache (on-chip) – profile – function inlining L 0 Buffer basic blocks allocated to L 0 buffer – identify basic blocks – code layout Core 15 code layout • Technique: Advantages - Automated: a ‘tool’ can do this job - Use of basic block as atomic unit of allocation - 60% (avg) energy reduction in i-mem hierarchy [SPEC 95] Limitations - Tag overhead L 0 Buffer address space

ESAT/ACCA Loop Buffers: Commercial Processors • RISC DSP Processors q SH-DSP ð Decoded instruction buffers ð Supports regular loops (no conditional constructs/nested loops) • VLIW Processors q Star. Core SC 140 ð Supports regular and nested loops ð Conditional constructs through predication q STMicroelectronics, ST 120 ð Supports nested loops and loops with conditional constructs 16

ESAT/ACCA Overview • Context: Introduction to the problem • Motivation for L 0 Buffer organization and status • Distributed L 0 Buffer organization • Instruction Memory Exploration q Software and Compiler Transformation • Conclusions 17

ESAT/ACCA Shortcomings • So far. . . Hardware, software, compiler optimizations to increase accesses/activity at L 0 Buffers L 1 cache (on-chip) Increased Accesses (activity) Main Memory (off-chip) • Bottleneck to solve – L 0 Buffer organization – Interconnect: from L 0 Buffer to Datapath – Efficient buffer controller • Organization Scalable with increase in #FUs L 0 Buffer LC Core FU 18 FU FU FU Centralized Organization

ESAT/ACCA Current Organizations for L 0 Buffers Uncompressed L 0 Buffer • Buffer: Width issue width (# FUS) • Interconnect: Long • LC: Simple Addressing (counter based) FU FU Ref: Bajwa et. al. , L. H. Lee et. al. , F. Vahid et. al. Compressed L 0 Buffer • Buffer: – High storage density (no NOPs) Decompressor/Dispatch – Width issue width (# FUS) – Overhead in decompressing FU 19 FU FU FU • Interconnect : Still centralized, long lines • LC: Simple Addressing (counter based) Ref: TI (execute packet fetch mechanism)

ESAT/ACCA Current Organizations for L 0 Buffers…. Partitioned L 0 Buffer par 1 par 2 par 3 par 4 • Buffer: Smaller memories • Interconnect: Still long LC • LC: – Simple addressing (counter based) FU FU – Need to access all the banks simultaneously, even if some of the FUs are not active Ref: Sub-banking Sub-banked/Partitioned L 0 Buffer with Compression Bank 1 Bank 2 Bank 3 Bank 4 LC Re-organizer • Buffer: Smaller memories, overhead in re-organizer • Interconnect: Still centralized • LC: Complex addressing (needs expensive tags) Ref: T. Conte et. al [TINKER] FU 20 FU FU FU • No correlation between partitioning and FUs

ESAT/ACCA Solution Distributed Instruction Buffer Organization A balance of energy consumption between Buffers, Interconnect and Local Controllers is needed Buffers Distributor/Dispatch • Sub-banked/Partitioned in correlation with FU activation IROC Interconnect ATC Buffers • Localized (limited connectivity b/w FUs and Buffers) Buffer Control FU FU FU Instruction Cluster ATC: Address Translation and Control IROC: Instruction Registers Operation and Control 21 • Stores instructions in each partition • Fetches instructions during loop execution • Regulates the accesses to each partition

ESAT/ACCA Distributed L 0 Buffer Operation • • Similar to conventional L 0 buffer operation Initiation q Special instruction LBON <offset> • Filling q Pre-fetching instructions from <start> to <end> • Termination q When the program flow jumps to an address out of <start> to <end> range Initiation Execution Normal Operation Filling L 1 Distributed L 0 Datapath Termination 22 L 0 Buffer Operation L 1 Distributed L 0 Datapath

ESAT/ACCA The Buffer Operation: An Illustration LBON <offset> for (. . ) { S: OP 11 OP 21 OP 31 NOP OP 22 OP 32 BNZ ‘x’ OP 12 NOP BR ‘y’ if block X: OP 13 NOP OP 33 NOP else block Y: OP 14 OP 23 NOP BNZ ‘s’ … if (. . ) else … } 23 {. …. }

ESAT/ACCA The Buffer Operation: An Illustration LBON <offset> for (. . ) { S: OP 11 OP 21 OP 31 NOP OP 22 OP 32 BNZ ‘x’ OP 12 NOP BR ‘y’ if block X: OP 13 NOP OP 33 NOP else block Y: OP 14 OP 23 NOP BNZ ‘s’ … if (. . ) {. …. } else … {. …. } } PC START_ADDR IR_USE IROC END_ADDR 1 0 1 1 1 0 1 2 3 OP 11 OP 12 OP 13 OP 14 FU 1 24 1 1 0 0 1 2 OP 21 OP 22 OP 23 FU 2 NEW_PC 1 1 0 0 1 2 - OP 31 OP 32 OP 33 FU 3 0 1 1 0 1 2 BNZ ‘x’ BR ‘y’ BNZ ‘s’ BR

ESAT/ACCA The Buffer Operation: An Illustration LBON <offset> for (. . ) { S: OP 11 OP 21 OP 31 NOP OP 22 OP 32 BNZ ‘x’ OP 12 NOP BR ‘y’ if block X: OP 13 NOP OP 33 NOP else block Y: OP 14 OP 23 NOP BNZ ‘s’ … if (. . ) {. …. } else … {. …. } } PC START_ADDR IR_USE IROC END_ADDR 1 0 1 1 1 0 1 2 3 OP 11 OP 12 OP 13 OP 14 FU 1 25 1 1 0 0 1 2 OP 21 OP 22 OP 23 FU 2 NEW_PC 1 1 0 0 1 2 - OP 31 OP 32 OP 33 FU 3 0 1 1 0 1 2 BNZ ‘x’ BR ‘y’ BNZ ‘s’ BR

ESAT/ACCA The Buffer Operation: An Illustration LBON <offset> for (. . ) { S: OP 11 OP 21 OP 31 NOP OP 22 OP 32 BNZ ‘x’ OP 12 NOP BR ‘y’ if block X: OP 13 NOP OP 33 NOP else block Y: OP 14 OP 23 NOP BNZ ‘s’ … if (. . ) {. …. } else … {. …. } } PC START_ADDR IR_USE IROC END_ADDR 1 0 1 1 1 0 1 2 3 OP 11 OP 12 OP 13 OP 14 FU 1 26 1 1 0 0 1 2 OP 21 OP 22 OP 23 FU 2 NEW_PC 1 1 0 0 1 2 - OP 31 OP 32 OP 33 FU 3 0 1 1 0 1 2 BNZ ‘x’ BR ‘y’ BNZ ‘s’ BR

ESAT/ACCA The Buffer Operation: An Illustration LBON <offset> for (. . ) { S: OP 11 OP 21 OP 31 NOP OP 22 OP 32 BNZ ‘x’ OP 12 NOP BR ‘y’ if block X: OP 13 NOP OP 33 NOP else block Y: OP 14 OP 23 NOP BNZ ‘s’ … if (. . ) {. …. } else … {. …. } } PC START_ADDR IR_USE IROC END_ADDR 1 0 1 1 1 0 1 2 3 OP 11 OP 12 OP 13 OP 14 FU 1 27 1 1 0 0 1 2 OP 21 OP 22 OP 23 FU 2 NEW_PC 1 1 0 0 1 2 - OP 31 OP 32 OP 33 FU 3 0 1 1 0 1 2 BNZ ‘x’ BR ‘y’ BNZ ‘s’ BR

ESAT/ACCA The Buffer Operation: An Illustration LBON <offset> for (. . ) { S: OP 11 OP 21 OP 31 NOP OP 22 OP 32 BNZ ‘x’ OP 12 NOP BR ‘y’ if block X: OP 13 NOP OP 33 NOP else block Y: OP 14 OP 23 NOP BNZ ‘s’ … if (. . ) {. …. } else … {. …. } } PC START_ADDR IR_USE IROC END_ADDR 1 0 1 1 1 0 1 2 3 OP 11 OP 12 OP 13 OP 14 FU 1 28 1 1 0 0 1 2 OP 21 OP 22 OP 23 FU 2 NEW_PC 1 1 0 0 1 2 - OP 31 OP 32 OP 33 FU 3 0 1 1 0 1 2 BNZ ‘x’ BR ‘y’ BNZ ‘s’ BR

ESAT/ACCA The Buffer Operation: An Illustration LBON <offset> for (. . ) { S: OP 11 OP 21 OP 31 NOP OP 22 OP 32 BNZ ‘x’ OP 12 NOP BR ‘y’ if block X: OP 13 NOP OP 33 NOP else block Y: OP 14 OP 23 NOP BNZ ‘s’ … if (. . ) {. …. } else … {. …. } } PC START_ADDR IR_USE IROC END_ADDR 1 0 1 1 1 0 1 2 3 OP 11 OP 12 OP 13 OP 14 FU 1 29 1 1 0 0 1 2 OP 21 OP 22 OP 23 FU 2 NEW_PC 1 1 0 0 1 2 - OP 31 OP 32 OP 33 FU 3 0 1 1 0 1 2 BNZ ‘x’ BR ‘y’ BNZ ‘s’ BR

ESAT/ACCA Energy Trade-Offs #partitions Energy = E buffer i + E LC i + E interconnect i i=1 i=1 Energy (normalized) E LC i Baseline 1 E buffer i E interconnect i 1 30 #partitions #FUs

ESAT/ACCA Profile Based Clustering Dynamic Trace (during loop execution) 11100… 1 10101… 0 01101… 1. . . 11101… 0 begin 11100… 1 10101… 0 end begin 01101… 1 end Instruction Cluster Energy Models (Register File) A group of functional units with a separate local controller and an instruction buffer partition - FU grouping Instruction Clusters Instruction Clustering - Width and Depth of instruction buffers in each partition Min { Energy(clust, Dynamicprofile, Staticprofile) } max_clusters S. T clust(i, j) = 1; j i =1 Where, Static Trace (loops mapped to L 0) 31 clust (i, j) = 1; if jth FU is assigned to cluster ‘j’ = 0; otherwise

ESAT/ACCA Results #partitions Energy = E buffer i + E LC i i=1 Assumptions Energy (normalized) - Only the buffers and controller is modeled (no interconnect as yet) - #FUs in datapath = 10 - Fixed Schedule ( activation trace) - Schedule generated using Trimaran 2. 0 #partitions 32

ESAT/ACCA In Comparison With Other Schemes Results Shown for ADPCM Uncompressed - Centralized. L 0 buffer Compressed - Centralized L 0 Buffer - 2 additional registers for VLDecoding Partitioned (no control) - 2 partitions Clustered (width only) - 3 partitions ) th ed ) k ly) g bo ed an n on in ed -b latio ss y h s e ub gu idt (var th) es pr (s e w p pr om m ed ss r ng red d de c i o ry te n C ion e Un va lus h a rit acc ( C idt Pa no ed r w ( ste u Cl 33 Clustered (width and depth) - 2 partitions

ESAT/ACCA Fully Distributed Instruction Memory Hierarchy Main Memory (off-chip) L 1 cache (on-chip) L 0 Buffers FU FU L 0 Buffers FU L 0 Cluster 34 L 1 cache (on-chip) FU FU L 0 Buffers FU FU L 1 Cluster FU FU FU

ESAT/ACCA Overview • Context: Introduction to the problem • Motivation for L 0 Buffer organization and status • Distributed L 0 Buffer organization • Instruction Memory Exploration q Software and Compiler Transformation • Conclusions 35

ESAT/ACCA Exploration Methodology What we have optimized for performance - maximum cluster activity Energy Application optimized for Energy - minimal cluster activity Software Transformations Delay Compiler (Scheduling) Clustering Tool Instruction Clusters 36 Pareto Curve Generation - For Choosing the operating point at Run-time Energy Models - Enable the designer to asses the trade-off between energy and performance

ESAT/ACCA Exploration Methodology What we want to achieve… optimized for performance - maximum cluster activity Energy Application optimized for Energy - minimal cluster activity Software Transformations Delay Compiler (Scheduling & Clustering) Energy Models Pareto Curve Generation - For Choosing the operating point at Run-time - Enable the designer to asses the trade-off between energy and performance Instruction Clusters 37 Schedule

ESAT/ACCA Compiler Scheduling Compiler scheduling can change the functional unit activity and hence the clustering result and hence energy and performance OP 11 OP 12 - OP 13 - All 3 clusters need to be active OP 14 OP 11 OP 12 Energy reduction without performance loss OP 13 OP 14 - - Only 2 clusters need to be active OP 11 OP 12 - OP 13 - OP 14 OP 21 - OP 22 - OP 23 - 2 activations of all 3 clusters - - - - - 38 OP 12 OP 11 Energy reduction at the expense of performance loss OP 11 - OP 22 OP 13 OP 23 OP 14 2 activations for 1 st, 1 activation for 2 nd and 3 rd cluster

ESAT/ACCA Software Transformations High level code transformations can also impact/change the clustering result and hence energy and performance Loop Transformations loop 1 - Loop splitting - Loop merging Loop - Loop peeling (for nested loops) - Loop collapsing (nested loops) - Code movement across loops -. . etc 39 loop 2 Loop Splitting

ESAT/ACCA Overview • Context: Introduction to the problem • Motivation for L 0 Buffer organization and status • Distributed L 0 Buffer organization • Instruction Memory Exploration q Software and Compiler Transformation • Conclusions 40

ESAT/ACCA Conclusions • L 0 Buffer Organization q Multimedia applications have high locality in small program segments q An additional small L 0 buffer should be used q Current options for L 0 buffer still not efficient (energy) q A distributed L 0 buffer organization should be sought q But, the clustering/partitioning should be application specific • L 1 Cache Organization q Distributed (? ) • Instruction Memory Exploration 41 q Software transformations and compiler scheduling can change the clusterting results q An exploration methodology should be sought to analyze the trade-offs in energy and performance (pareto curves)