Tuning of Loop Cache Architectures to Programs in

Скачать презентацию Tuning of Loop Cache Architectures to Programs in

c3e55b183d621a73a576b6521bc41f38.ppt

Количество слайдов: 22

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the U. S. National Science Foundation and a U. S. Department of Education GAANN Fellowship

Introduction Traditional Core Based microprocessor architecture Opportunity to tune the microprocessor architecture to the program 2

Introduction • I-cache – Size – Associativity – Replacement policy I$ Processor D$ Mem Bridge • JPEG – Compression JPEG USB CCDP P 4 • Buses – Width – Bus invert/gray code 3

Introduction • Memory access can consume 50% of an embedded microprocessor’s system power – Caches tend to be power hungry • M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99) • ARM 920 T: caches consume half of total power (Segars 01) 4

Introduction I$ Processor D$ Mem Bridge JPEG USB CCDP P 4 Advantageous to focus on the instruction fetching subsystem 5

Introduction • Techniques to reduce instruction fetch power – Program compression • Compress only a subset of frequently used instructions (Benini 1999) • Compress procedures in a small cache (Kirvoski 1997) • Lookup table based (Lekatsas 2000) – Bus encoding • Increment (Benini 1997) • Bus-invert (Stan 1995) • Binary/gray code (Mehta 1996) 6

Introduction • Techniques to reduce instruction fetch power (cont. ) – Efficient cache design • Small buffers: victim, non-temporal, speculative, and penalty to reduce miss rate (Bahar 1998) • Memory array partitioning and variation in cache sizes (Ko 1995) – Tiny caches • Filter cache (Kin/Gupta/Magione-Smith 1997) • Dynamically loaded tagless loop cache (Lee/Moyer/Arends 1999) • Preloaded tagless loop cache (Gordon-Ross/Cotterell/Vahid 2002) 7

Cache Architectures – Filter Cache • Small L 0 direct mapped cache • Utilizes standard tag comparison and miss logic • Has low dynamic power L 1 memory – Short internal bitlines – Close to the microprocessor Filter cache (L 0) • Performance penalty of 21% due to high miss rate (Kin 1997) Processor 8

Cache Architectures – Dynamically Loaded Loop Cache • Small tagless loop cache • Alternative location to fetch instructions • Dynamically fills the loop cache L 1 memory – Triggered by short backwards branch (sbb) instruction Processor Dynamic loop cache Mux • Flexible variation – Allows loops larger than the loop cache to be partially stored . . . add r 1, 2. . . sbb -5 Iteration 1 : detect sbb instruction Iteration 2 : fill loop cache Iteration 3 : fetch from loop cache 9

Cache Architectures – Dynamically Loaded Loop Cache (cont. ) L 1 memory • Limitations – Does not support loops with control of flow changes (cofs) – cofs terminate loop cache filling and fetching – cofs include commonly found if-then-else statements. . . add r 1, 2 bne r 1, r 2, 3. . . sbb -5 Dynamic loop cache Mux Processor Iteration 1 : detect sbb instruction Iteration 2 : fill loop cache, terminate at cof Iteration 3 : fill loop cache, terminate at cof 10

Cache Architectures – Preloaded Loop Cache • Small tagless loop cache • Alternative location to fetch instructions • Loop cache filled at compile time and remains fixed – Supports loops with cof • Fetch triggered by short backwards branch • Start address variation – Fetch begins on first iteration L 1 memory Preloaded loop cache Mux Processor. . . add r 1, 2 loop r 1, r 2, 3 bne. . . sbb -5 Iteration 1 : detect sbb instruction Iteration 2 : check to see if loop preloaded, if so fetch from cache 11

Traditional Design • Traditional Pre-fabricated IC – Typically optimized for best average case – Intended to run well across a variety of programs – Benchmark suite is used to determine which configuration L 1 memory ? Mux Processor • On average, what is the best tiny cache configuration? 12

Evaluation Framework – Candidate Cache Configurations Type Size Number of loops/ line size Configuration Original dynamically loaded loop cache 8 -1024 entries n/a 1 -8 Flexible dynamically loaded loop cache 8 -1024 entries n/a 9 -16 Preloaded loop cache (sa) 8 -1024 entries 2 - 3 loop address registers 17 -32 Preloaded loop cache (sbb) 8 -1024 entries 2 - 6 loop address registers 33 -72 Filter cache 8 -1024 bytes line size of 8 to 64 bytes 73 -106 13

Evaluation Framework – Motorola's Powerstone Benchmarks Benchmark Lines of C # Instructions Executed Description adpcm 501 63891 Voice Encoding bcnt 90 1938 Bit Manipulation binary 67 816 Binary Insertion blit 94 22845 Graphics Application compress 943 138573 Data Compression Program crc 84 37650 Cyclic Redundancy Check des 745 122214 Data Encryption Standard engine 276 410607 Engine Controller fir 173 16211 FIR Filtering g 3 fax 606 1128023 Group Three Fax Decode jpeg 540 4594721 JPEG Compression summin 74 1909787 Handwriting Recognition ucbqsort 209 219978 U. C. B Quick Sort v 42 553 2442551 Modem Encoding/Decoding 14

Simplified Tool Chain Program instruction trace Loop selector (preloaded) lcsim Loop cache stats lc power calculator Loop cache power Technology info 15

Best on Average original flexible preloaded (sa) preloaded (sbb) 30 • Configuration 30 – Preloaded Loop cache (sa), 512 entries, 3 loop address registers – 73% Instruction fetch energy savings filter 105 • Configuration 105 – Filter cache, 1024 entries , line size 32 bytes – 73% Instruction fetch energy savings 16

Core Based Design • Core Based Design – Know application – Opportunity to tune the architecture • Is it worth tuning the architecture to the application or is the average case good enough? microprocessor architecture 17

Best on Average • Both configurations perform well for some benchmarks such as engine and summin • However, both configurations perform below average for binary, v 42, and others 18

Results - binary original flexible preloaded (sa) 30 31 preloaded (sbb) filter 105 • Config 30 yields 61% savings • Config 105 yields 65% savings • Config 31 (preloaded/1024 entry/2 LAR) yields 79% savings 19

Results – v 42 original flexible preloaded (sa) 30 preloaded (sbb) filter 67 105 • Config 30 yields 58% savings • Config 105 yields 23% savings • Config 67 (preloaded/512 entry/6 LAR) yields 68% 20

Results - averages adpcm Best case : 68% (preloaded) Config 105: 25% Improvement : 43% blit Best case : 96% (flexible) Config 30: 87% Improvement : 9% v 42 Best case : 68% (preloaded) Config 105: 23% Improvement : 45% jpeg Best case : 92% (filter) Config 30: 69% Improvement : 23% Average case Best case : 84% Config 30 : 73% Config 105: 73% Improvement : 11% 21

Conclusion and Future Work • Shown benefits of tuning the tiny cache to a particular program – On average yields an additional 11% – Up to an additional 40% for some programs • Environment automated but requires several hours to find best configuration – Current methodology is too slow – Faster method based on equations described in upcoming ICCAD 2002 22