Скачать презентацию Ten Hardware Features That Affect Optimization COMP 512 Скачать презентацию Ten Hardware Features That Affect Optimization COMP 512

0ed13513e8b70174b78b61c21c28e228.ppt

  • Количество слайдов: 20

Ten Hardware Features That Affect Optimization COMP 512 Rice University Houston, Texas Fall 2003 Ten Hardware Features That Affect Optimization COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.

Hardware Features Affect Optimization • Target machine defines cost of each operation • Target Hardware Features Affect Optimization • Target machine defines cost of each operation • Target machine defines set of available resources • Target machine may provide unusual opportunities > Load multiple, predication, branches in delay slots, … Compiler Writers/Designers must understand hardware features • Make good use of features that help • Avoid downside impact of features that hurt COMP 512, Fall 2003 (branch to register) 2

Ten Hardware Features That Affect Optimization The list for today’s lecture 1. Register windows Ten Hardware Features That Affect Optimization The list for today’s lecture 1. Register windows 2. Partitioned register sets 3. Itanium’s rolling registers 4. x 86 floating-point register stack 5. Predicated Execution 6. Autoincrement & autodecrement 7. On-chip local memory 8. Hints to hardware 9. Branch-delay slots 10. Software-controlled processor speed COMP 512, Fall 2003 3

Register Windows Architectural response to procedure call save/restore overhead • Use hardware renaming to Register Windows Architectural response to procedure call save/restore overhead • Use hardware renaming to avoid most saves & restores at a call • Partition register names into sets Set shared with caller > Local set, maybe global set > Set shared with callee > • Manipulate the map at a call so that caller’s output set becomes callee’s input set > Intrinsic effect of call/return or separate operations • Hardware or software mechanism to handle stack overflow COMP 512, Fall 2003 4

Register Windows SPARC Using Register Windows • Save & Restore operations • 32 GPRs Register Windows SPARC Using Register Windows • Save & Restore operations • 32 GPRs visible at any time • No window on floats • • • global r 0 to r 7 w/callee Caller passes args in r 8 to r 15 Callee sees them in r 24 to r 31 Global set visible to all Can use r 16 to r 31 arbitrarily Can use r 8 to r 15 as scratch between calls r 8 to r 15 local w/caller r 16 to r 23 r 24 to r 31 • Overflow handled by trap > Most save-restore activity now automated in overflow code • 520 physical registers is a lot • 40 physical registers is not Faster for non-recursive code 40 to 520 physical regs COMP 512, Fall 2003 5

Register Windows Itanium • Variable size window • 32 “global” registers • Variable size Register Windows Itanium • Variable size window • 32 “global” registers • Variable size window (0 -96) > (only GPRs) Starts with r 32 • Background engine performs fill and spill operations on register stack overflows > Stall on return when fill is needed & incomplete • Callee inherits window of same size as caller > Operation lets callee set window size & local–out line • ISA includes alloc, flushrs, loadrs, & cover operations COMP 512, Fall 2003 6

Partitioned Register Sets • Number of functional units keeps rising • At some point, Partitioned Register Sets • Number of functional units keeps rising • At some point, the register-FU MUX becomes too deep & slow • One response is to partition the register set Register File • Multiple register files, each with a cluster of FUs • Inter-cluster xfer FU 0 FU 1 FU 2 FU 3 FU 4 FU 5 FU 6 FU 7 mechanism, with limited bandwidth • Fast access to local register file inter-cluster data paths TMS 320 C 6 X COMP 512, Fall 2003 7

PRS: Cluster Assignment & Scheduling • Compiler must place each operation & ensure operand PRS: Cluster Assignment & Scheduling • Compiler must place each operation & ensure operand availability > May necessitate inter-cluster copy operations • Adds another complex problem to the back end Bottom-up Greedy (BUG) algorithm [Ellis 81] • Separate cluster assignment phase before scheduling • Inserted all necessary data movement before scheduling Unified Assign and Schedule (UAS) [Ozer et al. 98] • Moved assignment into inner loop of backward list scheduler • Produced better results than bottom-up greedy approach Commercial practice • Ad-hoc techniques based on coloring as prelude to scheduling • Poor utilization of off-critical-path clusters COMP 512, Fall 2003 8

Cluster Assignment & Scheduling Jingsong He’s work • • • (MS Thesis) Follow pattern Cluster Assignment & Scheduling Jingsong He’s work • • • (MS Thesis) Follow pattern of UAS & move assignment into inner loop Use forward list scheduler Search backward for slots to insert inter-cluster copies Use direct reference as last result Two versions > TDF considers clusters in a fixed order > TDC considers clusters in order by operand count • Both TDF & TDC outperform BUG & UAS > Measured by execution cycles, not some static count Multiplies the complexity by N. COMP 512, Fall 2003 Limit the search to last 10 or 20 cycles if this worries you. 9

Itanium’s Rolling Registers Support for Software Pipelining • Lam suggested a combination of Modulo Itanium’s Rolling Registers Support for Software Pipelining • Lam suggested a combination of Modulo Variable Expansion and unrolling to straighten out the flow of values • Itanium supports a rolling-register set Fixed size portion of floating-point & predicate register set PR 32 to PR 63 and FR 32 to FR 127 > Code sets size of GPR rolling set (above GR 32) > • Code uses adjacent registers for same name in successive iterations of the pipelined loop • Loop-oriented branches adjust the RRB (rotating register base) rx+1 becomes rx after br. ctop or br. wtop > Other loop counting features simplify epilogue code > COMP 512, Fall 2003 10

Register Stack x 86 Floating-point registers are organized as a rotating stack • 8 Register Stack x 86 Floating-point registers are organized as a rotating stack • 8 FP registers • ST[0] refers to top, ST[7] refers to bottom • Memory operations always use to ST[0] Computational model differs from ILOC-like IRs • Places a premium on code shape (RPN) Generating code is well understood • Infix to postfix translation is a postorder walk on expression tree • Stack optimization was studied in 1970 s and 1980 s COMP 512, Fall 2003 11

Register Stack model complicates post-compilation optimization • Translation from explicit to implicit names loses Register Stack model complicates post-compilation optimization • Translation from explicit to implicit names loses information • Implicit names are inherently ambiguous > ST[i] can refer to FR 0, FR 1, FR 2, …, FR 7 • Simple translation from stack to infix code retains ambiguity Das Gupta built SSA from x 86 assembly in his Vizer system • • • Model push and pop with a series of register copy operations Creates (truly) ugly IR, but captures the effect Allows analysis to build accurate SSA and use it Copy folding eliminates most of the 7 x “extra” copy operations Reconstruct stack code on translation out of SSA via treewalk COMP 512, Fall 2003 12

Predicated Execution Pervasive predication changes code shape • Can use if-conversion to avoid branches Predicated Execution Pervasive predication changes code shape • Can use if-conversion to avoid branches > (Ea. C, § 7) Need to evaluate tradeoffs (path lengths, density of executed ops) • Subtler impacts abound Branches become predicated jumps > Multiway branches – up to number of FUs that can branch > Predicated prologue & epilogue in software pipelined loop > Run-time checks on ambiguous stores & loads > ® > Test condition before loop & only load/store on overlap … More will emerge as clever students work with predication I do not believe that we have seen the killer app for predication COMP 512, Fall 2003 13

Autoincrement & Autodecrement Many architectures support autoincrement (PDP-11, DSPs, IA 64) • TMS 320 Autoincrement & Autodecrement Many architectures support autoincrement (PDP-11, DSPs, IA 64) • TMS 320 C 25 relies heavily on indirect addressing No address-immediate form > Code must perform explicit arithmetic or use autoincrement > • Data layout in memory has a significant impact on speed & size Want offsets assigned so that successive references differ by an autoincrement or autodecrement (scalar variables ) > Folds address calculation into addressing hardware > Eliminates instructions (space & time ) > • Single register problem modeled as path covering problem (NPC) • General problem (multiple index registers) is harder This work may have application on Itanium (autoincrement) See “Storage Assignment to Decrease Code Size, ” S. Liao, S. Devedas, K. COMP 512, Tjiang, Keutzer, S. Fall 2003 and A. Wang, TOPLAS 18(3), May 1996, pages 235– 253. 14

On-chip Local Memory Many DSP chips have local memory rather than cache • Local On-chip Local Memory Many DSP chips have local memory rather than cache • Local memory is not mapped or managed • Takes less space & less power • Programmer (or compiler) control of contents (as cache is) How can the compiler manage this memory? • Tile and copy for large arrays > Strip mine and interchange to create manageable data size > Copy in & copy out around inner loop(s) • Spill memory? > Harvey showed that a couple of KB is enough > Interprocedural allocation problem COMP 512, Fall 2003 15

Hints to the Hardware Intel, in particular, likes this mechanism for compiler-given advice Itanium Hints to the Hardware Intel, in particular, likes this mechanism for compiler-given advice Itanium has • Hints to the register-stack engine > Enforced lazy, eager, load intensive, store intensive • Hints on loads, stores, & prefetches help cache management > Temporal, non-temporal L 1 (NT-L 1), NT-L 2, & NT-All • Hints on branch behavior > Branch predict (brp) operations > Default predictions in absence of history info. Hints that govern prediction ® Static not taken, static taken ® (no prediction resources) Dynamic not taken, dynamic taken > Hints about amount of code to prefetch > Hints to deallocate prediction resources COMP 512, Fall 2003 (use dynamic history) (few vs many lines) (keep vs free) 16

Branch Delay Slots Many processors expose branch delay slots to scheduling • SPARC has Branch Delay Slots Many processors expose branch delay slots to scheduling • SPARC has 1 slot, TMS 320 C 6 x has 5 • Bit in branch indicates whether next op is in the delay slot • Filling delay slots eliminates wasted cycles Branches in branch delay slots create complex control-flow • Both SPARC & C 6 x allow this code > SPARC manual actively encourages its use • Aggressive use can create complex code that is hard to decipher > Recall example from TI compiler … COMP 512, Fall 2003 17

Unravelling Control-flow || || || LOOP: || || || [B 0] B B ZERO Unravelling Control-flow || || || LOOP: || || || [B 0] B B ZERO ZERO LDW MPYH ADD SUB B ADD COMP 512, Fall 2003 (TI TMS 320 C 6 x) . S 1 LOOP ; branch to loop Stuff four branches. S 1 LOOP ; branch to loop into the pipe. S 1 LOOP ; branch to loop. L 1 A 2 ; zero A side product. L 2 B 2 ; zero B side product. S 1 LOOP ; branch to loop. L 1 A 3 ; zero A side accumulator. L 2 B 3 ; zero B side accumulator Set up the loop. D 1 A 1 ; zero A side load value. D 2 B 1 ; zero B side load value. D 1 *A 4++, A 1 ; load a[i] & a[i+1]. D 2 *B 4++, B 1 ; load a[i] & a[i+1] Single cycle loop. M 1 X A 1, B 1, A 2 ; load b[i] & b[i+1] ending with another. M 2 X A 1, B 2 ; a[i] * b[i] branch. L 1 A 2, A 3 ; a[i+1] * b[i+1] e… ctur e. L 2 B 2, B 3 ; ca += a[i] * b[i] ion L a. S 2 B 0, 1, B 0 ; decrement loop counter t timiz. S 1 LOOP ; branch tole Op loop ho. L 1 X A 3, B 3, A 3 >; c. Peep cb = ca + From 18

Branch Delay Slots Many processors expose branch delay slots to scheduling • SPARC has Branch Delay Slots Many processors expose branch delay slots to scheduling • SPARC has 1 slot, TMS 320 C 6 x has 5 • Bit in branch indicates whether next op is in the delay slot • Filling delay slots eliminates wasted cycles Branches in branch delay slots create complex control-flow • Both SPARC & C 6 x allow this code > SPARC manual actively encourages its use • Aggressive use can create complex code that is hard to decipher Recall example from TI compiler … > Code is complex, loop structure is hidden, but it is fast > COMP 512, Fall 2003 19

Software-adjustable Processor Speed Important component of power-aware processors Code can change the processor’s clock Software-adjustable Processor Speed Important component of power-aware processors Code can change the processor’s clock rate • Slower execution requires less power • Can have significant impact on battery life • Gnu. Emacs needs 300 K to 400 K OPS (n 2 effect) To generate code for this feature • Compiler must estimate speed required for appropriate progress > Hard part is defining appropriate progress • Compiler must insert code to change speed > May require brief delay to let processor stabilize COMP 512, Fall 2003 20