CSCE 430 830 Computer Architecture Instruction-level parallelism Loop Unrolling

Скачать презентацию CSCE 430 830 Computer Architecture Instruction-level parallelism Loop Unrolling

6a69cbbdc6be1e3a9534f94bdb1060b2.ppt

Количество слайдов: 13

CSCE 430/830 Computer Architecture Instruction-level parallelism: Loop Unrolling Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine) Fall, 2006 CSCE 430/830 Portions of these slides are derived from: Dave Patterson © UCB ILP: Loop Unrolling

Running Example • This code adds a scalar to a vector: for (i=1000; i>0; i=i– 1) x[i] = x[i] + s; • Assume following latency all examples Instruction producing result FP ALU op Load double Integer op CSCE 430/830 Instruction using result Another FP ALU op Store double Integer op Execution in cycles 4 3 1 1 1 Latency in cycles 3 2 1 0 0 ILP: Loop Unrolling

FP Loop: Where are the Hazards? • First translate into MIPS code: -To simplify, assume 8 is lowest address for (i=1000; i>0; i=i– 1) x[i] = x[i] + s; Loop: L. D ADD. D S. D DSUBUI BNEZ NOP F 0, 0(R 1) ; F 0=vector element F 4, F 0, F 2 ; add scalar from F 2 0(R 1), F 4 ; store result R 1, 8 ; decrement pointer 8 B (DW) R 1, Loop ; branch R 1!=zero ; delayed branch slot Where are the stalls? CSCE 430/830 ILP: Loop Unrolling

FP Loop Showing Stalls 1 Loop: L. D 2 stall 3 ADD. D 4 stall 5 stall 6 S. D 7 DSUBUI 8 BNEZ 9 stall Instruction producing result FP ALU op Load double F 0, 0(R 1) ; F 0=vector element F 4, F 0, F 2 ; add scalar in F 2 0(R 1), F 4 ; store result R 1, 8 ; decrement pointer 8 B (DW) R 1, Loop ; branch R 1!=zero ; delayed branch slot Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1 • 9 clocks: Rewrite code to minimize stalls? CSCE 430/830 ILP: Loop Unrolling

Revised FP Loop Minimizing Stalls 1 Loop: L. D 2 stall 3 ADD. D 4 DSUBUI 5 BNEZ 6 S. D F 0, 0(R 1) F 4, F 0, F 2 R 1, 8 R 1, Loop ; delayed branch 8(R 1), F 4 ; altered when move past DSUBUI Swap BNEZ and S. D by changing address of S. D Instruction producing result FP ALU op Load double Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1 6 clocks, but just 3 for execution, 3 for loop overhead; How make faster? CSCE 430/830 ILP: Loop Unrolling

Unroll Loop Four Times (straightforward way) 1 Loop: L. D 2 ADD. D 3 S. D 4 L. D 5 ADD. D 6 S. D 7 L. D 8 ADD. D 9 S. D 10 L. D 11 ADD. D 12 S. D 13 DSUBUI 14 BNEZ 15 NOP F 0, 0(R 1) F 4, F 0, F 2 0(R 1), F 4 F 6, -8(R 1) F 8, F 6, F 2 -8(R 1), F 8 F 10, -16(R 1) F 12, F 10, F 2 -16(R 1), F 12 F 14, -24(R 1) F 16, F 14, F 2 -24(R 1), F 16 R 1, #32 R 1, LOOP 1 cycle stall Rewrite loop to minimize stalls? 2 cycles stall ; drop DSUBUI & BNEZ ; alter to 4*8 15 + 4 x (1+2) = 27 clock cycles, or 6. 8 per iteration Assumes R 1 is multiple of 4 CSCE 430/830 ILP: Loop Unrolling

Unrolled Loop Detail • Do not usually know upper bound of loop • Suppose it is n, and we would like to unroll the loop to make k copies of the body • Instead of a single unrolled loop, we generate a pair of consecutive loops: – 1 st executes (n mod k) times and has a body that is the original loop – 2 nd is the unrolled body surrounded by an outer loop that iterates (n/k) times – For large values of n, most of the execution time will be spent in the unrolled loop CSCE 430/830 ILP: Loop Unrolling

Unrolled Loop That Minimizes Stalls 1 Loop: L. D 2 L. D 3 L. D 4 L. D 5 ADD. D 6 ADD. D 7 ADD. D 8 ADD. D 9 S. D 10 S. D 11 S. D 12 DSUBUI 13 BNEZ 14 S. D F 0, 0(R 1) • What assumptions F 6, -8(R 1) made when moved F 10, -16(R 1) F 14, -24(R 1) code? F 4, F 0, F 2 – OK to move store past F 8, F 6, F 2 DSUBUI even though F 12, F 10, F 2 changes register F 16, F 14, F 2 – OK to move loads before 0(R 1), F 4 stores: get right data? -8(R 1), F 8 – When is it safe for compiler -16(R 1), F 12 to do such changes? R 1, #32 R 1, LOOP 8(R 1), F 16 ; 8 -32 = -24 14 clock cycles, or 3. 5 per iteration CSCE 430/830 ILP: Loop Unrolling

Compiler Perspectives on Code Movement • Compiler concerned about dependencies in program • Whether or not a HW hazard depends on pipeline • Try to schedule to avoid hazards that cause performance losses • (True) Data dependencies (RAW if a hazard for HW) – Instruction i produces a result used by instruction j, or – Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. • If dependent, can’t execute in parallel • Easy to determine for registers (fixed names) • Hard for memory (“memory disambiguation” problem): – Does 100(R 4) = 20(R 6)? – From different loop iterations, does 20(R 6) = 20(R 6)? CSCE 430/830 ILP: Loop Unrolling

Where are the name dependencies? 1 Loop: L. D 2 ADD. D 3 S. D 4 L. D 5 ADD. D 6 S. D 7 L. D 8 ADD. D 9 S. D 10 L. D 11 ADD. D 12 S. D 13 DSUBUI 14 BNEZ 15 NOP F 0, 0(R 1) F 4, F 0, F 2 0(R 1), F 4 F 0, -8(R 1) F 4, F 0, F 2 -8(R 1), F 4 F 0, -16(R 1) F 4, F 0, F 2 -16(R 1), F 4 F 0, -24(R 1) F 4, F 0, F 2 -24(R 1), F 4 R 1, #32 R 1, LOOP ; drop DSUBUI & BNEZ ; alter to 4*8 How can remove them? CSCE 430/830 ILP: Loop Unrolling

Where are the name dependencies? 1 Loop: L. D 2 ADD. D 3 S. D 4 L. D 5 ADD. D 6 S. D 7 L. D 8 ADD. D 9 S. D 10 L. D 11 ADD. D 12 S. D 13 DSUBUI 14 BNEZ 15 NOP F 0, 0(R 1) F 4, F 0, F 2 0(R 1), F 4 F 6, -8(R 1) F 8, F 6, F 2 -8(R 1), F 8 F 10, -16(R 1) F 12, F 10, F 2 -16(R 1), F 12 F 14, -24(R 1) F 16, F 14, F 2 -24(R 1), F 16 R 1, #32 R 1, LOOP ; drop DSUBUI & BNEZ ; alter to 4*8 The Orginal“register renaming” CSCE 430/830 ILP: Loop Unrolling

Compiler Perspectives on Code Movement • Name Dependencies are Hard to discover for Memory Accesses – Does 100(R 4) = 20(R 6)? – From different loop iterations, does 20(R 6) = 20(R 6)? • Our example required compiler to know that if R 1 doesn’t change then: 0(R 1) -8(R 1) -16(R 1) -24(R 1) There were no dependencies between some loads and stores so they could be moved by each other CSCE 430/830 ILP: Loop Unrolling

Steps Compiler Performed to Unroll • Check OK to move the S. D after DSUBUI and BNEZ, and find amount to adjust S. D offset • Determine unrolling the loop would be useful by finding that the loop iterations were independent • Rename registers to avoid name dependencies • Eliminate extra test and branch instructions and adjust the loop termination and iteration code • Determine whether loads and stores in unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent – requires analyzing memory addresses and finding that they do not refer to the same address. • Schedule the code, preserving any dependences needed to yield same result as the original code CSCE 430/830 ILP: Loop Unrolling