Streamroller Automatic Synthesis of Prescribed Throughput Accelerator Pipelines

Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath Kudlur, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan 1 University of Michigan Electrical Engineering and Computer Science

Automated C to Gates Solution • So. C design – 10 -100 Gops, 200 m. W power budget – Low level tools ineffective • Automated accelerator synthesis for whole application app. c LA LA – Correct by construction – Increase designer productivity – Faster time to market 2 University of Michigan Electrical Engineering and Computer Science

Streaming Applications Image Transform Quantizer Coder • Data “streaming” through kernels • Kernels are tight loops Coded Image Inverse Quantizer Inverse Transform Motion Estimator Motion Predictor – FIR, Viterbi, DCT • Coarse grain dataflow between kernels H. 264 Encoder OVSF Generator Data in CRC Conv. / Turbo Block Interleaver RRC Filter Baseband Trasmitter – Sub-blocks of images, network packets Data out Spreader/ Scrambler W-CDMA Transmitter 3 University of Michigan Electrical Engineering and Computer Science

Software Overview 1 System Level Synthesis Frontend Analyses SRAM Buffers 2 3 4 Whole Application Loop Graph 4 Accelerator Pipeline University of Michigan Electrical Engineering and Computer Science

Input Specification • Sequential C program • Kernel specification • System specification – Perfectly nested FOR loop – Wrapped inside C function – All data access made row_trans(char inp[8][8], explicit char out[8][8] ) { for(i=0; i<8; i++) { for(j=0; j<8; j++) {. . . = inp[i][j]; out[i][j] =. . . ; } } } col_trans(char inp[8][8], char out[8][8]); zigzag_trans(char inp[8][8], char out[8][8]); – Function with main input/output – Local arrays to pass data – Sequence of calls to kernels inp dct(char inp[8][8], char out[8][8]) { char tmp 1[8][8], tmp 2[8][8]; row_trans(inp, tmp 1); col_trans(tmp 1, tmp 2); zigzag_trans(tmp 2, out); } row_trans tmp 1 col_trans tmp 2 zigzag_trans out 5 University of Michigan Electrical Engineering and Computer Science

Performance Specification Input image (1024 x 768) 8 8 inp • High performance DCT – Process one 1024 x 768 image every 2 ms – Given 400 Mhz clock • One image every 800000 cycles • One block every 64 cycles row_trans tmp 1 col_trans Task tmp 2 zigzag_trans Output coeffs out 8 8 • Low Performance DCT – Process one 1024 x 768 image every 4 ms – One block every 128 cycles Performance goal : Task throughput in number of cycles between tasks 6 University of Michigan Electrical Engineering and Computer Science

Building Blocks Kernel 1 tmp 1 Kernel 2 tmp 2 Kernel 3 Multifunction Loop Accelerator [CODES/ISSS ’ 06] tmp 3 Kernel 4 SRAM buffers 7 University of Michigan Electrical Engineering and Computer Science

System Schema Overview LA 1 Kernel 2 Task throughput Kernel 1 K 2 Kernel 3 LA 2 time Kernel 4 K 3 Kernel 4 Kernel 1 K 2 K 3 Kernel 1 Kernel 5 Kernel 4 K 2 K 3 Kernel 5 Kernel 4 LA 3 Kernel 5 8 University of Michigan Electrical Engineering and Computer Science

Cost Components • Cost of loop accelerator data path – Cost of FUs, shift registers, muxes, interconnect • Initiation interval (II) – Key parameter that decides LA cost • Low II → high performance → high cost – Loop execution time ≈ (trip count) x II Throughput = 1 task/200 cycles – Appropriate II chosen to satisfy task throughput Task 1 Throughput = 1 task/100 cycles K 1 Task 1 TC=100 K 1 II=1 TC=100 K 2 II=1 200 300 TC=100 K 3 K 1 K 2 K 1 Task 3 K 2 II=2 K 1 K 2 Task 2 200 Task 2 K 3 100 K 1 K 2 K 1 K 3 K 2 400 TC=100 K 2 II=2 600 II=1 TC=100 K 3 High performance K 3 II=2 K 3 Low performance 9 University of Michigan Electrical Engineering and Computer Science

Cost Components (Contd. . ) • Grouping of loops into a multifunction LA – More loops in a single LA → LA occupied for longer time in current task Throughput = 1 task / 200 cycles TC=100 K 1 100 LA 1 TC=100 200 K 2 300 TC=100 K 3 LA 2 400 K 1 K 2 K 3 K 1 K 4 K 2 K 3 TC=100 K 3 K 4 LA 3 10 LA 1 occupied for 200 cycles University of Michigan Electrical Engineering and Computer Science

Cost Components (Contd. . ) • Cost of SRAM buffers for intermediate arrays • More buffers → more task overlap → high performance TC=100 K 1 II=1 100 tmp 1 TC=100 LA 1 K 2 II=1 LA 2 tmp 2 200 300 K 1 100 K 2 K 3 K 1 LA 2 300 K 2 TC=100 K 3 II=1 LA 3 200 K 3 tmp 1 buffer in use by LA 2 K 1 K 3 K 2 K 3 LA 3 Adjacent tasks use different buffers 11 University of Michigan Electrical Engineering and Computer Science

ILP Formulation • Variables – II for each loop – Which loops are combined into single LA – Number of buffers for temp array • Objective function – Cost of LAs + cost of buffers • Constraints – Overall task throughput should be achieved 12 University of Michigan Electrical Engineering and Computer Science

Non-linear LA Cost 1. 0 0. 9 0. 8 0. 7 Relative Cost 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 1 2 3 4 IImin 5 6 7 8 9 10 11 12 Initiation interval 13 14 15 16 17 18 19 20 IImax IImin ≤ IImax II = 1*II 1 + 2*II 2 + 3*II 3 +. . + 14*II 14 and 0 ≤ IIi ≤ 1 Cost(II) = C 1*II 1 + C 2*II 2 + C 3*II 3 +. . + C 14*II 14 13 University of Michigan Electrical Engineering and Computer Science

Multifunction Accelerator Cost LA 1 LA 3 LA 2 LA 4 Worst Case : No sharing Cost = Sum LA 2 LA 1 LA 4 LA 3 Realistic Case : Some sharing Cost = Between Sum and Max LA 1 LA 3 LA 2 LA 4 Best case : Full sharing Cost = Max • Impractical to obtain accurate cost of all combinations • CLA = 0. 5 * (SUMCLA + MAXCLA) 14 University of Michigan Electrical Engineering and Computer Science

Case Study : “Simple” benchmark Loop graph TC=256 512 cycles 1 1 1 1 1 LA 2 1792 cycles 1 1 LA 1 1 2 1 1 LA 1 2048 cycles 1 1 1 LA 3 LA 4 1536 cycles 15 1 LA 1 1 3 3 1 LA 2 1 University of Michigan Electrical Engineering and Computer Science

Beamformer • 10 loops • Memory Cost – 60% to 70% • Up to 20% cost savings due to hardware sharing in multifunction accelerators • Systems at lower throughput have over-designed LAs – Not profitable to pick a lower performance LA • Memory buffer cost significant – High performance producer consumer better than more buffers 16 University of Michigan Electrical Engineering and Computer Science

Conclusions • Automated design realistic for system of loops • Designers can move up the abstraction hierarchy • Observations – Macro level hardware sharing can achieve significant cost savings – Memory cost is significant – need to simultaneously optimize for datapath and memory cost 17 University of Michigan Electrical Engineering and Computer Science

18 University of Michigan Electrical Engineering and Computer Science