Скачать презентацию Cost Sensitive Modulo Scheduling in a Loop Accelerator Скачать презентацию Cost Sensitive Modulo Scheduling in a Loop Accelerator

f915e0598af07ccb8b5b0f6323d9356d.ppt

  • Количество слайдов: 22

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan, Manjunath Kudlur, Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan 1 University of Michigan Electrical Engineering and Computer Science

Introduction 20 GB HD • Emerging applications have high performance, cost, energy demands 3. Introduction 20 GB HD • Emerging applications have high performance, cost, energy demands 3. 5 G (HSDPA) Wi. Max – H. 264, wireless, software radio, signal processing – 10 -100 Gops required – 200 m. W power budget Stereo Headset • Applications dominated by tight loops processing large amounts of streaming data TV out [ARM 2005] PC / Mac Memory card 2 University of Michigan Electrical Engineering and Computer Science

Loop Accelerators • Order-of-magnitude performance and efficiency wins – Viterbi: 100 x speedup vs. Loop Accelerators • Order-of-magnitude performance and efficiency wins – Viterbi: 100 x speedup vs. ARM 9 Automated C gates solution . C • Correct by construction • Close designer productivity gap • Achieve short time-to-market 3 University of Michigan Electrical Engineering and Computer Science

Loop Accelerator Template • Hardware realization of modulo scheduled loop • Parameterized execution resources, Loop Accelerator Template • Hardware realization of modulo scheduled loop • Parameterized execution resources, storage, connectivity 4 University of Michigan Electrical Engineering and Computer Science

Loop Accelerator Design Flow 2 1 FU FU Modulo Schedule RF C Code, Performance Loop Accelerator Design Flow 2 1 FU FU Modulo Schedule RF C Code, Performance (Throughput) Abstract Arch Synthesize Op 1 Op 2 Op 3 … Scheduled Ops 3 Build Datapath 5 Loop Accelerator time . c FU Alloc FUs 4 Instantiate Arch . v Verilog, Control Signals 5 FU FU Concrete Arch University of Michigan Electrical Engineering and Computer Science

Modulo Scheduling and Datapath Derivation • Schedule to abstract architecture (FUs) • Determine register Modulo Scheduling and Datapath Derivation • Schedule to abstract architecture (FUs) • Determine register and interconnect requirements from schedule FU 1 time 1 LOAD + ADD time 4 Source Code MEM . . . r 1 = Mem[r 2] r 3 = r 1 + 12 12 FU 2 Schedule 6 Datapath University of Michigan Electrical Engineering and Computer Science

Cost Sensitive Scheduling +1 time • Traditional scheduling is hardware unaware • Intelligent scheduling Cost Sensitive Scheduling +1 time • Traditional scheduling is hardware unaware • Intelligent scheduling needed to reduce hardware cost FU 1 FU 2 FU 3 +2 +1 2 FU 1 FU 2 FU 3 LD 2 FU 1 0 1 +2 LD 1 2 LD 2 time LD 1 0 1 FU 2 FU 3 +1 +2 LD 1 LD 2 • Different scheduling alternatives not equal 7 University of Michigan Electrical Engineering and Computer Science

Scheduling to Reduce Cost • Hardware cost is function of final schedule • Increased Scheduling to Reduce Cost • Hardware cost is function of final schedule • Increased hardware sharing = reduced cost 1 • Reusing FU hardware is FU “free” 2 • Traditional 3 FU metrics (register No additional cost 4 for longer lifetime pressure) not sufficient 8 University of Michigan Electrical Engineering and Computer Science

Initial Approach: Greedy • Standard iterative modulo scheduler, augmented with hardware cost model • Initial Approach: Greedy • Standard iterative modulo scheduler, augmented with hardware cost model • Choose alternative which increases cost the least while unscheduled ops remain { get valid alternatives for op for each alternative { get hardware cost } schedule op using min-cost alternative update hardware cost model Hardware cost = FU cost + Storage cost + Wire cost +- * << } 9 University of Michigan Electrical Engineering and Computer Science

Results – Greedy Scheduling FU Storage MUX • Local scope local minima • Much Results – Greedy Scheduling FU Storage MUX • Local scope local minima • Much more cost savings possible • 5% average cost savings 10 University of Michigan Electrical Engineering and Computer Science

Optimal Modulo Scheduling +2 +1 Op 1 +4 LD 3 -5 (1, 0) Op Optimal Modulo Scheduling +2 +1 Op 1 +4 LD 3 -5 (1, 0) Op 2 Op 3 (1, 1) (2, 0) Loop (3, 0) (2, 1) (3, 1) (FU #, time) Search Space Storage cost = widthi depthi FU cost = cost(FUi) • Optimal modulo scheduling extends [Eichenberger ’ 97] 11 University of Michigan Electrical Engineering and Computer Science

Results – Optimal Scheduling FU Storage MUX • 27% average cost savings 12 University Results – Optimal Scheduling FU Storage MUX • 27% average cost savings 12 University of Michigan Electrical Engineering and Computer Science

Problem Decomposition • Exact solutions are not practical – (#FU II stages) ^ #ops Problem Decomposition • Exact solutions are not practical – (#FU II stages) ^ #ops possible schedules – 20 lines of C code 100 hours – Excessive runtimes even for modest-size loops • Decompose into more manageable subproblems – Partitioned scheduling – Time-space decomposition 13 University of Michigan Electrical Engineering and Computer Science

Partitioned Scheduling • Partition the operations into small groups • Schedule groups of operations Partitioned Scheduling • Partition the operations into small groups • Schedule groups of operations sequentially – Account for hardware contribution of previously scheduled groups – Backtrack if infeasible state reached 1 2 1 3 4 3 5 5 Optimal Modulo Scheduler 14 1 2 3 4 5 Optimal Modulo Scheduler University of Michigan Electrical Engineering and Computer Science

Operation Partitioning • Traditional partitioning: minimize edge cuts – Does not necessarily lead to Operation Partitioning • Traditional partitioning: minimize edge cuts – Does not necessarily lead to good cost • Goal: maximize hardware sharing opportunities within a group + + LD << LD + * 15 University of Michigan Electrical Engineering and Computer Science

Results – Partitioned Scheduling FU Storage MUX • 8% average cost savings • With Results – Partitioned Scheduling FU Storage MUX • 8% average cost savings • With large number of partitions, similar to greedy 16 University of Michigan Electrical Engineering and Computer Science

Partition Size for Sharp • Improve cost by considering more ops at a time Partition Size for Sharp • Improve cost by considering more ops at a time 17 University of Michigan Electrical Engineering and Computer Science

Time-Space Decomposition • Reduce scheduling complexity • View all operations together 1 2 3 Time-Space Decomposition • Reduce scheduling complexity • View all operations together 1 2 3 s ime, T 4 time 0: 1 2 time 1: 3 4 5 e, tim FU 1: e 1 FU 2: 2 4 3 0 1 FU 2 FU 3 1 5 2 3 4 FU 1 5 FU 3: 5 Spac time pace time FU 1 0 1 FU 2 1 2 5 FU 3 4 • Optimize for register depth during time assignment, register width and FU cost during space assignment 18 3 University of Michigan Electrical Engineering and Computer Science

Results – Time-Space Scheduling FU Storage MUX • Time, space: 19% average cost savings Results – Time-Space Scheduling FU Storage MUX • Time, space: 19% average cost savings • Space, time: 20% average cost 19 savings University of Michigan Electrical Engineering and Computer Science

Real Cost Savings Viterbi, space-time decomposed scheduler, 0. 37 mm 2 43. 2% overall Real Cost Savings Viterbi, space-time decomposed scheduler, 0. 37 mm 2 43. 2% overall area savings Viterbi, naïve scheduler, 0. 66 mm 2 20 University of Michigan Electrical Engineering and Computer Science

Conclusion • Automated C loop accelerator synthesis system • Modulo scheduler must be cost Conclusion • Automated C loop accelerator synthesis system • Modulo scheduler must be cost aware • Decomposition methods make problem tractable – 20% average cost savings with space-time decomposition – Importance of global view of all operations • Individual savings up to 43% 21 • Compile times of 1 minute – 30 minutes University of Michigan Electrical Engineering and Computer Science

Questions? • For more information: http: //cccp. eecs. umich. edu 22 University of Michigan Questions? • For more information: http: //cccp. eecs. umich. edu 22 University of Michigan Electrical Engineering and Computer Science