Скачать презентацию Clustering of Large Designs for Channel-Width Constrained FPGAs Скачать презентацию Clustering of Large Designs for Channel-Width Constrained FPGAs

5e16157bde1882b5e4dc3c9259ce68ce.ppt

  • Количество слайдов: 36

Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin Tom Guy Lemieux University of Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin Tom Guy Lemieux University of British Columbia Department of Electrical and Computer Engineering Vancouver, BC, Canada

Overview • Introduction, Goals and Motivation – Reduce channel width, lower cost, make circuits Overview • Introduction, Goals and Motivation – Reduce channel width, lower cost, make circuits “routable” • Reducing Channel Width By Depopulation • Large Benchmark Circuits • New Clustering Technique – Selective Depopulation • Conclusions and Future Work

Mesh-Based FPGA Architecture • Channel width – Number of routing tracks per channel • Mesh-Based FPGA Architecture • Channel width – Number of routing tracks per channel • Larger FPGA devices: more tiles – Channel width is fixed L L L L L L L

Motivation: Area of FPGA Devices MCNC Circuits Mapped onto an FPGA SIZE of Layout Motivation: Area of FPGA Devices MCNC Circuits Mapped onto an FPGA SIZE of Layout Tile Total Layout AREA = SIZE * Number of Layout Tiles

Motivation: Channel Width Demand MCNC Circuits Mapped onto an FPGA Interconnect Range User has Motivation: Channel Width Demand MCNC Circuits Mapped onto an FPGA Interconnect Range User has no choice! Devices built for worst-case channel width (fixed width) Interconnect cost dominates (>70%) Logic Range User buys bigger device.

Goal: Reduce Channel Width Altera Cyclone • Channel width constraint of 80 routing tracks Goal: Reduce Channel Width Altera Cyclone • Channel width constraint of 80 routing tracks Constrained FPGA • Channel width constraint of 60 routing tracks • Smaller area, lower cost for low-channel-width circuits But { apex 4, elliptic, frisc, ex 1010, spla, pdc } are unroutable…. Can we make them routable in a Constrained FPGA?

Possible Solution • Trade-off logic utilization for channel width – User can always buy Possible Solution • Trade-off logic utilization for channel width – User can always buy more logic…. (not more wires) L L L L Trade-off: L L L L CLB count L L L L L for L L L L L Channel width L L L FPGA 1 FPGA 2 But…. . can we achieve lower Total Area? ( = SIZE * CLB Count)

Logic Element: BLE and CLB BLE #1 • Basic Logic Element (BLE) – ‘k’-input Logic Element: BLE and CLB BLE #1 • Basic Logic Element (BLE) – ‘k’-input LUT + FF • Clustered Logic Block (CLB) – ‘N’ BLEs, ‘N’ outputs – ‘I’ shared inputs L L L BLE #3 BLE #4 Note: I < k*N BLE #5 L L ‘I’ Inputs L L BLE #2 CLB ‘N’ Outputs

CLB Depopulation • Normally: CLBs fully packed – Reduces total # of CLBs needed CLB Depopulation • Normally: CLBs fully packed – Reduces total # of CLBs needed for circuit • CLB Depopulation: Tessier, De. Hon – – Do not use all BLEs Increase # CLBs used Decrease channel width Decrease overall area BLE #1 BLE #2 ‘I’ Inputs BLE #3 BLE #4 BLE #5 • Problem – Increase in # CLBs high for large circuits – Our work: limits # CLB increase CLB ‘N’ Outputs

Uniform Depopulation • Previous work – Depopulate each CLB by equal amount • But… Uniform Depopulation • Previous work – Depopulate each CLB by equal amount • But… circuit observations – regions of high routing demand – regions of low routing demand • Depopulate in low congestion areas ? ? – Unnecessary increase in area

Non-Uniform Depopulation • Our depopulation method: – Assume congestion is localized – Depopulate only Non-Uniform Depopulation • Our depopulation method: – Assume congestion is localized – Depopulate only congested areas • We show non-uniform depopulation – Effective method of channel width reduction – Graceful tradeoff between channel width and area – Makes unroutable circuits routable

Depopulation Methods to Reduce Channel Width Depopulation Methods to Reduce Channel Width

CLB Depopulation BLE #1 • General Approach – Use existing clustering tools – Do CLB Depopulation BLE #1 • General Approach – Use existing clustering tools – Do not fill CLB while clustering 1. Input-Limited • Eg. Maximum 67% input utilization per CLB • Might use all BLEs 2. BLE-Limited • Eg. Maximum 60% BLE utilization per CLB • Might use all Inputs BLE #2 ‘I’ Inputs BLE #3 BLE #4 BLE #5 CLB ‘N’ Outputs

Reducing Channel Width Results • Input-Limited (max cluster size 16) • No channel width Reducing Channel Width Results • Input-Limited (max cluster size 16) • No channel width control • BLE-Limited • (almost) monotonically increasing good channel width control

Benchmark Circuit Creation (We want BIG circuits!) (What do REALLY BIG circuits look like? Benchmark Circuit Creation (We want BIG circuits!) (What do REALLY BIG circuits look like? )

Benchmarking Circuits: Some Observations 20 Largest MCNC Benchmarks Altera Cyclone Benchmarks [CICC 2003] LUT Benchmarking Circuits: Some Observations 20 Largest MCNC Benchmarks Altera Cyclone Benchmarks [CICC 2003] LUT Range 10: 1 (1, 000. . 10, 000 LUTs) 10: 1 (2, 500. . 25, 000 LUTs) Channel Width Range 4: 1 (20. . 80 tracks) 3: 1 (40. . 120 tracks) • Altera has bigger benchmarks than academics – We noted similar characteristics: • Some LARGE circuits routable with NARROW routing channels • Some SMALL circuits need WIDE routing channels • What if each circuit is IP Block in larger system… ? ?

Benchmark Creation – IP Blocks • Mimic process of creating large designs – “IP Benchmark Creation – IP Blocks • Mimic process of creating large designs – “IP Blocks” <==> MCNC Circuits – So. C <==> Randomly integrate/stitch together “IP Blocks” – IP Blocks have varied interconnect needs • Real-life large designs: System-on-Chip Methodology – IP blocks (own, 3 rd party) • Re-use improves productivity – Primarily integration and verification effort

Benchmark Creation – Large Designs • Considered 3 stitching schemes… – Independent • IP Benchmark Creation – Large Designs • Considered 3 stitching schemes… – Independent • IP Blocks are not connected to each other – Pipeline • Outputs of one IP block connected to inputs of next IP block – Clique • Outputs of each IP block are uniformly distributed to inputs of all other IP blocks

Meta. Circuit: Reducing Routed Channel Width? • Observations – IP blocks are tightly-connected internally Meta. Circuit: Reducing Routed Channel Width? • Observations – IP blocks are tightly-connected internally – IP blocks have varied channel width needs • Hypotheses 1. Placement keeps each “IP block” together 2. IP blocks has large routed channel width Meta. Circuit has large routed channel width

Hypothesis Testing: Meta. Circuit P&R Results • Use VPR FPGA tools from University of Hypothesis Testing: Meta. Circuit P&R Results • Use VPR FPGA tools from University of Toronto • Hypothesis 1 – VPR placer successfully groups IP blocks from random initial placement • Hypothesis 2 – VPR router confirms channel width of Meta. Circuit is dominated by a few IP blocks { pdc, clma, ex 1010 }

Consequences of Hypothesis 2 • Question – Shrink channel width of few IP blocks Consequences of Hypothesis 2 • Question – Shrink channel width of few IP blocks ? ? shrink channel width of Meta. Circuit? • How to shrink channel widths? – Selective CLB Depopulation !! – Depopulate hard-to-route IP blocks the most • How much to depopulate? – Channel width profiling of IP block…

Meeting Channel Width Constraints: Selective Depopulation • Step 1: Channel Width Profiling of IP Meeting Channel Width Constraints: Selective Depopulation • Step 1: Channel Width Profiling of IP Blocks (Congestion Estimation) • Step 2: Re-cluster Only Congested IP Blocks (Selective Depopulation)

IP Block Properties • Cluster IP Blocks into N=16, k=6 • VPR: determine minimum IP Block Properties • Cluster IP Blocks into N=16, k=6 • VPR: determine minimum channel width for each IP Block • Sort IP Blocks based on channel width Easy-to-Route Circuits Hard-to-Route Circuits

Channel Width Profiling of IP Block • Cluster sizes – NA = FPGA Architecture Channel Width Profiling of IP Block • Cluster sizes – NA = FPGA Architecture Cluster Size (fixed) – NC = BLE-Limit Size (variable) • Sweep NC for each IP block

Analysis with Constraint • Given channel-width constraint of 60 tracks – tseng routable (easy) Analysis with Constraint • Given channel-width constraint of 60 tracks – tseng routable (easy) – clma routable for NC <= 10 – clma not routable for NC > 10

Our Technique: Selective Depopulation • Step 1: Channel Width Profiling of IP Blocks (Congestion Our Technique: Selective Depopulation • Step 1: Channel Width Profiling of IP Blocks (Congestion Estimation) • Step 2: Re-cluster Only Congested IP Blocks (Selective Depopulation)

Uniform Depopulation • Minimum NC Cluster Size – De-populate all clusters equally – Eg, Uniform Depopulation • Minimum NC Cluster Size – De-populate all clusters equally – Eg, use NC=10 for both IP Blocks

Non-Uniform Depopulation • Maximal NC Cluster Size – Depopulate each IP block according to Non-Uniform Depopulation • Maximal NC Cluster Size – Depopulate each IP block according to maximal cluster size – Eg, clma NC=10, tseng NC=16

Uniform vs. Non-Uniform • Non-Uniform depopulation better than Uniform – Lower CLB count – Uniform vs. Non-Uniform • Non-Uniform depopulation better than Uniform – Lower CLB count – Higher LUT utilization Total CLBs Needed Non-Uniform x 1, 000 Uniform LUT Utilization Channel Width Constraint

Meta. Circuit Clustering Results • Depopulate the mostcongested IP blocks – (BLE-Limit) of each Meta. Circuit Clustering Results • Depopulate the mostcongested IP blocks – (BLE-Limit) of each IP block shown (max=16) – Some IP blocks are depopulated more than others

Meta. Circuit P&R Results • Clique Meta. Circuit – P&R channel width results closely Meta. Circuit P&R Results • Clique Meta. Circuit – P&R channel width results closely match “constraints” Routed Normalized Area Channel Width Constraint • 1 Channel Width Constraint Shrink Channel Width by ~20% (from 95 to 75), NO AREA INCREASE by ~50% (from 95 to 50), 1. 7 x area increase

Other Meta. Circuit Results Channel Width Decreases Circuit Clustering Tool ( < 1. 05 Other Meta. Circuit Results Channel Width Decreases Circuit Clustering Tool ( < 1. 05 x Area ) ( 1. 7 x – 3. 5 x Area ) Clique T-VPack i. RAC Rep. 20% 7% 50% 29% Independent* T-VPack i. RAC Rep. 24% 27% 42% 30% Pipeline* T-VPack i. RAC Rep. 25% 11% 55% 27% * These latest results are better than those given in paper

Critical Path Delay and Average Wirelength • Expect critical path delay to increase under Critical Path Delay and Average Wirelength • Expect critical path delay to increase under tighter constraints – Delay “noise” due to instability of floorplan locations • Average wirelength / net increases under tighter constraints

Conclusion • System-level technique to map large System-on-Chip (So. C) designs to channel-width constrained Conclusion • System-level technique to map large System-on-Chip (So. C) designs to channel-width constrained FPGAs using fewer routing resources • Depopulating CLBs effective at reducing channel width • Non-uniform depopulation important to limit area inflation • Channel width reduced – by 0 -20% with < 5% area increase – by up to 50% with 3. 3 X area increase • Effective solution to trade-off CLBs for Interconnect !!! – UNROUTABLE circuits (channel width TOO LARGE) can be made ROUTABLE (reduced channel width) by buying an FPGA with MORE LOGIC!!!

End of Talk End of Talk

Future Work • Real-Life So. C Benchmark – – Licensed IP: Bluetooth baseband processor Future Work • Real-Life So. C Benchmark – – Licensed IP: Bluetooth baseband processor 325, 000 ASIC gates Numerous IP blocks of varying complexity Needed to authenticate “Synthetic” results • Automated technique to find “hard” IP blocks – Granularity is based on design hierarchy (? ) – Replaces time-consuming Step 1 of process