Clustering of Large Designs for Channel-Width Constrained FPGAs

Скачать презентацию Clustering of Large Designs for Channel-Width Constrained FPGAs

5e16157bde1882b5e4dc3c9259ce68ce.ppt

Количество слайдов: 36

Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin Tom Guy Lemieux University of British Columbia Department of Electrical and Computer Engineering Vancouver, BC, Canada

Overview • Introduction, Goals and Motivation – Reduce channel width, lower cost, make circuits “routable” • Reducing Channel Width By Depopulation • Large Benchmark Circuits • New Clustering Technique – Selective Depopulation • Conclusions and Future Work

Mesh-Based FPGA Architecture • Channel width – Number of routing tracks per channel • Larger FPGA devices: more tiles – Channel width is fixed L L L L L L L

Motivation: Area of FPGA Devices MCNC Circuits Mapped onto an FPGA SIZE of Layout Tile Total Layout AREA = SIZE * Number of Layout Tiles

Motivation: Channel Width Demand MCNC Circuits Mapped onto an FPGA Interconnect Range User has no choice! Devices built for worst-case channel width (fixed width) Interconnect cost dominates (>70%) Logic Range User buys bigger device.

Goal: Reduce Channel Width Altera Cyclone • Channel width constraint of 80 routing tracks Constrained FPGA • Channel width constraint of 60 routing tracks • Smaller area, lower cost for low-channel-width circuits But { apex 4, elliptic, frisc, ex 1010, spla, pdc } are unroutable…. Can we make them routable in a Constrained FPGA?

Possible Solution • Trade-off logic utilization for channel width – User can always buy more logic…. (not more wires) L L L L Trade-off: L L L L CLB count L L L L L for L L L L L Channel width L L L FPGA 1 FPGA 2 But…. . can we achieve lower Total Area? ( = SIZE * CLB Count)

Logic Element: BLE and CLB BLE #1 • Basic Logic Element (BLE) – ‘k’-input LUT + FF • Clustered Logic Block (CLB) – ‘N’ BLEs, ‘N’ outputs – ‘I’ shared inputs L L L BLE #3 BLE #4 Note: I < k*N BLE #5 L L ‘I’ Inputs L L BLE #2 CLB ‘N’ Outputs

CLB Depopulation • Normally: CLBs fully packed – Reduces total # of CLBs needed for circuit • CLB Depopulation: Tessier, De. Hon – – Do not use all BLEs Increase # CLBs used Decrease channel width Decrease overall area BLE #1 BLE #2 ‘I’ Inputs BLE #3 BLE #4 BLE #5 • Problem – Increase in # CLBs high for large circuits – Our work: limits # CLB increase CLB ‘N’ Outputs

Uniform Depopulation • Previous work – Depopulate each CLB by equal amount • But… circuit observations – regions of high routing demand – regions of low routing demand • Depopulate in low congestion areas ? ? – Unnecessary increase in area

Non-Uniform Depopulation • Our depopulation method: – Assume congestion is localized – Depopulate only congested areas • We show non-uniform depopulation – Effective method of channel width reduction – Graceful tradeoff between channel width and area – Makes unroutable circuits routable

Depopulation Methods to Reduce Channel Width

CLB Depopulation BLE #1 • General Approach – Use existing clustering tools – Do not fill CLB while clustering 1. Input-Limited • Eg. Maximum 67% input utilization per CLB • Might use all BLEs 2. BLE-Limited • Eg. Maximum 60% BLE utilization per CLB • Might use all Inputs BLE #2 ‘I’ Inputs BLE #3 BLE #4 BLE #5 CLB ‘N’ Outputs

Reducing Channel Width Results • Input-Limited (max cluster size 16) • No channel width control • BLE-Limited • (almost) monotonically increasing good channel width control

Benchmark Circuit Creation (We want BIG circuits!) (What do REALLY BIG circuits look like? )

Benchmarking Circuits: Some Observations 20 Largest MCNC Benchmarks Altera Cyclone Benchmarks [CICC 2003] LUT Range 10: 1 (1, 000. . 10, 000 LUTs) 10: 1 (2, 500. . 25, 000 LUTs) Channel Width Range 4: 1 (20. . 80 tracks) 3: 1 (40. . 120 tracks) • Altera has bigger benchmarks than academics – We noted similar characteristics: • Some LARGE circuits routable with NARROW routing channels • Some SMALL circuits need WIDE routing channels • What if each circuit is IP Block in larger system… ? ?

Benchmark Creation – IP Blocks • Mimic process of creating large designs – “IP Blocks” <==> MCNC Circuits – So. C <==> Randomly integrate/stitch together “IP Blocks” – IP Blocks have varied interconnect needs • Real-life large designs: System-on-Chip Methodology – IP blocks (own, 3 rd party) • Re-use improves productivity – Primarily integration and verification effort

Benchmark Creation – Large Designs • Considered 3 stitching schemes… – Independent • IP Blocks are not connected to each other – Pipeline • Outputs of one IP block connected to inputs of next IP block – Clique • Outputs of each IP block are uniformly distributed to inputs of all other IP blocks

Meta. Circuit: Reducing Routed Channel Width? • Observations – IP blocks are tightly-connected internally – IP blocks have varied channel width needs • Hypotheses 1. Placement keeps each “IP block” together 2. IP blocks has large routed channel width Meta. Circuit has large routed channel width

Hypothesis Testing: Meta. Circuit P&R Results • Use VPR FPGA tools from University of Toronto • Hypothesis 1 – VPR placer successfully groups IP blocks from random initial placement • Hypothesis 2 – VPR router confirms channel width of Meta. Circuit is dominated by a few IP blocks { pdc, clma, ex 1010 }

Consequences of Hypothesis 2 • Question – Shrink channel width of few IP blocks ? ? shrink channel width of Meta. Circuit? • How to shrink channel widths? – Selective CLB Depopulation !! – Depopulate hard-to-route IP blocks the most • How much to depopulate? – Channel width profiling of IP block…

Meeting Channel Width Constraints: Selective Depopulation • Step 1: Channel Width Profiling of IP Blocks (Congestion Estimation) • Step 2: Re-cluster Only Congested IP Blocks (Selective Depopulation)

IP Block Properties • Cluster IP Blocks into N=16, k=6 • VPR: determine minimum channel width for each IP Block • Sort IP Blocks based on channel width Easy-to-Route Circuits Hard-to-Route Circuits

Channel Width Profiling of IP Block • Cluster sizes – NA = FPGA Architecture Cluster Size (fixed) – NC = BLE-Limit Size (variable) • Sweep NC for each IP block

Analysis with Constraint • Given channel-width constraint of 60 tracks – tseng routable (easy) – clma routable for NC <= 10 – clma not routable for NC > 10

Our Technique: Selective Depopulation • Step 1: Channel Width Profiling of IP Blocks (Congestion Estimation) • Step 2: Re-cluster Only Congested IP Blocks (Selective Depopulation)

Uniform Depopulation • Minimum NC Cluster Size – De-populate all clusters equally – Eg, use NC=10 for both IP Blocks

Non-Uniform Depopulation • Maximal NC Cluster Size – Depopulate each IP block according to maximal cluster size – Eg, clma NC=10, tseng NC=16

Uniform vs. Non-Uniform • Non-Uniform depopulation better than Uniform – Lower CLB count – Higher LUT utilization Total CLBs Needed Non-Uniform x 1, 000 Uniform LUT Utilization Channel Width Constraint

Meta. Circuit Clustering Results • Depopulate the mostcongested IP blocks – (BLE-Limit) of each IP block shown (max=16) – Some IP blocks are depopulated more than others

Meta. Circuit P&R Results • Clique Meta. Circuit – P&R channel width results closely match “constraints” Routed Normalized Area Channel Width Constraint • 1 Channel Width Constraint Shrink Channel Width by ~20% (from 95 to 75), NO AREA INCREASE by ~50% (from 95 to 50), 1. 7 x area increase

Other Meta. Circuit Results Channel Width Decreases Circuit Clustering Tool ( < 1. 05 x Area ) ( 1. 7 x – 3. 5 x Area ) Clique T-VPack i. RAC Rep. 20% 7% 50% 29% Independent* T-VPack i. RAC Rep. 24% 27% 42% 30% Pipeline* T-VPack i. RAC Rep. 25% 11% 55% 27% * These latest results are better than those given in paper

Critical Path Delay and Average Wirelength • Expect critical path delay to increase under tighter constraints – Delay “noise” due to instability of floorplan locations • Average wirelength / net increases under tighter constraints

Conclusion • System-level technique to map large System-on-Chip (So. C) designs to channel-width constrained FPGAs using fewer routing resources • Depopulating CLBs effective at reducing channel width • Non-uniform depopulation important to limit area inflation • Channel width reduced – by 0 -20% with < 5% area increase – by up to 50% with 3. 3 X area increase • Effective solution to trade-off CLBs for Interconnect !!! – UNROUTABLE circuits (channel width TOO LARGE) can be made ROUTABLE (reduced channel width) by buying an FPGA with MORE LOGIC!!!

End of Talk

Future Work • Real-Life So. C Benchmark – – Licensed IP: Bluetooth baseband processor 325, 000 ASIC gates Numerous IP blocks of varying complexity Needed to authenticate “Synthetic” results • Automated technique to find “hard” IP blocks – Granularity is based on design hierarchy (? ) – Replaces time-consuming Step 1 of process