F 4 -09 Virtual Architecture and Design Automation

Скачать презентацию F 4 -09 Virtual Architecture and Design Automation

66498f6c8441bcb61ce3b369b54869f3.ppt

Количество слайдов: 32

F 4 -09: Virtual Architecture and Design Automation for Partial Reconfiguration Dr. Ann Gordon-Ross 2009 Midyear Workshop All Hands Meeting Assistant Professor of ECE University of Florida Dr. Alan D. George Professor of ECE University of Florida November 10 th, 2009 Abelardo Jara Terence Frederick Rohit Kumar Shaon Yousuf Research Students University of Florida

Outline Goals, Motivation and Challenges Virtual Architecture for Partially Reconfigurable Embedded System (VAPRES) MACS Inter-module Communication Architecture Case Study Application: Embedded Target Tracking System on Virtex-4 FPGA board Design methodology Multiple clock domains support Bitstream relocation Preliminary non-PR version using Kalman filters Design Automation for Partial Reconfiguration (DAPR) DAPR design flow VHDL annotations Connectivity file and graph Device library file Overlay generation

Goals, Motivations, and Challenges GOAL – Leverage partial reconfiguration (PR) for application designers Architect and implement a Virtual Architecture (VA) for Partially Reconfigurable Embedded Systems Processed output ICAP Filter repository Ease PR design via design automation Filter A MOTIVATIONS – Increase productivity and reduce Filter B Central Controlling Agent Filter A PRR design complexity for PR designs VA reduces development time Sensor Interface External Trigger Sensor Coverage Area Automated design flow makes PR more amenable system designers Dynamically load and unload hardware processing modules Processing hardware adapts to external environmental conditions Current PR design flow requires very high level of specialization Simplifies design of systems that time-multiplex FPGA resources → smaller devices CHALLENGES Provide sufficient VA flexibility with architectural parameterization Balancing enough application specialization with exploration complexity Creating new exploration algorithms/heuristics to automate PR design flow steps with respect to available PR tools 3

F 4 -09 Approach Expand prototype an FPGA-based architecture for rapid development of PR embedded systems VAPRES: Virtual Architecture for Partially Reconfigurable Embedded Systems MACS: Minimal Adaptive Circuit Switching mesh inter-module communication architecture for VAPRES Architectural support for hardware module context save and restore Tools to develop both reconfigurable modules and application software DAPR: Design Automation for Partial Reconfiguration Tool Study Virtex-4 and Virtex-5 bitstreams to leverage additional functionalities Not optimized for a specific application Improvement over F 4 -08 SCORES communication architecture Flexible and reusable base architecture Formulate and implement an automated PR design flow VAPRES Extend bitstream relocation and context save and restore for Virtex-5 Base Architecture 4 Design Methodology + VAPRES Builder Tool Highly specialized PR system design Reconfiguration behavior known at design time Highly optimized system floorplan based on known application

VAPRES: Architecture Design Control Region PLB Bus Network Module Interfaces PRR 2 Module Interfaces PRR 3 Module Interfaces To external I/O pins PRR 1 PR Socket 3 clk 2 FSL Interface Slice macros PR Socket 2 clk 1 Module Interfaces PR Socket 1 clk 0 ICAP I/O Module Flash controller To external I/O pins UART DCR Bridge I/O Module Micro. Blaze SDRAM kr=13 2 kl=2 MACS switch Base PR embedded system Multiple clock domains PRMs can operate at independent clock frequencies PRMs use FIFO-based I/O ports Streaming channels Data Processing Region Flexible scalable architecture Multiple architectural parameters enable base system specialization High speed inter-module communication architecture (MACS) 5 1 N =number of PRRs N= 3 2 kr =number of streaming channels going right kl =number of streaming channels going left Some additional parameters presented next

VAPRES: Design Methodology Application designers work separate from system designer Application decomposition Application software VAPRES API (vapres. h) C/C++ libraries for application software development Parametric VHDL models Base system design PRM design Software implementation design System designer chooses VAPRES parameters Base system specifications PRMs System definition files Synthesis Software implementation PRM implementation is separate from base system implementation Base System Flow (base system designer) Application Flow (application designers) Implementation Floorplan Synthesis Parametric models for VAPRES and MACS enable customization VAPRES VHDL, MHS, MSS, and UCF Implementation Executable file Partial bitstreams Static bitstream FPGA board 6 System floorplan defines PRR sizes and shapes

VAPRES: Builder Tool Overview Automates process of building VAPRES base system and applications System floorplan (. ucf) Increases designers productivity Builder Tool Features Some additional parameters used PRR height and width Automatic creation of VAPRES base system from parameters Architectural parameters Base system floorplanning Slice macro instantiation and placement Top VHDL entity (. vhd) Hardware specifications (. mhs) Software specifications (. mss) Automatic implementation of static and partial bitstreams Assisted framework for application designers Generates VAPRES SW libraries Templates for PRMs and software Static base system 7 PR modules (PRMs) Application software

VAPRES Builder – Results Design 1 Design 2 Design 3 Design 4 Number of PRRs 1 1 2 2 3 PRR height 1 row (16 CLBs) 2 rows (32 CLBSs) 2 rows (32 CLBs) 1 row (16 CLBs) PRR width 10 CLBs MACS parameters N=1, kr=1, kl=1 N=2, kr=2, kl=2 N=3, kr=2, kl=2 Post-place and Set of slice route macros (1 set implementation for each PRR) base static system PRR boundary ≈ 280 slices more when adding an extra PRR 100 MHz constraint met for all placeand-routed designs Maximum clock 120. 3 MHz 117. 6 MHz 116. 1 MHz 119. 3 MHz Static region slices (without MACS) 6927 7211 +263 slices 7474 MACS slices N/A 928 2745 +0 slices +284 slices N/A N = number of PRRs = number of MACS switches, kr = number of channels between switches going in the right direction, kl = number of channels between switches going in the left direction 8

VAPRES – Bitstream Relocation Data Processing Region (includes one or more RSBs – Reconfigurable Streaming Blocks) System Control Region I/O Module Network PRR 1 Interface SCORES Interface PRR 2 clk 1 ICAP To external I/O pins System. ACE Flash clk 0 UART Interface Switch In-situ Bitstream Relocation – Alters partial bitstream (with no external inputs) to run in any PRR Advantages: Reduces bitstream storage requirements (only one partial bitstream per module) Saves step of reading a partial bitstream from external Flash memory, if similar partial bitstream was already loaded into memory Enables VAPRES to dynamically place and migrate modules Restriction – PRRs must be homogeneous (ensures sufficient resources) 9 To external I/O pins Microblaze SDRAM I/O Module PLB Bus FSL Interf, Only one runs in bitstream Program partial necessary for each PRM external memory: Partial bitstreams Bitstream relocator stored in in non-volatile is stored compact flash compact PRM When flash is needed, partial bitstream System ACE is loaded into Microblaze controller loads and relocator is called relocator from flash and stores it partial bitstream New in SDRAM is loaded into correct PRR

Overview – MACS Communication Architecture Selects lowest cost path that best achieves network load distribution Similar interface ports for nodes and neighboring switch N N S N N N MACS Minimal-adaptive routing to explore all possible shortest paths N N S N N Any number (<=6) of nodes can be put on a single switch Unused interface ports, of switches around edges of No. C, can be utilized Node interface port available in Mx. N No. C is <= 2(M*N + M + N) Reduces area overhead of communication architecture per node Provides low-latency path(s) between frequently communicating node pairs (if attached to same switch) 10 N N S Features of MACS N S N S VAPRES requires high-bandwidth, low-latency communication channels inside reconfigurable streaming blocks (RSBs) Novel communication architecture named SCORES was implemented in 2008 MACS extends SCORES from linear array topology to mesh topology with few other new features S N S MACS: Minimal adaptive circuit switching mesh communication architecture S N

MACS implementation results (1/2) 9 architectural parameters to play around with Plotting all combinations is not feasible Assuming two values of each parameter requires 29 “area usage” plots and 29 “achievable frequency” plots 4000 3000 2500 2000 1500 1000 500 0 8 bits 16 bits 32 bits 5000 3000 8 bits 2000 16 bits 32 bits 1000 3000 16 bits 2000 1000 32 bits 0 1 2 3 4000 1 2 3 Figure 1: Area usage in number of slices per module for data widths W = 8, 16, and 32 bits for a varying number of lanes per switch and local port. The x-axis in each graph varies the Kl, Kr, Kd, and Ku parameters from 1 to 3 lanes per switch port. Left to right, the graphs vary the Kll and Krl parameters from 1 to 3 lanes per local port. 400 300 200 100 0 8 bits 16 bits 32 bits 400 8 bits 200 16 bits 100 32 bits 0 300 200 100 0 8 bits 16 bits 32 bits 1 2 3 Figure 2: Maximum operating frequency for data widths W = 8, 16, and 32 bits for a varying number of lanes per switch and local port. The x-axis in each graph varies the Kl, Kr, Kd, and Ku parameters from 1 to 3 lanes per switch port. From left to right, the graphs vary the Kll and Krl parameters from 1 to 3 lanes per local port. 11

MACS implementation results (2/2) Comparison of No. Cs Programmable Network on Chip, designed and realized by Hilton et al. Single switch with 8 modules attached to it 16 -bit wide data Comparable configuration of MACS BRAMs Frequency 1478 0 251 MHz Packet. Switching 2400 8 50 MHz PNo. C Slices MACS Designed and realized by Bartic et al. 8 modules attached in 2 D-mesh topology 16 -bit wide data Similar circuit-switched No. C, i. e. PNo. C 2 Network Architecture Difficult due to lack of published implementation results from other authors Representative packet-switching No. C 1 1223 1 134 MHz Comparison Results 5 x faster and 1. 5 x less area overhead than packetswitching No. C 2 x faster (with slight area overhead) than PNo. C 2 x 2 mesh of MACS switches W=16, Ku=Kd=Kl=Kr=Kil=Kir=1 1. Bartic, A. , Mignolet, J. Y. , Nollet, V. , Marescaux, T. , Verkest, D. , Vernalde, S. , and Lauwereins, R. “Highly scalable network on chip for reconfigurable systems”. In Proceedings of International Symposium on System-on-Chip, 2003, pages 79– 82. 2. Hilton C. and Nelson B. , “PNo. C: a flexible circuit-switched No. C for FPGA-based systems”. In Proceedings of Computers and Digital Techniques, 2006, pages 181 -188. 12

Analytical Modeling Analytical model of SCORES/MACS Streaming network FIFO at both ends: Pipelined channel/medium: n-stage Size D λp n-stage λm µm Size C λm Phases I n-stage pipeline Control Feedback Path Producer FIFO (of size D), Consumer FIFO (of size C) Analysis of producer-medium and medium-consumer pairs Phase II Analysis of medium-consumer with feedback 13 µc

Phase-I: Producer-Medium Pair(1/2) Markov-chain model 0 P 0 μm, 1 Solving λp, k-1 λp, 1 λp 1 P 1 μm, 2 2 P 2 μm, k λp, k k Pk λp Size D λp, D-1 D k+1 P k+1 μm, D for steady state gives 14 μm • Pk probability associated with the queue being in state k i. e. queue having k packets in it PD • λp = Arrival rate • μm = Service rate • D = System capacity • Flow = Sum of product of λ’s, μ’s and P’s

Phase-I: Producer-Medium Pair(2/2) Total probability of the system should be 1 1 1/(D+1) PD D (line size) 15

Phase II: Medium-Consumer Pair with control feedback, 2 D-Markov Chain Model (1/2) Streaming network Producer is filling with reduced rate λp, 1 µm Q=1 1 P 2 1, 1 P 2, 1 µm 0 λp P 0 1 P 1 2 P 2 µm Q=0 λp, 1 µm λp Consumer is emptying with µm λp, 1 Potential De-queuing at Q = 1 Producer is filling with rate λp Service rate is µm At k = D-1, queue switches to de-queuing state µm λp, 1 λp Potential Queuing at Q = 0 Number of packets in queue(k) Recently reached threshold(Q) µm i Pi, 1 µm λp λp µm k Pk k Total probability of state Q = 1 gives the Packet drop probability At k = 1, queue switches to queuing state, i. e. Q=0 16 µm λp, 1 D-1 P d-1, 1 µm λp D-1 P d-1 D PD, 1

Phase II: Medium-Consumer Pair with control feedback, 2 D-Markov Chain Model (2/2) Packet Drop Probability when ρ ≠ 1 Packet Drop Probability when ρ = 1 17 Probability of FIFO being filled with ‘k’ packets when ρ ≠ 1 Probability of FIFO being filled with ‘k’ packets when ρ = 1

Real-time Simulation and Profiling of MACS Setup for basic experiment One MACS switch with both module interface occupied Network frequency = Module frequency = 100 MHz Producer and consumer rates are Poisson process N N N Results N N S N Link utilization = 1/10. 35, before consumer FIFO is full (at transfer size ~46 KB) Link utilization = 1/105. 8081, after consumer FIFO is full (at transfer size > 46 KB) Both FIFO’s activity and probability distribution of consumer FIFO being ‘almost’ full is also plotted w. r. t to transfer size 18 N N N N S Chip. Scope ILA core captures all FIFO activity System parameters: FIFO sizes = 512 bytes, Network BW = 400 MBps, Producer rate = 40 MBps Consumer Rate = 4 MBps, (both generates data at Poisson distributed random intervals), Transfer size = 0 -128 KB N N S S ROM holds MATLAB generated Poisson distributed intervals based on different λ and µ Producer/consumer loads its counter with value from ROM and generates/reads a unit of data at counter overflow S S S N

Real-time Simulation and Profiling of MACS Setup for advanced experiment 3 x 3 MACS No. C with both module interface occupied for each switch Network frequency = Module frequency = 100 MHz Producer and consumer rates are linear Chip. Scope ILA core captures all activities such as request establishment, write enables for FIFO (used in link utilization calculation), average number of retrials for establishing a channel, avg. channel establishment latency, etc Observe aforementioned parameters for various network traffic patterns Network traffic generation patterns Pattern Name Description Uniform Random Module chooses a random destination among all the other modules and sends a packet to that destination. The probability is equal among the other modules Nearest Neighbor Each node send a packet to a module of its immediate neighbor switch with equal probability Tornado {X, Y} will send packets to destination {X+k/2− 1, y} mod k for the k-ary network (k=4) Transpose Router of the address {X, Y} will send a packet to router {Y, X} Bit Complement Node with address {b 0, b 1, b 2, b 3} in bits will send packets to the destination address NOT{b 0, b 1, b 2, b 3} in bits Hot Spot All the nodes send the packet to a certain node. Hot spot can act as receiver only or can be both transmitter and receiver. 19

Overview - Design Automation for Partial Reconfiguration (DAPR) Xilinx Early Access (EA) PR Flow provides PR system design support Existing PR flow is very specialized Requires target device architecture knowledge System designer must manually apply steps Hierarchical coding of HDL design description, synthesis, floorplanning, timing analysis implementation and merge DAPR design flow will mitigate existing PR design flow intricacies Manual Steps Hierarchical HDL design description Modified HDL design description via system designer annotations System designer annotated design constraints (optional) Automated Steps DAPR inputs - modified HDL design description and design constraints (parameters include bitstream size, timing, power) DAPR design exploration - iteratively generates candidate design and compares generated design performance parameters with system designer annotated constraints DAPR output – Final bitstreams if system designer constraints are met otherwise output final bitstreams that match closest to system designer annotated constraints 20 EA PR Flow HDL Design Description HDL Synthesis Set Design Constraints Timing/Place ment Analysis Manual Steps Automated Steps DAPR Design Flow HDL Design Description Modified HDL Design Description Design Constraints (optional) Implement Base Design Implement PR Modules DAPR Tool Merge Final Generated Bitstreams

Overview - DAPR Tool Phases and Description Information Extraction Synthesize all HDL design description file with Xilinx XST utility Read and record estimated slice requirements from generated synthesis log file (. srp) to. paif Generate connectivity information and PRR floorplan using estimated resources and device information libraries PRRs identification Phase 1 Information Extraction PR automation information File (. paif) Collect and write port connection names and widths within each instantiation to partial reconfiguration automation information file (*. paif) Resource Estimation and Constraint Generation VHDL DAPR tool Top File starts here Initial input Information Collection Modified VHDL Top File Extract static and PR region instantiations and corresponding HDL design description filenames from top level HDL design description file Phase 2 Information Collection Static region identification Run script to synthesize modules and estimate resource requirements Perform automated floorplanning and write to User Constraint File (UCF) Bitstream Generation Implement static region and PRMs with Xilinx’s ngdbuild, MAP, and PAR utilities Merge top, static, and PRMs with Xilinx’s PR_verify design and PR_assemble utilities to generate final full and partial bitstreams 21 Implement and merge design Generated full and partial bitstreams Phase 3 Overlay Generation Device inf. libraries (. dilf) Phase 4 Bitstream Generation

System Designer Annotations and Connectivity Information Examples Example system designer annotations (Case insensitive) --PRR_Start : : filename, filename… --Static_Start : : filename, filename… --bm_start --PRR_clock Significance of system designer annotation Identifies beginning PRR instantiation and PRM filenames (use comma to specify multiple filenames) Identifies static region instantiation and filenames (use comma to specify multiple filenames) Identifies slice Macro instantiation Identifies system top level clock -------------------------PRR_start : : prm_up, prm_down reconfig : Port Map( rmodule led_in=> rm_in_int, led_out=> rm_out_int); -------------------------static_start: : static -----------------------------bm_start led_registers : base Port Map( clk=> clk, led_unreg=> rm_out, led_reg=> rm_in); ------------------- in 0 : busmacro_xc 4 v_l 2 r_sync_narrow Port Map( input 0 => bml 2 r(0), input 1 => bml 2 r(1), input 2 => bml 2 r(2), ---------------------------- Connectivity Information Example A simple example design with two PRRs Counter Two 32 -bit up and down counter modules map to PRR 1 Two 8 -bit up and down counter modules map to PRR 2 32 32 Static Region 8 Module Name/Type Incoming Connections Outgoing Connections Base/Static 40 40 Counter/PR 32 32 Counter_sm/PR 8 8 Counter_sm Connectivity information gathered from. paif 8 file and connectivity graph generated for system designer verification Design Connectivity Information Table Design Connectivity Graph

DAPR V 4 LX 25 Device Library Device divided into 3 banks Bank 0 (left), Bank 1(right), Bank 2(center) Resource representation Single letter with prefix of either 1 or 0 Letters are S for Slices, D for DSP 48 s, F for FIFO 16 s, R for RAMB 16 s, C for DCM’s, G for BUGF’s Prefix of 0 means resource occupied, 1 means resource vacant Checking individual values will help identify resource type and also resource availability Device Library file will be shown in Demo Bank 0 Bank 2 Bank 1

DAPR Overlay Generation Overlay generation uses Cluster growth algorithm Cluster Growth Algorithm works in two steps Linear ordering of modules 1 CLB wide and 16 CLB tall Choose seed module from initial set of modules and move to a new set of ordered modules (initially an empty set) Compute gain for each remaining module (gain is number of connecting nets) Move module with highest gain to set of ordered modules and repeat from gain computation until no more modules are remaining in the initial set Place ordered modules on floorplan space Two types of floorplan growth – Vertical and Diagonal Current overlay generator floorplans builds vertically Advantage - bitstream size will be smaller Disadvantage - routing is difficult and will take longer Floorplan Growth Direction Floorplan Growths (diagonal (left) and veritcal (right) and colored blocks represent PRMs)

Results – Low-Level DAPR Design Flow Numerical Results Case Study implementation results with a 32 bit counter More design s are under test Cordic FFT Matrix Multiplier Iteration no. 1 2 3 4 5 6 7 8 9 10 11 Clock PRR size Pwr(mw) (Mhz) (CLB's) 269. 469 422 16 X 1 270. 783 422 16 X 1 271. 223 422 16 X 1 272. 109 422 16 X 1 266. 312 422 32 X 1 253. 357 422 32 X 1 275. 558 422 16 X 2 272. 109 422 16 X 2 289. 771 422 16 X 2 272. 109 422 16 X 2 253. 936 422 16 X 2 25 Partial bitstream size (KBs) 4. 3 8 8 7. 7 7. 8 7. 4 7. 7 7. 3 1 CLB wide and 16 CLB tall

Kalman Filter Case Study Data format For the X and Y coordinates 16 bits fixed point representation: 1 sign bit; 8 integral bits and 7 fractional bits For the 2 FIFOs Implemented using one Virtex-4 BRAM Each one has 32 bits width (16 for X and 16 for Y) and 512 words depth The process of the system 26

Kalman filter - Introduction Application Target tracking in linear system: Dynamic model and measurement model are linear Provide accurate, continuously updated information about the position of a target given a sequence of observations about its position. Noises are Gaussian distributed The system model: The dynamic system model: Uniform velocity motion: The measurement model: 27

Kalman filter algorithm Initialization Predict Predicted state: Predicted covariance : Update Innovation covariance: Optimal Kalman gain: Update state estimate: Innovation measurement : Update estimate covariance: The simplified version – Fixed-gain Kalman filter Difference The optimal Kalman gain is acquired before processing and keep fixed. Application If the system is stationary stochastic process, the Kalman gain does not change. 28

Type 1: Fixed-gain Kalman filter 8 multiplications Read and write FIFOs for Kalman filter part The process control If the FIFO TX is Full, stop writing and reading the data from the FIFO RX. -> stop processing data The time interval guarantee At least 3 clock cycles Parameters input Parameters (fixed Kalman gain, initial values) are inputted instead of being preprogrammed in the system 29

Results & Analysis For the flexibility of application, use 8 DSP to Instantiate the multipliers Resources consumption (V 4 LX 25) Number of Slices: 280 (2%) 156. 2 MHz, Number of DSP 48 s: Maximum frequency Throughput Dynamic power consumption (100 MHz CLK) Estimated results comparison Bouncing ball experiment Fixed-gain Kalman filter is suitable Results calculated by FPGA are identical to Matlab 30 8 (16%) 52 MSPS (3 cycles) 0. 06118 W

Type 2: Basic version of Kalman filter Assuming all noises are non-coherent, four elements in Kalman gain matrix are zero. 4 divisions and 12 multiplications. 31

Results & Analysis Reduce number of dividers and multipliers by resources reuse 4 divs & 12 muls 2 divs & 6 muls 1 div & 3 muls 1958 (18%) 1316 (12%) 1033 (9%) DSP 48 s 12 (25%) 6 (12%) 3 (6%) Max. frequency 71. 4 MHz Processing time 23 clock cycles 24 clock cycles 26 clock cycles Throughput 3. 1 MSPS 2. 9 MSPS 2. 7 MSPS Dynamic power (50 MHz CLK) 0. 09970 W 0. 07556 W 0. 08092 W Slices (V 4 LX 25) Estimated results comparison Bouncing ball experiment Kalman filter gain updates in each iteration Results calculated by FPGA are identical to Matlab 32