Lessons Learned with Performance Prediction and Design Patterns

Скачать презентацию Lessons Learned with Performance Prediction and Design Patterns

a0d52f3be987f00b59ab52f10883a115.ppt

Количество слайдов: 35

Lessons Learned with Performance Prediction and Design Patterns on Molecular Dynamics Brian Holland Karthik Nagarajan Saumil Merchant Herman Lam Alan D. George

Outline of Algorithm Design Progression n Algorithm decomposition q q n Jun ‘ 07 RC Amenability Test (RAT) Application case study Improvements to RAT Design patterns and methodology q q q n Design flow challenges Performance prediction q Feb ‘ 07 Sept ‘ 07 Introduction and related research Expanding pattern documentation Molecular dynamics case study Conclusions Design Evolution 2

Design Flow Challenges n Original mission q Create scientific applications for FPGAs as case studies to investigate topics such as portability and scalability n q Maximize performance and productivity using HLLs and high-performance reconfigurable computing (HPRC) n n Molecular dynamics is one such application Applications should have significant speedup over SW baseline Challenges q Ensuring speedup over traditional implementations n q Particularly when researcher is not RC oriented Exploring design space efficiently n Several designs may achieve speedup but which should be used? 3

Algorithm Performance n Premises q (Re)designing applications is expensive n q Scientific applications can contain extra precision n q Floating point may not be necessary but is a SW “standard” Optimal design may overuse available FPGA resources n n Only want to design once and even then, do it efficiently Discovering resource exhaustion mid-development is expensive Need q Performance prediction n n Quickly and with reasonable accuracy estimate the performance of a particular algorithm on a specific FPGA platform Utilize simple analytic models to make prediction accessible to novices 4

Introduction n RC Amenability Test q n Methodology for rapidly analyzing a particular algorithm’s design compatibility to a specific FPGA platform and projecting speedup Importance of RAT q Design migration process is lengthy and costly n n n Allows for detailed consideration with potential tradeoff analyses Creates formal procedure reducing need for “expert” knowledge Scope of RAT q RAT cannot make generalizations about applications n n Different algorithm choices will greatly affect application performance Different FPGA platform architectures will affect algorithm capabilities 5

RAT Methodology n Throughput Test q q n Algorithm and FPGA platform are parameterized Equations are used to predict speedup Numerical Precision Test q q RAT user should explicitly examine the impact of reducing precision on computation Interrelated with throughput test n n Two tests essentially proceed simultaneously Resource Utilization Test q FPGA resources usage is estimated to determine scalability on FPGA platform Overview of RAT Methodology 6

Related Work n Performance prediction via parameterization q “The performance analysis in this paper is not real performance prediction; rather it targets the general concern of whether or not an algorithm will fit within the memory subsystem that is designed to feed it. “ [1] (Illinois) n n q Parallel, heterogeneous shared RC resource modeling [2] (ORNL) n n System-level modeling for multi-FPGA augmented systems Other performance prediction research q q n Applications decomposed to determine total size and computational density Computational platforms characterized by memory size, bandwidth, and latency Performance Prediction Model (PPM) [3] Optimizing hardware function evaluation [4] Comparable models for conventional parallel processing systems q q Parallel Random Access Machine (PRAM) [5] Log. P [6] 7

Throughput n Methodology q q n Parameterize key components of algorithm to estimate runtime Use equations to determine execution time of RC application Compare runtime with SW baseline to determine projected speedup Explore ranges of values to examine algorithm performance bounds Terminology q Element n n q Basic unit of data for the algorithm that determines the amount of computation e. g. Each character (element) in a string matching algorithm will require some number of computations to complete Operation n n Basic unit of work which helps complete a data element e. g. Granularity can vary greatly depending upon formulation. q RAT Input Parameters 1 Multiply or 16 shifts could represent 1 or 16 operations, respectively 8

Communication and Computation n Communication is defined by reads and writes of FPGA q Note that this equation refers to a single iteration of the algorithm n Read and write times are a function of: number of elements; size of each element; and FPGA/CPU interconnect transfer rate n Similarly, computation is determined by: number of operations (a function of number of elements); parallelism/pipelining in the algorithm (throughput); and clock frequency 9

RC Execution Time Example Overlap Scenarios n Total RC execution time q q Function of communication time, computation time, and number of iterations required to complete the algorithm Overlap of computation and communication n Single Buffered (SB) q n No overlap, computation and communication are additive Double Buffered (DB) q Complete overlap, execution time is dominated by larger term 10

Performance Equations n Speedup q q n Compares predicted performance versus software baseline Shows performance as a function of total execution time Utilization q q Computation utilization shows effective idle time of FPGA Communication utilization illustrates interconnect saturation 11

Numerical Precision n Applications should have minimum level of precision necessary to remain within user tolerances q q n Automated fixed-point to floating-point conversion q q n SW applications will often have extra precision due to coarse-grain data types of general-purpose processors Extra precision can be wasteful in terms of performance and resource utilization on FPGAs Useful for exploring reduced precision in algorithm designs Often requires additional coding to explore options Ultimately, user must make final determination on precision q RAT exists to help explore computation performance aspects of application, just as it helps investigate other algorithmic tradeoffs 12

Resource Utilizations n Intended to prevent designs that cannot be physically realized in FPGAs q On-Chip RAM n n q Hardware Multipliers n n q Includes memory for application core and off-chip I/O Relatively simple to examine and scale prior to hardware design Includes variety of vendor-specific multipliers and/or MAC units Simple to compute usage with sufficient device knowledge Logic Elements n n Includes look-up tables and other basic registering logic Extremely difficult to predict usage before hardware design 13

Probability Density Function Estimation n Parzen window probability density function (PDF) estimation q Computation complexity O(Nnd) n n N – number of discrete probability levels (i. e. bins) n – number of discrete points where probability is estimated d – number of dimensions Intended architecture q q Eight parallel kernels each compute the discrete points versus a subset of the bins Incoming data samples are processed against 256 bins Chosen 1 -D PDF algorithm architecture 14

1 -D PDF Estimation Walkthrough n Dataset Parameters q Nelements, input n 204800 samples / 400 iterations = 512 q Nelements, output q Nbytes/element n n n Each data value is 4 bytes, size of Nallatech communication channel Communication Parameters q q n For 1 -D PDF, output is negligible Models a Nallatech H 101 -PCIXM card containing a Virtex-4 LX 100 user FPGA connected via 133 MHz PCI-X bus α parameters were established using a read/write microbenchmark for modeling transfer times Computation parameters q Nops/element n 256 bins * 3 ops each = 768 q throughputproc q fclock n n n 8 pipelines * 3 ops = 24 ≈ 20 Several values are considered Software Parameters q tsoft n q RAT Input Parameters of 1 -D PDF Measured from C code, compiled using gcc, and executed on a 3. 2 GHz Xeon N n 204800 samples / 500 elementes = 400 15

1 -D PDF Estimation Walkthrough n Frequency q q n Difficult to predict a priori Several possible values are explored Prediction accuracy q Communication accuracy was low n n q Performance Parameters of 1 -D PDF Computational accuracy was high n n n Despite microbenchmarking, communication was longer than expected Minor inaccuracies in timing for small transfers compounded over 400 iterations for 1 -D PDF Throughput was rounded from 24 ops/cycle to 20 ops/cycle Conservative parallelism was warranted due to unaccounted pipeline stalls Algorithm constructed in VHDL Example Computations from RAT Analysis 16

2 -D PDF Estimation n Dimensionality q q n PDF can extend to multiple dimensions Significantly increases computational complexity and volume of communication Algorithm q Same construction as 1 -D PDF n n Written in VHDL Targets Nallatech H 101 q q n Xilinx V 4 LX 100 FPGA PCI-X interconnect Prediction Accuracy q Communication n q RAT Input Parameters of 2 -D PDF Similar to 1 -D PDF, communication times were underestimated Computation n Computation was smaller than expected, balancing overall execution time Performance Parameters of 2 -D PDF 17

Molecular Dynamics n Simulation of physical interaction of a set of molecules over a given time interval q n Based upon code provided by Oak Ridge National Lab (ORNL) Algorithm q q q 16, 384 molecule data set Written in Impulse C Xtreme. Data XD 1000 platform n n q n Altera Stratix-II EPS 2180 FPGA Hyper. Transport interconnect SW baseline on 2. 4 GHz Opteron Challenges for accurate prediction q Nondeterministic runtime n q RAT Input Parameters of MD Molecules beyond a certain threshold are assumed to have zero impact Large datasets for MD n Exhausts FPGA local memory Performance Parameters of MD 18

Conclusions n RC Amenability Test q q q n Successes q q n Allows for rapid algorithm analysis before any significant hardware coding Demonstrates reasonably accurate predictions despite coarse parameterization Applications q q n Provides simple, fast, and effective method for investigating performance potential of given application design for a given target FPGA platform Works with empirical knowledge of RC devices to create more efficient and effective means for application design When RAT-projected speedups are found to be disappointing, designer can quickly reevaluate their algorithm design and/or RC platform selected as target Showcases effectiveness of RAT for deterministic algorithms like PDF estimation Provides valuable qualitative insight for nondeterministic algorithms such as MD Future Work q q q Improve support for nondeterministic algorithms through pipelining Explore performance prediction with applications for multi-FPGA systems Expand methodology for numerical precision and resource utilization 19

Molecular Dynamics Revisited 20

Molecular Dynamics n Algorithm q q q 16, 384 molecule data set Written in Impulse C Xtreme. Data XD 1000 platform n n q n Altera Stratix II EPS 2180 FPGA Hyper. Transport interconnect SW baseline on 2. 4 GHz Opteron Parameters q Dataset Parameters n q Communication Parameters n q Model volume of data used by FPGA Model the Hyper. Transport Interconnect Computation Parameters n n Model computational requirement of FPGA Nops/element q q n RAT Input Parameters of MD Throughputproc q q q 164000 ≈ 16384 * 10 ops i. e. each molecule (element) takes 10 ops/iteration 50 i. e. operations per cycle needed for >10 x speedup Software Parameters n Software baseline runtime and iterations required to complete RC application Performance Parameters of MD 21

Parameter Alterations for Pipelining n MD Optimization q q n Each molecular computation should be pipelined Focus becomes less on individual molecules and more on molecular interactions Parameters q Computation Parameters n Nops/element q q n Throughputpipeline q q n 16400 Strictly number of interactions per element. 333 Number of cycles needed to per interaction. i. e. you can only stall pipeline for 2 extra cycles Modified RAT Input Parameters of MD Npipeline q q 15 Guess based on predicted area usage Performance Parameters of MD 22

Pipelined Performance Prediction n Molecular Dynamics q If a pipeline is possible, certain parameters become obsolete n n n The number of operations in the pipeline (i. e depth) is not important The number of pipeline stalls becomes critical and is much more meaningful for non-deterministic apps Parameters q Nelement n n q Pipelined RAT Input Parameters of MD Nclks/element n n q 163842 Number of molecular pairs 3 i. e. up to two cycles can be stalls Npipelines n n 15 Same number of kernels as before 23

“And now for something completely different…” -Monty Python (Or is it? ) 24

Leveraging Algorithm Designs n Introduction q Molecular dynamics provided several lessons learned n n n Best design practices for coding in Impulse C Algorithm optimizations for maximum performance Memory staging for minimal footprint and delay q n Sacrificing computation efficiency for decreased memory accesses Motivations and Challenges q Application design should educate the researcher n q Designs should also trainer other researchers Unfortunately, new designing can be expensive n Collecting application knowledge into design patterns provides distilled lessons learned for efficient application 25

Design Patterns n Objected-oriented software engineering: q n “A design pattern names, abstracts, and identifies the key aspects of a common design structure that make it useful for creating a reusable object-oriented design” (1) Reconfigurable Computing q “Design patterns offer us organizing and structuring principles that help us understand how to put building blocks (e. g. , adders, multipliers, FIRs) together. ” (2) (1) Gamma, Eric, et. al. , Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley, Boston, 1995. (2) De. Hon, Andre, et. Al. , “Design Patterns for Reconfigurable Computing”, Proceedings of 12 th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’ 04), April 20 -23, 2004, Napa, California. 26

Classificaiton of Design Patterns – OO Text Book(1) n Pattern categories q Creational n n q n n n Describing Patterns q Abstract Factory Prototype Singleton Etc. q q Structural n q Adapter Bridge Proxy Etc. q q q Behavioral q Iterator Mediator Interpreter Etc. q n n q q 27 Pattern name Intent Also know as Motivation Applicability Structure Participants Collaborations Consequences Implementation Sample code Known uses Related patterns

Sample Design Patterns – RC Paper (2) n 14 pattern categories: q q q q n Area-Time Tradeoffs Expressing Parallelism Implementing Parallelism Processor-FPGA Integration Common-Case Optimization Re-using Hardware Efficiently Specialization Partial Reconfiguration Communications Synchronization Efficient Layout and Communications Implementing Communication Value-Added Memory Patterns Number Representation Patterns 28 89 patterns identified (samples) q q q q Course-Grained Time Multiplexing Synchronous Dataflow Multi-threaded Sequential vs. Parallel Design (hardware-software partitioning) SIMD Communicating FSMDs Instruction augmentation Exceptions Pipelining Worst-Case Footprint Streaming Data Shared Memory Synchronous Clocking Asynchronous Handshaking Cellular Automata Etc

Representing DPs for RC Engineering n Description Section q q q q q Pattern name and classification Intent Also know as Motivation Applicability Participants Collaborations Consequences Known uses Related patterns n Design Section q q q Structure n Block diagram representation n Reference to major RC building blocks (BRAM, SDRAM, compute modules, etc. ). n Rational: compatibility with RAT Specification n More formal representation n Such as UML n Possibly maps to HDL Implementation Section q q q HDL language-specific information Platform specific information Sample code 29

Example – Time Multiplexing Pattern § § § Intent – Large designs on small or fixed capacity platforms Motivation – Meet real-time needs or inadequate design space Applicability – For slow reconfiguration • § No feedback loops (acyclic dataflow) Participants – Subgraphs § Collaborations – Control algorithm directs subgraphs swapping § Consequences – Slow reconfiguration time, large buffers & imperfect device resource utilization § Known Uses – Video processing, target recognition § Implementation – Conventional processor issues commands for reconfiguration and collaboration 30 *De. Hon et al, 2004 * Computational graph divided into smaller subgraphs

Example – Datapath Duplication * Replicated computational structures for parallel processing § § § Intent – Exploiting computation parallelism in sequential programming structures (loops) Motivation – Achieving faster performance through replication of computational structures Applicability – data independent • § No feedback loops (acyclic dataflow) Participants – Single computational kernel § Collaborations – Control algorithm directs dataflow and synchronization § Consequences – Area time tradeoff, higher processing speed at the cost of increased implementation footprint in hardware § Known Uses – PDF estimation, Bb. NN implementation, MD, etc. § Implementation – Centralize controller orchestrates data movement and synchronization of parallel processing elements 31

System-level patterns for MD Visualization of Datapath Duplication n When design MD, initial goal is decompose algorithm into parallel kernels q q “Datapath duplication” is a potential starting pattern MD will require additional modifications since computational structure will not divide cleanly “What do customers buy after viewing this item? ” 67% use this pattern 37% alternatively use …. “May we also recommend? ” Pipelining Loop Fusion “On-line Shopping” for Design Patterns 32

Kernel-level optimization patterns for MD Pattern Utilization void Compute. Accel() { double dr[3], f, fc. Val, rr. Cut, rr, ri 2, ri 6, r 1; int j 1, j 2, n, k; rr. Cut = RCUT*RCUT; for(n=0; n

$Design Pattern Effects on MD for (i=0; i<num*(num-1); i++){ cg_count_ceil_32(1, 0, i==0, num-2, &k);$ Design Pattern Effects on MD for (i=0; i= j 1) j 2++; if(j 2==0) rr = 0. 0; split_64 to 32_flt(AL[j 1], &j 1 y, &j 1 x); split_64 to 32_flt(BL[j 1], &dummy, &j 1 z); split_64 to 32_flt(CL[j 2], &j 2 y, &j 2 x); split_64 to 32_flt(DL[j 2], &dummy, &j 2 z); if(j 1 < j 2) { dr 0 = j 1 x - j 2 x; dr 1 = j 1 y - j 2 y; dr 2 = j 1 z - j 2 z; } else { dr 0 = j 2 x - j 1 x; dr 1 = j 2 y - j 1 y; dr 2 = j 2 z - j 1 z; } void Compute. Accel() { double dr[3], f, fc. Val, rr. Cut, rr, ri 2, ri 6, r 1; int j 1, j 2, n, k; rr. Cut = RCUT*RCUT; for(n=0; n REGIONH 0 ? REGIONH 0 : MREGIONH 0 ) - ( dr 0 > MREGIONH 0 ? REGIONH 0 : MREGIONH 0 ); dr 1 = dr 1 - ( dr 1 > REGIONH 1 ? REGIONH 1 : MREGIONH 1 ) - ( dr 1 > MREGIONH 1 ? REGIONH 1 : MREGIONH 1 ); dr 2 = dr 2 - ( dr 2 > REGIONH 2 ? REGIONH 2 : MREGIONH 2 ) - ( dr 2 > MREGIONH 2 ? REGIONH 2 : MREGIONH 2 ); rr = dr 0*dr 0 + dr 1*dr 1 + dr 2*dr 2; ri 2 = 1. 0/rr; ri 6 = ri 2*ri 2; r 1 = sqrt(rr); fc. Val = 48. 0*ri 2*ri 6*(ri 6 -0. 5) + Duc/r 1; fx = fc. Val*dr 0; fy = fc. Val*dr 1; fz = fc. Val*dr 2; if(j 2 < j 1) { fx = -fx; fy = -fy; fz = -fz; } fp_accum_32(fx, k==(num-2), 1, k==0, &ja 1 x, &err); fp_accum_32(fy, k==(num-2), 1, k==0, &ja 1 y, &err); fp_accum_32(fz, k==(num-2), 1, k==0, &ja 1 z, &err); if( rr

Conclusions n Performance prediction is a powerful technique for improving efficiency of RC application formulation q q n Design patterns provide lessons learned documentation q q n Provides reasonable accuracy for the rough estimate Encourages importance of numerical precision and resource utilization in performance prediction Records and disseminates algorithm design knowledge Allows for more effective formulation of future designs Future Work q q q Improve connection b/w design patterns and performance prediction Expand design pattern methodology for better integration with RC Increase role of numerical precision in performance prediction 35