5fb18095aacfb8d8c44dad3b86a30f9f.ppt
- Количество слайдов: 42
The Sisal Programming Language Pat Miller Center for Applied Scientific Computing and erstwhile USF hanger-on December 18, 2003
Abstract Richard Feynman recounted a time when, as a graduate student at Princeton, a fellow student came down the hall beaming and shouting, "Everything I know about physics is wrong! Isn't it wonderful!" The premise of Sisal is that everything you think you know about computing is wrong, and that (when you discover the inner beauty of Sisal) it is just wonderful. The big idea advocated by Sisal is that the concept of memory is the root of all evil. The Von Neumann concept that a central processor executes instructions that update a fixed set of "memory" cells is the wrong way of looking at the world. This introduces a bottleneck (the infamous Von Neumann bottleneck) wherein the CPU is trying to feed itself though a tiny pipe connected to the vast stores of connected memory. If you look at modern computer architectures, they go to tremendous effort to deal with this very problem: caches, vectors, multiple instruction streams, and PIMs are all ways to get around the problem. In Sisal, however, the concept of memory is replaced by one of "definition. " This definition is semantically enforced through single assignment. An entire program can be represented as a dataflowgraph. Since all operations are functional, there are no side effects, and global analysis is trivial. The compiler can restructure data and control flow with amazing abandon: fusing and splitting loops while fracturing and restructuring records to optimize for vector or super scalar architectures. It can assemble heavyweight chunks of work or find hundreds of tiny execution threads much more easily than the compiler for an imperative language like FORTRAN or C. We'll talk about the technical core of the Sisal compiler tools, see how to program in a single-assignment language, and look at some programming examples. We'll look a bit at implementations of Sisal for the good ol' Crays, the Manchester Dataflow Machine and a systolic array processor. We'll even look at some of the (surmountable) weaknesses that helped doom Sisal and how they are being addressed in my home hobby language: SLAG. I will assign a couple of short homework programming assignments for those interested in trying the language out. CASC PJM 2
Tanks – An alegory There was a Pentagon working group looking at tank defense. The group included generals, defense contractors, technical gurus, etc…. The question on the floor was how the cycle of defense, attack, counter-defense etc… could be broken (e. g. Attack: self-forging munitions ablative armor multiple sabot, etc…. ) Somebody on the floor offered, “Maybe we should get rid of the tanks? CASC PJM 3
So what does that have to do with me? If we are ever to make more than evolutionary progress, we must rethink what our objectives and targets really are. We must buy into the “conservation of complexity” argument that states that you can never remove complexity, but can only sweep it under a rug where it is someone else’s problem CASC PJM 4
Overview What is Sisal computational structures — functions — arrays and loops — conditionals A sane implementation of an insane semantic — simple optimization — reference count optimization — update-in-place / build-in-place — parallel partitioning Evidence of a real success The Death of Sisal (abbreviated version) SLAG (Sisal Lives AGain) CASC PJM 5
Some Sisal Places LLNL University of Santiago, Chile University of Colorado University of Sao Paulo DEC University of Manchester (UK) Adelaide University (Australia) USC University of Puerto Rico CASC PJM 6
What is Sisal Dataflow work at MIT on VAL language The acronym is for Streams and Iteration in a Single Assignment Language Defined in 1983, revised and frozen in 1985 — Sisal 2. 0 (CSU) 1986 — Sisal 90 (LLNL) 1989 — Frozen in part because the original grammar tables were not under configuration control : -) Original collaborators were LLNL, Colorado State U, University of Manchester, and DEC. Large (100+) install base (pre-WWW) and still cited Threaded SMP implementations (e. g. Cray) + weird stuff CASC PJM 7
Objectives to define a general-purpose functional language to define a language independent intermediate form for dataflow graphs to develop optimization techniques for high performance parallel applicative computing to develop a microtasking environment that supports dataflow on conventional computer systems to achieve execution performance comparable to imperative languages to validate the functional style of programming for large-scale scientific applications Note: Citations in the form [NAME 99] are referenced at the end of the talk. Citations without a year [Miller] simply refer to the person who worked on the project CASC [Feo 96] PJM 8
Quick overview of functional languages all operations are functions are side-effect free — no changes to arguments nor global state functions do no hold persistent state no side effects implies no race conditions in parallel environment implies recreatable results determining if two operations can go in parallel is reduces to a matter of identifying a dataflow dependency between the operations inputs and outputs — There is a WHOLE lot of low level parallelism in every code we write. The trick is exploiting it! CASC PJM 9
Sisal Tools frontend — convert to IF 1 intermediate — originally Pascal with M 4 macros — p 2 c conversion to C and hacked — LL(1) grammar tables DI (debugger interpreter for IF 1) OSC — Optimizing Sisal Compiler — IF 1 -> optimizer intermediates -> C TWINE — semi-compiled debugger environment — Thesis Work Is Never Ending CASC PJM 10
Basic Sisal separate compilation, one global namespace functions are listed in define to export functions take 0 or more inputs and produce 1 or more outputs — strictly typed — Like FORTRAN “pure” – i. e. no side effects Arguments are named and typed, outputs are typed The body is an expression A well defined error semantic insures orderly, predictable execution even with divide-by-zero and bounds errors (sorta) CASC PJM 11
Examples % Hello world! define main function main(returns array[character]) “hello world” end function % Simple arrays define main function main(A: array[integer] returns integer, array[integer]) for element in A sqr : = A*A; returns value of sum sqr array of sqr end for end function CASC PJM 12
Arrays and loops Sisal was targeted at scientific computation Arrays — vector of vectors syntax — size is implicit to an instance — arbitrary lower bound (default 1) Streams — non-strict containers Loops — Iterative and parallel (do-across) for loops — “dot” product indices to describe “run-together” indices — “cross” product loops to describe nesting CASC PJM 13
Examples for i in 1, 10 returns value of sum i end for % Basic integer range % Simple reduction x, max_x : = for v in velocity dot t in time position : = v*t; returns array of position value of greatest position end for; CASC PJM 14
Examples for i in 0, 9 cross j in 0, 9 % Implied nested loops range ij : = i + 10*j; returns array of ij % Returns a 9 x 9 array[integer]] end for x in A x 2 : = x*x; returns array of x 2 end for; for x in B at i, j, k x 2 : = x*x; returns array of x 2 end for; CASC % compute across outer dimension % result is the same size as A % compute across 3 nested dimensions % has same size and shape as B PJM 15
Reductions/Filters array of x stream of x % collectives value of x value of sum x value of product x value of greatest x value of least x % scalar reduction value of catenate % join arrays value of left RRRR x value of right RRRR x value of tree RRRR x % left associative reduction % right associative % tree associative XXXX of x when flag XXXX of x unless flag %filtered CASC PJM 16
Iterative form Not all loops are implicitly data parallel Sisal supports an iterative form that supports the idea of “loop carried values” The loop body is allowed to reference the “old” value of a definition (variable) An separate body defines the initial values Loop exits on a boolean control expression (no break!) CASC PJM 17
For Initial for initial A : = some_value(); % This is the first “iteration” err : = 1. 0; % It sets initial values for variants…. epsilon : = 1 e-6; % and loop constant info while err > epsilon repeat A : = advance( old A ); % Note the use of “old” to denote last value err : = for element in A dot element 1 in old A % “old A” is still available diff : = element – element 1; returns value of sum diff*diff % This loop will get fused with advance end for; returns value of A end for CASC PJM 18
For Initial 3 -point stencil (Array Replace) for initial A : = some_value(); % This is the first “iteration” i : = array_liml(A); while i < array_limh(A) repeat i : = old i + 1; A : = old A[i: ( old A[i-1] + old A[i+1]) / 3. 0 ]; returns value of A end for The semantic for the red line says to… * make a copy of old A * replace the value of old A[i] with the new value * bind that value to the name “A” * i. e. There are 3 full copies of the array hanging around on each iteration * (But we didn’t implement it this way) CASC PJM 19
Conditionals if-then-[elseif]-else-end if form — it’s an expression — must be something in each branch [why? ] tag-case for union (variant records) if x < 10 then x+5 elseif y > 10 then x-3 else 10 end if CASC PJM 20
Fibre I/O structured description of data integers, reals, double_precision, char, string array [1: 11 22 33 …. ] presized array [1, 4: 11 22 33 44] stream {1: 11 22 33 44 … } record < 3 “hello” [1: 11 22 33] > union (2: 33) CASC PJM 21
Sisal compiles to an intermediate form IF 1 -- dataflow graph Consider the following function f(a, b: integer returns integer) a+b*3 end function a b 3 3 * + CASC PJM 22
More complex things work with compound nodes IF/FORALL/ITER are higher order functions Consider the following function f(a, b: integer returns integer) 5* if a < 10 then b+3 else a+b end if end function 10 < 3 + + 5 * CASC PJM 23
Common Subexpression Elimination If two nodes have the same opcode and the same inputs, they compute the same value function f(a, b: integer returns integer) (a+b)*(a+b) end function ==> Similarly CASC we get inlining, loop fusion, record fission, . . . PJM 24
Reference Count optimizations Implicit memory allocation --> implicit deallocation — OSC used compiler generated reference counting — GC was not appropriate because of re-use issues — 1 st cut we spent 40% of time reference counting — Ref. Cnt is a critical section We can eliminate reference counts if a data dependency insures object stays alive Deeper analysis can statically eliminated almost all reference counting ops and replaced them with statically defined lifetime — see [Skedz. Simpson 88] CASC PJM 25
Update-in-Place Array replacement (and record replacement) are prohibitively expensive if implemented naively A : = old A[i: new_element]; R : = rec replace [field: val]; Introducing artificial dependencies can push “readers” in front of “writers” (at the cost of a small amount of fine grain parallelism) reference count analysis may be able to statically guarantee the final writer may overwrite A dynamic reference count check (non-blocking) can allow overwrite if the reference count is one CASC PJM 26
Build-in-place In order to minimize data motion (and associated copying), the “build-in-place” optimizations attempt to compute a value into its final resting place first_row : = for …. . end for; middle : = for row in 2, array_size(A)-1 …. end for; last_row : = for …. end for; solution : = first_row || middle || last_row; You want to allocate the full array and then have each loop fill its portion CANNOT express this in functional dataflow since you are calling for a side effect IF 2 re-introduces state (in the form of memory buffers) to perform these optimizations CASC PJM 27
Cost estimates and parallel partitioning The problem isn’t the dearth of parallelism, it’s the glut of finegrained parallelism — Early implementations eagerly forked every function and split every parallel loop and experienced perfect inverse scaling (parallel slowdown) We switched to a system that tried to aglomerate big chunks of work — abandoned function parallelism for loop parallelism (OK on 4 processor Crays) We also needed a static performance estimate to decide which loops to split We also made dynamic estimates to throttle small loops Finally, there was a feedback loop that improved static guesses. CASC PJM 28
Implementations The main (LLNL) implementation used the Sisal->IF 1 frontend and OSC for conventional shared memory machines [Cann core extended Miller/Denton/et al] –Alliant –Cray (1, 2, XMP, YMP) –Dec/ULTRIX –Encore –Sequent –SGI –Solaris Did not implement error semantics and used strict streams (i. e arrays) Current version supports generic PThread & T 3[DE] CASC PJM 29
Debugger implementations SDB [ Cann ] — OSC add-on for simple breakpoint and variable inspection DI [Skedz Yates] — Was used to predict maximum available parallelism — Implemented almost the full language specification — Provided a command line debugger/inspector TWINE [Miller Cedeno] — semi-compiled environment with more complete debugger information and better speed/memory usage CASC PJM 30
Early and strange implementations Burton Smith’s HEP [Allan 85] Multi-headed VAX at LLNL (Circus and Dumbo) Manchester Dataflow Machine [Foley 87] CMU Warp Systolic Array machine [ no citation! ] Distributed Array implementation [M Miller] IBM Power 4 — Shared Memory — No hardware cache consistency — Compiler inserted cache directives [Wolski 95] CASC PJM 31
Programming Apocrypha Quadtree matrix [Mc. Graw Skedz] Livermore Loops [Feo] Salishan Problems [Feo 92] Cray FFT [Feo] “Simple” Hydro — 2 D KL Lagrange — Heavy loop fusion CASC PJM 32
The death of Sisal Project lost LLNL funding in 1996 (except for a small slice of NSF funding through USC) Work was suspended and most of the group drifted away Was DOE’s first “Open Source” release The source code was stashed in various parts of the globe and now lives on at Source. Forge Was referenced at http: //www. llnl. gov/sisal until 6 months ago The remnants of Sisal include — Sisal Mugs (SC ‘ 91) and banner — A video tutorial series — Some dubious backup tapes holding the mortal remains of the code CASC PJM 33
Some of the reasons Failure to penetrate the programs stymied by the beliefs that: — There is but one language, FORTRAN with Cray pointers — We will never recode our algorithms — We will never live in a mixed language world (see point 1) — The CRAYs will live forever and we already know how to vectorize — If I really need parallelism (a dubious thought), an automatic parallel compiler will give me enough Short term, programmatic focus of about 2 years Early focus on distributed memory machines was pushed over to SMP, but…. — SMP/vector work finished and polished right about the time the Meiko arrived and project was killed as irrelevant CASC PJM 34
Still more reasons. . . vector of vector syntax FIBRE I/O was universally hated clunky interface to/from FORTRAN — trying to avoid copies at function call interface final runtime was no better than hand-tuned FORTRAN — albeit arrived at in a much shorter time! no complex type difficult to add reductions poor library support (out of the box) wordy syntax and many keywords a non-zero learning curve for the language will you be here in seven years? CASC PJM 35
SLAG Sisal Lives AGain [Miller] C/C++-like syntax which compiles to XML version of IF 1 Very few keywords Modules and simplified extension interfaces (via Babel? ) Distributed implementation (hopefully) — The idea is to capture NUMA — threads on SMP — MPI-2 Get or SHMEM cross-box Will have objects, but objects cannot update state — objects birth new objects with modified state Maybe a Mentat style framework of cooperating functional objects My personal target is Sony PS* since it has vector units and will eventually go parallel (+ streams) CASC PJM 36
Slag program module foo { using math, runtime; pi : = 3. 1415926; radians_per_proc : = pi / runtime: : number_of_processors; double daxpy(. . . ) { …. } in “FORTRAN” double compute_dt(double* A) { …. } for “Python” int main(returns string) { // when main() returns string, it is stdout n : = int( argc[0] ); if ( n < 0 ) { string(n) << “ is negative” } else { “it is non-negative” } } } CASC PJM 37
Terms functional language – only real computational element is function call imperative language – a language in which statements modify the contents of memory cells and set control flow dataflow – computation is accomplished by input which flows along edges to nodes which fire when all inputs arrive. single assignment – a name is associated with exactly one value in an execution context referential transparency – a name and its value are always interchangeable CASC PJM 38
Terms lazy evaluation – values are computed as needed eager evaluation - values are computed as soon as possible CASC PJM 39
More stuff. . . W Ackerman and JB Dennis. VAL – A value oriented Algorithmic Language. MIT/LCS/TR-218, June 1979. D Raymond. SISAL: A Safe and Efficient Language for Numerical Calculations. Linux Journal #80, December 2000. see http: //www. linuxjournal. com/article. php? sid=4383 J Feo. Sisal. A comparative study of parallel programming languages: the Salishan problems. June 1992 UCRL-JC 110915 S Skedzielewski and R Simpson. A simple method to remove reference counting in appreciative programs. UCRL-100156. 1988 J Foley. Manchester Dataflow Machine: Preliminary Benchmark Test Evaluation. UMCS-87 -11 -2. 1987 CASC PJM 40
Still more stuff… S Allan and R Oldehoeft, HEP SISAL: parallel functional programming, on Parallel MIMD computation: HEP supercomputer and its applications, p. 123 -150, June 1985 R Wolski and D Cann. Compiler Enforced Cache Coherence Using a Functional Language. Journal of Scientific Programming, December, 1995. J Feo. Livermore Loops in Sisal. UCID-21159. 1987 CASC PJM 41
UCRL-PRES-210560 Work performed under the auspices of the U. S. Department of Energy by Lawrence Livermore National Laboratory under Contract W-7405 -Eng-48 CASC PJM 42