Pattern Programming ITCS 4 5145 Parallel Programming UNC-Charlotte B

Pattern Programming ITCS 4/5145 Parallel Programming UNC-Charlotte, B. Wilkinson, 2013. August 29 A, 2013 Pattern. Prog-1 PP-1. 1

Problem Addressed • To make parallel programming more useable and scalable. • Parallel programming -- writing programs to use multiple computers and processors collectively to solve problems -- has a very long history but still a challenge. 2

Traditional approach • Explicitly specifying message-passing (MPI) and • Explicitly using low-level threads APIs (Pthreads, Java threads, Open. MP, …). • Need a better structured approach. 3

Pattern Programming Concept Programmer begins by constructing his program using established computational or algorithmic “patterns” that provide a structure. “Design patterns” part of software engineering for many years: • • • Reusable solutions to commonly occurring problems * Provide guide to “best practices”, not a final implementation Provides good scalable design structure Can reason more easier about programs Potential for automatic conversion into executable code avoiding low-level programming – We do that here. • Particularly useful for the complexities of parallel/distributed computing * http: //en. wikipedia. org/wiki/Design_pattern_(computer_science) 4

In Parallel/Distributed Computing What patterns are we talking about? • Low-level algorithmic patterns that might be embedded into a program such as fork-join, broadcast/scatter/gather. • Higher level algorithm patterns forming a complete program such as workpool, pipeline, stencil, map-reduce. We concentrate upon higher-level “computational/algorithm ” level patterns rather than lower level patterns. 5

Some Patterns Workpool Workers Master Two-way connection Compute node Source/sink 6

Pipeline Stage 1 Stage 2 Stage 3 Workers Master One-way connection Two-way connection Compute node Source/sink 7

Divide and Conquer Divide Merge Two-way connection Compute node Source/sink 8

All-to-All compute nodes can communicate with all the other nodes Usually a synchronous computation - Performs number of iterations to obtain on solution e. g. N-body problem Two-way connection Compute node Master Source/sink 9

Stencil All compute nodes can communicate with only neighboring nodes On each iteration, each node communicates with neighbors to get stored computed values Usually a synchronous computation Two-way connection Compute node Source/sink - Performs number of iterations to converge on solution, e. g. solving Laplace’s/heat equation 10

Parallel Patterns -- Advantages • • Abstracts/hides underlying computing environment Generally avoids deadlocks and race conditions Reduces source code size (lines of code) Leads to automated conversion into parallel programs without need to write with low level message-passing routines such as MPI. • Hierarchical designs with patterns embedded into patterns, and pattern operators to combine patterns. Disadvantages • New approach to learn • Takes away some of the freedom from programmer • Performance reduced (c. f. using high level languages instead of assembly language) 11

Previous/Existing Work Patterns explored in several projects. • Industrial efforts – Intel Threading Building Blocks (TBB), Intel Cilk plus, Intel Array Building Blocks (Ar. BB). Focus on very low level patterns such as fork-join • Universities: – University of Illinois at Urbana-Champaign and University of California, Berkeley – University of Torino/Università di Pisa Italy 12

Book by Intel authors “Structured Parallel Programming: Patterns for Efficient Computation, ” Michael Mc. Cool, James Reinders, Arch Robison, Morgan Kaufmann, 2012 Focuses on Intel tools 15. 13

Note on Terminology “Skeletons” Sometimes term “skeleton” used to describe “patterns”, especially directed acyclic graphs with a source, a computation, and a sink. We do not make that distinction and use the term “pattern” whether directed or undirected and whether acyclic or cyclic. This is done elsewhere. 14

Our approach (Jeremy Villalobos’ UNC-C Ph. D thesis) Focuses on a few patterns of wide applicability (e. g. workpool, synchronous all-to-all, pipelined, stencil) but Jeremy took it much further than UPCRC and Intel. He developed a higher-level framework called “Seeds” Uses pattern approach to automatically distribute code across processor cores, computers, or geographical distributed computers and execute the parallel code. 15

Pattern Programming with the Seeds Framework 16

“Seeds” Parallel Grid Application Framework Some Key Features • Pattern-programming Java user interface (C++ version in development) • Self-deploys on computers, clusters, and geographically distributed computers http: //coit-grid 01. uncc. edu/seeds/ 17

Seeds Development Layers Basic Advanced: Used to add or extend functionality such as: Intended for programmers that have basic parallel computing background Based on skeletons and patterns Create new patterns Optimize existing patterns or Adapt existing pattern to non-functional requirements specific to the application Expert: Used to provide basic services: Deployment Security Communication/Connectivity Changes in the environment Derived from Jeremy Villalobos’s Ph. D thesis defense 18

Basic User Programmer Interface Programmer selects a pattern and implements “Module” class three principal Java methods with a module class: Diffuse • Diffuse method – to distribute pieces of data. Compute • Compute method – the actual computation • Gather method – used to gather the results Programmer also has to fill details in a “run module” bootstrap class that creates an instance of the module class and starts the framework. Gather “Run module” bootstrap class Framework then self-deploys on a specified parallel/distributed computing platform and executes pattern. 19

Example module class Complete code (Monte Carlo pi in Assignment 1, see later for more details) Computation package edu. uncc. grid. example. workpool; import java. util. Random; import java. util. logging. Level; import edu. uncc. grid. pgaf. datamodules. Data. Map; import edu. uncc. grid. pgaf. interfaces. basic. Workpool; import edu. uncc. grid. pgaf. p 2 p. Node; public class Monte. Carlo. Pi. Module extends Workpool { private static final long serial. Version. UID = 1 L; private static final int Double. Data. Size = 1000; double total; int random_samples; Random R; public Monte. Carlo. Pi. Module() { R = new Random(); } public void initialize. Module(String[] args) { total = 0; Node. get. Log(). set. Level(Level. WARNING); // reduce verbosity for logging random_samples = 3000; // set number of random samples } public Data Compute (Data data) { // input gets the data produced by Diffuse. Data() Data. Map<String, Object> input = (Data. Map<String, Object>)data; Data. Map<String, Object> output = new Data. Map<String, Object>(); Long seed = (Long) input. get("seed"); // get random seed Random r = new Random(); r. set. Seed(seed); Long inside = 0 L; for (int i = 0; i < Double. Data. Size ; i++) { double x = r. next. Double(); double y = r. next. Double(); double dist = x * x + y * y; if (dist <= 1. 0) { ++inside; } } output. put("inside", inside); // store partial answer to return to Gather. Data() return output; // output will emit the partial answers done by this method } public Data Diffuse. Data (int segment) { Data. Map<String, Object> d =new Data. Map<String, Object>(); d. put("seed", R. next. Long()); return d; // returns a random seed for each job unit } public void Gather. Data (int segment, Data dat) { Data. Map<String, Object> out = (Data. Map<String, Object>) dat; Long inside = (Long) out. get("inside"); total += inside; // aggregate answer from all the worker nodes. } public double get. Pi() { // returns value of pi based on the job done by all the workers double pi = (total / (random_samples * Double. Data. Size)) * 4; return pi; } public int get. Data. Count() { return random_samples; } } Note: No explicit message passing 20

Seeds Implementations Three Java versions available (2013): • Full JXTA P 2 P version requiring an Internet connection • JXTA P 2 P version but not needing an external network, suitable for a single computer • Multicore (thread-based) version for operation on a single computer Multicore version much faster execution on single computer. Only difference is minor change in bootstrap class. PP-2. 21

Bootstrap class JXTA P 2 P version This code deploys framework and starts execution of pattern Different patterns have similar code package edu. uncc. grid. example. workpool; import java. io. IOException; import net. jxta. pipe. Pipe. ID; import edu. uncc. grid. pgaf. Anchor; import edu. uncc. grid. pgaf. Operand; import edu. uncc. grid. pgaf. Seeds; import edu. uncc. grid. pgaf. p 2 p. Types; public class Run. Monte. Carlo. Pi. Module { public static void main(String[] args) { try { Monte. Carlo. Pi. Module pi = new Monte. Carlo. Pi. Module(); Seeds. start( "/path/to/seeds/seed/folder" , false); Pipe. ID id = Seeds. start. Pattern(new Operand( (String[])null, new Anchor("hostname", Types. Data. Flow. Roll. SINK_SOURCE), pi )); System. out. println(id. to. String() ); Seeds. wait. On. Pattern(id); Seeds. stop(); System. out. println( "The result is: " + pi. get. Pi() ) ; } catch (Security. Exception e) { e. print. Stack. Trace(); } catch (IOException e) { e. print. Stack. Trace(); } catch (Exception e) { e. print. Stack. Trace(); } } 22 }

Bootstrap class Multicore version • Multicore version • Much faster on a multicore platform • Thread based • Bootstrap class does not need to start and stop JXTA P 2 P. Seeds. start() and Seeds. stop() not needed. Otherwise user code similar. public class Run. Monte. Carlo. Pi. Module { public static void main(String[] args) { try { Monte. Carlo. Pi. Module pi=new Monte. Carlo. Pi. Module(); Thread id = Seeds. start. Pattern. Multicore( new Operand( (String[])null, new Anchor( args[0], Types. Data. Flow. Role. SINK_SOURCE), pi ), 4); id. join(); System. out. println( "The result is: " + pi. get. Pi() ) ; } catch (Security. Exception e) { e. print. Stack. Trace(); } catch (IOException e) { e. print. Stack. Trace(); } catch (Exception e) { e. print. Stack. Trace(); } } }

Measuring Time Can instrument code in the bootstrap class: public class Run. My. Module { public static void main (String [] args ) { try{ long start = System. current. Time. Millis(); My. Module m = new My. Module(); Seeds. start(. ); Pipe. ID id = ( … ); Seeds. wait. On. Pattern(id); Seeds. stop(); long stop = System. current. Time. Millis(); double time = (double) (stop - start) / 1000. 0; System. out. println(“Execution time = " + time); } catch (Security. Exception e) { … …

Compiling/executing • Can be done on the command line (ant script provided) or through an IDE (Eclipse) 25

Tutorial page http: //coit-grid 01. uncc. edu/seeds/ 15. 26

Acknowledgements Work initiated by Jeremy Villalobos in his Ph. D thesis “Running Parallel Applications on a Heterogeneous Environment with Accessible Development Practices and Automatic Scalability, ” UNC-Charlotte, 2011. Jeremy developed “Seeds” pattern programming software. Extending work to teaching environment supported by the National Science Foundation under grant "Collaborative Research: Teaching Multicore and Many-Core Programming at a Higher Level of Abstraction" #1141005/1141006 (2012 -2015). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

UNC-Charlotte Pattern Programming Research Group http: //coitweb. uncc. edu/~abw/Pattern. Prog. Group/ Fall 2013 • Jeremy Villalobos (Ph. D awarded, continuing involvement) Ph. D student • Yasaman Kamyab Hessary (Course TA) CS MS students • Haoqi Zhao (MS thesis) • Yawo Adibolo developed C++ version of framework software for interest. CS BS student • Matthew Edge (Senior project) • Kevin Silliman (Senior project evaluating Yawo’s C++ framework) Please contact B. Wilkinson if you would like to be involved in this work for academic credit 28

Questions 29

Next step • Assignment 1 – using the Seeds framework 30