Скачать презентацию מבוא לעיבוד מקבילי הרצאה מס 01 1002 21 42 Скачать презентацию מבוא לעיבוד מקבילי הרצאה מס 01 1002 21 42

b9860eb2c083aa1aa5d25253a44adc14.ppt

  • Количество слайдов: 101

 מבוא לעיבוד מקבילי הרצאה מס' 01 1002/21/42 מבוא לעיבוד מקבילי הרצאה מס' 01 1002/21/42

 תרגיל בית מס' 3 • ניתן להגיש עד ליום ה' ה- 1002/21/72 תרגיל בית מס' 3 • ניתן להגיש עד ליום ה' ה- 1002/21/72

 פרוייקטי גמר • קבוצות 01 -1 מתבקשות להכין את המצגות שלהן לשיעור בעוד פרוייקטי גמר • קבוצות 01 -1 מתבקשות להכין את המצגות שלהן לשיעור בעוד שבועיים. • נא להעביר את קבצי המצגות בפורמט Point Power לפני ההרצאה או לבוא לשיעור עם CDROM צרוב.

 הבחנים • בדיקת הבחנים תסתיים עד ליום ו'. • התוצאות יפורסמו בשיעור הבא. הבחנים • בדיקת הבחנים תסתיים עד ליום ו'. • התוצאות יפורסמו בשיעור הבא.

 נושאי ההרצאה • Today’s topics: – Shared Memory – Cilk, Open. MP – נושאי ההרצאה • Today’s topics: – Shared Memory – Cilk, Open. MP – MPI – Derived Data Types – How to Build a Beowulf

Shared Memory • Goto PDF presentation: Chapter 8 from Wilkinson & Allan’s book. “Programming Shared Memory • Goto PDF presentation: Chapter 8 from Wilkinson & Allan’s book. “Programming with Shared Memory”

Summary • • • Process creation The thread concept Pthread routines How data can Summary • • • Process creation The thread concept Pthread routines How data can be created as shared Condition Variables Dependency analysis: Bernstein’s conditions

Cilk http: //supertech. lcs. mit. edu/cilk Cilk http: //supertech. lcs. mit. edu/cilk

Cilk • A language for multithreaded parallel programming based on ANSI C. • Cilk Cilk • A language for multithreaded parallel programming based on ANSI C. • Cilk is designed for general-purpose parallel programming language • Cilk is especially effective for exploiting dynamic, highly asynchronous parallelism.

A serial C program to compute the nth Fibonacci number. A serial C program to compute the nth Fibonacci number.

A parallel Cilk program to compute the nth Fibonacci number. A parallel Cilk program to compute the nth Fibonacci number.

Cilk - continue • Compiling: $ cilk -O 2 fib. cilk -o fib • Cilk - continue • Compiling: $ cilk -O 2 fib. cilk -o fib • Executing: $ fib --nproc 4 30

Open. MP Next 5 slides taken from the SC 99 tutorial Given by: Tim Open. MP Next 5 slides taken from the SC 99 tutorial Given by: Tim Mattson, Intel Corporation and Rudolf Eigenmann, Purdue University

 לקריאה נוספת High-Performance Computing Part III Shared Memory Parallel Processors לקריאה נוספת High-Performance Computing Part III Shared Memory Parallel Processors

Back to MPI Back to MPI

Collective Communication Broadcast Collective Communication Broadcast

Collective Communication Reduce Collective Communication Reduce

Collective Communication Gather Collective Communication Gather

Collective Communication Allgather Collective Communication Allgather

Collective Communication Scatter Collective Communication Scatter

Collective Communication There are more collective communication commands… Collective Communication There are more collective communication commands…

MPI - נושאים מתקדמים ב • MPI – Derived Data Types • MPI-2 – MPI - נושאים מתקדמים ב • MPI – Derived Data Types • MPI-2 – Parallel I/O

 User Defined Types • מלבד ה- types המוגדרים מראש, יכול המשתמש ליצור טיפוסים User Defined Types • מלבד ה- types המוגדרים מראש, יכול המשתמש ליצור טיפוסים חדשים . • Compact pack/unpack

Predefined Types MPI_DOUBLE double MPI_FLOAT float MPI_INT signed int MPI_LONG signed long int MPI_LONG_DOUBLE Predefined Types MPI_DOUBLE double MPI_FLOAT float MPI_INT signed int MPI_LONG signed long int MPI_LONG_DOUBLE long double MPI_LONG_INT signed long int MPI_SHORT signed short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_LONG unsigned long int MPI_UNSIGNED_SHORT unsigned short int MPI_BYTE

Motivation • What if you want to specify: • non-contiguous data of a single Motivation • What if you want to specify: • non-contiguous data of a single type? • contiguous data of mixed types? • non-contiguous data of mixed types? Derived datatypes save memory, are faster, more portable, and elegant.

Steps 3 1. Construct the new datatype using appropriate MPI routines: MPI_Type_contiguous, MPI_Type_vector, MPI_Type_struct, Steps 3 1. Construct the new datatype using appropriate MPI routines: MPI_Type_contiguous, MPI_Type_vector, MPI_Type_struct, MPI_Type_indexed, MPI_Type_hvector, MPI_Type_hindexed 2. Commit the new datatype MPI_Type_commit 3. Use the new datatype in sends/receives, etc. Use

#include<mpi. h> void main(int argc, char *argv[]) { int rank; MPI_status; struct{ int x; #include void main(int argc, char *argv[]) { int rank; MPI_status; struct{ int x; int y; int z; }point; MPI_Datatype ptype; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Type_contiguous(3, MPI_INT, &ptype); MPI_Type_commit(&ptype); if(rank==3){ point. x=15; point. y=23; point. z=6; MPI_Send(&point, 1, ptype, 1, 52, MPI_COMM_WORLD); } else if(rank==1) { MPI_Recv(&point, 1, ptype, 3, 52, MPI_COMM_WORLD, &status); printf("P: %d received coords are (%d, %d) n", rank, point. x, point. y, point. z); } MPI_Finalize(); }

User Defined Types • MPI_TYPE_STRUCT • MPI_TYPE_CONTIGUOUS • MPI_TYPE_VECTOR • MPI_TYPE_HVECTOR • MPI_TYPE_INDEXED • User Defined Types • MPI_TYPE_STRUCT • MPI_TYPE_CONTIGUOUS • MPI_TYPE_VECTOR • MPI_TYPE_HVECTOR • MPI_TYPE_INDEXED • MPI_TYPE_HINDEXED

MPI_TYPE_STRUCT is the most general way to construct an MPI derived type because it MPI_TYPE_STRUCT is the most general way to construct an MPI derived type because it allows the length, location, and type of each component to be specified independently. int MPI_Type_struct (int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype)

Struct Datatype Example count = 2 array_of_blocklengths[0] = 1 array_of_types[0] = MPI_INT array_of_blocklengths[1] = Struct Datatype Example count = 2 array_of_blocklengths[0] = 1 array_of_types[0] = MPI_INT array_of_blocklengths[1] = 3 array_of_types[1] = MPI_DOUBLE

MPI_TYPE_CONTIGUOUS is the simplest of these, describing a contiguous sequence of values in memory. MPI_TYPE_CONTIGUOUS is the simplest of these, describing a contiguous sequence of values in memory. For example, MPI_Type_contiguous(2, MPI_DOUBLE, &MPI_2 D_POIN T); MPI_Type_contiguous(3, MPI_DOUBLE, &MPI_3 D_POIN T); MPI_Type_contiguous(int count, int MPI_Datatype oldtype, MPI_Datatype *newtype)

MPI_TYPE_CONTIGUOUS creates new type indicators MPI_2 D_POINT and MPI_3 D_POINT. These type indicators allow MPI_TYPE_CONTIGUOUS creates new type indicators MPI_2 D_POINT and MPI_3 D_POINT. These type indicators allow you to treat consecutive pairs of doubles as point coordinates in a 2 -dimensional space and sequences of three doubles as point coordinates in a 3 -dimensional space.

MPI_TYPE_VECTOR describes several such sequences evenly spaced but not consecutive in memory. MPI_TYPE_HVECTOR is MPI_TYPE_VECTOR describes several such sequences evenly spaced but not consecutive in memory. MPI_TYPE_HVECTOR is similar to MPI_TYPE_VECTOR except that the distance between successive blocks is specified in bytes rather than elements. MPI_TYPE_INDEXED describes sequences that may vary both in length and in spacing.

MPI_TYPE_VECTOR int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) count = MPI_TYPE_VECTOR int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) count = 2, blocklength = 3, stride = 5

: תכנית לדוגמא #include<mpi. h> void main(int argc, char *argv[]) { int rank, i, : תכנית לדוגמא #include void main(int argc, char *argv[]) { int rank, i, j; MPI_status; double x[4][8]; MPI_Datatype coltype; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Type_vector(4, 1, 8, MPI_DOUBLE, &coltype); MPI_Type_commit(&coltype);

if(rank==3){ for(i=0; i<4; ++i) for(j=0; j<8; ++j) x[i][j]=pow(10. 0, i+1)+j; MPI_Send(&x[0][7], 1, coltype, 1, if(rank==3){ for(i=0; i<4; ++i) for(j=0; j<8; ++j) x[i][j]=pow(10. 0, i+1)+j; MPI_Send(&x[0][7], 1, coltype, 1, 52, MPI_COMM_WORLD); } else if(rank==1) { MPI_Recv(&x[0][2], 1, coltype, 3, 52, MPI_COMM_WORLD, &status); for(i=0; i<4; ++i) printf("P: %d my x[%d][2]=%1 fn", rank, i, x[i][2]); } MPI_Finalize(); }

: הפלט P: 1 my x[0][2]=17. 000000 P: 1 my x[1][2]=107. 000000 P: 1 : הפלט P: 1 my x[0][2]=17. 000000 P: 1 my x[1][2]=107. 000000 P: 1 my x[2][2]=1007. 000000 P: 1 my x[3][2]=10007. 000000

Committing a datatype int MPI_Type_commit (MPI_Datatype *datatype) Committing a datatype int MPI_Type_commit (MPI_Datatype *datatype)

Obtaining Information About Derived Types • MPI_TYPE_LB and MPI_TYPE_UB can provide the lower and Obtaining Information About Derived Types • MPI_TYPE_LB and MPI_TYPE_UB can provide the lower and upper bounds of the type. • MPI_TYPE_EXTENT can provide the extent of the type. In most cases, this is the amount of memory a value of the type will occupy. • MPI_TYPE_SIZE can provide the size of the type in a message. If the type is scattered in memory, this may be significantly smaller than the extent of the type.

MPI_TYPE_EXTENT MPI_Type_extent (MPI_Datatype datatype, MPI_Aint *extent) Correction: Deprecated. Use MPI_Type_get_extent instead! MPI_TYPE_EXTENT MPI_Type_extent (MPI_Datatype datatype, MPI_Aint *extent) Correction: Deprecated. Use MPI_Type_get_extent instead!

Ref: Ian Foster’s book: “DBPP” Ref: Ian Foster’s book: “DBPP”

MPI-2 is a set of extensions to the MPI standard. It was finalized by MPI-2 is a set of extensions to the MPI standard. It was finalized by the MPI Forum in June, 1997.

MPI-2 • • New Datatype Manipulation Functions Info Object New Error Handlers Establishing/Releasing Communications MPI-2 • • New Datatype Manipulation Functions Info Object New Error Handlers Establishing/Releasing Communications Extended Collective Operations Thread Support Fault Tolerant

MPI-2 Parallel I/O • Motivation: – The ability to parallelize I/O can offer significant MPI-2 Parallel I/O • Motivation: – The ability to parallelize I/O can offer significant performance improvements. – User-level checkpointing is contained within the program itself.

Parallel I/O • MPI-2 supports both blocking and nonblocking I/O • MPI-2 supports both Parallel I/O • MPI-2 supports both blocking and nonblocking I/O • MPI-2 supports both collective and non-collective I/O

Complementary Filetypes Complementary Filetypes

Simple File Scatter/Gather Problem Simple File Scatter/Gather Problem

MPI-2 Parallel I/O • נושאים הקשורים בנושא שלא ילמדו במסגרת הקורס : הנוכחי • MPI-2 Parallel I/O • נושאים הקשורים בנושא שלא ילמדו במסגרת הקורס : הנוכחי • MPI-2 file structure • Initializing MPI-2 File I/O • Defining a View • Data Access - Reading Data • Data Access - Writing Data • Closing MPI-2 file I/O

How to Build a Beowulf How to Build a Beowulf

? What is a Beowulf • A new strategy in High-Performance Computing (HPC) that ? What is a Beowulf • A new strategy in High-Performance Computing (HPC) that exploits mass-market technology to overcome the oppressive costs in time and money of supercomputing.

? What is a Beowulf A Collection of personal computers interconnected by widely available ? What is a Beowulf A Collection of personal computers interconnected by widely available networking technology running one of several open-source Unix-like operating systems.

 • COTS – Commodity-off-the-shelf components • Interconnection networks: LAN/SAN Price/Performance • COTS – Commodity-off-the-shelf components • Interconnection networks: LAN/SAN Price/Performance

How to Run Application Faster c. There are 3 ways to improve performance: – How to Run Application Faster c. There are 3 ways to improve performance: – 1. Work Harder – 2. Work Smarter – 3. Get Help c. Computer Analogy – 1. Use faster hardware: e. g. reduce the time per instruction (clock cycle). – 2. Optimized algorithms and techniques – 3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle.

Motivation for using Clusters • The communications bandwidth between workstations is increasing as new Motivation for using Clusters • The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs. • Workstation clusters are easier to integrate into existing networks than special parallel computers.

Beowulf-class Systems A New Paradigm for the Business of Computing • Brings high end Beowulf-class Systems A New Paradigm for the Business of Computing • Brings high end computing to broad ranged problems – new markets • Order of magnitude Price-Performance advantage • Commodity enabled – no long development lead times • Low vulnerability to vendor-specific decisions – companies are ephemeral; Beowulfs are forever • Rapid response technology tracking • Just-in-place user-driven configuration – requirement responsive • Industry-wide, non-proprietary software environment

Beowulf Project - A Brief History • Started in late 1993 • NASA Goddard Beowulf Project - A Brief History • Started in late 1993 • NASA Goddard Space Flight Center – NASA JPL, Caltech, academic and industrial collaborators • Sponsored by NASA HPCC Program • Applications: single user science station – data intensive – low cost • General focus: – single user (dedicated) science and engineering applications – system scalability – Ethernet drivers for Linux

Beowulf System at JPL (Hyglac) • 16 Pentium Pro PCs, each with 2. 5 Beowulf System at JPL (Hyglac) • 16 Pentium Pro PCs, each with 2. 5 Gbyte disk, 128 Mbyte memory, Fast Ethernet card. • Connected using 100 Base-T network, through a 16 -way crossbar switch. u u Theoretical peak performance: 3. 2 GFlop/s. Achieved sustained performance: 1. 26 GFlop/s.

Cluster Computing - Research Projects (partial list) • • • Beowulf (Cal. Tech and Cluster Computing - Research Projects (partial list) • • • Beowulf (Cal. Tech and NASA) - USA Condor - Wisconsin State University, USA HPVM -(High Performance Virtual Machine), UIUC&now UCSB, US MOSIX - Hebrew University of Jerusalem, Israel MPI (MPI Forum, MPICH is one of the popular implementations) NOW (Network of Workstations) - Berkeley, USA NIMROD - Monash University, Australia Net. Solve - University of Tennessee, USA PBS (Portable Batch System) - NASA Ames and LLNL, USA PVM - Oak Ridge National Lab. /UTK/Emory, USA

Motivation for using Clusters • Surveys show utilisation of CPU cycles of desktop workstations Motivation for using Clusters • Surveys show utilisation of CPU cycles of desktop workstations is typically <10%. • Performance of workstations and PCs is rapidly improving • As performance grows, percent utilisation will decrease even further! • Organisations are reluctant to buy large supercomputers, due to the large expense and short useful life span.

Motivation for using Clusters • The development tools for workstations are more mature than Motivation for using Clusters • The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the nonstandard nature of many parallel systems. • Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing (HPC) platforms. • Use of clusters of workstations as a distributed compute resource is very cost effective - incremental growth of system!!!

Original Food Chain Picture Original Food Chain Picture

1984 Computer Food Chain Mainframe Mini Computer Vector Supercomputer Workstation PC 1984 Computer Food Chain Mainframe Mini Computer Vector Supercomputer Workstation PC

1994 Computer Food Chain (hitting wall soon) Mini Computer Workstation (future is bleak) Mainframe 1994 Computer Food Chain (hitting wall soon) Mini Computer Workstation (future is bleak) Mainframe Vector Supercomputer MPP PC

Computer Food Chain (Now and Future) Computer Food Chain (Now and Future)

Parallel Computing Cluster Computing Meta. Computing Pile of PCs NOW/COW Beowulf NT-PC Cluster Tightly Parallel Computing Cluster Computing Meta. Computing Pile of PCs NOW/COW Beowulf NT-PC Cluster Tightly Coupled Vector WS Farms/cycle harvesting DASHMEM-NUMA

PC Clusters: …small, medium, large PC Clusters: …small, medium, large

Computing Elements Applications Threads Interface Operating System Micro kernel Multi-Processor Computing System P P Computing Elements Applications Threads Interface Operating System Micro kernel Multi-Processor Computing System P P Processor P Thread P P Process Hardware

Networking • Topology • Hardware • Cost • Performance Networking • Topology • Hardware • Cost • Performance

Cluster Building Blocks Cluster Building Blocks

Channel Bonding Channel Bonding

Myrinet 2000 switch Myrinet 2000 NIC Myrinet 2000 switch Myrinet 2000 NIC

Example: 320 -host Clos topology of 16 -port switches 64 hosts 64 hosts (From Example: 320 -host Clos topology of 16 -port switches 64 hosts 64 hosts (From Myricom)

Myrinet • Full-duplex 2+2 Gigabit/second data rate links, switch ports, and interface ports. • Myrinet • Full-duplex 2+2 Gigabit/second data rate links, switch ports, and interface ports. • Flow control, error control, and "heartbeat" continuity monitoring on every link. • Low-latency, cut-through, crossbar switches, with monitoring for high-availability applications. • Switch networks that can scale to tens of thousands of hosts, and that can also provide alternative communication paths between hosts. • Host interfaces that execute a control program to interact directly with host processes ("OS bypass") for low-latency communication, and directly with the network to send, receive, and buffer packets.

Myrinet • Sustained one-way data rate for large messages: 1. 92 mbps • Latency Myrinet • Sustained one-way data rate for large messages: 1. 92 mbps • Latency for short messages: 9 msec.

Gigabit Ethernet Cajun 550 Cajun P 882 Switches by 3 COM and Avaya Cajun Gigabit Ethernet Cajun 550 Cajun P 882 Switches by 3 COM and Avaya Cajun M 770

Network Topology Network Topology

Network Topology Network Topology

Network Topology Network Topology

Topology of the Velocity+ Cluster at CTC Topology of the Velocity+ Cluster at CTC

Software: all this list for free! • • Compilers: FORTRAN, C/C++ Java: JDK from Software: all this list for free! • • Compilers: FORTRAN, C/C++ Java: JDK from Sun, IBM and others Scripting: Perl, Python, awk… Editors: vi, (x)emacs, kedit, gedit… Scientific writing: La. Tex, Ghostview… Plotting: gnuplot Image processing: xview, …and much more!!!

 בניית מערך מקבילי • 23 מעבדים top of the line • רשת תקשורת בניית מערך מקבילי • 23 מעבדים top of the line • רשת תקשורת מהירה

Hardware Dual P 4 2 HGz Hardware Dual P 4 2 HGz

 כמה זה עולה לנו? מחשב פנטיום-4 דואלי עם 2 GB זיכרון מהיר $3, כמה זה עולה לנו? מחשב פנטיום-4 דואלי עם 2 GB זיכרון מהיר $3, 000 : RDRAM 1 GB memory/CPU • מערכת הפעלה: 0$ ) (Linux

? כמה זה עולה לנו • PCI 64 B @ 133 MHz, Myrinet 2000 ? כמה זה עולה לנו • PCI 64 B @ 133 MHz, Myrinet 2000 NIC with 2 M memory: $1, 195 • Myrinet-2000 fiber cables, 3 m long: $110 • 16 -port switch with Fiber ports: $5, 625

? כמה זה עולה לנו • KVM: 16 port. ~$1, 000 • Avocent (Cybex) ? כמה זה עולה לנו • KVM: 16 port. ~$1, 000 • Avocent (Cybex) using cat 5 IP over Ethernet

 כמה זה עולה לנו? • • • מחשב: 000, 84$=61*0003$ כרטיס רשת: 088, כמה זה עולה לנו? • • • מחשב: 000, 84$=61*0003$ כרטיס רשת: 088, 02$=61*)011+591, 1( מתג תקשורת: 526, 5$ $1, 000 : KVM מסך + שונות: 005$ 500, 67$ סה"כ:

: • כוח חישוב תיאורטי שיאי • 2*32=64 GFLOPS • $76, 000/64=1, 187$/GFLOP Less : • כוח חישוב תיאורטי שיאי • 2*32=64 GFLOPS • $76, 000/64=1, 187$/GFLOP Less than 1. 2$/MFLOP!!!

 מה עוד נדרש? • • • מקום!, מיזוג אויר )קירור(, מערכת חשמל לגיבוי מה עוד נדרש? • • • מקום!, מיזוג אויר )קירור(, מערכת חשמל לגיבוי )אל- פסק(. נוח שאחת התחנות תשמש כשרת קבצים (NFS or ) other files sharing system ניהול המשתמשים ) (users בכלי כגון . NIS קישור לרשת חיצונית: אחת התחנות עושה routing ממרחב כתובות IP פנימי לחיצוני. כלי Monitoring כדוגמת . b. Watch

 התקנת המערכת • תחילה ניתן להתקין מחשב יחיד • את יתר המחשבים ניתן התקנת המערכת • תחילה ניתן להתקין מחשב יחיד • את יתר המחשבים ניתן להתקין על-ידי שיכפול הדיסק הקשיח של המחשב הראשון )לדוגמא ע"י תכנה כגון . (Ghost

XXX התקנת תוכנה (MPI )למשל • • • Download xxx. tar. gz Uncompress: gzip XXX התקנת תוכנה (MPI )למשל • • • Download xxx. tar. gz Uncompress: gzip –d xxx. tar. gz Untar: tar xvf xxx. tar Prepare makefile: . /configure Make (Makefile)

… תכנות מיקבול צריכות • “rlogin” must be allowed (xinitd: disable=no) • Create “. … תכנות מיקבול צריכות • “rlogin” must be allowed (xinitd: disable=no) • Create “. rhosts” file • Parallel administration tools: “brsh”, “prsh” and self-made scripts.

References • Beowulf: http: //www. beowulf. org • Computer Architecture: http: //www. cs. wisc. References • Beowulf: http: //www. beowulf. org • Computer Architecture: http: //www. cs. wisc. edu/~arch/www/

 בשבוע הבא • • נושאים נוספים ב- MPI Grid Computing חישובים מקביליים בבעיות בשבוע הבא • • נושאים נוספים ב- MPI Grid Computing חישובים מקביליים בבעיות מדעיות סיכום נא להתחיל לעבוד על הפרויקטים! המצגות מתחילות בעוד שבועיים!