b9860eb2c083aa1aa5d25253a44adc14.ppt
- Количество слайдов: 101
מבוא לעיבוד מקבילי הרצאה מס' 01 1002/21/42
תרגיל בית מס' 3 • ניתן להגיש עד ליום ה' ה- 1002/21/72
פרוייקטי גמר • קבוצות 01 -1 מתבקשות להכין את המצגות שלהן לשיעור בעוד שבועיים. • נא להעביר את קבצי המצגות בפורמט Point Power לפני ההרצאה או לבוא לשיעור עם CDROM צרוב.
הבחנים • בדיקת הבחנים תסתיים עד ליום ו'. • התוצאות יפורסמו בשיעור הבא.
נושאי ההרצאה • Today’s topics: – Shared Memory – Cilk, Open. MP – MPI – Derived Data Types – How to Build a Beowulf
Shared Memory • Goto PDF presentation: Chapter 8 from Wilkinson & Allan’s book. “Programming with Shared Memory”
Summary • • • Process creation The thread concept Pthread routines How data can be created as shared Condition Variables Dependency analysis: Bernstein’s conditions
Cilk http: //supertech. lcs. mit. edu/cilk
Cilk • A language for multithreaded parallel programming based on ANSI C. • Cilk is designed for general-purpose parallel programming language • Cilk is especially effective for exploiting dynamic, highly asynchronous parallelism.
A serial C program to compute the nth Fibonacci number.
A parallel Cilk program to compute the nth Fibonacci number.
Cilk - continue • Compiling: $ cilk -O 2 fib. cilk -o fib • Executing: $ fib --nproc 4 30
Open. MP Next 5 slides taken from the SC 99 tutorial Given by: Tim Mattson, Intel Corporation and Rudolf Eigenmann, Purdue University
לקריאה נוספת High-Performance Computing Part III Shared Memory Parallel Processors
Back to MPI
Collective Communication Broadcast
Collective Communication Reduce
Collective Communication Gather
Collective Communication Allgather
Collective Communication Scatter
Collective Communication There are more collective communication commands…
MPI - נושאים מתקדמים ב • MPI – Derived Data Types • MPI-2 – Parallel I/O
User Defined Types • מלבד ה- types המוגדרים מראש, יכול המשתמש ליצור טיפוסים חדשים . • Compact pack/unpack
Predefined Types MPI_DOUBLE double MPI_FLOAT float MPI_INT signed int MPI_LONG signed long int MPI_LONG_DOUBLE long double MPI_LONG_INT signed long int MPI_SHORT signed short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_LONG unsigned long int MPI_UNSIGNED_SHORT unsigned short int MPI_BYTE
Motivation • What if you want to specify: • non-contiguous data of a single type? • contiguous data of mixed types? • non-contiguous data of mixed types? Derived datatypes save memory, are faster, more portable, and elegant.
Steps 3 1. Construct the new datatype using appropriate MPI routines: MPI_Type_contiguous, MPI_Type_vector, MPI_Type_struct, MPI_Type_indexed, MPI_Type_hvector, MPI_Type_hindexed 2. Commit the new datatype MPI_Type_commit 3. Use the new datatype in sends/receives, etc. Use
#include
User Defined Types • MPI_TYPE_STRUCT • MPI_TYPE_CONTIGUOUS • MPI_TYPE_VECTOR • MPI_TYPE_HVECTOR • MPI_TYPE_INDEXED • MPI_TYPE_HINDEXED
MPI_TYPE_STRUCT is the most general way to construct an MPI derived type because it allows the length, location, and type of each component to be specified independently. int MPI_Type_struct (int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype)
Struct Datatype Example count = 2 array_of_blocklengths[0] = 1 array_of_types[0] = MPI_INT array_of_blocklengths[1] = 3 array_of_types[1] = MPI_DOUBLE
MPI_TYPE_CONTIGUOUS is the simplest of these, describing a contiguous sequence of values in memory. For example, MPI_Type_contiguous(2, MPI_DOUBLE, &MPI_2 D_POIN T); MPI_Type_contiguous(3, MPI_DOUBLE, &MPI_3 D_POIN T); MPI_Type_contiguous(int count, int MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_CONTIGUOUS creates new type indicators MPI_2 D_POINT and MPI_3 D_POINT. These type indicators allow you to treat consecutive pairs of doubles as point coordinates in a 2 -dimensional space and sequences of three doubles as point coordinates in a 3 -dimensional space.
MPI_TYPE_VECTOR describes several such sequences evenly spaced but not consecutive in memory. MPI_TYPE_HVECTOR is similar to MPI_TYPE_VECTOR except that the distance between successive blocks is specified in bytes rather than elements. MPI_TYPE_INDEXED describes sequences that may vary both in length and in spacing.
MPI_TYPE_VECTOR int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) count = 2, blocklength = 3, stride = 5
: תכנית לדוגמא #include
if(rank==3){ for(i=0; i<4; ++i) for(j=0; j<8; ++j) x[i][j]=pow(10. 0, i+1)+j; MPI_Send(&x[0][7], 1, coltype, 1, 52, MPI_COMM_WORLD); } else if(rank==1) { MPI_Recv(&x[0][2], 1, coltype, 3, 52, MPI_COMM_WORLD, &status); for(i=0; i<4; ++i) printf("P: %d my x[%d][2]=%1 fn", rank, i, x[i][2]); } MPI_Finalize(); }
: הפלט P: 1 my x[0][2]=17. 000000 P: 1 my x[1][2]=107. 000000 P: 1 my x[2][2]=1007. 000000 P: 1 my x[3][2]=10007. 000000
Committing a datatype int MPI_Type_commit (MPI_Datatype *datatype)
Obtaining Information About Derived Types • MPI_TYPE_LB and MPI_TYPE_UB can provide the lower and upper bounds of the type. • MPI_TYPE_EXTENT can provide the extent of the type. In most cases, this is the amount of memory a value of the type will occupy. • MPI_TYPE_SIZE can provide the size of the type in a message. If the type is scattered in memory, this may be significantly smaller than the extent of the type.
MPI_TYPE_EXTENT MPI_Type_extent (MPI_Datatype datatype, MPI_Aint *extent) Correction: Deprecated. Use MPI_Type_get_extent instead!
Ref: Ian Foster’s book: “DBPP”
MPI-2 is a set of extensions to the MPI standard. It was finalized by the MPI Forum in June, 1997.
MPI-2 • • New Datatype Manipulation Functions Info Object New Error Handlers Establishing/Releasing Communications Extended Collective Operations Thread Support Fault Tolerant
MPI-2 Parallel I/O • Motivation: – The ability to parallelize I/O can offer significant performance improvements. – User-level checkpointing is contained within the program itself.
Parallel I/O • MPI-2 supports both blocking and nonblocking I/O • MPI-2 supports both collective and non-collective I/O
Complementary Filetypes
Simple File Scatter/Gather Problem
MPI-2 Parallel I/O • נושאים הקשורים בנושא שלא ילמדו במסגרת הקורס : הנוכחי • MPI-2 file structure • Initializing MPI-2 File I/O • Defining a View • Data Access - Reading Data • Data Access - Writing Data • Closing MPI-2 file I/O
How to Build a Beowulf
? What is a Beowulf • A new strategy in High-Performance Computing (HPC) that exploits mass-market technology to overcome the oppressive costs in time and money of supercomputing.
? What is a Beowulf A Collection of personal computers interconnected by widely available networking technology running one of several open-source Unix-like operating systems.
• COTS – Commodity-off-the-shelf components • Interconnection networks: LAN/SAN Price/Performance
How to Run Application Faster c. There are 3 ways to improve performance: – 1. Work Harder – 2. Work Smarter – 3. Get Help c. Computer Analogy – 1. Use faster hardware: e. g. reduce the time per instruction (clock cycle). – 2. Optimized algorithms and techniques – 3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle.
Motivation for using Clusters • The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs. • Workstation clusters are easier to integrate into existing networks than special parallel computers.
Beowulf-class Systems A New Paradigm for the Business of Computing • Brings high end computing to broad ranged problems – new markets • Order of magnitude Price-Performance advantage • Commodity enabled – no long development lead times • Low vulnerability to vendor-specific decisions – companies are ephemeral; Beowulfs are forever • Rapid response technology tracking • Just-in-place user-driven configuration – requirement responsive • Industry-wide, non-proprietary software environment
Beowulf Project - A Brief History • Started in late 1993 • NASA Goddard Space Flight Center – NASA JPL, Caltech, academic and industrial collaborators • Sponsored by NASA HPCC Program • Applications: single user science station – data intensive – low cost • General focus: – single user (dedicated) science and engineering applications – system scalability – Ethernet drivers for Linux
Beowulf System at JPL (Hyglac) • 16 Pentium Pro PCs, each with 2. 5 Gbyte disk, 128 Mbyte memory, Fast Ethernet card. • Connected using 100 Base-T network, through a 16 -way crossbar switch. u u Theoretical peak performance: 3. 2 GFlop/s. Achieved sustained performance: 1. 26 GFlop/s.
Cluster Computing - Research Projects (partial list) • • • Beowulf (Cal. Tech and NASA) - USA Condor - Wisconsin State University, USA HPVM -(High Performance Virtual Machine), UIUC&now UCSB, US MOSIX - Hebrew University of Jerusalem, Israel MPI (MPI Forum, MPICH is one of the popular implementations) NOW (Network of Workstations) - Berkeley, USA NIMROD - Monash University, Australia Net. Solve - University of Tennessee, USA PBS (Portable Batch System) - NASA Ames and LLNL, USA PVM - Oak Ridge National Lab. /UTK/Emory, USA
Motivation for using Clusters • Surveys show utilisation of CPU cycles of desktop workstations is typically <10%. • Performance of workstations and PCs is rapidly improving • As performance grows, percent utilisation will decrease even further! • Organisations are reluctant to buy large supercomputers, due to the large expense and short useful life span.
Motivation for using Clusters • The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the nonstandard nature of many parallel systems. • Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing (HPC) platforms. • Use of clusters of workstations as a distributed compute resource is very cost effective - incremental growth of system!!!
Original Food Chain Picture
1984 Computer Food Chain Mainframe Mini Computer Vector Supercomputer Workstation PC
1994 Computer Food Chain (hitting wall soon) Mini Computer Workstation (future is bleak) Mainframe Vector Supercomputer MPP PC
Computer Food Chain (Now and Future)
Parallel Computing Cluster Computing Meta. Computing Pile of PCs NOW/COW Beowulf NT-PC Cluster Tightly Coupled Vector WS Farms/cycle harvesting DASHMEM-NUMA
PC Clusters: …small, medium, large
Computing Elements Applications Threads Interface Operating System Micro kernel Multi-Processor Computing System P P Processor P Thread P P Process Hardware
Networking • Topology • Hardware • Cost • Performance
Cluster Building Blocks
Channel Bonding
Myrinet 2000 switch Myrinet 2000 NIC
Example: 320 -host Clos topology of 16 -port switches 64 hosts 64 hosts (From Myricom)
Myrinet • Full-duplex 2+2 Gigabit/second data rate links, switch ports, and interface ports. • Flow control, error control, and "heartbeat" continuity monitoring on every link. • Low-latency, cut-through, crossbar switches, with monitoring for high-availability applications. • Switch networks that can scale to tens of thousands of hosts, and that can also provide alternative communication paths between hosts. • Host interfaces that execute a control program to interact directly with host processes ("OS bypass") for low-latency communication, and directly with the network to send, receive, and buffer packets.
Myrinet • Sustained one-way data rate for large messages: 1. 92 mbps • Latency for short messages: 9 msec.
Gigabit Ethernet Cajun 550 Cajun P 882 Switches by 3 COM and Avaya Cajun M 770
Network Topology
Network Topology
Network Topology
Topology of the Velocity+ Cluster at CTC
Software: all this list for free! • • Compilers: FORTRAN, C/C++ Java: JDK from Sun, IBM and others Scripting: Perl, Python, awk… Editors: vi, (x)emacs, kedit, gedit… Scientific writing: La. Tex, Ghostview… Plotting: gnuplot Image processing: xview, …and much more!!!
בניית מערך מקבילי • 23 מעבדים top of the line • רשת תקשורת מהירה
Hardware Dual P 4 2 HGz
כמה זה עולה לנו? מחשב פנטיום-4 דואלי עם 2 GB זיכרון מהיר $3, 000 : RDRAM 1 GB memory/CPU • מערכת הפעלה: 0$ ) (Linux
? כמה זה עולה לנו • PCI 64 B @ 133 MHz, Myrinet 2000 NIC with 2 M memory: $1, 195 • Myrinet-2000 fiber cables, 3 m long: $110 • 16 -port switch with Fiber ports: $5, 625
? כמה זה עולה לנו • KVM: 16 port. ~$1, 000 • Avocent (Cybex) using cat 5 IP over Ethernet
כמה זה עולה לנו? • • • מחשב: 000, 84$=61*0003$ כרטיס רשת: 088, 02$=61*)011+591, 1( מתג תקשורת: 526, 5$ $1, 000 : KVM מסך + שונות: 005$ 500, 67$ סה"כ:
: • כוח חישוב תיאורטי שיאי • 2*32=64 GFLOPS • $76, 000/64=1, 187$/GFLOP Less than 1. 2$/MFLOP!!!
מה עוד נדרש? • • • מקום!, מיזוג אויר )קירור(, מערכת חשמל לגיבוי )אל- פסק(. נוח שאחת התחנות תשמש כשרת קבצים (NFS or ) other files sharing system ניהול המשתמשים ) (users בכלי כגון . NIS קישור לרשת חיצונית: אחת התחנות עושה routing ממרחב כתובות IP פנימי לחיצוני. כלי Monitoring כדוגמת . b. Watch
התקנת המערכת • תחילה ניתן להתקין מחשב יחיד • את יתר המחשבים ניתן להתקין על-ידי שיכפול הדיסק הקשיח של המחשב הראשון )לדוגמא ע"י תכנה כגון . (Ghost
XXX התקנת תוכנה (MPI )למשל • • • Download xxx. tar. gz Uncompress: gzip –d xxx. tar. gz Untar: tar xvf xxx. tar Prepare makefile: . /configure Make (Makefile)
… תכנות מיקבול צריכות • “rlogin” must be allowed (xinitd: disable=no) • Create “. rhosts” file • Parallel administration tools: “brsh”, “prsh” and self-made scripts.
References • Beowulf: http: //www. beowulf. org • Computer Architecture: http: //www. cs. wisc. edu/~arch/www/
בשבוע הבא • • נושאים נוספים ב- MPI Grid Computing חישובים מקביליים בבעיות מדעיות סיכום נא להתחיל לעבוד על הפרויקטים! המצגות מתחילות בעוד שבועיים!