Скачать презентацию GPGPUs Data Parallel Accelerators Dezső Sima 2010 November Скачать презентацию GPGPUs Data Parallel Accelerators Dezső Sima 2010 November

83ee3e0b2ff04d2204cf146f51ccc8a3.ppt

  • Количество слайдов: 133

GPGPUs Data Parallel Accelerators Dezső Sima 2010 November Ver. 1. 0 (Updated 30/11/2010) © GPGPUs Data Parallel Accelerators Dezső Sima 2010 November Ver. 1. 0 (Updated 30/11/2010) © Dezső Sima 2010

1. Introduction (1) Aim Introduction and overview in a very limited time 1. Introduction (1) Aim Introduction and overview in a very limited time

Contents 1. Introduction 2. Basics of the SIMT execution 3. Overview of GPGPUs 4. Contents 1. Introduction 2. Basics of the SIMT execution 3. Overview of GPGPUs 4. Overview of data parallel accelerators 5. Microarchitecture of GPGPUs (examples) 5. 1 AMD/ATI RV 870 (Cypress) 5. 2 Nvidia Fermi 6. Integrated CPU/GPUs 6. 1 AMD Fusion APU line 6. 2 Intel Sandy Bridge 7. References

1. The emergence of GPGPUs 1. The emergence of GPGPUs

1. Introduction (2) Rrepresentation of objects by triangels Vertex Edge Surface Vertices • have 1. Introduction (2) Rrepresentation of objects by triangels Vertex Edge Surface Vertices • have three spatial coordinates • supplementary information necessary to render the object, such as • color • texture • reflectance properties • etc.

1. Introduction (3) Main types of shaders in GPUs Shaders Vertex shaders Transform each 1. Introduction (3) Main types of shaders in GPUs Shaders Vertex shaders Transform each vertex’s 3 D-position in the virtual space to the 2 D coordinate, at which it appears on the screen Pixel shaders (Fragment shaders) Geometry shaders Calculate the color of the pixels Can add or remove vertices from a mesh

1. Introduction (4) Direct. X version Pixel SM Vertex SM Supporting OS 8. 0 1. Introduction (4) Direct. X version Pixel SM Vertex SM Supporting OS 8. 0 (11/2000) 1. 0, 1. 1 Windows 2000 8. 1 (10/2001) 1. 2, 1. 3, 1. 4 1. 0, 1. 1 Windows XP/ Windows Server 2003 9. 0 (12/2002) 2. 0 9. 0 a (3/2003) 2_A, 2_B 2. x 9. 0 c (8/2004) 3. 0 Windows XP SP 2 10. 0 (11/2006) 4. 0 Windows Vista 10. 1 (2/2008) 4. 1 Windows Vista SP 1/ Windows Server 2008 11 (in development) 5. 0 Table: Pixel/vertex shader models (SM) supported by subsequent versions of Direct. X and MS’s OSs [18], [21]

1. Introduction (3) Convergence of important features of the vertex and pixel shader models 1. Introduction (3) Convergence of important features of the vertex and pixel shader models Subsequent shader models introduce typically, a number of new/enhanced features. Differences between the vertex and pixel shader models in subsequent shader models concerning precision requirements, instruction sets and programming resources. Shader model 2 [19] • Different precision requirements Vertex shader: FP 32 (coordinates) Pixel shader: FX 24 (3 colors x 8) • Different instructions • Different resources (e. g. registers) Shader model 3 [19] • Unified precision requirements for both shaders (FP 32) with the option to specify partial precision (FP 16 or FP 24) by adding a modifier to the shader code • Different instructions • Different resources (e. g. registers)

1. Introduction (3) Shader model 4 (introduced with Direct. X 10) [20] • Unified 1. Introduction (3) Shader model 4 (introduced with Direct. X 10) [20] • Unified precision requirements for both shaders (FP 32) with the possibility to use new data formats. • Unified instruction set • Unified resources (e. g. temporary and constant registers) Shader architectures of GPUs prior to SM 4 (Direct. X 10): have separate vertex and pixel units with different features. Drawback of having separate units for vertex and pixel shading Inefficiency of the hardware implementation (Vertex shaders and pixel shaders often have complementary load patterns [21]).

1. Introduction (5) Unified shader model (introduced in the SM 4. 0 of Direct. 1. Introduction (5) Unified shader model (introduced in the SM 4. 0 of Direct. X 10. 0) Unified, programable shader architecture The same (programmable) processor can be used to implement all shaders; • the vertex shader • the pixel shader and • the geometry shader (new feature of the SMl 4)

1. Introduction (6) Figure: Principle of the unified shader architecture [22] 1. Introduction (6) Figure: Principle of the unified shader architecture [22]

1. Introduction (7) Based on its FP 32 computing capability and the large number 1. Introduction (7) Based on its FP 32 computing capability and the large number of FP-units available the unified shader is a prospective candidate for speeding up HPC! GPUs with unified shader architectures also termed as GPGPUs (General Purpose GPUs) or c. GPUs (computational GPUs)

1. Introduction (8) Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P 1. Introduction (8) Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P 4 and Core 2 processors [11]

1. Introduction (9) Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P 4 and 1. Introduction (9) Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P 4 and Core 2 processors [11]

1. Introduction (10) Figure: Contrasting the utilization of the silicon area in CPUs and 1. Introduction (10) Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11]

2. Basics of the SIMT execution 2. Basics of the SIMT execution

2. Basics of the SIMT execution (1) Main alternatives of data parallel execution Data 2. Basics of the SIMT execution (1) Main alternatives of data parallel execution Data parallel execution SIMD execution SIMT execution • One dimensional data parallel execution, • Two dimensional data parallel execution, i. e. it performs the same operation on all elements of given FX/FP input vectors FX/FP input arrays (matrices) • is massively multithreaded, and provides • data dependent flow control as well as • barrier synchronization Needs an FX/FP SIMD extension of the ISA E. g. 2. and 3. generation superscalars Needs an FX/FP SIMT extension of the ISA and the API GPGPUs, data parallel accelerators Figure: Main alternatives of data parallel execution

2. Basics of the SIMT execution (2) Scalar, SIMD and SIMT execution Scalar execution 2. Basics of the SIMT execution (2) Scalar, SIMD and SIMT execution Scalar execution SIMD execution SIMT execution Domain of execution: single data elements Domain of execution: elements of vectors Domain of execution: elements of matrices (at the programming level) Figure: Domains of execution in case of scalar, SIMD and SIMT execution Remark SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia)

2. Basics of the SIMT execution (3) Key components of the implementation of SIMT 2. Basics of the SIMT execution (3) Key components of the implementation of SIMT execution • Data parallel execution • Massive multithreading • Data dependent flow control • Barrier synchronization

2. Basics of the SIMT execution (4) Data parallel execution Performed by SIMT cores 2. Basics of the SIMT execution (4) Data parallel execution Performed by SIMT cores execute the same instruction stream on a number of ALUs (i. e. all ALUs of a SIMT core perform typically the same operation). SIMT core Fetch/Decode ALU ALU Figure: Basic layout of a SIMT cores are the basic building blocks of GPGPU or data parallel accelerators. During SIMT execution 2 -dimensional matrices will be mapped to blocks of SIMT cores.

2. Basics of the SIMT execution (5) Remark 1 Different manufacturers designate SIMT cores 2. Basics of the SIMT execution (5) Remark 1 Different manufacturers designate SIMT cores differently, such as • streaming multiprocessor (Nvidia), • superscalar shader processor (AMD), • wide SIMD processor, CPU core (Intel).

2. Basics of the SIMT execution (6) Each ALU is allocated a working register 2. Basics of the SIMT execution (6) Each ALU is allocated a working register set (RF) Fetch/Decode ALU ALU RF RF Figure: Main functional blocks of a SIMT core

2. Basics of the SIMT execution (7) SIMT ALUs perform typically, RRR operations, that 2. Basics of the SIMT execution (7) SIMT ALUs perform typically, RRR operations, that is ALUs take their operands from and write the calculated results to the register set (RF) allocated to them. RF ALU Figure: Principle of operation of the SIMD ALUs

2. Basics of the SIMT execution (8) Remark 2 Actually, the register sets (RF) 2. Basics of the SIMT execution (8) Remark 2 Actually, the register sets (RF) allocated to each ALU are given parts of a large enough register file. RF RF ALU ALU ALU Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs

2. Basics of the SIMT execution (9) Basic operation of recent SIMT ALUs • 2. Basics of the SIMT execution (9) Basic operation of recent SIMT ALUs • execute basically SP FP-MADD (simple precision i. e. 32 -bit. Multiply-Add) instructions of the form axb+c , RF • are pipelined, capable of starting a new operation every new clock cycle, (more precisely, every shader clock cycle), That is, without further enhancements their peak performance is 2 SP FP operations/cycle ALU • need a few number of clock cycles, e. g. 2 or 4 shader cycles, to present the results of the SP FMADD operations to the RF,

2. Basics of the SIMT execution (10) Additional operations provided by SIMT ALUs • 2. Basics of the SIMT execution (10) Additional operations provided by SIMT ALUs • • • FX operations and FX/FP conversions, DP FP operations, trigonometric functions (usually supported by special functional units).

2. Basics of the SIMT execution (11) Massive multithreading Aim of massive multithreading to 2. Basics of the SIMT execution (11) Massive multithreading Aim of massive multithreading to speed up computations by increasing the utilization of available computing resources in case of stalls (e. g. due to cache misses). Principle • Suspend stalled threads from execution and allocate ready to run threads for execution. • When a large enough number of threads are available long stalls can be hidden.

2. Basics of the SIMT execution (12) Multithreading is implemented by creating and managing 2. Basics of the SIMT execution (12) Multithreading is implemented by creating and managing parallel executable threads for each data element of the execution domain. Same instructions for all data elements Figure: Parallel executable threads for each element of the execution domain

2. Basics of the SIMT execution (13) Effective implementation of multithreading if thread switches, 2. Basics of the SIMT execution (13) Effective implementation of multithreading if thread switches, called context switches, do not cause cycle penalties. Achieved by • providing separate contexts (register space) for each thread, and • implementing a zero-cycle context switch mechanism.

2. Basics of the SIMT execution (14) SIMT core Fetch/Decode CTX CTX CTX CTX 2. Basics of the SIMT execution (14) SIMT core Fetch/Decode CTX CTX CTX CTX CTX CTX Actual context CTX CTX Context switch CTX CTX CTX CTX ALU ALU ALU Register file (RF) Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU

2. Basics of the SIMT execution (15) Data dependent flow control Implemented by SIMT 2. Basics of the SIMT execution (15) Data dependent flow control Implemented by SIMT branch processing In SIMT processing both paths of a branch are executed subsequently such that for each path the prescribed operations are executed only on those data elements which fulfill the data condition given for that path (e. g. xi > 0). Example

2. Basics of the SIMT execution (16) Figure: Execution of branches [24] The given 2. Basics of the SIMT execution (16) Figure: Execution of branches [24] The given condition will be checked separately for each thread

2. Basics of the SIMT execution (17) First all ALUs meeting the condition execute 2. Basics of the SIMT execution (17) First all ALUs meeting the condition execute the prescibed three operations, then all ALUs missing the condition execute the next two operatons Figure: Execution of branches [24]

2. Basics of the SIMT execution (18) Figure: Resuming instruction stream processing after executing 2. Basics of the SIMT execution (18) Figure: Resuming instruction stream processing after executing a branch [24]

2. Basics of the SIMT execution (19) Barrier synchronization Lets wait all threads for 2. Basics of the SIMT execution (19) Barrier synchronization Lets wait all threads for completing all prior instructions before executing the next instruction. Implemented e. g. in AMD’s Intermediate Language (IL) by the fence threads instruction [10]. Remark In the R 600 ISA this instruction is coded by setting the BARRIER field of the Control Flow (CF) instruction format [7].

2. Basics of the SIMT execution (20) Principle of SIMT execution Host kernel 0<<<>>>() 2. Basics of the SIMT execution (20) Principle of SIMT execution Host kernel 0<<<>>>() Device Each kernel invocation lets execute all thread blocks (Block(i, j)) kernel 1<<<>>>() Figure: Hierarchy of threads [25]

3. Overview of GPGPUs 3. Overview of GPGPUs

3. Overview of GPGPUs (1) Basic implementation alternatives of the SIMT execution GPGPUs Data 3. Overview of GPGPUs (1) Basic implementation alternatives of the SIMT execution GPGPUs Data parallel accelerators Dedicated units Programmable GPUs supporting data parallel execution with appropriate programming environments programming environment Have display outputs E. g. Nvidia’s 8800 and GTX lines AMD’s HD 38 xx, HD 48 xx lines No display outputs Have larger memories than GPGPUs Nvidia’s Tesla lines AMD’s Fire. Stream lines Figure: Basic implementation alternatives of the SIMT execution

3. Overview of GPGPUs (2) GPGPUs AMD/ATI’s line Nvidia’s line 90 nm 80 nm 3. Overview of GPGPUs (2) GPGPUs AMD/ATI’s line Nvidia’s line 90 nm 80 nm G 80 Shrink 65 nm Shrink Enhanced arch. G 92 55 nm G 200 Shrink 40 nm R 600 RV 670 Enhanced arch. Fermi 40 nm Figure: Overview of Nvidia’s and AMD/ATI’s GPGPU lines RV 770 Shrink Enhanced arch. RV 870

3. Overview of GPGPUs (3) NVidia 10/07 11/06 6/08 9/09 G 80 G 92 3. Overview of GPGPUs (3) NVidia 10/07 11/06 6/08 9/09 G 80 G 92 GT 200 Fermi 90 nm/681 mtrs 65 nm/754 mtrs 65 nm/1400 mtrs 40 nm/3000 mtrs Cores Cards 8800 GTS 96 ALUs 320 -bit 8800 GTX 8800 GT GTX 260 GTX 280 128 ALUs 384 -bit 112 ALUs 256 -bit 192 ALUs 448 -bit 240 ALUs 512 -bit 11/07 6/07 CUDA 6/08 Version 1. 1 Version 1. 0 512 ALUs 384 -bit Version 2. 0 AMD/ATI 9/09 11/05 11/07 5/08 R 500 Cores 5/07 R 600 R 670 RV 770 RV 870 55 nm/956 mtrs 40 nm/2100 mtrs 80 nm/681 mtrs Cards 55 nm/666 mtrs (Xbox) HD 2900 XT HD 3850 HD 3870 HD 4850 HD 4870 HD 5870 48 ALUs 320 ALUs 512 -bit 320 ALUs 256 -bit 800 ALUs 256 -bit 1600 ALUs 256 -bit 12/08 Open. CL+ Open. CL 11/07 Brooks+ Brook+ 6/08 Rapid. Mind 3870 support 2005 2006 2007 Figure: Overview of GPGPUs 2008 2009

3. Overview of GPGPUs (4) 8800 GTS 8800 GTX 8800 GT GTX 260 GTX 3. Overview of GPGPUs (4) 8800 GTS 8800 GTX 8800 GT GTX 260 GTX 280 Core G 80 G 92 GT 200 Introduction 11/06 10/07 6/08 IC technology 90 nm 65 nm Nr. of transistors 681 mtrs 754 mtrs 1400 mtrs Die are 480 mm 2 324 mm 2 576 mm 2 Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz 96 128 112 192 240 1. 2 GHz 1. 35 GHz 1. 512 GHz 1. 242 GHz 1. 296 GHz 3 3 Computation No. of ALUs Shader frequency No. FP 32 inst. /cycle 3* (but only in a few issue cases) Peak FP 32 performance 346 GLOPS 512 GLOPS 508 GLOPS 715 GLOPS 933 GLOPS Peak FP 64 performance – – 77/76 GLOPS 1600 Mb/s 1800 Mb/s 1998 Mb/s 2214 Mb/s Mem. interface 320 -bit 384 -bit 256 -bit 448 -bit 512 -bit Mem. bandwidth 64 GB/s 86. 4 GB/s 57. 6 GB/s 111. 9 GB/s 141. 7 GB/s Mem. size 320 MB 768 MB 512 MB 896 MB 1. 0 GB Mem. type GDDR 3 GDDR 3 Mem. channel 6*64 -bit 4*64 -bit 8*64 -bit Mem. contr. Crossbar Crossbar SLI SLI SLI PCIe x 16 PCIe 2. 0 x 16 10 10. 1 subset Memory Mem. transfer rate (eff) System Multi. CPU techn. Interface MS Direct X Table: Main features of Nvidia’s GPGPUs

3. Overview of GPGPUs (5) HD 2900 XT HD 3850 HD 3870 HD 4850 3. Overview of GPGPUs (5) HD 2900 XT HD 3850 HD 3870 HD 4850 HD 4870 Core R 600 R 670 RV 770 Introduction 5/07 11/07 5/08 80 nm 55 nm Nr. of transistors 700 mtrs 666 mtrs 956 mtrs Die are 408 mm 2 192 mm 2 260 mm 2 Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz 320 320 800 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz 2 2 2 Peak FP 32 performance 471. 6 GLOPS 429 GLOPS 496 GLOPS 1000 GLOPS 1200 GLOPS Peak FP 64 performance – – – 200 GLOPS 240 GLOPS 1600 Mb/s 1660 Mb/s 2250 Mb/s 2000 Mb/s 3600 Mb/s (GDDR 5) 512 -bit 256 -bit 265 -bit 105. 6 GB/s 53. 1 GB/s 720 GB/s 64 GB/s 118 GB/s Mem. size 512 MB 256 MB 512 MB Mem. type GDDR 3 GDDR 4 GDDR 3/GDDR 5 Mem. channel 8*64 -bit 8*32 -bit 4*64 -bit Mem. contr. Ring bus Crossbar Multi. CPU techn. Cross. Fire X Interface PCIe x 16 PCIe 2. 0 x 16 10 10. 1 IC technology Computation No. of ALUs Shader frequency No. FP 32 inst. /cycle Memory Mem. transfer rate (eff) Mem. interface Mem. bandwidth System MS Direct X Table: Main features of AMD/ATIs GPGPUs

3. Overview of GPGPUs (6) Price relations (as of 10/2008) Nvidia GTX 260 GTX 3. Overview of GPGPUs (6) Price relations (as of 10/2008) Nvidia GTX 260 GTX 280 ~ 300 $ ~ 600 $ AMD/ATI HD 4850 HD 4870 ~ 200 $ na

4. Overview of data parallel accelerators 4. Overview of data parallel accelerators

4. Overview of data parallel accelerators (1) Data parallel accelerators Implementation alternatives of data 4. Overview of data parallel accelerators (1) Data parallel accelerators Implementation alternatives of data parallel accelerators On card implementation Recent implementations E. g. On-die integration Future implementations GPU cards Intel’s Heavendahl Data-parallel accelerator cards AMD’s Torrenza integration technology AMD’s Fusion integration technology Trend Figure: Implementation alternatives of dedicated data parallel accelerators

4. Overview of data parallel accelerators (2) On-card accelerators Card implementations Single cards fitting 4. Overview of data parallel accelerators (2) On-card accelerators Card implementations Single cards fitting into a free PCI Ex 16 slot of the host computer. E. g. Nvidia Tesla C 870 Nvidia Tesla C 1060 AMD Fire. Stream 9170 Desktop implementations 1 U server implementations Usually 4 cards Usually dual cards mounted into a 1 U server rack, mounted into a box, connected two adapter cards connected to an that are inserted into adapter card that is inserted into a two free PCIEx 16 slots of a server through two switches free PCI-E x 16 slot of the and two cables. host PC through a cable. Nvidia Tesla D 870 Nvidia Tesla S 1070 AMD Fire. Stream 9250 Figure: Implementation alternatives of on-card accelerators

4. Overview of data parellel accererators (2) FB: Frame Buffer Figure: Main functional units 4. Overview of data parellel accererators (2) FB: Frame Buffer Figure: Main functional units of Nvidia’s Tesla C 870 card [2]

4. Overview of data parellel accererators (3) Figure: Nvida’s Tesla C 870 and AMD’s 4. Overview of data parellel accererators (3) Figure: Nvida’s Tesla C 870 and AMD’s Fire. Stream 9170 cards [2], [3]

4. Overview of data parellel accererators (4) Figure: Tesla D 870 desktop implementation [4] 4. Overview of data parellel accererators (4) Figure: Tesla D 870 desktop implementation [4]

4. Overview of data parellel accererators (5) Figure: Nvidia’s Tesla D 870 desktop implementation 4. Overview of data parellel accererators (5) Figure: Nvidia’s Tesla D 870 desktop implementation [4]

4. Overview of data parellel accererators (6) Figure: PCI-E x 16 host adapter card 4. Overview of data parellel accererators (6) Figure: PCI-E x 16 host adapter card of Nvidia’s Tesla D 870 desktop [4]

4. Overview of data parellel accererators (7) Figure: Concept of Nvidia’s Tesla S 870 4. Overview of data parellel accererators (7) Figure: Concept of Nvidia’s Tesla S 870 1 U rack server [5]

4. Overview of data parellel accererators (8) Figure: Internal layout of Nvidia’s Tesla S 4. Overview of data parellel accererators (8) Figure: Internal layout of Nvidia’s Tesla S 870 1 U rack [6]

4. Overview of data parellel accererators (9) Figure: Connection cable between Nvidia’s Tesla S 4. Overview of data parellel accererators (9) Figure: Connection cable between Nvidia’s Tesla S 870 1 U rack and the adapter cards inserted into PCI-E x 16 slots of the host server [6]

4. Overview of data parellel accererators (10) NVidia Tesla 6/07 C 870 G 80 4. Overview of data parellel accererators (10) NVidia Tesla 6/07 C 870 G 80 -based 1. 5 GB GDDR 3 0. 519 GLOPS Card 6/08 C 1060 GT 200 -based 4 GB GDDR 3 0. 936 GLOPS 6/07 Desktop D 870 G 80 -based 2*C 870 incl. 3 GB GDDR 3 1. 037 GLOPS 6/07 IU Server 6/08 S 870 S 1070 G 80 -based 4*C 870 incl. 6 GB GDDR 3 2. 074 GLOPS GT 200 -based 4*C 1060 16 GB GDDR 3 3. 744 GLOPS 6/07 CUDA Version 1. 0 11/07 6/08 Version 1. 01 Version 2. 0 2007 Figure: Overview of Nvidia’s Tesla family 2008

4. Overview of data parellel accererators (11) AMD Fire. Stream 11/07 6/08 9170 RV 4. Overview of data parellel accererators (11) AMD Fire. Stream 11/07 6/08 9170 RV 670 -based 2 GB GDDR 3 500 GLOPS FP 32 ~200 GLOPS FP 64 Card Shipped 6/08 10/08 9250 RV 770 -based 1 GB GDDR 3 1 TLOPS FP 32 ~300 GFLOPS FP 64 Shipped 12/07 Stream Computing SDK Version 1. 0 Brook+ ACM/AMD Core Math Library CAL (Computer Abstor Layer) Rapid Mind 2007 2008 Figure: Overview of AMD/ATI’s Fire. Stream family

4. Overview of data parellel accererators (12) Nvidia Tesla cards AMD Fire. Stream cards 4. Overview of data parellel accererators (12) Nvidia Tesla cards AMD Fire. Stream cards Core type C 870 C 1060 9170 9250 Based on G 80 GT 200 RV 670 RV 770 Introduction 6/07 6/08 11/07 6/08 Core frequency 600 MHz 602 MHz 800 MHz 625 MHz ALU frequency 1350 MHz 1296 GHz 800 MHz 325 MHZ 128 240 320 800 Peak FP 32 performance 518 GLOPS 933 GLOPS 512 GLOPS 1 TLOPS Peak FP 64 performance – – ~200 GLOPS ~250 GLOPS 1600 Gb/s 1986 Gb/s 384 -bit 512 -bit 256 -bit 768 GB/s 102 GB/s 51. 2 GB/s 63. 5 GB/s Mem. size 1. 5 GB 4 GB 2 GB 1 GB Mem. type GDDR 3 PCI-E x 16 PCI-E 2. 0 x 16 171 W 200 W 150 W Core No. of ALUs Memory Mem. transfer rate (eff) Mem. interface Mem. bandwidth System Interface Power (max) Table: Main features of Nvidia’s and AMD/ATI’s data parallel accelerator cards

4. Overview of data parellel accererators (13) Price relations (as of 10/2008) Nvidia Tesla 4. Overview of data parellel accererators (13) Price relations (as of 10/2008) Nvidia Tesla C 870 D 870 S 870 ~ 1500 $ ~ 5000 $ ~ 7500 $ C 1060 ~ 1600 $ S 1070 ~ 8000 $ AMD/ATI Fire. Stream 9170 ~ 800 $ 9250 ~ 800 $

5. Microarchitecture of GPGPUs (examples) 5. 1 AMD/ATI RV 870 (Cypress) 5. 2 Nvidia 5. Microarchitecture of GPGPUs (examples) 5. 1 AMD/ATI RV 870 (Cypress) 5. 2 Nvidia Fermi

5. 1 AMD/ATI RV 870 5. 1 AMD/ATI RV 870

5. 1 AMD/ATI RV 870 (1) AMD/ATI RV 870 (Cypress) Radeon 5870 graphics card 5. 1 AMD/ATI RV 870 (1) AMD/ATI RV 870 (Cypress) Radeon 5870 graphics card Introduction: Sept. 22 2009 Availability: now Performance figures: SP FP performance: 2. 72 TFLOPS DP FP performance: 544 GFLOPS (1/5 of SP FP performance) Open. CL 1. 0 compliant

5. 1 AMD/ATI RV 870 (4) Architecture overview RV 870 20 cores 16 ALUs/core 5. 1 AMD/ATI RV 870 (4) Architecture overview RV 870 20 cores 16 ALUs/core 5 EUs/ALU 1600 EUs (Stream processing units) Figure: Architecture overview [42] 8 x 32 = 256 bit GDDR 5 153. 6 GB/s

5. 1 AMD/ATI RV 870 (2) Radeon series/5800 ATI Radeon HD 4870 ATI Radeon 5. 1 AMD/ATI RV 870 (2) Radeon series/5800 ATI Radeon HD 4870 ATI Radeon HD 5850 ATI Radeon HD 5870 Manufacturing Process 55 -nm 40 -nm # of Transistors 956 million 2. 15 billion Core Clock Speed 750 MHz 725 MHz 850 MHz # of Stream Processors 800 1440 1600 Compute Performance 1. 2 TFLOPS 2. 09 TFLOPS 2. 72 TFLOPS Memory Type GDDR 5 Memory Clock 900 MHz 1000 MHz 1200 MHz Memory Data Rate 3. 6 Gbps 4. 0 Gbps 4. 8 Gbps Memory Bandwidth 115. 2 GB/sec 128 GB/sec 153. 6 GB/sec Max Board Power 160 W 170 W 188 W Idle Board Power 90 W 27 W Figure: Radeon Series/5800 [42]

5. 1 AMD/ATI RV 870 (3) Radeon 4800 series/5800 series comparison ATI Radeon HD 5. 1 AMD/ATI RV 870 (3) Radeon 4800 series/5800 series comparison ATI Radeon HD 4870 ATI Radeon HD 5850 ATI Radeon HD 5870 Manufacturing Process 55 -nm 40 -nm # of Transistors 956 million 2. 15 billion Core Clock Speed 750 MHz 725 MHz 850 MHz # of Stream Processors 800 1440 1600 Compute Performance 1. 2 TFLOPS 2. 09 TFLOPS 2. 72 TFLOPS Memory Type GDDR 5 Memory Clock 900 MHz 1000 MHz 1200 MHz Memory Data Rate 3. 6 Gbps 4. 0 Gbps 4. 8 Gbps Memory Bandwidth 115. 2 GB/sec 128 GB/sec 153. 6 GB/sec Max Board Power 160 W 170 W 188 W Idle Board Power 90 W 27 W Figure: Radeon Series/5800 [42]

5. 1 AMD/ATI RV 870 (6) The HD 5870 card Figure: The 5870 card 5. 1 AMD/ATI RV 870 (6) The HD 5870 card Figure: The 5870 card [41]

5. 1 AMD/ATI RV 870 (7) http: //www. tomshardware. com/gallery/two-cypress-gpus, 0101 -230369 -7179 -0 5. 1 AMD/ATI RV 870 (7) http: //www. tomshardware. com/gallery/two-cypress-gpus, 0101 -230369 -7179 -0 -0 -0 -jpg-. html

5. 1 AMD/ATI RV 870 (8) ATI 5970 http: //www. tomshardware. com/gallery/two-cypress-gpus, 0101 -230369 5. 1 AMD/ATI RV 870 (8) ATI 5970 http: //www. tomshardware. com/gallery/two-cypress-gpus, 0101 -230369 -7179 -0 -0 -0 -jpg-. html

5. 1 AMD/ATI RV 870 (9) ATI 5970 P r e v i o 5. 1 AMD/ATI RV 870 (9) ATI 5970 P r e v i o u s P h o t o N e x t 4 / 3 3 http: //www. tomshardware. com/gallery/two-cypress-gpus, 0101 -230369 -7179 -0 -0 -0 -jpg-. html

5. 1 AMD/ATI RV 870 (10) HD 5970 HD 5870 P h o t 5. 1 AMD/ATI RV 870 (10) HD 5970 HD 5870 P h o t o N e x t 1 / 3 3 HD 5850 Source: Tomshardware

h i p s P C G r a p h i c s h i p s P C G r a p h i c s M a r k e t S l u m p s , I n t e l S o a r s , N 5. 1 AMD/ATI RV 870 (11)

5. 1 AMD/ATI RV 870 (12) http: //www. rumorpedia. net/wp-content/uploads/2010/11/amdmobilityroadmap. png 5. 1 AMD/ATI RV 870 (12) http: //www. rumorpedia. net/wp-content/uploads/2010/11/amdmobilityroadmap. png

5. 2 Nvidia Fermi 5. 2 Nvidia Fermi

5. 2 Nvidia Fermi (1) NVidia’s Fermi Introduced: 30. Sept. 2009 at NVidia’s GPU 5. 2 Nvidia Fermi (1) NVidia’s Fermi Introduced: 30. Sept. 2009 at NVidia’s GPU Technology Conference Available: 1 Q 2010

5. 2 Nvidia Fermi (2) Fermi’s overall structure NVidia: 16 cores (Streaming Multiprocessors) Each 5. 2 Nvidia Fermi (2) Fermi’s overall structure NVidia: 16 cores (Streaming Multiprocessors) Each core: 32 ALUs 6 x Dual Channel GDDR 5 (384 bit) Figure: Fermi’s overall structure [40]

5. 2 Nvidia Fermi (3) Layout of a core (SM) 1 SM includes 32 5. 2 Nvidia Fermi (3) Layout of a core (SM) 1 SM includes 32 ALUs called “Cuda cores” by NVidia) Cuda core (ALU) Figure: Layout of a core [40]

5. 2 Nvidia Fermi (4) A single ALU (“Cuda core”) SP FP: 32 -bit 5. 2 Nvidia Fermi (4) A single ALU (“Cuda core”) SP FP: 32 -bit FX: 32 -bit DP FP • IEEE 754 -2008 -compliant • Needs 2 clock cycles DP FP performance: ½ of SP FP performance!! Figure: A single ALU [40]

5. 2 Nvidia Fermi (5) Fermi’s system architecture Figure: Fermi’s system architecture [39] 5. 2 Nvidia Fermi (5) Fermi’s system architecture Figure: Fermi’s system architecture [39]

5. 2 Nvidia Fermi (6) Contrasting Fermi and GT 200 Figure: Contrasting Fermi and 5. 2 Nvidia Fermi (6) Contrasting Fermi and GT 200 Figure: Contrasting Fermi and GT 200 [39]

5. 2 Nvidia Fermi (7) The execution of programs utilizing GP/GPUs Host kernel 0<<<>>>() 5. 2 Nvidia Fermi (7) The execution of programs utilizing GP/GPUs Host kernel 0<<<>>>() Device Each kernel invocation executes a grid of thread blocks (Block(i, j)) kernel 1<<<>>>() Figure: Hierarchy of threads [25]

5. 2 Nvidia Fermi (8) Global scheduling in Fermi Figure: Global scheduling in Fermi 5. 2 Nvidia Fermi (8) Global scheduling in Fermi Figure: Global scheduling in Fermi [39]

5. 2 Nvidia Fermi (9) Fermi chip alternatives Shipped about 4/2010 1/2001 11/2010 http: 5. 2 Nvidia Fermi (9) Fermi chip alternatives Shipped about 4/2010 1/2001 11/2010 http: //www. techreport. com/articles. x/19934

5. 2 Nvidia Fermi (10) Ge. Force GTX 460/480/580 graphics cards GTX 460 is 5. 2 Nvidia Fermi (10) Ge. Force GTX 460/480/580 graphics cards GTX 460 is based on the GF 104 GTX 480 is based on the GF 100 GTX 580 is based on the GF 110 http: //www. techreport. com/articles. x/19934

5. 2 Nvidia Fermi (11) A pair of Ge. Force 480 cards http: //www. 5. 2 Nvidia Fermi (11) A pair of Ge. Force 480 cards http: //www. techreport. com/articles. x/18682

5. 2 Nvidia Fermi (12) NVidia Geforce 480 card (about March 2010) 5. 2 Nvidia Fermi (12) NVidia Geforce 480 card (about March 2010)

6. Integrated CPU/GPUs 6. 1 AMD Fusion APU line 6. 2 Intel Sandy Bridge 6. Integrated CPU/GPUs 6. 1 AMD Fusion APU line 6. 2 Intel Sandy Bridge

6. 1 AMD Fusion APU line (Accelerated Processing Unit) 6. 1 AMD Fusion APU line (Accelerated Processing Unit)

6. 1 AMD Fusion APU line (1) bp. blogspot. com/_q. NFgs. Hp 2 d. 6. 1 AMD Fusion APU line (1) bp. blogspot. com/_q. NFgs. Hp 2 d. Ck/Rjxipy. C-dc. I/AAAAATQ/5 Kz. O 63 ud. KB 4/s 1600 -h/4622_large_stars http: //blogs. amd. com/fusion/2010/11/09/simply-put-it%E 2%80%99 s-all-about-velocity/ph 4/

6. 1 AMD Fusion APU line (2) 6. 1 AMD Fusion APU line (2)

6. 1 AMD Fusion APU line (3) http: //blogs. amd. com/fusion/2010/11/09/simply-put-it%E 2%80%99 s-all-about-velocity/ph 5/ 6. 1 AMD Fusion APU line (3) http: //blogs. amd. com/fusion/2010/11/09/simply-put-it%E 2%80%99 s-all-about-velocity/ph 5/

6. 1 AMD Fusion APU line (4) LLano die 4 Bulldozer core GPU with 6. 1 AMD Fusion APU line (4) LLano die 4 Bulldozer core GPU with DX 11 http: //www. pcgameshardware. com/? menu=browser&article_id=699300&image_id=1220883

6. 1 AMD Fusion APU line (5) http: //blogs. amd. com/fusion/2010/11/09/simply-put-it%E 2%80%99 s-all-about-velocity/ph_server/ 6. 1 AMD Fusion APU line (5) http: //blogs. amd. com/fusion/2010/11/09/simply-put-it%E 2%80%99 s-all-about-velocity/ph_server/

6. 1 AMD Fusion APU line (6) http: //www. itextreme. hu/hirek/2010, 08, 24/az_amd_legujabb_processzor_tervei_bobcat_es_bulldozer 6. 1 AMD Fusion APU line (6) http: //www. itextreme. hu/hirek/2010, 08, 24/az_amd_legujabb_processzor_tervei_bobcat_es_bulldozer

AMD Bulldozer 6. 1 AMD Fusion APU line (7) Xbit labs AMD Bulldozer 6. 1 AMD Fusion APU line (7) Xbit labs

6. 1 AMD Fusion APU line (8) AMD Bulldozer Xbit labs 6. 1 AMD Fusion APU line (8) AMD Bulldozer Xbit labs

6. 1 AMD Fusion APU line (9) 6. 1 AMD Fusion APU line (9)

6. 1 AMD Fusion APU line (10) 6. 1 AMD Fusion APU line (10)

6. 1 AMD Fusion APU line (11) http: //www. cdrinfo. com/sections/news/Details. aspx? News. Id=28748 6. 1 AMD Fusion APU line (11) http: //www. cdrinfo. com/sections/news/Details. aspx? News. Id=28748

6. 2 Intel’s integrated CPU/GPU lines 6. 2 Intel’s integrated CPU/GPU lines

6. 2 Intel’s integrated CPU/GPU lines (1) Intel’s i 3/i 5/i 7 mobile processors 6. 2 Intel’s integrated CPU/GPU lines (1) Intel’s i 3/i 5/i 7 mobile processors (announced 1/2010) (32 nm CPU/45 nm discrete GPU)

6. 2 Intel’s integrated CPU/GPU lines (2) Intel’s i 3/i 5/i 7 mobile processors 6. 2 Intel’s integrated CPU/GPU lines (2) Intel’s i 3/i 5/i 7 mobile processors 32 nm CPU 45 nm GPU http: //www. anandtech. com/show/2921

6. 2 Intel’s integrated CPU/GPU lines (3) Source: Intel 6. 2 Intel’s integrated CPU/GPU lines (3) Source: Intel

6. 2 Intel’s integrated CPU/GPU lines (4) 6. 2 Intel’s integrated CPU/GPU lines (4)

6. 2 Intel’s integrated CPU/GPU lines (5) 6. 2 Intel’s integrated CPU/GPU lines (5)

7. References (1) 6. References [1]: Torricelli F. , AMD in HPC, HPC 07, 7. References (1) 6. References [1]: Torricelli F. , AMD in HPC, HPC 07, http: //www. altairhyperworks. co. uk/html/en-GB/keynote 2/Torricelli_AMD. pdf [2]: NVIDIA Tesla C 870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia [3] AMD Fire. Stream 9170, http: //ati. amd. com/technology/streamcomputing/product_firestream_9170. html [4]: NVIDIA Tesla D 870 Deskside GPU Computing System, System Specification, Jan. 2008, Nvidia, http: //www. nvidia. com/docs/IO/43395/D 870 -System. Spec-SP-03718 -001_v 01. pdf [5]: Tesla S 870 GPU Computing System, Specification, Nvida, http: //jp. nvidia. com/docs/IO/43395/S 870 -Board. Spec_SP-03685 -001_v 00 b. pdf [6]: Torres G. , Nvidia Tesla Technology, Nov. 2007, http: //www. hardwaresecrets. com/article/495 [7]: R 600 -Family Instruction Set Architecture, Revision 0. 31, May 2007, AMD [8]: Zheng B. , Gladding D. , Villmow M. , Building a High Level Language Compiler for GPGPU, ASPLOS 2006, June 2008 [9]: Huddy R. , ATI Radeon HD 2000 Series Technology Overview, AMD Technology Day, 2007 http: //ati. amd. com/developer/techpapers. html [10]: Compute Abstraction Layer (CAL) Technology – Intermediate Language (IL), Version 2. 0, Oct. 2008, AMD

6. References (2) [11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2. 6. References (2) [11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2. 0, June 2008, Nvidia [12]: Kirk D. & Hwu W. W. , ECE 498 AL Lectures 7: Threading Hardware in G 80, 2007, University of Illinois, Urbana-Champaign, http: //courses. ece. uiuc. edu/ece 498/al 1/ lectures/lecture 7 -threading%20 hardware. ppt#256, 1, ECE 498 AL Lectures 7: Threading Hardware in G 80 [13]: Kogo H. , R 600 (Radeon HD 2900 XT), PC Watch, June 26 2008, http: //pc. watch. impress. co. jp/docs/2008/0626/kaigai_3. pdf [14]: Nvidia G 80, Pc Watch, April 16 2007, http: //pc. watch. impress. co. jp/docs/2007/0416/kaigai 350. htm [15]: Ge. Force 8800 GT (G 92), PC Watch, Oct. 31 2007, http: //pc. watch. impress. co. jp/docs/2007/1031/kaigai 398_07. pdf [16]: NVIDIA GT 200 and AMD RV 770, PC Watch, July 2 2008, http: //pc. watch. impress. co. jp/docs/2008/0702/kaigai 451. htm [17]: Shrout R. , Nvidia GT 200 Revealed – Ge. Force GTX 280 and GTX 260 Review, PC Perspective, June 16 2008, http: //www. pcper. com/article. php? aid=577&type=expert&pid=3 [18]: http: //en. wikipedia. org/wiki/Direct. X [19]: Dietrich S. , “Shader Model 3. 0, April 2004, Nvidia, http: //www. cs. umbc. edu/~olano/s 2004 c 01/ch 15. pdf [20]: Microsoft Direct. X 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006, Nvidia, http: //www. nvidia. com/page/8800_tech_briefs. html

6. References (3) [21]: Patidar S. & al. , “Exploiting the Shader Model 4. 6. References (3) [21]: Patidar S. & al. , “Exploiting the Shader Model 4. 0 Architecture, Center for Visual Information Technology, IIIT Hyderabad, http: //research. iiit. ac. in/~shiben/docs/SM 4_Skp-Shiben-Jag-PJN_draft. pdf [22]: Nvidia Ge. Force 8800 GPU Architecture Overview, Vers. 0. 1, Nov. 2006, Nvidia, http: //www. nvidia. com/page/8800_tech_briefs. html [23]: Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch, http: //pc. watch. impress. co. jp/docs/2008/0822/kaigai_06. pdf [24]: Fatahalian K. , “From Shader Code to a Teraflop: How Shader Cores Work, ” Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008, [25]: Kanter D. , “NVIDIA’s GT 200: Inside a Parallel Processor, ” 09 -08 -2008 [26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 1. 1, Nov. 2007, Nvidia [27]: Seiler L. & al. , “Larrabee: A Many-Core x 86 Architecture for Visual Computing, ” ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008 [28]: Kogo H. , “Larrabee”, PC Watch, Oct. 17, 2008, http: //pc. watch. impress. co. jp/docs/2008/1017/kaigai 472. htm [29]: Shrout R. , IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective, http: //www. pcper. com/article. php? aid=453

6. References (4) [30]: Stokes J. , Larrabee: Intel’s biggest leap ahead since the 6. References (4) [30]: Stokes J. , Larrabee: Intel’s biggest leap ahead since the Pentium Pro, ” Aug. 04. 2008, http: //arstechnica. com/news. ars/post/20080804 -larrabeeintels-biggest-leap-ahead-since-the-pentium-pro. html [31]: Shimpi A. L. C Wilson D. , “Intel's Larrabee Architecture Disclosure: A Calculated First Move, Anandtech, Aug. 4. 2008, http: //www. anandtech. com/showdoc. aspx? i=3367&p=2 [32]: Hester P. , “Multi_Core and Beyond: Evolving the x 86 Architecture, ” Hot Chips 19, Aug. 2007, http: //www. hotchips. org/hc 19/docs/keynote 2. pdf [33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1. 2. 1 http: //ati. amd. com/technology/streamcomputing/ Stream_Computing_User_Guide. pdf [34]: Doggett M. , Radeon HD 2900, Graphics Hardware Conf. Aug. 2007, http: //www. graphicshardware. org/previous/www_2007/presentations/ doggett-radeon 2900 -gh 07. pdf [35]: Mantor M. , “AMD’s Radeon Hd 2900, ” Hot Chips 19, Aug. 2007, http: //www. hotchips. org/archives/hc 19/2_Mon/HC 19. 03. 01. pdf [36]: Houston M. , “Anatomy if AMD’s Tera. Scale Graphics Engine, ”, SIGGRAPH 2008, http: //s 08. idav. ucdavis. edu/houston-amd-terascale. pdf [37]: Mantor M. , “Entering the Golden Age of Heterogeneous Computing, ” PEEP 2008, http: //ati. amd. com/technology/streamcomputing/IUCAA_Pune_PEEP_2008. pdf

6. References (5) [38]: Kogo H. , RV 770 Overview, PC Watch, July 02 6. References (5) [38]: Kogo H. , RV 770 Overview, PC Watch, July 02 2008, http: //pc. watch. impress. co. jp/docs/2008/0702/kaigai_09. pdf [39]: Kanter D. , Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009, http: //www. realworldtech. com/includes/templates/articles. cfm? Article. ID=RWT 093009110932&mode=print [40]: Wasson S. , Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed, Tech Report, Sept 30 2009, http: //techreport. com/articles. x/17670/1 [41]: Wasson S. , AMD's Radeon HD 5870 graphics processor, Tech Report, Sept 23 2009, http: //techreport. com/articles. x/17618/1 [42]: Bell B. , ATI Radeon HD 5870 Performance Preview , Firing Squad, Sept 22 2009, http: //www. firingsquad. com/hardware/ ati_radeon_hd_5870_performance_preview/default. asp

Principle of operation of the G 80/G 92/Fermi GPGPUs Principle of operation of the G 80/G 92/Fermi GPGPUs

Akrout (AMD) CPU/GPU industry dynamics and technologies 2010 Sept +cpu/gpu+industry+dynamics&hl=hu&gl=hu&pid=bl&srcid=ADGEESj. U 5 ox 06 Akrout (AMD) CPU/GPU industry dynamics and technologies 2010 Sept +cpu/gpu+industry+dynamics&hl=hu&gl=hu&pid=bl&srcid=ADGEESj. U 5 ox 06 YHjh. Ip. K_U 05 Vn 52 vs-Ef. H

http: //www. cdrinfo. com/sections/news/Details. aspx? News. Id=28748 http: //www. cdrinfo. com/sections/news/Details. aspx? News. Id=28748

http: //blogs. pcmag. com/miller/2010/11/amds_graphics_roadmap_for_2011. php http: //blogs. pcmag. com/miller/2010/11/amds_graphics_roadmap_for_2011. php

6. References (5) Microarchitecture of a Fermi core 6. References (5) Microarchitecture of a Fermi core

6. References (5) Principle of operation of the G 80/G 92 GPGPUs The key 6. References (5) Principle of operation of the G 80/G 92 GPGPUs The key point of operation is work scheduling Work scheduling • Scheduling thread blocks for execution • Segmenting thread blocks into warps • Scheduling warps for execution

6. References (5) CUDA Thread Block Thread scheduling in NVidia’s GPGPUs p p All 6. References (5) CUDA Thread Block Thread scheduling in NVidia’s GPGPUs p p All threads in a block execute the same kernel program (SPMD) Programmer declares block: n n n p Threads have thread id numbers within block n p p Block size 1 to 512 concurrent threads Block shape 1 D, 2 D, or 3 D Block dimensions in threads Thread program uses thread id to select work and address shared data Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate n CUDA Thread Block Thread Id #: 0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA Each block can execute in any order relative to other blocs! llinois. edu/ece 498/al/lectures/lecture 4%20 cuda%20 threads%20 part 2%20 spring%202009. ppt#316, 2,

6. References (5) Scheduling thread blocks for execution TPC t 0 t 1 t 6. References (5) Scheduling thread blocks for execution TPC t 0 t 1 t 2 … tm SM 0 MT IU SP TPC: Thread Processing Cluster (Texture Processing Cluster) SM 1 t 0 t 1 t 2 … tm Up to 8 blocks can be assigned to an SM for execution MT IU SP Blocks Shared Memory TF A device may run thread blocks sequentially or even in parallel, if it has enough resources for this, or usually by a combination of both. A TPC has Texture L 1 2 SMs in the G 80/G 92 3 SMs in the G 200 L 2 Memory Figure: Assigning thread blocks to streaming multiprocessors (SM) for execution [12]

6. References (5) Segmenting thread blocks into warps • Threads are scheduled for execution 6. References (5) Segmenting thread blocks into warps • Threads are scheduled for execution in groups of 32 threads, called the warps. • For scheduling each thread block is subdivided into warps. • At any point of time up to 24 warps can be maintained by the scheduler. Block … 1 Warps t 0 t 1 t 2 … t 31 … Block … 2 Warps t 0 t 1 t 2 … t 31 … Streaming Multiprocessor Remark Instruction L 1 The number of threads constituting a warp is an implementation decision and not part of the CUDA programming model. Data L 1 Instruction Fetch/Dispatch Shared Memory SP SP SFU SP SP Figure: Segmenting thread blocks in warps [12]

6. References (5) Scheduling warps for execution • The warp scheduler is a zero-overhead 6. References (5) Scheduling warps for execution • The warp scheduler is a zero-overhead scheduler • Only those warps are eligible for execution whose next instruction has all operands available. • Eligible warps are scheduled SM multithreaded Warp scheduler • coarse grained (not indicated in the figure) • priority based. time warp 8 instruction 11 warp 1 instruction 42 All threads in a warp execute the same instruction when selected. warp 3 instruction 95. . . warp 8 instruction 12 4 clock cycles are needed to dispatch the same instruction to all threads in the warp (G 80) warp 3 instruction 96 Figure: Scheduling warps for execution [12]

5. 3 Intel’s Larrabee 5. 3 Intel’s Larrabee

5. 2 Intel’s Larrabee (1) Larrabee Part of Intel’s Tera-Scale Initiative. • Objectives: High 5. 2 Intel’s Larrabee (1) Larrabee Part of Intel’s Tera-Scale Initiative. • Objectives: High end graphics processing, HPC Not a single product but a base architecture for a number of different products. • Brief history: Project started ~ 2005 First unofficial public presentation: 03/2006 (withdrawn) First brief public presentation 09/07 (Otellini) [29] First official public presentations: in 2008 (e. g. at SIGGRAPH [27]) Due in ~ 2009 • Performance (targeted): 2 TFlops

5. 2 Intel’s Larrabee (2) NI: New Instructions Figure: Positioning of Larrabee in Intel’s 5. 2 Intel’s Larrabee (2) NI: New Instructions Figure: Positioning of Larrabee in Intel’s product portfolio [28]

5. 2 Intel’s Larrabee (3) Figure: First public presentation of Larrabee at IDF Fall 5. 2 Intel’s Larrabee (3) Figure: First public presentation of Larrabee at IDF Fall 2007 [29]

5. 2 Intel’s Larrabee (4) Basic architecture Figure: Block diagram of the Larrabee [30] 5. 2 Intel’s Larrabee (4) Basic architecture Figure: Block diagram of the Larrabee [30] • Cores: In order x 86 IA cores augmented with new instructions • L 2 cache: fully coherent • Ring bus: 1024 bits wide

5. 2 Intel’s Larrabee (5) Figure: Block diagram of Larrabee’s cores [31] 5. 2 Intel’s Larrabee (5) Figure: Block diagram of Larrabee’s cores [31]

5. 2 Intel’s Larrabee (6) Larrabee’ microarchitecture [27] Derived from that of the Pentium’s 5. 2 Intel’s Larrabee (6) Larrabee’ microarchitecture [27] Derived from that of the Pentium’s in order design

5. 2 Intel’s Larrabee (7) Main extensions • 64 -bit instructions • 4 -way 5. 2 Intel’s Larrabee (7) Main extensions • 64 -bit instructions • 4 -way multithreaded (with 4 register sets) • addition of a 16 -wide (16 x 32 -bit) VU • increased L 1 caches (32 KB vs 8 KB) • access to its 256 KB local subset of a coherent L 2 cache • ring network to access the coherent L 2 $ and allow interproc. communication. Figure: The anchestor of Larrabee’s cores [28]

5. 2 Intel’s Larrabee (8) New instructions allow explicit cache control • to prefetch 5. 2 Intel’s Larrabee (8) New instructions allow explicit cache control • to prefetch data into the L 1 and L 2 caches • to control the eviction of cache lines by reducing their priority. the L 2 cache can be used as a scratchpad memory while remaining fully coherent.

5. 2 Intel’s Larrabee (9) The Scalar Unit • supports the full ISA of 5. 2 Intel’s Larrabee (9) The Scalar Unit • supports the full ISA of the Pentium (it can run existing code including OS kernels and applications) • provides new instructions, e. g. for • bit count • bit scan (it finds the next bit set within a register).

5. 2 Intel’s Larrabee (10) The Vector Unit Mask registers have one bit per 5. 2 Intel’s Larrabee (10) The Vector Unit Mask registers have one bit per bit lane, to control which bits of a vector reg. or memory data are read or written and which remain untouched. VU scatter-gather instructions (load a VU vector register from 16 non-contiguous data locations from anywhere from the on die L 1 cache without penalty, or store a VU register similarly). Numeric conversions 8 -bit, 16 -bit integer and 16 bit FP data can be read from the L 1 $ or written into the L 1 $, with conversion to 32 -bit integers without penalty. L 1 D$ becomes as an extension of the register file Figure: Block diagram of the Vector Unit [31]

5. 2 Intel’s Larrabee (11) ALUs • ALUs execute integer, SP and DP FP 5. 2 Intel’s Larrabee (11) ALUs • ALUs execute integer, SP and DP FP instructions • Multiply-add instructions are available. Figure: Layout of the 16 -wide vector ALU [31]

5. 2 Intel’s Larrabee (12) Task scheduling performed entirely by software rather than by 5. 2 Intel’s Larrabee (12) Task scheduling performed entirely by software rather than by hardware, like in Nvidia’s or AMD/ATI’s GPGPUs.

5. 2 Intel’s Larrabee (13) SP FP performance 2 operations/cycle 16 ALUs 32 operations/core 5. 2 Intel’s Larrabee (13) SP FP performance 2 operations/cycle 16 ALUs 32 operations/core At present no data available for the clock frequency or the number of cores in Larrabee. Assuming a clock frequency of 2 GHz and 32 cores SP FP performance: 2 TFLOPS

5. 2 Intel’s Larrabee (14) Figure: Larrabee’s software stack (Source Intel) Larrabee’s Native C/C++ 5. 2 Intel’s Larrabee (14) Figure: Larrabee’s software stack (Source Intel) Larrabee’s Native C/C++ compiler allows many available apps to be recompiled and run correctly with no modifications.