Скачать презентацию Aerospace Electronic Systems Society The Cell Broadband Скачать презентацию Aerospace Electronic Systems Society The Cell Broadband

99a7d38790968745201ce9e6dad4b466.ppt

  • Количество слайдов: 75

Aerospace & Electronic Systems Society The Cell Broadband Engine Processor Hardware, Software, Performance and Aerospace & Electronic Systems Society The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business Manager, Performance Computing Group © 2006 Mercury Computer Systems, Inc.

Cell Chip Lives in Two Worlds • Game console chip market § Driven by Cell Chip Lives in Two Worlds • Game console chip market § Driven by “game physics” requirements, not just graphics • Compute intensive, vector processing, floating and fixed point § New consoles introduced every 5+ years, last about 10 years • PS 3 unveiled May 2005, will launch November 2006, about 6 years after PS 2. § New chip architectures linked to console designs • Chip architecture unchanged during lifetime • Process shrinks targeted at lower cost and lower power • High performance processor market § Evolving architecture with backwards compatibility § Piggy-back off largest volume processor platform that is leading in performance • With affordable architecture increments to address high performance needs § Previously desktop PC, now game console • Cell roadmap addresses both game console and high performance markets 2 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Mercury’s Relationship with IBM In June 2005, Mercury announced a strategic alliance agreement with Mercury’s Relationship with IBM In June 2005, Mercury announced a strategic alliance agreement with IBM offering Mercury special access to IBM expertise including the broadly publicized Cell technology. Multicomputer-on-a-chip 3 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Cell BE Processor Block Diagram • Cell BE processor boasts nine processors on a Cell BE Processor Block Diagram • Cell BE processor boasts nine processors on a single die § § 1 Power® processor 8 vector processors § § 205 GFLOPS @ 3. 2 GHz 410 GOPS @ 3. 2 GHZ § 205 GB/s maximum sustained bandwidth § 25. 6 GB/s XDR main memory bandwidth • Computational Performance • A high-speed data ring connects everything • High performance chip interfaces 4 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Synergistic Processing Element • Standalone vector processor § 128 bit SIMD model § 128 Synergistic Processing Element • Standalone vector processor § 128 bit SIMD model § 128 registers each 128 bits wide • Alti. Vec/VMX has only 32 registers, SSE 3 only eight • 256 KB local store § Load/store instructions can access only local store • Memory flow controller § DMA engine built into each SPE § SPE includes DMA instructions for explicitly moving data between local store and main memory • Performance § Dual issue § Two- to sixteen-way SIMD § 25. 6 GFLOPS (single precision), 51 GOPS (8 bit) 5 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

SPE 128 Bit SIMD Engine • Operates on 128 bit vector registers § § SPE 128 Bit SIMD Engine • Operates on 128 bit vector registers § § 128 bits 2 x 64 bits (DP float) 4 x 32 bits (SP float or integer) 8 x 16 bits (integer) 16 x 8 bits (integer) • Example: Floating point multiply add fma vr, v 1, v 2, v 3 § 4 x 32 bit fma instruction can complete eight floating point operations (FLOPS) every cycle v 1 v 2 X X v 3 + + vr 6 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Power® Processing Element • • • 64 -bit Power® core with complete Alti. Vec™/VMX Power® Processing Element • • • 64 -bit Power® core with complete Alti. Vec™/VMX High frequency Low power consumption Hardware multi-threading L 2 is 512 KB Can use any SPE’s DMA engine Altivec is a registered trademark of Freescale Semiconductor Corp. 7 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Why is Cell So Fast? • The SPE is a very fast, very lean Why is Cell So Fast? • The SPE is a very fast, very lean core § SPE (3. 2 GHz) is up to 3 times faster than the fastest Pentium core (3. 6 GHz) when computing FFTs § That’s 24 X better performance chip to chip • Huge internal chip bandwidth § 205 GB/s sustained ring bandwidth § 25. 6 GB/s main memory bandwidth • High performance DMA § DMA can be fully overlapped with SPE computation § Software controlled DMAs can bring exactly the right data into local store at the right time 8 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

9 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc. 9 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

10 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc. 10 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

11 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc. 11 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Mercury Cell Hardware Products © 2006 Mercury Computer Systems, Inc. Mercury Cell Hardware Products © 2006 Mercury Computer Systems, Inc.

Mercury Cell Related Roadmap 2008 Blades 2006 3 Q 2007 4 Q 1 Q Mercury Cell Related Roadmap 2008 Blades 2006 3 Q 2007 4 Q 1 Q Single slot, 2 BE, 2 Comp. Chips, 4 GB XDR+DDR 2 2 BE, 2 South. Bridges, 1 GB XDR Embedded 3 Q Dual Cell Based Blade 2 Dual Cell Based Blade 1 U Servers 2 Q 4 Q 1 Q 2 Q Dual Cell Based Blade 3 Single slot, 2 BE, 2 Comp. Chips, up to 32 GB DDR 2 Dual Cell Based Server 3 2 BE 2 Southbridges, 1 GB XDR CAB PCIe Add-In Card 1 BE, 1 Companion Chip, 4 GB DDR 2, 1 GB XDR 2 BE, 2 Comp. Chips 4 GB XDR+DDR 2 Turismo Chassis Concept 2 BE, 2 Comp. Chips, up to 32 GB DDR 2 ATCA Blade Concept 1 BE, 1 Companion Chip, 4 GB DDR 2 1 GB XDR Rugged Power. Block™ 200 VITA 46 / 48 TM ½ ATR Concept Power. Stream Concept 1 BE, 1 Companion Chip, 4 GB DDR 2, 1 GB XDR 13 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Dual Cell Based Blade • Flexible blade solution based on the Cell BE processor Dual Cell Based Blade • Flexible blade solution based on the Cell BE processor § Outstanding performance for HPC applications § Designed for distributed processing § Cell-optimized software available § About 11 TFLOPS in 5 feet of rack height • Dual-width Blade. Center. TM blade • Two PCI Express x 4 expansion slots § Initially supports only Infiniband cards • Evaluation units available since • 14 December 2005 Production October 2006 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Dual Cell Based Blade Block Diagram Infiniband Daughtercard 2. 5 GB/s 3. 2 GHz Dual Cell Based Blade Block Diagram Infiniband Daughtercard 2. 5 GB/s 3. 2 GHz each way Cell Processor 25. 6 GB/s 512 MB XDR DRAM Southbridge Gb. E PCI Express x 4 Power 20 GB/s each way Infiniband Daughtercard Southbridge PCI Express x 4 Gb. E 2. 5 GB/s 3. 2 GHz each way 25. 6 GB/s 512 MB XDR DRAM Cell Processor Blade. Center Midplane Connector Power Serial Port 15 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Cell Blade Systems Complete 19” rack-based systems • 25 U (42. 75” high) § Cell Blade Systems Complete 19” rack-based systems • 25 U (42. 75” high) § Up to 14 blades, 5. 7 TFLOPS • 42 U (73. 5”) chassis § Up to 28 blades, 11. 5 TFLOPS Cell Technology Evaluation System • Complete turn-key Cell HW & SW system • 25 U rack • One Dual Cell-Based Blade § All components included to support expansion to 7 blade system • Multi-rack systems scalable using Infiniband • Multi. Core Plus SDK and Gb. E § One year subscription to production SW 25 U 14 -Blade System Monitor and keyboard Serial line concentrator Xeon based Linux server External Gb. E switch Blade. Center chassis Power distribution front 16 rear © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

1 U Dual-Cell Based Server • Hardware § § § Dual Cell processors at 1 U Dual-Cell Based Server • Hardware § § § Dual Cell processors at 3. 2 GHz 1 GB of XDR DRAM Integrated dual Gigabit Ethernet Serial port Dual full size PCI Express x 4 slots • Initially supports only Infiniband cards • Software § Toolchain • Native (PPE hosted) • Cross (x 86 hosted) § GUI via X-Windows over Gb. E • No direct keyboard / video / mouse support • Production Q 1 2007 17 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Cell Companion Chip • Under design by IBM since May 2005 § With significant Cell Companion Chip • Under design by IBM since May 2005 § With significant design input from Mercury • First parts began preliminary testing June 2006 • Second spin for production in December 2006 18 DMA 405 PPC DDR 2 667 MHz © 2006 Mercury Computer Systems Gb. E GPIO UART PCI-X Gb. E DDR 2 controllers • 5 GB/s each • Up to 4 GB each Mailbox PCIe 16 x interfaces Each configurable: • 8 x, 4 x, 2 x and 1 x • Endpoint or root complex Cell BE Interface 5 GB/s PCIe 16 x Cell BE Interface • 5 GB/s each way • Extends Cell global address space to PCIe, DDR 2 etc. • Non-coherent (non-cached) Low latency, high capacity mailbox Multichannel, striding DMA engine © 2005 Mercury Computer Systems, Inc.

Dual Cell Based Blade 2 Chip 2 PCIe 8 x 3. 2 Cell Processor Dual Cell Based Blade 2 Chip 2 PCIe 8 x 3. 2 Cell Processor 25. 6 GB/s 1 GB XDR DRAM 5 GB/s GHz each way Companion IB 4 x 2 -8 GB DDR 2 Blade. Center H High Speed Daughtercard 1 GB XDR DRAM Chip PCIe 16 x 2 -8 GB DDR 2 IB 4 x PCIe 16 x § Up to twice the density • Uses new companion chip Gb. E 3. 2 Cell Processor 25. 6 GB/s blade Power 20 GB/s each way 5 GB/s GHz each way Companion • Single slot Gb. E One-Slot Processor Blade Power § Up to 10 x I/O bandwidth • DDR 2 I/O PCIe x 16 / PCI-X Daughtercard buffer memory • Production available Q 3 2007 PCIe x 16 / PCI-X Daughtercard One-Slot I/O Expansion Blade 19 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

5 GB/s GHz each way Companion 3. 2 Cell Processor 25. 6 GB/s Chip 5 GB/s GHz each way Companion 3. 2 Cell Processor 25. 6 GB/s Chip 2 PCIe 8 x 8 -16 GB DDR 2 1 -2 GB DDR 2 Blade. Center H High Speed Daughtercard Gb. E One-Slot Processor Blade 2 IB 4 x Dual Cell Based Blade 3 Concept Power • Improved SPE double precision performance PCIe 16 x 1 -2 GB DDR 2 Gb. E 5 GB/s 3. 2 GHz each way 25. 6 GB/s Companion Cell 8 -16 GB DDR 2 Chip Processor 2 IB 4 x 20 GB/s each way Power PCIe / PCI-X x 16 Daughtercard 20 • Expanded memory § DDR 2 replaces XDR • Production available Q 1 2008 One-Slot I/O Expansion Blade © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

1 U Dual-Cell Based Server 2 • 1 U solution using based on companion 1 U Dual-Cell Based Server 2 • 1 U solution using based on companion chip • Dual 3. 2 GHz Cell processors • Memory § 2 GB of XDR § 4 -16 GB of DDR 2 • I/O § Daughtercard site options under consideration • PCI-E and PCI-X customer options § Dual Gig. E § Dual IB 4 x • Production available Q 3 2007 21 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

1 U Dual-Cell Based Server 3 Concept • 1 U solution with enhanced memory 1 U Dual-Cell Based Server 3 Concept • 1 U solution with enhanced memory capacity • Dual 3. 2 GHz Cell processors • Memory § 16 -32 GB of DDR 2 § Main memory is now DDR 2 DIMMs § 1 -2 GB of DDR 2 per companion chip for IO buffering • I/O § PCIe / PCI-X daughtercards § Dual Gig. E § Dual IB 4 x • Production available Q 1 2008 22 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Cell Accelerator Board • PCI Express™ accelerator • • • card compatible with high-end Cell Accelerator Board • PCI Express™ accelerator • • • card compatible with high-end workstations More than 180 GFLOPS on a desktop 1 GB of XDR and 4 GB of DDR 2 Gigabit Ethernet on end bracket • Internal prototype boards • • 23 with FPGA bridge received July 2006 Boards with the prototype bridge silicon received September 2006 Volume production of boards Q 1 2007 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Cell Accelerator Board Block Diagram 22 GB/s 1 GB XDR DRAM 2. 8 GHz Cell Accelerator Board Block Diagram 22 GB/s 1 GB XDR DRAM 2. 8 GHz Cell Processor 4 GB DDR 2 8 GB/s Companion Chip 24 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Software is the Key to Harnessing Cell Performance! • Mercury’s Multi. Core Plus SDK Software is the Key to Harnessing Cell Performance! • Mercury’s Multi. Core Plus SDK © 2006 Mercury Computer Systems, Inc.

Cell BE Processor Architecture • Resembles distributed memory multiprocessor with explicit DMA over a Cell BE Processor Architecture • Resembles distributed memory multiprocessor with explicit DMA over a fabric 26 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Mercury Multi-DSP Board (1996) 27 © 2006 Mercury Computer Systems © 2005 Mercury Computer Mercury Multi-DSP Board (1996) 27 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Programming Cell: What’s Good and What’s Hard Good Hard § No second guessing about Programming Cell: What’s Good and What’s Hard Good Hard § No second guessing about cache replacement algorithm § Very deterministic pipeline § 128 registers mask pipeline latency very well SPE § Burden on software to get code and data into local store § Local store is small compared to ring latency § Branch prediction is manual and very restricted Ring and XDR § 128 byte alignment necessary for best performance § XDR bandwidth is a bottleneck § Cell chips linked in coherent mode increases latency § DMA has negligible impact on SPE local store bandwidth § Generous ring bandwidth means topology is seldom an issue § Standard Power® core 28 PPE § Performance is modest © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

How Much Faster Is Cell? Relative performance of Cell and leading general purpose processors How Much Faster Is Cell? Relative performance of Cell and leading general purpose processors • Performance relative to 1 GHz Freescale 744 x (i. e. Freescale = 1) • In all cases, we are comparing Mercury optimized Cell algorithm implementations with the best available (Mercury or 3 rd party) implementations on other processors • Did not compare with dual core x 86 processors Single precision complex FFTs 29 Symmetric image filters © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Goals for Programming Cell • Achieve high performance: § The only reason for choosing Goals for Programming Cell • Achieve high performance: § The only reason for choosing Cell • Ease of programming: § An important aspect of this is programmer portability • Code Portability § Important for large legacy code bases written in C/C++, Fortran § And new code developed for Cell should be portable to current and anticipated multiprocessor architectures 30 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Linux OS • Linux on Cell patches released by IBM Linux Technology Center § Linux OS • Linux on Cell patches released by IBM Linux Technology Center § § Kernel Version 2. 6. 17 libspe version 1. 1 Built and tested with Fedora Core 5 distribution IBM LTC releases packages through Barcelona Supercomputing Center to official kernel website www. bsc. es/projects/deepcomputing/linuxoncell/ • Mercury works closely with IBM Linux team on performance optimization § Linux now able to acheive maximum hardware performance possible on Dual Cell-Based Blade § NUMA support, PPE affinity, SPE affinity, 64 KB and 16 MB page support • Mercury uses Terra Soft Solutions Y-HPC Distribution § Mercury contracted TSS to port to Y-HPC to the Dual Cell Based Blade § Distributions are tested and supported on Mercury hardware § Mercury assists TSS with driver development • Gb. E, u. DAPL, Infiniband 31 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

The Multi. Core Plus SDK • Multi. Core Framework (MCF) • Scientific Algorithm Library The Multi. Core Plus SDK • Multi. Core Framework (MCF) • Scientific Algorithm Library (SAL) • Multi. Core Plus IDE • TATL • SPEAK © 2006 Mercury Computer Systems, Inc.

Mercury Approach to Programming Cell • Very pragmatic § Can’t wait for tools to Mercury Approach to Programming Cell • Very pragmatic § Can’t wait for tools to mature § Develop our own tools when it makes sense • Emphasis on explicitly programming the architecture rather than trying to hide it § When the tools are immature, this allows us to get maximum performance • Achieve ease-of-use and portability through function offload model § Run legacy code on PPE § Offload compute intensive workload to SPEs 33 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Multi. Core Framework • An API for programming • • 34 heterogeneous multicores that Multi. Core Framework • An API for programming • • 34 heterogeneous multicores that contain explicit non-cached memory hierarchies Provides an abstract view of the hardware oriented toward computation of multidimensional data sets First implementation is for the Cell BE processor © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

MCF Abstractions • Function offload model § Worker Teams: § Plug-ins: Allocate tasks to MCF Abstractions • Function offload model § Worker Teams: § Plug-ins: Allocate tasks to SPEs Dynamically load and unload functions from within worker programs • Data movement § Distribution Objects: § Tile Channels: § Re-org Channels: § Multibuffering: Defining how n-dimensional data is organized in memory Move data between SPEs and main memory Move data among SPEs Overlap data movement and computation • Miscellaneous § § 35 Barrier and semaphore synchronization DMA-friendly memory allocator DMA convenience functions Performance profiling © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

MCF Abstractions • Function offload model § Worker Teams: § Plug-ins: Allocate tasks to MCF Abstractions • Function offload model § Worker Teams: § Plug-ins: Allocate tasks to SPEs Dynamically load and unload functions from within worker programs • Data movement § Distribution Objects: § Tile Channels: § Re-org Channels: § Multibuffering: Defining how n-dimensional data is organized in memory Move data between SPE and main memory Move data among SPEs Overlap data movement and computation • Miscellaneous § § 36 Barrier and semaphore synchronization DMA-friendly memory allocator DMA convenience functions Performance profiling © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

MCF Distribution Objects Frame One complete data set in main memory • Distribution Object MCF Distribution Objects Frame One complete data set in main memory • Distribution Object parameters: § § § 37 Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e. g. split / interleaved) Partitioning policy across workers, including partition overlap © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

MCF Distribution Objects Frame Tile Unit of work for an SPE One complete data MCF Distribution Objects Frame Tile Unit of work for an SPE One complete data set in main memory • Distribution Object parameters: § § § 38 Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e. g. split / interleaved) Partitioning policy across workers, including partition overlap © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

MCF Partition Assignment Partitions SPE 0 SPE 1 SPE 2 • Distribution Object parameters: MCF Partition Assignment Partitions SPE 0 SPE 1 SPE 2 • Distribution Object parameters: § § § 39 Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e. g. split / interleaved) Partitioning policy across workers, including partition overlap © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

MCF Tile Channels Partitions Tile Channel SPE 0 SPE 1 SPE 2 • Distribution MCF Tile Channels Partitions Tile Channel SPE 0 SPE 1 SPE 2 • Distribution Object parameters: § § § 40 Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e. g. split / interleaved) Partitioning policy across workers, including partition overlap © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

MCF Tile Channels input tile channel manager (PPE) generates data set and injects it MCF Tile Channels input tile channel manager (PPE) generates data set and injects it into input tile channel subdivides data set into tiles worker 1 each worker (SPE) extract tiles out of input tile channel. . . worker 2 manager . . . computes on input tiles to produce output tiles. . . output tile channel. . . and inserts them into output tile channel when output data set is complete, manager is notified and extracts data set 41 output tile channel automatically puts tiles into correct location in output data set © 2006 Mercury Computer Systems worker 3 © 2005 Mercury Computer Systems, Inc.

MCF Manager Program main(int argc, char **argv) { mcf_m_net_create(); mcf_m_net_initialize(); Add worker tasks mcf_m_net_add_task(); MCF Manager Program main(int argc, char **argv) { mcf_m_net_create(); mcf_m_net_initialize(); Add worker tasks mcf_m_net_add_task(); mcf_m_team_run_task(); mcf_m_tile_distribution_create_3 d(“in”); mcf_m_tile_distribution_set_partition_overlap(“in”); mcf_m_tile_distribution_create_3 d(“out”); mcf_m_tile_channel_create(“in”); mcf_m_tile_channel_create(“out”); mcf_m_tile_channel_connect(“in”); mcf_m_tile_channel_connect(“out”); Create and connect to tile channels Get empty source buffer mcf_m_tile_channel_get_buffer(“in”); // fill input data here Specify data organization Fill it with data mcf_m_tile_channel_put_buffer(“in”); mcf_m_tile_channel_get_buffer(“out”); } 42 Send it to workers // process output data here Wait for results from workers © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

MCF Worker Program mcf_w_main (int n_bytes, void * p_arg_ls) { mcf_w_tile_channel_create(“in”); mcf_w_tile_channel_create(“out”); mcf_w_tile_channel_connect(“in”); mcf_w_tile_channel_connect(“out”); MCF Worker Program mcf_w_main (int n_bytes, void * p_arg_ls) { mcf_w_tile_channel_create(“in”); mcf_w_tile_channel_create(“out”); mcf_w_tile_channel_connect(“in”); mcf_w_tile_channel_connect(“out”); while (! mcf_w_tile_channel_is_end_of_channel(“in”) { mcf_w_tile_channel_get_buffer(“in”); mcf_w_tile_channel_get_buffer(“out”); // Do math here Do math and fill destination buffer mcf_w_tile_channel_put_buffer(“in”); Create and connect to tile channels Get full source buffer Get empty destination buffer Put back empty source buffer mcf_w_tile_channel_put_buffer(“out”); } Put back full destination buffer } 43 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

MCF Implementation • Consists of § PPE library § SPE library and tiny executive MCF Implementation • Consists of § PPE library § SPE library and tiny executive (12 KB) • Utilizes Cell Linux “libspe” support § But amortizes expensive system calls § Reduces overhead from milliseconds to microseconds § Provides faster and smaller footprint memory allocation library • Based on Data Reorg standard § http: //www. data-re. org • Derived from existing Mercury technologies § Other Mercury RDMA-based middleware § DSP product experience with small footprint, non-cached architectures 44 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

SAL Primary Markets Medical Imaging Signals Intelligence 45 Radar Semiconductor Inspection © 2006 Mercury SAL Primary Markets Medical Imaging Signals Intelligence 45 Radar Semiconductor Inspection © 2006 Mercury Computer Systems Sonar Defense Imaging © 2005 Mercury Computer Systems, Inc.

Scientific Algorithm Library • SAL is a collection of optimized functions § Baseline • Scientific Algorithm Library • SAL is a collection of optimized functions § Baseline • Arithmetic, data type conversions, data moves § DSP • FFTs, convolutions, correlation, filters, etc. § Linear Algebra • Linear systems, matrix decomposition, etc. § Parallel Algorithms (future) • • High level algorithms on multiple cores Invoked from application running on PPE Automatically use one or more SPEs Initial work done for 1 D and 2 D FFTs and fast convolutions • PIXL – Image Processing Library • Edge detection, fixed point operations and analysis, filtering, manipulation, erosion, dilation, histogram, lookup tables, etc. • Work in this area depend on customer demand. • PPE SAL based on Altivec optimizations for G 4 and G 4 A 2 § SAL C source code version also available • SPE SAL is new implementation optimized for SPE architecture § Backwards compatibility with existing SAL API except in very rare cases § Some new APIs needed in order to extract best performance from SPE § Static and plug-in component versions for each function 46 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Eclipse Framework • Provides an open platform for creating an Integrated Development Environment (IDE) Eclipse Framework • Provides an open platform for creating an Integrated Development Environment (IDE) • Eclipse Consortium manages continuous development of the tool • Eclipse plug-ins extend the functionality of the framework • Written in Java • Compilers, debuggers, TATL, helpfiles, etc. are all be Eclipse plug-ins. 47 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Mercury Multi. Core Plus IDE • PPE and SPE cross build support for § Mercury Multi. Core Plus IDE • PPE and SPE cross build support for § Gcc/gcc++ § XLC/C++ • Eclipse CDT (C/C++ Development Toolkit) § § § 48 Syntax highlighting Code completion Content assistance Makefile generation Remote debugging of PPE and SPE applications TATL plug-in © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

TATL™ Trace Analysis Tool • Log events from PPE • • 49 © 2006 TATL™ Trace Analysis Tool • Log events from PPE • • 49 © 2006 Mercury Computer Systems & SPE threads across multiple Cell chips Synchronized global timestamps Minimally intrusive in space and time Timeline trace and histogram viewers Structured log file for use in other tools © 2005 Mercury Computer Systems, Inc.

SPE Assembly Development Kit (SPE-ADK) • The SPE architecture encourages “bare metal programmers” § SPE Assembly Development Kit (SPE-ADK) • The SPE architecture encourages “bare metal programmers” § Very deterministic architecture § Performance benefits from hand tuning the pipelines • SPE-ADK dramatically improves bare metal productivity • SPE-ADK consists of § Assembler preprocessor, optimizer and macro library • Using SPE-ADK is similar to programming with SPE C extensions § But with more deterministic control of instruction scheduling and hardware resources • SPE-ADK is a productized version of the internal development tool used by all Mercury SAL developers 50 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

SPE-ADK Features • Alignment of instructions for the • • • even and odd SPE-ADK Features • Alignment of instructions for the • • • even and odd pipelines of the SPU Automatic insertion of nop's and lnop's or instruction swapping to maintain dual dispatch Alignment of loops to minimize instruction fetching overhead Register assignment. It automatically: § Finds symbolic register operands, § Assigns registers to symbols to minimize register usage, § Eliminates bugs from inconsistent register assignment. • Mapping of register usage, both • 51 active line number extents per symbol, and active hardware registers per line Analysis of stall cycles due to register dependencies • Optional C emulation for assembly development allows C-like debugging facilities § Hardware independence for assembly code, § Setting breakpoints at source line numbers, § Displaying source code rather than disassembling the object code, § Displaying register contents by symbol. • Detection of errors to preclude bugs: § Inconsistent manual register assignment, § Write-only variables, § Uninitialized variables, § Updated but unused variables. © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Software Summary • The Cell BE processor can achieve one to two orders of Software Summary • The Cell BE processor can achieve one to two orders of magnitude performance improvement over current general purpose processors § Lean SPE core saves space and power § And makes it easier for software to approach peak performance • Cell is a distributed memory multiprocessor on a chip § Prior experience on these architectures translates easily to Cell • But for most programmers, Cell is a new architecture § Successful adoption by programmers is Cell’s biggest challenge § And the history of other new processor architectures is not encouraging • We need a range of tools that span the continuum from ease-of-use to high performance 52 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Markets for Cell • Aerospace and Defense • Semiconductor • Medical Imaging • Oil Markets for Cell • Aerospace and Defense • Semiconductor • Medical Imaging • Oil and Gas • Visualization © 2006 Mercury Computer Systems, Inc.

Sales & Marketing Progress for Cell Very Active • Semiconductor inspection – active sales Sales & Marketing Progress for Cell Very Active • Semiconductor inspection – active sales engagements; prototypes sold • Medical imaging – active sales engagements; prototypes sold • Semiconductor lithography – active sales engagements; prototypes sold. • Defense signal & image processing – active sales engagements; prototypes sold • Oil & Gas exploration – active sales engagements; prototypes sold • Video transcoding – active sales engagements Less Active for Mercury • Financial modeling (IBM) • Gaming • Animation & rendering • Defense simulation for training (specialized gaming) 54 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Summary • Mercury has been developing computing solutions • • 55 for applications well Summary • Mercury has been developing computing solutions • • 55 for applications well suited for Cell technology for many years. Cell technology represents a significant performance breakthrough similar to historical programming models. Customers can leverage Cell technology through Mercury to achieve: § Unbiased assessment of risks and applicability of deploying Cell-based solutions. § Significant improvements in performance and bandwidth for certain applications compared to conventional processors © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

For More Information (866) 627 -6951 (US) (978) 967 -1401 (International) E-mail: webinfo@mc. com For More Information (866) 627 -6951 (US) (978) 967 -1401 (International) E-mail: [email protected] com Web: www. mc. com/cell 56 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Backup Slides © 2006 Mercury Computer Systems, Inc. Backup Slides © 2006 Mercury Computer Systems, Inc.

Semiconductor DFM Requirements © 2006 Mercury Computer Systems, Inc. Semiconductor DFM Requirements © 2006 Mercury Computer Systems, Inc.

Moore’s Law Irrelevant • Processing requirements of semiconductor • industry are increasing at an Moore’s Law Irrelevant • Processing requirements of semiconductor • industry are increasing at an even faster rate Driven by: § Increased feature density § Increased complexity of processing due to subwavelength physics 12 X § Tool specific features Processing needs outpace mainstream computing as data rates and algorithm complexity increase Processing Requirements 4 X Moore’s Law Year 1 59 Year 2 Year 3 © 2006 Mercury Computer Systems Year 4 © 2005 Mercury Computer Systems, Inc.

OPC/RET/DFM – The need for speed CHALLENGES • Reduce OPC cycle times from days/weeks OPC/RET/DFM – The need for speed CHALLENGES • Reduce OPC cycle times from days/weeks to hours • Simulation models that ensure a mask will work when printed • Computing goes up by an order of magnitude at every design node (e. g. 65 nm to 45 nm) • Resolution Enhancement Technologies (RET) • Optical Proximity Correction (OPC) • Phase Shift Masks (PSM) • Off-axis Illumination (OAI) • Design for Manufacturing (DFM) • WYSIWYG no more Quotes from top chip designers: “It takes 8 days with 500 nodes to do OPC on a single chip layer … and we need it to be 10 to 100 times faster” “We have 10, 000 blades to do RET” Source: AMD 60 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Cost of Ownership • System sizes to do RET and • Lithography simulation are Cost of Ownership • System sizes to do RET and • Lithography simulation are expanding to the 1000 s of 1 U servers Dense racks of servers are expensive to maintain § Cost of electricity to power computers § Cost of capital infrastructure for electricity delivery § Cost of electricity to power HVAC systems § Cost of capital infrastructure for HVAC § Challenge of managing air flow 61 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Cost of Ownership • A single dual processor server § Consumes 250 -400 Watts Cost of Ownership • A single dual processor server § Consumes 250 -400 Watts § Costs $100 -200/year just to power (at $. 05/k. Wh) § Comparable amounts for HVAC and capital costs • A rack of 84 such servers § Costs $10 K+ per year to power § Comparable amounts for HVAC and capital costs • Operators of data centers now see power and cooling costs as more significant than cost of computing hardware 62 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Processing Efficiency • The metric of performance per dollar must be expanded to include Processing Efficiency • The metric of performance per dollar must be expanded to include not just the cost of the hardware but also the lifetime cost of operating the computer system • Performance/Watt, which used to just be a metric for the embedded and defense industry, is now important for commercial customers as well 63 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Summary • Cell processor technology provides: § Order-of-magnitude improvement in computing performance per processor Summary • Cell processor technology provides: § Order-of-magnitude improvement in computing performance per processor for OPC/RET applications § Significant improvement in performance per Watt § Significant performance breakthrough for other critical computationally intensive applications • The right software infrastructure is critical for: § Taking full advantage of specialized processing units § Partitioning application among heterogeneous group or processing cores § Parallelizing application among multiple processing nodes • Cell can significantly improve OPC/RET turnaround time 64 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Ray Tracing • Mercury Computer Systems • Visualization and Sciences Group © 2006 Mercury Ray Tracing • Mercury Computer Systems • Visualization and Sciences Group © 2006 Mercury Computer Systems, Inc.

What is Ray Tracing? § Computer Graphics Rendering Technique which mathematically simulates rays of What is Ray Tracing? § Computer Graphics Rendering Technique which mathematically simulates rays of light § Capable of producing photorealistic images § Used in a variety of markets Ø Automotive, aerospace and marine virtual prototyping Ø Architecture Ø Industrial Design Ø Digital Content Creation in film and video 66 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Basic Technique § For each pixel in the screen, send out a ray of Basic Technique § For each pixel in the screen, send out a ray of light from the viewpoint. § Check every object in the scene and check for intersection. § If the ray does not intersect an object, set pixel to background color § If the ray does intersect an object, set the pixel color to the first object it intersects 67 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

More Advanced Technique 68 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, More Advanced Technique 68 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Characteristics of Ray Tracing • Simulating the Physics of Light • • Simulates light Characteristics of Ray Tracing • Simulating the Physics of Light • • Simulates light transport by following “photons” Fully parallel: just as nature Demand-driven: start from the camera Correctly orders rendering effects (per pixel !!) Can account for all global effects All effects are orthogonal to each other Makes content design easy and fast • Requires very large amount of CPU in order to be interactive • Driven by intersection calculations • Every ray checked against all objects • Each secondary ray becomes a primary ray in a recursive algorithm • 800 x 600 screen, 3 light sources, 50 opaque objects requires 600 billion intersection tests! 69 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Challenges Implementing on Cell • In-order instruction access and SIMD § Must carefully optimize Challenges Implementing on Cell • In-order instruction access and SIMD § Must carefully optimize instructions to avoid stalls § Must parallelize code to take advantage of SIMD instructions • Memory Access § DMA engines must move data into LS from XDR § Hiding latency requires overlapped I/O and processing (DMA read latency is a few hundered clock cycles) § Even more challenging for irregular data access • Mapping to 8 SPEs § Mapping algorithm very important with Cell architecture 70 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Linear Speed-up Across SPEs 71 © 2006 Mercury Computer Systems © 2005 Mercury Computer Linear Speed-up Across SPEs 71 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Results Frames per Second (Normalized to 2. 4 GHz Opteron) 2. 4 GHz x Results Frames per Second (Normalized to 2. 4 GHz Opteron) 2. 4 GHz x 86 3. 0 2. 5 2. 4 GHz SPE 7. 4 (+3%) 2. 6 (-13%) 1. 9 (-24%) 2. 4 GHz Cell 58. 1 (8 x) 20 (6. 6 x) 16. 2 (6. 4 x) 2. 4 GHz Dual Cell 110. 9 (15. 4 x) 37. 3 (12. 4 x) 30. 6 (12. 2 x) 3. 2 GHz Cell 72 7. 2 67. 8 (9. 4 x) 23. 2 (7. 7 x) 18. 9 (7. 5 x) © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

What is Open. RTRT from Mercury? • Highly optimized ray tracing rendering • • What is Open. RTRT from Mercury? • Highly optimized ray tracing rendering • • • engine Enabling high-quality rendering at interactive frame rate Supports large model visualisation Complements GPU Open. GL-based rendering § § 73 Realism and rendering effects Quality and accuracy Capacity for large models Performance scalability with multiple CPUs and clusters © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Open. RTRT: Real-Time Ray Tracing • Recognized as outstanding, breakthrough technology Cutting edge research Open. RTRT: Real-Time Ray Tracing • Recognized as outstanding, breakthrough technology Cutting edge research and dramatic optimizations achieved by U. Saarland in. Trace: § cache & data layout optimization, parallelization - SIMD/SSE, multi-threading, distribution… § Interactive even on a PC, enough for preparation work for instance • Scalable performances with multiple CPUs § Allow fully interactive visualization § Performance depends linearly on the number of pixels, rays and processors § Logarithmic in scene size (20 Mio triangles guaranteed) • Available for Linux on x 86, x 86 -64, and IA 64 and Windows 32 74 © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.

Background 2000 Start of research at the University of Saarland 2001 Presentation of the Background 2000 Start of research at the University of Saarland 2001 Presentation of the first scientific results 2002 Initial projects with the Automotive industry Simulation of Ray Tracing hardware 2003 Foundation of in. Trace Gmb. H Volkswagen AG as first customer (VR – Lab) 2004 New project visualization center at Wolfsburg based on Ray Tracing. First Ray tracing hardware prototype 2005 75 Projects with basically all German car manufacturers: VW, Audi, BMW, Daimler. Chrysler + Airbus, Boeing, … First design of fully programmable chip for Ray Tracing Exclusive agreement for worldwide distribution with Mercury Computer Systems © 2006 Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.