Скачать презентацию Micro-37 Tutorial Compilation System for Throughput-driven Multi-core Network Скачать презентацию Micro-37 Tutorial Compilation System for Throughput-driven Multi-core Network

2d7ee30c223694e54165b509f0e5ce81.ppt

  • Количество слайдов: 111

Micro-37 Tutorial Compilation System for Throughput-driven Multi-core Network Processors Michael K. Chen Erik Johnson Micro-37 Tutorial Compilation System for Throughput-driven Multi-core Network Processors Michael K. Chen Erik Johnson Roy Ju {michael. k. chen, erik. j. johnson, roy. ju}@intel. com Corporate Technology Group Intel Corp. December 5, 2004

Agenda Project Overview Domain-specific Language High-level Optimizations Code Generations and Optimizations Performance Characterization Runtime Agenda Project Overview Domain-specific Language High-level Optimizations Code Generations and Optimizations Performance Characterization Runtime Adaptation Summary December 5, 2004 2

Project Overview Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004 Project Overview Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004

Outline Problem Statement Overview of Shangri-la System Status and Teams December 5, 2004 Project Outline Problem Statement Overview of Shangri-la System Status and Teams December 5, 2004 Project Overview 4

The Problem Packet processing application Multi-threaded (x 8) Microengine Array RDRAM Controller MEv 2 The Problem Packet processing application Multi-threaded (x 8) Microengine Array RDRAM Controller MEv 2 1 2 3 4 MEv 2 8 7 6 5 Media Switch Fabric I/F Intel® PCI XScale™ MEv 2 Core 9 10 11 12 QDR SRAM Controller State-of-the-art: Hand-tuned code for maximal performance M C but often error-prone and not scalable Static resource allocation often tailored to one particular workload not flexible to varying workloads and hardware December 5, 2004 I O Project Overview MEv 2 16 15 14 13 Scratch Memory Hash Unit PIXP 2800 P P P Tanglewood P socket C C C C 16 cores 16 MB Cache C C C C P P P P 5

Shangri-La Overview Mission: research an industry leading programming environment for packet processing on Intel Shangri-La Overview Mission: research an industry leading programming environment for packet processing on Intel chip multiprocessors (CMP) silicon Challenges: Hide architectural details from programmers Automate allocation of system resources Adapt resource allocation to match dynamic traffic conditions Achieve performance comparable to hand-tuned systems Technology: Language: enable portable packet processing applications Compiler: automate code partitioning and optimizations Run-time System: adapt to dynamic workloads December 5, 2004 Project Overview 6

Architectural Features of Intel® IXP Processor Heterogeneous, multi-cores: Intel Xscale® Processor (control) and Micro. Architectural Features of Intel® IXP Processor Heterogeneous, multi-cores: Intel Xscale® Processor (control) and Micro. Engines (data) Memory hierarchy Local memory (LM): distributed on MEs No HW cache Scratch, SRAM, DRAM: shared Long memory latency Micro. Engine: Single issue; deferred slots Light-weighted HW multi-threading Event signals to synchronize threads Multiple register banks and constraints as operands in instructions Limited code store December 5, 2004 Project Overview 7

Packet Processing Applications Types of apps: IPv 4 forwarding, L 3 -Switch, MPLS (Multi-Protocol Packet Processing Applications Types of apps: IPv 4 forwarding, L 3 -Switch, MPLS (Multi-Protocol Label Switch), NAT (Network Address Translation), Firewall, Qo. S (Quality of Service) Characteristics of packet processing apps: Performance metric: throughput (vs. latency) Mostly memory bound Large amount of packets without locality Smaller instruction footprint Execution paths tend to be predictable December 5, 2004 Project Overview 8

Anatomy of Shangri-La General-purpose Compiler Language(s) Front-end Profiling Inter-Procedural Opt. Baker Programming Language Baker Anatomy of Shangri-La General-purpose Compiler Language(s) Front-end Profiling Inter-Procedural Opt. Baker Programming Language Baker Compiler Profiler Pi Compiler Loop/Memory Opt. Global Optimizations Aggregate Compiler Code Generation Execution environment December 5, 2004 Run-time System Project Overview Modular language (with C-like syntax) to express applications as a dataflow graph Extract run-time characteristics by executing application Compiler optimizations for pipeline construction and data structure mapping/caching Code generation and optimization for heterogeneous cores Dynamically adapt mapping to match traffic fluctuations 9

Baker Language Familiar to embedded systems programmers Syntactically “feels like” C Simplifies the development Baker Language Familiar to embedded systems programmers Syntactically “feels like” C Simplifies the development of packet-processing applications Hides architectural details Single-level of memory Implicit threading model Modular programming – encapsulation Domain-specific Data flow model Actors and interconnects ( PPFs and channels) Built-in types, e. g. packet Enables compiler to generate efficient code on target CMP hardware December 5, 2004 Project Overview 10

Shangri-la Example Profiler Baker Modular, simple description L 3 Switch Looks like C L Shangri-la Example Profiler Baker Modular, simple description L 3 Switch Looks like C L 3 Fwdr • IR run on IR-simulator module l 3_switch { • Stimulated by packet traceint l 3_switch. l 2_clsfr. process module eth_rx, eth_tx; // Built-in (ether_packet_t *in_pkt) • Statistics stored in IR ppf l 2_clsfr; L 2 Cls RX { module eth_encap_mod, l 3_fwdr, l 2_bridge; Eth Encap. . . wiring { if ( fwd ) { TX eth_rx. eth 0=-> l 3_switch. l 2_clsfr. input_chnl; p packet_decap(in_pkt); l 3_fwdr. input_chnl <- l 2_clsfr. l 3_fwrd_chnl; channel_put(l 3_forward_chnl, p); . . . } L 2 Bridge else { } channel_put(l 2_bridge_chnl, in_pkt); . . . } }; } December 5, 2004 Project Overview 11

Compiler and Optimizations Perform program and data partitioning Cluster multiple finer-grained components into larger Compiler and Optimizations Perform program and data partitioning Cluster multiple finer-grained components into larger aggregates Balancing between replication and pipelining Automatic data mapping on memory hierarchy Optimizations and code generation Code generations for heterogeneous processing cores Global machine independent optimizations Optimizations for memory hierarchy Machine dependent code generation and optimizations December 5, 2004 Project Overview 12

Shangri-la Example Aggregate Compiler Pipeline Compiler L 3 Switch. Internal Channels L 3 Fwdr Shangri-la Example Aggregate Compiler Pipeline Compiler L 3 Switch. Internal Channels L 3 Fwdr Intel XScale® converted to function calls 10101010 101010 L 2 Cls ME 10101010 101010 Eth Encap RX 10101010 101010 L 2 Bridge 10101010 10101010 101010 TX 10101010 101010 Executable binaries Each Aggregate given a Aggregates PPFs Critical Path PPFS in same main with a while(1) loop aggregate December 5, 2004 Project Overview 13

Run-time Adaptation Workloads fluctuate over time Usually over provision to handle worst case Adapt Run-time Adaptation Workloads fluctuate over time Usually over provision to handle worst case Adapt to workload Change mapping to increase performance when needed Power-down unneeded processors Adaptation requirements Hardware independent abstraction Querying of resource utilization December 5, 2004 Project Overview 14

Shangri-la Example Run-time System L 3 Switch XScale ME 10101010 10101010 10101010 10101010 Automatically Shangri-la Example Run-time System L 3 Switch XScale ME 10101010 10101010 10101010 10101010 Automatically map aggregates to Automatically processing units remap at runtime 10101010 10101010 101010 Intel® XScale™ Core MEv 2 1 2 3 4 MEv 2 8 7 6 5 10101010 101010 December 5, 2004 Project Overview 15

Project Status Project started in Q 1 2003 Collaboration among Intel, Chinese Academy of Project Status Project started in Q 1 2003 Collaboration among Intel, Chinese Academy of Sciences, UT-Austin Compiler based on Open Research Compiler (ORC) A completed prototype system with maximal packet forwarding rate on a number of applications Research project to transfer technology to product groups December 5, 2004 Project Overview 16

Acknowledgements Communication Technology Lab, Intel Erik Johnson, Jamie Jason, Aaron Kunze, Steve Goglin, Arun Acknowledgements Communication Technology Lab, Intel Erik Johnson, Jamie Jason, Aaron Kunze, Steve Goglin, Arun Raghunath, Vinod Balakrishnan, Robert Odell Microprocessor Technology Lab, Intel Xiao Feng Li, Lixia Liu, Jason Lin, Mike Chen, Roy Ju, Astrid Wang, Kaiyu Chen, Subramanian Ramaswamy Institute of Computing Technology, Chinese Academy of Sciences Zhaoqing Zhang, Ruiqi Lian, Chengyong Wu, Junchao Zhang, Jiajun Wu, Han. Dong Ye, Tao Liu, Bin Bao, Wei Tang, Feng Zhou University of Texas at Austin Harrick Vin, Jayaram Mudigonda, Taylor Riche, Ravi Kokku December 5, 2004 Project Overview 17

The Baker Language Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004 The Baker Language Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004

Baker Overview and Goals Baker is C with data-flow and packet processing extensions Goal Baker Overview and Goals Baker is C with data-flow and packet processing extensions Goal # 1: Enable efficient expression of packet processing applications on large-scale chip-multiprocessors (e. g. , Intel® IXP 2400 processor) Encourage interesting, complex application development Be familiar to embedded systems programmers -“should start with C” Goal # 2: Enable good execution performance Scalable performance across new versions of large-scale CMP Expose compiler and run-time system to optimization opportunities Don’t constrain the compiler’s ability to place code or data; don’t preclude run-time adaptation December 5, 2004 The Baker Language 19

Outline The Baker Approach Hardware abstractions and models Domain-specific constructs Standard C language feature Outline The Baker Approach Hardware abstractions and models Domain-specific constructs Standard C language feature reductions Results Future Research Summary December 5, 2004 The Baker Language 20

Hardware Model (1/5) Domain Features C Restrictions Results and Summary Baker’s Hardware Models Memory Hardware Model (1/5) Domain Features C Restrictions Results and Summary Baker’s Hardware Models Memory Concurrency, i. e. , cores and threads I/O, e. g. , receive and transmit December 5, 2004 The Baker Language 21

Hardware Model (2/5) Domain Features C Restrictions Results and Summary A Single-level Memory Model Hardware Model (2/5) Domain Features C Restrictions Results and Summary A Single-level Memory Model Baker exposes a single-level, shared memory model, like C Makes programming easier Variable declaration and use, malloc/free work just like C Enables compiler freedom in optimizing data placement Move most accessed data structures (or parts of structures) to fastest memory Enables compiler to move code to any core Code not tied to a particular core’s physical memory model December 5, 2004 The Baker Language 22

Hardware Model (3/5) Domain Features C Restrictions Results and Summary Implicitly Threaded Concurrency Baker Hardware Model (3/5) Domain Features C Restrictions Results and Summary Implicitly Threaded Concurrency Baker exposes a multithreaded concurrency model Programmer knows code may execute concurrently Programmer does not Know the of number of cores Explicitly create or destroy threads A consequence: Programmers must protect shared memory with locks Enables compiler and run-time system to optimize execution Can create an application pipeline and balance it Can optimize locks based on which processors access a lock December 5, 2004 The Baker Language 23

Hardware Model (4/5) Domain Features C Restrictions Results and Summary Code shown for illustrative Hardware Model (4/5) Domain Features C Restrictions Results and Summary Code shown for illustrative purposes only and should not be considered valid. Example of Implicit Threading /* /* Implicitly Threaded */ /* Single Threaded */ Explicitly Threaded */ lock foo /* Another entity calls g_lock; void init(){ lock g_lock; repeatedly */ int i; void foo(void* arg){ void foo(){ init_lock(g_lock); packet* p; for (i=0; ikey] ++; release_lock(&g_lock); do_some_work(p); if (p = get_packet()){ drop(p); acquire_lock(&g_lock); } g_table[p->key] ++; } release_lock(&g_lock); } do_some_work(p); drop(p); } while (p = get_packet ()){ } g_table[p->key] void foo(void* arg){ ++; do_some_work(p); packet* p; drop(p) for (; ; ){ } } } December 5, 2004 } The Baker Language 24

Hardware Model (5/5) Domain Features C Restrictions Results and Summary I/O As A Driver Hardware Model (5/5) Domain Features C Restrictions Results and Summary I/O As A Driver Model RX and TX require hardware knowledge E. g. , PHYs and MACs, RBUFs, TBUFs, flow control hardware Difficult to abstract this hardware using common C constructs Solution Don’t write these in Baker Written once in assembly by the system vendors for each board Baker developers use receive and transmit code like a device driver December 5, 2004 The Baker Language 25

Hardware Model Domain Features (1/5) C Restrictions Results and Summary Exposing Domain Features Hiding Hardware Model Domain Features (1/5) C Restrictions Results and Summary Exposing Domain Features Hiding hardware features can drastically decrease performance Baker exposes application domain features to compensate Tailor the compiler and run-time system optimizations to the domain Programmer is forced to help the compiler and run-time system find parallelism But in a “natural” way Two types of domain features Data-flow abstractions Packet processing abstractions December 5, 2004 The Baker Language 26

Hardware Model Domain Features (2/5) C Restrictions Results and Summary Data Flow Overview A Hardware Model Domain Features (2/5) C Restrictions Results and Summary Data Flow Overview A data flow is a directed graph Graph nodes are called actors (or packet-processing functions, PPF, in Baker) and represent the computation Graph edges are called channels and move data between actors Data-flow is a natural fit for the packet processing domain December 5, 2004 The Baker Language 27

Hardware Model Domain Features (3/5) C Restrictions Results and Summary Data Flow: PPFs and Hardware Model Domain Features (3/5) C Restrictions Results and Summary Data Flow: PPFs and Channels PPFs (or Actors) Implicitly Concurrent Stateful Support multiple inputs and outputs No assumptions about a steady rate of packet consumption Channels Queue-like properties Asynchronous, unidirectional, typed, reliable Active and passive varieties Can be replaced with function Run-time system can choose an optimal implementation E. g. , Scratch rings vs next neighbor rings December 5, 2004 The Baker Language 28

Hardware Model Domain Features (4/5) C Restrictions Results and Summary Packet Processing Features Packets Hardware Model Domain Features (4/5) C Restrictions Results and Summary Packet Processing Features Packets and Meta-data as first class objects Packets Programmer accesses packet data through a special pointer type, all packet accesses go through these pointers Allows compiler to coalesce reads/writes, avoid head and tail manipulation, etc. Meta-data Storage associated and carried with a packet E. g. , input port, output port, etc. Accessed via the packet’s pointer Useful to programmers to carry per-packet state passed between actors Language ensures that meta-data is created before it is used The Baker Language December 5, 2004 29

Hardware Model Domain Features (5/5) C Restrictions Results and Summary Code shown for illustrative Hardware Model Domain Features (5/5) C Restrictions Results and Summary Code shown for illustrative purposes only and should not be considered valid. Example Application MPLS APP RX L 2 CLS module mpls_mod{ ppf mpls_app{ module mpls. ftn { ops; ppf ftn, ilm, result_t { L 2 channel mpls. ftn. process(ipv 4_packet_t* in_pkt){ module eth_rx, eth_tx; channel { uint 32_t nhid; input ppfinput l 2_cls; ipv 4_packet_t* input_chnl; Bridge = ipv 4_packet_t* enter_chnl; nhid output in_pkt->metadata. nexthop_id; packet_t* l 2_bridge, module mpls, l 3_fwdr, exit_chnl; eth_encap_mod; input mpls_packet_t* from_chnl; if (nhid > num_ftn_entries) { output ipv 4_packet_t* enter_chnl; output ipv 4_packet_t* exit_chnl; L 3 packet_drop(in_pkt); } wiring output packet_t* output_chnl; return wiring Forwarder{ RTN_FAILURE; {} } input_chnl passive process; eth_rx. eth 0 -> l 2_cls. input_chnl; wiring(ftn_table[nhid]. nhlfe) { { if input_chnl exit_chnl, enter_chnl; . . . enter_chnl -> ftn. input_chnl; = in_pkt->metadata. nhlfe_id = input_chnl requires mpls_nh; MPLS l 2_cls. l 2_bridge_chnl -> l 2_bridge. input_chnl; from_chnl = ilm. input_chnl; Eth ftn_table[nhid]. nhlfe; enter_chnl generates mpls_nhlfe; l 2_cls. mpls_chnl -> mpls. from_chnl; exit_chnl = ops. exit_chnl; TX in_pkt); } return channel_put(enter_chnl, Encap } output_chnl = ftn. exit_chnl; else { result_t process(ipv 4_packet_t*); FTN l 3_fwdr. forward_chnl -> mpls. enter_chnl; output_chnl = ops. enter_chnl; = in_pkt->metadata. nexthop_id }; ftn. enter_chnl -> ops. for_chnl; ftn_table[nhid]. nhid; mpls. exit_chnl -> l 3_fwdr. input_chnl; ilm. output_chnl -> ops. from_chnl; OPS channel_put(exit_chnl, in_pkt); return } mpls. output_chnl -> eth_encap_mod. input_chnl; }. . . uint 32_t add_nhlfe_entry(MPLS_OP op, ILM } } uint 32_t label, . . . uint 32_t next_hop_id); }; . . . }; December 5, 2004 The Baker Language 30

Hardware Model Domain Features C Restrictions (1/1) Results and Summary Reduce Language Features By Hardware Model Domain Features C Restrictions (1/1) Results and Summary Reduce Language Features By removing some features of C, compiler is able to make more optimizations Typesafe pointers Compiler is able to do much better alias analysis Networking code typically does not use tricky pointer manipulations Some features needed to be removed to avoid large overheads on the microengines Recursion No natural stack on the microengine so the compiler has to implement one Eliminating recursion simplifies stack analysis Function Pointers Removed for similar reasons as recursion Unfortunately, network programmers actually use them a great deal December 5, 2004 The Baker Language 31

Hardware Model Domain Features C Restrictions Results & Summary (1/3) Results Source-lines of code Hardware Model Domain Features C Restrictions Results & Summary (1/3) Results Source-lines of code measured using sloccount Does not do complexity analysis, does not handle assembly code L 3 Switch (sloc) NAT (sloc) 3126 1205 4145 Simplistic route table management Baker MPLS (sloc) Plus unmodified source reuse of L 3 switch Simplistic connection setup code IXA SDK 6174 8091 Microengine-C No ARP, no L 2 bridging Includes many more error conditions, assembly not counted These tests and ratings are measured using specific computer systems and/or components and reflect the size of the indicated code as measured by those tests. Any difference in system hardware or software design or The Baker Language December 5, 2004 configuration may affect actual sizes. 6934 32

Hardware Model Domain Features C Restrictions Results & Summary (2/3) Future Research Existing languages Hardware Model Domain Features C Restrictions Results & Summary (2/3) Future Research Existing languages expose packets as completely independent, however flows are a more appropriate independence class for data in this domain How should flows of packets be represented in a language and how to optimize around these: Automated ordering Flow-data locality improvements Flow-lock elision December 5, 2004 The Baker Language 33

Hardware Model Domain Features C Restrictions Results & Summary (3/3) Summary Goals Enable efficient Hardware Model Domain Features C Restrictions Results & Summary (3/3) Summary Goals Enable efficient expression of packet processing applications on large-scale chip-multiprocessors (e. g. , Intel® IXP 2400 processor) Enable good execution performance Approach Hide hardware details Single memory, implicit threading, RX/TX as drivers Expose domain-specific constructs Data-flow, packets, meta-data Reduce C Features Typesafe pointers, recursion, function pointers December 5, 2004 The Baker Language 34

High-Level Optimizations Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004 High-Level Optimizations Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004

Shangri-La Compiler Overview General-purpose Compiler Language(s) Front-end Profiling Inter-procedural opt Baker language Baker parser Shangri-La Compiler Overview General-purpose Compiler Language(s) Front-end Profiling Inter-procedural opt Baker language Baker parser Profiler Pi compiler Loop / memory opt Global optimizations Aggregate compiler Code generation Execution environment December 5, 2004 Convert Baker program into compiler intermediate representation (IR) Run-time system Derive run-time characteristics by simulating application Compiler optimizations for pipeline construction and data structure mapping/caching Code generation and optimization for heterogeneous cores Load application and perform dynamic resource linking High-Level Optimizations 36

Profiling Overview Stages Baker language Baker parser Profiler Pi compiler Aggregate compiler Run-time system Profiling Overview Stages Baker language Baker parser Profiler Pi compiler Aggregate compiler Run-time system Simulation of high-level IR Developed a custom IR interpreter Different from the traditional 2 -pass profiling Profiling information guides optimizations in later phases Stimulated using user-supplied packet traces Information collected Execution frequency Communication Memory access statistics December 5, 2004 High-Level Optimizations 37

Pi Compiler Details Stages Performs most high-level optimizations Baker language Baker parser Profiler Intra-PPF Pi Compiler Details Stages Performs most high-level optimizations Baker language Baker parser Profiler Intra-PPF IPA Pi compiler Aggregate compiler Mapping PPFs to heterogeneous cores Assign memory levels to global data structures Memory mapper Run-time system Code size est. Execution time est. Perform inter-procedural analysis for optimizations needing support Aggregate formation Guided by profiling results Intra-aggregate IPA Aggregate dump December 5, 2004 High-Level Optimizations 38

Supporting Language Features with Compiler Optimizations Modular, dataflow language Automatic program partitioning Packet abstraction Supporting Language Features with Compiler Optimizations Modular, dataflow language Automatic program partitioning Packet abstraction model Packet handling optimizations Flat memory hierarchy Automatic memory mapping December 5, 2004 High-Level Optimizations 39

Key Compiler Technologies Automatic program partitioning to heterogeneous cores Packet handling optimizations Packet access Key Compiler Technologies Automatic program partitioning to heterogeneous cores Packet handling optimizations Packet access combining Static offset and alignment resolution Packet primitive removal Partitioned memory hierarchy optimizations Memory mapping Delayed-update software-controlled caches Program stack layout optimization December 5, 2004 High-Level Optimizations 40

Automatic Program Partitioning (1/3) Packet Handling Optimizations Memory Hierarchy Optimizations Partitioning Across Heterogeneous Cores Automatic Program Partitioning (1/3) Packet Handling Optimizations Memory Hierarchy Optimizations Partitioning Across Heterogeneous Cores Partition across Intel XScale® and multiple MEs Partitioning considerations Identifying control and data planes Minimizing inter-processor communication costs Account for dynamic characteristics using profiling results Satisfying code size constraint Different memory addresses seen by different cores Insert address translations Minimize insertions and impact on performance December 5, 2004 High-Level Optimizations 41

Automatic Program Partitioning (2/3) Packet Handling Optimizations Memory Hierarchy Optimizations Inputs Into Partitioning Algorithm Automatic Program Partitioning (2/3) Packet Handling Optimizations Memory Hierarchy Optimizations Inputs Into Partitioning Algorithm Throughput-driven cost model Eliminates latency from consideration Expresses goal appropriately for domain Relevant profiling statistics PPF execution time Global data access frequency Channel utilization Possible partitioning strategies Pipelining application across cores Replicating application across cores December 5, 2004 High-Level Optimizations 42

Automatic Program Partitioning (3/3) Packet Handling Optimizations Memory Hierarchy Optimizations Partitioning Algorithm Pi Compiler Automatic Program Partitioning (3/3) Packet Handling Optimizations Memory Hierarchy Optimizations Partitioning Algorithm Pi Compiler Intra-PPF IPA Code size & exec time est. Pipeline Compiler Memory mapper Aggregate formation Merge PPFs with highest L 3 Fwdr communication cost L 3 Switch Intra-aggregate IPA Aggregate dump Rx Tx L 2 Cls Eth Encap L 2 Bridge Duplicate aggregate with lowest throughput December 5, 2004 Duplicate entire pipeline on available MEs High-Level Optimizations 43

Automatic Program Partitioning Packet Handling Optimizations (1/5) Memory Hierarchy Optimizations Packet Access Combining Basic Automatic Program Partitioning Packet Handling Optimizations (1/5) Memory Hierarchy Optimizations Packet Access Combining Basic packet accesses are powerful Support for language features Naïve mapping results in at least one memory access per packet access Combine multiple packet accesses / metadata accesses L 3 -Switch has 24 packet accesses per packet on critical path Take advantage of IXP’s wide DRAM access instruction Buffer values in local memory or transfer registers December 5, 2004 High-Level Optimizations 44

Automatic Program Partitioning Packet Handling Optimizations (2/5) Memory Hierarchy Optimizations Packet Access Combining Example Automatic Program Partitioning Packet Handling Optimizations (2/5) Memory Hierarchy Optimizations Packet Access Combining Example b = read pkt (off=64 b, sz=16 b) t 1 = pkt->ttl (off=64 b, sz=8 b) t 1 = ( b >> 8 ) & 0 xff t 2 = b & 0 xff t 2 = pkt->prot (off=72 b, sz=8 b) Analysis overview Isolate packet accesses Perform checks to guarantee packet accesses combined safely Validate range and size of combined memory access Replace combined accesses with accesses to / from Local Memory / transfer registers December 5, 2004 High-Level Optimizations 45

Automatic Program Partitioning Packet Handling Optimizations (3/5) Memory Hierarchy Optimizations Static Offset and Alignment Automatic Program Partitioning Packet Handling Optimizations (3/5) Memory Hierarchy Optimizations Static Offset and Alignment Resolution (SOAR) packet_encap IPv 4 over Ethernet packet ethernet header 14 B ipv 4 payload ipv 4 header 20 B packet_decap 14 B 4 B … mpls header ethernet header mpls header MPLS over Ethernet packet offset( src_ip ) = 26 B ipv 4 header 4 B 20 B ipv 4 payload offset( src_ip ) = ? ? ? Generic packet accesses Can handle arbitrary layering of protocols and arbitrary field offsets Clearly simplifies programmer’s tasks But dynamic offset and alignment determination add significant overheads Dynamic offsets handling adds 20+ instructions per packet access Dynamic alignment adds several instructions per packet access December 5, 2004 High-Level Optimizations 46

Automatic Program Partitioning Packet Handling Optimizations (4/5) Memory Hierarchy Optimizations Static Offset and Alignment Automatic Program Partitioning Packet Handling Optimizations (4/5) Memory Hierarchy Optimizations Static Offset and Alignment Resolution (SOAR) Statically resolved packet field alignment eliminates a few instructions Statically resolved packet field offset and alignment can be accessed with a few instructions Implemented using custom dataflow analysis l 3_switch. m 18/18 resolved Eth ◄ IP l 3_cls. p l 2_cls. p Rx 3/3 resolved eth_encap. m lpm_lookup. p options_processor. p encap. p icmp_processor. p arp. p 1/1 resolved l 3_fwdr. m IP ► Eth New ICMP ► IP Copy IP ◄ ICMP ► IP Eth ◄ Arp Tx 2/2 resolved l 2_bridge. m bridge. p Copy Eth December 5, 2004 High-Level Optimizations 47

Automatic Program Partitioning Packet Handling Optimizations (5/5) Memory Hierarchy Optimizations Eliminate Unnecessary Packet Primitives Automatic Program Partitioning Packet Handling Optimizations (5/5) Memory Hierarchy Optimizations Eliminate Unnecessary Packet Primitives in Code Eliminate unnecessary packet_encap and packet_decap primitives Balanced packet_encap and packet_decap in the same aggregate can be eliminated because they have no external effect Works in conjunction with SOAR analysis results Convert metadata accesses into local memory accesses when all uses are within the same aggregate Private uses of metadata have no external effect metadata accesses composed of 1+ SRAM and 20+ instructions Candidate accesses can be identified with def-use analysis December 5, 2004 High-Level Optimizations 48

Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (1/6) Global Data Memory Mapping Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (1/6) Global Data Memory Mapping Collect dynamic access frequencies to shared global data structures Map data structures to appropriate memory levels Map small, frequently accessed data structures to Scratch Memory Otherwise, place in SRAM Pointers may point to objects in different levels of memory Perform congruence analysis to allocate such objects to a common memory level December 5, 2004 High-Level Optimizations 49

Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (2/6) Delayed-Update Software. Controlled Caches Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (2/6) Delayed-Update Software. Controlled Caches Cache unprotected global data structures Since these structures are not protected by locks, assume that they can tolerate delayed update Delayed update results in some mishandled packets, tolerable for network applications Identify caching candidates automatically from profiling statistics Frequently read – packet processing core Infrequently written – control and initialization routines High predicted hit rate – derived from profiling Good candidates Configuration globals – MAC table, classification table Lookup tables … December 5, 2004 High-Level Optimizations 50

Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (3/6) Caching Route Lookups 01 Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (3/6) Caching Route Lookups 01 10 11 Packet forwarding routes are stored in trie tables Frequently executed path 00 01 10 a c 11 11 b Route lookups Infrequently executed path Route update 00 01 10 11 a c b c Updated with an atomic write 00 01 10 11 b a c a December 5, 2004 High-Level Optimizations 51

Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (4/6) Delayed-update software-controlled caches Infrequent Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (4/6) Delayed-update software-controlled caches Infrequent write path Shared data Frequent read path N count > check_limit Y WR WR global_data WR WR update_flag global_data RD RD RD count = 0 RD update_flag is update_flag true? Y Clear cache Clear update_flag N RD global_data count++ RD in main memory global_data from main memory / cache WR update_flag = true Optimized Access Base Access Delayed-update coherency checks home location only occassionaly update_flag set on any change to the cached variable Update check rate set as a function of tolerable error rate and variable’s expected load and store rate December 5, 2004 High-Level Optimizations 52

Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (5/6) Program stack layout optimization Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (5/6) Program stack layout optimization Shangri-la’s runtime model Supports calling convention Stack holds PPF’s local variables and temporary spill locations Baker does not support recursion – stack could be assigned statically to different locations Want to assign disjoint stack frames to limited Local Memory Stack is mapped to Local Memory and SRAM Only 48 words / thread for stack December 5, 2004 High-Level Optimizations 53

Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (6/6) Program stack layout optimization Automatic Program Partitioning Packet Handling Optimizations Memory Hierarchy Optimizations (6/6) Program stack layout optimization main() 16 words PPF 1 16 words Stack Local Memory 48 words PPF 2 32 words SRAM PPF 3 16 words PPFs higher in call graph assigned to Local Memory first Dispatch model ensures relatively flat call graph If PPF is called from two places, assign to minimum stack location that will not collide with live stack frames December 5, 2004 High-Level Optimizations 54

Conclusions Proposed optimizations for generating code competitive with hand-tuned code from high-level languages Memory-level Conclusions Proposed optimizations for generating code competitive with hand-tuned code from high-level languages Memory-level optimizations Program partitioning to heterogeneous cores Optimizations to support packet abstractions Total system performance will be shown after we describe code generation optimizations and the run-time system December 5, 2004 High-Level Optimizations 55

Code Generation and Optimizations Part of the Shangri-la Tutorial presented at MICRO-37 December 5, Code Generation and Optimizations Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004

Outline Compiler Flow Intel XScale® Processor Code Generation Micro. Engine Code Generation December 5, Outline Compiler Flow Intel XScale® Processor Code Generation Micro. Engine Code Generation December 5, 2004 Code Generation 57

Shangri-la Compiler Flow December 5, 2004 Code Generation 58 Shangri-la Compiler Flow December 5, 2004 Code Generation 58

Intel XScale® Processor Code Generation Intel XScale® Processor Runs configuration, management, control plane, and Intel XScale® Processor Code Generation Intel XScale® Processor Runs configuration, management, control plane, and cold code With OS and virtually unlimited code store Less performance critical Code generation Shares the compilation path with ME till WOPT Regenerates C source code with proper naming convention Leverages an existing Gcc compiler for Intel XScale® Processor Issue on address translation Intel XScale® Processor uses virtual address and ME uses physical address + memory type Perform address translation only on Intel XScale® Processor for addresses exposed between two types of cores December 5, 2004 Code Generation 59

ME Code Generator Pi Compiler … … … ME code generator Lowering&Code Selection Aggregates ME Code Generator Pi Compiler … … … ME code generator Lowering&Code Selection Aggregates Represented as WHIRL Memory Optimization Loop Optimization Global Instr. Scheduling Register Allocation Code Size Guard System Model Region Formation Local Instr. Scheduling Code Emission Assembly files December 5, 2004 Code Generation 60

Register Allocation ME architectural constraints on assigning registers Multiple register banks used in specific Register Allocation ME architectural constraints on assigning registers Multiple register banks used in specific types of instructions GPR banks, SRAM/DRAM Transfer In/Out banks, Next Neighbor bank Cannot use certain banks of registers for both A and B operands E. g. GPR A and GPR B banks ME register allocation framework Step 1: identifying candidate banks For each TN (virtual register), identify all possible register banks at each occurrence according to ME ISA If there is at least one common register bank, follow conventional register allocation December 5, 2004 Code Generation 61

Register Allocation (cont. ) ME register allocation framework (cont. ) Step 2: resolving bank Register Allocation (cont. ) ME register allocation framework (cont. ) Step 2: resolving bank conflicts if no common bank exists Locate conflicting edges Partition def-use graph Add moves between sub-graphs Step 3: allocating intra-set registers Perform conventional register allocation but observe the constraints on A and B operands Add an edge between two source operands in the same instruction in a symbolic register conflict graph Use different heuristics to balance the usage of GPR A and B banks December 5, 2004 Code Generation 62

Calling Convention and Stack Support calling convention Caller/callee save registers, parameter passing, etc. Perform Calling Convention and Stack Support calling convention Caller/callee save registers, parameter passing, etc. Perform code generation, e. g. register allocation, within a function scope Ease debugging and performance tuning to focus on changes only in the affected scope Support calling stack despite no recursion Stack frame for local vars, spilled parameters, register spills Calling stack grows from LM to SRAM Allocate disjoint stack frames on precious LM Statically decide the memory level for a frame for both performance and code size reasons December 5, 2004 Code Generation 63

More Features in Code Generation and Optimizations Inter-procedural analysis and function inlining Global scalar More Features in Code Generation and Optimizations Inter-procedural analysis and function inlining Global scalar optimization for register promotion Parameterized machine model for ease of porting Code size guard to throttle the aggressiveness of optimizations which increase code size Global instruction scheduling and latency hiding Bitwise optimizations Loop unrolling December 5, 2004 Code Generation 64

The Run-time System Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004 The Run-time System Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004

Motivation (1/2) RTS Goals Adapt execution of the application to match the current workload Motivation (1/2) RTS Goals Adapt execution of the application to match the current workload Isolate the RTS user from hardware-specific features commonly needed for packet processing December 5, 2004 The Run-time System 66

Motivation (2/2) Adaptation Opportunities IPv 6 Compression MEv 2 1 and Forwarding 4 2 Motivation (2/2) Adaptation Opportunities IPv 6 Compression MEv 2 1 and Forwarding 4 2 3 MEv 2 8 7 6 5 Intel IPv 4 XScale® MEv 2 Compression 9 10 11 12 core and Forwarding MEv 2 IPv 6 1 2 3 4 Compression MEv 2 and Forwarding 8 7 6 5 MEv 2 16 15 14 13 IPv 6 Compression MEv 2 1 and Forwarding 4 2 3 MEv 2 VPN 8 7 6 5 Intel Encry IPv 4 XScale® MEv 2 pt/Dec Compression 9 10 11 12 rypt core IPv 6 MEv 2 Compr 8 7 6 5 Intel ession IPv 4 XScal® MEv 2 and Compression 9 10 11 12 Forwa core and Forwarding MEv 2 16 15 14 13 IPv 6 Compression MEv 2 1 and Forwarding 4 2 3 Ex. 2 MEv 2 VPN 1 Encrypt/Decrypt 4 2 3 MEv 2 16 15 14 13 Ex. 1 Intel XScale® MEv 2 IPv 4 9 10 11 12 core Compression MEv 2 and Forwarding MEv 2 Compression 1 2 3 4 and Forwarding MEv 2 8 7 6 5 Ex. 3 Intel IPv 4 XScale® MEv 2 Compression 9 10 11 12 core and Forwarding MEv 2 16 15 14 13 December 5, 2004 16 15 14 Change allocation to increase individual service performance 13 Support a large set of services in the “fast path”, according to use IPv 6 MEv 2 8 7 6 5 Intel XScale® MEv 2 9 10 11 12 core Power-down unneeded processors IPv 4 Compression MEv 2 and Forwarding 13 16 15 14 The Run-time System 67

Outline RTS Design Overview Run-time Adaptation Mechanisms Binding Checkpointing State migration Run-time Adaptation Results Outline RTS Design Overview Run-time Adaptation Mechanisms Binding Checkpointing State migration Run-time Adaptation Results Overheads and costs Benefits Future research Summary December 5, 2004 The Run-time System 68

RTS Design (1/4) Adaptation Mechanisms Adaptation Results Summary RTS Theory of Operations System monitor RTS Design (1/4) Adaptation Mechanisms Adaptation Results Summary RTS Theory of Operations System monitor collets runtime statistics (queue depths), triggers adaptation Resource Abstraction Layer hides the implementation of processor resources A 1010101 0101010 1010 B 1010101 0101010 1010101 10101010 01010101 10101010 0101010 1010 A B XScale C + C ME Topology Resource Planner & Allocator Triggers System Monitor Queue depths Mapping B A Resource planner and allocator computes new processor mapping based on global knowledge Executable binaries C B C Resource Abstraction Layer (RAL) Traffic Mix Run-time system December 5, 2004 The Run-time System MEv 2 1 2 A 3 4 Intel® XScale™ B, C Core MEv 2 8 7 B 6 5 A MEv 2 9 10 11 12 C MEv 2 16 15 14 13 Resource Mapping 69

RTS Design (2/4) Adaptation Mechanisms Adaptation Results Summary The Resource Abstraction Layer Three goals RTS Design (2/4) Adaptation Mechanisms Adaptation Results Summary The Resource Abstraction Layer Three goals Support adaptation: Packet channels and Locks Allow common abstractions for rest of RTS code: Processing units, network interfaces Allow for portability of compiler’s code generator Data memory, packet memory, timers, hash, random Key Lesson: Noble last goal, but performance cost can be large Focus on supporting adaptation December 5, 2004 The Run-time System 70

RTS Design (3/4) Adaptation Mechanisms Adaptation Results Summary How RAL Supports Adaptation A microengine-based RTS Design (3/4) Adaptation Mechanisms Adaptation Results Summary How RAL Supports Adaptation A microengine-based example RAL calls are initially undefined RAL Implementation 0 RAL Implementation 1 RAL Implementation 2 Application Code At run time, the RTS has the application. o file and the application. o file the RAL. o file RAL Implementation 3 RAL Implementation 4 RAL Implementation 5 RAL Implementation 6 Application. o file Final. o file December 5, 2004 Process repeated after each adaptation The Run-time System Linker adjusts jump targets using import RAL. o file variable mechanism 71

RTS Design (4/4) Adaptation Mechanisms Adaptation Results Summary System Monitor and Resource Planner System RTS Design (4/4) Adaptation Mechanisms Adaptation Results Summary System Monitor and Resource Planner System Monitor Triggering policies E. g. , queue thresholds Resource planner and allocator Mapping policies Move code into/out of fast path Duplicate code within the fast path December 5, 2004 The Run-time System 72

RTS Design Adaptation Mechs (1/7) Adaptation Results Summary Adaptation Mechanisms Binding Checkpointing State migration RTS Design Adaptation Mechs (1/7) Adaptation Results Summary Adaptation Mechanisms Binding Checkpointing State migration December 5, 2004 The Run-time System 73

RTS Design Adaptation Mechs (2/7) Adaptation Results Summary Why Have Binding? A B A RTS Design Adaptation Mechs (2/7) Adaptation Results Summary Why Have Binding? A B A MEv 2 1 2 A 3 4 Intel XScale™ Core MEv 2 8 7 B 6 5 MEv 2 9 10 11 12 Now we can use NN rings, B local locks MEv 2 A B 1 2 3 4 Intel XScale™ Core MEv 2 16 15 14 13 MEv 2 8 7 6 5 MEv 2 9 10 11 12 MEv 2 16 15 14 13 Want to be able to use the fastest implementations of resources available December 5, 2004 The Run-time System 74

RTS Design Adaptation Mechs (3/7) Adaptation Results Summary Binding: The Value of Choosing the RTS Design Adaptation Mechs (3/7) Adaptation Results Summary Binding: The Value of Choosing the Right Resource Implementation on Intel® IXP 2400 Processor Next-neighbor # S-push/S-pull bytes 0 % S-push/S-pull bandwidth 0% Scratchpad ring 4 0. 47% SRAM ring w/stats 68 7. 9% Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. December 5, 2004 The Run-time System 75

RTS Design Adaptation Mechs (4/7) Adaptation Results Summary Binding: Compile-time or Not? Binding approach RTS Design Adaptation Mechs (4/7) Adaptation Results Summary Binding: Compile-time or Not? Binding approach Code-store footprint Execution overhead Rebinding time and flexibility Compile-time Best Worst Load-time Average Best Run-time Worst Best December 5, 2004 The Run-time System 76

RTS Design Adaptation Mechs (5/7) Adaptation Results Summary Checkpointing When migrating, RTS follows simple RTS Design Adaptation Mechs (5/7) Adaptation Results Summary Checkpointing When migrating, RTS follows simple algorithm Tell affected processing units to stop at checkpoint location Wait for processing unit to reach checkpoint location Reload and run processing units December 5, 2004 The Run-time System 77

RTS Design Adaptation Mechs (6/7) Adaptation Results Summary Checkpointing cont’d Finding the best checkpoint RTS Design Adaptation Mechs (6/7) Adaptation Results Summary Checkpointing cont’d Finding the best checkpoint is easier in packet processing than in general domains Leverage characteristics of data-flow applications Typically implemented as a dispatch loop Dispatch loop is executed at high-frequency Top of the dispatch loop has no stack information Since compiler creates dispatch loop, compiler inserts checkpoints in the code December 5, 2004 The Run-time System 78

RTS Design Adaptation Mechs (7/7) Adaptation Results Summary State Migration Once a processor has RTS Design Adaptation Mechs (7/7) Adaptation Results Summary State Migration Once a processor has been checkpointed, state from old resources must be moved to new resources E. g. , Packets sitting in previous packet channel implementations, cached data Solution: Copy packets in old channels to new channels Flush any caches December 5, 2004 The Run-time System 79

RTS Design Adaptation Mechanisms Adaptation Results (1/7) Summary Adaptation Results Adaptation costs (i. e. RTS Design Adaptation Mechanisms Adaptation Results (1/7) Summary Adaptation Results Adaptation costs (i. e. , overheads) Checkpointing Loading Experimental setup: Binding • Radisys, Inc. ENP 2611* State migration (not covered) • 1 600 MHz Intel® IXP 2400 Processor Cumulative effects • Monta. Vista Linux* Adaptation benefits • Timer measurement accuracy: 0. 53 us * Third party brands/names are property of their respective owners The Run-time System December 5, 2004 80

RTS Design Adaptation Mechanisms Adaptation Results (2/7) Summary Checkpointing Overhead Factors Time to inform RTS Design Adaptation Mechanisms Adaptation Results (2/7) Summary Checkpointing Overhead Factors Time to inform a processing unit to stop at the checkpoint ME: 60 us Intel Xscale® core: 34 us Time to check if all threads have stopped ME: 3 us Intel XScale® core*: 3 us Time to start a processing unit ME: 0. 036 ms *Linux kernel thread Intel Xscale® core*: 0. 097 ms Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by The Run-time System December 5, 2004 those tests. Any difference in system hardware or software design or configuration may affect actual performance. 81

RTS Design Adaptation Mechanisms Adaptation Results (3/7) Summary Loading Overhead Intel XScale® core thread RTS Design Adaptation Mechanisms Adaptation Results (3/7) Summary Loading Overhead Intel XScale® core thread start time: 0. 054 ms Graph shows ME load times December 5, 2004 The Run-time System Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 82

RTS Design Adaptation Mechanisms Adaptation Results (4/7) Summary Binding Overhead ME binding Intel XScale® RTS Design Adaptation Mechanisms Adaptation Results (4/7) Summary Binding Overhead ME binding Intel XScale® core binding Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. December 5, 2004 The Run-time System 83

RTS Design Adaptation Mechanisms Adaptation Results (5/7) Summary Cumulative Effects of Adaptation Overhea Not RTS Design Adaptation Mechanisms Adaptation Results (5/7) Summary Cumulative Effects of Adaptation Overhea Not all adaptation time represents an inoperable system Can leave some processors running while checkpointing others L 3 fwdr L 2 bridge Initial mapping 1 ME 3 MEs Final mapping 3 MEs 1 ME Channel into L 3 fwdr SRAM-based Scratchpadbased Channel into L 2 bridge Scratchpadbased Total time Time ME 1 Time ME 2 Time ME 3 Time ME 4 to adapt was down 99. 5 ms 96. 41 ms 76. 48 ms 47. 30 ms 25. 47 ms Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software The Run-time System December 5, 2004 84 design or configuration may affect actual performance.

RTS Design Adaptation Mechanisms Adaptation Results (6/7) Summary Adaptation Overhead Learnings Overall adaptation time RTS Design Adaptation Mechanisms Adaptation Results (6/7) Summary Adaptation Overhead Learnings Overall adaptation time is: Linking time + (checkpointing and loading time * number of cores) Packet loss occurs during checkpointing and loading, but not during binding So, focus optimizations on starting, stopping, and loading Exchange time in loading for more time in linking December 5, 2004 The Run-time System 85

RTS Design Adaptation Mechanisms Adaptation Results (7/7) Summary Theoretical Benefits of Adaptation For more RTS Design Adaptation Mechanisms Adaptation Results (7/7) Summary Theoretical Benefits of Adaptation For more details see paper in Hot. Nets-II http: //nms. lcs. mit. edu/Hot. Nets-II/papers/adaptation-case. pdf Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as The Run-time System December 5, 2004 measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 86

RTS Design Adaptation Mechanisms Adaptation Results Summary (1/2) Future Research Gather experimental benefits of RTS Design Adaptation Mechanisms Adaptation Results Summary (1/2) Future Research Gather experimental benefits of adaptation Define and develop performance determinism in the face of adaptation Apply power scaling to adaptation mechanisms Co-exist commercial operating systems with adaptation December 5, 2004 The Run-time System 87

RTS Design Adaptation Mechanisms Adaptation Results Summary (2/2) Summary An adaptive run-time system provides RTS Design Adaptation Mechanisms Adaptation Results Summary (2/2) Summary An adaptive run-time system provides benefits in: Performance, supported services, and power consumption The system can be built with a truly programmable largescale chip multiprocessor, requires Checkpointing Binding State migration Adaptation costs come primarily from loading and checkpointing times, optimize these December 5, 2004 The Run-time System 88

Shangri-la Performance Evaluation Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004 Shangri-la Performance Evaluation Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004

Setup (1/3) Resource Budgets Benchmarks Results Evaluation Setup Hardware Radisys ENP 2611* evaluation board Setup (1/3) Resource Budgets Benchmarks Results Evaluation Setup Hardware Radisys ENP 2611* evaluation board (3 x 1 Gbps optical ports) IXIA packet generator (2 x 1 Gbps optical ports) Currently only capable of generating 2 Gbs traffic Benchmarks L 3 -Switch (3126 lines) – L 2 bridging and L 3 forwarding Firewall (2784 lines) – Simple firewall using ordered rule-based classification MPLS (4331 lines) – Multi-protocol labeled switching (transit node) Packet traces L 3 -Switch and MPLS evaluated using NPF packet traces Firewall used custom packet trace December 5, 2004 * Third party brands/names are property of their respective owners Shangri-la Performance Evaluation 90

Setup (2/3) Resource Budgets Benchmarks Results Mt. Hood Board One Intel® IXP 2400 Three Setup (2/3) Resource Budgets Benchmarks Results Mt. Hood Board One Intel® IXP 2400 Three 1 Gbps optical ports 64 MB DRAM 3 optical ports 8 MB SRAM ME Array Media Switch Fabric I/F DRAM Hash Unit Rx MEv 2 2 MEv 2 3 Scratch Memory MEv 2 1 MEv 2 4 MEv 2 5 MEv 2 6 MEv 2 7 MEv 2 8 Tx RDRAM Controller Intel XScale® Core PCI QDR SRAM Controller Intel® IXP 2400 December 5, 2004 Shangri-la Performance Evaluation 91

Setup (3/3) Resource Budgets Benchmarks Results Test & Development Environment Linux host machine Ethernet Setup (3/3) Resource Budgets Benchmarks Results Test & Development Environment Linux host machine Ethernet + serial cable Provides power to Radisys ENP 2611* board via PCI bus Linux host Radisys ENP 2600* Compiles code for MEs and Intel XScale® Running NFS server Intel XScale® core running Linux and Shangri-la RTS Read generated binaries from host machine’s NFS server and load onto MEs 2 x 1 Gbps optical links IXIA packet generator December 5, 2004 * Third party brands/names are property of their respective owners Shangri-la Performance Evaluation 92

Setup Resource Budgets (1/5) Benchmarks Results Instruction and memory budgets at 2. 5 Gb/s Setup Resource Budgets (1/5) Benchmarks Results Instruction and memory budgets at 2. 5 Gb/s x 6 Assumed memory access latency – 100 cycles Scratch Memory – 60 cycles SRAM – 90 cycles DRAM – 120 cycles Memory access budget refers only to number of memory accesses that can be overlapped with computation Does not account for bandwidth of SRAM/DRAM December 5, 2004 Shangri-la Performance Evaluation 93

Setup Resource Budgets (2/5) Benchmarks Results Evaluating Intel® IXP 2400 Memory Bandwidth Modified empty Setup Resource Budgets (2/5) Benchmarks Results Evaluating Intel® IXP 2400 Memory Bandwidth Modified empty PPF connected to Rx and Tx Add loop to access chosen memory level n times Graph throughput of various configurations n = 1, 2, 4, … 1024 Memory accessed: SCRATCH, SRAM, DRAM Results using minimum-sized 64 B packets December 5, 2004 Shangri-la Performance Evaluation 94

Setup Resource Budgets (3/5) Benchmarks Results Scratch Memory Bandwidth Significant difference in memory bandwidth Setup Resource Budgets (3/5) Benchmarks Results Scratch Memory Bandwidth Significant difference in memory bandwidth consumed according to access size Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. December 5, 2004 Shangri-la Performance Evaluation 95

Setup Resource Budgets (4/5) Benchmarks Results SRAM Memory Bandwidth Behavior is similar to Scratch Setup Resource Budgets (4/5) Benchmarks Results SRAM Memory Bandwidth Behavior is similar to Scratch Memory Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. December 5, 2004 Shangri-la Performance Evaluation 96

Setup Resource Budgets (5/5) Benchmarks Results DRAM Memory Bandwidth DRAM accesses significantly constrain forwarding Setup Resource Budgets (5/5) Benchmarks Results DRAM Memory Bandwidth DRAM accesses significantly constrain forwarding rate Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. December 5, 2004 Shangri-la Performance Evaluation 97

Setup Resource Budgets Benchmarks (1/4) Results L 3 -Switch Performs core router functionality Bridge Setup Resource Budgets Benchmarks (1/4) Results L 3 -Switch Performs core router functionality Bridge packets not destined for this router Handle ARP packets for resolving Ethernet addresses Route IP packets targeting this router eth_encap. m l 3_cls. p lpm_lookup. p encap. p icmp_processor. p Rx options_processor. p arp. p Tx l 2_cls. p l 3_fwdr. m l 2_bridge. m bridge. p l 3_switch. m December 5, 2004 Shangri-la Performance Evaluation 98

Setup Resource Budgets Benchmarks (2/4) Results Firewall Filter out unwanted packets from a WAN Setup Resource Budgets Benchmarks (2/4) Results Firewall Filter out unwanted packets from a WAN (e. g. Internet) Assign flow IDs to packets according user-specified rules list using src IP, dst IP, src port, dst port, TOS and protocol Drop packets for specified flow IDs Optimize assignment of flow IDs Try to find flow ID in the hash table, placed in the table by a previous packet with the same fields Otherwise, do a long search by testing the rules in order Rx hash_lookup. p firewall. p Tx long_search. p classifier. m firewall. m December 5, 2004 Shangri-la Performance Evaluation 99

Setup Resource Budgets Benchmarks (3/4) Results Multi-Protocol Label Switching (MPLS) Route packets using attached Setup Resource Budgets Benchmarks (3/4) Results Multi-Protocol Label Switching (MPLS) Route packets using attached labels instead of IP addresses Reduces routing hardware requirements Facilitates high-level traffic management on user-defined packet streams mpls. m ilm. p ops. p MPLS ► MPLS? ftn. p eth_encap. m encap. p Rx l 3_fwdr. m arp. p l 2_cls. p Tx l 2_bridge. m bridge. p mpls_app. m December 5, 2004 Shangri-la Performance Evaluation 100

Setup Resource Budgets Benchmarks (4/4) Results Other Network Benchmarks Network address translation (NAT) Allows Setup Resource Budgets Benchmarks (4/4) Results Other Network Benchmarks Network address translation (NAT) Allows multiple LAN hosts connect to a WAN (e. g. Internet) through one IP address Achieved by remapping LAN IPs and ports WAN hosts only see the NAT router Manages a table mapping active connections between the LAN and WAN Quality of Service (Qo. S) Allows partitioning of available bandwidth to user-specified traffic streams Packet streams are throttled by intentionally dropping packets Header compression Reduce size of transmitted packet headers Since many fields are similar for packets in the same flow, achieve compression by only transmitting differences Various security features Encryption / decryption December 5, 2004 Shangri-la Performance Evaluation 101

Setup Resource Budgets Benchmarks Results (1/4) Dynamic Memory Accesses + SWC – software cache Setup Resource Budgets Benchmarks Results (1/4) Dynamic Memory Accesses + SWC – software cache + PHR – pkt handling removal + SOAR – static offset & align + PAC – pkt access combing + -O 2 – inline pkt handling + -O 1 – typical scalar opts + BASE – no opts Table shows average per-packet access count PAC significantly reduces packet memory accesses -O 1 enables pipeline to fit on one ME Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. December 5, 2004 SWC and PHR also contribute to reduced memory accesses Shangri-la Performance Evaluation 102

Setup Resource Budgets Benchmarks Results (2/4) L 3 -Switch: Forwarding Rate Top-end performance still Setup Resource Budgets Benchmarks Results (2/4) L 3 -Switch: Forwarding Rate Top-end performance still constrained by memory bandwidth PHR and SWC alleviate this somewhat + SWC – software cache + PHR – pkt handling removal + SOAR – static offset & align + PAC – pkt access combing + -O 2 – inline pkt handling + -O 1 – typical scalar opts + BASE – no opts Performance tests and ratings are measured PAC and SOAR have the most PHR and SOAR have the most using specific computer systems impact on L 3 -Switch forwarding rate impact on forwarding rate and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Forwarding rate of minimum-sized packets (64 B) Reduced memory access and instruction count both improve forwarding rate December 5, 2004 Shangri-la Performance Evaluation 103

Setup Resource Budgets Benchmarks Results (3/4) Firewall: Forwarding Rate + SWC – software cache Setup Resource Budgets Benchmarks Results (3/4) Firewall: Forwarding Rate + SWC – software cache + PHR – pkt handling removal + SOAR – static offset & align + PAC – pkt access combing + -O 2 – inline pkt handling + -O 1 – typical scalar opts + BASE – no opts Per-ME performance improvement dominated by PAC December 5, 2004 Shangri-la Performance Evaluation Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 104

Setup Resource Budgets Benchmarks Results (4/4) MPLS: Forwarding Rate SOAR does not help this Setup Resource Budgets Benchmarks Results (4/4) MPLS: Forwarding Rate SOAR does not help this application due to stacking of MPLS labels + SWC – software cache + PHR – pkt handling removal + SOAR – static offset & align + PAC – pkt access combing + -O 2 – inline pkt handling + -O 1 – typical scalar opts + BASE – no opts Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Results are for MPLS transit only (internal routers of a MPLS domain) Similar performance characteristics to L 3 -Switch December 5, 2004 Shangri-la Performance Evaluation 105

Conclusions Demonstrate performance from a high-level language comparable to hand-tuned code Memory-level optimizations Program Conclusions Demonstrate performance from a high-level language comparable to hand-tuned code Memory-level optimizations Program partitioning to heterogeneous cores Optimizations to support packet abstractions Language features are more attractive as users can enjoy ease of programming without sacrificing performance Modular program design Packet model supporting encapsulation, metadata and bit-level accesses Flat memory model Able to achieve 2 Gbps on L 3 -Switch, Firewall and MPLS Transit December 5, 2004 Shangri-la Performance Evaluation 106

Summary Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004 Summary Part of the Shangri-la Tutorial presented at MICRO-37 December 5, 2004

Summary - Baker Goals Enable efficient expression of packet processing applications on large-scale chip-multiprocessors Summary - Baker Goals Enable efficient expression of packet processing applications on large-scale chip-multiprocessors (e. g. , Intel® IXP 2400 processor) Enable good execution performance Approach Hide hardware details Expose domain-specific constructs Reduce C Features December 5, 2004 Summary 108

Summary – Compiler Optimizations Demonstrate performance from a high-level language comparable to hand-tuned code Summary – Compiler Optimizations Demonstrate performance from a high-level language comparable to hand-tuned code Memory-level optimizations Program partitioning to heterogeneous cores Optimizations to support packet abstractions Machine-specific optimizations Able to achieve maximal packet forwarding rates (2 Gbps) on L 3 -Switch, Firewall and MPLS Transit December 5, 2004 Summary 109

Summary – Runtime Adaptation An adaptive system is important for packet processing to adapt Summary – Runtime Adaptation An adaptive system is important for packet processing to adapt to varying workloads dynamically Benefits in performance, services, and power consumption The system can be built with a truly programmable largescale chip multiprocessor, requires Checkpointing Binding State migration Adaptation costs come primarily from loading and checkpointing times, optimize these December 5, 2004 Summary 110

Key Learning High-level language features ease programming for complex multi-core network processors Effective compiler Key Learning High-level language features ease programming for complex multi-core network processors Effective compiler optimizations able to achieve performance comparable to hand-tuned systems Architecture-specific, domain-specific, and general optimizations all critical to obtain high performance Ease of programming and performance can co-exist Runtime adaptation a key feature to future network systems The system can be built with a large-scale CMP Many learning applicable to general CMP systems December 5, 2004 Summary 111