8c7cef18d657617564ccace3e59512c5.ppt
- Количество слайдов: 33
Open Fabrics BOF Supercomputing 2009 Tziporet Koren, Gilad Shainer, Yiftah Shahar, Stan Smith Hal Rosenstock, Jeff Squyres, DK Panda, Bob Woodruff, Betsy Zeller Rev. 1. 0 www. openfabrics. org
Agenda Ø Open Fabrics Linux Update (15 – minutes) § OFED 1. 4. 1, 1. 4. 2, and OFED 1. 5 releases § OFED 1. 6 plans and roadmap Ø Open Fabrics Windows Update (15 – minutes) § Win. OF 2. 1 release § Win. OF 2. 2 plans and roadmap Ø Open Discussion – 60 minutes § § OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community www. openfabrics. org 2
Linux: OFED Components OFA Ø HCA/NIC Drivers Development § § Ø Ø Ø IB: IBM, Mellanox, QLogic i. WARP: Chelsio, Intel Core: Verbs, mad, SMA, CMA, SA cache IPo. IB SDP SRP and SRP Target i. SER RDS Qlogic_VNIC u. DAPL OSM Diagnostic tools i. SER Target NFS-RDMA www. openfabrics. org Add on Ø Bonding module Ø Open i. SCSI Ø MPI Components § § MVAPICH Open MPI MVAPICH 2 Benchmark tests Tested with Ø Proprietary MPIs: Intel, HP, Platform mpi Ø Proprietary SMs: Sun, Voltaire, Qlogic, Mellanox 3
Update from Sonoma ’ 09 Session Progress: Ø Provide user space components in tarballs according to distros requests www. openfabrics. org 4
OFED 1. 4. 1 – Released May 2009 Ø New features § § § § § Added support for RHEL 5. 3 and SLES 11 NFS/RDMA: In beta quality with support for RHEL 5. 2, 5. 3 and SLES 10 SP 2 Updated MPI packages: MVAPICH 1. 1. 0 -3355, Open MPI 1. 3. 2 Updated bonding package: ib-bonding-0. 9. 0 -40 Updated DAPL: compat-dapl-1. 2. 14 and dapl-2. 0. 19 Updated Open. SM version to include critical bug fixes Fixed RDS i. WARP support Low level drivers updated: ehca, mlx 4, cxgb 3, nes, ipath, mthca Added a module parameter to control number of MTTs per segment in Mellanox HCAs (mlx 4 & mthca) § mstflint update § Enhanced Open. SM and management tools, user interface, HA, routing enhancements, much more, too much to list… details in the backup slides www. openfabrics. org 5
OFED 1. 4. 2 – Released August 2009 Ø New features § § Critical bug fixes only Fixes to NES (Intel i. Warp) driver Fixes to support running with Lustre installed NFS/RDMA critical bug fixes Ø Minimal QA § Thus, recommended only for people hitting these critical bugs www. openfabrics. org 6
OFED 1. 5 – Release December 2009 Ø New features § § § § Added support for Red. Hat EL 5. 4 and EL 4. 8 and SLES 10 SP 3 Added support for kernel. org 2. 6. 29 and 2. 6. 30 u. DAPL scalability enhancements, new UCM provider Hardware driver for new Qlogic QDR HCA All user space packages released as tar balls for easier distro integration MVAPICH 2 1. 4 Open. MPI 1. 3. 3 Several new enhancements to Open. SM and management tools for improved scalability, performance, Qo. S, routing, etc. (see backup slides for details) § Bug fixes § SDP Zero Copy, and other performance improvements Ø OFED 1. 5 -RDMAo. E Branch § Experimental branch of OFED-1. 5 that also includes support for Mellanox RDMAo. E § For those that want to try out this new technology § Open Fabrics board has voted to include this code in OFED/Win. OF • Which release should it go into ? OFED-1. 5 ? or wait till the code is accepted upstream and there is a standard spec ? www. openfabrics. org 7
OFED 1. 5 OS Matrix Ø List of Supported Kernels for OFED 1. 5 § RHEL 4: up 6, up 7, up 8 § RHEL 5: up 2, up 3, up 4 § SLES 10: SP 2, SP 3 § SLES 11 § Fedora Core 11* § Open. Su. SE 11* § Kernel. org: 2. 6. 18 -2. 6. 30 * minimal QA for these versions. www. openfabrics. org 8
OFED 1. 6 Plans Ø Preliminary Schedule § Release at Nov 2010 § Detailed schedule will be derivative from the above Ø Preliminary Feature List: § Kernel. org: 2. 6. 33 and 2. 6. 34 § SRIOV support § Mellanox Vnic for Bridge. X § MMU notification for MPI (if accepted by the kernel) § New HW from vendors (if any) § RDMAo. E (if not already in an earlier release) www. openfabrics. org 9
OFED 1. 6 OS Matrix Ø kernel. org: kernel 2. 6. 33 and 2. 6. 34 Ø RHEL 4: up 6, up 7, up 8 (maybe drop at all if RHEL 6 is out – lets talk in meeting) Ø RHEL 5: up 2, up 3, up 4, up 5 Ø RHEL 6 Ø SLES 10: SP 2, SP 3, SP 4 Ø SLES 11: SP 1 Ø Fedora Core: latest Ø Open. Su. SE: latest • new for OFED 1. 6 in bold • drop support for items in blue www. openfabrics. org 10
Agenda Ø Open Fabrics Linux Update (15 – minutes) § OFED 1. 4. 1, 1. 4. 2, and OFED 1. 5 releases § OFED 1. 6 plans and roadmap Ø Open Fabrics Windows Update (15 – minutes) § Win. OF 2. 1 release § Win. OF 2. 2 plans and roadmap Ø Open Discussion – 60 minutes § § OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community www. openfabrics. org 11
Windows Open. Fabrics (Win. OF) Ø Win. OF 2. 1 Released 9/30/2009 § Winverbs fully integrated into IB core feature set. § OFED Compatibility layers • libibverbs, libmad, libumad, librdmacm § OFED Diagnostics on OFED Compat layers • Ibaddr, ibnetdiscover, ibroute, ibstat, saquery, sminfo… § § Installer fully integrated with Driver. Store + PNP. OFED u. DAT/u. DAPL code base on Windows. Server 2008 HPC integration Numerous Bug fixes. www. openfabrics. org 12
Win. OF Roadmap Ø Win. OF 2. 2 § Release target Q 1’ 2010, freeze in Q 4’ 09 § Features: • Windows 7 & Server 2008 R 2 fully supported. • NDIS 6. 0 IPo. IB driver based on WHQL’ed source. • Open. SM 3. 3. 3 (Win. OF 2. 1 @ 3. 0. 0 ~OFED 1. 2+). • SRP multi-path fixes. Ø Win. OF 2. 3 § Release target Q 4’ 2010, freeze early Q 4’ 2010 § Connected Mode IPo. IB. www. openfabrics. org 13
Agenda Ø Open Fabrics Linux Update (15 – minutes) § OFED 1. 4. 1, 1. 4. 2, and OFED 1. 5 releases § OFED 1. 6 plans and roadmap Ø Open Fabrics Windows Update (15 – minutes) § Win. OF 2. 1 release § Win. OF 2. 2 plans and roadmap Ø Open Discussion – 60 minutes § § OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community www. openfabrics. org 14
OFA Scalability Ø Challenges and Goals Ø Infrastructure Scalability Ø ULPs/Apps Scalability Ø Possible Improvements www. openfabrics. org 15
Challenges and Goals Ø Scale out to 10 K-20 K or more nodes § Performance § Reliability § Sometimes hard to differentiate feature from scalability Ø Focus additional attention/resources on issues § Get ready for more detailed discussion at Sonoma www. openfabrics. org 16
Infrastructure Scalability/Features Ø Improved multicore affinity/awareness/support § Binding to specific hw threads in a core • e. g. http: //arstechnica. com/hardware/news/2009/09/ibms-8 -core-power 7 -twice-the-musclehalf-the-transistors. ars § Interrupt distribution § Binding HCAs/RNICs to numa nodes Ø Multicast § Reliable multicast • New IBTA optional feature § Better UD multicast performance • Small message mcast latency, with just two members in the mcast group, is 2 x to 3 x that of unicast latency between the same pair Ø Flow control for SRQ § New IBTA optional feature § CM extension www. openfabrics. org 17
Infrastructure Scalability/Features Ø Fault tolerance § Application transparent fault detection, isolation, recovery § Multiple HCAs/NICs with transparent failover Ø IB monitoring § Performance counters, throughput, hotspots, degraded links § This is IB's Achilles' heel. . . • Need much better monitoring tools discover congestion, bottlenecks Ø Adaptive routing § HCA out-of-order delivery § Switch logic for state info & adaptive algorithm, etc. www. openfabrics. org 18
Infrastructure Scalability Ø SA aspects § Primarily Path. Record • Open. SM • SA client Ø RDMA CM § Resolve route • ARP query scalability § Resolve address • SA Path. Record query scalability www. openfabrics. org 19
Infrastructure Scalability Ø CM § Higher abstraction model • Current APIs are cumbersome & difficult to use Ø Open. SM § Stateful failover • Replication • Eliminate client re-registration § Congestion manager www. openfabrics. org 20
Possible Infrastructure Improvements Ø Adaptive MAD retransmission Ø Better duplicate transaction handling by SA (and MAD ? ) Ø SA scalability in terms of Path. Record responses § More parallelization • Shadow DB ? • SA distribution beyond node Ø Tunable retry mechanism for various components Ø RDMA CM API addition and ACM (Assistant to the IB CM) § Does this address higher abstraction model requested ? www. openfabrics. org 21
Possible ULPs/Apps Improvements Ø MPI § Don’t query Path. Records per core § Hardware collective support • Common API § Reliable multicast § ummunotify Ø Bo. IB (Boot over IB) § SM improvements for handling non responsive SMAs as node transitions from boot ROM to kernel § infiniband as boot interface without ethernet suspenders Ø Bonding § Load balancing • Not just active/standby (failover) Ø DHCP § Use raw (mmap) rather than BSD socket interface due to inadequate performance Ø Others ? www. openfabrics. org 22
Agenda Ø Open Fabrics Linux Update (15 – minutes) § OFED 1. 4. 1, 1. 4. 2, and OFED 1. 5 releases § OFED 1. 6 plans and roadmap Ø Open Fabrics Windows Update (15 – minutes) § Win. OF 2. 1 release § Win. OF 2. 2 plans and roadmap Ø Open Discussion – 60 minutes § § OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community www. openfabrics. org 23
MPI Distribution in OFED: Rationale Ø Open source MPI’s initially included to “bootstrap” the OFED project Ø MPI was the main user for OFED, so this seemed like a natural pairing § Made it (significantly) easier for customers to get their MPI jobs running on Infini. Band § Also necessary for political buy-in: unify under one, standard verbs API (vs. different MVAPI stacks) Ø QA testing of MPI + OFED is still extremely valuable § This is not a discussion of removing MPI + OFED QA www. openfabrics. org 24
MPI Distribution in OFED: Pros Ø MPI is still the most common OFED “customer” Ø HPC customers get network stack + MPI in one package § Helps rapid MPI deployment on new clusters (out-of-box) § MPI-selector function allows to select MPI stack of choice during the installation Ø Customers get QA assurance of specific MPI + OFED version tuples Ø Helps to test multiple functionalities of the OFED stack and IB/i. WARP fabric with comprehensive suite of MPI-level benchmarks www. openfabrics. org 25
MPI Distribution in OFED: Cons Ø MPI’s have their own QA cycles § MPI+OFED QA testing is more for OFED, not MPI Ø Bundling induces project scheduling difficulties between OFED and various MPI packages Ø Red. Hat and Su. SE both say “Don’t do this!” § They both already include the open source MPI’s § Makes it more difficult for them to take OFED drops Ø Many users will download the latest-n-greatest MPIs anyway – not the ones included in OFED www. openfabrics. org 26
Agenda Ø Open Fabrics Linux Update (15 – minutes) § OFED 1. 4. 1, 1. 4. 2, and OFED 1. 5 releases § OFED 1. 6 plans and roadmap Ø Open Fabrics Windows Update (15 – minutes) § Win. OF 2. 1 release § Win. OF 2. 2 plans and roadmap Ø Open Discussion – 60 minutes § § OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community www. openfabrics. org 27
OFA Solutions for Ethernet clusters for HPC Ø Would like to get some community feedback on success stories for building HPC clusters using Ethernet. § What works well ? § Things that need improvement to make it easier ? § Other ? www. openfabrics. org 28
Agenda Ø Open Fabrics Linux Update (15 – minutes) § OFED 1. 4. 1, 1. 4. 2, and OFED 1. 5 releases § OFED 1. 6 plans and roadmap Ø Open Fabrics Windows Update (15 – minutes) § Win. OF 2. 1 release § Win. OF 2. 2 plans and roadmap Ø Open Discussion – 60 minutes § § OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community www. openfabrics. org 29
Backup Slides www. openfabrics. org 30
Open SM – OFED 1. 4. 1 and 1. 4. 2 Ø Versions in OFED 1. 4. 2: § libibumad-1. 3. 1, libibmad-1. 3. 1, opensm-3. 3. 1, infiniband-diags-1. 5. 1 Ø User Interface: § Unified configuration file § Configuration reloading on the fly § Improved Plugin interface – multiple plugins are supported § Mlticast: Ipv 6 Solicited Node consolidation § Better diagnostic tools (new - ibsendtrap) § HA: • Open. SM will query Standby SMs periodically • Standby Open. SM notifies Master SM about priority change (Trap 144) www. openfabrics. org 31
Open SM - OFED 1. 4. 2 Routing Ø Cached routing (- -R ftree, updn, minhop) Ø LMC improvements: § Preserve base LIDs routes § Ensure LMC paths balancing over different switches/chassis Ø Ordered paths balancing § Ports are sorted by switch loads § Port order file option (--guid_routing_order_file option) Ø Better LASH support: Mesh geometry analysis, Paths balancing over multiple links Ø General § Port IDs for Up/Down § Min hop weights § Connecting root nodes with Up/Down § Connecting IO nodes with Fast. Tree www. openfabrics. org 32
Open SM - new features in OFED 1. 5 Scalability & performance Qo. S improvements Ø Optimized SL 2 VL setup Ø SL 2 VL setup optimization Ø Parallel LFTs setup Ø Qo. S/LASH co-exist Ø Parallel MFTs setup Routing & multicast Major bug fix Ø FTree improvements Ø MCG join/leave fixes Ø Routing engine reloading Ø Clean delayed MCG deletion Ø Mesh switch reordering optimizations Ø MGID to MLID compression www. openfabrics. org 33
8c7cef18d657617564ccace3e59512c5.ppt