Linux in High-Availability Environments Alan Robertson IBM Linux

Linux in High-Availability Environments Alan Robertson IBM Linux Technology Center alanr@unix. sh -- OSS for High-Availability April, 2005

OSS in HA Environments Why OSS for High Availability Environments? What is High-Availability (HA) Clustering? What can HA do for me? DRBD Data Replication The Linux Virtual Server Load Balancer The Linux-HA project? Linux-HA applications and customers Thoughts about cluster security -- OSS for High-Availability April, 2005

Why OSS In High-Availability Environments? Openness Broad Range Of Environments Breadth of Support Options Lack of Vendor Lock-In -- OSS for High-Availability April, 2005

Openness Extensive Peer Review System Source code freely available Source code reviewed by outside parties Changes discussed openly – often in great detail Ability to obtain uncensored product information Mailing lists archives contain uncensored comments from Users with deep expertise Users with little expertise Users who are very happy Users with problems -- OSS for High-Availability April, 2005

Broad Range of Environments OSS typically runs on many platforms, often on different OSes too Users often find very creative uses for the software Freedom to try something at low cost decreases perceived risks and encourages this behavior Creative uses find their way into mailing list (archives) and sometimes into the OSS product Users help with testing – providing more breadth in test environment than might otherwise occur -- OSS for High-Availability April, 2005

Support for OSS Systems Mailing lists consist of hundreds to thousands of users who are very knowledgeable and helpful – usually regarded as very responsive – typically located in most time zones across the world Can choose support vendor freely: Hardware, OS or OSS supplier Independent consulting/support organizations In-house expertise (most motivated) OSS mailing lists Any combination of the above -- OSS for High-Availability April, 2005

No Vendor Lock-In Does not rely on a vendor's future plans being compatible with yours (risk mitigation) Obsolescence more readily manageable Does not rely on a single vendor in another company or country Contributing to the product (or paying someone else to) provides you a voice in future direction Compatibility with other systems typically better -- OSS for High-Availability April, 2005

What Is HA Clustering? A group of computers which cooperate and trust each other to provide a service even when cluster components fail When one machine goes down, others take over its work This involves IP address takeover, service takeover, etc. New work comes to the “takeover” machine Not primarily designed for high-performance -- OSS for High-Availability April, 2005

What Can HA Clustering Do For You? It cannot achieve 100% availability – nothing can. HA Clustering designed to recover from single faults It can make your outages very short From about a second to a few minutes It is like a Magician's (Illusionist's) trick: When it goes well, the hand is faster than the eye When it goes not-so-well, it can be reasonably visible A good HA clustering system adds a “ 9” or two to your availability 99 ->99. 9, 99. 9 ->99. 99, 99. 99 ->99. 999, etc. Complexity is the enemy of reliability! -- OSS for High-Availability April, 2005

The Desire for HA systems Who wants low-availability systems? ¨Why are so few systems High. Availability? -- OSS for High-Availability April, 2005

Why isn't everything HA? Cost Complexity -- OSS for High-Availability April, 2005

-- OSS for High-Availability April, 2005

Single Points of Failure (SPOFs) A single point of failure is a component whose failure will cause near-immediate failure of an entire system or service Good HA design eliminates of single points of failure -- OSS for High-Availability April, 2005

How Does HA work? Manage redundancy to improve service availability Like a cluster-wide-super-init on steroids Even complex services are now “respawn” on node (computer) death on “impairment” of nodes on loss of connectivity for services that aren't working (not necessarily stopped) managing very complex dependency relationships -- OSS for High-Availability April, 2005

DRBD – RAID over the LAN Block-device (filesystem) level replication Clever synchronization methods make resyncs faster, decrease latency, preserve integrity Useful for both HA and Disaster Recovery NO single point of failure Extremely cost-effective $200 (max) instead of $20, 000 (min) ($USD) Probably not suitable for some high-end writeintensive applications Supportable by IBM Support Line -- OSS for High-Availability April, 2005

-- OSS for High-Availability April, 2005

LVS – The Linux Virtual Server Project LVS is the standard Linux Load Balancer Called "ipvs" in the standard Linux kernel Stable, fast, flexible Especially suitable for large "server farms" -- OSS for High-Availability April, 2005

LVS IN Action -- OSS for High-Availability April, 2005

“Plays Well With Others” Each of these independent services can work together to scale to large systems All single points of failure can be eliminated High-Availability, Load Balancing work together nicely -- OSS for High-Availability April, 2005

Linux Virtual Server, Linux-HA and DRBD -- OSS for High-Availability April, 2005

The Linux-HA Project Linux-HA is the oldest high-availability project for Linux, with the largest associated community The core piece of Linux-HA is called “heartbeat” (though it does much more than heartbeat) Linux-HA has been in production since 1999, and is currently in use on about ten thousand sites Linux-HA also runs on Free. BSD and Solaris, and is being ported to Open. BSD and others Linux-HA is shipped with every major Linux distribution except one. -- OSS for High-Availability April, 2005

Linux-HA Release 1 Applications Database Servers Load Balancers Web Servers Custom Applications Firewalls, routers, DNS, DHCP Retail Point of Sale Solutions Authentication File Servers Proxy Servers Medical Imaging Almost any type server application you can think of – except SAP -- OSS for High-Availability April, 2005

Selected Linux-HA customers Los Alamos (US) National Labs – linear accelerator badge reader Emageon – medical imaging for hospitals and clinics ISO New England manages power grid using ≈ 20 Linux-HA clusters Various Firewall, DNS, DHCP products use Linux-HA basically embedded Karstadt, Circuit City, Autozone use Linux-HA in each of several hundred stores MAN Nutzfahrzeuge AG – truck manufacturing division of Man AG Autostrada – 230 clusters across Italy BBC – Internet Infrastructure Citysavings Bank in Munich (infrastructure) Bavarian Radio Station (Munich) coverage of 2002 Olympics in Salt Lake City The Weather Channel (weather. com) Sony (manufacturing) Incredimail bases their mail service on Linux-HA on IBM hardware University of Toledo (US) – 20 k student Computer Aided Instruction system -- OSS for High-Availability April, 2005

Linux-HA Release 1 capabilities Supports 2 -node clusters Can use serial, UDP bcast, mcast, ucast comm. Fails over on node failure Fails over on loss of IP connectivity Capability for failing over on loss of SAN connectivity Limited command line administrative tools to fail over, query current status, etc. Active/Active or Active/Passive Simple resource group dependency model Requires external tool for resource monitoring SNMP monitoring -- OSS for High-Availability April, 2005

Linux-HA Release 2 capabilities Built-in resource monitoring Support for the OCF resource standard Much Larger clusters supported (>= 8 nodes) Sophisticated dependency model with rich constraint support (resources, groups, incarnations, master/slave) (needed for SAP) XML-based resource configuration Configuration and monitoring GUI Support for GFS cluster filesystem Multi-state (master/slave) resource support Initially - no IP, SAN monitoring -- OSS for High-Availability April, 2005

Resource Objects in Release 2 supports “resource objects” which can be any of the following: Primitive Resources OCF, heartbeat-style, or LSB resource agent scripts Resource Incarnations – need “n” resource objects somewhere Resource groups – a group of resources with implied colocation and linear ordering constraints Multi-state resources (master/slave) Designed to model master/slave (replication) resources (DRBD, et al) -- OSS for High-Availability April, 2005

Basic Dependencies in Release 2 Ordering Dependencies start before start after (implies stop after) (implies stop before) Mandatory Co-location Dependencies must be co-located with cannot be co-located with -- OSS for High-Availability April, 2005

Resource Incarnations allow one to have a resource which runs multiple (“n”) times on the cluster This is useful for managing load balancing clusters where you want “n” of them to be slave servers Cluster filesystems Cluster Alias IP addresses -- OSS for High-Availability April, 2005

Security Considerations Cluster: A computer whose backplane is the Internet If this isn't scary, you don't understand. . . You may think you have a secure cluster network You're probably mistaken now You will be in the future -- OSS for High-Availability April, 2005

Secure Networks are Difficult Because. . . Security is not often well-understood by admins Security is well-understood by “black hats” Network security is easy to breach accidentally Users bypass it Hardware installers don't fully understand it Most security breaches come from “trusted” staff Staff turnover is often a big issue Virus/Worm/P 2 P technologies will create new holes especially for Windows machines -- OSS for High-Availability April, 2005

Security Advice Good HA software should be designed to assume insecure networks Not all HA software assumes insecure networks Good HA installation architects use dedicated (secure? ) networks for intra-cluster HA communication Crossover cables are reasonably secure – all else is suspect ; -) -- OSS for High-Availability April, 2005

References http: //linux-ha. org/download/ http: //wiki. linux-ha. org/New. Heartbeat. Design New Web site content (a work in progress) http: //wwnew. linux-ha. org/ http: //wiki. linux-ha. org/ (prettier) (editable) http: //wwnew. linux-ha. org/Success. Stories www. linux-mag. com/2003 -11/availability_01. html http: //www. linuxvirtualserver. org/ http: //drbd. org/ -- OSS for High-Availability April, 2005

Legal Statements IBM is a trademark of International Business Machines Corporation. Linux is a registered trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others. This work represents the views of the author and does not necessarily reflect the views of the IBM Corporation. -- OSS for High-Availability April, 2005