Quick Silver Middleware for Scalable Self-Regenerative Systems Cornell

Quick. Silver: Middleware for Scalable Self-Regenerative Systems Cornell University Ken Birman, Johannes Gehrke, Paul Francis, Robbert van Renesse, Werner Vogels Raytheon Corporation Lou Di. Palma, Paul Work June 24, 2003 AMS 2003 Autonomic Computing Approved for Public Release, Distribution Unlimited

Our topic n Computing systems are growing n n … larger, … and more complex, … and we are hoping to use them in a more and more “unattended” manner But the technology for managing growth and complexity is lagging July 21, 2004 Approved for Public Release, Distribution Unlimited 2

Our goal n n n Build a new platform in support of massively scalable, self-regenerative applications Demonstrate it by offering a specific military application interface Work with Raytheon to apply in other military settings July 21, 2004 Approved for Public Release, Distribution Unlimited 3

Representative scenarios n n Massive data centers maintained by the military (or by companies like Amazon) Enormous publish-subscribe information bus systems (broadly, OSD calls these GIG and NCES systems) Deployments of large numbers of lightweight sensors New network architectures to control autonomous vehicles over media shared with other “mundane” applications July 21, 2004 Approved for Public Release, Distribution Unlimited 4

How to approach the problem? n n Web Services architecture has emerged as a likely standard for large systems But WS is “document oriented, ” lacks n n n High availability (or any kind of quick response guarantees) A convincing scalability story… Self-monitoring/adaptation features July 21, 2004 Approved for Public Release, Distribution Unlimited 5

Signs of trouble? n n Most technologies are way beyond their normal scalability limits in this kind of center: we are “good” at small clusters but not huge ones Pub-sub was a big hit. No longer… n n Curious side-bar: used heavily for point-topoint communication! (Why? ) Extremely hard to diagnose problems July 21, 2004 Approved for Public Release, Distribution Unlimited 6

We lack the right tools! n Today, our applications navigate in the dark n n n They lack a way to find things They lack a way to sense system state There are no rules for adaptation, if/when needed In effect: We are starting to build very big systems, yet doing so in the usual clientserver manner This denies applications any information about system state, configuration, loads, etc July 21, 2004 Approved for Public Release, Distribution Unlimited 7

Quick. Silver n n Quick. Silver: A platform to help developers build these massive new systems It has four major components n n Astrolabe: a novel kind of “virtual database” Bimodal Multicast: for faster “few to many” data transfer patterns Kelips: A fast “lookup” mechanism Group replication technologies based on virtual synchrony or other similar models July 21, 2004 Approved for Public Release, Distribution Unlimited 8

Quick. Silver Architecture Pub-sub (JMS, JBI) Massively Scalable Group Communication Composable Microprotocol Stacks Native API Monitoring Indexing Distributed query, event detection Message Repository Overlay Networks July 21, 2004 Approved for Public Release, Distribution Unlimited 9

ASTROLABE Astrolabe’s role is to collect and report system state, which is used for many purposes including selfconfiguration and repair. July 21, 2004 10

What does Astrolabe do? n n n Astrolabe’s role is to track information residing at a vast number of sources Structured to look like a database Approach: “peer to peer gossip”. Basically, each machine has a piece of a jigsaw puzzle. Assemble it on the fly. July 21, 2004 Approved for Public Release, Distribution Unlimited 11

Astrolabe in a single domain Name swift n 0 1 6. 2 1. 5 1. 1 0. 8 0. 9 1 0 4. 1 cardinal n 2. 0 1. 8 2. 1 1. 9 3. 1 falcon n Load Weblogic? SMTP? Word Version 4. 5 2. 7 3. 6 5. 3 1 0 … 6. 0 Row can have many columns Total size should be k-bytes, not megabytes Configuration certificate determines what data is pulled into the table (and can change) July 21, 2004 Approved for Public Release, Distribution Unlimited 12

So how does it work? n Each computer has n n n Its own row Replicas of some objects (configuration certificate, other rows, etc) Periodically, but at a fixed rate, pick a friend “pseudo-randomly” and exchange states efficiently (bound the size of data exchanged) n n States converge exponentially rapidly. Loads are low and constant and protocol is robust against all sorts of disruptions! July 21, 2004 Approved for Public Release, Distribution Unlimited 13

State Merge: Core of Astrolabe epidemic Name Time Load Weblogic? SMTP? Word Versio n swift 2011 2. 0 0 1 6. 2 falcon 1971 1. 5 1 0 4. 1 cardinal 2004 4. 5 1 0 6. 0 swift. cs. cornell. edu Name Time Load Weblogic ? SMTP? Word Version swift 2003 . 67 0 1 6. 2 falcon 1976 2. 7 1 0 4. 1 cardinal 2201 3. 5 1 1 6. 0 cardinal. cs. cornell. edu July 21, 2004 Approved for Public Release, Distribution Unlimited 14

State Merge: Core of Astrolabe epidemic Name Time Load Weblogic? SMTP? Word Versio n swift 2011 2. 0 0 1 6. 2 falcon 1971 1. 5 1 0 4. 1 cardinal 2004 4. 5 1 0 6. 0 swift. cs. cornell. edu swift cardinal Name Time Load Weblogic ? SMTP? 2003 . 67 0 1 1976 2. 7 1 0 4. 1 cardinal 2201 3. 5 1 1 6. 0 cardinal. cs. cornell. edu July 21, 2004 3. 5 6. 2 falcon 2201 2. 0 Word Version swift 2011 Approved for Public Release, Distribution Unlimited 15

State Merge: Core of Astrolabe epidemic Name Time Load Weblogic? SMTP? Word Versio n swift 2011 2. 0 0 1 6. 2 falcon 1971 1. 5 1 0 4. 1 cardinal 2201 3. 5 1 0 6. 0 swift. cs. cornell. edu Name Time Load Weblogic ? SMTP? Word Version swift 2011 2. 0 0 1 6. 2 falcon 1976 2. 7 1 0 4. 1 cardinal 2201 3. 5 1 1 6. 0 cardinal. cs. cornell. edu July 21, 2004 Approved for Public Release, Distribution Unlimited 16

Observations n Merge protocol has constant cost n n One message sent, received (on avg) per unit time. The data changes slowly, so no need to run it quickly – we usually run it every five seconds or so Information spreads in O(log N) time But this assumes bounded region size n In Astrolabe, we limit them to 50 -100 rows July 21, 2004 Approved for Public Release, Distribution Unlimited 17

Scaling up… and up… n With a stack of domains, we don’t want every system to “see” every domain n n Cost would be huge So instead, we’ll see a summary Name Time Load Weblogic SMTP? Word ? Name Time Load Weblogic SMTP? Version Word ? Version Name 2011 Time 2. 0 Load Weblogic 1 SMTP? 6. 2 Word swift 0 ? Version swift Name 2011 Time 2. 0 Load 0 Weblogic 1 SMTP? 6. 2 Word falcon 1976 2. 7 1 4. 1 ? 0 Version SMTP? 6. 2 Word cardinal 2201 3. 5 1 swift Name 2011 Time 2. 0 1 Load 0 Weblogic 1 6. 0 falcon 1976 2. 7 1 4. 1 ? 0 Version cardinal 2201 3. 5 swift 2011 2. 0 1 0 1 1 6. 0 6. 2 falcon 1976 2. 7 1 0 4. 1 cardinal 2201 3. 5 1 1 6. 0 cardinal. cs. cornell. edu July 21, 2004 Approved for Public Release, Distribution Unlimited 18

Build a hierarchy using a P 2 P protocol that “assembles the puzzle” without any servers Dynamically changing query output is visible system-wide Name Avg Load WL contact SF 2. 6 123. 45. 61. 3 123. 45. 61. 17 NJ 1. 8 127. 16. 77. 6 127. 16. 77. 11 Paris 3. 1 14. 66. 71. 8 14. 66. 71. 12 Name Load Weblogic? SMTP? Word Version swift 2. 0 0 1 falcon 1. 5 1 cardinal 4. 5 1 Name Load Weblogic? SMTP? Word Version 6. 2 gazelle 1. 7 0 0 4. 5 0 4. 1 zebra 3. 2 0 1 6. 2 0 6. 0 gnu . 5 1 0 6. 2 San Francisco July 21, 2004 SQL query “summarizes” data SMTP contact … … New Jersey Approved for Public Release, Distribution Unlimited 19

(1) Query goes out… (2) Compute locally… (3) results flow to top level of the hierarchy Name SMTP contact 2. 6 123. 45. 61. 3 123. 45. 61. 17 NJ 1. 8 127. 16. 77. 6 127. 16. 77. 11 Paris 3 WL contact SF 1 Avg Load 3. 1 14. 66. 71. 8 14. 66. 71. 12 Name 2 Load Weblogic? SMTP? Word Version swift 2. 0 0 1 falcon 1. 5 1 cardinal 4. 5 1 July 21, 2004 3 Name Load Weblogic? SMTP? Word Version 6. 2 gazelle 1. 7 0 0 4. 5 0 4. 1 zebra 3. 2 0 1 6. 2 0 6. 0 gnu . 5 1 0 6. 2 San Francisco … 1 … 2 New Jersey Approved for Public Release, Distribution Unlimited 20

Hierarchy is virtual… data is replicated Name Avg Load WL contact SMTP contact SF 2. 6 123. 45. 61. 3 123. 45. 61. 17 NJ 1. 8 127. 16. 77. 6 127. 16. 77. 11 Paris 3. 1 14. 66. 71. 8 14. 66. 71. 12 Name Load Weblogic? SMTP? Word Version swift 2. 0 0 1 falcon 1. 5 1 cardinal 4. 5 1 Name Load Weblogic? SMTP? Word Version 6. 2 gazelle 1. 7 0 0 4. 5 0 4. 1 zebra 3. 2 0 1 6. 2 0 6. 0 gnu . 5 1 0 6. 2 San Francisco July 21, 2004 … … New Jersey Approved for Public Release, Distribution Unlimited 21

Hierarchy is virtual… data is replicated Name Avg Load WL contact SMTP contact SF 2. 6 123. 45. 61. 3 123. 45. 61. 17 NJ 1. 8 127. 16. 77. 6 127. 16. 77. 11 Paris 3. 1 14. 66. 71. 8 14. 66. 71. 12 Name Load Weblogic? SMTP? Word Version swift 2. 0 0 1 falcon 1. 5 1 cardinal 4. 5 1 Name Load Weblogic? SMTP? Word Version 6. 2 gazelle 1. 7 0 0 4. 5 0 4. 1 zebra 3. 2 0 1 6. 2 0 6. 0 gnu . 5 1 0 6. 2 San Francisco July 21, 2004 … … New Jersey Approved for Public Release, Distribution Unlimited 22

The key to self-* properties! n A flexible, reprogrammable mechanism n n n Which clustered services are experiencing timeouts, and what were they waiting for when they happened? Find 12 idle machines with the NMR-3 D package that can download a 20 MB dataset rapidly Which machines have inventory for warehouse 9? Where’s the cheapest gasoline in the area? Think of aggregation functions as small agents that look for information July 21, 2004 Approved for Public Release, Distribution Unlimited 23

What about security? n Astrolabe requires n n n New! Read permissions to see database Write permissions to contribute data Administrative permission to change aggregation or configuration certificates Users decide what data Astrolabe can see A VPN setup can be used to hide Astrolabe’s internal messages from intruders Byzantine Agreement based on threshold crypto used to secure aggregation functions July 21, 2004 Approved for Public Release, Distribution Unlimited 24

Data Mining n n Quite a hot area, usually done by collecting information to a centralized node, then “querying” within that node Astrolabe is doing the comparable thing, but its query evaluation occurs in a decentralized manner n n This is incredibly parallel, hence faster And more robust against disruption too! July 21, 2004 Approved for Public Release, Distribution Unlimited 25

Cool Astrolabe Properties n n n Parallel. Everyone does a tiny bit work, so we accomplish huge tasks in seconds Flexible. Decentralized query evaluation, in seconds One aggregate can answer lots of questions. E. g. “where’s the nearest supply shed? ” – the hierarchy encodes many answers in one tree! July 21, 2004 Approved for Public Release, Distribution Unlimited 26

Aggregation and Hierarchy n Nearby information n Maintained in more detail, can query it directly Changes seen sooner Remote information summarized n n High quality aggregated data This also changes as information evolves July 21, 2004 Approved for Public Release, Distribution Unlimited 27

Astrolabe summary n n n Scalable: could support millions of machines Flexible: can easily extend domain hierarchy, define new columns or eliminate old ones. Adapts as conditions evolve. Secure: n n Uses keys for authentication and can even encrypt Handles firewalls gracefully, including issues of IP address re-use behind firewalls Performs well: updates propagate in seconds Cheap to run: tiny load, small memory impact July 21, 2004 Approved for Public Release, Distribution Unlimited 28

Bimodal Multicast n n A quick glimpse of scalable multicast Think about really large Internet configurations n n A data center as the data source Typical “publication” might be going to thousands of client systems July 21, 2004 Approved for Public Release, Distribution Unlimited 29

Swiss Stock Exchange Problem: Vsync. multicast is “fragile” Most members are healthy…. … but one is slow July 21, 2004 Approved for Public Release, Distribution Unlimited 30

Performance degrades as the system scales up Virtually synchronous Ensemble multicast protocols average throughput on nonperturbed members 250 group size: 32 group size: 64 group size: 96 150 100 96 50 0 July 21, 2004 32 200 0 0. 1 0. 2 0. 3 0. 4 0. 5 perturb rate 0. 6 0. 7 Approved for Public Release, Distribution Unlimited 0. 8 0. 9 31

Why doesn’t multicast scale? n With weak semantics… n n With stronger reliability semantics… n n n Faulty behavior may occur more often as system size increases (think “the Internet”) Encounter a system-wide cost (e. g. membership reconfiguration, congestion control) That can be triggered more often as a function of scale (more failures, or more network “events”, or bigger latencies) Similar observation led Jim Gray to speculate that parallel databases scale as O(n 2) July 21, 2004 Approved for Public Release, Distribution Unlimited 32

But none of this is inevitable n n n Recent work on probabilistic solutions suggests that gossip-based repair strategy scales quite well Also gives very steady throughput And can take advantage of hardware support for multicast, if available July 21, 2004 Approved for Public Release, Distribution Unlimited 33

Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s) July 21, 2004 34

Periodically (e. g. every 100 ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them. July 21, 2004 35

Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip July 21, 2004 36

Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time. July 21, 2004 37

This solves our problem! 140 average throughput Low bandwidth comparison of pbcast performance at faulty and correct hosts High bandwidth comparison of pbcast performance at faulty and correct hosts 200 traditional: at unperturbed host traditional w/1 perturbed pbcast: at unperturbed host 180 pbcast w/1 perturbed traditional: at perturbed host throughput for traditional, measured at perturbed host pbcast: at perturbed host 160 throughput for pbcast measured at perturbed host 160 120 100 80 60 Bimodal Multicast 40 rides out 20 disturbances! 40 20 0 140 0. 1 0. 2 0. 3 0. 4 July 21, 2004 0. 5 0. 6 perturb rate 0. 7 0. 8 0. 9 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 perturb rate 0. 7 0. 8 0. 9 38

Bimodal Multicast Summary n n An extremely scalable technology Remains steady and reliable n n n Even with high rates of message loss (in our tests as high as 20%) Even with large numbers of perturbed processes (we tested with up to 25%) Even with router failures Even when IP multicast fails And we’ve secured it using digital signatures July 21, 2004 Approved for Public Release, Distribution Unlimited 39

Kelips n n Third in our set of tools A P 2 P “index” n n Put(“name”, value) Get(“name”) Kelips can do lookups with one RPC, is self-stabilizing after disruption Unlike Astrolabe, nodes can put varying amounts of data out there. July 21, 2004 Approved for Public Release, Distribution Unlimited 40

Kelips Take a a collection of “nodes” 110 230 202 30 July 21, 2004 Approved for Public Release, Distribution Unlimited 41

Kelips Map nodes to affinity groups Affinity Groups: peer membership thru consistent hash 0 1 2 N -1 110 230 202 N members per affinity group 30 July 21, 2004 Approved for Public Release, Distribution Unlimited 42

110 knows about other members – 230, 30… Kelips Affinity group view id hbeat rtt 30 234 90 ms 230 322 Affinity Groups: peer membership thru consistent hash 30 ms 0 1 2 N -1 110 230 202 N members per affinity group 30 Affinity group pointers July 21, 2004 Approved for Public Release, Distribution Unlimited 43

Kelips 202 is a “contact” for Affinity Groups: 110 in group 2 peer membership thru consistent hash Affinity group view id hbeat rtt 30 234 90 ms 230 322 30 ms Contacts group … 2 1 2 N -1 110 230 N 202 members per affinity group contact. Node … 0 202 30 Contact pointers July 21, 2004 Approved for Public Release, Distribution Unlimited 44

Kelips “dot. com” maps to group 2. So 110 tells group 2 to “route” inquiries about dot. com to it. Affinity group view id hbeat rtt 30 234 90 ms 230 322 30 ms Affinity Groups: peer membership thru consistent hash Contacts 0 … 2 202 contact. Node … 2 N -1 110 230 group 1 N members per affinity group 202 Resource Tuples resource info … … dot. com 110 July 21, 2004 30 Gossip protocol replicates data cheaply Approved for Public Release, Distribution Unlimited 45

Kelips To look up “dot. com”, just ask some contact in group 2. It returns “ 110” (or forwards your request). Affinity Groups: peer membership thru consistent hash 0 1 2 N -1 110 230 202 N members per affinity group 30 July 21, 2004 Approved for Public Release, Distribution Unlimited 46

Kelips summary n Split the system into N subgroups n n n Each node tracks n n n Map (key, value) pairs to some subgroup, by hashing the key Replicate within that subgroup Its own group membership k members of each of the other groups To lookup a key, hash it and ask one or more of your contacts if they know the value July 21, 2004 Approved for Public Release, Distribution Unlimited 47

Kelips summary n O( N) storage overhead, which is higher than for other DHT’s n n Same space overhead for member list, contact list, and replicated data itself Heuristic is used to keep contacts fresh and avoid contacts that seem to churn This buys us O(1) lookup cost And background overhead is constant July 21, 2004 Approved for Public Release, Distribution Unlimited 48

Virtual Synchrony n n Last piece of the puzzle Outcome of a decade of DARPA-funded work, technology core of n n AEGIS “integrated” console New York and Swiss Stock Exchange French Air Traffic Control System Florida Electric Power and Light System July 21, 2004 Approved for Public Release, Distribution Unlimited 49

Virtual Synchrony Model July 21, 2004 Approved for Public Release, Distribution Unlimited 50

Roles in Quick. Silver? n Provides way for groups of components to n n n Replicate data, synchronize Perform tasks in parallel (like parallel database lookups, for improved speed) Detect failures and reconfigure to compensate by regenerating lost functionality July 21, 2004 Approved for Public Release, Distribution Unlimited 51

Replication: Key to understanding Quick. Silver Astrolabe Bimodal Multicast Gossip protocol tracks membership. Hash each member to an “affinity group” 0 1 N -1 2 110 230 202 N query 30 members per affinity group Gossip protocol replicates data cheaply Kelips July 21, 2004 Virtual Synchrony Approved for Public Release, Distribution Unlimited 52

Metrics n We plan to look at several: n n n Robustness to externally imposed stress, overload: expect to demonstrate significant improvements Scalability: Graph performance/overheads as function of scale, load, etc End-user power: Implement JBI, sensor networks, data-center mgt. platform Total cost: With Raytheon, explore impact on real military applications Under DURIP funding we have acquired a clustered evaluation platform. July 21, 2004 Approved for Public Release, Distribution Unlimited 53

Our plan n n Integrate these core components Then n n Build a JBI layer over the system Integrate Johannes Gehrke’s data mining technology into the platform Support scalable overlay multicast (Francis) Raytheon: Teaming with us to tackle military applications, notably Navy July 21, 2004 Approved for Public Release, Distribution Unlimited 54

More information? www. cs. cornell. edu/Info/Projects/Quick. Silver July 21, 2004 Approved for Public Release, Distribution Unlimited 55