Toward Interactive Debugging for ISP Networks Chia-Chi Lin

Toward Interactive Debugging for ISP Networks Chia-Chi Lin†, Matthew Caesar†, Jacobus Van der Merwe§ †University of Illinois at Urbana-Champaign §AT&T Labs – Research

Debugging in ISP Networks • Internet: most complex distributed system ever created – Leads to complex failure modes – Bugs, vulnerabilities, compromise, misconfigurations • Major challenges in debugging in ISP Networks – Lack of visibility – High rates of change of protocols – Complex interdependencies • These could cause devastating effects – Long-term outages, slow repair – February 2009 BGP outage 1

Interactive Debugging is Necessary • Problems exist with fully automated techniques – Focus on detection rather than diagnosis – Modeling could be inexact – Logical and semantic errors seems to require human knowledge to solve • Our position: – Humans must be “in-the-loop” – Tools are required to facilitate the process 2

Pause when the outage occurs A Scenario Cloned Network ISP Customer 3

Our Vision • Isolation of the operational network – Prevent diagnostic procedure from interfering with live network operation – Solution: virtualization technologies • Reproducibility of network execution – Enable operator to replay execution, narrow in on rare events – Solution: instill a pseudorandom ordering over events, messages • Interactive stepping through execution – Operator can slowly step through operation, trace messages – Solution: protocols providing tight control over distributed execution 4

The Architecture Virtual Service Platforms Physical Network Infrastructure Application 2: e. g. OSPF Application 1: e. g. BGP Virtual Service Nodes Virtual Service Coordinator Physical Network Debugging Node Coordinator User (human troubleshooter) 5

Key Challenge: Reproducibility • Reproducibility simplifies interactive debugging – Can run multiple times, varying inputs to narrow down cause – When rare bug occurs, don’t need to wait for it to reoccur • One option: generate comprehensive logs of all events – e. g. , log all packet sends/receives, all data – Problem: not scalable to large networked software • Our approach: eliminate randomness in execution – Starting with the same initial state will produce same execution – Make execution “pseudorandom” to explore different execution paths – Key challenge: how to eliminate randomness in large-scale software execution? 6

An Algorithm for Distributed, Reproducible Execution • Approach: – Encapsulate software in virtual environment – Intercept software’s inputs/outputs, instill an ordering over them – Make sure that ordering is the same, every time software is run • How this is done: – Network is run in lockstep fashion – On every cycle: messages from neighbors are buffered – Before deliver to application, pseudorandom ordering is instilled by consistent hash of packet’s contents – Human sends “step” commands to move to next lockstep cycle 7

Improving Performance for the Production Network • Problem: running application in lockstep fashion slows operation – Might be okay for some protocols (e. g. , BGP) – Probably not okay for others (e. g. , OSPF) • Solution: “optimistic” execution of events – Choose pseudorandom ordering in advance that is likely to happen anyway – Don’t buffer packets, deliver them immediately – If we guess wrong, roll back application to earlier state 8

Example: Running the Lockstep Algorithm in a Cloned Network Transmission Phase App S I finished processing. transmitting. I am ready to transmit. Processing process. S A Phase App L K A S K App App App L K App L 1. 2. 3. 4. S L K …… App Sending Buffer Receiving Buffer 9

Example: Live Algorithm in Production Network S 13 The Seattle Packets from live algorithm does two things: Seattle should • Determine the ordering of events come before those • 10 back events violating the ordering Roll Pseudorandom ordering is from Los Angeles Chicago 14 violated! Salt Lake City K 13 L S C K 14 C K L C Los Angeles 1. 2. 3. 4. 5. 6 Seattle 16 Angeles Los Kansas City Chicago …… C Kansas City 8 3 9 10 K New York Washington 7 Atlanta 11 Houston 10

Connecting the Two Algorithms • We can run the production network using the live algorithm – Achieves a fixed ordering over messages – But how to actually debug it? • Solution: replay using the lockstep algorithm – First let the production network run, checkpoint starting state – To debug, start lockstep algorithm with same staring state – Lockstep algorithm will traverse the same execution • Can replay multiple times, narrow in on problem, experiment by changing inputs, etc. 11

Simulation Settings • Protocol evaluated: OSPF • Topologies used: BRITE, Internet 2 backbone • Link delay model: 1 ms + (0, 0. 5] exponentially distributed random delay • Events simulated: Abilene IS-IS traces over the month of January 2009 (giving 209 events) • Measure performance overheads of our approach 12

Results – Overhead in Production Networks • Live algorithm suffers from rollbacks, incurring 4 x inflation in traffic overhead • Using delay-estimation optimization reduces overhead to 0. 02 x traffic inflation 13

Results – Response Time in Cloned Networks • Low response time is beneficial to interactive debugging • Response time is low for variety of network sizes 14

Conclusion • Humans are required to be “in-the-loop” to diagnose problems • Our architecture is a first step towards interactive debugging – Builds on known techniques, e. g. , virtualization technologies and distributed semaphores – Develop techniques to reproduce distributed executions • Simulations on real-world events show the scheme accompanied with low overheads 15

16

The State of the Art: Automated Techniques • Logging observations – X-Trace, Friday, etc. • Model checking – rcc, OD flow, etc. • Debugging standalone programs – Coverity, AVIO, etc. 17

Optimized Ordering in the Production Network • Goal: avoid rollbacks by selecting ordering likely to happen anyway – Events separated by long period will fall into different groups which means ordering is easy – Problem: some failure events are correlated • E. g. , multiple overlay links sharing same physical link – How to order events in same group? • Solution: if we know link delays, we can reliably estimate expected arrival of events – In practice we don’t know exact link delays – But we can estimate them – Can improve estimation by giving protocol messages high priority 18

Results – Storage in Production Network • State required for rolling back packets is small and increases slowly with network size 19