Rethinking the Internet Architecture Process Architecture and Troubleshooting

Rethinking the Internet Architecture Process, Architecture, and Troubleshooting Scott Shenker (joint work with many people, including Katerina Argyraki, Hari Balakrishnan, David Cheriton, Petros Maniatis, Ion Stoica, Mike Walfish) 1

Process Why are we doing this, anyway? 2

Why the Clean Slate Mania? • Internet in crisis? - lack of functionality not a crucial problem - lack of reliability is most important problem • Research community in crisis? - little practical impact on architecture - narrowed focus, stopped asking the big questions • NSF’s response: FIND and GENI - but not enough by itself. . 3

You Can Lead an Academic to Architecture, but. . • Normal academic behavior won’t produce architecture - Publication requires differentiation and/or indifference - Architecture comes from critique and synthesis • work on ideas other than your own. . . • Can’t just design, simulate and abandon - must also experiment and deploy. . . -. . . then discuss and synthesize • Process change harder than technical issues - adoption is much harder than both! 4

Some Thoughts on Architecture material covered in several papers (apologies to those who have heard all this before) not comprehensive architecture, many issues ignored 5

What’s Wrong with the Internet? • Internet is everywhere, used for (almost) everything • Main limiting factor seems to be lack of reliability - can’t do telesurgery, air traffic control, etc. • Hard to improve reliability of packet delivery within current architecture • Vulnerable to attacks, misconfigurations and failures 6

Packet Delivery Problems • Access link failures - multihome • Routing failures - security, policy, configuration, convergence, multipath, . . . • Congestion control failures - FQ, XCP, RCP, . . • Do. S - default-off, capabilities, filters, . . . 7

Packet Delivery Problems • Technical solutions are largely at hand - not perfect, but huge improvement over status quo • No overarching synthetic architecture has emerged - symptom of process failure, or just too early? • But packet delivery won’t be the focus of this talk. . - because only experts see it as the major problem 8

Normal User’s Perspective Other forms of failure dominate: • out-of-date email addresses • broken links • misleading urls and/or inauthentic data • applications blocked by NATs, etc. • email unusable or unreliable due to spam • . . . 9

Why? Three Important Changes. . . 1. Host-to-host accessing data and services 2. End-to-end middleboxes 3. Appropriate communication spam 10

Three Important Changes 1. Host-to-host accessing data and services 2. End-to-end middleboxes 3. Appropriate communication spam 11

Not just host-oriented apps. . • Of course, packets always flow from host to host - modulo middleboxes. . • But which host are the packets sent to? • This is controlled by what hostname is used • So adjusting to data-oriented apps involves reevaluating the Internet naming system - data, service specified by host/path pair 12

Problems with host/path names • Data movement causes broken links - names should be persistent • Replication unnecessarily difficult - Akamai expensive, and can’t replicate at object granularity - Google, P 2 P, etc. do this now. . • DNS names lead to legal/political battles - increasingly important, witness ICANN debacle • Names don’t facilitate authentication - can’t easily verify that data originated with intended source 13

Fix #1: Name Data/Services Directly • Network locations: IP addresses • Hosts: endpoints identifiers (EIDs) • Data/Services: service identifiers (SIDs) - direct naming supports fine-grained migration/replication • User-level descriptors: - search terms - canonical names (AOL keywords) -. . . . 14

Fix #2: Use Names in Appropriate Layer User-level descriptors (e. g. , search) App-specific search/lookup returns SIDs App session Application App session Resolves SID to EID Opens transport conns Bind to EID (HIP) Transport Resolves EID to IP IP IP hdr EID TCP SID … IP 15

Fix #3: Names Should be Flat! 0 xf 436 f 0 ab 527 bac 9 e 8 b 100 afeff 394300 • A name can be persistent if and only if it doesn’t embed any mutable information about its referent • Flat names embed no information, so they can be used to persistently name anything - Enables inter-domain migration, etc. • Once you have a large flat namespace, you never need other global handles - no distinction between EIDs, SIDs, etc. 16

Disadvantages of Flat Names • Hard to resolve • No local control • No locality • Not human friendly all can be handled, but flat names do require new resolution infrastructure 17

Fix #4: Make Names Self-certifying • Name = Hash(pubkey, salt) • Value = <pubkey, salt, data, signature> - can verify name related to pubkey and pubkey signed data • Can receive data from caches or other 3 rd parties without worry - much more opportunistic data transfer 18

Proposed Naming System • Flat, self-certifying identifiers for all entities • Used in “layered” fashion so that each protocol binds to the correct level of abstraction • Names are persistent, verifiable, and support easy replication and migration • Requirement: industrial-strength flat name resolver - names, key revocation (later, another use) 19

Three Important Changes 1. Host-to-host accessing data and services 2. End-to-end middleboxes 3. Appropriate communication spam 20

Not just end-to-end. . • Middleboxes provide important functionality - NATs, firewalls, proxies, caches, app accelerators, etc. • But processing between endpoints violates pure endto-end religion, and causes many practical problems - e. g. , NATs interfere with many applications, • How can architecture support middleboxes better? - eliminate problems and make them architecturally sound 21

Delegation via Resolution • Names usually resolve to “location” of entity • Delegation principle: A network entity should be able to direct resolutions of its name not only to its own location, but also to chosen delegates • Semantics: - where am I where should packets be sent to reach me • This allows packets to be directed towards middleboxes in a clean and coherent manner 22

Architecturally-Sound Middleboxes Current (Bad) Middleboxes Example Dest EID d Mapping ipd ipf Packet structure ipd EID hdr ipf TCP d TCP hdr Firewall EID d IP ipd EID s IP ipf • Delegate can be anywhere, not necessarily on path • Can apply to app-layer middle boxes • Including SID, EID in packet is crucial 23

Possible Impacts • More general services: more complex services (like Riverbed, transcoding, etc. ) can fit within framework • Remote services, not boxes: since middleboxes need not be on-path, services like firewalls, virus-scanners, etc. can be provided as remote services • Rethinking transport: with intermediaries between endpoints, basic notion of the transport layer should be rethought, combining ideas from DTN, DOT, etc. 24

Three Important Changes 1. Host-to-host accessing data and services 2. End-to-end middleboxes 3. Appropriate communication spam 25

Restraining Usage • Can’t be at packet level, must be app-dependent • But don’t want separate mechanism for each app - Email, IM, wiki, etc. • Proposal: quota system - quotas allocated in application-dependent manner - quotas enforced through single mechanism • stamp for each usage, canceled through mechanism • see NSDI 06 paper for details. . • Uses flat name resolution 26

Summary: Other Forms of Failure. . . • broken links and pointers: persistent names • inauthentic data: self-certifying names • applications blocked by NATs, etc. : delegation • spam and other clutter: quota enforcement No change to IP or routers! 27

Troubleshooting and Debugging because things inevitably fail. . . 28

User’s Perspective • Want to know who to yell at - identify responsible entity (at appropriate granularity) • Want their complaints to be taken seriously - provide credible and actionable report • Want the problem fixed, now - detailed diagnostic tools - this is traditional focus of troubleshooting 29

User’s Perspective • Want to know who to yell at - identify responsible entity (at appropriate granularity) • Want their complaints to be taken seriously - provide credible and actionable reports • Want the problem fixed - detailed debugging tools - this is traditional focus of work in this area 30

Vision • Incorporate coherent set of monitoring tools into architecture that: - record necessary information - process information to answer relevant questions • Key points: - not just statistics (e. g. , Netflow), but answers - focus broader than just detailed diagnostics • Three examples 31

Ex. #1: Monitoring ISPs • Monitor boxes on peering links record packet digests - no internal information revealed • Boxes exchange information to determine where packets are dropped and/or delayed • Information ends up at source ISP or end user • Overhead: ~2 -4% of packet bandwidth • Can be applied within enterprises, etc. 32

Ex. #2: Multilayer Tracing • Traceroute is useful, but limited to IP • XTrace (just started) is a generalized version: - operates at multiple layers - follows recursive packet generation (DNS queries, etc. ) - can implement policies about when to respond • Requirements: - layer must be able to handle and propagate metadata - module on box to intercept and report on packets 33

Ex. #3: Distributed Debugging • When bugs occur in operation, it can be extremely difficult to locate and reproduce • We are developing liblog, a log-and-replay debugging tool (early) that is always turned on • Lots of log-and-replay debuggers, ours meets a special set of requirements. . (not described here) 34

Logging and Replay 1. Each process logs its execution to a local file 2. Logs are collected at central location and replayed app app liblog Log 1 Node 1 Log 2 Log 3 Node 2 Node 3 Replay Node GDB console GDB app/liblog 1 app/liblog 5 6 3 4 GDB app/liblog 2 8 7 9 35

Extensions • liblog generates too much data - hard to sift through for large systems • Next step: setting global watchpoints and breakpoints • Can specify in terms of general expressions (python) - routing loops, state inconsistencies, etc. • No operational experience yet 36

Troubleshooting and Debugging • Automated end-user reporting tools would be useful to both users and ISPs - lots of low-hanging fruit • Not clear ISPs will take the lead on troubleshooting - ISPs may not be eager to admit fault - but they should be eager to reduce phonebank expenses • Experience needed with distributed debugger in networking context 37

Summary • Biggest challenge is to get community talking to each other rather than past each other • Reliability more pressing than functionality - have tools to provide better packet delivery - then considered wider set of failure modes - can handle without IP/router involvement • Troubleshooting should be part of “architecture” - nowhere near coherent yet - looking for basic building blocks 38