Скачать презентацию Studying Black Holes on the Internet with Hubble Скачать презентацию Studying Black Holes on the Internet with Hubble

68428f5e0146c0bb8bb11e4536fd4a53.ppt

  • Количество слайдов: 44

Studying Black Holes on the Internet with Hubble Ethan Katz-Bassett, Harsha V. Madhyastha, John Studying Black Holes on the Internet with Hubble Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas Anderson University of Washington NSDI, April 2008 This work partially supported by

Global Reachability n n When an address is reachable from every other address Most Global Reachability n n When an address is reachable from every other address Most basic goal of Internet, especially BGP q n n “There is only one failure, and it is complete partition” Clarke, Design Philosophy of the DARPA Internet Protocols Physical path BGP path traffic reaches Black hole: BGP path, but traffic persistently does not reach 2

Does use, seems to usually work reachability? n From Internet give global n Can Does use, seems to usually work reachability? n From Internet give global n Can we assume the protocols just make it work? n “Please try to reach my network 194. 9. 82. 0/24 from your networks…. Kindly anyone assist. ” Operator on NANOG mailing list, March 2008. 3

Does Internet give global reachability? 4 Does Internet give global reachability? 4

Hubble System Goal In real-time on a global scale, automatically monitor long-lasting reachability problems Hubble System Goal In real-time on a global scale, automatically monitor long-lasting reachability problems and classify causes 5

Problem Seen by Hubble on Oct. 8, 2007 5: 09 a. m. Fr: X Problem Seen by Hubble on Oct. 8, 2007 5: 09 a. m. Fr: X To: D Ping? Fr: D To: X Ping! Fr: Z To: D Ping? 5: 11 a. m. 1. Target Identification – distributed ping monitors detect when the destination becomes unreachable 6

Problem Seen by Hubble on Oct. 8, 2007 5: 13 a. m. 1. 2. Problem Seen by Hubble on Oct. 8, 2007 5: 13 a. m. 1. 2. Target Identification – distributed ping monitors Reachability analysis – distributed traceroutes determine the extent of unreachability 7

Problem Seen by Hubble on Oct. 8, 2007 1. 2. 3. Target Identification – Problem Seen by Hubble on Oct. 8, 2007 1. 2. 3. Target Identification – distributed ping monitors Reachability analysis – distributed traceroutes Problem Classification a) group failed traceroutes 8

Problem Seen by Hubble on Oct. Fr: Y 2007 8, Fr: Y To: D Problem Seen by Hubble on Oct. Fr: Y 2007 8, Fr: Y To: D Ping? D to Y works! Y to D fails! 1. 2. 3. Fr: D To: Y Ping! Fr: D Fr: X To: Y To: D Ping! Ping? To: D Ping? D to Z works! Z to D fails! Target Identification – distributed ping monitors Reachability analysis – distributed traceroutes Problem Classification group failed traceroutes b) spoofed probes to isolate direction of failure a) 9

Architecture: Detect Problem n Ping prefix to check if still reachable q q n Architecture: Detect Problem n Ping prefix to check if still reachable q q n Every 2 minutes from Planet. Lab Report target after series of failed pings Maintain BGP tables from Route. Views feeds q q Allows IP AS mapping Identify prefixes undergoing BGP changes as targets 10

Architecture: Assess Extent of Problem n Traceroutes to gather topological data q q n Architecture: Assess Extent of Problem n Traceroutes to gather topological data q q n Keep probing while problem persists Every 15 minutes from 35 Planet. Lab sites Analyze which traceroutes reach q q BGP table to map addresses to ASes Alias information to map interfaces to routers 11

Architecture: Classify Problem To aid operators in diagnosis and repair: n Which ISP contains Architecture: Classify Problem To aid operators in diagnosis and repair: n Which ISP contains problem? n Which routers? n Which destinations? 12

Architecture: Classify Problem n n Real-time, automated classification Find common entity that explains substantial Architecture: Classify Problem n n Real-time, automated classification Find common entity that explains substantial number of failed traceroutes to a prefix Does not have to explain all failed traceroutes Not necessarily pinpointing exact problem 13

Classifying with Current Topology Group failed/successful traceroutes by last AS, router Example: Router problem Classifying with Current Topology Group failed/successful traceroutes by last AS, router Example: Router problem n No probes reach P through router R n Some reach through R’s AS n 28% of classified problems n 14

Classifying with Historical Topology Daily probes from Planet. Lab to all prefixes n Gives Classifying with Historical Topology Daily probes from Planet. Lab to all prefixes n Gives baseline view of paths before problems Example: “Next hop” problem n Paths previously converged on router R n Now terminate just before R n n 14% of classified problems 15

Classifying with Direction Isolation n n Internet paths can be asymmetric Traceroutes only return Classifying with Direction Isolation n n Internet paths can be asymmetric Traceroutes only return routers on forward path q q q n n Might assume last hop is problem Even so, require working reverse path Hard to determine reverse path Isolate forward from reverse to test individually Without node behind problem, use spoofed probes q q Spoof from S to check forward path from S Spoof as S to check reverse path back to S 16

Classifying with Direction Isolation n Hubble deployment on RON employs spoofed probes q q Classifying with Direction Isolation n Hubble deployment on RON employs spoofed probes q q 6 of 13 RON permit source spoofing Planet. Lab does not support source spoofing Example: Multi-homed provider problem n Probes through Provider B fail n Some reach through Provider A n Like Cox/USC n 6% of classified problems 17

Architecture: Summary of Approach n Synthesis of multiple information sources q q n n Architecture: Summary of Approach n Synthesis of multiple information sources q q n n Passive monitoring of route advertisements Active monitoring from distributed vantage points Historical monitoring data to enable troubleshooting Topological classification and spoofing point at problem 18

Evaluation Target Identification n How much of the Internet does Hubble monitor? Reachability Analysis Evaluation Target Identification n How much of the Internet does Hubble monitor? Reachability Analysis n What percentage of the various paths to a prefix does Hubble analyze? Problem Classification n How often can Hubble identify a common entity that explains the failed paths to a prefix? n How often does spoofing isolate the failure direction? For further evaluation, please see the paper. 19

How much does Hubble monitor? Every 2 minutes: n 110, 000 prefixes n 89% How much does Hubble monitor? Every 2 minutes: n 110, 000 prefixes n 89% of Internet’s edge address space n 92% of edge ASes n Origin ASes for 99% of 14 M Bit. Torrent users 20

What % of paths does Hubble monitor? Compare with BGP paths of 447 RIPE What % of paths does Hubble monitor? Compare with BGP paths of 447 RIPE peers (downhill ASes) AT&T Sprint Tier 1 Abilene Cenic Gigapop Transit UW WSU Inte l UT UM MIT Stub n n Planet. Lab’s restricted size and homogeneity limit uphill 90% of our failed traceroutes terminate within 2 AS hops of prefix’s origin 21

What % of paths does Hubble monitor? Compare with BGP paths of 447 RIPE What % of paths does Hubble monitor? Compare with BGP paths of 447 RIPE peers (downhill ASes) AT&T Sprint Tier 1 Abilene Cenic Gigapop Transit UW WSU Inte l UT UM MIT Stub BGP ASes: { AT&T, Sprint, Gigapop, Cenic, Intel } Also on Traceroutes: { Sprint, Gigapop, Cenic, Intel } Coverage for Intel prefix: 4 of 5 downhill ASes = 80% 22

What % of paths does Hubble monitor? Compare with BGP paths of 447 RIPE What % of paths does Hubble monitor? Compare with BGP paths of 447 RIPE peers (downhill ASes) AT&T Sprint Tier 1 Abilene Cenic Gigapop Transit UW WSU Inte l UT UM MIT Stub Overall for prefixes monitored by Hubble n For >60% of prefixes, traverse ALL downhill RIPE ASes n For 90% of prefixes, traverse more than half the ASes 23

How often can Hubble classify? n 9 classes currently Based on topology q Point How often can Hubble classify? n 9 classes currently Based on topology q Point to an AS and/or router q n n Results from first week of February 2008 Automatically classified 375, 775/457, 960 (82%) of problems as they occurred 24

How often does spoofing work? When a RON path works and another does not: How often does spoofing work? When a RON path works and another does not: n Isolate 68% of failures from spoofing sources n 47% forward, 21% reverse 25

How long do black holes last? n n 3 week study starting September 17, How long do black holes last? n n 3 week study starting September 17, 2007 31, 000 black holes involving 10, 000 prefixes 20% lasted at least 10 hours! 68% were cases of partial reachability 26

How long do black holes last? Partial reachability: § Can’t be just hardware failure How long do black holes last? Partial reachability: § Can’t be just hardware failure § Configuration/ policy n n 3 week study starting September 17, 2007 31, 000 black holes involving 10, 000 prefixes 20% lasted at least 10 hours! 68% were cases of partial reachability 27

Other Measurement Results n Can’t find problems using only BGP updates q n Multi-homing Other Measurement Results n Can’t find problems using only BGP updates q n Multi-homing may not give resilience against failure q n 100 s of multi-homed prefixes had provider problems like COX/USC, and ALL occurred on path TO prefix Inconsistencies across an AS q n Only 38% of problems correlate with Route. Views updates For an AS responsible for partial reachability, usually some paths work and some do not Path changes accompany failures q 3/4 router problems are with routers NOT on baseline path 28

Conclusions and Future Work n n Hubble: working real-time system Lots of reachability problems, Conclusions and Future Work n n Hubble: working real-time system Lots of reachability problems, some long lasting Baseline/ fine-grained data enable problem classification Spoofing to isolate direction of path failures http: //hubble. cs. washington. edu Uses i. Plane, Max. Mind, Google Maps 29

Thanks! http: //hubble. cs. washington. edu 30 Thanks! http: //hubble. cs. washington. edu 30

Long term prospects for spoofing? Support for spoofing: n No complaints about our spoofed Long term prospects for spoofing? Support for spoofing: n No complaints about our spoofed probes n Can receive spoofed probes at Planet. Lab n Planet. Lab support in future kernels? n Router vendor talking to us about router support for measurements Alternatives to spoofing: n Traceroute servers behind problems n End-hosts behind problems 31

Comparison to Planet. Seer [OSDI ‘ 04] n n n Most similar system Passively Comparison to Planet. Seer [OSDI ‘ 04] n n n Most similar system Passively monitors Co. Dee. N clients, probes on anomalies Different and complimentary analysis Planet. Seer n Clients that connected within 15 minutes n 43% edge ASes (sum over 3 months) n Not problems that prevent access to CDN Hubble n All prefixes every 2 minutes n 92% edge Ases (every 2 minutes) n All partial or complete reachability problems 32

Characteristics of Problems of Interest n n n Routable prefix present in BGP tables Characteristics of Problems of Interest n n n Routable prefix present in BGP tables Persistent through 2 rounds of probes Routing infrastructure failures q q n Not simply end-system/end-network failure Judgments based on connectivity to origin AS Not simply source problem q q Monitor if less than 90% of vantages reach Based on 4 months of probes to 110 K prefixes 33

How well does Hubble work? Scale: n 89% of the Internet’s edge address space How well does Hubble work? Scale: n 89% of the Internet’s edge address space n 92% of edge ASes n Origin ASes for 99% of 14 M Bit. Torrent users Effectiveness: n Finds 85% of black holes, 95% of those that last at least 1 hr [compared to pervasive approach] Cost: n 5. 5% of the probes required by pervasive approach 34

Does spoofing work? When 3+ spoofing RON nodes fail to reach: n Isolate all Does spoofing work? When 3+ spoofing RON nodes fail to reach: n Isolate all failed paths in 61% of cases n 42% forward, 16% reverse, 3% mixed n For 95% of cases, all paths isolate to same direction 35

Provider(s) Unreachable • No probes reach even the provider(s) of Origin AS • Probes Provider(s) Unreachable • No probes reach even the provider(s) of Origin AS • Probes fail in AS upstream 3% of classified problems (1 -13% at any point in time) 36

Single-homed Origin AS Down • No probes reach single-homed Origin AS • Some reach Single-homed Origin AS Down • No probes reach single-homed Origin AS • Some reach its provider 17% of classified problems (4 -37% at any point in time) 37

Multi-homed Origin AS Down • No probes reach multi-homed Origin AS • Some reach Multi-homed Origin AS Down • No probes reach multi-homed Origin AS • Some reach its provider(s) 9% of classified problems (2 -30% at any point in time) 38

Provider AS Problem for Multi-Homed • Probes through Provider B fail to reach P Provider AS Problem for Multi-Homed • Probes through Provider B fail to reach P • Some reach through Provider A 6% of classified problems (1 -17% at any point in time) 39

Non-Provider AS Problem • Probes through Non-Provider C fail • Some reach through other Non-Provider AS Problem • Probes through Non-Provider C fail • Some reach through other ASes 17% of classified problems (1 -37% at any point in time) 40

Router Problem on Known Path • Last hop router R was seen on recent Router Problem on Known Path • Last hop router R was seen on recent paths reaching P • No probes reach P through R • Some reach through R’s AS Historical Traceroutes 7% of classified problems (1 -40% at any point in time) 41

Router Problem on New Path • Last hop router R not seen on recent Router Problem on New Path • Last hop router R not seen on recent paths reaching P • No probes reach P through R • Some reach through R’s AS 21% of classified problems (1 -40% at any point in time) 42

Next Hop Problem on Known Paths • No last hop router or AS explains Next Hop Problem on Known Paths • No last hop router or AS explains problem • Paths previously converged on router R • Now terminate just before R 14% of classified problems (1 -39% at any point in time) 43

Topological classification results Of ones we classify: 1. Provider(s) unreachable: 2. Single-homed origin AS Topological classification results Of ones we classify: 1. Provider(s) unreachable: 2. Single-homed origin AS down: 3. Multi-homed origin AS down: 4. Provider AS problem for multi-homed origin AS: 5. Non-provider AS problem: 6. Router problem on old path: 7. Router problem on new path: 8. Next hop problem on known paths: 9. Prefix unreachable: Overall (range over time) 3% (1 -13%) 17% (4 -37%) 9% (2 -30%) 6% (1 -17%) 17% (1 -37%) 7% (1 -40%) 21% (1 -40%) 14% (1 -39%) 22% (7 -79%) 44