Скачать презентацию Unlocking Systems and Data The Key to Network Скачать презентацию Unlocking Systems and Data The Key to Network

af3f01cc32064bf4d8695e1ae9e3696e.ppt

  • Количество слайдов: 17

Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V. P. AT&T Labs-Research 2006 IEEE/IFIP Network Operations and Management Symposium

Vision for IP Network Management Goal: A robust, global, multi-service IP/MPLS network Design goals, Vision for IP Network Management Goal: A robust, global, multi-service IP/MPLS network Design goals, policies Network-wide model auditing, “what-if, ” etc. Topology, Configuration, Workflow Offered Traffic, Routing, Fault Provisioning, Changes to the Network measure control Network Approach • Manage the entire network, not network elements • Instrument the network, rely on direct correlation of real data • Model interactions to predict the effects of actions in advance • Automate as much as possible, audit results CRK 12/6/2005 2

Why It’s Hard Scale & Diversity Challenges • Large, distributed networks (100, 000’s of Why It’s Hard Scale & Diversity Challenges • Large, distributed networks (100, 000’s of NE’s) • Complex, diverse building blocks • Ongoing maintenance, spanning multiple time zones • Fragile IP network control planes • Complex software systems on top Constant change • Architectural change, new features & services, new protocols… • Customers join, leave, change/upgrade service • Network “events” – failures, migrations, upgrades, etc. Measurement and data challenges • Inadequate implementation of the basics • Data often locked up in NM systems “smokestacks” • Diverse data sources, with highly variable data quality • Limited direct measurements of causality • Inadequate ability to trace events across the network CRK 12/6/2005 3

Tier-1 Service Provider Network DWDM systems P C Po. P PE E P C Tier-1 Service Provider Network DWDM systems P C Po. P PE E P C P C Intercity P C P C PE E Rough stats: Po. P 100 s of offices 100 s of Ps, 1000 s of PEs, 10000 s of CEs 100, 000 s of transport facilities Po. P PE E P C Po. P: Point-of-Presence P: Backbone (core) Router PE: Provider Edge Router CE: Customer Edge Router OC-48 or OC-192 DWDM P C PE E Customer facing PE interfaces Metro LEC (Enterprise customer networks rival ISP’s in size & complexity!) Access Network CPE CE Customer Network CRK 12/6/2005 4

Unlocking Network Data Measurement data is essential to running the network • Marketing and Unlocking Network Data Measurement data is essential to running the network • Marketing and customer acquisition • Network and customer care • Network engineering and capacity management • Research to improve / evolve the network If you don’t have the data, you can’t design, manage, secure, or improve the network If you can’t evolve systems, you can’t evolve the network Example 1: Fault/performance management Example 2: Router Provisioning CRK 12/6/2005 5

Network Troubleshooting Goals • Automate the entire life cycle of event detection and repair Network Troubleshooting Goals • Automate the entire life cycle of event detection and repair for every performance impacting event – • Detect, Localize, Diagnose, Fix, Verify Drive short and long term network, operations & systems improvements – Use forensics to reveal chronic events Systems and Tools • Active and passive performance monitoring – • Each data source has its unique value and limitations Maintenance and troubleshooting require correlation across multiple data sets – Associations of customers to access circuits, router interfaces, network policies, network elements, monitoring systems, … CRK 12/6/2005 6

Example: Cross-Layer Troubleshooting IP composite link: multiple SONET links combined together • Example: 5 Example: Cross-Layer Troubleshooting IP composite link: multiple SONET links combined together • Example: 5 OC 192 s • IP routing does not take bandwidth into account. – On component failure: how to decide between mechanisms to take traffic off the link, as function of remaining capacity? 3 units of traffic LA Logical IP link NY NY congestion 1 unit of capacity CRK 12/6/2005 7

Example: Cross-Layer Troubleshooting (cont. ) Detect: • Packet loss from active measurements for a Example: Cross-Layer Troubleshooting (cont. ) Detect: • Packet loss from active measurements for a set of PE pairs Localize/Diagnose: • Temporal correlation: PE-PE measurement alerts occurring at the same time as flapping on several composite link members • Spatial correlation: paths where packet loss occurs contain flapping composite link components (PE-PE measurements mapped to paths via route monitoring) Diagnose: • Congestion due to composite link component flapping Fix: • • Short term: “cost out” the link Permanent: repair failing components Verify: • Packet loss alerts disappear CRK 12/6/2005 8

Example: Chronic Control Plane Outage PE PE Detect • Active performance monitoring shows high Example: Chronic Control Plane Outage PE PE Detect • Active performance monitoring shows high loss at a PE Localize/Diagnose • Correlation of performance alerts, fault data, routing updates, configuration, and workflow logs reveals recurring pattern – OSPF sessions flap during customer provisioning on some PE platforms • Diagnosis: BGP starves OSPF processing on this class of PEs Fix • • Short-term: process changes to control provisioning on this class of PE Long-term: better OSPF and BGP process scheduler for PE Verify • High loss disappears at the PE CRK 12/6/2005 9

Data Distribution Problem • • • Many, diverse data feeds required Labor-intensive and error-prone Data Distribution Problem • • • Many, diverse data feeds required Labor-intensive and error-prone to create and maintain each feed Ad-hoc development to convert, copy, encrypt, & ingest the data Several groups with business critical functions need network data Stringent delivery requirements (security, timeliness, reliability) Network data Customer data Access: location, circuit ID, IP addresses, CE platform, LEC interface, layer 2 info (Frame Relay, Ethernet, DSL, Private Line, …), router info (hardware, software version) • Network inventory • Route monitors, BGP tables • SNMP link utilization & faults • Syslog info (status, health, events) • Active path monitoring • • Netflow Trouble tickets Performance and SLA reports • Other: workflow, Vo. IP, transport • • Service orders • CRK 12/6/2005 10

Data Correlation Framework Flexible data/systems architecture • Pluggable data-source specific collectors • Data distribution Data Correlation Framework Flexible data/systems architecture • Pluggable data-source specific collectors • Data distribution bus • Common real time and archival data store • Variety of network management applications on top Evolving domain knowledge • It’s an iterative process: exploratory data mining (EDM) – – Apply statistical tools, visualization, “hunches, ” … Export results to “case manager” for analysis Diagnosis engines • Near real-time drill down, forensics • Temporal and spatial event clustering • Scalable statistical mechanisms to uncover correlations CRK 12/6/2005 11

Data/Systems Architecture Customer Portal Internal Portal Topology I/F GUI GUI Real-time End-to-end Network Mgt Data/Systems Architecture Customer Portal Internal Portal Topology I/F GUI GUI Real-time End-to-end Network Mgt Reporting Planning Surveillance Applications Application Data Distribution Bus • Publish/subscribe system handling all incoming data feeds Application • Data Store Component (DSC) Syslog Collector Active L 3 Control Probe Plane Collector Network Supports multiple transport options, normalizes data to “standard” formats Reliably delivers data to consumers Data Store Component • Efficient long-term storage of operational OA&M data Distribution Bus (DDB) CDR • Netflow SNMP Collector • Automatic generation of schema, loading scripts, access scripts, data aging allowing non-DBAs to manage warehouse Network data is available to multiple applications allowing auditing, correlation, reporting, EDM, … CRK 12/6/2005 12

Router Provisioning Goal: translate service intent to network reality • Get hardware & circuits Router Provisioning Goal: translate service intent to network reality • Get hardware & circuits to the right place at the right time • Access & update network inventory databases • Configure routers to establish and verify the service Challenges • Huge diversity at network element layer (dependencies on hardware & software versions, physical configuration, vendor, etc. ) • Low level configuration languages, no abstraction layer, multiple ways of achieving the same thing • Config generator must consider hardware limitations, service definition, customer order info, additional customer info, etc. • Commercial tools offer limited customizability, only solve pieces of the problem • Initial provisioning is only part of the life cycle problem (network-wide changes, firmware mgt, auditing, CE-PE coordination, change requests, …) CRK 12/6/2005 13

Configuration File Analysis Auditing Provisioning queries Detect/Fix Discords • Customer/ network database • Non-compliance Configuration File Analysis Auditing Provisioning queries Detect/Fix Discords • Customer/ network database • Non-compliance to architectural intent – e. g. , errors in route-maps for VPNs crossing routing domains Config time-bombs – e. g. , gaps in the ACL perimeter defense Additional Benefits Low level standard Discords form (tables) • Technology • fix polled Assessment, Bootstrapping automation, Decision Support Parsers, Algorithms, Rules and Queries encoding domain expertise : e. g. , ACL analysis Router configuration CRK 12/6/2005 14

Automated CPE Router Provisioning • Technical Questionnaire • E. g. , Web form • Automated CPE Router Provisioning • Technical Questionnaire • E. g. , Web form • (Service Level) • Logic: allocations of ports, IP addresses, VRFs, … • Device/service specific templates, with embedded variables and callouts to computations and databases • E. g. , callouts for ports, IP addresses, ACL clauses, … • Detailed Device Configuration commands – bundled as a “configlet” • (Network Element Level) CRK 12/6/2005 15

Template-driven Config Generation Executing templates in a given context (stored in a database) produces Template-driven Config Generation Executing templates in a given context (stored in a database) produces configs, similar to code generation – Evolves easily to integrate new features, router models, access types, resiliency options – Eliminates errors, reduces holds – Ensures conformance to engineering guidelines Example: BGP configuration Context Substitution router bgp no synchronization Functional Substitution bgp log-neighbor-changes network , 255. 252)> mask 255. 252 network , 255. 252)> mask 255. 252 network mask 255 CRK 12/6/2005 16

Conclusions Unlocking data and fault/performance management systems enables innovation • • Exploratory data mining Conclusions Unlocking data and fault/performance management systems enables innovation • • Exploratory data mining and data correlation are essential to forensics and network maintenance automation Approach: Flexible data distribution and data storage architecture Unlocking provisioning systems enables innovation • • Bottom-up analysis is a useful tool for discord-detection, etc. Template driven approach allows network engineering to add new network features without new systems development Challenges are legion… • How to overcome proprietary data models, systems thwarting forensics? • How to find efficiently find needles in (massive) data haystacks? • How to raise the level of provisioning abstraction? • How to reduce the systems drag on network feature and architecture change? CRK 12/6/2005 17