33a420ebbc2ee7ac49972cedf75cb68d.ppt
- Количество слайдов: 10
Grid. Ka – DE-KIT procedurs Bruno Hoeft LHC-OPN Meeting 10. – 11. 03. 08 Bruno Hoeft, Aurelie Reymund LHC-OPN 2008, Madrid, 10 -11 th March. 1
LHC-OPN Hardware at DE-KIT (Grid. Ka): fully redundant border router setup are in place (resilience) two border router Cisco Catalyst 6509 Router - 2 sup engines WS-SUP 720 -3 B ( IOS s 72033_rp-IPSERVICESK 9_WAN-VM), Version 12. 2(33)SXF 9). -- line cards WS-x 6704 -10 GE, facilitated with single mode transceiver XENPAK-10 GB-SR -DFN 2 Huawei DWDM Ø- one DWDM is providing the light colour from DE-KIT (Grid. Ka) to CERN and SARA (direction north from Karlsruhe) - the second DWDM is providing the light colour from DE-KIT (Grid. Ka) to IN 2 P 3 and CNAF (direction south from Karlsruhe The direction to CERN from Karlsruhe is north since the DANTE peering to DFN is located in Frankfurt for the DFN/Dante link DE-KIT(Grid. Ka) – CERN. Bruno Hoeft, Aurelie Reymund LHC-OPN 2008, Madrid, 10 -11 th March. 2
DE-KIT LHC-OPN links R-inet-gis-I Interface (Layer-2) VLan IP (Layer-3) / Te 7/2 10 192. 166. 34/30 GE 10/HUA 0674_FRA_FZK (Frankfurt/Dante ->Genf) CERN (fra-gen_LHC_CERN-DFN_06006) Te 1/1 751 192. 166. 105/30 GE 10/HUA 0778_FZK_MUE Muenster/Surfnet-> Amsterdam/SARA (DFN/Surfnet CBF) Link Name (DFN) Description R-inet-gis-II Interface (Layer-2) Vlan IP (Layer-3) Te 3/2 752 192. 166. 109/30 / GE 10/HUA 1106_FZK_KEH (Kehl) IN 2 P 3 (DFN/RENATER CBF) Te 2/2 750 192. 166. 101/30 / GE 10/HUA 0673_BAS_FZK (Milano) Bologna INFN(CNAF) (DFN/Switch/GARR CBF) Bruno Hoeft, Aurelie Reymund / Link Name (DFN) LHC-OPN 2008, Madrid, 10 -11 th March. Description 3
Operative service levels three service levels entities: - First level support is GGUS (5*8) General FZK network support: (5*8, (plus an automated incident broadcast (SMS) 24*7) – - Telematis (an external Company is covering the “off workinghours” incident broadcast on call support) Expert Support: (5*8, plus Experts on call) o The combination of the three operative service levels are providing a 24*7 LHC-OPN support. This will match the requirements specified by the LHC experiments in there CDR. All operators will be granted a fully transparent access to the DE-KIT (Grid. Ka) wiki knowledge base, the DE-KIT (Grid. Ka) log analyser facility and monitoring system as well as LHC-OPN monitoring systems, as they are: - DE-KIT (Grid. Ka) local • - LHC-OPN central monitoring pages • • • – – – • DE-KIT (Grid. Ka) general monitoring site [http: //www. gridka. de/monitoring/main. html] cacti , netflow, ganglia, nagios, log analyser iepm [http: //192. 108. 45. 161/iepm-bw. fzk. de/LHC-ATLAS. slac_wan_bw_tests. html#node 1. uchicago. edu ] BGP – ENOC monitoring page Dante E 2 Ecu monitoring page - Several DE-KIT (Grid. Ka) local information sites are restricted to local access only. Bruno Hoeft, Aurelie Reymund LHC-OPN 2008, Madrid, 10 -11 th March. 4
Incident origination: - DE-KIT (Grid. Ka) Monitoring (Log. Monitoring/Port. Monitoring) - - DE-KIT (Grid. Ka) Monitoring tools triggering an incident, automated email/SMS (e. g. router port up/down, flapping, bgp changes…), or by router operators operation at DE-KIT (Grid. Ka) will open a GGus (or LCU) ticket GGus (or LCU) will control the ticket the mainly involved tier-1 site (DE-KIT (Grid. Ka)) will operate the ticket, until the ticket is solved or closed. appropriate partner(s) affected by the incident will be included in the ticket. - GGUS/LCU ticket initiated by HEP user, distant NOC/Tier-0/1 or NREN GGus/LCU submits the ticket to the appropriate site (DE-KIT (Grid. Ka)) the ticket will still be controlled by GGus(/LCU) and DE-KIT (Grid. Ka) will take over the operative part - no difference to a GGus/LCU ticket. - request to open a GGus/LCU ticket however appropriate actions will be taken immediately to solve the issue. - GGus (and/or LCU) ticket will be opened and it will be announced in GOC, this should inform all LHCOPN sites via EGEEBroadcast as well as through GOC (for each EGEE broadcast should exist an according ticket) - - GGus/LCU: - LIPCU (LCU)/E 2 ECU: - Information by a site: - maintenance/changes at DE-KIT (Grid. Ka) / EGEE Broadcast: • Incident and ticket handling Bruno Hoeft, Aurelie Reymund ticket of an incident is handled and controlled by either GGus, LCU, or E 2 Ecu operation of certain actions are transferred to the affected/coresponding location like a tier-1 centre DE -KIT (Grid. Ka) or a “NREN” the management will still resides at the ticket owner (GGUS, LCU/LIPCU, E 2 ECU LHC-OPN 2008, Madrid, 10 -11 th March. 5
Operation of an Incident (1) - Layer-1 incident (An issue on layer-1 has for consequence that there is no light on the path) - No light (Descr. : there is a light cut somewhere on the path) Actions: - check the router / transceiver / hardware / cable / logs - evaluate the impact (backup path available) - contact DFN and Di-Data as well as T 0/T 1 - send an EGEE broadcast if no backup path (depended on –estimated length, and impact) and escalate to Experts - report the incident and its solution in the documentation Involved groups: - Internal: GIS / NG (Network Group) - External: DFN, Di-Data, T 0/T 1 network responsible, NREN / Dante - Momitoring eg. : http: //stats. geant 2. net/e 2 emon/G 2_E 2 E_index_PROD. html - Local hardware failure (Descr. : a hardware element seems to be deficient on the local network) Actions: - check the router / transceiver / hardware / cable / logs - evaluate the impact (backup path available) - contact T 0/T 1 - send an EGEE broadcast if no backup path (depended on –estimated length, and impact) and escalate to Experts - report the incident and its solution in the documentation Involved groups: -Internal: GIS / NG - External: DFN, Di-Data, T 0/T 1 network responsible, NREN / Dante - Remote hardware failure (Descr. : a hardware element seems to be deficient on the remote network) Actions: - check the router / transceiver / hardware / cable / logs - evaluate the impact (backup path available) - if nothing suspicious detected, contact T 0/T 1 - send an EGEE broadcast if no backup path (depended on –estimated length, and impact) and escalate to Experts - report the incident and its solution in the documentation Involved groups: - Internal: GIS / NG - External: DFN, Di-Data, T 0/T 1 network responsible, NREN / Dante • Bruno Hoeft, Aurelie Reymund LHC-OPN 2008, Madrid, 10 -11 th March. http: //stats. geant 2. net/e 2 emon/G 2_E 2 E_index_PROD. html 6
Operation of an Incident (2) - Layer-2 (the light on the path is maintained, but there is no connectivity to the neighbour) - No MAC (Descr. : missing mac entry from the neighbor’s network) Actions: - check router configuration - evaluate the impact - contact T 0/T 1 - send EGEE broadcast if no backup path (estimated length, and impact), escalate to Experts - report the incident and its solution in the documentation Groups involved: - Internal: GIS / NG - External: T 0/T 1 network responsible Bruno Hoeft, Aurelie Reymund LHC-OPN 2008, Madrid, 10 -11 th March. 7
Operation of an Incident (3) - Layer-3 (By a routing issue on layer-3, the light on the path is maintained, but there is no reachability to the neighbour) - Routing issue : no route to neighbour (Descr. : T 1 -center cannot reach the neighbour) - BGP issue : no announcement from neighbour (Descr. : the bgp table shows) - BGP issue : no routes advertised to neighbour (Descr. : local bgp does not advertise the network(s) correctly to the neighbour) Actions: - check router configuration / routing / acls - evaluate the impact - contact T 0/T 1 - send EGEE broadcast if no backup path (estimated length, and impact), escalate to Experts - report the incident and its solution in the documentation Involved groups: - Internal: GIS / NG - External: T 0/T 1 network responsible Actions: Bruno Hoeft, Aurelie Reymund - check router configuration / routing / acls - evaluate the impact - contact T 0/T 1 - send EGEE broadcast if no backup path (estimated length, and impact), escalate to Experts - eport the incident and its solution in the documentation Involved groups: - Internal: GIS / NG - External: T 0/T 1 network responsible Actions: - check router configuration / routing / acls - evaluate the impact - contact T 0/T 1 - send EGEE broadcast if no backup path (estimated length, and impact), escalate to Experts - report the incident and its solution in the documentation Involved groups: - Internal: GIS / NG - External: T 0/T 1 network responsible LHC-OPN 2008, Madrid, 10 -11 th March. 8
Maintenance window - The light path and/or the connectivity / reachability can be affected -- Descr. : T 1 -center plans maintenance on the network infrastructure Actions: - send an EGEE broadcast - contact T 0/T 1, NREN, Dante Involved groups: - Internal: - External: Bruno Hoeft, Aurelie Reymund LHC-OPN 2008, Madrid, 10 -11 th March. GIS / NG / Security T 0/T 1 network responsible, NREN (DFN) / Dante 9
Configuration / Infrastructure change - Configuration change (The light path and/or the connectivity / reachability can be affected -- Descr. : T 1 -center makes a change on the network configuration) Actions: - send an EGEE broadcast - contact T 0/T 1, NREN, Dante Involved groups: - Internal: GIS / NG / Security - External: T 0/T 1 network responsible, NREN (DFN) / Dante - Infrastructure change (The light path and/or the connectivity / reachability can be affected -- Descr. : T 1 -center plans a change in the network infrastructure/topology) Actions: - send an EGEE broadcast - contact T 0/T 1, NREN, Dante Involved groups: - Internal: GIS / NG / Security - External: T 0/T 1 network responsible, NREN (DFN) / Dante - General remarks: - all LHC-OPN involving actions: - (as long as planable) shall as possible 3 days in advanced anounced (ticket, GOC, EGEEBroadcast) Changes of the infrastructure (e. g. routing/reorganisation of router port) shall be discussed with the affected site, cern and the coordination unit (LCU/LIPCU) - The configuration of the DE-KIT (Grid. Ka) installation will be documented, as well as all changes will be included in the documentation Bruno Hoeft, Aurelie Reymund LHC-OPN 2008, Madrid, 10 -11 th March. 10
33a420ebbc2ee7ac49972cedf75cb68d.ppt