Скачать презентацию EGI-In SPIRE Mario Reale IGI GARR Скачать презентацию EGI-In SPIRE Mario Reale IGI GARR

66d81a2ac225d20fee7795648a0ff21d.ppt

  • Количество слайдов: 22

EGI-In. SPIRE Mario Reale – IGI / GARR mario. reale@garr. it Lyon, Sept 19, EGI-In. SPIRE Mario Reale – IGI / GARR mario. reale@garr. it Lyon, Sept 19, 2011 Network Support Operations Workshop @ EGI TF EGI-In. SPIRE RI-261323 www. egi. eu

Outline • • Basic idea Architecture GUI Potentials EGI-In. SPIRE RI-261323 2 www. egi. Outline • • Basic idea Architecture GUI Potentials EGI-In. SPIRE RI-261323 2 www. egi. eu

basic idea: “Instead of installing a probe at each site, run a grid job” basic idea: “Instead of installing a probe at each site, run a grid job” EGI-In. SPIRE RI-261323 3 www. egi. eu

Features of Net. Jobs • • • Lightweight deployment High scalability Security Reliability Cost-effectiveness Features of Net. Jobs • • • Lightweight deployment High scalability Security Reliability Cost-effectiveness EGI-In. SPIRE RI-261323 4 www. egi. eu

Pros • No installation/deployment needed in the sites ØMonitoring 10 or 300 sites is Pros • No installation/deployment needed in the sites ØMonitoring 10 or 300 sites is just a matter of configuration • Running on a proven architecture (the grid) • Possibility to use grid services (ex: Auth. N and Auth. Z) EGI-In. SPIRE RI-261323 5 www. egi. eu

Cons • Some low-level metrics can’t be implemented in the job ØBecause we have Cons • Some low-level metrics can’t be implemented in the job ØBecause we have no control of the “Worker Node” environment (hardware, software) where the job is running • Some sites will have to slightly update their middleware configuration ØThe maximum lifetime of jobs should be increased if it is too low (at least for the DN of the certificate that the system uses) EGI-In. SPIRE RI-261323 6 www. egi. eu

System architecture: Global view EGI-In. SPIRE RI-261323 7 www. egi. eu System architecture: Global view EGI-In. SPIRE RI-261323 7 www. egi. eu

System Architecture the components DB 1 www request DB 2 Monitoring server Front-end Monitoring System Architecture the components DB 1 www request DB 2 Monitoring server Front-end Monitoring server Possible DB ROC 1 new configuration Grid network monitoring jobs Monitoring server @ ROC 1 – Server A Monitoring server @ ROC 1 – Server B Frontend: Apache Tomcat, Ajax, Google Web Toolkit (GWT) Monitoring server & Jobs: Python, bash script (portability is a major aspect for jobs) Database: Postgre. SQL EGI-In. SPIRE RI-261323 8 www. egi. eu

Choice of network paths • The system is completely configurable about the e 2 Choice of network paths • The system is completely configurable about the e 2 e paths and the scheduling of measurements – The admin specifies a list of scheduled tests, giving for each one » The source and the remote site » The type of test » The time and frequency of the test – Users can contact and request the administrator to have a given path monitored (form available on the UI) This request is then validated by the administrator. • If you still have many paths, you can start several server instances (in order to achieve the required performance) EGI-In. SPIRE RI-261323 9 www. egi. eu

Example of scheduling • Latency test – TCP RTT – Every 10 minutes • Example of scheduling • Latency test – TCP RTT – Every 10 minutes • Hop count – Iterative connect() test – Every 10 minutes • MTU size In order to avoid too many connections these three measurements are done in the same test – Socket (IP_MTU socket option) – Every 10 minutes • Achievable Bandwidth – TCP throughput transfer via Grid. FTP transfer between 2 Storage Elements – Every 8 h EGI-In. SPIRE RI-261323 10 www. egi. eu

System architecture: The Server, the Jobs, and the Grid EGI-In. SPIRE RI-261323 11 www. System architecture: The Server, the Jobs, and the Grid EGI-In. SPIRE RI-261323 11 www. egi. eu

Technical constraints • When running a job, the grid user is mapped to a Technical constraints • When running a job, the grid user is mapped to a Linux user of the Worker Node (WN): – This means the job is not running as root on the WN Ø Some low level operations are not possible (for example opening an ICMP listening socket is not allowed) • Heterogeneity of the WN environments (various OS, 32/64 bits…) – Ex: making the job download and run an external tool may be tricky (except if it is written in an OS independent programming language) • The system has to deal with the grid mechanism overhead (delays, job lifetime limit…) EGI-In. SPIRE RI-261323 12 www. egi. eu

Initialization of grid jobs Site paris-urec-ipv 6 Site X UI WMS Ready! Central monitoring Initialization of grid jobs Site paris-urec-ipv 6 Site X UI WMS Ready! Central monitoring server program (CMSP) Site A Site B Site C CE CE CE WN Job WN Request: Job RTT test to site A Job submission Socket connection EGI-In. SPIRE RI-261323 Request: Job BW test to site B WN Probe Request 13 www. egi. eu

Remarks • Chosen design (1 job <-> many probes) is much more efficient than Remarks • Chosen design (1 job <-> many probes) is much more efficient than starting a job for each probe – Considering (grid-related) delays – Considering the handling of middleware failures (nearly 100% of failures occur at job submission, not once the job is running) • TCP connection is initiated by the job Ø No open port needed on the WN better for security of sites • An authentication mechanism is implemented between the job and the server • A job cannot last forever (Glue. CEPolicy. Max. Wall. Clock. Time), so actually there are 2 jobs running at each site – A ‘main’ one, and – A ‘redundant’ one which is waiting and will become ‘main’ when the other one ends EGI-In. SPIRE RI-261323 14 www. egi. eu

RTT, MTU and hop count Site paris-urec-ipv 6 UI Central monitoring server program (CMSP) RTT, MTU and hop count Site paris-urec-ipv 6 UI Central monitoring server program (CMSP) Site B Site C CE WN Request: Job RTT test to site C Probe Request Socket connection EGI-In. SPIRE RI-261323 Probe Result 15 www. egi. eu

RTT, MTU and hop test • The ‘RTT’ measure is the time a TCP RTT, MTU and hop test • The ‘RTT’ measure is the time a TCP ‘connect()’ call takes: – Because a connect() call involves a round-trip of packets: • SYN Round trip • SYN-ACQ Just sending => no network delay • ACQ – Results very similar to the ones of ‘ping’ • The MTU is given by the IP_MTU socket option • The number of hops is calculated in an iterative way • These measures require: – To connect to an accessible port (1) on a machine of the remote site – To close the connection (no data is sent) – Note: This (connect/disconnect) is detected in the application log (1): We use the port of the gatekeeper of the CE since it is known to be accessible (it is used by the grid middleware g. Lite) EGI-In. SPIRE RI-261323 16 www. egi. eu

Active Grid. FTP BW Test Site paris-urec-ipv 6 UI Central monitoring server program (CMSP) Active Grid. FTP BW Test Site paris-urec-ipv 6 UI Central monitoring server program (CMSP) Site A SE Replication of a large grid file Site C SE Read the grid. FTP WN Request: log file Job Grid. FTP BW test to site C Socket connection Probe Request Probe Result EGI-In. SPIRE RI-261323 17 www. egi. eu

Grid. FTP BW test • If the Grid. FTP log file is not accessible Grid. FTP BW test • If the Grid. FTP log file is not accessible (cf. d. Cache? ) – In this case we just do the transfer via globus-urlcopy in a verbose mode in order to get the transfer rate. • A passive version of this BW test could be envisageable – The job just reads the gridftp log file periodically (the system does not request additional transfers) – This is only possible if the log file is available on the Storage Element (i. e. it is a DPM) EGI-In. SPIRE RI-261323 18 www. egi. eu

User Interface EGI-In. SPIRE RI-261323 19 www. egi. eu User Interface EGI-In. SPIRE RI-261323 19 www. egi. eu

Changes/Progress since January • Set up server/DB and GUI at GARR at http: //netjobs. Changes/Progress since January • Set up server/DB and GUI at GARR at http: //netjobs. dir. garr. it • No new sites enrolled • Some minor code update to cope with minor changes in g. Lite • Updated GUI (GWT ext 2. 0. 0 to 2. 0. 2) to fix issues with latest Firefox version EGI-In. SPIRE RI-261323 20 www. egi. eu

Potentials & Future • • • Highly scalable tool Robust w. r. t. the Potentials & Future • • • Highly scalable tool Robust w. r. t. the Grid worker nodes environment Easy to configure Currently focusing on low level network metrics Scheduled bandwidth measurements can be integrated easily – Any other network measurement (unpriv. user on EGI Worker Nodes) can be integrated • Anyone (site) wishing to be included in the current system server just tell us • We can provide support for setting up new servers & GUIs • Documentation (Install & Admin Guide) will be provided pending potential interested users EGI-In. SPIRE RI-261323 21 www. egi. eu

References / Contacts Current Instance: http: //netjobs. dir. garr. it/ Wiki: https: //twiki. cern. References / Contacts Current Instance: http: //netjobs. dir. garr. it/ Wiki: https: //twiki. cern. ch/twiki/bin/view/EGI/Grid. Network. Monitoring Contacts: etienne. duble@urec. cnrs. fr mario. reale@garr. it Acknowledgements: EGI-In. SPIRE RI-261323 Stefano Gargiulo / GARR Roma Etienne Duble / LIG IMAG Grenoble 22 www. egi. eu