Workflow management in large distributed systems CHEP 2010

Workflow management in large distributed systems CHEP 2010 21 October 2010 I. Legrand, H. Newman, R. Voicu C. Grigoras, C. Dobre, A. Costan 1

Monitoring Information is necessary for System Design, Control, Optimization, Debugging and Accounting An essential part of managing large scale, distributed data processing facilities, is a monitoring system that ACCOUNTING Computing Models is able to monitor computing Modeling & Simulations facilities, storage systems, networks and a very large number of applications running on these Optimization systems in near-real time. Algorithms MONITORING q The monitoring information gathered for all REALsubsystems is ~ the TIME essential for Information modelling, design, debugging, accounting and the development of “higher level Create resilient services”, that provide decision Distributed Systems Control and support and some degree of DEBUGGING Operational support automated decisions and for ALARMS maintaining and optimizing workflow in large scale distributed systems. q

The Mon. ALISA Framework Ø Mon. ALISA is a Dynamic, Distributed Service System capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems. Ø The Mon. ALISA system is designed as an ensemble of autonomous multithreaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks. These agents can analyze and process the information, in a distributed way, and to provide optimization decisions in large scale distributed applications. Iosif Legrand October 2010

The Mon. ALISA Architecture HL services Proxies Agents Mon. ALISA services Network of JINI-Lookup Services Secure & Public Regional or Global High Level Services, Repositories & Clients Secure and reliable communication Dynamic load balancing Scalability & Replication AAA for Clients Distributed System for gathering and analyzing information based on mobile agents: Customized aggregation, Triggers, Actions Distributed Dynamic Registration and Discoverybased on a lease mechanism and remote events Fully Distributed System with no Single Point of Failure Iosif Legrand October 2010

Mon. ALISA Service & Data Handling Postgres n Data Store Web Service WS Clients and service WSDL SOAP io rat t Data Cache Service & DB is eg R Lookup Service D is co ve ry Data (via ML Proxy) Predicates & Agents Applications Collects any type of information Configuration Control (SSL) AGENTS FILTERS / TRIGGERS Monitoring Modules Clients or Higher Level Services Dynamic Loading Push and Pull Iosif Legrand October 2010

Monitoring Grid sites, Running Jobs, Network Traffic, and Connectivity Running Jobs JOBS TOPOLOGY ACCOUNTING Iosif Legrand October 2010

Monitoring CMS Jobs Worldwide CMS is using Mon. ALISA and Ap. Mon to monitor all the production and analysis jobs. This information is than used in the CMS dashboard frontend Rate of collected monitoring values Rates up to more than 6000 values per second Lost in UDP < 5*10 -6 Total Collected values Collected ~5* 1010 monitoring values in the last 12 months Organize and structure Monitoring Information More than 3 years continuous operation without any problems Iosif Legrand October 2010

Monitoring architecture in ALICE Ap. Mon run tim e n ope files ed eu nts Qu Age b Jo at ed Da ta lo My. SQL ets Servers sock mi g mb rate yte d s Ap. Mon Castor. Grid Scripts API Services Ap. Mona. Lisa Repository Mon. ALISA LCG Site di us sk ed Ap. Mon eg y rox My. P tus sta Ali. En Job Agent gr nr. o f files Ap. Mon Ali. En SE Ap. Mon job sta tus cpu Ali. En Job Agent ksi 2 k Ap. Mon. ALISA @CERN Ap. Mon ad e tiv ac ions ss se Cluster Monitor Ali. En CE Ap. Mon Ag Ap. Mon Ali. En Job Agent j st obs at us Mon. ALISA @Site rss Ali. En Job Agent Ap. Mon Ali. En Brokers Ap. Mon ses z Ali. En SE Ap. Mon ces vs Ali. En TQ Ali. En Optimizers pro Ali. En Job Agent Ap. Mon job slots u cp e tim Ap. Mon f sp ree ac e Ali. En Job Agent Ali. En IS Cluster Monitor net In/o ut Ali. En CE LCG Tools Alerts Actions Long History DB Iosif Legrand October 2010

ALICE : Global Views, Status & Jobs http: //pcalimonitor. cern. ch Iosif Legrand October 2010

Monitoring in ALICE: jobs, resources, services Iosif Legrand October 2010

Local and Global Decision Framework Two levels of decisions: local (autonomous), • Traffic • Jobs • Hosts • Apps global (correlations). Actions triggered by: values above/below given thresholds, absence/presence of values, ML Service Actions based on global information Actions based on local information Global ML Services correlations between any values. • Temperature • Humidity alerts (emails/instant msg/atom feeds), • A/C Power • … Action types: automatic charts annotations in the repository, running custom code, like securely ordering MLs service to change connectivity – optimize traffic, submit jobs, (re)start global service. Iosif Legrand Sensors ML Service Local decisions Global decisions October 2010

ALICE: Automatic job submission Restarting Services My. SQL daemon is automatically restarted when it runs out of memory Trigger: threshold on VSZ memory usage ALICE Production jobs queue is kept full by the automatic submission Trigger: threshold on the number of aliprod waiting jobs Administrators are kept up-to-date on the services’ status Trigger: presence/absence of monitored information Iosif Legrand October 2010

Automatic actions in ALICE is using the monitoring information to automatically: resubmit error jobs until a target completion percentage is reached, submit new jobs when necessary (watching the task queue size for each service account) production jobs, RAW data reconstruction jobs, for each pass, restart site services, whenever tests of Vo. Box services fail but the central services are OK, send email notifications / add chart annotations when a problem was not solved by a restart dynamically modify the DNS aliases of central services for an efficient load-balancing. Most of the actions are defined by few lines configuration files. Iosif Legrand October 2010

Monitoring USLHCNet Topology & Status & Peering Iosif Legrand Real Time Topology for L 2 Circuits October 2010

Monitoring Links Availability Very Reliable Information 99. 7% 99. 6% AMS-GVA (SURFnet) AMS-GVA (Geant) AMS-NYC (Level 3) 99. 7% 97. 2% AMS-NYC (TSystems) Ref @ CERN) 99. 4% 95. 4% CHI-NYC(Level 3) GVA-NYC (TSystems) CHI-NYC(NLR 1) 99. 9% 99. 5% 96. 9% 99. 2% CHI-GVA(TSystems) CHI-GVA(Level 3) CHI-NYC(NLR 2) P 1 Network LINK 100% monitoring availability P 1 Iosif Legrand October 2010

USLHCnet: Accounting for Integrated Traffic Iosif Legrand October 2010

ALARMS and Automatic notifications for USLHCnet Iosif Legrand October 2010

Monitoring Network Topology (L 3), Latency, Routers NETWORKS ROUTERS AS Real Time Topology Discovery & Display August Iosif Legrand 2009

EVO : Real-Time monitoring for Reflectors and the quality of all possible connections Iosif Legrand October 2010

EVO: Creating a Dynamic, Global, Minimum Spanning Tree to optimize the connectivity A weighted connected graph G = (V, E) with n vertices and m edges. The quality of connectivity between any two reflectors is measured every second. Building in near real time a minimumspanning tree with addition constrains Resilient Overlay Network that optimize real-time communication Iosif Legrand October 2010

Dynamic MST to optimize the Connectivity for Reflectors Frequent measurements of RTT, jitter, traffic and lost packages The MST is recreated in ~ 1 S case on communication problems. Iosif Legrand October 2010

EVO: Optimize how clients connect to the system for best performance and load balancing Iosif Legrand October 2010

Monitoring the Topology and Optical Power on Fibers for Optical Circuits Controlling Port power monitoring Iosif Legrand Glimmerglass Switch Example October 2010

“On-Demand”, End to End Optical Path Allocation >FDT A/file. X B/path/ CREATES AN END TO END PATH < 1 s OS path available Configuring interfaces Starting Data Transfer Real time monitoring Internet DATA Re g APPLICATION ul ar IP pa th Mon. ALISA Distributed Service System LISA sets up - Network Interfaces - TCP stack - Kernel parameters - Routes LISA APPLICATION “use eth 1. 2, …” OS Agent B LISA Agent TL 1 LISA AGENT Monitor Control A Mon. ALISA Service Optical Switch Active light p ath Iosif Legrand Detects errors and automatically recreate the path in less than the TCP timeout October 2010

Controlling Optical Planes Automatic Path Recovery 200+ MBytes/sec From a 1 U Node CERN Geneva USLHCnet Internet 2 Starlight CALTECH Pasadena FDT Transfer Manlan “Fiber cut” simulations The traffic moves from one transatlantic line to the other one FDT transfer (CERN – CALTECH) continues uninterrupted TCP fully recovers in ~ 20 s Iosif Legrand 2 1 4 3 44 fiber cuts emulations Fiber cut simulations October 2010

Mon. ALISA collects any type of monitoring information in distributed systems The Mon. ALISA package includes: ØLocal host monitoring (CPU, memory, network traffic , Disk I/O, processes and sockets in each state, LM sensors, APC UPSs), log files tailing ØSNMP generic & specific modules ØCondor, PBS, LSF and SGE (accounting & host monitoring), Ganglia ØPing, tracepath, traceroute, pathload and other network-related measurements ØTL 1, Network devices, Ciena, Optical switches ØCalling external applications/scripts that return as output the values ØXDR-formatted UDP messages (such as Ap. Mon). New modules can be easily added by implementing a simple Java interface. Filters can be used to generate new aggregate data. The Service can also react to the monitoring data it receives (actions alarms). Mon. ALISA can run code as distributed agents for global optimization ØUsed by Evo to maintain the tree of connections between reflectors ØOn demand end to end optical paths ØControls distributed data transfers Iosif Legrand October 2010

Mon. ALISA Summary Mon. ALISA Today Running 24 X 7 at ~360 Sites r Collecting ~ 2 million “persistent” parameters in real-time USLHCnet Major Communities r 80 million “volatile” parameters per day q q q ALICE CMS ATLAS PANDA EVO LGC RUSSIA OSG MXG Ro. Edu. Net USLHCNET ULTRALIGHT Enlightened r Update rate of ~25, 000 parameter updates/sec - VRVS ALICE - r Monitoring r 40, 000 computers U r > 100 WAN Links r > 8, 000 complete end-to-end network path measurements r Tens of Thousands of Grid jobs OSG running concurrently r Controls jobs summation, different central services for the Grid, EVO topology, FDT … r The Mon. ALISA repository system serves EVO ~8 million user requests per year. http: //monalisa. caltech. edu Iosif Legrand October 2010

Back-up slides Iosif Legrand August 2009

Registration / Discovery Admin Access and AAA for Clients Application Mon. ALISA Service Registration (signed certificate) Trust keystore Discovery Client (other service) Lookup Services Proxy Multiplexer Applications Mon. ALISA Services Proxy Multiplexer Admin SSL connection Mon. ALISA Service Lookup Service Trust keystore Data Filters & Agents Client authentication Client (other service) AAA services Iosif Legrand October 2010

Active Available Bandwidth measurements between all the ALICE grid sites Iosif Legrand October 2010

Active Available Bandwidth measurements between all the ALICE grid sites (2) Iosif Legrand October 2010