3261c98871dc483d6cebce84b0186807.ppt
- Количество слайдов: 24
Online Monitoring with Mon. ALISA Dan Protopopescu Glasgow, UK
Mon. ALISA Is a distributed service able to: collect any type of information from different systems analyze this information in real time take automated decisions and perform actions based on it optimize work flows in complex environments Read more at http: //monalisa. caltech. edu
Uses Monitoring distributed computing, i. e. GRIDs Ø Optimizing flow in complex system (VRVS, optics cable networks) Ø ALICE also uses ML for monitoring online reconstruction Ø Some benchmark figures for the service: Ø ~ 800 k monitored parameters at 50 k updates/second Ø > 10 k running (alien) jobs monitored simultaneously Ø > 100 WAN links Ø We are proposing ML as a high level monitoring and possible control system along with (or on top of) existing slow controls systems as epics, pvss etc.
Advantages Mon. ALISA is simple to install, configure and use Ap. Mon APIs are available in C, C++, Java, Python and Perl ROOT plugin allows macros to send data directly to Mona. LISA Can easily interface with (or sit on top of) any existing or future slow controls subsystem (epics, pvss) Data is stored in a standard Pg. SQL (or My. SQL) database that can be accessed by other applications, independently of ML Automatic data summarizing Several data repositories (and hence DBs) can exist (local and remote) Easy access via Web. Service (WS) from service and/or repository Fully supported by development team; work is being done in this direction
Capabilities Based on monitored information, actions can be taken in: § ML Service § ML Repository Actions can be triggered by: § Values above/below given thresholds § Absence/presence of values § Correlations between several values Possible actions types: § External command § Plain event logging § Annotation of repository charts; RSS feeds § Email § Instant messaging
Components GUI LUS/Proxies Web Server Ap. Mon Service Actions based on local information Ap. Mon Quick actions Repository Ap. Mon Actions based on aggregated information
Service setup ML Service setup: wget http: //nuclear. gla. ac. uk/~protopop/ML/Mona. Lisa. tar. gz tar -zxvf Mona. Lisa. tar. gz cd Mona. Lisa/. /install. sh cd. . /Mona. Lisa/Service/CMD/. /MLD start LUS Web Server Ap. Mon Service Actions based on local information Ap. Mon Quick actions Repository Ap. Mon Actions based on aggregated information
Repository setup ML Repository setup: wget http: //nuclear. gla. ac. uk/~protopop/ML/MLrepository. tgz tar -zxvf MLrepository. tgz [configure it] cd MLrepository. /start. sh LUS Web Server Ap. Mon Service Actions based on local information Ap. Mon Quick actions Repository Ap. Mon Actions based on aggregated information
Ap. Mon setup: Ap. Mon setup wget http: //nuclear. gla. ac. uk/~protopop/ML/Ap. Mon_perl. tar. gz tar -xzvf Ap. Mon_perl. tar. gz cd Ap. Mon_perl [create your script, say mysend. pl] perl mysend. pl LUS/Proxies Web Server Ap. Mon Service Actions based on local information Ap. Mon Quick actions Repository Ap. Mon Actions based on aggregated information
[monalisa@glasgow]$ cat mysend. pl Simple monitoring script use Ap. Mon; my $apm = new Ap. Mon({"glasgow. jlab. org: 8884" => {"sys_monitoring" => 0, "general_info" => 0}}); my @pair; while (1) {# loop forever LUS # get values from somewhere @pair = getmypar(“pspec_logic_ai_0”); $apm->send. Parameters(”Detector", “MOR”, @pair); Web Server sleep (20); Ap. Mon } Service Actions based on local information Ap. Mon Quick actions Repository Ap. Mon Actions based on aggregated information
Time history example: Time history [monalisa@glasgow]$ cat mor. properties page=hist Farms=Jlab. ML Clusters=Detector Nodes=MOR Functions=pspec_logic_ai_0 ylabel=Tagger rate title=MOR annotation. groups=2 LUS Web Server Ap. Mon Service Actions based on local information Ap. Mon Quick actions Repository Ap. Mon Actions based on aggregated information
Web interface
Java GUI
Application control Your custom Java client GUI client ML Repository Key Your custom view × ML Clients × TCP based subscribe mechanism serialized, compressed objects with optional encryption × ML Proxies LUS × Application commands are Keystore ML Service Your mon module App Mon. C Ap. Mon Your Application bash Your app module encrypted × ML Services × Standard and/or user’s sensors Your application and/or application modules
Alert-based Actions My. SQL daemon is automatically restarted when it runs out of memory Trigger: threshold on VSZ memory usage ALICE Production jobs queue is automatically kept full by the automatic resubmission Trigger: threshold on the number of aliprod waiting jobs Administrators are kept up-to-date on the services’ status Trigger: presence/absence of monitored information via instant messaging, RSS feeds, toolbar alerts etc.
Summary Mon. ALISA is a very promising tool for online experiment monitoring and interfacing with a variety of slow control subsystems; Glue. X are seriously considering ML for this task Easy to configure, understand use Experience from Grid monitoring and more Support from the developers group for implementation of new modules/features Online experiment monitoring tests of CLAS@Jlab were recently carried on; demo repository is at http: //mlr 1. gla. ac. uk: 7002
More examples / Extras
Integrated Pie Charts
History Plots, Annotations
Ali. En Services Monitoring Ali. En services Periodically checked PID check + SOAP call Simple functional tests SE space usage Efficiency
Job Network Traffic Monitoring Based on the xrootd transfer from every job Aggregated statistics for Sites (incoming, outgoing, site to site, internal) Storage Elements (incoming, outgoing) Of Read and written files Transferred MB/s
Individual Job Tracking Based on Ali. En shell cmds. top, ps, spy, jobinfo, masterjob Using the GUI ML Client Status, resource usage, per job
Head Node Monitoring Machine parameters, real-time & history, load, memory & swap usage, processes, sockets
Mon. ALISA in Ali. En The Mon. ALISA framework is used as a primary monitoring tool for the ALICE Grid since 2004 Presently the system is used for monitoring of all (identified) services, jobs and network parameters necessary for the Grid operation and debugging The number of concurrently monitored and stored parameters today is ~ 300. 000 in 75 ML Services The add-on tools for automatic events notification allow for more efficient reaction to problems The framework design and flexibility answers all requirements for a monitoring system The accumulated information allows to construct and implement automated decision making algorithms, thus increasing further the efficiency of the Grid operations


