Скачать презентацию Grid Resource Center Administration Jason Shih ASGC Grid Скачать презентацию Grid Resource Center Administration Jason Shih ASGC Grid

5a95d881ff926df638969994b2c33d98.ppt

  • Количество слайдов: 42

Grid Resource Center Administration Jason Shih ASGC Grid Administrator Tutorial March 15 -16, Academia Grid Resource Center Administration Jason Shih ASGC Grid Administrator Tutorial March 15 -16, Academia Sinica

Outlines • From site mgr’s point of view • Problem tracking • SFT & Outlines • From site mgr’s point of view • Problem tracking • SFT & GSTAT • Other monitoring system: CERT lifetime, RT monitoring • Using trouble request system • ASGC-TRS, GGUS etc. • Searching knowledge base • GSearch • From CIC/ROC mgr point of view • CIC (Core infrastructure center) • ROC (regional operational center) • User support • Infrastructure and interfaces

Monitoring System SFT & GStat Monitoring System SFT & GStat

Site Functional Tests: monitoring https: //lcg-sft. cern. ch/sft/lastreport. cgi? sortby=GOC-region Site Functional Tests: monitoring https: //lcg-sft. cern. ch/sft/lastreport. cgi? sortby=GOC-region

SFT: Config your own view • Select specific VO • Select specific Region • SFT: Config your own view • Select specific VO • Select specific Region • Select what functional tests interested

SFT: what are critical tests & what are not? • Critical tests: • Job SFT: what are critical tests & what are not? • Critical tests: • Job submission • Replication management (lcg-cr, lcg-cp, lcg-rep, lcg-del, lcg-lp etc) • CA RPMs • /etc/cron. d/edg-fetch-crl on WNs/Server nodes • CSH test • Make sure env settings is correct (e. g. GFAL_INFOSYS, VO_xx_DEFAULT_SE, EDG/LCG_LOCATION, and PROXY_SERVER etc) • S/W Version: LCG software version, lcg-version command will be used • S/W directory • make sure exp_soft mounted on WN • Broker Info: execute ‘edg-brokerinfo -v get. CE’ on WN • Non-critical tests: • RGMA • APEL accounting

SFT: replication mgmt • Check GFAL infosys (env | grep LCG_GFAL_INFOSYS) • Generic test: SFT: replication mgmt • Check GFAL infosys (env | grep LCG_GFAL_INFOSYS) • Generic test: • • Register file into default SE (lcg-cr) and create replica Copy file from default SE to WN based on lfn Replicate file from default SE to central SE (CERN) based on lfn Delete all replicas from all two SE • 3 rd party replication • • Register file to central SE Copy file from central SE to WN Replicate file from central SE to default SE Delete replicas

Gstat (Site GIIS monitoring) case 1: missing DN ……… Gstat (Site GIIS monitoring) case 1: missing DN ………

Gstat: case 2: incorrect information published Gstat: case 2: incorrect information published

Gstat: GIIS Usage Analyzer Conditions Alert Lelve No problems OK INFO WARN se percent Gstat: GIIS Usage Analyzer Conditions Alert Lelve No problems OK INFO WARN se percent usage > 80% se percent usage > 90% se. Avail < 1 GB wait. Job > 50*total. CPU wait. Job > 150*total. CPU no cpu info found no job info found NOTE WARN ERROR

Other Monitoring System CERT Life. Time & RT monitoring Other Monitoring System CERT Life. Time & RT monitoring

Real Time Monitoring http: //gridportal. hep. ph. ic. ac. uk/rtm/ Java Applet, dev. By Real Time Monitoring http: //gridportal. hep. ph. ic. ac. uk/rtm/ Java Applet, dev. By Imperial College, UK

Certificate lifetime monitoring http: //goc. grid-support. ac. uk/gridsite/monitoring/certificates/HOSTView. php • Check validity of hostcert: Certificate lifetime monitoring http: //goc. grid-support. ac. uk/gridsite/monitoring/certificates/HOSTView. php • Check validity of hostcert: • openssl x 509 -noout -fingerprint -text < hostcert. pem • Generate host certificate request: • openssl x 509 -signkey hostkey. pem -in hostcert. pem -x 509 toreq -out host_new. csr

Trouble. Shooting Guide Trouble. Shooting Guide

Problem categories • Authentication • poolaccount, voms, grid-mapfile, lcmaps/lcas • Information system • Site Problem categories • Authentication • poolaccount, voms, grid-mapfile, lcmaps/lcas • Information system • Site GIIS (BDII), local GRIS, and global BDII • RGMA • Job submission • Jobwrapper, gass cache, no compatible resources etc. • Data management • Cant open data connection, lcg_cr invalid argument. . etc. • Batch system

Troubleshooting procedures (I) • Always check logfile before tracking the problem • OS level Troubleshooting procedures (I) • Always check logfile before tracking the problem • OS level issue? /var/log/messages, /var/log/secure • Batch system? • PBS server: /var/spool/pbs/server_log • Maui Sched: /var/log/maui. log • Grid middleware: • Globus GK? : /var/log/globus-gatekeeper. log • Grid FTP: /var/log/globus-gridftp. log • Several services on RB: /var/edgwl/ • edg/lcg log file: /opt/edg/var/log • Exp. Software (ask submitted to provide detail log) • Mostly related to missing dep. lib or bug of exp. Software sometimes • Check savannah http: //savannah. cern. ch/ • tcpdump if you would like to profile network problem

Troubleshooting procedures (II) • Testing the problem from single layer • Ping, telnet <port> Troubleshooting procedures (II) • Testing the problem from single layer • Ping, telnet • qsub, ‘pbsnodes –a’ (batch) • exclude the impact of IS • globus-job-run (direct GK contact, ) • globus-url-copy • ldapsearch (IS sanity check) • edg-job-list-match (check if available resource published from IS) • edg-job-submit (RB/CE interactions) • lcg-utils (lcg-cr, lcg-cp, lcg-rep, lcg-del etc)

Troubleshooting: Authentication • 7 authentication failed • Error msg from job logging info, via: Troubleshooting: Authentication • 7 authentication failed • Error msg from job logging info, via: ‘edg-job-get-logging-info - reason = 7 authentication failed: GSS Major Status: Authentication Failed GSS Minor Status Error Chain: init. c: 497: globus_gss_assist_init_sec_context_async: Error during context initialization init_sec_context • 530 LCMAPS credential mapping NOT successful • Missing pool account, and specific group of VO yaim run_function config_users • Check if user. conf contains poolaccount information wrt VO supported!

Troubleshooting: authentication • 530 No local mapping for Globus ID • Force update gridmap Troubleshooting: authentication • 530 No local mapping for Globus ID • Force update gridmap by rerun edg-mkgridmap command! • Proxy expired • Check time diff between two T-Zone • 501 -FTPD GSSAPI error: GSS Major Status: General failure • Invalid CRL: The available CRL has expired • Check if cron job running correctly, and have valid CRL downloaded • GRAM Authentication test failure

Troubleshooting: Authentication (cont’) • VOMS authentication • Check if voms server alive (initial proxy Troubleshooting: Authentication (cont’) • VOMS authentication • Check if voms server alive (initial proxy init) • Check if users’ DN already registered in specific VO provided by VOMS check registered users • Check if grid-map contain valid DN fetch from VOMS • Cron job ‘edg-mkgridmap’ in /etc/cron. d, check if target VOMS server exist in config file • Force update by issuing cmd: /opt/edg/sbin/edg-mkgridmap –output /etc/grid-security/grid-mapfile –safe

edg-mkgridmap config • /opt/edg/etc/edg-mkgridmap. conf: # ATLAS # Map VO members (Role) atlassgm group edg-mkgridmap config • /opt/edg/etc/edg-mkgridmap. conf: # ATLAS # Map VO members (Role) atlassgm group vomss: //lcg-voms. cern. ch: 8443/voms/atlas? /atlas/Role=lcgadmin atlassgm # Map VO members (root Group) atlas group vomss: //lcg-voms. cern. ch: 8443/voms/atlas? /atlas/lcg 1. atlas # LDAP lines for ATLAS group ldap: //grid-vo. nikhef. nl/ou=lcgadmin, o=atlas, dc=eu-datagrid, dc=org atlassgm group ldap: //grid-vo. nikhef. nl/ou=lcg 1, o=atlas, dc=eu-datagrid, dc=org. atlas # CMS # Map VO members (Role) cmssgm group vomss: //lcg-voms. cern. ch: 8443/voms/cms? /cms/Role=lcgadmin cmssgm # Map VO members (root Group) cms group vomss: //lcg-voms. cern. ch: 8443/voms/cms? /cms. cms # LDAP lines for CMS group ldap: //grid-vo. nikhef. nl/ou=lcgadmin, o=cms, dc=eu-datagrid, dc=org cmssgm group ldap: //grid-vo. nikhef. nl/ou=lcg 1, o=cms, dc=eu-datagrid, dc=org. cms

LCMAPS template • /opt/edg/etc/lcmaps/{gridmapfile, groupmapfile} LCMAPS template • /opt/edg/etc/lcmaps/{gridmapfile, groupmapfile} "/VO=atlas/GROUP=/atlas/ROLE=lcgadmin" atlassgm "/VO=atlas/GROUP=/atlas". atlas "/VO=alice/GROUP=/alice/ROLE=lcgadmin" alicesgm "/VO=alice/GROUP=/alice". alice "/VO=cms/GROUP=/cms/ROLE=lcgadmin" cmssgm "/VO=cms/GROUP=/cms". cms "/VO=dteam/GROUP=/dteam/ROLE=lcgadmin" dteamsgm "/VO=dteam/GROUP=/dteam". dteam "/VO=biomed/GROUP=/biomed". biomed "/VO=twgrid/GROUP=/twgrid". twgrid "/VO=apesci/GROUP=/apesci". asiagrid "/VO=atlas/GROUP=/atlas/ROLE=lcgadmin" atlas "/VO=atlas/GROUP=/atlas" atlas "/VO=alice/GROUP=/alice/ROLE=lcgadmin" alice "/VO=alice/GROUP=/alice" alice "/VO=cms/GROUP=/cms/ROLE=lcgadmin" cms "/VO=cms/GROUP=/cms" cms "/VO=dteam/GROUP=/dteam/ROLE=lcgadmin" dteam "/VO=dteam/GROUP=/dteam" dteam "/VO=biomed/GROUP=/biomed/ROLE=lcgadmin" biomed "/VO=biomed/GROUP=/biomed" biomed "/VO=twgrid/GROUP=/twgrid/ROLE=lcgadmin" twgrid "/VO=twgrid/GROUP=/twgrid" twgrid "/VO=apesci/GROUP=/apesci/ROLE=lcgadmin" asiagrid "/VO=apesci/GROUP=/apesci" asiagrid

Troubleshooting: IS • Check if local GRIS implemented on CE/SE have: • Siteinfo entry Troubleshooting: IS • Check if local GRIS implemented on CE/SE have: • Siteinfo entry • Glue. Cluster, Glue. Sub. Cluster • Glue. CE, Glue. CESEBinding. Group • Glue. SE, Glue. SA • Check if site/global BDII responding • check BDII configuration files • Make sure resource interexchange information available from update list (RLS, LFC etc. ) • LDAP query tend to be long check cache size of slapd. conf (/opt/lcg/etc/lcg-bdii-read-slapd. conf) • local GRIS • No responding restarting MDS • GRIS cant be restart kill all -9 slapd • Make sure lcg-info-wrapper exist in grid-info-resource-ldif. conf • No update with latest dynamic information probably a stale slapd process

Troubleshooting: IS (cont’) • GIP is not producing information • If generic info configuration Troubleshooting: IS (cont’) • GIP is not producing information • If generic info configuration empty: • Rerun yaim . /run_function config_gip • If static ldif file is not created: • /opt/lcg/sbin/lcg-info-generic-config /opt/lcg/var/lcg-info-generic. conf • Still emprt check if template (Glue) have been specify in info config • Incorrect dynamic information • Possible impact incorrect free CPUs, available space etc. • Check if dynamic script exist in GIP configuration • If dynamic ldif file (/opt/lcg/var/gip/tmp) is empty: • Check if ‘edginfo’ and ‘rgma’ users are in admin(3) group to have privilege to retrieve dynamic status from scheduler

Troubleshooting: Job submission Troubleshooting: Job submission

Troubleshooting: Data mgmt • Generic error: • Fail to contact local GRIS, have no Troubleshooting: Data mgmt • Generic error: • Fail to contact local GRIS, have no idea where the file being registered!

Troubleshooting: Batch system • Generic problem • Firewall, check if 15001 -15004 open on Troubleshooting: Batch system • Generic problem • Firewall, check if 15001 -15004 open on CE to all WNs (iptables –L) • CE name mismatch (check site def) • MOM config file missing or incorrect • Host base authentication (make sure all pubkey of WNs have been append into ‘ssh_known_host’ file • Daemon dead • service status • Missing pool accounts on WNs • Try with any of available pool account • echo “sleep 1” | qsub, see if you have std out/err returned! • Time Sync within CE/WNs (add ntp ticker & conf) • Other possibilities: • Disk full? • NFS stale?

Template of MON config & maui. cfg • Resource reservation, at least 1 CPU, Template of MON config & maui. cfg • Resource reservation, at least 1 CPU, for dteam class (queue) is recommend to help accepting SFT job running in your site (ave. 3 Hr) /var/spool/pbs/mom_priv/config $clienthost ws 45. twgrid. org $clienthost localhost $restricted ws 45. twgrid. org $logevent 255 $ideal_load 1. 6 $max_load 2. 1 SERVERHOST ADMIN 1 ADMIN 3 ADMINHOST RMCFG[base] SERVERPORT SERVERMODE ws 45. twgrid. org root edginfo rgma ws 45. twgrid. org TYPE=PBS 40559 NORMAL # Set PBS server polling interval. If you have short # queues or/and jobs it is worth to set a short interval. (10 seconds) RMPOLLINTERVAL /var/spool/maui. cfg 00: 10 # a max. 10 MByte log file in a logical location LOGFILE /var/log/maui. log LOGFILEMAXSIZE 10000000 LOGLEVEL 1 # Set the delay to 1 minute before Maui tries to run a job again, # in case it failed to run the first time. # The default value is 1 hour. DEFERTIME 00: 01: 00

User Support Infrastructure & Interfaces User Support Infrastructure & Interfaces

EGEE User Support: Infrastructure • This central helpdesk keeps track of all service requests EGEE User Support: Infrastructure • This central helpdesk keeps track of all service requests and assigns them to the appropriate support groups. In this way, formal communication between all support groups is possible. To enable this, each group has to build only one interface between its internal support structure and the central GGUS application

EGEE User Support: interfaces Resource Center 1(RC) Local User Support Application em l ob EGEE User Support: interfaces Resource Center 1(RC) Local User Support Application em l ob o ep R The User Rep ort Pro ace erf Int r P rt ble m . . . Resource Center N(RC) Regional Operations Center (ROC) Third level support: -Generic deployment -Grid Middleware Us et he We bv iew Central GGUS Application Interface VO support CIC

EGEE User support: Responsible Unit • First Level Support • GGUS team • SOD EGEE User support: Responsible Unit • First Level Support • GGUS team • SOD (ROC experts rotation) • Second Level Support • • • • CIC-on-duty ROC_Asia/Pacific ROC_CERN ROC_France ROC_GER/CH ROC_Italy ROC_North ROC_Russia ROC_SE ROC_SW ROC_UK/Ireland VOSupport (atlas, magic, biomed, compass, bab ar, cdf, alice, lhcb, cms, d 0) • Third Level Support (filled with experts provided by ROCs) Grid Deployment • • • Castor Generic Deployment Manual Installation Pre-production system VO management/VOMS • • d-Cache Data Management GLUE Grid. ICE Information System/GIP/BDII R-GMA Security Management Workload Management Grid Middleware

Problem Detection and Tracking • Operations Escalation Procedure • Detect problems and performs diagnosis Problem Detection and Tracking • Operations Escalation Procedure • Detect problems and performs diagnosis • 1. Open ticket for problem tracking in CIC portal • Sends email notification to Site and ROC • Escalation period is 1 to 3 days depending on severity • 2. Send second email if no response • Sends email notification to ROC • 1 -3 days escalation • 3. Phone call to ROC • 4. If still no response, CIC suggest site is suspended • Site removed from Top Level BDII configuration • Essentially removed from the Grid

CIC dashboard CIC dashboard

GGUS (Global Grid User Support) https: //gus. fzk. de/pages/support. php GGUS (Global Grid User Support) https: //gus. fzk. de/pages/support. php

APROC Helpdesk • Currently support following services (queue): • • CIC/ROC PRAGMA HPC SRB APROC Helpdesk • Currently support following services (queue): • • CIC/ROC PRAGMA HPC SRB • Classification of sub-queue of CIC/ROC: • • T 1 CASTOR SC SSC

APROC TRS: Customer http: //roc. grid. sinica. edu. tw/otrs/customer. pl APROC TRS: Customer http: //roc. grid. sinica. edu. tw/otrs/customer. pl

APROC TRS: Supporters http: //roc. grid. sinica. edu. tw/otrs/index. pl APROC TRS: Supporters http: //roc. grid. sinica. edu. tw/otrs/index. pl

APROC TRS: Accounting statistic(Tot/Ave) Open tickets 9/34 Close tickets 325/33 Total tickets 334/37 APROC TRS: Accounting statistic(Tot/Ave) Open tickets 9/34 Close tickets 325/33 Total tickets 334/37

Knowledgebase & search engine http: //listserv. rl. ac. uk/cgi-bin/webadmin? S 1=lcg-rollout http: //roc. grid. Knowledgebase & search engine http: //listserv. rl. ac. uk/cgi-bin/webadmin? S 1=lcg-rollout http: //roc. grid. sinica. edu. tw/gsearch/

More information. . http: //goc. grid. sinica. edu. tw/gocwiki More information. . http: //goc. grid. sinica. edu. tw/gocwiki

References • SFT: https: //lcg-sft. cern. ch/sft/lastreport. cgi? sortby=GOC-region • Gstat: http: //goc. grid. References • SFT: https: //lcg-sft. cern. ch/sft/lastreport. cgi? sortby=GOC-region • Gstat: http: //goc. grid. sinica. edu. tw/gstat • GSearch: http: //roc. grid. sinica. edu. tw/gsearch • Rollout: http: //listserv. rl. ac. uk/archives/lcg-rollout. html • Goc. Wiki: • Troubleshooting Guide: http: //goc. grid. sinica. edu. tw/gocwiki/Site. Problems. Follow. Up. Faq • Admin. ’s Guide: http: //goc. grid. sinica. edu. tw/gocwiki/Administration. Faq