4da1fab1db7d7512631f23e463e092c0.ppt
- Количество слайдов: 12
Enabling Grids for E-scienc. E LCG and Glite open issues Massimo Sgaravatto INFN Padova www. eu-egee. org INFSO-RI-508833
LCG problems hopefully addressed Enabling Grids for E-scienc. E • The bugs below are still open in the LCG Savannah, but they have already been addressed – Patches provided (by us, or by LCG) • Still open because patches under test/still to be tested • #3546, #3848, #4144, #6134, #7582 INFSO-RI-508833 Massimo Sgaravatto - INFN Padova 2
LCG issues not addressed yet Enabling Grids for E-scienc. E • #3671: To drain an RB – They would like to make possible to disallow new submissions, while allowing the other commands – Asked to LCG if the idea discussed here last time to look for a given file, created by the admin on the Broker (if the file exists the NS will drain all submissions) § No feedback so far • #3724: Log. Monitor should be resilient to full file system – Still to be understood why irepository. dat could not be recovered – Priority lowered • #3808: Network. Server must log from which UI the job was submitted – A patch was provided, but it logs the UI address and the user DN in *separate* messages (and it is not possible to unambiguously connect them) – Asked if instead they could use the LB info instead: no answer INFSO-RI-508833 Massimo Sgaravatto - INFN Padova 3
LCG issues not addressed yet Enabling Grids for E-scienc. E • #4319: Suggestion for change of policy for resubmitted jobs – The famous “shallow resubmission” … • #4570: Multiple cancel requests can crash WM (and possibly PR) – Addressed for PR – For WM already discussed (it would require major modifications) • #5404: JC/LM id repository – Inconsistency between the JC (memory resident) id repository and the LM (disk resident) version – This happened when a daemon was down for a while – Each daemon needs to know if its partner is live or dead – Proposal (each one writes a file with an epoch and updates it every m seconds; if the date in the partner file is older than a threshold this means that the partner is dead and so a more or less drastic solution can be taken) submitted to LCG for feedback § No feedbacks • #6295: RB problem if the Output. Data attribute is too big – Job submission hangs with a long Output. Data (JDL ? ) attribute INFSO-RI-508833 Massimo Sgaravatto - INFN Padova 4
LCG issues not addressed yet Enabling Grids for E-scienc. E • #6653: Log. Monitor can abort on removing job directory – “Got an unhandled standard exception” – Directory was not empty because of a “nfs. XYZ” file – Directory not properly cleaned by the purger ? § To be investigated by Alessio and Salvo • #7372: lbserver and locallogger must not share host proxy – When either service discovers the proxy file has been renewed, it reads the new file twice: once for the certificate and once for the key. – Between the 2 readings the file can get renewed again by the other service, leading to a mismatch between the certificate and the key – Under investigation by D. Kouril, AFAIK • • • #7875: RB hardcodes PBS/LSF batch system for MPI jobs – Fixed in Glite CVS – Just needed to send the patch to LCG #8034: Wrong exist status for jobs requiring automatic upload & registration of output files – Done (OK)" also when the stageout of some of the output files has failed – It should instead be set to "Done (Exit code !=0)". #9268: Log. Monitor can abort b/c sandbox directory not empty – Possible duplicate of #6653 INFSO-RI-508833 Massimo Sgaravatto - INFN Padova 5
Issues addressed by LCG that we didn’t integrate yet Enabling Grids for E-scienc. E • • • #4318: Matchmaking policy for resubmitted jobs – Remove previously matched sites in resubmission – Now we remove only previously matched CEs #5109: WMS daemon memory leaks – Memory leaks in JC, ldif 2 classad, LM, LB, NS – Fixes integrated only for JC, LM, NS, LB (as far as I know) #7611: NS can get into state where all connections fail – If there are two or more approximately concurrent connections to the NS (as the first connections after a restart) the NS refuses every connection • • • #7902: memory leak in edg-wl-ns_daemon – One leak contributing is a missing gss_release_cred() in GSISocket. Server. cpp #7684: Inefficient queries for 'subcluster' information – Should be applicable to the ISM BDII purchaser #7965: WL startup scripts must not implicitly refer to /opt/edg – To allow the deployment of parallel LCG releases on the same WN, which as of LCG-2_4_0 allows for the user selecting a different, relocated release instead of the default release corresponding to /etc/sysconfig/edg INFSO-RI-508833 Massimo Sgaravatto - INFN Padova 6
Glite problems hopefully already addressed Enabling Grids for E-scienc. E • The bugs below are still open in the Glite Savannah, but they have already been addressed • Still open because patches under test/still to be tested • #5115, #5869, #5977, #6081, #6083, #6362, #6417, #6439, #6578, #6655, #6665, #6682, #6700, #6722, #6760, #6768, #6778, #7097, #7140, #7180, #7203, #7227, #7237, #7244, #7254, #7311, #7461, #7490, #7493, #7580, #7808, #7884, #7910, #8003, #8005, #8499, #8500, #8531, #8630, #8637, #8663, #8681, #8852, #8899, #8942, #8946, #8970, #8998, #9030, #9040, #9087, #9135, #9136, #9137, #9139, #9140, #9183 • Some bugs in “Remind” (waiting for more info) – #7231, #7312, #7324, #7718, #8026, #8540, #8600, #9194 INFSO-RI-508833 Massimo Sgaravatto - INFN Padova 7
Glite issues not addressed yet Enabling Grids for E-scienc. E • #4588: all but two wms rpms have 0. 0 version numbers and no summary – version numbers are not zero anymore, but the summary and description are still "Change me“ • • #5278: lack of logging information for the workload_manager daemon – Discussed between Mario and Francesco. G #7512: Glite Python modules can be overload by user – Glite command use the python modules which are found first on PYTHONPATH location. And this value can be modify by users. – The usage of namespace for python could solve this problem • • #7930: globus lib missing when running blah job submit – Necessary to set LD_LIBRARY_PATH #7977: count not correctly supported for a DAG node – If planning fails for a DAG node (i. e. the pre script fails, in DAGMan terms), the job is aborted, without considering that the retry count could allow further attempts • #8327: Final job status error when job fails – “Cancelled” even if the user didn’t issue a job-cancel – Cancel was triggered by JC, because of a Condor problem – Under investigation INFSO-RI-508833 Massimo Sgaravatto - INFN Padova 8
Glite issues not addressed yet Enabling Grids for E-scienc. E • #8536: jobs remain in a ready state for ever, until the proxy expires and they abort – Forgot to promptly reply – Asked if they still see the problem: no answer so far • #8759: Jobs submitted with Retry. Count set to 0 are resubmitted multiple times without visible reason – WM kept crashing (and restarted by the cron) because of a problem when interacting with Storage. Index § This specific problem (interaction with Storage. Index) was fixed – Retry. Count not properly “read”: probably a race condition • #8786: Support for shallow resubmission • #8997: IOException when submitting jobs in a loop – Not reproducible – User will come back if the problem reappears INFSO-RI-508833 Massimo Sgaravatto - INFN Padova 9
Glite issues not addressed yet Enabling Grids for E-scienc. E • #9016: Job remains in Running state for long time after termination. – Not reproducible – Asked to the notified if the problem reappears • #9148: Job stays 'Submitted' forever – Pending LB events in /tmp/dglogd. log. * • #9256: Incompatibility between g. Lite jobwrapper and LCG WNs - wrong env. variable – Necessary to modify the job wrapper to deal with both types of WNs – E. g. setting the EDG* and GLITE* variables INFSO-RI-508833 Massimo Sgaravatto - INFN Padova 10
Glite issues not addressed yet Enabling Grids for E-scienc. E • #4637: VOMS API should offer a simpler way of processing the VOMS attribute certs – current VOMS API provides function VOMS_Retrieve(). – This call is sufficient if one uses Openssl API but is useless when an application uses GSSAPI and GSSAPI extensions – Will be fixed in VOMS v. 1. 6 • #6943: If the user issuing voms-proxy-init is not a member of the specified VO, confusing error is returned – Will be fixed in v. 1. 6 • #7048: VOMS: there is no easy way for an application to retrieve the info from the proxy cert – E. g. no easy way to get the VO(s) using the current API – Will be fixed in v. 1. 6 • #7395: Unhelpful error message with missing certificate info – Just a “Unable to verify signature!“ message – Will be addressed in v. 1. 6 INFSO-RI-508833 Massimo Sgaravatto - INFN Padova 11
Glite issues not addressed yet Enabling Grids for E-scienc. E • #7660: hard-wired defaults in voms_install_db – The script voms_install_db does no checks if all the necessary parameters are set, and also sets default values which are non-VO specific – Partly fixed in Glite 1. 2 • #7662: references to EDG license in voms – Instead of the EGEE one • #8021: VOMS test script doesn't tell you the progress when in the mass voms-proxy-* phase. – To be fixed • #8603: VOMS (core) service can't restart after crash when log file is at 2 GB – Will be fixed in VOMS 1. 6 • #9114: Bizarre error messages with a bad role/group request – Depends on bug #6943, and will also be fixed in VOMS 1. 6 INFSO-RI-508833 Massimo Sgaravatto - INFN Padova 12
4da1fab1db7d7512631f23e463e092c0.ppt