150214693d51f8471f0ce913b156ed8e.ppt
- Количество слайдов: 10
D 0 Grid Data Production Initiative: Coordination Mtg 11 Version 1. 0 (meeting edition) 20 November 2008 Rob Kennedy and Adam Lyon Attending: …
Outline • Summary and News • Deployment “Feature List” – Details filling in on December Deployment • Task Status (4 slides) – Focus on individual task status, what is needed • Deployment 1 Plan – Focus on overall schedule, task order
Summary and News • Summary – Initiative Deployment 1 Planning Mtg 2 held Monday – New Station move completed successfully – FWD 1 -3 Upgrade (with FWD 4 -5 in prd) Wed done, but full service not yet restored – (details to come) – QUE 1 Upgrade planned for Thu (today) if all agreed • Focus Today: Resolve FWD 1 -3, Proceed or not with QUE 1 • News and Notes: – ITSM all-day Workshops this Tue-Fri • Running ahead, so Rob K. here today afterall
Current Deployment “Feature” Lists • Deployment 1: Split Data/MC Production Services (NO CHANGE) – – 1. Config: Basic Splitting of Fwd, Que Services between Data and MC Production with 2 Fwd nodes assigned to each, plus 1 Fwd dedicated to all Merging – 2. Fwd 4 deployed (w/o virtualization) – 3. Fwd 5 deployed – 4. Que 2 deployed, with client software to enable parallel use of 2 QUE nodes – 5. New SAM Station (moved off of FWD 1) – 6. Condor 7 via “new” 1. 10. 1 m official release from UWisc – 7. File. Max increase on all Fwd nodes to handle large n. Job actions – • Time frame: November 13 -17, with 1 week+ observation before holidays 8. D 0 Runjob Upgrade for Data Production: Prerequisite for deploying new SAM-Grid release Deployment 2: Optimize Data and MC Production Configurations – Time frame: December 8 -10, with 1 week+ observation before holidays – 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length – 2. New SAM-Grid Release with support for new Job status value at Queuing node – 3. Uniform OS’s: Upgrade FWD 1 -3 and QUE 1 to latest SLF 4. 0, same as FWD 4 -5 – 4. Formalize transfer of support for QUE 1 (samgrid. fnal. gov) to FEF from FGS
Deployment 1 Schedule B • Mon 17 Nov 2008 – – – • Thu 20 Nov 2008 – Mon 17 Nov: Depl Plan Mtg 2, 9 am Test FWD 4, 5, QUE 2 w/ new Open. File. Max (MD, …) Backup samgrid products area on FWD 1 -3. Also /etc/grid-security, globus-gatekeeper (AL) Request Open. File. Max change on FWD 1 -3 (AL) Plan QUE 1 Upgrade in detail (AL, PM) – – – FWD 1 -3 wipe/re-install via umbrella package (JB); Increase Open. File. Max (FEF/JB) • • • – • MC Prod uses FWD 4 while this happens Data Prod uses FWD 5 while this happens Reboot FWD 1 -3 to pickup Open. File. Max change Any order of FWD work is OK: All at once or seq. If all goes well, then stop, announce, observe. QUE 1 work starts next day. Fall-back: restore samgrid products from backup SAM Station: Move context server (RI) to new sam station host and observe. • • – – • If all goes well, then stop, announce. Observe. Fall-back: restore samgrid products from backup Validate the overall configuration matches plan Check all monitoring, automated tasks. Observe system in production Tues 25 Nov 2008 – • QUE 1 has brokering, web page not on QUE 2 AL: Be careful NOT to wipe state of old jobs… Brokering, Web page should not be touched. We have not fully tested the new deployment of these. Production can use QUE 2 while this happens. This has modest complication of using this queuing node for recovery jobs (impacts Data Prod). Fri 21 – Mon 24 Nov 2008 – • Sign off on FWD work, proceed with QUE 1 work QUE 1 upgrade install via umbrella package (JB) • • • FWD 2 certs expire. Test FWD 4, 5, QUE 2 w/ new Open. File. Max Backup samgrid products area on QUE 1. Also /etc/grid-security, globus-gatekeeper; job_queue, job_history (AL, PM) Automated administration/monitoring on QUE 1, 2: put into a product (AL) Wed 20 Nov 2008 Coordination Mtg 9 am led by Adam • Tues 18 Nov 2008 – – – • • Sign-off on D 0 Grid Production System Clean-up: Deferred to December Deployment – – SRM client cert with correct host address OS upgrades (old nodes on SLF 4. 5 to SLF 4. 7)
Deployment 1 Configuration • Reco (adapted from Oct 6 proposal, tweaked in meeting) – FWD 1: 1250 (now 750) – FWD 5: 1250 • MC, MC Merge – FWD 2: 1250 (now 750) – FWD 4: 1250 • Reco Merge – FWD 3: 750/300 grid each • QUE 1: Reco, Reco Merge – keep here to maintain history • QUE 2: MC, MC Merge • SAM Station: All • Jim Client: can submit to QUE 1 or QUE 2 depending on qualifier
Task Status (1 of 4) (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes) • 1. 1. 1 Forwarding Node 4 (Fwd 4) – – • "JS, MD, JB“ Fri 11/14/08 Tue 11/18/08 3 d JB Wed 11/12/08 Fri 11/14/08 3 d Tue 11/18/08 0 d
Deployment 1 Tasks (2 of 4) (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes) • 1. 1. 6 Deployment Stage 1 –
Task Status (3 of 4) • (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes) 1. 1. 8 FWD and QUE Packaging with Version-Based Umbrella Product – – – 1. 1. 8. 16 New FWD Install Proc/Doc hand-off to REX/Ops AL JB Mon 11/17/08 1. 1. 8. 6 Umbrella Product: Update FWD Installation Procedure. AL JB Mon 11/24/08 Tue 11/25/08 1. 1. 8. 14 Add Open. File. Max setting to FWD Installation Procedure. AL REX Wed 11/19/08 – – 1. 1. 8. 15 New QUE Install Proc/Doc hand-off to REX/Ops AL JB Mon 11/17/08 0 d 1. 1. 8. 10 Umbrella Product: Update QUE Installation Procedure AL JB Mon 11/24/08 Tue 11/25/08 2 d 1. 1. 8. 13 Umbrella Product: FWD and QUE Install. Proc. archived. AL REX Wed 11/26/08 1 d 1. 1. 8. 11 Milestone: FWD and QUE Packaging with Version-Based Umbrella Product done "GG, AL" Wed 11/26/08 – Notes: … 0 d 2 d 1 d
Task Status (4 of 4) (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes) • 1. 3. 1 – – – • 1. 3. 2 – • • 1. 3. 3 1. 4 – • SAM-Grid Job Status Info 1. 3. 1. 7 New Job Status Value at QUE Node: Later Work GG PM Tue 11/18/08 Mon 11/24/08 5 d 1. 3. 1. 1 Use "Same" Proxy for Gridftps GG PM Thu 11/20/08 Mon 11/24/08 3 d 1. 3 SAM-Grid Release with Job Status Info feature GG PM Tue 11/25/08 Wed 11/26/08 2 d 1. 3. 1. 6 Pre-deployment test of new SAM-Grid Release AL REX Mon 12/1/08 Fri 12/5/08 5 d 1. 3. 1. 4 Upgrade D 0 Runjob version used by Data Production AL "MD, AL"Thu 10/30/08 Fri 10/31/08 2 d 1. 3. 1. 5 Milestone: SAM-Grid Release Deployable for Data Production AL REX Fri 12/5/08 0 d Slow Fwd-CAB Job Transition Note: File. Max change requires a schedd restart (ST). Work into deployment plans. Improved H/w Uptime Metrics n. Submissions plot for Sep ’ 08 Mike? Post-Deployment topics and tasks covered in the “Deployment 1 Review” – – – Archiving Installation Instructions with all note-worthy comments in JIRA integrated Lists of new-machine certs and new-operator authorization: location, process, what uses it, manual or auto updated Cost-benefit: push FWD, QUE nodes to be appliances • • • Fri 12/5/08 spec’d from OS (including Open. File. Max) to applications to grid system configuration rapid wipe and re-install Past Notes: At mercy of off-site gridmap updates… need to use the existing automated system to keep all in sync – – Also: no remote site has new VDT (which has new VOMS) No installation instructions for durable locations server. Considering for Phase 2 of Initiative.


