Скачать презентацию D 0 Grid Data Production Initiative Coordination Mtg Скачать презентацию D 0 Grid Data Production Initiative Coordination Mtg

150214693d51f8471f0ce913b156ed8e.ppt

  • Количество слайдов: 10

D 0 Grid Data Production Initiative: Coordination Mtg 11 Version 1. 0 (meeting edition) D 0 Grid Data Production Initiative: Coordination Mtg 11 Version 1. 0 (meeting edition) 20 November 2008 Rob Kennedy and Adam Lyon Attending: …

Outline • Summary and News • Deployment “Feature List” – Details filling in on Outline • Summary and News • Deployment “Feature List” – Details filling in on December Deployment • Task Status (4 slides) – Focus on individual task status, what is needed • Deployment 1 Plan – Focus on overall schedule, task order

Summary and News • Summary – Initiative Deployment 1 Planning Mtg 2 held Monday Summary and News • Summary – Initiative Deployment 1 Planning Mtg 2 held Monday – New Station move completed successfully – FWD 1 -3 Upgrade (with FWD 4 -5 in prd) Wed done, but full service not yet restored – (details to come) – QUE 1 Upgrade planned for Thu (today) if all agreed • Focus Today: Resolve FWD 1 -3, Proceed or not with QUE 1 • News and Notes: – ITSM all-day Workshops this Tue-Fri • Running ahead, so Rob K. here today afterall

Current Deployment “Feature” Lists • Deployment 1: Split Data/MC Production Services (NO CHANGE) – Current Deployment “Feature” Lists • Deployment 1: Split Data/MC Production Services (NO CHANGE) – – 1. Config: Basic Splitting of Fwd, Que Services between Data and MC Production with 2 Fwd nodes assigned to each, plus 1 Fwd dedicated to all Merging – 2. Fwd 4 deployed (w/o virtualization) – 3. Fwd 5 deployed – 4. Que 2 deployed, with client software to enable parallel use of 2 QUE nodes – 5. New SAM Station (moved off of FWD 1) – 6. Condor 7 via “new” 1. 10. 1 m official release from UWisc – 7. File. Max increase on all Fwd nodes to handle large n. Job actions – • Time frame: November 13 -17, with 1 week+ observation before holidays 8. D 0 Runjob Upgrade for Data Production: Prerequisite for deploying new SAM-Grid release Deployment 2: Optimize Data and MC Production Configurations – Time frame: December 8 -10, with 1 week+ observation before holidays – 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length – 2. New SAM-Grid Release with support for new Job status value at Queuing node – 3. Uniform OS’s: Upgrade FWD 1 -3 and QUE 1 to latest SLF 4. 0, same as FWD 4 -5 – 4. Formalize transfer of support for QUE 1 (samgrid. fnal. gov) to FEF from FGS

Deployment 1 Schedule B • Mon 17 Nov 2008 – – – • Thu Deployment 1 Schedule B • Mon 17 Nov 2008 – – – • Thu 20 Nov 2008 – Mon 17 Nov: Depl Plan Mtg 2, 9 am Test FWD 4, 5, QUE 2 w/ new Open. File. Max (MD, …) Backup samgrid products area on FWD 1 -3. Also /etc/grid-security, globus-gatekeeper (AL) Request Open. File. Max change on FWD 1 -3 (AL) Plan QUE 1 Upgrade in detail (AL, PM) – – – FWD 1 -3 wipe/re-install via umbrella package (JB); Increase Open. File. Max (FEF/JB) • • • – • MC Prod uses FWD 4 while this happens Data Prod uses FWD 5 while this happens Reboot FWD 1 -3 to pickup Open. File. Max change Any order of FWD work is OK: All at once or seq. If all goes well, then stop, announce, observe. QUE 1 work starts next day. Fall-back: restore samgrid products from backup SAM Station: Move context server (RI) to new sam station host and observe. • • – – • If all goes well, then stop, announce. Observe. Fall-back: restore samgrid products from backup Validate the overall configuration matches plan Check all monitoring, automated tasks. Observe system in production Tues 25 Nov 2008 – • QUE 1 has brokering, web page not on QUE 2 AL: Be careful NOT to wipe state of old jobs… Brokering, Web page should not be touched. We have not fully tested the new deployment of these. Production can use QUE 2 while this happens. This has modest complication of using this queuing node for recovery jobs (impacts Data Prod). Fri 21 – Mon 24 Nov 2008 – • Sign off on FWD work, proceed with QUE 1 work QUE 1 upgrade install via umbrella package (JB) • • • FWD 2 certs expire. Test FWD 4, 5, QUE 2 w/ new Open. File. Max Backup samgrid products area on QUE 1. Also /etc/grid-security, globus-gatekeeper; job_queue, job_history (AL, PM) Automated administration/monitoring on QUE 1, 2: put into a product (AL) Wed 20 Nov 2008 Coordination Mtg 9 am led by Adam • Tues 18 Nov 2008 – – – • • Sign-off on D 0 Grid Production System Clean-up: Deferred to December Deployment – – SRM client cert with correct host address OS upgrades (old nodes on SLF 4. 5 to SLF 4. 7)

Deployment 1 Configuration • Reco (adapted from Oct 6 proposal, tweaked in meeting) – Deployment 1 Configuration • Reco (adapted from Oct 6 proposal, tweaked in meeting) – FWD 1: 1250 (now 750) – FWD 5: 1250 • MC, MC Merge – FWD 2: 1250 (now 750) – FWD 4: 1250 • Reco Merge – FWD 3: 750/300 grid each • QUE 1: Reco, Reco Merge – keep here to maintain history • QUE 2: MC, MC Merge • SAM Station: All • Jim Client: can submit to QUE 1 or QUE 2 depending on qualifier

Task Status (1 of 4) (Red = critical tasks, Green = done, Blue = Task Status (1 of 4) (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes) • 1. 1. 1 Forwarding Node 4 (Fwd 4) – – • "JS, MD, JB“ Fri 11/14/08 Tue 11/18/08 3 d JB Wed 11/12/08 Fri 11/14/08 3 d Tue 11/18/08 0 d 1. 1. 3. 14 "Que 2: Setup Automated Maintenance, Monitoring" 1. 1. 3. 9 Que 2: Integration Test w/2 -QUE Client 1. 1. 3. 11 Milestone: Que 2 Ready to Deploy AL AL AL REX Thu 11/13/08 Fri 11/14/08 2 d JB Thu 11/13/08 Fri 11/14/08 2 d Fri 11/14/08 0 d 1. 1. 5 New Distinct Sam Station – – • 1. 1. 2. 8 Fwd 5: Pre-Deployment Open. File. Max=16 k Large-Scale Test AL 1. 1. 2. 13 "Fwd 5: Setup Automated Maintenance, Monitoring" AL 1. 1. 2. 9 Milestone: Fwd 5 Ready to Deploy AL 1. 1. 3 Queuing Node 2 (Que 2) – – • "JS, MD, JB" Fri 11/14/08 Tue 11/18/08 3 d JB Wed 11/12/08 Fri 11/14/08 3 d Tue 11/18/08 0 d 1. 1. 2 Forwarding Node 5 (Fwd 5) – – • 1. 10 Fwd 4: Pre-Deployment Open. File. Max=16 k Large-Scale Test AL 1. 14 "Fwd 4: Setup Automated Maintenance, Monitoring" AL 1. 11 Milestone: Fwd 4 Ready to Deploy AL 1. 1. 5. 7 Milestone: SAM Station Ready to Deploy AL Fri 11/14/08 JIRA “Figure out what to do with SRMs” contains “Request and Install SRM-related certs” 0 d

Deployment 1 Tasks (2 of 4) (Red = critical tasks, Green = done, Blue Deployment 1 Tasks (2 of 4) (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes) • 1. 1. 6 Deployment Stage 1 – – 1. 1. 6. 2 Deployment 1: Execute • • – – – 1. 1. 6. 2. 1 1. 1. 6. 2. 2 1. 1. 6. 2. 3 1. 1. 6. 2. 4 1. 1. 6. 2. 5 1. 1. 6. 2. 6 1. 1. 6. 2. 7 1. 1. 6. 2. 8 AL "SAM Station: Deactivate old station, Activate new station"AL "FWD 1 Upgrade (App, Config, Open. File. Max)" AL "FWD 2 Upgrade (App, Config, Open. File. Max)" AL "FWD 3 Upgrade (App, Config, Open. File. Max)" AL "QUE 1 Upgrade (App, Config)" AL Establish Grid Production Configuration AL SAM Station: Setup Context Server AL Milestone: Deployment 1 Execution done AL 1. 1. 6. 3 Deployment 1: Monitor 1. 1. 6. 4 Deployment 1: Sign-off 1. 1. 6. 5 MILE 1: Deployment 1 Completed AL AL AL REX Fri 11/14/08 Thu 11/20/08 RI JB JB REX RI Fri 11/14/08 Wed 11/19/08 Thu 11/20/08 Wed 11/19/08 Thu 11/20/08 Fri 11/14/08 Thu 11/20/08 Thu 11/20/08 Thu 11/19/08 Thu 11/20/08 5 d 1 d 2 d 2 d 1 d 2 d 0 d REX Fri 11/21/08 Mon 11/24/08 REX Tue 11/25/08 Tue 11/25/08 2 d 1 d 0 d • 1. 1. 11 • • Not all starts/durations above are sync’d to the latest Monday plan Meeting on Monday 17 November produced the authoritative schedule (Sched B) We cannot deploy later than 20 Nov. (Thursday)… no deploy on Friday or holiday week. New Condor is in this deployment too, all FWD, QUE nodes. THIS is a major risk. Deployment 1 Review AL Mon 12/1/08 1 d

Task Status (3 of 4) • (Red = critical tasks, Green = done, Blue Task Status (3 of 4) • (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes) 1. 1. 8 FWD and QUE Packaging with Version-Based Umbrella Product – – – 1. 1. 8. 16 New FWD Install Proc/Doc hand-off to REX/Ops AL JB Mon 11/17/08 1. 1. 8. 6 Umbrella Product: Update FWD Installation Procedure. AL JB Mon 11/24/08 Tue 11/25/08 1. 1. 8. 14 Add Open. File. Max setting to FWD Installation Procedure. AL REX Wed 11/19/08 – – 1. 1. 8. 15 New QUE Install Proc/Doc hand-off to REX/Ops AL JB Mon 11/17/08 0 d 1. 1. 8. 10 Umbrella Product: Update QUE Installation Procedure AL JB Mon 11/24/08 Tue 11/25/08 2 d 1. 1. 8. 13 Umbrella Product: FWD and QUE Install. Proc. archived. AL REX Wed 11/26/08 1 d 1. 1. 8. 11 Milestone: FWD and QUE Packaging with Version-Based Umbrella Product done "GG, AL" Wed 11/26/08 – Notes: … 0 d 2 d 1 d

Task Status (4 of 4) (Red = critical tasks, Green = done, Blue = Task Status (4 of 4) (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes) • 1. 3. 1 – – – • 1. 3. 2 – • • 1. 3. 3 1. 4 – • SAM-Grid Job Status Info 1. 3. 1. 7 New Job Status Value at QUE Node: Later Work GG PM Tue 11/18/08 Mon 11/24/08 5 d 1. 3. 1. 1 Use "Same" Proxy for Gridftps GG PM Thu 11/20/08 Mon 11/24/08 3 d 1. 3 SAM-Grid Release with Job Status Info feature GG PM Tue 11/25/08 Wed 11/26/08 2 d 1. 3. 1. 6 Pre-deployment test of new SAM-Grid Release AL REX Mon 12/1/08 Fri 12/5/08 5 d 1. 3. 1. 4 Upgrade D 0 Runjob version used by Data Production AL "MD, AL"Thu 10/30/08 Fri 10/31/08 2 d 1. 3. 1. 5 Milestone: SAM-Grid Release Deployable for Data Production AL REX Fri 12/5/08 0 d Slow Fwd-CAB Job Transition Note: File. Max change requires a schedd restart (ST). Work into deployment plans. Improved H/w Uptime Metrics n. Submissions plot for Sep ’ 08 Mike? Post-Deployment topics and tasks covered in the “Deployment 1 Review” – – – Archiving Installation Instructions with all note-worthy comments in JIRA integrated Lists of new-machine certs and new-operator authorization: location, process, what uses it, manual or auto updated Cost-benefit: push FWD, QUE nodes to be appliances • • • Fri 12/5/08 spec’d from OS (including Open. File. Max) to applications to grid system configuration rapid wipe and re-install Past Notes: At mercy of off-site gridmap updates… need to use the existing automated system to keep all in sync – – Also: no remote site has new VDT (which has new VOMS) No installation instructions for durable locations server. Considering for Phase 2 of Initiative.