Скачать презентацию Environmental Monitoring and Alerting for Computing Room Facilities Скачать презентацию Environmental Monitoring and Alerting for Computing Room Facilities

fe3cbdc2d26a5a2c19804f23ffa47da8.ppt

  • Количество слайдов: 13

Environmental Monitoring and Alerting for Computing Room Facilities Wednesday, November 17, 2004 9: 00 Environmental Monitoring and Alerting for Computing Room Facilities Wednesday, November 17, 2004 9: 00 am – 10: 00 am Gerry Bellendir, Jack Mac. Nerland, David Ritchie, and Mark Thomas

Agenda • • • FCC New Muon -> LCC HDCF -> GCC Futures Vulnerabilities Agenda • • • FCC New Muon -> LCC HDCF -> GCC Futures Vulnerabilities Discussion, Questions, etc.

FCC Presented by Jack Mac. Nerland • • • Smoke detection Sprinklers Under Floor FCC Presented by Jack Mac. Nerland • • • Smoke detection Sprinklers Under Floor Fire Supression Tape robot fire suppression Power Logic Electrical Panel Monitoring Security at FCC

FCC (cont’d) • Presented by Mark Thomas • Firus – New developments – Installed FCC (cont’d) • Presented by Mark Thomas • Firus – New developments – Installed FIRUS Terminal in OPS Office so can monitor chillers at New Muon. – Set up page to show critical info for FCC, New Muon, HDCF, and Casey’s Pond. – Com Center monitors night; FESS monitors day; – CD/OPS monitors also.

FCC (cont’d) Presented by David Ritchie – CSS (cont’d): • Metasys • – UPS FCC (cont’d) Presented by David Ritchie – CSS (cont’d): • Metasys • – UPS and Generator Monitoring and Alerting via Metasys – Current and future Status (see Appendix A) • Other Monitoring – CSS (Stan Naymola): Two types… • lm_sensors. – – Can shutdown systems that are hot. Self-contained, works independently of any other system. If >50% of the nodes are down, it notifies. single nodes that turn themselves off - recorded in logs for investigation. Independent temperature monitor located in the top of a rack. – – – Recorded in ganglia as record of room temperature. Emails when temp crosses highs and lows. Does not page. – CDF (Glenn Cooper): • • CDF nodes just have straight lm_sensors, uses the RPM put together by the Farms group.

New Muon -> LCC Presented by Jack Mac. Nerland • • Smoke detection Sprinklers New Muon -> LCC Presented by Jack Mac. Nerland • • Smoke detection Sprinklers Under Floor Fire Suppression Security (Pegasys)

New Muon (cont’d) Presented by Mark Thomas • Firus – The usual fire protection New Muon (cont’d) Presented by Mark Thomas • Firus – The usual fire protection system – Chillers

New Muon (cont’d) Presented by David Ritchie • • • Metasys – See Appendix New Muon (cont’d) Presented by David Ritchie • • • Metasys – See Appendix A. Other – CDF - see above lm_sensors discussion Other – Lattice QCD (Don Holmgren)… – Omega temp. box • • – In use for a couple of years Alarms on high/low temperature, dry contact input. Only Notification: dial out to 4 phone number rotation until acknowledged. Currently: Call Center, Amitoj's office number, DH office number. • Other – Lattice QCD ( cont’d) – IPMI • • – – Discussion… • • Vulnerability: Omega box not able to reach someone (pre-call-center, post-operator-exit) Addressed with Netbotz unit – – • • • Reads out cpu and system temperatures, fans. Includes vendor-specified thresholds. connects to the network, can send e-mail, push files via FTP, and serve data via HTTP. has "last call" pager when power loss. Have not switched to the Netbotz for notifying the call center; Still use Omega box. Have the Netbotz unit set to send e-mail to lqcd principals on various alarms. Also have trend plots and live web page… http: //lqcd. fnal. gov/cgi-bin/netbotz http: //netbotz. fnal. gov/ • When a sufficient number of nodes are over temperature, we automatically declare an alarm and shutdown… » Batch queues, » Operating systems, and » Power off the nodes via IPMI. Independently, the Netbotz and Omega boxes can trigger an alarm which causes the LQCD and/or ISA groups to manually initiate shutdowns if necessary. We maintain trend plots for all measured quantities, and have automated mailings listing nodes with bad fans and/or high temperatures. The trend plots are available by clicking on the vertical bars on: http: //lqcd. fnal. gov/cgi-bin/stat? health=all or via individual nodes, http: //lqcd. fnal. gov/cgi-bin/stat? health=qcd 0102 http: //lqcd. fnal. gov/cgibin/stat? health=MRTG=qcd 0102

HDCF -> GCC Presented by Jack Mac. Nerland • • Smoke detection Sprinklers Under HDCF -> GCC Presented by Jack Mac. Nerland • • Smoke detection Sprinklers Under Floor Fire Suppression Security at GCC (Pegasys) Presented by Mark Thomas • Firus • UPS Monitoring and Alerting via Metasys – Connection under development – See Appendix A.

HDCF -> GCC (cont’d) Presented by David Ritchie • Other – lm_sensors (see above) HDCF -> GCC (cont’d) Presented by David Ritchie • Other – lm_sensors (see above) • Other – auto-shutdown when UPS goes to batteries. – Zonatherm / Liebert have automatic shutdown capability • may be acceptable to shut down the PCs in GCC upon the UPS going to batteries • Involves: – Agent PC running Liebert-provided software which senses UPS dry contacts status. – Software (SNMP) notifying, IP-by-IP address, each PC that it should shutdown. – Cost ~$5, 000. – Outstanding issues • Must hand-installed s/w in all ~1400 PCs and • Must manually enter 1400 IP addresses – Liebert seems interested in joint effort.

Other Matters • Futures – Facilities Environmental Event Notification Scheme – Next Generation Metasys Other Matters • Futures – Facilities Environmental Event Notification Scheme – Next Generation Metasys • Vulnerabilities – FCC has loss of Casey's Pond Water or anything in that causality chain as its main vulnerability (JM) – New Muon has loss of electrical and/or loss of water as its primary vulnerability (age? , ownership? ) (JM/DR) – HDCF has loss of cooling without consequent loss of power as its main vulnerability (JM/DR) • Discussion, Questions, etc.

Metasys – Current • FESS (Mike Michalak) — Status as of 11/12: – FCC Metasys – Current • FESS (Mike Michalak) — Status as of 11/12: – FCC is operational. (Power Logic panel monitoring work still required? ). – HDCF network connected to Metasys panel Mike: should have HDCF up on the Metasys System Extended Architecture (Next Generation) next week (week of 11/15? ). • power outage required to tie in the power meters. – New Muon has no Metasys. • • NAE purchased for New Muon Ready to plan connections at New Muon. Network connection will be required. Metasys System Extended Architecture (MSEA) will be installed with the new CRAC units as part of the New Muon project which started on 11/15. • Monitoring of chilled water temperature, chiller status, and pump status on MSEA will then begin.

Metasys - Future • FESS (Ted Thorson) — Technology: Status is: – New Metasys Metasys - Future • FESS (Ted Thorson) — Technology: Status is: – New Metasys system is ready for deployment • awaiting the approval of the Critical System Plan, • a pre-requisite to buying the PIX firewall and VPN concentrator. – All existing equipment on Ethernet and – All existing equipment migrated to the new system. – However, no one will be able to see the equipment at HDCF or New Muon until the new system can be deployed. • FESS (Roger Slisz) — Critical System Coordinator: To do list is: – – Procure a VPN concentrator and a PIX firewall device. Secure VPN accounts for initial round of named users Complete third draft of the CSP Train initial round of named users on how MESA works and what they can and can not do with it. This has been a long complex project begun in February 2001. It is now perhaps close to first deployment.