Fermilab Site Report Mark O Kaletka Head Core

Fermilab Site Report Mark O. Kaletka Head, Core Support Services Department Computing Division

CD mission statement • The Computing Division’s mission is to play a full part in the mission of the laboratory and in particular: • To proudly develop, innovate, and support excellent and forefront computing solutions and services, recognizing the essential role of cooperation and respect in all interactions between ourselves and with the people and organizations that we work with and serve.

How we are organized

We participate in all areas

Production system capacities

Growth in farms usage

Growth in farms density

Projected growth of computers

Projected power growth

Computer rooms • Provide space, power & cooling for central computers • Problem: increasing luminosity – ~ 2600 computers in FCC – Expect to add ~1, 000 systems/year – FCC has run out of power & cooling, cannot add utility capacity • New Muon Lab – 256 systems for Lattice Gauge theory – CDF early buys of 160 systems + 160 CDF existing systems from FCC – Developing plan for another room • Wide Band – – Long term phased plan FY 04 – 08 FY 04/05 build: 2, 880 computers (~$3 M) Tape robot room in FY 05 FY 06/07: ~3, 000 computers

Computer rooms

Storage and data movement • 1. 72 PB of data in ATL – Ingest of ~100 TB/mo • Many 10’s of TB fed to analysis programs each day • Recent work: – Parameterizing storage systems for SRM • Apply to SAM • Apply more generally – VO notions in storage systems

FNAL Starlight dark fiber project • FNAL dark fiber to Starlight – Completion: Mid-June, 2004 – Initial DWDM configuration: • One 10 Gb/s (LAN_PHY) channel • Two 1 Gb/s (OC 48) channels • Intended uses of link – WAN network R&D projects – Overflow for production traffic: • ESnet link to remain production network link – Redundant offsite path

General network improvements • Core network upgrades – Switch/router (Catalyst 6500 s) supervisors upgraded: • 720 Gb/s switching fabric (Sup 720 s); provides 40 Gb/s per slot – Initial deployment of 10 Gb/s backbone links • 1000 B-T support expanded – Ubiquitous on computer room floors: • New farms acquisitions supported on gigabit ethernet ports – Initial deployment in a few office areas

Network security improvements • Mandatory node registration for network access – “Hotel-like” temporary registration utility for visitors – System vulnerability scan is part of the process • Automated network scan blocker deployed – Based on quasi-real time network flow data analysis – Blocks outbound & inbound scans • VPN service deployed

Central services • Email – Spam tagging in place • X-Spam-Flag: YES – Capacity upgrades for gateways, imapservers, virus scanning – Redundant load sharing • AFS – – • Completely on Open. AFS SAN for backend storage Ti. BS Backup system DOE-funded SBIR for performance investigations Windows – Two-tier patching system for Windows • 1 st tier under control of OU (patchlink) • 2 nd tier domain-wide (SUS) • 0 Sasser infections postimplementation

Central services -- backups • Site-wide backup plan is moving forward – Spectra. Logic T 950 -5 – 8 SAIT-1 drives – Initial 450 tape capacity for 7 TB pilot project • Plan for modular expansion to over 200 TB

Computer security • Missed by Linux rootkit epidemic – but no theoretical reason for immunity • Experimenting w/ AFS cross-cell authentication – w/ Kerberos 5 authentication – subtle ramifications • DHCP registration process – includes security scan, does not (yet) deny access – a few VIP’s have been tapped during meetings • Vigorous self-scanning program – based on nessus – maintain database of results – look especially for “critical vulnerabilities” (& deny access)

Run II – D 0 • D 0 reprocessed 600 M events in fall 2003 – using grid style tools, 100 M of those event processed offsite at 5 other facilities – Farm production capacity is roughly 25 M events per week – MC production capacity is 1 M events per week – about 1 B events/week on the analysis systems. • Linux SAM station on a 2 TB fileserver to serve the new analysis nodes – next step in the plan to reduce D 0 min – station has been extremely performant, expanding the Linux SAM cache – station typically delivers about 15 TB of data and 550 M events per week. • Rolled out a MC production system that has grid-style job submission – JIM component of SAM-Grid • Torque (s. PBS) is in use on the most recent analysis nodes – has been much more robust than PBS. • Linux fileservers are being used as "project" space – physics group managed storage with high access patterns – good results.

MINOS & BTe. V status • MINOS – data taking in early 2005 – using “standard” tools • Fermi Linux • General-purpose farms • AFS • Oracle • enstore & dcache • ROOT • BTe. V – preparations for CD-1 review by DOE • included review of online (but not offline) computing • novel feature is that much of the Level 2/3 trigger software will be part of the offline reconstruction software

US-CMS computing • DC 04 Data Challenge and the preparation for the computing TDR – preparation for the Physics TDR (P-TDR) – roll out of the LCG Grid service and federating it with the U. S. facilities • Develop the required Grid and Facilities infrastructure – increase the facility capacity through equipment upgrades – commission Grid capabilities through Grid 2003 and LCG-1 efforts – develop and integrate required functionalities and services • Increase the capability of User Analysis Facility – improve how a physicists would use facilities and software – facilities and environment improvements – software releases, documentation, web presence etc

US-CMS computing – Tier 1 • 136 Worker Nodes (Dual 1 U Xeon Servers and Dual 1 U Athlon) – 240 CPUs for Production (174 k. SI 2000) – 32 CPUs for Analysis (26 k. SI 2000) • All systems purchased in 2003 are connected over gigabit • 37 TB of Disk Storage – 24 TB in Production for Mass Storage Disk Cache • In 2003 we switched to SATA disks in external enclosures connected over fiber channel • Only marginally more expensive than 3 ware based systems, and much easier to administrate. – 5 TB of User Analysis Space • Highly available, high performance, backed-up space – 8 TB Production Space • 70 TB of Mass Storage Space – Limited by tape purchases and not silo space

US-CMS computing

US-CMS computing – DC 03 & GRID 2003 Over 72 K CPU-hours used in a week 100 TB of data transferred across Grid 3 sites Peak numbers of jobs approaching 900 Average numbers during the daytime over 500

US-CMS computing – DC 04

1 st LHC magnet leaving FNAL for CERN

And our science has shown up in some unusual journals… “Her sneakers squeaked as she walked down the halls where Lederman had walked. The 7 th floor of the high-rise was where she did her work, and she found her way to the small, functional desk in the back of the pen. ”