Ever alert and always on the job, the NCCS’ control room operators keep its systems up and running
With more than 1,000 researchers around the world using the Oak Ridge Leadership Computing Facility’s (OLCF’s) IBM AC922 Summit this year to pursue some of the biggest mysteries in science, the nation’s most powerful supercomputer gets little downtime. At any given hour, it’s typically running complex codes to investigate everything from molecular dynamics to explosive stellar astrophysics. And that means somebody has to make sure Summit is running every minute of every day, without fail.
That duty falls upon the shoulders of the control room operators in the HPC Infrastructure Operations group at the National Center for Computational Sciences (NCCS), which is located (along with the OLCF) at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL). But keeping OLCF supercomputers running is only one task among a myriad of responsibilities entrusted to ORNL’s five operators: Jack Breazeale, Cindy Leach, George Phipps, Clifford Richards, and Teresa Wilson.
Together, the control room operators serve as a troubleshooting SWAT team that must routinely work outside the control room altogether, especially during the last year as ORNL operates with a sparse on-campus staff under COVID-19 protocols. Although most of ORNL’s computational scientists and administrators currently work from home, the operators are always on campus and on the job.
“Operators work 24 hours a day, 7 days a week, 365 days a year, so they’re a continual asset that we have on site,” said Paul Abston, HPC Infrastructure Operations group leader. “They’re my extra 10 arms and my extra 10 eyes. Most of them have had a presence in the OLCF since its conception.”
As their job title indicates, operators spend a lot of time in the Operations Control Room (OCR). It’s a small space packed with large monitors, each one reporting the current status of different systems: supercomputers, cluster machines, networks, and the High-Performance Storage System, among others. Operators staff the OCR two per day shift and one on night shift, watching for any sign of the dreaded color red, which indicates an issue. Work-slowing problems can involve anything from a login node going down to a file system hitting capacity. Once alerted, no matter the hour, operators will first make the call to whichever manager is responsible for the machine in question.
“We’re the guys that call the systems administrators at two in the morning, wake them up, and say, ‘Hey, you’ve got a problem on one of your systems,’” said Breazeale, who’s worked at ORNL for the last 31 years.
During the pandemic—even if those calls are made at 2 p.m. rather than 2 a.m.—the operators are now most likely the ones getting machines back in running order. With much of the ORNL staff working remotely, control room personnel often find themselves making quick fixes.
“Yesterday, for instance, I had to go up and reboot five different machines for people. They’re not onsite now—they can’t just walk across the hall from their office when there’s an issue and power-cycle the machine,” Breazeale said. “But if they’re trying to do a reboot and it doesn’t go right, we’re here for the administrators to call us and say, ‘Hey, can you go over and power-cycle a machine for me?’”
Operators do get some exercise by walking rounds each night through all the different facilities, looking for anomalies that might indicate a problem. Monitoring the electrical and cooling systems that support Summit and the other supercomputers is also on their to-do list, so they watch out for water leaks, smoke, or excess heat.
“In the event of an upset condition, whether it’s a power-quality event, a power outage, or the unexpected, they’re our first resource to bring things back to a stable state,” Abston said.
One such power event occurred in January when a pickup truck crashed into a high-voltage transmission line outside the ORNL campus, knocking the tower down and causing power outages throughout the area. Richards was working the day shift and had just finished an inspection walk-through when he started receiving automatic emails from the OLCF’s uninterruptible power supply (UPS) units that the systems had switched to batteries. Richards referred to the Computational Sciences Building emergency manual and worked his way down the call list to notify administrators.
“I actually found out the cause of the outage by reaching out to all the points of contact that are in our response plan. A few of them were in traffic stuck behind the area where the accident occurred,” Richards said.
Richards and Facilities Engineer John Gutman conducted a complete walk-through of the data centers during the event, confirming that no systems were affected—the facility’s UPS units and backup generators had worked exactly as intended.
Some of the operators’ other duties are less dramatic: double-checking inventory to ensure no changes have been made inside the computer cabinets, taking in shipments, even delivering office mail.
Then there are the phone calls. During the day shift, operators typically receive calls from Summit users who are experiencing a problem logging in and need help to resolve it. But they also get the occasional wrong number, too. Phipps, who’s worked at ORNL for 37 years, has answered calls intended for the ORNL Federal Credit Union.
“We get the calls for the credit union—their phone number is similar to ours, so people will call and ask for their account numbers or try to transfer some money into their account,” Phipps said. “We just take these wrong-number calls in stride.”
But it’s the night shift that gets the more unusual calls.
“On off-hours you’ll get people who used to work here on a project years ago—the atomic bomb or something—and they’ll call wanting to know if that research is still going on,” Breazeale said. “But we get some wild phone calls, too. I had one guy call who wanted to talk about some type of experience he had with aliens.”
Being an expert in providing assistance no matter the situation, Breazeale did not hesitate to offer the caller a helpful resource: “I told him I wasn’t sure if I had the information that he was requesting, but that he might be better off to call back on the day shift—perhaps he could talk with someone who had more knowledge than I did on that situation.”
UT-Battelle LLC manages ORNL for the Department of Energy’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit https://energy.gov/science.