When the US Department of Energy’s (DOE) Oak Ridge National Laboratory (ORNL) committed its IBM Power System AC922 Summit supercomputer to conducting open research in the fight against COVID-19, it initiated an unprecedented all-hands effort. Summit is currently running simulations for at least ten individual research projects by other national labs and medical institutions around the country, from analyzing potential drug treatments to creating structural models of the virus. But amid this urgent undertaking, ORNL has another duty: to protect its staff. On March 16, lab leadership initiated a work-from-home policy.
So how do you run the world’s most powerful and smartest scientific supercomputer from your house—especially when you’re also conducting absolutely vital research in combatting a global pandemic?
As it turns out, it’s not so difficult. In fact, working remotely via laptops and cell phones was already part of the daily routine for many staffers at the Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science User Facility at ORNL. When it comes to operating Summit, OLCF support staff and researchers tend to work odd hours from diverse locations, even in the most normal of times. It has served as useful preparation for the center’s current mode of full-time remote operations.
The OLCF’s Acting Director of Science, Bronson Messer, has personally logged onto Summit and its supercomputer predecessor, Titan, from 17 different countries. As a computational astrophysicist, Messer uses Summit to solve scientific problems, conduct data analysis, work on codes, or perform maintenance such as moving data from one place to another. He does it all by logging into Summit via a personal computer.
“The distinguishing characteristic among all those tasks is I’m never sitting near Summit—I may be in the building coincidentally, but it doesn’t matter if I’m in the building or not,” Messer said. “I typically do this at all hours. I’m a night owl, so I tend to do it late at night, which means I’m always at home when I’m doing large tranches of work.”
One advantage of working from home is a readily available high-speed Internet connection, which can be difficult to locate on the road. A frequent business traveler, Messer can recall many desperate searches for a Hardee’s with Wi-Fi over the years, back when public hotspots were not so common. Today, broadband is a commodity that’s much easier to find, even in exotic locales.
Reuben Budiardja, a computational scientist in the National Center for Computational Science’s Scientific Computing group, was able to complete some particularly critical tasks while immersed in the splendors of Yosemite National Park: testing the brand-new Summit supercomputer before it went to production in 2018.
“I was already on a planned family vacation at Yosemite, but because of scheduling changes it coincided with when we needed to do the acceptance testing,” Budiardja said. “I ended up checking out hotspots in Yosemite, and I was actually able to do some acceptance testing while sitting in our rented redwood log cabin at Yosemite. The latency was terrible, but I was able to do what I needed to do on Summit.”
Of course, any group effort—especially running the fastest supercomputer in the world—means attending lots of meetings. Fortunately, as other professions have discovered, teleconferencing does get the job done, for the most part.
Antigoni Georgiadou, a postdoctoral research associate in the Scientific Computing group, is a mathematician by training, but for the past 5 years she has devoted herself to applying computational methods to astrophysics “to reveal the mysteries of the universe and stars.” And now she’s doing it from home with the aid of virtual meetings.
“My work is mostly scientific computing where I collaborate with groups outside of the lab, like Argonne National Laboratory, but it is essential to the progress of my work to participate in project meetings,” Georgiadou said. “But I can easily reach out to people through phone, emails, Slack, and Microsoft Teams. Despite working from home, all meetings are running as usual and if anything, everyone is on time—including myself—because there is no commute.”
Long-distance communication is a matter of course for the User Assistance and Outreach group, tasked with helping scientists using OLCF systems from around the world overcome technical issues ranging from resetting passwords to figuring out why jobs won’t run.
“Many of our users on Summit and the other supercomputers are remote, so we have facilities in place for remote access to the machines,” said Brian Smith, a senior high-performance computing (HPC) engineer who investigates tickets filed by users. “Our group generally doesn’t touch hardware, so we’re used to just having remote access to the big resources. Things like mail and chat programs are a daily part of our work routine as well, so we’re pretty fortunate in being able to work remotely pretty easily, at least from a technical perspective.”
Working from home can have its drawbacks, however—especially when it comes to all manner of unexpected distractions. Gustav Jansen, a scientific liaison in the Scientific Computing Group for multiple INCITE and CAAR projects, finds his workdays tending to get longer as he finds himself tackling unforeseen household circumstances.
“Our dogs actually just had puppies—so we have 11 dogs in the house right now, in addition to three kids that are being homeschooled at the moment,” Jansen said. “So there’s a lot of stuff that needs to be done around the house, and I try to contribute a little bit, but that means my workday is spread out over more hours.”
Despite the easy transition for most of the OLCF staff to working remotely, people still have to stay on the ORNL campus to make sure everything is in working order. That includes craft support: electricians for electrical systems, HVAC mechanics, instrument technicians, and utility operators to make sure water keeps flowing to cool the supercomputer rooms. Then there’s always at least one operator per shift to monitor the operations center for anomalies.
Meanwhile, technicians need to be on hand in case any machine malfunctions occur, and that job falls primarily to the vendors who built the supercomputers and their associated equipment. IBM, for example, has a hardware technician on site every day (and two on Monday) to repair any issues that may arise on Summit. These technicians are part of a team managed by the OLCF’s Don Maxwell, an HPC systems engineer who serves as the task lead for the HPC Data and Operations group. He’s used to working from home already since calls about systems failures can occur at any time, and often at odd hours.
“We’re on call, so we get calls in the middle of the night when something goes down,” Maxwell said. “So we are used to working from home, though not on the extended basis that we’re doing it now.”
Even as he successfully tackles technical issues from home, Maxwell also finds himself attempting to engineer domestic bliss.
“I have set up camp in our formal dining room, and my wife has set up camp in our bonus room,” Maxwell said. “And then my son, who is a sophomore at Purdue, has a desk set up in his bedroom. So all three of us are now working in our ‘virtual office’ Monday through Friday. But my daughter, who is a junior in high school, is basically about to lose her mind.”
UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.