The “Pioneering Frontier” series features stories profiling the many talented ORNL employees behind the construction and operation of the OLCF’s incoming exascale supercomputer, Frontier. The HPE Cray system is scheduled for delivery in 2021, with full user operations in 2022.
Electrical Engineering Specialist Rick Griffin had just powered up his first supercomputer installation at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL) in early 2000 and was testing its load capacity when the fire alarms started ringing. That was not a good sign.
Even though Griffin had over 20 years of experience in designing electrical distribution systems for facilities, this was his first design for a building that would contain a supercomputer. ORNL’s newly constructed Computational Sciences Building (CSB) included a medium-voltage and low-voltage distribution system. According to Griffin, the first supercomputer installed in CSB was also the first of its kind, and ORNL had very little operating experience with machines like it.
“That has been the case with most of the supercomputers we have installed since then, but we have gotten better at understanding their inner workings and can project with reasonable accuracy how they will operate,” Griffin said. “At the start of testing of this first computer and before it was heavily loaded, we had a momentary interruption of power and the computer continued operating. I thought, ‘Well, this is great. I love these big machines!’”
But during acceptance testing, as the computer load was being increased, smoke began coming from one of the power distribution units. This activated the fire alarms, which resulted in the arrival of the ORNL Fire Department.
“The smoke wasn’t visible, but we could smell it, and it was detected by a very sensitive smoke detection system we installed in the data center. After a lot of searching for the source, we determined that it was a piece of fiberglass-reinforced plastic tape that was wrapped around an electrical connection that got a little hot and started smoking,” Griffin said.
Most enterprise computer equipment never reaches its peak power draw, but when this machine was fully exercised, it reached peak power. “Except for the tape overheating, the system was able to accommodate the load, but from this experience we learned that when vendors provide projected load numbers for their supercomputers, they’ll probably hit that number,” Griffin said.
Servers, network equipment, and similar small-scale IT equipment typically have a diversity factor that results in the actual power required being less than what is needed for peak load. Supercomputers, which are typically parallel machines, will be run at maximum speed to verify performance. During this time, they will hit their maximum electrical load—and that load must be supplied by the electrical system.
Now, after helping install over 15 systems at ORNL, the 67-year-old Griffin is very experienced with getting power safely and reliably to the world’s most advanced supercomputers. Currently, he’s putting the finishing touches on his 30-megawatt masterpiece of electrical design, the Oak Ridge Leadership Computing Facility’s (OLCF’s) Frontier, which will be the nation’s first exascale-class supercomputer, capable of a billion-billion floating point operations per second.
“Frontier’s going to be the culmination of 20 years of experience with supercomputer electrical infrastructure,” Griffin said. “With every new system, we learn something we can apply to the next system. CSB initially had two unit substations to provide 3 megawatts of power for computers. When Frontier is completed, we will have 21 unit substations that can provide 56.5 megawatts of electrical capacity for computers in this building. Consequently, we have had to make a lot of changes to our facilities to accommodate that much of an increase in capacity.”
Frontier’s unprecedented power requirements inspired the engineering team at ORNL’s Laboratory Modernization Division to think big. Two 2.5-mile-long power lines running along 85 new power poles had to be installed to bring the additional power needed for Frontier to CSB. Also, the OLCF leadership’s offices had to be removed to make space for a new transformer room around the perimeter of the revamped data center. The electrical design was guided by Griffin’s long history of learning from previous projects and the difficulties that he encountered while operating these systems.
During the construction, testing, and operation of any new computer infrastructure, an evaluation system is triggered whenever an “off-normal event” occurs, such as smoking fiberglass-reinforced tape. The design team, facilities support personnel, and others assemble to analyze what happened, what was affected, why it was affected, and what can be done to prevent it from happening again. This activity has yielded great dividends for ORNL’s supercomputer operations and design, lessons that are shared among computing facilities across the DOE complex.
“Over the many years we’ve been doing this, we’ve discovered all kinds of nuances and little things that in the normal course of designing a support infrastructure for a supercomputer would normally get overlooked. But when something happens, these things really stand out and say, ‘Hey, here I am, you missed me,’” Griffin said.
One such lesson learned was applied during the design of the OLCF’s Summit supercomputer in 2018, which resulted in Griffin changing the overcurrent protection scheme for circuits supplying the computer racks. The individual 20-amp circuit breakers in the Summit rack power distribution units weren’t rated to accommodate the fault current available from the 480-volt, 50,000-amp switchboards that would supply the racks. To remedy this, he worked with breaker manufacturers to get a fuse series-rated with the 20-amp breakers that could be installed upstream of the computer racks and limit current to levels compatible with the 20-amp breakers. It was inadvertently tested when there was a dead-short in a 277-volt receptacle inside one of the Summit cabinets. When a technician attempted to plug into the receptacle, a minor arc occurred at the plug.
“The receptacle had a short in it, and when he plugged it in, a fault occurred that resulted in the 20-amp breaker tripping and the upstream fuse opening. The energy of the arc was limited by the fuse, and the technician wasn’t injured. If the fuse hadn’t been in the circuit, the upstream breaker would have operated much slower than the 20-amp breaker protecting the receptacle circuit,” Griffin said. “This delay would have resulted in the 20-amp breaker trying to interrupt the fault and possibly failing catastrophically. That could have resulted in an injury to the technician. I feel very pleased that it performed as intended.”
Griffin describes himself as a systems engineer since his job covers so many different responsibilities. He interfaces with customers to get requirements, creates layouts for the new computers, works with vendors to get information on their equipment and recommends necessary changes, conveys requirements to contractors, follows construction work, then keeps tabs on how the systems operate and what can be done to make the next design better. He feels that his most important contribution is to think ahead and anticipate future directions and requirements to make sure ORNL data centers can continue to expand, upgrade, and operate efficiently and reliably.
Those systems Griffin has contributed to building include people—a staff that trusts each other to conduct “real-time evaluations” of their work and make adjustments to improve efficiency, lower costs, and make things safer. He opts for the best ideas, no matter where they come from. “I always say, if you’ve got a better idea than me, then that’s the idea we’re going with,” Griffin said. And he believes his organization has been successful over the years because of this kind of attitude and approach to work.
“I’ve worked with many people in a lot of different situations in my life and have to say this is the best group of people I’ve ever worked with by far—and the results validate that,” Griffin said. “You get things done through people. It’s not money or projects or procedures or requirements or any of that. Those things are part of it, but if you’ve got the wrong people, you’re not going to succeed.”
To relieve stress, Griffin prefers the solitary activity of gardening. “That’s my therapy,” Griffin said. “If I ever get to where I can’t walk, I’ve told my family to just take me to the edge of the garden, give me a little hoe, and I’ll get to work.”
UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.