A workhorse supercomputer for the scientific community retires August 1
The Cray XK7 Titan supercomputer operated by the Oak Ridge Leadership Computing Facility (OLCF) at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL) will be decommissioned on August 1 and disassembled for recycling. Performing up to 27 quadrillion calculations per second, Titan ranked as one of the world’s top 10 fastest supercomputers from its debut as No. 1 in 2012 until June 2019. Through more than 26 billion core hours of computing time, Titan has served hundreds of research teams around the world working on today’s most urgent scientific challenges.
A new generation of supercomputing
In 2009, the high-performance computing community was merely a year into surpassing the petascale barrier, achieving more than one quadrillion calculations per second on two DOE supercomputers: Roadrunner at Los Alamos National Laboratory and Jaguar at ORNL. CPU upgrades to the 2.3-petaflop Cray XT5 Jaguar, managed and operated by the OLCF, made it the world’s fastest petascale system on the November TOP500 list that year.
However, science never sleeps, and the OLCF was already planning its next supercomputer. This second-generation petascale system would need to be about 10 times more powerful than Jaguar to meet the growing computational needs of researchers working on complex problems in materials science, biology, physics, and other research domains. Even more challenging than increasing speed, the OLCF’s next supercomputer would need to meet DOE goals for cost and energy efficiency, meaning it would need to do 10 times the work while consuming roughly the same amount of energy.
Enter Titan, a new generation of supercomputer with a revolutionary architecture that combined AMD 16-core Opteron CPUs and NVIDIA Kepler accelerated processors known as GPUs, which tackled computationally intensive math problems while the CPUs efficiently directed tasks. When the system debuted at No. 1 in 2012, Titan delivered 10 times the performance of Jaguar, reaching a peak performance of 27 petaflops.
“Choosing a GPU-accelerated system was considered a risky choice,” said OLCF Program Director Buddy Bland. “A DOE independent project review committee insisted that we demonstrate that our users would be able to effectively use Titan for the broad range of modeling and simulation applications we support. We spent 6 months working with Cray, NVIDIA, and our users to convince the reviewers, DOE, and ourselves that GPUs would deliver what we needed. Yes, there was risk, but we developed effective ways to manage the risks and educate both our staff and users in how to use the system. The result has been a remarkably productive system that has led the way for many GPU-accelerated systems.”
Now in its seventh year of operation, Titan will be decommissioned to make room for a new scale of supercomputer: OLCF’s 2021 exascale system, Frontier.
OLCF is retrofitting 20,000 square feet of data center space that includes Titan, its Atlas file system, and the Cray XC30 Eos cluster, all of which will be decommissioned in August. Many OLCF system users are moving from Titan to the facility’s IBM AC922 Summit supercomputer, which was launched in 2018 and has its own data center space.
While researchers have continued to run large projects on Titan during the first half of 2019, June 30 will be the last day users can submit jobs to Titan or Eos (which is also 7 years old) in preparation for the August 1 decommissioning of both systems. Atlas will be decommissioned on August 15. The Rhea cluster will still be available to users but will transition from mounting the Atlas Lustre file system to mounting the Alpine GPFS file system that also supports Summit.
“Titan has run its course,” said Operations Manager Stephen McNally. “The components of Titan are now 7 years old, and it’s really impressive that users have been successfully producing high-impact science results since the system became available to them. But the reality is, in electronic years, Titan is ancient. Think of what a cell phone was like 7 years ago compared to the cell phones available today. Technology advances rapidly, including supercomputers.”
Decommissioning a computer Titan’s size requires collaboration between onsite staff, facility vendors, and users. OLCF staff are supporting users who need to complete runs, save data, or transition their projects to Summit and other resources.
“We’ve communicated shutdown deadlines to users so they can be prepared while still getting high-quality research done,” McNally said. “One big task for users has been cleaning up 32 petabytes of data and moving data from Atlas to other storage systems.”
Electricians will safely shut down the 9 megawatt-capacity system, and Cray staff will disassemble and recycle Titan’s electronics and its metal components and cabinets (which predated the system as Jaguar’s cabinets).
“People ask why we can’t split up Titan and donate sets of cabinets to different research groups, but the answer is that it’s simply not worth the cost to a data center or university of powering and cooling even fragments of Titan,” McNally said. “Titan’s value lies in the system as a whole.”
OLCF users are encouraged to review detailed system deadlines here.
UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://science.energy.gov.