Titan Stability Helps Improve User Experience
Team effort produces stable outcome
The Oak Ridge Leadership Computing Facility’s (OLCF’s) Titan supercomputer has overcome a challenging launch and is now showing impressive stability.
“The machine is running very well,” said Don Maxwell, task lead in the OLCF’s HPC Operations Group. “Node failures are on par with what we’d expect. Things are going very well. We’ve only experienced one unscheduled outage in over five months and no unscheduled outages in 2014.”
Chris Fuson, of the User Assistance and Outreach Group, said his communication with users confirms this stability. “That just shows the maturity of the machine.” According to Fuson, jobs are interrupted less. “There is more uptime,” he said, “so more jobs can get through the queue and run to completion.”
“We don’t necessarily advertise that we are up more than normal,” he added. “The direct result of increased stability is that the users have an improved experience.”
After Titan was delivered, two rounds of rolling repairs were untaken. The second was completed on December 17. For that phase, about 20 percent of the machine was taken off-line at a time. Since repairs were completed, the machine has been very stable and heavily utilized.
The fact these challenges presented themselves is not unexpected, said OLCF Project Director Buddy Bland. “As we’ve seen many times with very large, first-of-a-kind systems, you’re likely to find abnormalities and manufacturing defects that might never be found anywhere else, just because there are so many different parts from so many different places.”
With this in mind, Maxwell said, an integrated team effort helped achieve Titan’s stability.
Maxwell leads a team of four Cray employees and three ORNL staffers who keep Titan functioning. Their work includes planning scheduled downtimes for software upgrades, troubleshooting system problems, and doing 24-7 duty to keep the machine running, with each team member taking a turn being on call. The beauty of this system, Maxwell said, is that every member of the team has a hand in Titan’s successful operation.
And since December 17, he noted, this hasn’t included anything unscheduled or unanticipated other than the aforementioned outage.
As Fuson said, that translates into an improved experience for Titan users. Fewer unscheduled outages and node failures reduce job interruptions, allowing jobs to run to completion with fewer restarts.
This increased stability has laid the foundation for higher than normal availability and utilization. Since January 1, 2014, OLCF users have completed 110,587 jobs on Titan and have used 1,611,330,832 core hours. The 2014 INCITE projects are off to a fast start and have collectively used a higher percentage of their allocation than ever before at this point in the allocation cycle. In addition, capability usage is at 62 per cent for the year, which means 62 per cent of the time utilized on Titan thus far this year across all allocation programs has come from jobs using more than 20 per cent of the resource.