Development and Operations working group minimizes time to introduce new application features on ORNL resources
As computer complexity has grown, so too have the challenges associated with writing, testing, reviewing, and deploying code. In October, 2014, a working group at the US Department of Energy’s (DOE) Oak Ridge National Laboratory (ORNL) received a significant event award for streamlining that process. ORNL employees receive significant event awards for making notable impacts or taking part in noteworthy events. The awards are disbursed from ORNL variable pay budget biannually.
The Development and Operations (DevOps) working group at the Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science User Facility, began meeting in 2014 to address challenges associated with maintaining and deploying support applications on OLCF computational resources. The group developed a continuous integration and deployment (CI/D) system to drastically improve efficiency, security, and flexibility of these applications on OLCF systems.
Staff members at the OLCF must develop customized applications to maintain security and efficiency on OLCF computing resources. The process can be labor-intensive. HPC Security Administrator Ryan Adamson explained the need for a more streamlined approach. “We’re a very big organization, so security is important,” Adamson said. “We have a lot of responsibilities for securing our systems, and everything comes back to ‘how do you know who has access to which systems?’”
Many OLCF staff members use an internally-developed Resource Allocation Tracking System (RATS) application to manage end-user access to computing resources. However, without doing regression testing, which uncovers potential application bugs, computer programmers can inadvertently cause problems to other features. Developer Lead Adam Carlyle noted that making significant changes to RATS or similar applications can have unintended consequences. “It’s critical to have this regression test suite as part of CI/D, because as a programmer, you can inadvertently break a feature that you’re not working on without knowing it,” Carlyle said.
DevOps designed the CI/D system using popular open-source software development tools like Git, Gerrit, and Jenkins. Git, a free, open-source version-control tool, allows support staff to take “snapshots” of an application codebase at any time, use metadata to track who made changes to a codebase, and revert to older versions should something go wrong. In fact, most of the tools the DevOps group used for the CI/D project are open-source and readily available to the public.
These large-scale changes made a big difference. Before the new CI/D system, it could take staff many days of downtime to deploy new application features, but the CI/D system has cut that time to 45 minutes. In addition to a massive efficiency increase for changing code, programmers are benefitting from a more collaborative environment. “Peer review allows you to collaborate and share ideas and also catch mistakes that other programmers have made,” Adamson said. “It’s like a second pair of eyes.” —Eric Gedenk
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.