OLCF staff set up technology to run deep learning software packages on new systems

Adam Simpson (front) and Matt Belhorn (back), high-performance computing user support specialists at the OLCF, use the Singularity application to develop containers that will allow newer systems to run deep learning packages.

Deep learning is emerging as a powerful branch of machine learning that uses artificial neural networks to help scientists find connections hidden within big data. But deep learning software packages—such as the ones used to recognize images, generate data for self-driving cars, translate speech, and identify drug candidates for cancer patients—are complex and require specific environments to run.

User support specialists at the Oak Ridge Leadership Computing Facility (OLCF) are using a technology called containers to bundle an operating system and software into a single file, which will make it easier for researchers to run deep learning software on the OLCF’s supercomputers. The OLCF is a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory (ORNL).

“The supercomputers run a relatively old but extremely stable operating system,” said Adam Simpson, high-performance computing user support specialist. “These newer deep learning software packages assume you’re running on one that’s up-to-date, so it’s difficult to get some of them installed on an enterprise-level system. Containers allow us to run on top of a supercomputer’s operating system.”

In February, staff members in the OLCF’s User Assistance and Outreach Group successfully ran multiple containers on Summitdev, the OLCF’s early-access version of its next-generation supercomputer, Summit.

An application called Singularity is allowing OLCF staff members to create customized software stacks that include newer operating systems as well as various deep learning frameworks. Users can then run their own applications via the container’s software, enabling the use of packages not previously possible.

“When users run a container, they can interact with an operating system they are familiar with rather than the enterprise host operating system,” Simpson said.

Containers will ultimately allow newer systems like Summitdev and the DGX-1 deep learning system to run packages like TensorFlow, an open source library for machine learning. TensorFlow is one of six software packages being evaluated by team members working on the CANcer Distributed Learning Environment (CANDLE), a project that aims to build a deep neural network for studying cancer behavior and treatment outcomes. CANDLE is funded by the Exascale Computing Project—a joint effort between the Office of Science and the National Nuclear Security Administration to develop an exascale computing system and environment—and includes collaborators from NVIDIA, the National Cancer Institute, and other DOE laboratories.

Arvind Ramanathan, ORNL’s technical lead for CANDLE, said scaling existing deep learning platforms on newer machines is one of the project’s main goals but that many of the platforms have dependencies—programs that an initial software program relies on to do its work.

“Deep learning packages have quite a few dependencies,” he said. “We can’t expect every user to come up with all the necessary software packages and files needed to run these platforms, and we can’t track everyone and see how each person is individually compiling software. With containers, we create a common environment up front so that the user can run their code efficiently.”

Ramanathan said that containerizing software will eliminate the hassle of optimizing individual compute environments, allowing users to easily run deep learning packages on new resources. Simpson led the effort to build and configure the containers in February, receiving assistance from other OLCF staff members throughout the duration of the project. HPC Systems Administrators Don Maxwell and Matt Ezell, HPC User Support Specialist Matt Belhorn, and HPC Cyber Security Administrator Ryan Adamson contributed to the collaborative effort, which resulted in success running deep learning packages MXNet, Caffe, TensorFlow, Theano, pyTorch, and Neon on Summitdev via containers. Altogether, the team spent more than 100 hours on the project.

“We are trying to support as many popular platforms as possible,” Ramanathan said. “Right now, we are trying out different combinations of software and systems to see which ones work and don’t work. But now that we’ve scaled containers on Summitdev, users can begin testing them to give us an idea of what improvements we need to make so that we’ll never have to worry about the back end.”

This effort was a pilot project to evaluate containers in the context of the OLCF’s leadership-class computing resources. Staff members will continue testing and developing the containers with an eye toward making them production-ready in the future.

Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.