With new offerings of data analytics tools, OLCF supports the increasing role of data analytics in large-scale science

This story is the second in a series called “Data Matters,” a collection of features dedicated to exploring the relevance of data for scientific advancement and exploration at the Oak Ridge Leadership Computing Facility, across Oak Ridge National Laboratory, and beyond.

Taking advantage of a computer’s full memory capacity and network interconnections is no small task, particularly when it comes to using Summit, the world’s most powerful supercomputer for open science, located at Oak Ridge National Laboratory (ORNL).

With a computing power of 200 petaflops, the Oak Ridge Leadership Computing Facility (OLCF) Summit supercomputer is capable of 200 quadrillion calculations per second, allowing scientists to study—and sometimes produce—very large sets of data.

“Scientific simulations and experiments can involve enormous datasets. Working with such large amounts of information requires adequate software for scaling methods of analysis, so data can be distributed efficiently across all nodes of Summit,” explained ORNL biophysicist Ada Sedova.

To propel more powerful scientific research, the OLCF—a US Department of Energy (DOE) Office of Science user facility located at ORNL—is now working to offer its users a robust artificial intelligence and advanced analytics software stack that would allow for data processing, deep learning, and analysis at a leadership scale. Sedova, along with other OLCF researchers, is part of a team taking an active look at how different types of software can optimize scientific research on Summit.

The stack offered by OLCF includes software like IBM Watson ML, a new platform for massively parallel distributed deep learning (DDL) using GPUs on Summit, and emerging capabilities including NVIDIA RAPIDS and DASK. “These frameworks provide a promising solution to accelerate and distribute analytics workloads on POWER9 architecture. However, Summit’s unique configuration requires careful investigation to deploy, test, and benchmark NVIDIA RAPIDS and DASK before they become part of Summit’s analytics software stack,” said Benjamín Hernández, a computer scientist in OLCF’s Advanced Data and Workflow Group. Hernández is part of the OLCF team working on Summit’s software stack.

During Summit’s deployment and performance evaluation process, the team worked closely with the NVIDIA RAPIDS and DASK teams. They recently shared their preliminary findings with the community during the NVIDIA GTC 2020 conference, which took place virtually in March.

“Other software, such as pbdR, can help users achieve high-performance big-data analysis using the popular R programming language, which is the most commonly used statistical data analysis software among academic researchers,” explained Junqi Yin, a computational scientist at the Advanced Data and Workflow Group.

Until now, scientists have not typically used hundreds or even thousands of nodes to crunch data with R. pbdR is changing that, helping users break up very large datasets—a size not normally handled with R codes—into smaller pieces, so that scientists can use multiple nodes of Summit to crunch the data with the help of Message Passing Interface (MPI), a standard library that allows individual processes on distributed nodes of a supercomputer to communicate with one another.

“This can outperform another software called Spark, in many analytic tasks that are commonly used in industry. Currently, pbdR is deployed within the R module available on Summit,” added Yin.

Accelerating AI

The introduction of these tools has allowed computational scientists to take a fresh look at the possibilities offered by artificial intelligence (AI).

Until recently, prohibitive development costs had kept high-performance computing (HPC) almost exclusively reserved for scientists looking to model large complex systems. But in the last few years, the rise of AI and machine learning (ML) tools has brought a new dimension to the use of HPC, with researchers turning to these new tools to enable more facile and deeper analysis of the data from simulations and experiments.

This new convergence of AI and HPC means the development of novel programming models and environments that can run on computers like Summit—where the main source of acceleration comes from GPUs—so that they can efficiently use specialized hardware like tensor cores.

Popular interpreted high-level programming languages like Python—typically used to develop new algorithms and for data analysis—have not been the preferred choice for HPC programming because they are slower at running large calculations. Software like RAPIDS provides the best of both worlds, allowing the high-level python interface to be used with GPU-accelerated libraries as a backend, which lets data scientists to develop new algorithms and rapidly deploy them at scale on HPC resources.

“Many advanced analytics algorithms that are vital to AI and ML are currently being developed in Python due to its ease of use and large set of dedicated open source libraries. More and more frequently we are seeing Python being used as a binding element for compiled libraries and to develop code capable of analyzing data coming from ML frameworks. From an HPC perspective, we need to rethink the way we have and haven’t been using these tools and make space for them. This new software stack will allow us to look at these tools with a fresh set of eyes,” said Matthew Baker, a computer scientist at ORNL’s Computer Science Research Group in the Computer Science and Mathematics Division (CSMD), as well as the National Center for Computational Sciences (NCCS).

Distributed deep learning is a challenge for science due to its algorithmic complexity.  Scientists need to consider many factors when porting existing deep learning architectures from single node to distributed versions, such as the ones used in Summit.

Yin and Sedova agree that there’s a need to consider many changing aspects of the convergence of AI and HPC, such as certain parameters and learning rates. Best practices are not completely clear, and most research has been done by industry, so academics are not familiar with it.

“What moves us to try to understand all of these factors is a desire to be at the forefront of enabling new convergence for science,” said Yin.

Full circle of data

Moving forward, OLCF will make available one of the world’s first exascale computer systems, Frontier. That means the possibility of larger simulations and data volumes, with even larger outputs of data.

But unlike Summit—which was developed by IBM and NVIDIA—Frontier’s architecture will be developed by AMD. This presents new challenges for continuous efficient data analysis, a factor of utmost importance for users of any supercomputer. “This is part of the work we have ahead of us,” said Oscar Hernández, a computer scientist who specializes in HPC programming models at ORNL’s Computer Science Research group in CSMD and NCCS.

The team hopes that by showcasing this effort, OLCF will also be able to attract the right projects to apply for time on Summit.

By using efficient and performant data analysis software, scientists are able to improve distributed deep learning, which in turn, creates a call for even larger data sets to be analyzed with Summit and, in the near future, Frontier. 

“The speed of training and number of nodes on Summit makes finding an appropriate dataset very difficult: millions to billions of training data samples are required to make this search efficient. The number of datasets that fit this description in open science are limited but can be found. They are definitely out there, and we want them to know that we are ready to host them,” said Sedova, who is currently working as a sub-lead in supercomputing research related to COVID-19. She also has a specific interest in the use of software for protein structure prediction and work in complex biosystems.

To foster a better understanding of the convergence of HPC and AI at exascale and beyond, ORNL will be organizing in August the 2020 Smoky Mountains Conference for Computational Sciences and Engineering —also called SMC2020—in Kingsport, Tennessee. This year the main topic of the conference is the convergence of HPC and AI, connected federated instruments, data life cycles, and the deployment of leadership-class systems can help solve grand scientific data challenges. For more information about the conference, visit http://smc.ornl.gov

Related Publication: H. Lu et al. “S21517: NVIDIA Rapids on Summit Supercomputer: Early Experiences.” GPU Technology Conference 2020, San Jose, California (March 22–26, 2020).

UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.