The “Pioneering Frontier” series features stories profiling the many talented ORNL employees behind the construction and operation of the OLCF’s incoming exascale supercomputer, Frontier. The HPE Cray system was delivered in 2021, with full user operations expected in 2022.

Rafael Ferreira da Silva’s job is to make life simple—at least for the thousands of users who access the Oak Ridge Leadership Computing Facility’s (OLCF’s) computing resources each year.

That’s because Ferreira da Silva, a senior R&D scientist in the National Center for Computational Sciences, works on the design and deployment of scientific workflow applications and tools used on supercomputers at the OLCF. These tools are an important part of the user experience and success on machines such as the IBM AC922 Summit and the HPE Cray EX Frontier, which recently debuted as the first exascale machine and the fastest supercomputer in the world according to the TOP500 list.

At its most basic, a computational workflow is simply a series of steps taken to achieve a specific goal or objective, such as a simulation or data analysis. For example, before scientists can use data for research, it may need to be cleaned, organized, and corrected in a process that often requires managing the data across multiple systems. Effective scientific workflows automate these steps and allow users to delegate the process to a computer, which is especially good at handling repetitive tasks. As a result, scientists are able to carry out their research more effectively.

Man standing between rows of computer cabinets

Ferreira da Silva and his team currently support scientific workflows on the IBM AC922 Summit supercomputer. Recently, they have been working to scale these workflows to HPE Cray EX Frontier, the world’s first exascale machine. Image: Carlos Jones/ORNL

Ferreira da Silva is one member of the scientific workflows team in the Data Lifecycle and Scalable Workflows Group at Oak Ridge National Laboratory (ORNL). Working alongside him are Research Scientist Sean Wilkinson, Linux Systems Engineer Ketan Maheshwari, and Research Scientist Tyler Skluzacek. With Frontier user access on the horizon, the team, which is well practiced in supporting workflows on Summit, has shifted its focus to preparing for the new system, which will be open to users for full science at the beginning of 2023. The team has been using the Crusher test system to evaluate and scale existing workflows for a seamless transition.

“We are part of the Advanced Technologies Section at ORNL, so we are thinking about the future, and the future is Frontier. The goal is trying to think ahead and not be deploying software as the users come on but to have the software available already,” said Ferreira da Silva.

For Ferreira da Silva and the workflows team, the user is the focus. Much of their work involves sifting through the hundreds of existing workflow systems to evaluate their capabilities and compatibility with OLCF systems before installing, configuring, and finally training users. Scaling these workflows to Frontier, which will be capable of at least 1.5 exaflops (more than 1018 operations per second), presents a new challenge.

“On Frontier, the situation becomes even more complex because we are talking about thousands and thousands of cores and much more data—so, larger and more complex computations. With workflows, we want to make the life of the user simpler. We want to help users realize the full power of Frontier as a tool for scientific discovery.”

Today, the team officially supports five workflows systems: Parsl, Swift/T, Ensemble Toolkit, MLflow, and FireWorks. Each workflow system is uniquely developed to solve problems that fit different paradigms, and each supports different types of workloads. For example, MLflow provides sets of tools tailored for AI and machine learning workflows; Parsl provides capabilities for running distributed computing jobs across different sets of resources; and Swift/T is designed to leverage systems that use the Message Passing Interface—a message-passing standard for parallel computing—to improve the efficiency of large-scale workflows. The library of supported systems is constantly growing as new systems and user requirements are evaluated, tested, and integrated.

Beyond the ins-and-outs of scientific workflows, the team must also consider how their work at ORNL fits into the broader workflows community and supports OLCF users that come from around the world.

“We want to be sure that ORNL is well positioned in the workflows community and ensure that any decisions we make here about how we support workflows, and how we enable workflows for the users on Summit and Frontier, are aligned with the larger community.”

Earlier this year, Ferreira da Silva helped launch the Workflows Community Initiative, which is a free online platform dedicated to training, resources, workshops, and news about scientific workflows. The centrality and connectedness of the initiative is a theme reflected in larger projects that the workflows team focuses on at ORNL, including efforts to create a framework that connects systems and resources (i.e., compute, storage, networking) across US Department of Energy (DOE) facilities. This project is aligned with the needs identified by the DOE Integrated Research Infrastructure effort. The goal of this project is to create a systems ecosystem to simplify the learning process for users and streamline design, development, and troubleshooting of supporting systems such as scientific workflows and data management frameworks.

“The idea is to connect current popular software and systems that are available on Summit, Frontier, and other computational systems and build this common framework that would integrate all these different research infrastructures,” Ferreira da Silva says.

Navigating the technical intricacies of projects at this scale is no small feat, but the challenge is worth it for Ferreira da Silva.

“The challenge and reward for me is enabling these solutions to run in a seamless way. Because it’s not only computing but also networking, power, and cooling systems—there are so many aspects involved that it’s at another magnitude. And we’re solving large-scale problems that cannot be solved on any other kind of system.”

UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit