New service enables users to conduct in situ analysis of big data on HPC systems

Apache Spark—developed at the AMPLab at the University of California, Berkeley—is a software framework for big data analysis. The Advanced Data and Workflow Group will now support a new service, Spark On-Demand, that enables user to instantiate Spark to conduct in situ data analysis on OLCF compute resources.
The ability to conduct massive-scale simulations on supercomputers and data-intensive experiments at different scientific user facilities is driving the convergence of high-performance computing (HPC) and data analytics. Addressing such emerging user needs at the Oak Ridge Leadership Computing Facility (OLCF), the Advanced Data and Workflow Group will support a new service, Spark On-Demand, that enables users to instantiate Apache Spark to conduct in situ data analysis on OLCF compute resources.
Spark—developed at the AMPLab at the University of California, Berkeley—is a popular open-source in-memory analytics framework for big data analysis that was designed to scale on commodity cloud computing hardware. Spark On-Demand users can leverage the functionality of the Spark framework and the plethora of community tools for scalable data analysis and machine learning on resources such as Rhea at the OLCF, a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory (ORNL).
Rhea is a data analysis cluster that allows scientists to analyze and visualize data produced on the OLCF’s Titan supercomputer after simulation runs. Rhea shares the LUSTRE file system, Atlas, with Titan to manage massive datasets. The Spark On-Demand service allows OLCF users (based on node-hour allocation and priority) to launch the Apache Spark framework as a user job, load data from Atlas within the allocated compute nodes as the user job, and execute the analysis routines. Upon completion of the analysis and results saved in the file system, the compute resources are released automatically.
“This service is a win-win for both users and the facility,” said Sreenivas Rangan Sukumar, the Advanced Data and Workflow Group leader. “Our users no longer have to move petabytes of data to a different location in order to perform advanced analytics. They no longer need to invest in new analytics hardware dedicated for post-simulation processing. Furthermore, they can leverage scalable open-source software libraries for analytics being developed within the Spark framework to be scientifically productive.”
The idea for this service began as a Laboratory Directed Research and Development (LDRD) project in the Computational Science and Engineering Division at ORNL more than 3 years ago with an OLCF allocation through the Director’s Discretionary program. The aim of the LDRD project was to offer Apache Hadoop, Spark’s predecessor, on OLCF resources and study the performance of then-emerging big data technologies with leadership-class high-performance computers. Seung-Hwan Lim, a computer science researcher in the Computational Data Analytics Group, was the primary prototype developer. Dale Stansberry, a systems software engineer in the group, hardened the prototype into a supported service that users can launch.
While initial experiments with Spark On-Demand are revealing different execution performance characteristics compared with Apache Spark installations on dedicated cloud computing infrastructures—because of the differences in architecture, file systems, and software in HPC environments—users are able to conduct large-scale data analysis at a fraction of the time and cost required to move massive datasets around.
The first users of this service were Sergei Kalinin and his colleagues at the Institute for Functional Imaging of Materials for an image analysis problem at ORNL’s Center for Nanophase Materials Sciences (CNMS). The materials science team needed to perform principal component analysis (PCA) on image sequences, each consisting of 16,000 high-resolution microscopic images. The analysis was running out of memory in workstation tools such as MATLAB; if the scalable analysis code had to be written from scratch to solve the memory issue, the result was expected to be a delay of several months. Because the OLCF group offers Spark On-Demand, and algorithms such as PCA are offered as part of a library within the Spark framework, CNMS was able to perform the analysis within a week. Furthermore, the execution time for the analysis also reduced from what would have taken several hours to a few minutes. And now, climate and fusion science users at the OLCF are exploring this service.
The software and service have both been tested on Rhea. John Harney, a data management scientist, is quantifying how this service could affect other OLCF operations. For example, he is testing what behaviors and performance implications to anticipate when 100 users ask simultaneously for five-node sessions versus one user asking for a 500-node job. The group members are trying to understand how the service will affect the supercomputer, the file system, and other subsystems. The service also will be offered on infrastructures made available through the Compute and Data Environment for Science (CADES), and it will become publicly available at the OLCF User Meeting in May.
The Advanced Data and Workflow Group will offer a demo of Spark On-Demand as part of the Data Analytics and Visualization Tools tutorial on May 23, before the official start of the OLCF User Meeting. The tutorial will help those attending the meeting to understand how to launch jobs, how to request a particular set of nodes, where the bottlenecks are, and how to use the service appropriately as a user resource. Users must indicate interest in the tutorial when registering for the meeting. For more information, please visit https://www.olcf.ornl.gov/training-event/2016-olcf-user-meeting/.
“This is an excellent outcome where an LDRD project has produced a service that our facility users can leverage to accelerate scientific discovery,” Sukumar said. “We are excited to offer our first service as the Advanced Data and Workflow Group.” – Miki Nolin
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.