With big science comes big data. In fact, large-scale simulations that run on leadership-class supercomputers work at such high speeds and resolution that they generate unprecedented amounts of data. The size of these datasets—ranging from a few gigabytes to hundreds of terabytes—makes managing and analyzing the resulting information a challenge in its own right.
Fortunately for users at the Oak Ridge Leadership Computing Facility (OLCF), a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory (ORNL), an updated tool is helping them cut large datasets down to size.
The OLCF’s Advanced Data and Workflow (ADW) Group and the Computer Science and Mathematics Division’s Scientific Data Group (SDG) have worked together to scale R—the most commonly used data analytics software in academia and a rising programming language in high-performance computing (HPC)—to the OLCF’s Rhea, Eos, and Titan systems. Though R users have typically used the software to analyze smaller datasets on regular workstations, this development will allow them to deploy the tool for big data analysis that scales to thousands of processors and speeds up analysis by at least an order of magnitude.
“Simulation researchers are generating all of this data, but they’re not doing much analysis with it,” said Drew Schmidt, a graduate research assistant at the University of Tennessee, Knoxville, and codeveloper of the Programming with Big Data in R (pbdR) package. “With our infrastructure, there’s no reason not to.”
The open source pbdR packages emerged out of two data analysis projects led by George Ostrouchov, a senior scientist in SDG. The two projects—a DOE-funded project for the analysis of extreme-scale climate data at ORNL, where Wei-Chen Chen was a postdoc, and the National Science Foundation (NSF)–funded project for a Remote Data Analysis and Visualization Center at the National Institute for Computational Sciences, where Pragneshkumar Patel and Schmidt were computational scientists—both used R. The current pbdR project is funded by the NSF Division of Mathematical Sciences. The original pbdR team included Chen, Ostrouchov, Patel, and Schmidt.
“The main idea is to use some of the same scalable libraries that are already used by simulation science and supercomputers,” Ostrouchov said. “We not only make them easily accessible from R, but we also built infrastructure inside and outside R that makes it easier to implement statistical matrix methods in a highly scalable way.”
More specifically, Schmidt was responsible for the matrix methods. In other words, he wrote most of the code that makes it possible to conduct deep data analysis from the R language. Chen developed the message passing interface–based, high-level infrastructure, which allows for easier implementation of complex statistical computations on highly parallel supercomputers. Ostrouchov and his team worked on scaling R, while Mike Matheson of the ADW group optimized the library and data input choices for best results on the thousands of cores on Rhea, Eos, and Titan. Matheson also ran benchmarks on all of the machines for thorough optimization.
“It’s currently the most scalable infrastructure out there for data analytics,” Ostrouchov said. “Thirty years of research put together these libraries, but we only recently enabled supercomputing results to be used for data analytics, for which these libraries were rarely used in the past.”
At the 2016 OLCF User Meeting, Ostrouchov and Matheson presented a well-attended tutorial on how to use pbdR on OLCF systems, on which R is now available as a module for users. Sreenivas Rangan Sukumar, the ADW group leader, also presented a use case at the user meeting that exemplified the pbdR project’s significance to the emerging area of high-performance data analytics. The group was able to take a problem (conducting a principal-components analysis on a huge matrix) that takes several hours on popular cloud computing frameworks, such as Apache Spark, and analyze it in less than a minute using R on OLCF high-performance hardware. Such near-real-time analysis of data will become more important in the future as additional scientific instruments generating large quantities of data are brought online.
“Our tests are revealing the trade-offs and benefits between performance and productivity for data analytics,” Sukumar said. “If a user wants to explore 10 different analysis methods but has limited access to HPC resources, Apache Spark–like frameworks are great. However, for situations where one needs interactive near-real-time analysis, the pbdR approach is much better.”
The pbdR project intends to close the wide gap between scientific computing and data analytics. But more importantly, the team hopes that its work will encourage new communities to embrace supercomputing resources. More specifically, the OLCF seeks to help its users as well as users at several experimental facilities at ORNL—including the Spallation Neutron Source, Manufacturing Demonstration Facility, and Center for Nanophase Materials Sciences—with bigger and better data analysis.
“OLCF data liaisons are eager to help users with use cases that have analysis needs that require speed-up and scale-up over thousands of nodes,” Sukumar said.
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.