Preaching pbdR

Since bursting onto the scene in the early ’90s, high-performance computing (HPC) has become the most productive method for exploring ambitious problems in science that require substantial computational power. Yet in 2018, researchers in the statistical sciences—fields ranging from biology to economics to sociology—have yet to fully embrace the power of HPC.

George Ostrouchov, an ORNL senior data scientist and member of the OLCF’s Advanced Data and Workflow Group spoke about parallel computing software pbdR at the largest annual meeting of statisticians and data scientists in the world.

New tools, however, are making supercomputing more palatable to reluctant data scientists. Speaking at the largest annual conference for statisticians and data scientists in the world, the Oak Ridge Leadership Computing Facility’s (OLCF’s) George Ostrouchov told his audience that there has never been a better time to get on the HPC bandwagon. Ostrouchov is a senior data scientist and member of the Advanced Data and Workflow Group at the OLCF, a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory.

“With HPC, researchers can work with larger datasets and get a quicker turnaround,” Ostrouchov said during the session he organized at the 2018 Joint Statistical Meetings, July 28–August 2 in Vancouver, to address parallel architectures in statistical computing. “Data analysis is a very interactive process, and a speedup in the iteration process can really aid discovery. We’re trying to get a community that we think should be participating in HPC to come and participate.”

Ostrouchov isn’t telling statisticians to take the plunge blindly. In fact, he’s created a bridge to meet data scientists where they are.

Programming with Big Data in R (pbdR)—statistical software Ostrouchov and his colleagues developed—takes the preferred programming language of the statistics community and makes it compatible with parallel computing in HPC environments.

For more than 2 decades, statisticians and data scientists have turned to the R programming language to develop new methods to sort, sift, and analyze datasets. Large datasets running on individual workstations, however, can take hours or days to process. For this reason, much of the community is focused on reducing datasets and making problems smaller, Ostrouchov said.

“But if they have a truly large dataset, software like pbdR is going to speed up their turnaround time by an order of magnitude,” he said. “In today’s big data environment, where a lot more data is often touted to be a substitute for judiciously collected small data, it is especially important for statisticians to be present in HPC and demonstrate, at scale, that biases can persist.”

In Vancouver, Ostrouchov organized a session that brought together representatives from the private sector, academia, and government who have been experimenting with parallel computing using multiple nodes or single nodes with multiple cores. For many researchers, the transition can feel a little disorienting. Instead of working with data within a serial program, where instructions are executed one after the other, batch parallel computing dictates that many copies of a program run simultaneously and work together to solve a problem.

“It’s sort of upside down when compared to traditional approaches,” Ostrouchov said. The payoff, however, is worth the pain for users who want a timely solution.

Thus far, pbdR has gained traction with a range of users, particularly in the biological sciences where data growth in areas like genomics has created a need for more computing. The software has also been deployed internally at ORNL to assist materials science users of an advanced workflow infrastructure called BEAM (Bellerophon Environment for Analysis of Materials) that connects experimental data from advanced microscopes directly to HPC. Beyond ORNL, pbdR has been deployed at a number of other HPC centers, and supercomputing company Cray plans to include the software in the next Cray Urika-XC big data analytics tools software suite.

“In a way, everyone will eventually have to move to HPC because you have to use parallel algorithms to take advantage of the newest hardware,” Ostrouchov said. “We hope if we make it easy enough, more of the statistics community will get engaged.”

ORNL is managed by UT-Battelle for the Department of Energy’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit https://science.energy.gov/.