The COVID-19 pandemic pushed computational scientists to explore new ways of operating on and analyzing large data sets with NVIDIA GPU computing
Computational scientists at the Oak Ridge Leadership Computing Facility (OLCF) are using the power of NVIDIA GPU computing and BlazingSQL, a new engine for GPU-accelerated queries using the SQL language, to substantially speed up big data analysis, such as is needed for their work on drug discovery for COVID-19.
OLCF has contracted with BlazingSQL, Inc. for work on deploying, scaling, and supporting the BlazingSQL platform on OLCF Summit. This will help OLCF users who need to query massive amounts of data in seconds to minutes. The work with OLCF will improve BlazingSQL’s Power9 and GPU support on Summit as well as provide better integration with NVIDIA Mellanox InfiniBand and next-generation NVIDIA NVLINK using the UCF Consortium’s Unified Communication X (UCX) open source communication framework.
“When you have several terabytes of data, using queries on CPUs to sort information can take hours, if not days,” said Jens Glaser, computational scientist at the National Center for Computational Sciences (NCCS). “Ideally, we shouldn’t have to wait that long. We needed to find a solution that would allow scientists to analyze large amounts of information in a timely manner and make use of Summit’s GPUs.”
The team had been working over the past year to deploy the new NVIDIA RAPIDS and Dask ecosystem on Summit and understand how it could be leveraged to help OLCF users with applications. Now their COVID simulations required exactly such a solution. The only thing missing was a way to use query language to easily access, manipulate, and sort the datasets they were working with.
To solve this, the team looked at how commercial applications such as finance and marketing approach the analysis of large structured datasets. They found that recently a new solution had emerged that would allow them to leverage the power of Summit and the speed of the RAPIDS software stack: the integration of BlazingSQL into the RAPIDS/Dask ecosystem, providing a GPU-accelerated open-source platform to process extremely fast and scalable SQL queries, along with other data analytics.
“Scientific simulations and experiments can involve enormous datasets. Working with such large amounts of information requires adequate software for scaling methods of analysis, so data can be distributed efficiently across multiple nodes of Summit, letting us query the output data repeatedly, for instance when we apply a new machine learning model,” said Ada Sedova, a biophysicist in the Biosciences Division and co-lead on COVID-19 high-performance computing research. As a result of the team’s work trying to find therapeutics against the novel coronavirus, this problem found a solution as well.
The perfect data storm
As SARS-CoV-2 tore through the world this year, scientists everywhere raced to understand the virus in an attempt to stop it.
Almost immediately, researchers turned to OLCF—a US Department of Energy (DOE) Office of Science User Facility located at the Oak Ridge National Laboratory (ORNL) and home to the nation’s fastest supercomputer, Summit—to try to help find the virus’s weak spots and information that could be useful in the development of therapeutics.
Simulations can be key to this type of research, but computational scientists must often combine large amounts of information from several different sources, such as databases, together with their simulation output to arrive at a final solution.
Together, these sources of information produce a mountain of data, which means that even basic operations, like finding a given column’s top value, require scientists to sort through a massive set of records.
“We had done these large simulation runs, using Summit to dock molecules to SARS-CoV-2 viral protein structures, which yielded an enormous amount of data that we were not prepared to look at, sort, manipulate, and join with other sources of information that we also needed, such as data on chemical structures, to finish the analysis,” explained Glaser.
Although the computational part was fast, they were looking at days to weeks to finalize the analysis portions using CPU-based database software, as each of these large-scale calculations produced 1.3 terabytes of tabular data.
Part of the research was run in the new Summit cabinets furnished with high-memory CPUs and NVIDIA V100 GPU nodes with 32 GB of HBM2, bought a few months earlier with supplemental federal funding for COVID-19-related research provided by the DOE through the CARES Act.
Breaking records and setting new benchmarks
“COVID-19 came to be the perfect case study, not only because of the amount of data but also because of the urgency with which we needed to process it,” said Oscar Hernandez, senior researcher in ORNL’s Computer Science and Mathematics Division.
The technology could also prove particularly helpful in other domains, such as cheminformatics, radiation physics, and even monitoring Summit’s own energy efficiency and system performance.
“Some of these fields generate a lot of data that is more powerful and useful when it can be navigated in real time,” said Hernandez.
By using BlazingSQL, the team was able to demonstrate a ten-fold improvement on the TPCx-BB benchmark, a synthetic dataset in the context of retailers, over the fastest commercial solution on CPUs, with 27 of Summit’s new high-memory nodes. Almost simultaneously and independently, NVIDIA achieved a nearly 20-fold speedup on a similar benchmark using their latest generation of NVIDIA A100 Tensor Core GPUs, showing that this technology is only just getting started breaking records.
The team is already inviting others to the table. They held an invitation-only workshop on October 15th to demonstrate “a virtual cheminformatics lab that uses the Summit supercomputer to analyze massive amounts of data using GPUs” and incorporates interactive analysis using Jupyter Notebooks for running data science and taking advantage of Summit’s GPU computing platform and interconnects. The ability to launch a Jupyter notebook in a browser and connect to Summit through the notebooks was made possible by OLCF’s Slate resource.
“Jupyter notebooks and scalable GPU-accelerated Python frameworks are two new offerings in OLCF’s AI, ML, and Data Analytics software stack, that our users have requested enthusiastically in the past,” said Benjamín Hernández, computer scientist in OLCF’s AI Analytics Scalable Methods group.
“With this workshop, we had the opportunity to introduce these frameworks to the teams working in the Innovative and Novel Computational Impact on Theory and Experiment (INCITE); the Advanced Scientific Computing Research’s (ASCR) Leadership Computing Challenge (ALCC); the Exascale Computing Project (ECP); and the Center for Accelerated Application Readiness (CAAR). We expect to be able to promote these offerings more broadly among our user community in the near future,” he added.
The organizers said that “participants were thrilled to hear about the strong commitment to the GPU-accelerated Python data ecosystem that was demonstrated by our vendors and the OLCF at this workshop.”
UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.