Data-Driven Science for HPC Through CADES
NCCS welcomes CADES staff, plans to expand data services for ORNL HPC users
Many of today’s biggest scientific discoveries are delivered by extracting insights from massive collections of datasets. Scientists are compiling more information than ever before, and advanced data analysis backed by powerful computing is needed across research domains.
In response to a growing interest in data services that are integrated with high-performance computing (HPC), the National Center for Computational Sciences (NCCS) is expanding its data analysis group, the Advanced Data and Workflow (ADW) group.
Located at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL), NCCS oversees large-scale computing initiatives funded by the laboratory and federal agencies. The center’s largest initiative is a DOE Office of Science user facility named the Oak Ridge Leadership Computing Facility (OLCF), which operates the 27-petaflop Titan supercomputer.
In December 2016, ORNL’s data-centric facility known as the Compute and Data Environment for Science (CADES), including eight staff members, moved into the ADW group from the laboratory’s Information Technology and Services Division (ITSD). At its new home in NCCS, CADES is continuing to serve its primary purpose: to make scalable computing, data analytics, and information management available to ORNL scientific staff.
Three years ago, ORNL leadership recognized a growing need for easy-to-use HPC resources to include a range of data capabilities for researchers across the laboratory. Scientific instruments had become more sophisticated, producing large amounts of data that could take researchers months to analyze and sent them searching for more storage space and analysis capabilities.
It was clear that, in addition to extreme-scale computing, researchers needed smarter ways to compute—advanced workflows and analytics that could comprehend and reduce large datasets faster than researchers could with traditional methods. From cluster computing to supercomputing and machine learning to graph analytics, there was no single method to answer the range of questions scientists wanted to unlock through data science. To address this need, ORNL created CADES with Laboratory Directed Research and Development funding and built it within ITSD.
“With the creation of the Advanced Data and Workflow group in 2015, we were looking to facilitate new data-focused opportunities for our users,” said NCCS Director James Hack. “We also wanted to standardize a range of data services and capabilities so that they were more readily available and easier to use. Simultaneously, CADES was beginning to build a similar environment for ORNL staff. Bringing ADW and CADES together now optimizes our expertise and resources.”
A smarter way
As part of NCCS, CADES users will have ready access to resources like OLCF’s Titan, ORNL clusters, and other recent additions to the growing data-intensive computing infrastructure. These additions include test bed systems optimized for machine learning and graph analytics that enable connections and predictions across data patterns and structures from simulations or empirical observations.
“CADES is already providing data and compute services to facilities that are part of larger DOE missions and programs, so it’s a strategic decision to align it with our leadership computing facility and expand the reach of its services to ORNL and the broader user community,” said Arjun Shankar, CADES director and ADW group leader.
Since 2014, CADES staff have worked with ORNL researchers to design advanced workflows that enable faster and more accurate scientific analysis. Using machine learning, researchers have rapidly analyzed data generated by powerful atom-scale microscopes and neutron scattering instruments to study the structures of materials such as those used in fuel cells, photovoltaics, and semiconductors.
The CADES initiative has also helped advance data-driven projects in the health sciences. CADES staff developed a graph analytics program called ORiGAMi that analyzes the semantic, statistical, and logical construction of written language across millions of medical publications to mine for connections that might otherwise never be discovered.
A combination of graph analytics programs and deep learning is also supporting cancer research as part of the CANcer Distributed Learning Environment, or CANDLE, a DOE and National Cancer Institute collaboration that is developing computational methods for integrating and analyzing many types of cancer data to better understand cancer biology and identify new treatments.
“Once CADES began demonstrating its range of capabilities, we realized that it could be applied on a larger scale,” Hack said. “Now we want to offer CADES as an ‘on/off ramp’ to the OLCF through which our users can determine and configure their data analysis and workflow needs before running on leadership systems.”
New possibilities in data analysis complement traditional HPC. Computational scientists are often validating theory against experiment, which requires bridging large-scale simulation and observational data. They also conduct predictive modeling and simulation that digs into data by searching and recalibrating many designs or scenarios to optimize results. Whatever the approach, big computing and big data are inextricably linked.
With the OLCF’s newest supercomputer, Summit, coming online in 2018, Hack and Shankar said it is the right time to align CADES and OLCF resources.
“We see a growing need for scalable data environments across science domains, and we want to offer an infrastructure that deploys all of our resources to not only address current and future data challenges but, most importantly, accelerate scientific discovery,” Shankar said.
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.