Users Defeat ‘Data Deluge’ at OLCF’s First Large Data Sets Workshop

“We still need to learn from users what their patterns are like. Different communities—from chemistry to bioinformatics to materials science—use data differently, and they will come to define what their needs are.” – OLCF HPC User Assistance Specialist Fernanda Foertter

With more computing capability, Titan’s colossal data sets present new challenges

With big science, comes big data. Large-scale simulations that drive molecular dynamics, climate, plasma physics, and other research on Titan—the OLCF’s Cray XK7 supercomputer—generate enormous data sets. This outpouring of data then becomes meaningful analyses and visualizations that help users understand how atoms behave, stars collapse, or sea levels rise.

Applications running on Titan’s hybrid CPU/GPU architecture, which went online in November 2012, are already seeing up to 7.4 times the speed-up of comparable all-CPU machines, delivering high volumes of data at breakneck speed. But despite its approximately 700 terabytes of memory, the data from even one user project can be a challenge to transport, store, and analyze.

To assist users planning for data management, including modifying input/output (I/O) operations for efficiency, selecting short and long-term data storage, and building visualizations, OLCF HPC user assistance specialist Fernanda Foertter organized the August 6–8 workshop “Processing and Analysis of Very Large Data Sets,” held in Knoxville, TN.

The workshop also included participation from NERSC and LBL, who also deal with the challenges of very large data sets; plans are underway to develop a workshop next year with participation from all 3 Office of Science supercomputing facilities.

Drawing 91 participants on the first day, the workshop came at a serendipitous time for many users who will be required to submit a Data Management Plan with Department of Energy (DOE) Office of Science research proposals effective October 1 of this year.

“With Titan going online coupled with the new requirement from Office of Science for data management plans, we decided we needed a workshop that teaches users how to deal with large data sets,” Foertter said. “And it grew from a hands-on teaching workshop to a bigger conversation about how we prepare ourselves as scientific computing users for the data deluge ahead.”

Why data is packing on the petabytes

Although many popular conversations about “big data” concern companies that amass data on billions of social and economic interactions each day, scientists face similar challenges as the technologies they use—from particle accelerators to satellites to high-performance computers—become more advanced.

“As technologies like Titan get better, data sets get bigger,” said Jack Wells, National Center for Computational Science director of science.

World-class particle accelerators are one example. The accelerators at the European Organization for Nuclear Research (CERN) have collected large amounts of data for decades. Because their detectors measure millions of particle interactions per second, information erupts from these instruments faster than scientists can analyze it. The ATLAS detector at the Large Hadron Collider (LHC) accelerator produces 30 petabytes (the super-size of data) each year alone.

Alexei Klimentov, Brookhaven National Laboratory Omega group leader and collaborator for the ATLAS detector, spoke at the OLCF workshop on the legacy of big data in high-energy physics and how computing systems like Titan are shifting the trend. “This is the first time that computing is one of the major players,” Klimentov said. “And if you can process data, analyze data, and simulate data on computers like Titan then you can achieve more refined results in physics in a more timely manner.”

At the forefront of big data applications in high-performance computing have been the weather and climate science communities. If anyone knows what characterizes large data sets, it’s OLCF computational climate scientist Valentine Anantharaj, who spoke at the workshop about data management from a climate science perspective.

To predict short-term weather or long-term climate change, scientists use a variety of data sets for developing and validating models, such as the Weather Research and Forecasting Model or the Community Earth System Model (CESM). For climate studies, researchers using CESM simulate past and present climate information over decades, dealing with large volumes of data on temperature, precipitation, and a score of other climate indicators.

“The complexity of climate science is partly due to all the interactions among the different components of the Earth system. It’s not just the atmosphere in isolation but how the atmosphere interacts with the land, ocean, and ice,” Anantharaj said. “Looking more closely, there are supporting processes—the water cycle, the nitrogen cycle, the carbon cycle—each with their own complexities; and then we add human influence on these natural systems.”

Weather models likewise depend on a variety of data, primarily from satellites and radar systems that observe and stream information about the current state of the atmosphere. Thousands of measurement stations on the ground, in the ocean, and in the air may differ in the characteristics and accuracy of their data, complicating model building and analysis. Weather data is also time-sensitive and must be analyzed rapidly to inform the public of potential hazards like hurricanes. If files sit unanalyzed in storage too long, they can lose much of their benefit.

Storing raw data—in all fields of scientific research—assumes that the data will be repurposed in the future. Although storing weather data for many years may not be reasonable, many science communities say they can benefit from the provenance of data.

“Ultimately, making data available leads to transparency, but it also saves resources through data reuse,” Foertter said.

If raw data can be accessed and reused then the community as a whole could cut cost and time-to-solution, advancing the field more quickly.

“But we also have to ask if it is always cheaper to keep data because storage is expensive,” Wells said. “If the data will be too dated in five years to use again, or if it is rarely accessed, then maybe it is cheaper to re-run the simulation.”

What OLCF can do for users

In addition to providing case studies in large data set management, the workshop offered instruction in specific data-related tools and software available to OLCF users.

“On day one, we focused on I/O, day two on visualization, and day three on storage and analysis tools,” Foertter said.

Users learned about the features of ADIOS, the Adaptable I/O System for Big Data, a software framework that helps researchers manage their I/O workflows. The software just won OLCF computational scientists and partnering institutions an R&D 100 award for its ability to reduce the complexity of data management and increase performance by orders of magnitude, enabling users to generate tens of petabytes of data from a single simulation.

“The software allows scientists to write and read data very efficiently on all of the machines DOE has to offer in its portfolio,” said Scott Klasky, an ADIOS developer.

Scientific visualization staff presented on the Data Visualization Laboratory (which houses the 30-foot EVEREST powerwall), as well as the ParaView application that can build visualizations from very large data sets.

“Moving data onto a visualization cluster can be difficult and expensive,” Wells said. “So it’s important that through I/O and analysis users are effectively moving the data they need to build the right visualizations.”

Ultimately, after visualizations are built and analyses completed, data must live somewhere. During a user’s time on Titan, data can be temporarily saved to the Lustre-file system SPIDER, which is being upgraded from 10.7 petabytes to 32 petabytes this fall due to escalating demand.

But what happens to data—how it will be managed and curated after the initial user project is deactivated—is a subject that leadership computing facilities and researchers will continue to work out in the coming years as the scientific community decides how data is best reused, catalogued, and stored long-term.

“We still need to learn from users what their patterns are like,” Foertter said. “Different communities—from chemistry to bioinformatics to materials science—use data differently, and they will come to define what their needs are. The future of our data is something the workshop got us all to think about.”

—Katie Elyce Jones