Skip to main content

Talks cover Progressive File Layouts, advanced monitoring tools

OLCF staff delivered talks and tutorials at the 2nd International Workshop on The Lustre Ecosystem: Enhancing Lustre Support for Diverse Workloads, held March 8-9 in Baltimore. ORNL workshop organizers include (from left) Rick Mohr, Sarp Oral, Neena Imam, and Michael Brim.

OLCF staff delivered talks and tutorials at the 2nd International Workshop on The Lustre Ecosystem: Enhancing Lustre Support for Diverse Workloads, held March 8-9 in Baltimore. ORNL workshop organizers include (from left) Rick Mohr, Sarp Oral, Neena Imam, and Michael Brim.

Because of its ability to handle large-scale data storage and retrieval, Lustre has become the parallel file system of choice for many supercomputing centers around the world, including the US Department of Energy’s (DOE’s) Oak Ridge Leadership Computing Facility (OCLF), a DOE Office of Science User Facility.

Scientific workloads are Lustre’s bread and butter, but to continue to meet the needs of high-performance computing (HPC) users, a coordinated effort is underway to expand the parallel file system’s performance and flexibility, with an emphasis on adapting Lustre to handle diverse, big data–oriented workloads.

In March, OLCF staff delivered talks and tutorials to Lustre users from academia, industry, and government on work being done to update the file system. The presentations took place in Baltimore at the Second International Workshop on The Lustre Ecosystem: Enhancing Lustre Support for Diverse Workloads. The workshop was organized by Oak Ridge National Laboratory’s (ORNL’s) Computational Research and Development Programs, sponsored by the US Department of Defense. Neena Imam, Michael Brim, and Sarp Oral of ORNL’s Computing and Computational Sciences Directorate served as workshop cochairs.

“Today, Lustre is a square peg that fits into a square hole,” said Oral, file and storage systems team lead at the OLCF. “We need to reshape it as a circle to make it suitable for other use cases.”

Two major ways the OLCF is contributing to this task are through informing work on the new Progressive File Layout (PFL) feature and through deployment of advanced file system monitoring tools. OLCF staff delivered presentations on both topics at the workshop.

“PFL is a project that makes it so users don’t have to be Lustre experts to operate and get effective use out of the parallel file system,” said Brim, a research associate in ORNL’s Computer Science and Mathematics Division. “It basically increases the number of storage servers used to hold file data as the file grows larger.”

Though PFL is intended to be transparent to most users, advanced users will be able to exploit the feature to customize how their data is divided and stored across servers, a technique known as data striping. The approach gives users more opportunities to optimize I/O performance, especially for big data–type workloads.

“If you have particular workloads that organize files in atypical ways, you can use PFL to reflect that and improve your performance,” Brim said.

Currently, Intel is carrying out the PFL development effort under an ORNL subcontract. When the feature is complete, it will enable additional capabilities that could help broaden Lustre’s appeal.

“Down the road, PFL will allow greater flexibility at places like DOE and the OLCF to do more data analytics workloads and build more flexible systems at less cost,” Oral said.

Another Lustre element being explored by the OLCF is JobStats, a feature that captures I/O performance at the job level. Administrators can use this information to identify bottlenecks and work with users to alleviate them. Previous monitoring tools only displayed general performance indicators across the system.

“Users typically don’t say anything about their applications’ I/O performance until it’s slow,” said Jesse Hanley, OLCF HPC Unix/storage system administrator. “We want to be more proactive and have suggestions for how they might get better performance.”

The OLCF has been working with JobStats for less than a year but already is seeing benefits. After deploying the tool on the center-wide Atlas file system, the OLCF was able to identify several inefficient jobs and work with users to fix the issues.

“The result is improved I/O performance, and therefore users can spend less time in I/O and have more time for compute,” Hanley said.

Currently, Hanley is focused on categorizing jobs to study I/O patterns in other ways. The task may result in application-specific recommendations for optimizing I/O performance.

“One of my long-term goals is to get I/O profiles of applications into the hands of users so they can see how their jobs perform over time,” Hanley said.

Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.

Jonathan Hines

Jonathan Hines is a science writer for the Oak Ridge Leadership Computing Facility.