The OLCF-6 system will provide innovative solutions for hardware with a demonstrable path toward performance portability using a software stack and tools that will ease the transition without sacrificing DOE goals for continued delivery of science deliverables across modeling and simulation, artificial intelligence and machine learning, and large-scale data analysis.

The past 15-20 years of computing have provided an almost unprecedented stability in high-level system architectures and parallel programming models, with the MPI, OpenMP, C++, and Fortran standards paving the way for performance portable code. Combined with the application trends toward more coupled physics, predictive capabilities, sophisticated data management, object-oriented programming, and massive scalability – each of the applications that form the typical workload for OLCF represents tens or hundreds of person-years of effort, and thousands of person-years in aggregate. Thus, there is a keen interest in protecting the investment in the DOE application base by procuring systems that allow today’s workhorse application codes to continue to run without radical overhauls. OLCF seeks solutions that minimize disruptive changes to software that are not part of a standard programming model likely to be available on multiple future acquisitions, while recognizing the need that the existing software base must continue to evolve.

The OLCF-6 benchmark suite has been developed to capture the programming models, programming languages, numerical motifs, fields of science, and other modalities of investigation expected to make up the bulk (e.g., more than 80% of all the consumed time on the platform) of the usage upon deployment.

Questions related to the benchmarks should be sent to OLCF6benchmarks@ornl.gov.

OLCF-6 Run Rules

OLCF-6 Benchmark Run Rules

Note: The following run rules are intentionally similar to and adapted from NERSC-10 Benchmarks Run Rules. There are however some differences that should be noted.

Application Benchmark Descriptions

The application benchmarks are a representation of the OLCF workload and were chosen to span a variety of algorithmic and scientific spaces. Each benchmark distribution contains a README file that provides links to the benchmark source code distribution for reference, instructions for compiling, executing, verifying numerical correctness, and reporting results for each benchmark. Multiple input problems and sample output from existing OLCF systems (i.e. Summit and/or Frontier) are included to facilitate profiling at reduced scale. The README files specify a target problem size which must be used to report benchmark performance.

Allowed Modifications

Two tiers of modification—ported and optimized—are permitted for the benchmarks. The purpose of the tiering is to understand the level of developer efforts needed to achieve high performance on the target architecture(s). Besides the rules for each tier, each benchmark may provide additional benchmark-specific rules or amendments that supersede the rules described here. In all cases, benchmark performance will be accepted only from runs that exhibit correct execution.

Ported results are intended to reflect out-of-the box performance with minimal developer effort needed to run the benchmark on the target system. Limited source code modification is permitted (elaborated below) and must be described. Compiler options, library substitutions and concurrency may also be modified as follows.

  • Only publicly available and documented compiler flags shall be used.
  • Library substitutions are permitted. Proprietary library routines may be used as long as they currently exist in a supported set of general or scientific libraries, and must be in such a set when the system is delivered. Publicly available or open source libraries may be used if they can be built and used by the installed compilation system. The libraries must not specialize or limit the applicability of the benchmark, nor violate the measurement goals of the particular benchmark.
  • Parallel constructs substitutions (e.g. replacing for-loop with Kokkos parallel_for or with other library calls) are not considered library substitutions and are permitted only in the optimized category.
  • Concurrency (e.g node-type, node-count, process-count and accelerator-count) may be modified to produce the best results on the target system. The rationale for these choices must be explained. Note that the number of MPI tasks that can be used for a particular benchmark may be constrained by any domain decomposition rules inherent in the code, as described in the benchmark’s README file.
  • Source code modifications for correct execution permitted after a discussion with OLCF. A change will only be accepted if the Offeror shows that the original source has a software bug.
  • Batch scripts may be modified in order to execute on the target system. The script may include optimizations appropriate for the original executable, e.g. setting environment variables and specifying storage system file placement.
  • Replacement of existing architecture-specific language constructs (examples: CUDA, HIP, DPC++) with another well documented language or interface is permitted. This may also include API and library substitutions that necessitate limited, well-scoped changes in the source code.
  • Addition or modification of directives or pragmas is permitted.

Optimized results are intended to showcase system capabilities that are not realized by the baseline code. Aggressive code changes that enhance performance are permitted as long as the full capabilities of the code are maintained, the code can still pass validation tests, and the underlying purpose of the benchmark is not compromised. Changes to the source code may be made so long as the following conditions are met:

  • The rationale for and relative effect on performance of any optimization are described;
  • Algorithms fundamental to the program are not replaced (since replacing algorithms may result in violations of correctness or program requirements or other chosen software decisions);
  • All simulation parameters such as grid size, number of particles, etc., must not be changed;
  • The optimized code execution must still result in correct numerical results; any code optimizations must be made available to the general user community, either through a system library or a well-documented explanation of code improvements.

For the optimized tier, the Offeror is strongly encouraged to optimize the source code in a variety of ways, including, but not limited to:

  • Aggressive code changes that enhance performance are permitted. Performance improvements from pragma-style guidance in C, C++, and Fortran source files are preferred. Wholesale algorithm changes or manual rewriting of loops that become strongly architecture specific are of less value.
  • Newer versions of the benchmark the source code obtained from the upstream repository and branch may be used without providing additional rationale or analysis. The source code revision ID (commit hash) must be provided.
  • Source preprocessors, execution profile feedback optimizers, etc., are allowed as long as they are, or will be, available and supported as part of the compilation system for the delivered systems.
  • Optimizations that accelerate data-movement between stages of a workflow are permitted.
  • Specialized code to optimize specific hardware accelerators is permitted.

If multiple code optimizations are implemented, the Offeror is encouraged to provide the performance improvement from each optimization at a granularity that enables OLCF reviewers to understand the relative importance of each optimization as well as potential transferability to other codes in the OLCF workload.

Benchmarks

LAMMPS

LAMMPS

LAMMPS is a code for molecular dynamics (MD) simulations. There are two sub-problems included with this benchmark. The first sub-problem, liquid water at room temperature, stresses the balance of short- and long-range interactions. The second one is a ReaxFF modeling of Pentaerythritol tetranitrate (PETN) which stresses short-range interactions involving changes that occur at a fast time-scale.

M-PSDNS

M-PSDNS

The Pseudo-Spectral DNS code is designed to investigate the fundamental behavior of turbulence at very high resolution through numerical integration of the Navier-Stokes equations. This benchmark code is a “Minimalist” version of the PSDNS code that exercises the essential feature of the PSDNS code: large 3D FFTs with intense, large-scale MPI communications.

MILC

MILC

The MIMD Lattice Collaboration (MILC) code has been a workhorse of LQCD calculations utilizing Highly Improved Staggered (HiSQ) quarks. For this benchmark we use the gauge generation on “su3_rhmd_hisq” problem of MILC. In general, gauge generation tends to be a strong scaling problem, spread over as many nodes as is practicable. As the node count increases, the surface to volume ratio of the local problem decreases, and the problem becomes increasingly communications bound. Factors affecting the communication can include communications latency and bandwidth.

QMCPACK

QMCPACK

QMCPACK is an open-source many-body ab initio Diffusion Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids. Ab-initio Quantum Monte-Carlo is one of the leading methods that allow the calculation of many-electron interactions in solids that goes beyond the capabilities of the most widely used density functional approaches in being able to capture the many body effects that are inaccessible less computationally demanding methods. This benchmark is particularly sensitive to floating point, memory bandwidth and memory latency performance. To obtain high performance, the compiler’s ability to optimize and vectorize the application is critical. Strategies to place more of the “walker” data higher in the memory hierarchy are likely to increase performance.

SPATTER

SPATTER

Spatter is a pure-memory microbenchmark for timing Gather/Scatter memory access patterns on CPUs and GPUs. In this benchmark, we provide the memory access pattern traced from an application XRAGE to be used as an input to Spatter.

FORGE

FORGE

FORGE is an application that pre-trains large language models (LLMs) on 200M scientific papers. It’s based on Generative Pre-trained Transformer (GPT) architecture (the same as GPT-NeoX), and runs distributed with data, tensor, and pipeline parallelisms. The resulted foundation models demonstrate good zero-shot performance, and are finetuned for scientific downstream tasks, such as domain-subject classification, etc.

Workflow

Workflow

Information coming soon.

Changelog

Updates on November 7, 2023:
  • QMCPACK benchmark is available
  • Minor correction for FORGE model size in its description
  • FORGE dataset is available
Updates on September 22, 2023:
  • Initial release