Using Balanced Injection

What is balanced injection?

Balanced Injection is a mechanism available on Cray systems that helps balance the injection bandwidth of the compute nodes to the capabilities of the high speed network (HSN). In some cases, using this mechanism results in improved application performance and reduces network congestion.

By default, Balanced Injection is only used on MPI all-to-all collectives. However, users can enable Balanced Injection on their applications via the APRUN_BALANCED_INJECTION variable. To do so, users must set the environment variable to a number between 1-100 before launching the application:

setenv APRUN_BALANCED_INJECTION 64  (csh/tcsh)
export APRUN_BALANCED_INJECTION=64  (bash)

Because this is a tunable setting, there is not a recommnded value that applies to all applications. Users are advised to try different values and measure the impact observed in their application’s performance.

It is important to note that not all communication patterns will benefit from using this mechanism. Some examples of algorithms that can potentially benefit from Balanced Injection include: higher-dimensional nearest-neighbor
algorithms, unstructured mesh computations, hand constructed all-to-all operations, matrix transpose. For more details and additional examples, please see: Using Balanced Injection in Cray Systems.

Recommended Use Cases at the OLCF

NWCHEM

On Titan, NWCHEM can generate high traffic on the HSN which results in network throttling. When this occurs, the performance of all jobs on the system, and in particular jobs allocated near an NWCHEM job, can experience significant performance degradation.

To avoid network congestion, OLCF recommends using the following settings in your NWCHEM job submission script:

export ARMCI_DMAPP_LOCK_ON_GET=1
export ARMCI_DMAPP_LOCK_ON_PUT=1
export APRUN_BALANCED_INJECTION=64

The specific value to use will vary by problem and job size.

What is balanced injection?

Recommended Use Cases at the OLCF

NWCHEM

Contact Us

Quick Links

Connect with OLCF