Optimization Guide for AMD64 Processors

AMD offers guidelines specifically for serial code optimization on the AMD Opteron processors. Please see AMD’s Developer Documentation site for whitepages and information on the latest generation of AMD processors.

File I/O Tips

Spider, the OLCF’s center-wide Lustre® file system, is configured for efficient, fast I/O across OLCF computational resources. You can find information about how to optimize your application’s I/O on the Spider page.


CrayPat is a performance analysis tool for evaluating program execution on Cray systems. CrayPat consists of three main components:

  • pat_build – used to instrument the program for analysis
  • pat_report – a text report generator that can be used to explore data gathered by instrumented program execution
  • Apprentice2 – a graphical analysis tool that can be used in addition to pat_report to explore and visualize the data gathered by instrumented program execution
Note: Details of these components can be found in the pat_build, pat_report, and app2 man pages made available by loading the perftools-base module.

The standard workflow for program profiling with CrayPat is as follows:

  1. Load the perftools-base and perftools modules
  2. Build your application as normal
  3. Instrument the application with pat_build
  4. Execute the instrumented application
  5. View the performance data generated in Step 4 with pat_report and Apprentice2

The following steps will guide you through performing a basic analysis with CrayPat and Apprentice2.

Begin with a fully debugged and executable program. Since CrayPat is a performance analysis tool, not a debugger, the targeted program must be capable of running to completion, or intentional termination.

Step 1: Load the appropriate modules

To make CrayPat available,

module load perftools-base 
module load perftools

The perftools-base module must be loaded before the perftools module. Attempting to load perftools first will result in the following message:

Error: The Perftools module is available only after the perftools-base module
is loaded.

The Perftools-base module:
    - Provides access to Perftools man pages, Reveal and Cray Apprentice2
    - Does not alter compiling or program behavior 
    - Makes the following instrumentation modules available: 

perftools                - full support, including pat_build and pat_report
perftools-lite           - default CrayPat-lite profile
perftools-lite-events    - CrayPat-lite event profile
perftools-lite-gpu       - CrayPat-lite gpu kernel and data movement
perftools-lite-loops     - CrayPat-lite loop estimates (for Reveal)
perftools-lite-hbm       - CrayPat-lite memory bandwidth estimates (for Reveal)

Step 2: Build your application

Now that CrayPat is loaded, compile the program using the recommended Cray compiler wrappers (cc, CC, ftn) and appropriate flags to preserve all .o (and .a, if any) files created at compile time. CrayPat requires access to these object files (and archive files, if any exist).

For example, if working with a Fortran program, use commands similar to

ftn -c my_program.f
ftn -o my_program my_program.o

in order to retain object files.

Similarly, if you are using a makefile, perform a

make clean
Note: By default, a copy of the build’s .o files will be placed in /ccs/home/$USER/.craypat. This may increase your home directory usage to the quota limit. To change the location of the .craypat directory, set the PAT_LD_OBJECT_TMPDIR environment variable. For example, export PAT_LD_OBJECT_TMPDIR=/tmp/work/$USER

Step 3: Instrument the application with pat_build

If desired, use

pat_build -O apa my_program

where -O apa is a special argument for Automatic Program Analysis. apa instrumented programs will produce a .apa file at execution, which includes recommended parameters for improving the analysis.

This will produce an instrumented executable my_program+pat.

Step 4: Execute the instrumented program

Executing the instrumented program will generate and collect performance analysis data, written to one or more data files. On a Cray XT/XK Series CNL system, programs are executed using the aprun command.

aprun -n <numproc> my_program+pat

This will produce on completion (or termination) the data file my_program+pat+PID-nodesdt.xf, which contains basic asynchronously derived program profiling data.

Step 5: View the performance data with pat_report or Apprentice2

pat_report is the text report generator used to explore data in .xf files. It also outputs .ap2 files used for graphically viewing data with Apprentice2.

pat_report -T -o report1.txt my_program+pat+PID-nodesdt.xf

will produce:

  • a sampling-based text report, report1.txt
  • a .ap2 file, my_program+pat+PID-nodesdt.ap2, which contains both the report data and the associated mapping from addresses to functions and source code line numbers. This file can be opened for viewing with Apprentice2.
  • a .apa file, my_program+pat+PID-nodesdt.apa, which contains the pat_build arguments for further analysis.

If using the Automatic Program Analysis parameters is desired, open the .apa file in a text editor and make changes as appropriate. Any pat_build options may be added to this file (Most commonly used: -g mpi, -g blacs, -g blas, -g io, -g lapack, -g lustre, -g math, -g scalapack, -g stdio, -g sysio, -g system). Re-instrument the program with pat_build, run the new executable, and generate another set of data files.

Note: When re-building the program for Automatic Program Analysis, it is not necessary to specify the program name.

pat_build -O my_program+pat+PID-nodesdt.apa

is sufficient, since the .apa file contains the program name.

To view results graphically, first ensure X-forwarding is enabled, then use

app2 my_program+apa+PID2-nodesdt.ap2

A GUI should appear to interact with the collected performance data.

Simple GPU Profiling Example

A simple GPU profiling example could be preformed as follows:

With PrgEnv-Cray:

$ module load craype-accel-nvidia35
$ module load perftools-base perftools

With PrgEnv other than Cray:

$ module load cudatoolkit
$ module load perftools-base perftools


$ nvcc -g -c cuda.cu
$ cc cuda.cu launcher.c -o gpu.out
$ pat_build -u gpu.out
$ export PAT_RT_ACC_STATS=all
$ pat_report gpu.out+*.xf

CrayPAT Temporary Files

When building code with CrayPat in your $MEMBERWORK/[projid] or $PROJWORK/[projid] area, a copy of the build’s .o files will, by default, be placed in /ccs/home/$USER/.craypat.

This may increase your User Home directory usage above quota. The PAT_LD_OBJECT_TMPDIR environment variable can be used to control the location of the .craypat directory. For example:


Additional CrayPAT Resources

For more details on linking nvcc and compiler wrapper compiled code please see our tutorial on Compiling Mixed GPU and CPU Code.


NVIDIA’s command-line profiler, nvprof, provides profiling for CUDA codes. No extra compiling steps are required to use nvprof. The profiler includes tracing capability as well as the ability to provide many performance metrics, including FLOPS. The profiler data can be saved and imported into the NVIDIA Visual Profiler for easier, graphical analysis.

To use NVPROF, the cudatoolkit module must be loaded and PMI daemon forking disabled. To view the output in the NVIDIA Compute Visual Profiler, X11 forwarding must be enabled.

The aprun -b flag is currently required to use NVPROF, this requires that your executable reside on a compute node visible filesystem.
$ module load cudatoolkit
$ export PMI_NO_FORK=1

Although NVPROF doesn’t provide MPI aggregated data, the %h and %p output file modifiers can be used to create separate output files for each host and process.

$ aprun -b -n16 nvprof -o output.%h.%p ./gpu.out 

A variety of metrics and events can be captured by the profiler. For example, to output the number of double precision flops you may use the following:

$ aprun -b -n16 nvprof --metrics flops_dp -o output.%h.%p ./gpu.out 

To see a list of all available metrics and events the following can be used:

$ aprun -b nvprof --query-metrics
$ aprun -b nvprof --query-events 

For information on how to view the output in the NVIDIA visual profiler, please see the NVIDIA documentation.

Additional NVPROF Resources

The nvprof user guide is available on the NVIDIA Developer Documentation Site and provides comprehensive coverage of the profiler’s usage and features.


Score-P is a performance evaluation tool for large scale parallel applications. It provides a measurement infrastructure for profiling, event trace recording, and online analysis of High Performance Computing applications. Score-P allows users to instrument and record the behavior of sequential, multi-process (MPI, SHMEM), thread-parallel (OpenMP, Pthreads) and accelerator-based (CUDA, OpenCL) applications as well as hybrid parallel applications. Profile data, in CUBE4 format, can be viewed with CUBE or cube_stat. The Score-P trace files, in OTF2 format, can be visualized using Vampir.

For detailed information about versions using Score-P on Titan and the builds available, please see the Score-P Software Page.


Vampir is a software performance analysis tool focused on highly parallel applications. It presents a unified view on an application run including information on the various used programming paradigms like MPI, OpenMP, PThreads, CUDA, OpenCL and OpenACC. It also incorporates performance data from hardware performance counters and other sources. Its many interactive displays offer insight into the performance behavior and reveal bottlenecks of applications. Vampir’s highly scalable analysis server and visualization engine enable interactive navigation through large amounts of detailed performance data.
Use Score-P to generate performance recordings for Vampir.

For detailed information about using Vampir on Titan and the builds available, please see the Vampir Software Page.


TAU Performance System is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, and Python. Generated traces can be viewed in the included Paraprof GUI or displayed in Vampir.

Simple GPU Profiling

A simple GPU profiling example could be preformed as follows:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load tau cudatoolkit
$ nvcc source.cu -o gpu.out

Once the cuda code has been compiled tau_exec -cuda can be used to profile the code at runtime

$ aprun tau_exec -cuda ./gpu.out

The resulting trace file can then be viewed using paraprof

$ paraprof

Other TAU uses

module load tau

This command sets the TAUROOT environment variable on OLCF platforms and puts the TAU compiler wrappers in your PATH.

Automatic instrumentation when compiling with the C TAU wrapper:

>  export TAU_MAKEFILE=${TAU_LIB}/Makefile.tau-papi-mpi-pdt-openmp-opari-pgi
>  export TAU_MAKEFILE=${TAU_LIB}/Makefile.tau-papi-mpi-pthread-pdt-pgi
>  tau_f90.sh test.f

Debug: Parsing with PDT Parser

> /sw/xt/tau/2.17/cnl2.0+pgi7.0.7/pdtoolkit-3.12//craycnl/bin/f95parse mpi_example8.f
-I/sw/xt/tau/2.17/cnl2.0+pgi7.0.7/tau-2.17/include -I/opt/xt-mpt/default/mpich2-64/P/include

Debug: Instrumenting with TAU

> /sw/xt/tau/2.17/cnl2.0+pgi7.0.7/tau-2.17/craycnl/bin/tau_instrumentor mpi_example8.pdb mpi_example8.f -o

Debug: Compiling (Individually) with Instrumented Code

> ftn -I. -c mpi_example8.inst.f -I/sw/xt/tau/2.17/cnl2.0+pgi7.0.7/tau-2.17/include
-I/opt/xt-mpt/default/mpich2-64/P/include -o mpi_example8.o
/opt/xt-pe/2.0.33/bin/snos64/ftn: INFO: linux target is being used

Debug: Linking (Together) object files

> ftn mpi_example8.o -L/opt/xt-mpt/default/mpich2-64/P/lib -L/sw/xt/tau/2.17/cnl2.0+pgi7.0.7/tau-2.17/craycnl/lib
-lTauMpi-mpi-pdt -lrt -lmpichcxx -lmpich -lrt -L/sw/xt/tau/2.17/cnl2.0+pgi7.0.7/tau-2.17/craycnl/lib -ltau-mpi-pdt
-L/opt/pgi/7.0.7/linux86-64/7.0/bin/../lib -lstd -lC -lpgc -o a.out
/opt/xt-pe/2.0.33/bin/snos64/ftn: INFO: linux target is being used

Debug: cleaning inst file

> /bin/rm -f mpi_example8.inst.f

Debug: cleaning PDB file

> /bin/rm -f mpi_example8.pdb
> aprun -n 4 ./a.out
> ls prof*
profile.0.0.0  profile.1.0.0  profile.2.0.0  profile.3.0.0

To visualize the profile with the Paraprof tool:

> module load java-jre
> module load tau   #if not loaded
> paraprof

Additional Tau Resources

The TAU documentation website contains a complete User Guide, Reference Guide, and even video tutorials.


Arm MAP (part of the Arm Forge suite, with DDT) is a profiler for parallel, multithreaded or single threaded C, C++, Fortran and F90 codes. It provides in depth analysis and bottleneck pinpointing to the source line. Unlike most profilers, it’s designed to be able to profile pthreads, OpenMP or MPI for parallel and threaded code. MAP aims to be simple to use – there’s no need to instrument each source file, or configure.

Linking your program with the MAP Sampler (for Cray systems)

In order to collect information about your program, you must link your program with the MAP sampling libraries.

When using shared libraries on non-Cray systems, MAP can do this automatically at runtime.

On Cray systems, this process must be performed manually. The map-static-link and map-dynamic-link modules can help with this.

  1. module load forge
  2. module load map-link-static # or map-link-dynamic
  3. Re-compile or re-link your program.

Do I need to recompile?

There’s no need to instrument your code with Arm MAP, so there’s no strict requirement to recompile. However, if your binary wasn’t compiled with the -g compiler flag, MAP won’t be able to show you source-line information, so recompiling would be beneficial.

Note: If using the Cray compiler, you may wish to use -G2 instead of -g. This will prevent the compiler from disabling most optimizations, which could affect runtime performance.

Generating a MAP output file

Arm MAP can generate a profile using the GUI or the command line. The GUI option should look familiar to existing users of DDT, whereas the command line option may offer the smoothest transition when moving from an existing launch configuration.

MAP profiles are small in size, and there’s generally no configuration required other than your existing aprun command line.

To generate a profile using MAP, take an existing queue submission script and modify to include the following:

source $MODULESHOME/init/bash # May already be included if using modules
module load forge

And then add a prefix your aprun command so that:

aprun -n 128 -N 8 ./myapp a b c

would become:

map --profile aprun -n 128 -N 8 ./myapp a b c

Once your job has completed running, the program’s working directory should contain a timestamped .map file such as myapp_1p_1t_2016-01-01_12-00.map.

Profiling a subset of your application

To profile only a subset of your application, you can either use the --start-after=TIME and its command line options (see map --help for more information), or use the API to have your code tell MAP when to start and stop sampling, as detailed here.

Viewing a MAP profile

Once you have collected a profile, you can then view the information using the map command, either by launching and choosing “Load Profile Data File”, or by specifying the file on the command line e.g.

map ./myapp_1p_1t_2016-01-01_12-00.map

(The above will require a SSH connection with  X11 forwarding, or other remote graphics setup.)

An alternative that provides a local, native GUI (for Linux, OS X, or Windows) is to install the Arm Forge Remote Client on your local machine. This client is able to load and view profiles locally (useful when working offline), or remotely (which avoids the need to copy the profile data and corresponding source code to your local machine).

The remote client can be used for both Arm DDT and Arm MAP. For more information on how to install and configure the remote client, see the remote client setup page.

For more information see the Arm Forge user guide (also available via the “Help” menu in the MAP GUI).

Additional Arm MAP resources