scorep Overview

The Score-P measurement infrastructure is a highly scalable and easy-to-use tool suite for profiling, event tracing, and online analysis of HPC applications. Score-P supports analyzing C, C++ and Fortran applications that make use of multi-processing (MPI, SHMEM), thread parallelism (OpenMP, PThreads) and accelerators (CUDA, OpenCL, OpenACC) and combinations.
Support

Usage


Score-P

The steps in a typical Score-P workflow are explained in detail below:

  1. Instrument your application with Score-P
  2. Perform a measurement run with profiling enabled
  3. Perform profile analysis with CUBE or cube_stat
  4. Use scorep-score to define an appropriate filter
  5. Perform a measurement run with tracing enabled and the filter applied
  6. Perform in-depth analysis on the trace data with Vampir

Instrument your Application

In order to instrument an application, you need to recompile the application using the Score-P instrumentation command, which is added as a prefix to the original compile and link command lines.
Example:

$ module load scorep
$ scorep cc -o test test.c

The Score-P instrumentation command will use the compilers of your loaded programming environment by default. If you switch the PrgEnv, reload the scorep module:

$ module unload scorep
$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load scorep

Usually the Score-P instrumenter scorep is able to automatically detect the programming paradigm from the set of compile and link options given to the compiler. In some cases however, e.g., for CUDA applications, scorep needs to be made aware of the programming paradigm in order to do the correct instrumentation.

The cuda module must be loaded for GPU traces:

$ module load cudatoolkit
$ module load scorep

Once loaded, the program must be recompiled by prefixing scorep to the original compiler:

$ scorep cc source.c
$ scorep --cuda nvcc source.cu

To see all available options for instrumentation:

$ scorep --help
 This is the Score-P instrumentation tool. The usage is:
 scorep <options> <original command>
 
Common options are:
  ...
  --compiler    Enables compiler instrumentation.
                By default, it disables pdt instrumentation.
  --nocompiler  Disables compiler instrumentation.
  --user        Enables user instrumentation.
  --cuda        Enables cuda instrumentation.
  ...

Example for instrumenting the C++ part of a CUDA application with user instrumentation enabled:

$ scorep --cuda --user CC

Example for generating trace files:

$ export SCOREP_ENABLE_PROFILING=false 
$ export SCOREP_ENABLE_TRACING=true
$ aprun instrumented_binary.out

For CMake and autotools based build systems, it is recommended to use the scorep-wrapper script instances. The intended usage of the wrapper instances is to replace the application’s compiler and linker with the corresponding wrapper at configuration time so that they will be used at build time. As Score-P instrumentation during the cmake or configure steps is likely to fail, the wrapper script allows to disable instrumentation by setting the variable SCOREP_WRAPPER to OFF.

For CMake based build systems it is recommended to configure in the following way:

$ SCOREP_WRAPPER=OFF cmake .. \
     -DCMAKE_C_COMPILER=scorep-cc \
     -DCMAKE_CXX_COMPILER=scorep-CC \
     -DCMAKE_Fortran_COMPILER=scorep-ftn

For autotools based build systems it is recommended to configure as follows:

$ SCOREP_WRAPPER=OFF ../configure \
     CC=scorep-cc \
     CXX=scorep-CC \
     FC=scorep-ftn \
     --disable-dependency-tracking
Note: SCOREP_WRAPPER=OFF disables the instrumentation only in the environment of the configure or cmake command. Subsequent calls to “make” are not affected and will instrument the application as expected.

To pass options to the scorep command in order to diverge from the default instrumentation or to activate CUDA instrumentation, use the variable SCOREP_WRAPPER_INSTRUMENTER_FLAGS at make time:

$ make SCOREP_WRAPPER_INSTRUMENTER_FLAGS="--cuda"

The wrapper also allows for passing of flags to the wrapped compiler call by using the variable SCOREP_WRAPPER_COMPILER_FLAGS.
Example for a “make” command using both Score-P instrumentation and additional compiler flags:

$ make SCOREP_WRAPPER_INSTRUMENTER_FLAGS="--cuda" SCOREP_WRAPPER_COMPILER_FLAGS="-g -O2"

This will result in the execution of:

scorep --cuda <your-compiler> -g -O2
Note: If “make install” does a re-linking and you are not using the default instrumentation, you need to pass Score-P instrumentation and compiler flags again, see “make”.

Perform a measurement run with profiling enabled

Measurements are configured via environment variables:

$ scorep-info config-vars --full
SCOREP_ENABLE_PROFILING
[...]
SCOREP_ENABLE_TRACING
[...]
SCOREP_TOTAL_MEMORY
Description: Total memory in bytes for the measurement system
[...]
SCOREP_EXPERIMENT_DIRECTORY
Description: Name of the experiment directory
[...]

On Titan, by default, profiling is enabled and tracing is disabled.

Here is an example for generating a profile using the environment variables and aprun:

$ export SCOREP_ENABLE_PROFILING=true 
$ export SCOREP_ENABLE_TRACING=false
$ export SCOREP_EXPERIMENT_DIRECTORY=profile
$ aprun instrumented_binary.out

Perform analysis on profile data

Profile performance analysis can be done with CUBE or cube_stat.

CUBE is a profile analysis tool for displaying performance data of parallel programs. It can be run on the login nodes if you are using X forwarding. If the GUI is too slow to respond, you may need to connect with a remote client, such as no machine. Score-P generates files of the form profile.cubex for this application.

Call-path profile analysis with CUBE:

$ cube profile/profile.cubex

Please see CUBE User Guide for a more detailed description.

For a quick top-n text-based flat profile analysis you can also use the tool cube_stat.

Flat profile analysis with cube_stat for the top 3 most time consuming functions:

$ cube_stat -t 3 -p profile/profile.cubex 

Output:

cube::Region            NumberOfCalls       ExclusiveTime  InclusiveTime
binvcrhs               522844416.000000       200.939958    200.939958
!$omp do @z_solve.f:52   51456.000000         159.719801    321.887996
!$omp do @y_solve.f:52   51456.000000         147.645683    302.313644

Define an appropriate filter with scorep-score

scorep-score is a tool that allows to estimate the size of an OTF2 trace from a CUBE4 profile. Furthermore, the effects of filters are estimated. The main goal is to define appropriate filters for a tracing run from a profile.

To invoke scorep-score with a detailed output for every recorded function you must provide the -r option and the filename of a CUBE4 profile as arguments:

$ scorep-score –r profile/profile.cubex

Output:

Estimated aggregate size of event trace:                   40GB
Estimated requirements for largest trace buffer (max_buf): 10GB
Estimated memory requirements (SCOREP_TOTAL_MEMORY):       10GB
(warning: The memory requirements can not be satisfied by Score-P to avoid
 intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the
 maximum supported memory or reduce requirements using USR regions filters.)

Flt type     max_buf[B]        visits time[s] time[%] time/visit[us]  region
     ALL 10,690,196,070 1,634,070,493 1081.30   100.0           0.66  ALL
     USR 10,666,890,182 1,631,138,069  470.23    43.5           0.29  USR
     OMP     22,025,152     2,743,808  606.80    56.1         221.15  OMP
     COM      1,178,450       181,300    2.36     0.2          13.04  COM
     MPI        102,286         7,316    1.90     0.2         260.07  MPI

     USR  3,421,305,420   522,844,416  144.46    13.4           0.28  matmul_sub
     USR  3,421,305,420   522,844,416  102.40     9.5           0.20  matvec_sub
     ...

The first line of the output gives an estimation of the total size of the trace, aggregated over all processes. This information is useful for estimating the space required on disk. In the given example, the estimated total size of the event trace is 40GB.

The second line prints an estimation of the memory space required by a single process for the trace. Since flushes heavily disturb measurements, the memory space that Score-P reserves on each process at application start must be large enough to hold the process’ trace in memory in order to avoid flushes during runtime. In addition to the trace, Score-P requires some additional memory to maintain internal data structures. Thus, it provides also an estimation for the total amount of required memory on each process. The memory size per process that Score-P reserves is set via the environment variable SCOREP_TOTAL_MEMORY. In the given example the per process memory is about 10GB. A Titan node provides 32GB of RAM. If using 16 processes on a node, you have to reduce the per process memory to 2GB by defining a filter.

When defining a filter, it is recommended to exclude short, frequently called functions from measurement since they require a lot of buffer space (represented by a high value under max_tbc) but incur a high measurement overhead. MPI functions and OpenMP constructs cannot be filtered. Thus, it is usually a good approach to exclude regions of type USR starting at the top of the list until you reduced the trace to your needs.

The example below excludes the functions matmul_sub and matvec_sub from the trace:

$ cat scorep.filter
SCOREP_REGION_NAMES_BEGIN
 EXCLUDE
   matmul_sub
   matvec_sub
SCOREP_REGION_NAMES_END

You can use scorep-score to test the effect of your filter on the trace file. To do so, pass a -f followed by the file name of your filter:

$ scorep-score profile/profile.cubex -f scorep.filter

Perform a measurement run with tracing enabled and the filter applied

The filter given above has been applied by exporting the SCOREP_FILTERING_FILE variable. Furthermore, SCOREP_TOTAL_MEMORY is set according to the estimated memory size per process by scorep-score. This process with generates files of the form traces.otf2, which can be analyzed with Vampir.

$ export SCOREP_ENABLE_PROFILING=false
$ export SCOREP_ENABLE_TRACING=true
$ export SCOREP_EXPERIMENT_DIRECTORY=trace
$ export SCOREP_TOTAL_MEMORY=2GB
$ export SCOREP_FILTERING_FILE=scorep.filter

$ aprun instrumented_binary.out

Perform analysis on the trace data with Vampir

Vampir provides a visual GUI interface to analyze the trace.otf2 files generated with Score-P.

To initate the trace for very small trace files in the trace directory, login with X forwarding on and issue the following command:

$ vampir trace/traces.otf2 
Note:This is not recommended for most trace files. Instead, for better performance, please use the Server/Client version: see the Vampir documentation for further information.

Supported features based on scorep/3.0

Titan PGI GNU INTEL CRAY
MPI instrumentation x x x x
OpenMP instrumentation x x x x
Pthreads instrumentation x x x x
CUDA instrumentation x x x x
OpenCL instrumentation x x x
OpenACC instrumentation x
Cray-SHMEM instrumentation x x x x
TAU instrumentation x x x x
PAPI counter x x x x
Sampling x x  x  x
Memory Recording x x x x

Additional Score-P Resources

The Score-P User Manual provides comprehensive coverage of Score-P usage.

Builds

TITAN

  • 3.0
  • 3.1

EOS

  • 3.0
  • 3.1

RHEA

  • 3.0
  • 3.1