scorep OverviewThe Score-P measurement infrastructure is a highly scalable and easy-to-use tool suite for profiling, event tracing, and online analysis of HPC applications. Score-P supports analyzing C, C++ and Fortran applications that make use of multi-processing (MPI, SHMEM), thread parallelism (OpenMP, PThreads) and accelerators (CUDA, OpenCL, OpenACC) and combinations.
The steps in a typical Score-P workflow are explained in detail below:
- Instrument your application with Score-P
- Perform a measurement run with profiling enabled
- Perform profile analysis with CUBE or cube_stat
- Use scorep-score to define an appropriate filter
- Perform a measurement run with tracing enabled and the filter applied
- Perform in-depth analysis on the trace data with Vampir
Instrument your Application
In order to instrument an application, you need to recompile the application using the Score-P instrumentation command, which is added as a prefix to the original compile and link command lines.
$ module load scorep $ scorep cc -o test test.c
The Score-P instrumentation command will use the compilers of your loaded programming environment by default. If you switch the PrgEnv, reload the scorep module:
$ module unload scorep $ module switch PrgEnv-pgi PrgEnv-gnu $ module load scorep
Usually the Score-P instrumenter scorep is able to automatically detect the programming paradigm from the set of compile and link options given to the compiler. In some cases however, e.g., for CUDA applications, scorep needs to be made aware of the programming paradigm in order to do the correct instrumentation.
The cuda module must be loaded for GPU traces:
$ module load cudatoolkit $ module load scorep
Once loaded, the program must be recompiled by prefixing
scorep to the original compiler:
$ scorep cc source.c $ scorep --cuda nvcc source.cu
To see all available options for instrumentation:
$ scorep --help This is the Score-P instrumentation tool. The usage is: scorep <options> <original command> Common options are: ... --compiler Enables compiler instrumentation. By default, it disables pdt instrumentation. --nocompiler Disables compiler instrumentation. --user Enables user instrumentation. --cuda Enables cuda instrumentation. ...
Example for instrumenting the C++ part of a CUDA application with user instrumentation enabled:
$ scorep --cuda --user CC
Example for generating trace files:
$ export SCOREP_ENABLE_PROFILING=false $ export SCOREP_ENABLE_TRACING=true $ aprun instrumented_binary.out
For CMake and autotools based build systems, it is recommended to use the scorep-wrapper script instances. The intended usage of the wrapper instances is to replace the application’s compiler and linker with the corresponding wrapper at configuration time so that they will be used at build time. As Score-P instrumentation during the cmake or configure steps is likely to fail, the wrapper script allows to disable instrumentation by setting the variable SCOREP_WRAPPER to OFF.
For CMake based build systems it is recommended to configure in the following way:
$ SCOREP_WRAPPER=OFF cmake .. \ -DCMAKE_C_COMPILER=scorep-cc \ -DCMAKE_CXX_COMPILER=scorep-CC \ -DCMAKE_Fortran_COMPILER=scorep-ftn
For autotools based build systems it is recommended to configure as follows:
$ SCOREP_WRAPPER=OFF ../configure \ CC=scorep-cc \ CXX=scorep-CC \ FC=scorep-ftn \ --disable-dependency-tracking
To pass options to the scorep command in order to diverge from the default instrumentation or to activate CUDA instrumentation, use the variable SCOREP_WRAPPER_INSTRUMENTER_FLAGS at make time:
$ make SCOREP_WRAPPER_INSTRUMENTER_FLAGS="--cuda"
The wrapper also allows for passing of flags to the wrapped compiler call by using the variable SCOREP_WRAPPER_COMPILER_FLAGS.
Example for a “make” command using both Score-P instrumentation and additional compiler flags:
$ make SCOREP_WRAPPER_INSTRUMENTER_FLAGS="--cuda" SCOREP_WRAPPER_COMPILER_FLAGS="-g -O2"
This will result in the execution of:
scorep --cuda <your-compiler> -g -O2
Perform a measurement run with profiling enabled
Measurements are configured via environment variables:
$ scorep-info config-vars --full SCOREP_ENABLE_PROFILING [...] SCOREP_ENABLE_TRACING [...] SCOREP_TOTAL_MEMORY Description: Total memory in bytes for the measurement system [...] SCOREP_EXPERIMENT_DIRECTORY Description: Name of the experiment directory [...]
On Titan, by default, profiling is enabled and tracing is disabled.
Here is an example for generating a profile using the environment variables and aprun:
$ export SCOREP_ENABLE_PROFILING=true $ export SCOREP_ENABLE_TRACING=false $ export SCOREP_EXPERIMENT_DIRECTORY=profile $ aprun instrumented_binary.out
Perform analysis on profile data
Profile performance analysis can be done with CUBE or cube_stat.
CUBE is a profile analysis tool for displaying performance data of parallel programs. It can be run on the login nodes if you are using X forwarding. If the GUI is too slow to respond, you may need to connect with a remote client, such as no machine. Score-P generates files of the form profile.cubex for this application.
Call-path profile analysis with CUBE:
$ cube profile/profile.cubex
Please see CUBE User Guide for a more detailed description.
For a quick top-n text-based flat profile analysis you can also use the tool cube_stat.
Flat profile analysis with cube_stat for the top 3 most time consuming functions:
$ cube_stat -t 3 -p profile/profile.cubex
cube::Region NumberOfCalls ExclusiveTime InclusiveTime binvcrhs 522844416.000000 200.939958 200.939958 !$omp do @z_solve.f:52 51456.000000 159.719801 321.887996 !$omp do @y_solve.f:52 51456.000000 147.645683 302.313644
Define an appropriate filter with scorep-score
scorep-score is a tool that allows to estimate the size of an OTF2 trace from a CUBE4 profile. Furthermore, the effects of filters are estimated. The main goal is to define appropriate filters for a tracing run from a profile.
To invoke scorep-score with a detailed output for every recorded function you must provide the -r option and the filename of a CUBE4 profile as arguments:
$ scorep-score –r profile/profile.cubex
Estimated aggregate size of event trace: 40GB Estimated requirements for largest trace buffer (max_buf): 10GB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 10GB (warning: The memory requirements can not be satisfied by Score-P to avoid intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the maximum supported memory or reduce requirements using USR regions filters.) Flt type max_buf[B] visits time[s] time[%] time/visit[us] region ALL 10,690,196,070 1,634,070,493 1081.30 100.0 0.66 ALL USR 10,666,890,182 1,631,138,069 470.23 43.5 0.29 USR OMP 22,025,152 2,743,808 606.80 56.1 221.15 OMP COM 1,178,450 181,300 2.36 0.2 13.04 COM MPI 102,286 7,316 1.90 0.2 260.07 MPI USR 3,421,305,420 522,844,416 144.46 13.4 0.28 matmul_sub USR 3,421,305,420 522,844,416 102.40 9.5 0.20 matvec_sub ...
The first line of the output gives an estimation of the total size of the trace, aggregated over all processes. This information is useful for estimating the space required on disk. In the given example, the estimated total size of the event trace is 40GB.
The second line prints an estimation of the memory space required by a single process for the trace. Since flushes heavily disturb measurements, the memory space that Score-P reserves on each process at application start must be large enough to hold the process’ trace in memory in order to avoid flushes during runtime. In addition to the trace, Score-P requires some additional memory to maintain internal data structures. Thus, it provides also an estimation for the total amount of required memory on each process. The memory size per process that Score-P reserves is set via the environment variable SCOREP_TOTAL_MEMORY. In the given example the per process memory is about 10GB. A Titan node provides 32GB of RAM. If using 16 processes on a node, you have to reduce the per process memory to 2GB by defining a filter.
When defining a filter, it is recommended to exclude short, frequently called functions from measurement since they require a lot of buffer space (represented by a high value under max_tbc) but incur a high measurement overhead. MPI functions and OpenMP constructs cannot be filtered. Thus, it is usually a good approach to exclude regions of type USR starting at the top of the list until you reduced the trace to your needs.
The example below excludes the functions matmul_sub and matvec_sub from the trace:
$ cat scorep.filter SCOREP_REGION_NAMES_BEGIN EXCLUDE matmul_sub matvec_sub SCOREP_REGION_NAMES_END
You can use scorep-score to test the effect of your filter on the trace file. To do so, pass a -f followed by the file name of your filter:
$ scorep-score profile/profile.cubex -f scorep.filter
Perform a measurement run with tracing enabled and the filter applied
The filter given above has been applied by exporting the SCOREP_FILTERING_FILE variable. Furthermore, SCOREP_TOTAL_MEMORY is set according to the estimated memory size per process by scorep-score. This process with generates files of the form traces.otf2, which can be analyzed with Vampir.
$ export SCOREP_ENABLE_PROFILING=false $ export SCOREP_ENABLE_TRACING=true $ export SCOREP_EXPERIMENT_DIRECTORY=trace $ export SCOREP_TOTAL_MEMORY=2GB $ export SCOREP_FILTERING_FILE=scorep.filter $ aprun instrumented_binary.out
Perform analysis on the trace data with Vampir
Vampir provides a visual GUI interface to analyze the trace.otf2 files generated with Score-P.
To initate the trace for very small trace files in the trace directory, login with X forwarding on and issue the following command:
$ vampir trace/traces.otf2
Supported features based on scorep/3.0
Additional Score-P Resources
The Score-P User Manual provides comprehensive coverage of Score-P usage.