Categories: Debugging and Profiling, Software
Print this article
Score-P is a performance evaluation tool for large scale parallel applications. It provides a measurement infrastructure for profiling, event trace recording, and online analysis of High Performance Computing applications. Score-P allows users to instrument and record the behavior of sequential, multi-process (MPI, SHMEM), thread-parallel (OpenMP, Pthreads) and accelerator-based (CUDA, OpenCL) applications as well as hybrid parallel applications. Profile data, in CUBE4 format, can be viewed with CUBE or cube_stat. The Score-P trace files, in OTF2 format, can be visualized using Vampir.
Here is an overview of the workflow for Score-P. In the sections below each of these steps will be explained in greater detail.
- Instrument your application with Score-P
- Perform a measurement run with profiling enabled
- Perform profile analysis with CUBE or cube_stat
- Use scorep-score to define an appropriate filter
- Perform a measurement run with tracing enabled and the filter applied
- Perform in-depth analysis on the trace data with Vampir
Instrument your Application
- In order to instrument an application, you need to recompile the application using the Score-P instrumentation command, which is added as a prefix to the original compile and link command lines. For each of the compiler wrappers:
$ module load scorep $ scorep cc -o test test.c
The Score-P instrumentation command will use the compilers of your loaded programming environment by default. If you switch the PrgEnv, reload the scorep module.
$ module unload scorep $ module switch PrgEnv-pgi PrgEnv-gnu $ module load scorep
Usually the Score-P instrumenter scorep is able to automatically detect the programming paradigm from the set of compile and link options given to the compiler. In some cases however, e.g., for CUDA applications, scorep needs to be made aware of the programming paradigm in order to do the correct instrumentation.
To see all available options for instrumentation:
$ scorep --help This is the Score-P instrumentation tool. The usage is: scorep <options> <original command> Common options are: ... --compiler Enables compiler instrumentation. By default, it disables pdt instrumentation. --nocompiler Disables compiler instrumentation. --user Enables user instrumentation. --cuda Enables cuda instrumentation. ...
Example for instrumenting the C++ part of a CUDA application with user instrumentation enabled:
$ scorep --cuda --user CC
For CMake and autotools based build systems it is recommended to use the scorep-wrapper script instances, see below. The intended usage of the wrapper instances is to replace the application’s compiler and linker with the corresponding wrapper at configuration time so that they will be used at build time. As Score-P instrumentation during the cmake or configure steps is likely to fail, the wrapper script allows to disable instrumentation by setting the variable SCOREP_WRAPPER to OFF.
For CMake based build systems it is recommended to configure in the following way:
$ SCOREP_WRAPPER=OFF cmake .. \ -DCMAKE_C_COMPILER=scorep-cc \ -DCMAKE_CXX_COMPILER=scorep-CC \ -DCMAKE_Fortran_COMPILER=scorep-ftn
For autotools based build systems it is recommended to configure as follows:
$ SCOREP_WRAPPER=OFF ../configure \ CC=scorep-cc \ CXX=scorep-CC \ FC=scorep-ftn \ --disable-dependency-tracking
Note: SCOREP_WRAPPER=OFF disables the instrumentation only in the environment of the configure or cmake command. Subsequent calls to “make” are not affected and will instrument the application as expected.
To pass options to the scorep command in order to diverge from the default instrumentation or to activate CUDA instrumentation, use the variable SCOREP_WRAPPER_INSTRUMENTER_FLAGS at make time:
$ make SCOREP_WRAPPER_INSTRUMENTER_FLAGS="--cuda"
The wrapper also allows to pass flags to the wrapped compiler call by using the variable SCOREP_WRAPPER_COMPILER_FLAGS.
Example for a “make” command using both Score-P instrumentation and additional compiler flags:
$ make SCOREP_WRAPPER_INSTRUMENTER_FLAGS="--cuda" SCOREP_WRAPPER_COMPILER_FLAGS="-g -O2"
This will result in the execution of:
scorep --cuda <your-compiler> -g -O2
Note: If “make install” does a re-linking and you are not using the default instrumentation, you need to pass Score-P instrumentation and compiler flags again, see “make”.
Perform a measurement run with profiling enabled
Measurements are configured via environment variables:
$ scorep-info config-vars --full SCOREP_ENABLE_PROFILING [...] SCOREP_ENABLE_TRACING [...] SCOREP_TOTAL_MEMORY Description: Total memory in bytes for the measurement system [...] SCOREP_EXPERIMENT_DIRECTORY Description: Name of the experiment directory [...]
On Titan, by default, profiling is enabled and tracing is disabled.
Here is an example for generating a profile using the environment variables and aprun:
$ export SCOREP_ENABLE_PROFILING=true $ export SCOREP_ENABLE_TRACING=false $ export SCOREP_EXPERIMENT_DIRECTORY=profile $ aprun instrumented_binary.out
Perform analysis on profile data
Profile performance analysis can be done with CUBE or cube_stat.
CUBE is a profile analysis tool for displaying performance data of parallel programs. It can be run on the login nodes if you are using X forwarding. You may need a remote client, such as nomachine, if the GUI responds too slowly. Score-P generates file of the form profile.cubex for this application.
Call-path profile analysis with CUBE:
$ cube profile/profile.cubex
Please see CUBE User Guide for a more detailed description.
For a quick topN text-based flat profile analysis you can also use the tool cube_stat.
Flat profile analysis with cube_stat for the top 3 most time consuming functions:
$ cube_stat -t 3 -p profile/profile.cubex
cube::Region NumberOfCalls ExclusiveTime InclusiveTime binvcrhs 522844416.000000 200.939958 200.939958 !$omp do @z_solve.f:52 51456.000000 159.719801 321.887996 !$omp do @y_solve.f:52 51456.000000 147.645683 302.313644
Define an appropriate filter with scorep-score
scorep-score is a tool that allows to estimate the size of an OTF2 trace from a CUBE4 profile. Furthermore, the effects of filters are estimated. The main goal is to define appropriate filters for a tracing run from a profile.
To invoke scorep-score with a detailed output for every recorded function you must provide the -r option and the filename of a CUBE4 profile as argument:
$ scorep-score –r profile/profile.cubex
Estimated aggregate size of event trace: 40GB Estimated requirements for largest trace buffer (max_buf): 10GB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 10GB (warning: The memory requirements can not be satisfied by Score-P to avoid intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the maximum supported memory or reduce requirements using USR regions filters.) Flt type max_buf[B] visits time[s] time[%] time/visit[us] region ALL 10,690,196,070 1,634,070,493 1081.30 100.0 0.66 ALL USR 10,666,890,182 1,631,138,069 470.23 43.5 0.29 USR OMP 22,025,152 2,743,808 606.80 56.1 221.15 OMP COM 1,178,450 181,300 2.36 0.2 13.04 COM MPI 102,286 7,316 1.90 0.2 260.07 MPI USR 3,421,305,420 522,844,416 144.46 13.4 0.28 matmul_sub USR 3,421,305,420 522,844,416 102.40 9.5 0.20 matvec_sub ...
The first line of the output gives an estimation of the total size of the trace, aggregated over all processes. This information is useful for estimating the space required on disk. In the given example, the estimated total size of the event trace is 40GB.
The second line prints an estimation of the memory space required by a single process for the trace. The memory space that Score-P reserves on each process at application start must be large enough to hold the process’ trace in memory in order to avoid flushes during runtime, because flushes heavily disturb measurements. In addition to the trace, Score-P requires some additional memory to maintain internal data structures. Thus, it provides also an estimation for the total amount of required memory on each process. The memory size per process that Score-P reserves is set via the environment variable SCOREP_TOTAL_MEMORY. In the given example the per process memory is about 10GB. A Titan node provides 32GB of RAM. If using 16 processes on a node, you have to reduce the per process memory to 2GB by defining a filter.
For defining a filter, it is recommended to exclude short frequently called functions from measurement, because they require a lot of buffer space (represented by a high value under max_tbc) but incur a high measurement overhead. MPI functions and OpenMP constructs cannot be filtered. Thus, it is usually a good approach to exclude regions of type USR starting at the top of the list until you reduced the trace to your needs.
The example below excludes the functions matmul_sub and matvec_sub from the trace:
$ cat scorep.filter SCOREP_REGION_NAMES_BEGIN EXCLUDE matmul_sub matvec_sub SCOREP_REGION_NAMES_END
You can use scorep-score to test the effect of your filter on the trace file. Therefor, you need to pass a -f followed by the file name of your filter:
$ scorep-score profile/profile.cubex -f scorep.filter
Perform a measurement run with tracing enabled and the filter applied
Below the filter given above has been applied by exporting the SCOREP_FILTERING_FILE variable. Furthermore, SCOREP_TOTAL_MEMORY is set according to the estimated memory size per process by scorep-score. This process with generates files of the form traces.otf2, which can be analyzed with Vampir.
$ export SCOREP_ENABLE_PROFILING=false $ export SCOREP_ENABLE_TRACING=true $ export SCOREP_EXPERIMENT_DIRECTORY=trace $ export SCOREP_TOTAL_MEMORY=2GB $ export SCOREP_FILTERING_FILE=scorep.filter $ aprun instrumented_binary.out
Perform analysis on the trace data with Vampir
Vampir allows a visual GUI interface to analyse the trace.otf2 files generated with Score-P.
To initate the trace for very small trace files in the trace directory, login with X forwarding on and issue the following command:
$ vampir trace/traces.otf2
However this is not recommended for most trace files. Instead, for better performance, please use the Server/Client version: see the Vampir documentation for further information.
For a more detailed description see Score-P.
Supported features based on scorep/3.0
Please contact Ronny Brendel (on-site contact person) with any questions.
Furthermore, you may contact Score-P Support or OLCF User Support.