Profiling CUDA Code with NVPROF

NVIDIA’s command-line profiler, nvprof, provides profiling for CUDA codes. No extra compiling steps are required to use nvprof. The profiler includes tracing capability as well as the ability to gather many performance metrics, including FLOPS. The profiler data output can be saved and imported into the NVIDIA Visual Profiler for additional graphical analysis.

To use nvprof, the cuda module must be loaded.

summit> module load cuda

A simple “Hello, World!” run using nvprof can be done by adding “nvprof” to the jsrun line in your batch script.

jsrun -n1 -a1 -g1 nvprof ./hello_world_gpu

Although nvprof doesn’t provide aggregated MPI data, the %h and %p output file modifiers can be used to create separate output files for each host and process.

jsrun -n1 -a1 -g1 nvprof -o output.%h.%p ./hello_world_gpu

There are many various metrics and events that the profiler can capture. For example, to output the number of double-precision FLOPS, you may use the following:

jsrun -n1 -a1 -g1 nvprof --metrics flops_dp -o output.%h.%p ./hello_world_gpu

To see a list of all available metrics and events, use the following:

summit> nvprof --query-metrics
summit> nvprof --query-events

While using nvprof on the command-line is a quick way to gain insight into your CUDA application, a full visual profile is often even more useful. For information on how to view the output of nvprof in the NVIDIA Visual Profiler, see the NVIDIA Documentation.


Note: Score-P and Vampir are not currently available on Summit.

Score-P is a performance evaluation tool for large scale parallel applications. It provides a measurement infrastructure for profiling, event trace recording, and online analysis of High Performance Computing applications. Score-P allows users to instrument and record the behavior of sequential, multi-process (MPI, SHMEM), thread-parallel (OpenMP, Pthreads) and accelerator-based (CUDA, OpenCL) applications as well as hybrid parallel applications. Profile data, in CUBE4 format, can be viewed with CUBE or cube_stat. The Score-P trace files, in OTF2 format, can be visualized using Vampir.

For detailed information about versions using Score-P on Summit and the builds available, please see the Score-P Software Page.


Note: Score-P and Vampir are not currently available on Summit.

Vampir is a software performance analysis tool focused on highly parallel applications. It presents a unified view on an application run including information on the various used programming paradigms like MPI, OpenMP, PThreads, CUDA, OpenCL and OpenACC. It also incorporates performance data from hardware performance counters and other sources. Its many interactive displays offer insight into the performance behavior and reveal bottlenecks of applications. Vampir’s highly scalable analysis server and visualization engine enable interactive navigation through large amounts of detailed performance data.
Use Score-P to generate performance recordings for Vampir.

For detailed information about using Vampir on Summit and the builds available, please see the Vampir Software Page.