Profiling CUDA Code with NVPROF

NVIDIA’s command-line profiler, nvprof, provides profiling for CUDA codes. No extra compiling steps are required to use nvprof. The profiler includes tracing capability as well as the ability to gather many performance metrics, including FLOPS. The profiler data output can be saved and imported into the NVIDIA Visual Profiler for additional graphical analysis.

To use nvprof, the cuda module must be loaded.

summit> module load cuda

A simple “Hello, World!” run using nvprof can be done by adding “nvprof” to the jsrun line in your batch script.

jsrun -n1 -a1 -g1 nvprof ./hello_world_gpu

Although nvprof doesn’t provide aggregated MPI data, the %h and %p output file modifiers can be used to create separate output files for each host and process.

jsrun -n1 -a1 -g1 nvprof -o output.%h.%p ./hello_world_gpu

There are many various metrics and events that the profiler can capture. For example, to output the number of double-precision FLOPS, you may use the following:

jsrun -n1 -a1 -g1 nvprof --metrics flops_dp -o output.%h.%p ./hello_world_gpu

To see a list of all available metrics and events, use the following:

summit> nvprof --query-metrics
summit> nvprof --query-events

While using nvprof on the command-line is a quick way to gain insight into your CUDA application, a full visual profile is often even more useful. For information on how to view the output of nvprof in the NVIDIA Visual Profiler, see the NVIDIA Documentation.


The Score-P measurement infrastructure is a highly scalable and easy-to-use tool suite for profiling, event tracing, and online analysis of HPC applications. Score-P supports analyzing C, C++ and Fortran applications that make use of multi-processing (MPI, SHMEM), thread parallelism (OpenMP, PThreads) and accelerators (CUDA, OpenCL, OpenACC) and combinations.

For detailed information about using Score-P on Summit and the builds available, please see the Score-P Software Page.


Vampir is a software performance visualizer focused on highly parallel applications. It presents a unified view on an application run including the use of programming paradigms like MPI, OpenMP, PThreads, CUDA, OpenCL and OpenACC. It also incorporates file I/O, hardware performance counters and other performance data sources. Various interactive displays offer detailed insight into the performance behavior of the analyzed application. Vampir’s scalable analysis server and visualization engine enable interactive navigation of large amounts of performance data. Score-P and TAU generate OTF2 trace files for Vampir to visualize.

For detailed information about using Vampir on Summit and the builds available, please see the Vampir Software Page.