How to profile the CUDA application only by nvprof

Hi, I want to write a script to profile my cuda application only using the command tool nvprof. At present, I focus on two metrics: GPU utilization and GPU flops32 (FP32).
GPU utilization is the fraction of the time that the GPU is active. The active time of GPU can be easily obtained by “nvprof --print-gpu-trace”, while the elapsed time (without overhead) of the application is confused me. I use nvvp to visualize the profiling results. It seems that the elapsed time is the time between the first and last API call, then subtracting the overhead time which is not clear in the nvprof results.
GPU flops32 (FP32) is the number of FP32 instructions gpu executes per second while it is active. I follow Greg Smith’s suggestion ( and find that when I profile this metric the nvprof is very slow.
So there are two questions that I want to ask:

  1. How to calculate the elapsed time (without overhead) of a CUDA application using nvprof?
  2. Is there a faster way to obtain the gpu flops32?
    Any suggestion would be appreciated.

Hi, lucienwang

  1. For finding out GPU utilization. You can use issue_slot_utilization metrics. Description of it is “Percentage of issue slots that issued at least one instruction, averaged across all cycles”

  2. flops_count* metric are slow because they use instrumented counter underline and overhead of instrumentation counter increases with number of instruction patched in kernel. For finding FLOPs32 in faster manner if you have access to source code you have to calculate the number of FLOP operations performed by all threads in your kernel manually, then simply divide it by the kernel duration time which you can get through “nvprof app_name”