Hi, I want to write a script to profile my cuda application only using the command tool nvprof. At present, I focus on two metrics: GPU utilization and GPU flops32 (FP32).
GPU utilization is the fraction of the time that the GPU is active. The active time of GPU can be easily obtained by “nvprof --print-gpu-trace”, while the elapsed time (without overhead) of the application is confused me. I use nvvp to visualize the profiling results. It seems that the elapsed time is the time between the first and last API call, then subtracting the overhead time which is not clear in the nvprof results.
GPU flops32 (FP32) is the number of FP32 instructions gpu executes per second while it is active. I follow Greg Smith’s suggestion (profiling - How to calculate Gflops of a kernel - Stack Overflow) and find that when I profile this metric the nvprof is very slow.
So there are two questions that I want to ask:
- How to calculate the elapsed time (without overhead) of a CUDA application using nvprof?
- Is there a faster way to obtain the gpu flops32?
Any suggestion would be appreciated.