I am running nsys
and part of the profiling info on the terminal is as follows. The kernel of interest is kernelV5
.
[6/8] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- -------- -------- ----------- -------------------------------------------------------------
51.5 417,893 2 208,946.5 208,946.5 3,264 414,629 290,879.0 void asum_kernel<int, float, float>(cublasAsumParams<T2, T3>)
48.5 393,604 1 393,604.0 393,604.0 393,604 393,604 0.0 kernelV5(float *, float *)
whereas ncu
shows the same kernelV5
took a different timing (541.15 micro-seconds compared to 393 micro-seconds from nsys
):
Also the cycles number in ncu
seem to be close to that of nsys
number. What am I getting wrong? In other words, which one is the true kernel execution measurement?