Kernel execution measurement - profiling

I am running nsys and part of the profiling info on the terminal is as follows. The kernel of interest is kernelV5.

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                              Name                             
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  -------------------------------------------------------------
     51.5          417,893          2  208,946.5  208,946.5     3,264   414,629    290,879.0  void asum_kernel<int, float, float>(cublasAsumParams<T2, T3>)
     48.5          393,604          1  393,604.0  393,604.0   393,604   393,604          0.0  kernelV5(float *, float *)

whereas ncu shows the same kernelV5 took a different timing (541.15 micro-seconds compared to 393 micro-seconds from nsys):

Also the cycles number in ncu seem to be close to that of nsys number. What am I getting wrong? In other words, which one is the true kernel execution measurement?

One possibility for the timing discrepancy, is that by default, ncu runs with GPU clocks locked at the card’s base frequencies - see the table here, under “clock-control”.

I haven’t found a similar setting for nsys, so the clocks may vary due to temp, load etc.

You could try setting ncu to “none” and see if the durations are closer.

Clock control on Nsight Compute GUI is set here.

1 Like

Thank you. Setting the clock-control none for the ncu seemed to provide consistent results with nsys. What would be a good way to measure timing of a single kernel performance in a real world scenario where such kernels are executed several thousand of times in a loop? At base or none or something else? I would like to know if any other thing I should keep in mind.

It’s not a situation I’m familiar with, but someone may offer a better solution.

I’d either use Cuda events to time it, or the clock mechanism as outlined in the Cuda Samples.