Kernel execution measurement - profiling

skps23 · May 5, 2024, 4:15am

I am running nsys and part of the profiling info on the terminal is as follows. The kernel of interest is kernelV5.

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                              Name                             
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  -------------------------------------------------------------
     51.5          417,893          2  208,946.5  208,946.5     3,264   414,629    290,879.0  void asum_kernel<int, float, float>(cublasAsumParams<T2, T3>)
     48.5          393,604          1  393,604.0  393,604.0   393,604   393,604          0.0  kernelV5(float *, float *)

whereas ncu shows the same kernelV5 took a different timing (541.15 micro-seconds compared to 393 micro-seconds from nsys):

Also the cycles number in ncu seem to be close to that of nsys number. What am I getting wrong? In other words, which one is the true kernel execution measurement?

rs277 · May 5, 2024, 5:07am

One possibility for the timing discrepancy, is that by default, ncu runs with GPU clocks locked at the card’s base frequencies - see the table here, under “clock-control”.

I haven’t found a similar setting for nsys, so the clocks may vary due to temp, load etc.

You could try setting ncu to “none” and see if the durations are closer.

Clock control on Nsight Compute GUI is set here.

skps23 · May 5, 2024, 5:15pm

Thank you. Setting the clock-control none for the ncu seemed to provide consistent results with nsys. What would be a good way to measure timing of a single kernel performance in a real world scenario where such kernels are executed several thousand of times in a loop? At base or none or something else? I would like to know if any other thing I should keep in mind.

rs277 · May 5, 2024, 6:55pm

It’s not a situation I’m familiar with, but someone may offer a better solution.

I’d either use Cuda events to time it, or the clock mechanism as outlined in the Cuda Samples.

Topic		Replies	Views
Time column in "nsys stats" Profiling Linux Targets	2	1412	October 8, 2020
Kernel time discrepancy between nsys profile and cudaEventElapsedTime Profiling Linux Targets cuda , kernel , profiling	4	765	April 28, 2023
Inconsistent kernel execution times, and affected by Nsight Systems CUDA Programming and Performance	1	326	April 23, 2024
Inconsistent results with nsight systems Profiling Embedded Targets	5	821	June 20, 2023
Issues about the time shown in ncu Nsight Compute	4	100	March 19, 2025
Sum of kernel time is different in ncu and nsys Profiling Linux Targets nsight	11	3189	March 15, 2022
Inconsistent kernel time between nsight and cudaEvent Nsight Compute cuda	2	1666	June 12, 2024
Is the Nsight System accurate in measuring the execution time of the kernel? Profiling Linux Targets	14	1632	April 6, 2024
Precision of events for recording time elapsed of a kernel CUDA Programming and Performance	5	1174	December 21, 2017
[Solved]Relation of elapsed_cycles_sm and kernel execution time in cuda CUDA Programming and Performance	1	1282	September 8, 2016

Kernel execution measurement - profiling

Related topics