Does the kernel time shown in ncu exactly the execution time of the kernel? Or if it was slower because ncu has overhead during the kernel execution?
The kernel duration measured by ncu is accurate. Ncu collect different types of metrics in different passes, replaying the workload for each pass. The duration is measured in a pass where metric data is collected without impacting the kernel’s performance. Metrics that count entities but have a significant impact on kernel duration are measured in a separate pass.
What exactly you consider the kernel duration however depends on what is measured. E.g., Nsight Systems and Nsight Compute use slightly different mechanisms to measure it, and using e.g. CUDA events would yet be another measurement method. You should not expect these to be completely consistent.
See also Average of all kernels L1, L2 Cache Hit Rate - #8 by Greg
Thanks!
As a further note, remember that by default, ncu locks clocks to the GPU’s base clocks. This is to ensure deterministic behavior across passes. Especially for data-center class chips, the base clock can be much lower than the typical clock the GPU runs on. If needed, you can disable this and locks clocks externally using nvidia-smi
to a clock level you require.