Profiling overhead

Greg · January 25, 2022, 3:24pm

If you are using Nsight Compute then you are serializing kernels which changes the execution on the GPU. Nsight Compute targets single kernel profiling so its job is to make the information for the kernel execution as accurate as possible by isolating the kernel. gpu__time_duration.sum introduces no measurable overhead to the kernel execution. However, gpu__time_duration is end timestamp - start timestamp. If the kernel is long enough it is likely to context switch and the duration will include time the GPU spent executing another context.

Nsight Systems/CUPTI support the most accurate start timestamp and end timestamp that we can do on the GPU. It is recommended that you use Nsight Systems prior to use Nsight Compute to optimize individual kernels. It is also recommended that you iterate back and forth between the two tools as you can naively optimize a kernel at the loss of concurrency between kernels resulting in a performance regression.

Topic		Replies	Views
Profiler speed Nsight Compute	4	1025	December 19, 2022
Cycles in nsight-compute and nsight-systems Nsight Compute	2	1218	October 26, 2022
Can't Get NCU GUI To Import Properly Nsight Compute	8	1340	October 5, 2020
Option to profile only master process Nsight Compute cuda	23	3537	December 1, 2023
GPU metrics in the Nsight System Profiling Linux Targets	3	884	October 15, 2024
Using Nsight Compute to Inspect your Kernels Technical Blog	2	1677	August 31, 2020
How to quantify kernel launch overhead using NCU? Visual Profiler and nvprof	8	1840	April 30, 2025
Nsight Compute-Roofline chart Nsight Compute	12	1461	September 20, 2024
Is the Nsight System accurate in measuring the execution time of the kernel? Profiling Linux Targets	14	1712	April 6, 2024
How can I measure kernel launch overhead using ncu Nsight Compute	7	1338	May 4, 2023

Profiling overhead

Related topics