How can I measure kernel launch overhead using ncu


Is it possible to measure the kernel launch overhead accurately in nsight compute?

I saw other threads where people measured it using other ways than nsight compute, I was wondering if nsight compute is capable of capturing it by looking at elapsed cycles and sm active cycles.

It would also be helpful if there’s a place where I can know more about kernel launch overhead in cuda.


You can use Nsight Systems to measure kernel launch overhead. Please refer to the Understanding the Visualization of Overhead and Latency in NVIDIA Nsight Systems blog.

Thanks Sanjiv for the prompt response!

So nsight compute doesn’t involve such latency? The kernel duration in nsight compute is the time spent only doing work on GPU?

Please refer the reply in this forum post for more details.

Thanks for the link, this says that nsight compute kernel duration doesn’t involve kernel launch overhead however in this answer Question for sm__elapsed_cycles_sum - #2 by Greg it says otherwise. I’m confused

Nsight Compute is not designed to measure launch overhead. Nsight Compute measures do contain some of the launch overhead.

The ASCII diagram below is a GPU timeline. This timeline does not include overhead of the CUDA driver command buffer creation, command buffer submission, and the GPU to switch to the command buffer.


            time -->
FE          [1][2][3]                           [8]
SCHED                [4]
CWD                     [5]                  [7]
SM                          [6--------------]


  • FE - Front End - Process command buffers.
  • SCHED - Compute Work Scheduler - Manages and prioritizes grid launches.
  • CWD - Compute Work Distributor - Rasterizes grids, distributes thread blocks to SMs, tracks thread block and grid completion, performs post-grid complete tasks.

[1] Driver specific updates before the launch including but not limited to:
- resize local memory, printf, or device heap
- upload device code
[2] Commands to setup launch including copying kernel parameters, texture bindings, etc. to device memory.
[3] Commands to launch the grid.
[4] Scheduler prioritizes grid and pass highest priority grids to CWD.
[5] CWD queries SMs to see how many thread blocks can fit on each SM. CWD rasterizes the grid and launches thread blocks.
[6] SM execute the kernel
[7] CWD receives last thread block complete and executed grid complete tasks.
[8] Front end continues to process commands for the stream.


  • Nsight Compute measures from [3] to the start of [8].
  • Nsight Systems measures from the start of [6] to middle of [7].

If you launch a 1 warp grid then
sm__cycles_elapsed.max - sm__cycles_active.max is approximately the overhead of [4] to [7].


Thanks a lot Greg and Sanjiv! This is clear now.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.