How can I measure kernel launch overhead using ncu

m_ali102 · April 20, 2023, 5:39am

Hi,

Is it possible to measure the kernel launch overhead accurately in nsight compute?

I saw other threads where people measured it using other ways than nsight compute, I was wondering if nsight compute is capable of capturing it by looking at elapsed cycles and sm active cycles.

It would also be helpful if there’s a place where I can know more about kernel launch overhead in cuda.

Thanks.

Sanjiv.Satoor · April 20, 2023, 6:13am

You can use Nsight Systems to measure kernel launch overhead. Please refer to the Understanding the Visualization of Overhead and Latency in NVIDIA Nsight Systems blog.

m_ali102 · April 20, 2023, 6:17am

Thanks Sanjiv for the prompt response!

So nsight compute doesn’t involve such latency? The kernel duration in nsight compute is the time spent only doing work on GPU?

Sanjiv.Satoor · April 20, 2023, 6:43am

Please refer the reply in this forum post for more details.

m_ali102 · April 20, 2023, 7:01am

Thanks for the link, this says that nsight compute kernel duration doesn’t involve kernel launch overhead however in this answer Question for sm__elapsed_cycles_sum - #2 by Greg it says otherwise. I’m confused

Greg · April 20, 2023, 2:07pm

Nsight Compute is not designed to measure launch overhead. Nsight Compute measures do contain some of the launch overhead.

The ASCII diagram below is a GPU timeline. This timeline does not include overhead of the CUDA driver command buffer creation, command buffer submission, and the GPU to switch to the command buffer.

GPU TIMELINE

            time -->
FE          [1][2][3]                           [8]
SCHED                [4]
CWD                     [5]                  [7]
SM                          [6--------------]

UNITS

FE - Front End - Process command buffers.
SCHED - Compute Work Scheduler - Manages and prioritizes grid launches.
CWD - Compute Work Distributor - Rasterizes grids, distributes thread blocks to SMs, tracks thread block and grid completion, performs post-grid complete tasks.

GPU RANGES
[1] Driver specific updates before the launch including but not limited to:
- resize local memory, printf, or device heap
- upload device code
[2] Commands to setup launch including copying kernel parameters, texture bindings, etc. to device memory.
[3] Commands to launch the grid.
[4] Scheduler prioritizes grid and pass highest priority grids to CWD.
[5] CWD queries SMs to see how many thread blocks can fit on each SM. CWD rasterizes the grid and launches thread blocks.
[6] SM execute the kernel
[7] CWD receives last thread block complete and executed grid complete tasks.
[8] Front end continues to process commands for the stream.

TOOLS

Nsight Compute measures from [3] to the start of [8].
Nsight Systems measures from the start of [6] to middle of [7].

If you launch a 1 warp grid then
sm__cycles_elapsed.max - sm__cycles_active.max is approximately the overhead of [4] to [7].

m_ali102 · April 20, 2023, 6:42pm

Thanks a lot Greg and Sanjiv! This is clear now.

system · May 4, 2023, 6:42pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to quantify kernel launch overhead using NCU? Visual Profiler and nvprof	8	1867	April 30, 2025
Overhead of launching a new thread block CUDA Programming and Performance	9	2173	December 1, 2016
How can I dissect different latencies with nsight systems? Profiling Linux Targets	3	1767	February 15, 2020
Cycles in nsight-compute and nsight-systems Nsight Compute	2	1229	October 26, 2022
Dispatch Kernel Overhead (OpenCL) CUDA Programming and Performance	6	3628	March 28, 2017
How can I profile both kernel and cuda APIs hardware usage and application total duration Nsight Compute	5	425	March 27, 2024
Issues about the time shown in ncu Nsight Compute	4	133	March 19, 2025
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4652	February 15, 2018
Can you use nsight to see tensor core occupancy? Nsight Compute cudnn	4	1027	March 23, 2024
kernel launch overhead timing best practices CUDA Programming and Performance	3	9988	June 24, 2014

How can I measure kernel launch overhead using ncu

Related topics