Is it possible to measure the kernel launch overhead accurately in nsight compute?
I saw other threads where people measured it using other ways than nsight compute, I was wondering if nsight compute is capable of capturing it by looking at elapsed cycles and sm active cycles.
It would also be helpful if there’s a place where I can know more about kernel launch overhead in cuda.
Thanks for the link, this says that nsight compute kernel duration doesn’t involve kernel launch overhead however in this answer Question for sm__elapsed_cycles_sum - #2 by Greg it says otherwise. I’m confused
Nsight Compute is not designed to measure launch overhead. Nsight Compute measures do contain some of the launch overhead.
The ASCII diagram below is a GPU timeline. This timeline does not include overhead of the CUDA driver command buffer creation, command buffer submission, and the GPU to switch to the command buffer.
GPU TIMELINE
time -->
FE [1][2][3] [8]
SCHED [4]
CWD [5] [7]
SM [6--------------]
UNITS
FE - Front End - Process command buffers.
SCHED - Compute Work Scheduler - Manages and prioritizes grid launches.
CWD - Compute Work Distributor - Rasterizes grids, distributes thread blocks to SMs, tracks thread block and grid completion, performs post-grid complete tasks.
GPU RANGES
[1] Driver specific updates before the launch including but not limited to:
- resize local memory, printf, or device heap
- upload device code
[2] Commands to setup launch including copying kernel parameters, texture bindings, etc. to device memory.
[3] Commands to launch the grid.
[4] Scheduler prioritizes grid and pass highest priority grids to CWD.
[5] CWD queries SMs to see how many thread blocks can fit on each SM. CWD rasterizes the grid and launches thread blocks.
[6] SM execute the kernel
[7] CWD receives last thread block complete and executed grid complete tasks.
[8] Front end continues to process commands for the stream.
TOOLS
Nsight Compute measures from [3] to the start of [8].
Nsight Systems measures from the start of [6] to middle of [7].
If you launch a 1 warp grid then
sm__cycles_elapsed.max - sm__cycles_active.max is approximately the overhead of [4] to [7].