Kernel launch time in nsight system

Chad.Ding · June 28, 2020, 6:48am

I use nsight system to profile the program run in xavier, and I collect some cuda runtime using nsight. So I try to analysis those data.

As pic above shows that the nchwToNhwckernel in one thread operate use 139.648us from begin to end. And the latency is 760.512us. So I wonder what is the difference in 139.648us and 760.512us and what those params means.
I think the time from begin to end means the time of launching kernel. and the latency means the time from kernel start to launch to kernel run end, which include the launch time and wait time and kernel run time. Do I think right?

The next question is about kernel runtime. In above pic, I can see that the nchwToNhwckernel which is same as above uses about 7.328us from begin to end. It is so quick for a kernel to run. Is it the real time of a kernel runtime? and the time we think normal is including the time of kernel launch. Is it all right?

The last question is that I find some kernel uses too long time to launch as above. I think the cuda is not busy from blue line below, but the launch time of the kernel which using red rect is still so long. So I wonder to know that what conditions will cause the long kernel launch time, and what can I do for deceasing the launch time.

Chad.Ding · July 2, 2020, 3:59am

can anyone help me to answer this question?

hwilper · July 22, 2020, 2:51pm

Sorry for the delay in responding.

In general your concept of launch time and latency is correct. This means that if you launch multiple kernels, the latency may go up for the ones launched later, because the GPU is busy, so they have extra wait before they start.

For your second question, that is a completely possible kernel runtime, depending on what is being executed.

For your third question, I can’t really determine from that screenshot what might be going on. I would zoom in on the GPU section in question and see what kernels are running and how they interact, and if there are any issues with memory transfers. I would also look to the OS runtime trace up on the CPU thread and make sure the CPU thread was active. It looks like you have a massive kernel that is not keeping the GPU busy.

Topic		Replies	Views
CUDA kernel is 6x slower in model than in a separate benchmark CUDA Programming and Performance cuda , kernel	6	441	February 17, 2023
kernel launch time way too long CUDA Programming and Performance	6	4026	July 5, 2011
Some kernel launch is taking much longer (100x) than others in the same Cuda Stream CUDA Programming and Performance	7	448	February 10, 2024
CPU Kernel launch Profiling Linux Targets cuda , nsight	3	596	August 25, 2020
Improve kernel launch times on Jetson TX2? Jetson TX2	4	602	October 18, 2021
Inconsistent kernel time between nsight and cudaEvent Nsight Compute cuda	2	1681	June 12, 2024
How to profile kernel launch overhead? Profiling Linux Targets	1	13	April 29, 2025
Inconsistent kernel execution times, and affected by Nsight Systems CUDA Programming and Performance	1	339	April 23, 2024
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3674	October 21, 2017
Why Cuda Kernel Launch Takes so much time ？ CUDA Programming and Performance cuda , gstreamer	1	835	November 9, 2023

Kernel launch time in nsight system

Related topics