Kernel launch time in nsight system

I use nsight system to profile the program run in xavier, and I collect some cuda runtime using nsight. So I try to analysis those data.


As pic above shows that the nchwToNhwckernel in one thread operate use 139.648us from begin to end. And the latency is 760.512us. So I wonder what is the difference in 139.648us and 760.512us and what those params means.
I think the time from begin to end means the time of launching kernel. and the latency means the time from kernel start to launch to kernel run end, which include the launch time and wait time and kernel run time. Do I think right?

The next question is about kernel runtime. In above pic, I can see that the nchwToNhwckernel which is same as above uses about 7.328us from begin to end. It is so quick for a kernel to run. Is it the real time of a kernel runtime? and the time we think normal is including the time of kernel launch. Is it all right?

The last question is that I find some kernel uses too long time to launch as above. I think the cuda is not busy from blue line below, but the launch time of the kernel which using red rect is still so long. So I wonder to know that what conditions will cause the long kernel launch time, and what can I do for deceasing the launch time.

can anyone help me to answer this question?

Sorry for the delay in responding.

In general your concept of launch time and latency is correct. This means that if you launch multiple kernels, the latency may go up for the ones launched later, because the GPU is busy, so they have extra wait before they start.

For your second question, that is a completely possible kernel runtime, depending on what is being executed.

For your third question, I can’t really determine from that screenshot what might be going on. I would zoom in on the GPU section in question and see what kernels are running and how they interact, and if there are any issues with memory transfers. I would also look to the OS runtime trace up on the CPU thread and make sure the CPU thread was active. It looks like you have a massive kernel that is not keeping the GPU busy.