Kernel launch time in nsight system

I use nsight system to profile the program run in xavier, and I collect some cuda runtime using nsight. So I try to analysis those data.


As pic above shows that the nchwToNhwckernel in one thread operate use 139.648us from begin to end. And the latency is 760.512us. So I wonder what is the difference in 139.648us and 760.512us and what those params means.
I think the time from begin to end means the time of launching kernel. and the latency means the time from kernel start to launch to kernel run end, which include the launch time and wait time and kernel run time. Do I think right?

The next question is about kernel runtime. In above pic, I can see that the nchwToNhwckernel which is same as above uses about 7.328us from begin to end. It is so quick for a kernel to run. Is it the real time of a kernel runtime? and the time we think normal is including the time of kernel launch. Is it all right?

The last question is that I find some kernel uses too long time to launch as above. I think the cuda is not busy from blue line below, but the launch time of the kernel which using red rect is still so long. So I wonder to know that what conditions will cause the long kernel launch time, and what can I do for deceasing the launch time.

can anyone help me to answer this question?