The gap between CPU kernel launch and GPU kernel execution is called kernel launch latency
, in your screenshot it’s about 175us (17. 65ms - 17.475ms), which is not super bad but does look higher than optimal.
I do see many posts talking about launch latency as well instead of just launch cost/overhead, e.g. in this post someone suggested if there are a lot of kernel parameters and/or UVM usage that causes page faults, there could be higher launch latency. Is any of that applicable to your application?
I’m also seeing you are using NCCL in the application, is this a multi-GPU system? Any chance the process is waiting for data from other GPUs before actually scheduling the workload to be run on GPU?
BTW, you may also want to raise a question in the CUDA forum: CUDA Programming and Performance - NVIDIA Developer Forums. While our team develops the profiling tool to allow users observing this kind of performance issues, we don’t always hold the best expertise to explain/resolve them. For this specific issue about high CUDA kernel launch latency, the CUDA team might be able to provide more insight. I can see someone else posted similar questions there in the past, e.g. Too much time for kernel launch latency.