Kernel operation delays when gpu is idle

I run a program on multiple gpus. Found that on one of gpus, op delays even the gpu is idle. As we can see from the picture, there is a obvious latency between the time cuda api call finishes and time when kernel runs.

Unfortunately there is not enough information in a simple screenshot to tell you exactly what is going on. Take a look at https://developer.nvidia.com/blog/understanding-the-visualization-of-overhead-and-latency-in-nsight-systems/ and see if any of that helps you determine what is going on.

I’ve already read the blog and can not find something helpful. :(
What information do you need? I’ll see whether I can provide. Or, if there is some possible cause I can dig into?

@liuyis can you help this person?

@scse-l Could you share your report file and the timestamp of the API call where you observed the latency?

Sorry, I cannot share the report file.

Without the report file, it’s not easy for us to look into your specific case and provide specific suggestions.

Can you provide more screenshots from your report? For example capture the timelines for all the processes and threads are invoking CUDA activities, and for all the CUDA GPUs/Contexts/Streams that are running workloads.

If you search “cuda kernel launch high latency” on google, there are lots of similar questions and answers. You may want to dig into them and see if they are applicable.


Here is a screenshot of timelines for all the processes.
I’ve searched & read most of them. All of them are talking about the cost of kernel launch. However, what I ask about is the gap (the gpu is idle) between the time cuda api call finishes and the time cuda kernel actually runs.

The gap between CPU kernel launch and GPU kernel execution is called kernel launch latency, in your screenshot it’s about 175us (17. 65ms - 17.475ms), which is not super bad but does look higher than optimal.

I do see many posts talking about launch latency as well instead of just launch cost/overhead, e.g. in this post someone suggested if there are a lot of kernel parameters and/or UVM usage that causes page faults, there could be higher launch latency. Is any of that applicable to your application?

I’m also seeing you are using NCCL in the application, is this a multi-GPU system? Any chance the process is waiting for data from other GPUs before actually scheduling the workload to be run on GPU?

BTW, you may also want to raise a question in the CUDA forum: CUDA Programming and Performance - NVIDIA Developer Forums. While our team develops the profiling tool to allow users observing this kind of performance issues, we don’t always hold the best expertise to explain/resolve them. For this specific issue about high CUDA kernel launch latency, the CUDA team might be able to provide more insight. I can see someone else posted similar questions there in the past, e.g. Too much time for kernel launch latency.

Thanks a lot for helping me distinguish between kernel launch latency and launch overhead. And I’ll see if the posts are helpful.
“Any chance the process is waiting for data from other GPUs before actually scheduling the workload to be run on GPU?”
I guess the answer is No. All the other gpus run ahead of the “delayed” gpu and in fact, all the other gpus are waiting for this “delayed” one.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.