Kernel operation delays when gpu is idle

scse-l · March 18, 2024, 11:06am

I run a program on multiple gpus. Found that on one of gpus, op delays even the gpu is idle. As we can see from the picture, there is a obvious latency between the time cuda api call finishes and time when kernel runs.

hwilper · March 18, 2024, 1:32pm

Unfortunately there is not enough information in a simple screenshot to tell you exactly what is going on. Take a look at https://developer.nvidia.com/blog/understanding-the-visualization-of-overhead-and-latency-in-nsight-systems/ and see if any of that helps you determine what is going on.

scse-l · March 18, 2024, 1:50pm

I’ve already read the blog and can not find something helpful. :(
What information do you need? I’ll see whether I can provide. Or, if there is some possible cause I can dig into?

hwilper · March 19, 2024, 12:59am

@liuyis can you help this person?

liuyis · March 19, 2024, 2:28am

@scse-l Could you share your report file and the timestamp of the API call where you observed the latency?

scse-l · March 19, 2024, 12:18pm

Sorry, I cannot share the report file.

liuyis · March 19, 2024, 3:40pm

Without the report file, it’s not easy for us to look into your specific case and provide specific suggestions.

Can you provide more screenshots from your report? For example capture the timelines for all the processes and threads are invoking CUDA activities, and for all the CUDA GPUs/Contexts/Streams that are running workloads.

If you search “cuda kernel launch high latency” on google, there are lots of similar questions and answers. You may want to dig into them and see if they are applicable.

scse-l · March 20, 2024, 2:51am

Here is a screenshot of timelines for all the processes.
I’ve searched & read most of them. All of them are talking about the cost of kernel launch. However, what I ask about is the gap (the gpu is idle) between the time cuda api call finishes and the time cuda kernel actually runs.

liuyis · March 20, 2024, 4:27am

The gap between CPU kernel launch and GPU kernel execution is called kernel launch latency, in your screenshot it’s about 175us (17. 65ms - 17.475ms), which is not super bad but does look higher than optimal.

I do see many posts talking about launch latency as well instead of just launch cost/overhead, e.g. in this post someone suggested if there are a lot of kernel parameters and/or UVM usage that causes page faults, there could be higher launch latency. Is any of that applicable to your application?

I’m also seeing you are using NCCL in the application, is this a multi-GPU system? Any chance the process is waiting for data from other GPUs before actually scheduling the workload to be run on GPU?

BTW, you may also want to raise a question in the CUDA forum: CUDA Programming and Performance - NVIDIA Developer Forums. While our team develops the profiling tool to allow users observing this kind of performance issues, we don’t always hold the best expertise to explain/resolve them. For this specific issue about high CUDA kernel launch latency, the CUDA team might be able to provide more insight. I can see someone else posted similar questions there in the past, e.g. Too much time for kernel launch latency.

scse-l · March 20, 2024, 12:40pm

Thanks a lot for helping me distinguish between kernel launch latency and launch overhead. And I’ll see if the posts are helpful.
“Any chance the process is waiting for data from other GPUs before actually scheduling the workload to be run on GPU?”
I guess the answer is No. All the other gpus run ahead of the “delayed” gpu and in fact, all the other gpus are waiting for this “delayed” one.

system · April 3, 2024, 12:40pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What are possible reasons of heavy kernel launch latency? CUDA Programming and Performance cuda , kernel , python	8	773	March 26, 2024
Too much time for kernel launch latency CUDA Programming and Performance	9	2439	November 28, 2022
Why Cuda Kernel Launch Takes so much time ？ CUDA Programming and Performance cuda , gstreamer	1	805	November 9, 2023
What does the idle time between kernel functions in Nsight System mean? Profiling Linux Targets nsight	1	735	August 27, 2021
Overlapping GPU and CPU computation? CUDA Programming and Performance	9	1245	November 19, 2010
Diff. between CPU / GPU kernel execution time CUDA Programming and Performance	4	1652	May 18, 2010
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1437	July 17, 2017
How can I dissect different latencies with nsight systems? Profiling Linux Targets	3	1737	February 15, 2020
CudaMemcpyAsync wait long time to launch CUDA Programming and Performance cuda , kernel	8	2012	April 11, 2022
Running a kernel blocks the CPU? Is it possible to run it asynchronously? CUDA Programming and Performance	2	3485	April 21, 2009

Kernel operation delays when gpu is idle

Related topics