What are possible reasons of heavy kernel launch latency?

scse-l · March 25, 2024, 2:40am

I run a program on multiple gpus. Found that on one of gpus, op delays even the gpu is idle. Here is a screenshot of timelines for all the processes.

There is a obvious latency(~175 us) between the time cuda api call finishes and the time when kernel runs. This happens to one gpu and all the other gpus are waiting for the delayed one when it comes to nccl apis.

njuffa · March 25, 2024, 3:03am

How do we know the GPU is idle? Is it possibly being used by the operating system GUI for display tasks? Or In use by another process belonging to a different user? What kind of GPUs are these, and what kind of system are they in?

scse-l · March 25, 2024, 3:21am

It’s a linux gpu server and there’s no GUI. From nsys report(the screeshot), CUDA HW is idle.
OS is Ubuntu 22.04.

njuffa · March 25, 2024, 3:39am

So one of these GPUs behaves differently from the others, and it does not immediately appear to be software related.

What about hardware topology? Are all GPUs linked by a dedicated PCIe gen4 x16 interconnect directly to the CPU? Are any PCIe switches in use? If this is a multi-socket server, are the GPUs even distributed across the CPUs, e.g. two GPUs per CPU socket? Also, if this is a multi-socket system, is processor and memory affinity configured such that each GPU “talks” to the near CPU and the near memory memory controlled by that CPU?

Basically I am just asking some preliminary questions for the purpose of excluding various scenarios. If one GPU in a multiple-GPU configuration is behaving differently from all the others there has to be a rational explanation for it.

@Robert_Crovella has experience with large multi-GPU servers and likely has some good ideas as to what to check or some hypotheses about possible root causes.

scse-l · March 25, 2024, 6:33am

Thanks a lot! I’ll check all of them.
Topology: all 4 gpus are linked by PCIe, as “PXB” shown in nvidia-smi topo -m
In addition, this happens to some other gpu later(at ~499 ms of the same report file):

njuffa · March 25, 2024, 6:45am

FWIW, except for the timeline at the top the font in these screen captures is so small that I can only discern colored horizontal bars. Others may have more luck on higher resolution monitors.

So from what I gather so far it is not a case of one particular GPU behaving badly, but the scenario is rather that occasionally and seemingly randomly one of the GPUs in the system experiences an unexpected delay in launching a kernel although it is idle.

Is this application running on bare metal or a virtualized system? Does it use Multi-Process Service (MPS)?

Honestly, at the moment I have no idea what the issue could be. I have vague recollections of similar issues being reported in this forum before, but try as I might, I cannot recall any of the details or any of these threads. I hope Robert can spot something.

scse-l · March 25, 2024, 7:16am

it is not a case of one particular GPU behaving badly, but the scenario is rather that occasionally and seemingly randomly ~~one~~ some of the GPUs in the system experiences an unexpected delay in launching a kernel although it is idle.

Exactly what I try to ask!
Sorry about the font stuff. The program uses 4 gpus and to show timeline view of all, I just keep it as 1x. Furthermore, what the ops are does not really matter imho. Just let me know if you need a more detail view, I can zoom-in and re-capture it.
The program runs in a container and it does NOT use MPS.

I’ve searched in forum and read some posts talking about similar issues. However, I cannot find something helpful. And I’ve raise a question in Nsys forum. (FYI, Kernel operation delays when gpu is idle)

Robert_Crovella · March 25, 2024, 9:06pm

GPU kernel launch latency (the time from when the CPU code encountered the kernel launch in your source code, until the time when the kernel was actually processing) could be impacted if:

the GPU is busy with other work &
the launch queue is full
the CPU is busy or heavily loaded
a synchronizing operation is needed, for example with lazy loading &
in a multi-threaded application, due to competition for internal resource locks
there is a varying or large parameter pack (data size of the arguments passed to the kernel) &
the GPU is in default compute mode, and there are other users of the GPU (which also includes other containers) &

if we focus our attention only on the latency between when the launch was actually requested (roughly, the completion of the bar in the API section) and when the kernel actually began processing (the start of the bar in the device section), then the list is shorter. I have marked those above with &. And of course there are probably others that I don’t know or haven’t remembered.

Without the code and access to the profiler interactively, I don’t think I can offer any further advice.

scse-l · March 26, 2024, 2:43am

Thanks a lot for the amazing check list!
I’ve walk through the list and here are some of my guesses(please let me know if there is something wrong):

I’m the only user of the server and I run only one program with 4 gpus in a container. This removes the first and the last from the list.
This occasionally happens to some of the gpus and the other gpus behave as expected. This removes the second-to-last item. Since all gpus are expected to execute the same ops with the same parameter pack. It should happen to all gpus if parameter pack is the root cause.

There is only “lazy loading” on the list and I’ll dig into it first.

Topic		Replies	Views
Kernel operation delays when gpu is idle Profiling Linux Targets cuda , kernel , python	10	427	March 20, 2024
Too much time for kernel launch latency CUDA Programming and Performance	9	2246	November 28, 2022
Why Cuda Kernel Launch Takes so much time ？ CUDA Programming and Performance cuda , gstreamer	1	734	November 9, 2023
Multiple GPUs and CPU CUDA Programming and Performance	7	4390	March 26, 2014
Needing expert advice.. CUDA Programming and Performance	4	1264	July 21, 2014
Reducing GPU Idle Time CUDA Programming and Performance	19	4396	June 14, 2022
Some kernel launch is taking much longer (100x) than others in the same Cuda Stream CUDA Programming and Performance	7	367	February 10, 2024
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3639	October 21, 2017
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1707	July 19, 2022
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	712	October 8, 2016

What are possible reasons of heavy kernel launch latency?

Related topics