What are possible reasons of heavy kernel launch latency?

I run a program on multiple gpus. Found that on one of gpus, op delays even the gpu is idle. Here is a screenshot of timelines for all the processes.


There is a obvious latency(~175 us) between the time cuda api call finishes and the time when kernel runs. This happens to one gpu and all the other gpus are waiting for the delayed one when it comes to nccl apis.

How do we know the GPU is idle? Is it possibly being used by the operating system GUI for display tasks? Or In use by another process belonging to a different user? What kind of GPUs are these, and what kind of system are they in?

  1. It’s a linux gpu server and there’s no GUI. From nsys report(the screeshot), CUDA HW is idle.
  2. OS is Ubuntu 22.04.

So one of these GPUs behaves differently from the others, and it does not immediately appear to be software related.

What about hardware topology? Are all GPUs linked by a dedicated PCIe gen4 x16 interconnect directly to the CPU? Are any PCIe switches in use? If this is a multi-socket server, are the GPUs even distributed across the CPUs, e.g. two GPUs per CPU socket? Also, if this is a multi-socket system, is processor and memory affinity configured such that each GPU “talks” to the near CPU and the near memory memory controlled by that CPU?

Basically I am just asking some preliminary questions for the purpose of excluding various scenarios. If one GPU in a multiple-GPU configuration is behaving differently from all the others there has to be a rational explanation for it.

@Robert_Crovella has experience with large multi-GPU servers and likely has some good ideas as to what to check or some hypotheses about possible root causes.

Thanks a lot! I’ll check all of them.
Topology: all 4 gpus are linked by PCIe, as “PXB” shown in nvidia-smi topo -m
In addition, this happens to some other gpu later(at ~499 ms of the same report file):

FWIW, except for the timeline at the top the font in these screen captures is so small that I can only discern colored horizontal bars. Others may have more luck on higher resolution monitors.

So from what I gather so far it is not a case of one particular GPU behaving badly, but the scenario is rather that occasionally and seemingly randomly one of the GPUs in the system experiences an unexpected delay in launching a kernel although it is idle.

Is this application running on bare metal or a virtualized system? Does it use Multi-Process Service (MPS)?

Honestly, at the moment I have no idea what the issue could be. I have vague recollections of similar issues being reported in this forum before, but try as I might, I cannot recall any of the details or any of these threads. I hope Robert can spot something.

it is not a case of one particular GPU behaving badly, but the scenario is rather that occasionally and seemingly randomly one some of the GPUs in the system experiences an unexpected delay in launching a kernel although it is idle.

Exactly what I try to ask!
Sorry about the font stuff. The program uses 4 gpus and to show timeline view of all, I just keep it as 1x. Furthermore, what the ops are does not really matter imho. Just let me know if you need a more detail view, I can zoom-in and re-capture it.
The program runs in a container and it does NOT use MPS.

I’ve searched in forum and read some posts talking about similar issues. However, I cannot find something helpful. And I’ve raise a question in Nsys forum. (FYI, Kernel operation delays when gpu is idle)

GPU kernel launch latency (the time from when the CPU code encountered the kernel launch in your source code, until the time when the kernel was actually processing) could be impacted if:

  • the GPU is busy with other work &
  • the launch queue is full
  • the CPU is busy or heavily loaded
  • a synchronizing operation is needed, for example with lazy loading &
  • in a multi-threaded application, due to competition for internal resource locks
  • there is a varying or large parameter pack (data size of the arguments passed to the kernel) &
  • the GPU is in default compute mode, and there are other users of the GPU (which also includes other containers) &

if we focus our attention only on the latency between when the launch was actually requested (roughly, the completion of the bar in the API section) and when the kernel actually began processing (the start of the bar in the device section), then the list is shorter. I have marked those above with &. And of course there are probably others that I don’t know or haven’t remembered.

Without the code and access to the profiler interactively, I don’t think I can offer any further advice.

1 Like

Thanks a lot for the amazing check list!
I’ve walk through the list and here are some of my guesses(please let me know if there is something wrong):

  • I’m the only user of the server and I run only one program with 4 gpus in a container. This removes the first and the last from the list.
  • This occasionally happens to some of the gpus and the other gpus behave as expected. This removes the second-to-last item. Since all gpus are expected to execute the same ops with the same parameter pack. It should happen to all gpus if parameter pack is the root cause.

There is only “lazy loading” on the list and I’ll dig into it first.