There is a obvious latency(~175 us) between the time cuda api call finishes and the time when kernel runs. This happens to one gpu and all the other gpus are waiting for the delayed one when it comes to nccl apis.
How do we know the GPU is idle? Is it possibly being used by the operating system GUI for display tasks? Or In use by another process belonging to a different user? What kind of GPUs are these, and what kind of system are they in?
So one of these GPUs behaves differently from the others, and it does not immediately appear to be software related.
What about hardware topology? Are all GPUs linked by a dedicated PCIe gen4 x16 interconnect directly to the CPU? Are any PCIe switches in use? If this is a multi-socket server, are the GPUs even distributed across the CPUs, e.g. two GPUs per CPU socket? Also, if this is a multi-socket system, is processor and memory affinity configured such that each GPU “talks” to the near CPU and the near memory memory controlled by that CPU?
Basically I am just asking some preliminary questions for the purpose of excluding various scenarios. If one GPU in a multiple-GPU configuration is behaving differently from all the others there has to be a rational explanation for it.
@Robert_Crovella has experience with large multi-GPU servers and likely has some good ideas as to what to check or some hypotheses about possible root causes.
Thanks a lot! I’ll check all of them.
Topology: all 4 gpus are linked by PCIe, as “PXB” shown in nvidia-smi topo -m
In addition, this happens to some other gpu later(at ~499 ms of the same report file):
FWIW, except for the timeline at the top the font in these screen captures is so small that I can only discern colored horizontal bars. Others may have more luck on higher resolution monitors.
So from what I gather so far it is not a case of one particular GPU behaving badly, but the scenario is rather that occasionally and seemingly randomly one of the GPUs in the system experiences an unexpected delay in launching a kernel although it is idle.
Is this application running on bare metal or a virtualized system? Does it use Multi-Process Service (MPS)?
Honestly, at the moment I have no idea what the issue could be. I have vague recollections of similar issues being reported in this forum before, but try as I might, I cannot recall any of the details or any of these threads. I hope Robert can spot something.
it is not a case of one particular GPU behaving badly, but the scenario is rather that occasionally and seemingly randomly onesome of the GPUs in the system experiences an unexpected delay in launching a kernel although it is idle.
Exactly what I try to ask!
Sorry about the font stuff. The program uses 4 gpus and to show timeline view of all, I just keep it as 1x. Furthermore, what the ops are does not really matter imho. Just let me know if you need a more detail view, I can zoom-in and re-capture it.
The program runs in a container and it does NOT use MPS.
I’ve searched in forum and read some posts talking about similar issues. However, I cannot find something helpful. And I’ve raise a question in Nsys forum. (FYI, Kernel operation delays when gpu is idle)
GPU kernel launch latency (the time from when the CPU code encountered the kernel launch in your source code, until the time when the kernel was actually processing) could be impacted if:
the GPU is busy with other work &
the launch queue is full
the CPU is busy or heavily loaded
a synchronizing operation is needed, for example with lazy loading &
there is a varying or large parameter pack (data size of the arguments passed to the kernel) &
the GPU is in default compute mode, and there are other users of the GPU (which also includes other containers) &
if we focus our attention only on the latency between when the launch was actually requested (roughly, the completion of the bar in the API section) and when the kernel actually began processing (the start of the bar in the device section), then the list is shorter. I have marked those above with &. And of course there are probably others that I don’t know or haven’t remembered.
Without the code and access to the profiler interactively, I don’t think I can offer any further advice.
Thanks a lot for the amazing check list!
I’ve walk through the list and here are some of my guesses(please let me know if there is something wrong):
I’m the only user of the server and I run only one program with 4 gpus in a container. This removes the first and the last from the list.
This occasionally happens to some of the gpus and the other gpus behave as expected. This removes the second-to-last item. Since all gpus are expected to execute the same ops with the same parameter pack. It should happen to all gpus if parameter pack is the root cause.
There is only “lazy loading” on the list and I’ll dig into it first.
There is no explicit method to find out. It requires inferences based on API behavior. In a nutshell, the kernel launch switches from async to sync, blocking the host CPU thread until a queue slot opens up. This is what you can observe. I’m not suggesting its easy or sensible to try to observe it, I’m just suggesting I know of no other obvious “real-time” indicator. There are various questions on various forums about this queue, here is one example. Here is another.
If you just want to know how long the queue is on your particular system, you can just launch kernels with non-trivial execution time (for example, around 100 milliseconds) and observe when their issue rate drops from one every few microseconds to one per 100 milliseconds. Typical queue depths observed in the past have been on the order of a thousand kernel launches.
The GPU launch queue becoming full is an indication of too much work being issued to the GPU. The queue serves as a buffer that can absorb short-term activity bursts, but does not help in cases of permanent overload. Sometimes this can be mitigated by splitting the GPU work across fewer, but more time-consuming kernels, thus reducing accumulating launch overhead. At other times a faster GPU is simply what is needed. If that is not possible, maybe some work such as pre-computation can remain on the host CPU for more advantageous load balancing between host and device, or work can be split across multiple GPUs.
One could also put the cuda calls into a dedicated background thread, which manages its own CPU side queue, which could be any size you need. The kernels would not run faster, but they also would not block.