it is not a case of one particular GPU behaving badly, but the scenario is rather that occasionally and seemingly randomly
onesome of the GPUs in the system experiences an unexpected delay in launching a kernel although it is idle.
Exactly what I try to ask!
Sorry about the font stuff. The program uses 4 gpus and to show timeline view of all, I just keep it as 1x. Furthermore, what the ops are does not really matter imho. Just let me know if you need a more detail view, I can zoom-in and re-capture it.
The program runs in a container and it does NOT use MPS.
I’ve searched in forum and read some posts talking about similar issues. However, I cannot find something helpful. And I’ve raise a question in Nsys forum. (FYI, Kernel operation delays when gpu is idle)