Unexplained gaps in CUDA stream execution

I don’t understand a profiling I made with Nsight Systems.
I am using a stream to submit a queue of CUDA operations; no sync is required or done, but I can observe that regularly, the kernel execution does nothing during fixed periods of time. (see screenshot)
Nothing in the trace seems related to any wait or blocking state. I just don’t understand.
What could be the explanation ?

could you attach the .nsys-rep file? I have some thoughts, but I cannot tell enough from a screenshot to really say.

gaps.7z (7.9 MB)

Here is a trace.
You can see the gaps for instance at 655.2841ms

@jcohen can you please take a look at this.

So, do you confirm that it is not something obvious that I missed and that something unexpected is going on ?

@jcohen, @hwilper : so no news ?

@Chacha21 What you’re seeing may be the GPU context switching and doing work for a different process. Can you try enabling the trace feature “GPU Context Switch”? You’ll get extra timeline rows that show which context the GPU switched to. We don’t yet have direct correlation between these GPU contexts and CUDA kernel executions, but it should be visually obvious which context is the one your kernels are executing in. If you see it switch from that context to some other one at the same time these gaps occur, then you’ll know your workloads were not experiencing the gaps due to any problem with your code – it’s just that something else needed to use the GPU. If that’s indeed the case, then you can get rid of those gaps by ensuring nothing else uses the GPU while your app runs – i.e. don’t run your CUDA app on the same GPU that’s rendering your desktop. It looks like you’re on Windows, so be sure to run Nsys elevated (i.e. “run as administrator”) so it has permission to see the PIDs of other users’ GPU contexts. If you run as a normal user, the PIDs of other users’ GPU contexts will all be zero. But if you run elevated you’ll be able to see for example if the PID of the GPU context interrupting your kernels is the PID of dwm.exe, the desktop window manager (which normally runs as a different user).

Also, another problem on Windows is the WDDM driver model. It may be that you’re not getting interrupted by another GPU context, but just that you’re seeing the driver batching up CUDA work and submitting it to the WDDM kernel mode driver in chunks. To get the best performance out of WDDM, make sure you’ve turned on GPU Hardware Scheduling in your Windows advanced display settings, and make sure you’re running the newest release of Windows 11 (the WDDM version of Windows 10 is not going to keep getting updated unfortunately) and the newest NVIDIA display drivers. And there is an alternative to using WDDM – there is a different driver model for NVIDIA GPUs called TCC that completely bypasses the Windows display stack, and allows you to use the GPU entirely for CUDA. The CUDA performance is better in TCC mode, but note that you can’t attach a monitor to a TCC-mode GPU and get a Windows desktop on it. You’ll need to have one GPU in WDDM mode to run your desktop and use a different GPU in TCC mode for doing CUDA with the best performance. That WDDM-mode GPU can either be an NVIDIA GPU if you have more than one plugged in (it doesn’t have to be a fast one just to run the desktop), or if your CPU has integrated graphics, you can use the CPU’s graphics to drive your Windows desktop and put your NVIDIA GPU(s) in TCC mode to be dedicated to CUDA computing.

Hope that helps! Sorry for the late response.

Thank you for the answer, even late (the GTC 2023 might be a good reason)

Indeed, the test was done on Windows 10, and the GPU is used for display rendering (but only for that, no other GPU-enabled software is running). I didn’t think it could be a GPU access conflict because of the very high repeatability of the timings.

So I will try

  • to trace the context switches as mentioned
  • to explore GPU Hardware scheduling (if available on W10)
  • to run under Windows 11
  • with a secondary GPU for display

It will take me some time, but I will post any interesting results (even “solved”) here.