Same Kernel called multiple times in a loop has different runtimes

Hello,

I have a CUDA kernel performing a simulation that is called multiple times (~300x) in a loop. The output is written to global memory and with subsequent kernels processed. Preceding kernels create the simulation models. From time to time however, the loop iteration is slowed down and speeds up again. It even occurs, if I use just one block with a couple of threads.

If I analyze my program with nvvp, it shows up under “Performance-Critical Kernels” 2-3 times and for each entry, the kernel as a different runtime (varying from 4ms to 400ms). I attached a small screenshot of nvvp.

Can anybody give me hints, how to debug/analyze this behavior? The GPU is a 1050 Ti under Linux only used for CUDA. Graphics output is processed on a different GPU.

Thanks in advance
nvvp.png

I’d certainly be wondering about the page faults.

Thanks! I realized that I use managed memory for one kernel parameter. After prefetching it, everything works as expected.