There is a sticky thread on this already but it is apparently closed so I’m starting another one.
I’ve been developing on CUDA for many years and am well aware of the timeout. Up until now it hasn’t been a problem because my kernels have naturally had very short runtimes and I have unavoidably needed to synchronise between kernel executions.
However, something new I’m working on seems to be triggering the timeout. This surprised me at first because my individual kernel calls still have quite short run times but I noticed in the other thread a comment that kernel calls can sometimes be batched by the driver. Is it possible that this is happening to me? I have a loop and in each iteration I have several cudaMemcpy3DAsync() calls (device-to-device) and one kernel execution. The runtime for each iteration is ~150ms but I have about 20 iterations.
Is there a recommended method for avoiding a timeout? Is it as simple as putting a cudaThreadSynchronize() in my loop?