I’ve written a CUDA function which I compile and run from Matlab using mexcuda. The function uses dynamic parallelism (kernel calls from within kernels), and when I run it manually with a few seconds in between calls it works just as intended.
However, when I call it frequently in a loop with identical input, for example:
dev = gpuDevice(1); for i = 1 : 100 C = myCUDAfunction(A,B); wait(dev) end
it breaks down after a few iterations and returns NaN as result. It then starts working again the next iteration or so. In other words its not 100 % reliable. If I however add “cudaDeviceReset();” at the beginning of my CUDA code it always seems to work, but the reset command slows everything down substantially.
What is this unreliable behavior a sign of? Could it be that I need to add some sort of wait/synchronization command to my code, and it that case what command and in what part of the code (start/end)?
I have tried adding “__syncthreads();” pretty much all over the place in desperation, but it does not seem to help at all.