CUDA function breaks down when stressed repeatedly in a loop. Do I need add wait/synchronize commands?

I’ve written a CUDA function which I compile and run from Matlab using mexcuda. The function uses dynamic parallelism (kernel calls from within kernels), and when I run it manually with a few seconds in between calls it works just as intended.

However, when I call it frequently in a loop with identical input, for example:

dev = gpuDevice(1);
for i = 1 : 100
    C = myCUDAfunction(A,B);
    wait(dev)
end

it breaks down after a few iterations and returns NaN as result. It then starts working again the next iteration or so. In other words its not 100 % reliable. If I however add “cudaDeviceReset();” at the beginning of my CUDA code it always seems to work, but the reset command slows everything down substantially.

What is this unreliable behavior a sign of? Could it be that I need to add some sort of wait/synchronization command to my code, and it that case what command and in what part of the code (start/end)?

I have tried adding “__syncthreads();” pretty much all over the place in desperation, but it does not seem to help at all.

Thanks.