Iterative execution without device reset function-time decreases and leads to unknown error

Hello everybody,

I’m currently experimenting on a function written as a CUDA-kernel. The execution itself is working, and the results are correct.
The results are used in an application and I measure the time it takes to execute the function (involving copying memory etc.).

The problem I experience is that after a few iterations with different parameters, the performance drops consistently until
I run into a cudaErrorUnknown-error when trying to copy the result-array back to Host-memory.

This error can be eliminated by calling cudaDeviceReset after every complete calculation. But this call increases the total runtime
by ~2.5x so I would like to find a way not to use it. As the error without the reset occurs when copying memory, there seems to be
a general problem with memory usage. As I can see I’m correctly using the cudaMalloc, cudaMemcpy and cudaFree commands.

What kind of misuse would force such a behavior? First running smoothly, then slowing down and then failing…

I have now found that the error itself is probably produced by the kernel-run itself. The cudaErrorUnknown is already propagated
by the cudaThreadSynchronize after the kernel execution - even before the results are copied back.

The question what causes the performance drop and failure still stands tough. The aspect that the execution runs correctly and then
suddenly breaks (even with the same input parameters) is odd.

And finally, the error is found. It turns out that one pointer, that was used for a temporary array,
was not free’ed at the end of the kernel execution. That led to the flooding of the registers which
decreased performance during the iterations that still had enough free registers. When the registers
were used up and new space was needed, the kernel failed with the cudaErrorUnknown-error.

I hope I can help people having the same kind of problem, just remember to free/delete your resources! :w00twave: