I’m currently experimenting on a function written as a CUDA-kernel. The execution itself is working, and the results are correct.
The results are used in an application and I measure the time it takes to execute the function (involving copying memory etc.).
The problem I experience is that after a few iterations with different parameters, the performance drops consistently until
I run into a cudaErrorUnknown-error when trying to copy the result-array back to Host-memory.
This error can be eliminated by calling cudaDeviceReset after every complete calculation. But this call increases the total runtime
by ~2.5x so I would like to find a way not to use it. As the error without the reset occurs when copying memory, there seems to be
a general problem with memory usage. As I can see I’m correctly using the cudaMalloc, cudaMemcpy and cudaFree commands.
What kind of misuse would force such a behavior? First running smoothly, then slowing down and then failing…
I have now found that the error itself is probably produced by the kernel-run itself. The cudaErrorUnknown is already propagated
by the cudaThreadSynchronize after the kernel execution - even before the results are copied back.
The question what causes the performance drop and failure still stands tough. The aspect that the execution runs correctly and then
suddenly breaks (even with the same input parameters) is odd.