Intermittent Freeze/Hang in CUDA Operations

Hello,

Please can anyone suggest why I might be getting intermittent freezing in my CUDA application. It runs the same functions (custom kernel and thrust) over and over on different input data and every so often, it will just hang on one of the calls for a period of time. Strangely, it seems quite common that it hangs for almost exactly one second. The whole UI freezes at the same time.

nvvp shows that it happens in various function calls (e.g. cudaDeviceSynchronise, cudaFree, cudaMemcpyAsync_ptsz).

Even if the input the the functions is identical, it still hangs randomly, ie. not on a given input.

Thanks for your suggestions

try using cuda-memcheck with the synccheck tool. It may uncover code locations where not all threads participate in a __syncthread() operation, which may lead to unpredictably hanging kernels.

https://docs.nvidia.com/cuda/cuda-memcheck/index.html#cuda-memcheck-tools

Hi, thanks for your suggestion.

I tried cuda-memcheck (as well as racecheck, synccheck and initcheck) and I did not discover any issues with the code.

After coming back to this problem again recently I was able to create a reproducible test project and it appears that the (one) cause of the issues is that my CPU thread invoking CUDA is a real-time priority SCHED_FIFO thread, priority 85. Whenever I run without this, I see no hangs.

I need to do some more digging to see if I can isolate the specific issue and whether I can maintain a real-time thread priority level.

I have done some searching but can’t find any reference to this. Is it a known feature with CUDA that it does not like RT threads?

Thanks!

cudaFree call are blocking even done in a stream (wait for all other stream completion). even cpu code is blocked.
there are hidden cudafree in thrust as in other lib (cusparse)

see https://cs.unc.edu/~anderson/papers/ecrts18c.pdf