I’m developing a fairly large application with dozens of compute threads and associated asynchronous CUDA streams. Within these, there’s a couple dozen actual kernels running fairly rapidly.
I seem to have a data race that pops up maybe on average once every 1000th kernel execution (this is random, not the same interval every time). This race causes a crash in my program.
I’ve attempted to debug this with several tools (cuda-gdb, nvvp, Nvidia nsight compute), but the best I can determine is that one of my many kernels in one of my many threads is causing a device reset, which causes all kernels to terminate, and the only place where execution breaks is in the next sequence of calls when a new thread is created and cuCtxGetDevice fails because the device has been reset. However, this is completely removed from the actual problem with the code.
So, my questions are:
- is there anything like gcc’s thread sanitizer for determining data races?
- Is there any tool that allows setting a breakpoint where a device reset is internally called by CUDA on an error condition?
- Is there anything else I’m missing in trying to debug execution like this.
I’m using CUDA 11 on Ubuntu 20. There’s also significant use of OpenCV 4.3’s cuda library