Debugging Kernel Errors that cause Device Reset

I’m developing a fairly large application with dozens of compute threads and associated asynchronous CUDA streams. Within these, there’s a couple dozen actual kernels running fairly rapidly.

I seem to have a data race that pops up maybe on average once every 1000th kernel execution (this is random, not the same interval every time). This race causes a crash in my program.

I’ve attempted to debug this with several tools (cuda-gdb, nvvp, Nvidia nsight compute), but the best I can determine is that one of my many kernels in one of my many threads is causing a device reset, which causes all kernels to terminate, and the only place where execution breaks is in the next sequence of calls when a new thread is created and cuCtxGetDevice fails because the device has been reset. However, this is completely removed from the actual problem with the code.

So, my questions are:

  1. is there anything like gcc’s thread sanitizer for determining data races?
  2. Is there any tool that allows setting a breakpoint where a device reset is internally called by CUDA on an error condition?
  3. Is there anything else I’m missing in trying to debug execution like this.

I’m using CUDA 11 on Ubuntu 20. There’s also significant use of OpenCV 4.3’s cuda library