Advice on finding illusive CUDA bugs (e.g., segfault on cudaGetDeviceCount)?

I’m kind of lost on how to approach finding this bug. I’ve been going through my diffs and see no obvious location, but it’s clearly the new world order of the code I’ve refactored. The facts:

  1. It’s a CUDA / OpenGL interop application. OpenGL is definitely being initialized first. In previous versions I once refactored and the cuda init stuff ended up happening before the OpenGL init, but this would cause a segfault in cudaMalloc.

  2. I cannot find any cudaMalloc or any other API calls that happen before cudaGetDeviceCount

  3. I tried blindly doing cudaDeviceReset first, but that doesn’t work and I think that is a bad idea in general?

I’m just asking for any high level advice, I’ve been hunting this for quite some time and am rather perplexed. What makes this challenging is that if I step through the program with GDB, no segfault occurs, and everything works out.

While this is a threaded application, this shouldn’t be relevant as the threads have not been created yet (as evidenced by GDB as well). The threads are created after the init methods for linking CUDA and OpenGL.

Basically, I don’t know how to find this thing, and am wondering if anybody has any high level advice on finding this. Thanks for any consideration x0

does it make a difference whether you make a release or debug build?

Supposedly this is on Linux? what Linux and what Kernel? what nVidia driver version?

have you tried running your application with valgrind to check for possible heap corruption?

Linux (4.8.13) and OSX (10.12). I’m confident it is the application, not an installation. The master branch works just fine. This happens on Release or Debug (all optimizations stripped out). The debugging run-through was happening on Debug builds, but the segfault somehow does not follow.

I can’t believe I forgot about valgrind, I’ll have to comb through that! I was trying to find some potential stack corruption, but e.g. -fsanitize-address doesn’t seem to be possible to use when compiling with CUDA. I found a blog post that seems to explain it never will, but the details were a little over my head.

Hmmm. I guess the OpenGL context wasn’t fully initialized (yet again)? The way things have to be created for this framework mean that the actual window being used to display things cannot be created on program startup. I ended up just making an invisible window that never gets used and I no longer segfault.

My defacto test case for CUDA / GL interop is to do the initialization of OpenGL, do

  1. cudaGLSetGLDevice
  2. cudaSetDevice
  3. Perform a dummy cudaMalloc and cudaFree

If you are reading this and did that, if you get a segfault on (3), the OpenGL context is not fully initialized. Note that most OpenGL API’s have something like glfwInit() or whatever. This is not sufficient. A full blown window must be created and the context made current.

Hopefully somebody reading this one day will save a lot of time from my mistakes. While it would certainly be swell if the driver could detect this and at least allow for error reporting (rather than a segfault), I can’t imagine what implementing that would even look like.