cudaGLSetGLDevice bug? CUDA 4.0

Hi, all.

System spec:
Windows 7 32 and 64 bit. GTX460. Latest developer drivers.

I was using the CUDA 3.2 in my project. Than I have updated to CUDA 4.0 and got problems.

The problematic code scheme is following.

  1. Thread A starts thread B
  2. Thread B calls cudaSetDevice
  3. Thread B do some work using CUDA, while thread A do some other, not CUDA related work (message loop). Everything is ok here.
  4. When thread B finishes, thread A calls cudaGLSetGLDevice for initializing CUDA with OpenGL interop.

The code above was working in CUDA 3.2.
But once I have upgraded to CUDA 4.0 in #4 cudaGLSetGLDevice started to return cudaErrorSetOnActiveProcess. Note, that thread A did no CUDA calls before #4. And although further CUDA calls in thread A seems to work - the OpenGL interop of course does not.

Ok, I go further and move #4 to the first place.

  1. THread A calls cudaGLSetGLDevice. Ok.
  2. Thread A starts thread B
  3. Thread B calls cudaSetDevice. Ok. (also tried to not call cudaSetDevice at all, nothing changes).
  4. Thread B do the CUDA work. THe very first memory allocation returns me cudaErrorDeviceUnavailable!

Ok go further. Remove the thread B at all (just for test). It leaves:

  1. Thread A calls cudaGLSetGLDevice
  2. Thread A do some CUDA job.
    Works ok.

Another scheme.

  1. Thread A starts thread B
  2. Thread B calls cudaSetDevice
  3. Thread B do some work using CUDA, while thread A do some other, not CUDA related work (message loop). Everything is ok here.
  4. When thread B finishes, thread A calls cudaSetDevice.

No errors, but of course OpenGL interop doesnt work.

Any suggestions? Is it a bug?

Thanks.

This is caused by the fact that the several host threads in a process accessing the same GPU via the CUDA Runtime now share a context by default. cudaGLSetGLDevice() can only be used on a device prior to when the context is created for that device (which now could have been done by some other host thread as you’ve seen). To get back to the old behavior where the several host threads have separate contexts to the GPU (though you would sacrifice the benefits that provides), you could have each host thread call cuCtxCreate() on the device prior to doing any CUDA Runtime API work.

Hope this helps,
Cliff

thread B calls cudaSetGLDevice should solve this problem without the need for multiple contexts. (multiple contexts on the same GPU within the same process are bad)

Thanks for advice. Partially this is work. Well, first I tried to make thread B (the one that doesnt use OpenGL interop) to use cuCtreateCtx, while thread A is simply call cudaGLSetGLDevice. The system started to work just fine.

Than I though, “ok, I need to delete the context I created in thread B”. So, at the end of thread B I called cuDestroyCtx. But if I do this, thread A stops to work then, either cudaGLSetGLDevice fails (if thread A initializes after thread B), or I get “unspecified driver error” from first non-device management call(if thread B initializes after thread A).

You said, that both threads must use cuCreateCtx. So I tried this. But this didn’t resolve the problem.

So currently my solution is.

Thread A simply call cudaGLSetGLDevice.

Thread B call cudaSetDevice + cuCtxCreate. Thread B doesn’t destroy this new context on exit.

So the question is: is it ok that thread B can’t clean after itself? What are possible side effects? All this mixing of CUDA Runtime and Driver API seems to be hack…

Btw, simply puttin cudaGLSetGLDevice in both threads didn’t helped.

I think I had the same setup and found a solution. The CUDA 4.0 Programming guide, section 3.2.1 says: There is no explicit initialization function for the runtime; it initializes the first time a runtime function is called (more specifically any function other than functions from the device and version management sections of the reference manual).

Therefore what I think is happening is the following:

Neither cudaGLSetGLDevice and cudaSetDevice are runtime functions. So the very first runtime call (cudaMalloc, which initializes the runtime) is in step 4) from thread B. However, there is no OpenGL Context in Thread B, therefore initializing the runtime with OpenGL interoperability fails.

Solution:

  1. Thread A calls cudaGLSetGLDevice. Ok.

  2. Thread A does a dummy runtime call, for example a cudaMalloc

  3. Thread A starts thread B

  4. Thread B does the CUDA work. Ok (No need to call cudaGLSetGLDevice or cudaSetDevice)

If somebody from nVidia sees this: It would be great to mention this in the OpenGL Interop section of the documentation.