'Invalid argument' error in a cudaMemset on 2 GPU configuration in multithreaded application

Hey everyone,

I have an application that is multithreaded and I am in a situation where thread 1 is allocating memory with a simple cudaMalloc. Later in my application, thread 2 tries to memset this memory.

In my understanding this is possible since CUDA 3.2 (CUDA contexts can be shared between host threads). I’m running Ubuntu 12.04, CUDA 5.0 and latest graphics driver (310.19). I have two graphics cards: a GeForce 680 running the graphics and a GeForce 580 running my CUDA kernels (my kernels are executing faster on the 580). In that exact configuration, the memset fails with a ‘invalid argument’. Of course, I checked the size and the address and they both seem fine.

If I execute the CUDA computations on the GeForce 680 as well (I have defined an environment variable to do that, so both graphics and CUDA use the same card), everything works fine then. No error.

The type of error ‘invalid argument’ is the same type of error we used to have when sharing contexts between host threads was not possible before CUDA 3.2 (http://stackoverflow.com/questions/5616538/cudamemcpy-invalid-argument).

Could it be a bug in CUDA 5.0 at the moment? I don’t have a small executable to replicate the bug yet, but I will try to make one. Someone has an idea?

Thanks for your help.

I found the problem. In case someone is interested, it was because cudaSetDevice is specific to a given host thread. I was selecting the second GPU (device 1), allocate my memory (on device 1) on it. But my main loop was executed from another thread, and because I wasn’t calling again cudaSetDevice, CUDA chose the default device (device 0) which obviously triggered an error trying to access an address of an array allocated on device 1 on device 0.

I could see the use of a cudaApplicationSetDevice that would set the CUDA device for the application, and not per thread as currently. I am using 2 GPUs, one is dedicated to rendering and one for CUDA kernels. So that would be nice to be able to select the device I want for the entire application. Since I have to call cudaSetDevice from my main loop, I need to make my main loop depend on CUDA even though a very small subset actually depend on it. Which is not ideal from a design point of view.

Anyway, hope it will help someone in the same situation one day.

Thanks, you helped me now.

Thanks a lot!!! My bug fixed.