CUDA 4.1 Thread Problem

Hi.

We use several GPUs for a direct correlation. Each GPU is used by one host thread. And each thread does (as very short version):

cudaSetDevice( nDevice );

// do something useful ;-)

cudaDeviceReset();

This should be done for many independent input data (image pairs). If one thread / device has finished the work, the thread continues with the next data. Our problem is, since we use the new 4.1 (Toolkit / SDK / driver), that the first call of “cudaMalloc” never returns. I have attached a simple example which have the same behavior (using openMP for the threads).

Another topic with the same(?) problem is found here: Simple program won’t exit if cudaMalloc is called.

Is it a bug or do we something wrong? I have no idea and i don’t want to change back to CUDA 4.0 because it is ~8% faster (with one thread).

Edit:

System configuration:

OS: Win7 64bit

GPU0: GTX 480

GPU1: GTX 480

Driver: 286.19 (dev driver)

CPU: 2x Xeon X5550 @ 2.67 GHz

RAM: 24 GB

Best regards

Thomas R.
CUDA_ThreadTest.zip (1.37 KB)

Please file a bug against the CUDA driver, attaching your repro code. Thank you for your help.

If you have not filed a bug before: When you login to the registered developer website (partners.nvidia.com), there is a link to the bug reporting form in the menu on the left side of the screen. For some reason the CUDA version selector in that form only goes up to 4.0, I would suggest prefixing the synopsis with "CUDA 4.1: " to make sure it is handled as a CUDA 4.1 bug.