We use several GPUs for a direct correlation. Each GPU is used by one host thread. And each thread does (as very short version):
cudaSetDevice( nDevice ); // do something useful ;-) cudaDeviceReset();
This should be done for many independent input data (image pairs). If one thread / device has finished the work, the thread continues with the next data. Our problem is, since we use the new 4.1 (Toolkit / SDK / driver), that the first call of “cudaMalloc” never returns. I have attached a simple example which have the same behavior (using openMP for the threads).
Another topic with the same(?) problem is found here: Simple program won’t exit if cudaMalloc is called.
Is it a bug or do we something wrong? I have no idea and i don’t want to change back to CUDA 4.0 because it is ~8% faster (with one thread).
OS: Win7 64bit
GPU0: GTX 480
GPU1: GTX 480
Driver: 286.19 (dev driver)
CPU: 2x Xeon X5550 @ 2.67 GHz
RAM: 24 GB
CUDA_ThreadTest.zip (1.37 KB)