I am experimenting with my first multi GPU code. My problem is relatively simplistic. There are thousands (over 65K for the test case) of computationally intensive, completely independent calculations that must be performed from a relatively small amount of input data (~10 MB in the test case). I get a great speedup on a single GPU - 30x on FX 5600 and 100x on FX 5800.
I am trying to split the computation amongst 4 Quadro FX 5600’s. Since the input data is tiny, I simply load all of the input onto each of the cards and split the output between them. I use win threads to launch four separate process, each of which calls cudaSetDevice with the appropriate thread number. I use cudaGetDevice and cudaGetLastError to verify that everything went according to plan.
cudaSetDevice(nDeviceToUse);
cudaGetDevice(&nActualDevice);
cudaError_t err = cudaGetLastError();
if ( cudaSuccess != err)
{
std::cout << "Cuda error: " << cudaGetErrorString(err) << std::endl;
}
std::cout << "Device #" << nDeviceToUse << " requested, Device #" << nActualDevice << " initialized." << std::endl;
Everything prints out as expected and appears to be going well. However, three of the threads complete in approximately 1/4 of the original time and the final thread takes twice as long as the others (it alone runs for half of the original run time)… The answers are OK (at least on a quick visual inspection). I have also verified that the work is being split appropriately.
Is there something I might be doing wrong?