Sounds like a CPU bottleneck. Specifically, it appears that you’re trying to run 5 threads on a 4 core CPU. The problem is that Windows thread scheduling is woefully inadequate for this, leading to such horrors as a thread being switched out and switching back in for as long as 30ms. That’s the time it takes to render a frame!
Thankfully, you can work around all this, at least to an extent.
First, make sure each of your devices has
cudaSetDeviceFlags(cudaDeviceScheduleYield);
That will get them to release the CPU when they hit a sync, thus allowing other threads a chance of running sooner rather than later.
Second, whenever one of the threads reaches a point where it can’t continue (worker thread runs out of commands, master thread has filled up the command queue), call
SwitchToThread();
This will yield the thread, again, allowing the other threads a chance to run, particularily, if the worker thread has nothing to do, then that master thread needs to end up scheduled asap!
Another thing to do is some amount of pipelining. That is, once a thread has sent all its data to the GPU and started a kernal, it’s safe for the master thread to come in and start writing the data for the next frame. That means that the worker thread might not have to wait for data when it finishes its current job, but can go straight on into the next.
One last thing you can do is juggle thread priorities. For instance, if a worker thread runs out of jobs, it can raise the priority of the master thread, so that it’s more likely to be scheduled. Then, the master thread can reset its priority once it fills all available queues, and even raise the priorities of the worker threads. This way, you can have at least some control in getting the right threads at the right time. The call for this is
SetThreadPriority(ThreadHandle,THREAD_PRIORITY_NORMAL);
I’d recommend not raising the priority of any thread higher than THREAD_PRIORITY_NORMAL, since this can slow down the whole system. Instead, use THREAD_PRIORITY_BELOW_NORMAL for any thread that runs out of things to do. Set it back to normal when another thread finds something for it to do - either the master thread now has an empty queue to fill, or a worker thread has just had it’s queue filled, and is free to run again.
Using all these techniques, I managed to get flam4 from running about 1 GPU at a time to using all 3 fairly efficiently. This is on a 2 core system, so we’re talking about 4 threads (1 master + 3 slaves) here.