4x GPU - Performance Problem: Using 4 devices == Using 3 devices

Hi,

I’m using two GTX 295. The four devices are active. My code is part of a render app, meaning that the CUDA code is inside a loop.
Using 1 device: ~33 fps
Using 2 devices: ~55 fps
Using 3 devices: ~73 fps
Using 4 device: ~73 fps

I’m sure that i’m using four devices. The visual prof. shows 4 contexts for 4 different devices and that the kernel CPU time for two of those devices is very high(equivalent to 73 fps). I supose that it is a thread sync problem, but i don’t know what it is. I used A master slave model (1 master thread + 4 slave threads), using the Windows API. I also used the GPUWorker and the result was the same…

Does anybody knows what problem could it be?

Sounds like a CPU bottleneck. Specifically, it appears that you’re trying to run 5 threads on a 4 core CPU. The problem is that Windows thread scheduling is woefully inadequate for this, leading to such horrors as a thread being switched out and switching back in for as long as 30ms. That’s the time it takes to render a frame!

Thankfully, you can work around all this, at least to an extent.

First, make sure each of your devices has

cudaSetDeviceFlags(cudaDeviceScheduleYield);

That will get them to release the CPU when they hit a sync, thus allowing other threads a chance of running sooner rather than later.

Second, whenever one of the threads reaches a point where it can’t continue (worker thread runs out of commands, master thread has filled up the command queue), call

SwitchToThread();

This will yield the thread, again, allowing the other threads a chance to run, particularily, if the worker thread has nothing to do, then that master thread needs to end up scheduled asap!

Another thing to do is some amount of pipelining. That is, once a thread has sent all its data to the GPU and started a kernal, it’s safe for the master thread to come in and start writing the data for the next frame. That means that the worker thread might not have to wait for data when it finishes its current job, but can go straight on into the next.

One last thing you can do is juggle thread priorities. For instance, if a worker thread runs out of jobs, it can raise the priority of the master thread, so that it’s more likely to be scheduled. Then, the master thread can reset its priority once it fills all available queues, and even raise the priorities of the worker threads. This way, you can have at least some control in getting the right threads at the right time. The call for this is

SetThreadPriority(ThreadHandle,THREAD_PRIORITY_NORMAL);

I’d recommend not raising the priority of any thread higher than THREAD_PRIORITY_NORMAL, since this can slow down the whole system. Instead, use THREAD_PRIORITY_BELOW_NORMAL for any thread that runs out of things to do. Set it back to normal when another thread finds something for it to do - either the master thread now has an empty queue to fill, or a worker thread has just had it’s queue filled, and is free to run again.

Using all these techniques, I managed to get flam4 from running about 1 GPU at a time to using all 3 fairly efficiently. This is on a 2 core system, so we’re talking about 4 threads (1 master + 3 slaves) here.