4x GPU - Performance Problem: Using 4 devices == Using 3 devices

als3 · August 1, 2009, 8:06pm

Hi,

I’m using two GTX 295. The four devices are active. My code is part of a render app, meaning that the CUDA code is inside a loop.
Using 1 device: ~33 fps
Using 2 devices: ~55 fps
Using 3 devices: ~73 fps
Using 4 device: ~73 fps

I’m sure that i’m using four devices. The visual prof. shows 4 contexts for 4 different devices and that the kernel CPU time for two of those devices is very high(equivalent to 73 fps). I supose that it is a thread sync problem, but i don’t know what it is. I used A master slave model (1 master thread + 4 slave threads), using the Windows API. I also used the GPUWorker and the result was the same…

Does anybody knows what problem could it be?

Keldor314 · September 9, 2009, 5:30am

Sounds like a CPU bottleneck. Specifically, it appears that you’re trying to run 5 threads on a 4 core CPU. The problem is that Windows thread scheduling is woefully inadequate for this, leading to such horrors as a thread being switched out and switching back in for as long as 30ms. That’s the time it takes to render a frame!

Thankfully, you can work around all this, at least to an extent.

First, make sure each of your devices has

cudaSetDeviceFlags(cudaDeviceScheduleYield);

That will get them to release the CPU when they hit a sync, thus allowing other threads a chance of running sooner rather than later.

Second, whenever one of the threads reaches a point where it can’t continue (worker thread runs out of commands, master thread has filled up the command queue), call

SwitchToThread();

This will yield the thread, again, allowing the other threads a chance to run, particularily, if the worker thread has nothing to do, then that master thread needs to end up scheduled asap!

Another thing to do is some amount of pipelining. That is, once a thread has sent all its data to the GPU and started a kernal, it’s safe for the master thread to come in and start writing the data for the next frame. That means that the worker thread might not have to wait for data when it finishes its current job, but can go straight on into the next.

One last thing you can do is juggle thread priorities. For instance, if a worker thread runs out of jobs, it can raise the priority of the master thread, so that it’s more likely to be scheduled. Then, the master thread can reset its priority once it fills all available queues, and even raise the priorities of the worker threads. This way, you can have at least some control in getting the right threads at the right time. The call for this is

SetThreadPriority(ThreadHandle,THREAD_PRIORITY_NORMAL);

I’d recommend not raising the priority of any thread higher than THREAD_PRIORITY_NORMAL, since this can slow down the whole system. Instead, use THREAD_PRIORITY_BELOW_NORMAL for any thread that runs out of things to do. Set it back to normal when another thread finds something for it to do - either the master thread now has an empty queue to fill, or a worker thread has just had it’s queue filled, and is free to run again.

Using all these techniques, I managed to get flam4 from running about 1 GPU at a time to using all 3 fairly efficiently. This is on a 2 core system, so we’re talking about 4 threads (1 master + 3 slaves) here.

Topic		Replies	Views
Performance drop using multiple cuda devices with pthread CUDA Programming and Performance	3	1085	April 23, 2013
One GPU of four running slowly? CUDA Programming and Performance	4	2125	March 26, 2009
multi-GPU parallel operation CUDA Programming and Performance	4	4031	May 1, 2008
Multiple Devices with GTX 295 CUDA Programming and Performance	1	1513	October 7, 2009
Why the following multigpu code works faster when I set GPU_N=1 while it is slower for GPU_N=4? CUDA Programming and Performance cuda	1	629	September 21, 2020
Multiple GPU computing CUDA Programming and Performance	8	7882	May 7, 2008
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3024	November 13, 2007
CPU cores vs GPUs CUDA Programming and Performance	6	9843	March 18, 2009
Multi Device Bandwidth CUDA Programming and Performance	6	1279	May 4, 2010
Weird multiGPU performance About 10 times slower than single GPU CUDA Programming and Performance	10	3917	November 25, 2009

4x GPU - Performance Problem: Using 4 devices == Using 3 devices

Related topics