GPU Affinity Performance One Man's Battle to get Two Operating Systems to run Three Cards

hambagu · February 5, 2010, 6:41pm

I’ve spent the last few days trying to convince myself that my code is making full use of the hardware available in my box. I’ve got an X58/Core i7 940 system (8 logical cores) with a Quadro FX 550 as the main adapter (monitor attached) and two Quadro FX 5800’s in the second (gpu1) and third (gpu2) slots (no SLI, no monitors). I’m always targeting gpu 1 and/or gpu2 - gpu0 is just for the display.

I’ve tested using 191.78, 191.66, and 186.18 drivers on both Windows XP 64 and Windows 7 64. My app is usually a 32 bit build, except for observation 5, where a 64 bit build fixed a problem.

My application has two image rendering threads that render completely independently (no shared context, etc.). Each thread renders into it’s own offscreen pbuffer context, which is created from a WGL_nv_gpu_affinity context (which is created from a temp window). I think I can take out the pbuffer context and render into the affinity context, but it’s left in so that I can also test with just a pbuffer context created from the temp window. I can also run my application with only one thread, so that I can compare two processes with one thread each or one process with two threads.

I’ve got two test scenes - one has 8M triangles in the view and one has 50K triangles in the view. I’ve tested viewports ranging in size from 2x2 pixels up to 2500x2500 pixels. The scene complexity/viewport sizes were changed around to try to stress fill rate, etc., but none of that really matters for the observations below.

Observations:

On Windows XP 64, GPU affinity works as expected. If I render only one image, the frame rate is about the same with our without affinity. If I render two images without affinity, the frame rate is more than halved. If I render two images with affinity to different cards, the frame rate returns to normal (this is true whether I use one process with two threads or two process with one thread). If I render two images with affinity to one card, the frame rate is more than halved.
On Windows XP 64, if I try to start two processes at the same time, the second one fails to create the affinity context (the call to wglEnumGpusNV fails). If I add a short delay before starting the second process, everything works fine. I guess there’s a mutex in the driver that makes the call from the second process fail, rather than block.
On Windows XP 64, only one CPU core is used, whether I run one process with two threads or two process with one thread. This is the case regardless of triangle count or viewport size.

Actually, more than one core is used, but the CPU usage for the process never goes over ~13% (with two processes, one is usually 10-12% and the other is 1-3%). If I set the CPU affinity to one core, I actually get a few extra FPS. I also disabled hyperthreading in the BIOS and tested with four logical cores, and CPU usage stayed around 25% (one core).

At first, when I was only testing the multithreaded app, I thought there was some problem with my scene graph that was causing the threads to serialize (even though there are two instances of the scene graph, it’s got a few static variables). So I made a test app that just loads up the scene in a display list and just calls the list in a tight loop - no change in CPU usage behavior. I then went on to test two instances of the app with a single thread each and saw that the CPU usage was still limited to one core - that really blew my mind. The only way I could get the app/apps to use two cores was to add a glFinish call to the end of my loop.

All of this makes me think that the driver efficiently coordinates waiting on the hardware across threads and processes. That’s really cool, except for the better part of a day spent trying to figure out why my multithreaded app was only using one core’s worth of CPU.

On Windows 7 64, CPU usage is almost nearly 0%. This is the case with one process or two. Apparently, the driver guys found an even more efficient way of waiting.

However, when I use one process and two threads, the second GPU has really poor performance (i.e. gpu1 = 45 FPS, gpu2 = 7 FPS) If I use two processes, the performance is the same for each GPU (45 FPS).

I did find a way to get the one process/two threads performance where it should be - I added a glFinish to the end of the loop. Now both GPUs give me 45 FPS, but I’m using 13% of the CPU (not 25% like on XP, but I’d like 0% better, like when I use two processes).

On Windows 7 64, once I add the glFinish call, the multithreaded app crashes when the app exits (access violation in nvoglv32.dll). This only happens with a 32 bit build when I run two threads - two 32 bit processes with one thread is OK, and two threads in a 64 bit process is OK.
On Windows 7 64, the “DisplayLessPolicy” and “LimitVideoPresentSources” registry mod that enables CUDA on headless cards, does not enable GPU affinity. I was able to get it to work by forcing Windows to add a display to each GPU in the “Change display settings” control panel (I posted instructions in reply to a question in the NVSG thread). I’m not sure if this is the desired behavior, but it would be nice if the CUDA trick worked for both. Or better yet, add something to the NVidia Control panel? (I’ve noticed that Enable/Disable PhysX tweaks the LimitVideoPresentSources value).

Topic		Replies	Views
Multiple CPU threads Performance hit CUDA Programming and Performance	5	5381	February 28, 2008
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	241	July 7, 2024
Error executing two threads using OpenGL CUDA Programming and Performance	3	1386	November 20, 2008
CPU threads and CUDA CUDA Programming and Performance	8	7226	January 15, 2018
How to start 2 kernels on 2 devices CUDA Programming and Performance	16	10495	January 7, 2009
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6039	December 8, 2008
Needing expert advice.. CUDA Programming and Performance	4	1267	July 21, 2014
multi-GPU parallel operation CUDA Programming and Performance	4	4031	May 1, 2008
GTX 295, CUDA thinks I have only 30 multiprocessors CUDA Programming and Performance	8	5981	March 7, 2009
CUDA with 6 GPUs Why do not I see all the six GPUs? CUDA Programming and Performance	4	8860	February 18, 2011

GPU Affinity Performance One Man's Battle to get Two Operating Systems to run Three Cards

Related topics