GPU Affinity Performance One Man's Battle to get Two Operating Systems to run Three Cards

I’ve spent the last few days trying to convince myself that my code is making full use of the hardware available in my box. I’ve got an X58/Core i7 940 system (8 logical cores) with a Quadro FX 550 as the main adapter (monitor attached) and two Quadro FX 5800’s in the second (gpu1) and third (gpu2) slots (no SLI, no monitors). I’m always targeting gpu 1 and/or gpu2 - gpu0 is just for the display.

I’ve tested using 191.78, 191.66, and 186.18 drivers on both Windows XP 64 and Windows 7 64. My app is usually a 32 bit build, except for observation 5, where a 64 bit build fixed a problem.

My application has two image rendering threads that render completely independently (no shared context, etc.). Each thread renders into it’s own offscreen pbuffer context, which is created from a WGL_nv_gpu_affinity context (which is created from a temp window). I think I can take out the pbuffer context and render into the affinity context, but it’s left in so that I can also test with just a pbuffer context created from the temp window. I can also run my application with only one thread, so that I can compare two processes with one thread each or one process with two threads.

I’ve got two test scenes - one has 8M triangles in the view and one has 50K triangles in the view. I’ve tested viewports ranging in size from 2x2 pixels up to 2500x2500 pixels. The scene complexity/viewport sizes were changed around to try to stress fill rate, etc., but none of that really matters for the observations below.

Observations:

  1. On Windows XP 64, GPU affinity works as expected. If I render only one image, the frame rate is about the same with our without affinity. If I render two images without affinity, the frame rate is more than halved. If I render two images with affinity to different cards, the frame rate returns to normal (this is true whether I use one process with two threads or two process with one thread). If I render two images with affinity to one card, the frame rate is more than halved.

  2. On Windows XP 64, if I try to start two processes at the same time, the second one fails to create the affinity context (the call to wglEnumGpusNV fails). If I add a short delay before starting the second process, everything works fine. I guess there’s a mutex in the driver that makes the call from the second process fail, rather than block.

  3. On Windows XP 64, only one CPU core is used, whether I run one process with two threads or two process with one thread. This is the case regardless of triangle count or viewport size.

Actually, more than one core is used, but the CPU usage for the process never goes over ~13% (with two processes, one is usually 10-12% and the other is 1-3%). If I set the CPU affinity to one core, I actually get a few extra FPS. I also disabled hyperthreading in the BIOS and tested with four logical cores, and CPU usage stayed around 25% (one core).

At first, when I was only testing the multithreaded app, I thought there was some problem with my scene graph that was causing the threads to serialize (even though there are two instances of the scene graph, it’s got a few static variables). So I made a test app that just loads up the scene in a display list and just calls the list in a tight loop - no change in CPU usage behavior. I then went on to test two instances of the app with a single thread each and saw that the CPU usage was still limited to one core - that really blew my mind. The only way I could get the app/apps to use two cores was to add a glFinish call to the end of my loop.

All of this makes me think that the driver efficiently coordinates waiting on the hardware across threads and processes. That’s really cool, except for the better part of a day spent trying to figure out why my multithreaded app was only using one core’s worth of CPU.

  1. On Windows 7 64, CPU usage is almost nearly 0%. This is the case with one process or two. Apparently, the driver guys found an even more efficient way of waiting.

However, when I use one process and two threads, the second GPU has really poor performance (i.e. gpu1 = 45 FPS, gpu2 = 7 FPS) If I use two processes, the performance is the same for each GPU (45 FPS).

I did find a way to get the one process/two threads performance where it should be - I added a glFinish to the end of the loop. Now both GPUs give me 45 FPS, but I’m using 13% of the CPU (not 25% like on XP, but I’d like 0% better, like when I use two processes).

  1. On Windows 7 64, once I add the glFinish call, the multithreaded app crashes when the app exits (access violation in nvoglv32.dll). This only happens with a 32 bit build when I run two threads - two 32 bit processes with one thread is OK, and two threads in a 64 bit process is OK.

  2. On Windows 7 64, the “DisplayLessPolicy” and “LimitVideoPresentSources” registry mod that enables CUDA on headless cards, does not enable GPU affinity. I was able to get it to work by forcing Windows to add a display to each GPU in the “Change display settings” control panel (I posted instructions in reply to a question in the NVSG thread). I’m not sure if this is the desired behavior, but it would be nice if the CUDA trick worked for both. Or better yet, add something to the NVidia Control panel? (I’ve noticed that Enable/Disable PhysX tweaks the LimitVideoPresentSources value).