Speed problems with multi-gpu on GTX295

Dear All,

I’m writing a CFD simulation program for GPUs. The program schedules the simulations, and can run paralelly 2 different simulations on the GTX295 I have. The problem is, that it’s not as fast as it should (at least I think).

CPU: Intel Extreme 975
Motherboard: Asus P6T SE
Memory: 12 GB DDR3 (now on 1066MHz).
GPU: GTX295 (Asus)
OS: Windows Vista x64 Business
NVIDIA Control Panel Settings: PhysX OFF, Multi-GPU Support: ON

If I run only one simulation with a specific settings, then it finishes in 21m:40s, if I run 2 of them, they finish in 33m:22s and 29m:20s, so the total performance is only 33% bigger. The program does not send more than 200 MB during the total run-time between the GPU and the CPU, so the PCI Express cannot cause the difference.
I also tested the n-body from the SDK. If run only on one GPU, it measures 304 GFlops, if run on 2 (I have forced to run on two different devices), then 303 and 120.

Is it normal? Does anybody know what causes it?

       Thanks in advance!

                                                                                  Yours sincerely:
                                                                                         Laszlo Daroczy

I don’t have any experience with MultiGPU, but, it seems out of these possibilities,

  1. shared resource bottleneck
  2. limited ||ism - no, Monte Carlo is embarassingly ||
  3. Inefficient algorithm (not likely because Mark Harris & friends @NVIDIA have done very good, albeit sometimes arcane optimizations).

#1 is most likely because I assume the GPUs are also being used for display. I’m using a Tesla 1060 and have a Quadro 295 for the display. I noticed using the Quadro, results in lower than expected speed, which is probably due to driving the display (1024 * 768 * 4 bytes/pixel * 60Hz = 180 MiB/s bandwidth).

The Tesla card also has to be mapped to an extended part of the screen for CUDA to recognize it. This is a requirement for the regular display driver. The specialized Tesla driver doesn’t require you to map to extended desktop, but due to Windows Display Driver Model 1.0, you can only use 1 driver, and you need to use the regular driver along with another NVIDIA board if you want to have a display. Windows 7’s WDDM 1.1 does support multiple drivers. Mapping the Tesla to the desktop doesn’t affect performance because Tesla doesn’t have display circuitry to begin with.

I suggest you try the following to reduce the sharing between CUDA and the display driver functions:

  1. choose the lowest display resolution and bit depth
  2. disable compositing (Aero for Vista, Compiz or anything that uses OpenGL)
  3. try mapping the 2nd GPU to a different part of the screen (maybe you can find way to disable display all together)

Usually Aero is disabled for me, and the second GPU is mapped as a PhysX device. The bigger problem is, that in this case, sometimes random mistakes happen in the data, and it causes problems for my program. :verymad:

I simply have no idea, why is this happening. Last time e.g. after some secs, in the 66,273 element of my mesh was simply modified which caused divergence in the algorithm… :confused:

The 295 consumes <= 290W according to Wikipedia and 200W according eXtreme Power Supply Calculator. Maybe the cards aren’t getting enough power. Maybe you can find a tool to display the voltage of the 12V supply? I know some BIOSes show it, but I don’t know of any other programs - maybe GPU-Z?

You can also try underclocking using RivaTuner and see if that fixes the error. I’m using Tesla 1060 and never had a stability problem. I’ve asked what’s so good about Tesla in terms of reliability and didn’t get any convincing answer, but did see that Tesla significantly under clocks the memory.

I think there should not be any problems with the power, as there is a 750W supply in the computer, but I will check with RivaTuner.

Are you using direct management for each GPU in system or just relaying on CUDA driver ? Is you pass requirement for power supplay with 2*GPU?

I’m only using the CUDA driver.