Speed problems with multi-gpu on GTX295

Dear All,

I’m writing a CFD simulation program for GPUs. The program schedules the simulations, and can run paralelly 2 different simulations on the GTX295 I have. The problem is, that it’s not as fast as it should (at least I think).

Computer:
CPU: Intel Extreme 975
Motherboard: Asus P6T SE
Memory: 12 GB DDR3 (now on 1066MHz).
GPU: GTX295 (Asus)
OS: Windows Vista x64 Business
NVIDIA Control Panel Settings: PhysX OFF, Multi-GPU Support: ON

If I run only one simulation with a specific settings, then it finishes in 21m:40s, if I run 2 of them, they finish in 33m:22s and 29m:20s, so the total performance is only 33% bigger. The program does not send more than 200 MB during the total run-time between the GPU and the CPU, so the PCI Express cannot cause the difference.
I also tested the n-body from the SDK. If run only on one GPU, it measures 304 GFlops, if run on 2 (I have forced to run on two different devices), then 303 and 120.

Is it normal? Does anybody know what causes it?

       Thanks in advance!

                                                                                  Yours sincerely:
                                                                                         Laszlo Daroczy

I don’t have any experience with MultiGPU, but, it seems out of these possibilities,

  1. shared resource bottleneck
  2. limited ||ism - no, Monte Carlo is embarassingly ||
  3. Inefficient algorithm (not likely because Mark Harris & friends @NVIDIA have done very good, albeit sometimes arcane optimizations).

#1 is most likely because I assume the GPUs are also being used for display. I’m using a Tesla 1060 and have a Quadro 295 for the display. I noticed using the Quadro, results in lower than expected speed, which is probably due to driving the display (1024 * 768 * 4 bytes/pixel * 60Hz = 180 MiB/s bandwidth).

The Tesla card also has to be mapped to an extended part of the screen for CUDA to recognize it. This is a requirement for the regular display driver. The specialized Tesla driver doesn’t require you to map to extended desktop, but due to Windows Display Driver Model 1.0, you can only use 1 driver, and you need to use the regular driver along with another NVIDIA board if you want to have a display. Windows 7’s WDDM 1.1 does support multiple drivers. Mapping the Tesla to the desktop doesn’t affect performance because Tesla doesn’t have display circuitry to begin with.

I suggest you try the following to reduce the sharing between CUDA and the display driver functions:

  1. choose the lowest display resolution and bit depth
  2. disable compositing (Aero for Vista, Compiz or anything that uses OpenGL)
  3. try mapping the 2nd GPU to a different part of the screen (maybe you can find way to disable display all together)

Usually Aero is disabled for me, and the second GPU is mapped as a PhysX device. The bigger problem is, that in this case, sometimes random mistakes happen in the data, and it causes problems for my program. External Media

I simply have no idea, why is this happening. Last time e.g. after some secs, in the 66,273 element of my mesh was simply modified which caused divergence in the algorithm… External Media

The 295 consumes <= 290W according to Wikipedia and 200W according eXtreme Power Supply Calculator. Maybe the cards aren’t getting enough power. Maybe you can find a tool to display the voltage of the 12V supply? I know some BIOSes show it, but I don’t know of any other programs - maybe GPU-Z?

You can also try underclocking using RivaTuner and see if that fixes the error. I’m using Tesla 1060 and never had a stability problem. I’ve asked what’s so good about Tesla in terms of reliability and didn’t get any convincing answer, but did see that Tesla significantly under clocks the memory.

I think there should not be any problems with the power, as there is a 750W supply in the computer, but I will check with RivaTuner.

Are you using direct management for each GPU in system or just relaying on CUDA driver ? Is you pass requirement for power supplay with 2*GPU?

I’m only using the CUDA driver.