multiGPU poor performance up to 10x lowest performance in multiGPU

I found an issue with the performance in multiGPUs (there are other posts in
the forum for this issue, but I don’t find a solution yet). If I start threads to work
with differents GPUs the performance is up to 10x lowest.
I ran 3 exe programs at the same time, setted with different devices and the
performance was ok, then I think that is not a hardware problem.
Are there changes in CUDA 1.1 for the multiGPU and multiThread process?
many thanks,

It’s pretty hard for anyone to give advice without a lot more information on your system, what you’re doing in the multiGPU kernels, etc. Is your code I/O bound? Is your motherboard able to push more than one PCIe slot effectively? Some simple tests and back-of-the-envelope calculations should help you determine the source of your problem.

John Stone

Hi John. I can’t comment on the code itself, but I can comment on the hardware. The system we are using has 8x PCIe running off of the host to a multi PCIe backplane. Some slots are 4x electrical and some are 8x electrical, and I believe with the 6 GPUs we are using, half are on 4x and the other half are on 8x.

Alejo, it may be worth looking at the I/O parts of your kernel, since we are limited to 4 GB/s off of the host board split across GPUs. This may not bode well for trying to do real time work to any major degree, but I’m guessing you could look at processing larger chunks of memory at a time, rather than equally splitting information across GPUs, since memory use can vary, and smaller problems will be bottlenecked by I/O.

Hi John

Thanks for your reply. I’m working in a ray tracer aplication, the idea is to divide the frame in areas to render in different GPUs in a multuGPU box. Anyway this issue with performance has been reported for the multiGPU example as well (see…869&hl=multigpu ). If you run this example with more than one GPU the kernell take 600ms approx, but if you run with one GPU the process take only 40ms. I had have similar results with my aplication.

Can you give me some examples about this simple tests nad back-of-the-envelope calculations?

many thanks, Alejo

Is there a profiler you can use to find out what is happening where when using multiple threads? I think last time we looked at it, there was a thread syncing issue. I think there is a function for synchronizing threads which ended up increasing the the time to completion of the software.


If you’re using some cards in 4x slots, you’re probably outside of the realm of testing that the NVIDIA people do with their drivers. There could be all kinds of issues with the hardware configuration you’re describing. I’d suggest running NVIDIAs bugreport script that collects hardware info and providing them the output. Just hearing about the highly unusual bus topology you’re using scares me a bit :-)

Maybe one of the NVIDIA people can make a comment on whether a system setup like this has much chance of working.



I think that the people experiencing these 10x slowdowns probably have an unusual hardware combination. We’ve had no problems like this with any of our CUDA test machines, and I’ve been doing MultiGPU codes on several different host systems, motherboards, and GPU combinations. I’d suggest that people having huge slowdowns like this post their hardware info and/or report a bug. If you post the output of the NVIDIA bug reporting script someone at NVIDIA could presumably look into it. Without that info I don’t think there’s much anyone can do to help since most other hardware combos are working fine.



It is definitely a more exotic hardware setup. Although I don’t believe that the fact we are running on 4x and 8x PCIe is the cause of major concern since that primarily regards the bandwidth capacity, and not overall GPU performance. Could be wrong though. Where is this bug reporting script?

Also do you have any software I can use to test these GPUs in the multi GPU environment? I was hoping to try and see if I can actually benchmark time of I/O as well as tax all GPUs while data is in GPU memory.

Just a little more info on our setup, it isn’t a traditional motherboard setup. We are using a host board/backplane setup using a Trenton MCXT host board on a Cyclone PCIe backplane. This gives us 14 PCIe slots to load up. Some are x4 electrical and some are x8 electrical, although mechanically, all of them are x8, so we had to use adapters, since there is currently no backplanes available that natively supports x16 mechanical slots.

With this, we have 6 GeForce 8800 GTS 640MB GPUs. This is all powered by a 1200W PSU, although I’m getting the feeling that I may need to look at a dedicated PSU for a few of the GPUs, given some of the issues we have been having.

EDIT: BTW, I’ve been trying to get a hold of a number of people at NVIDIA directly by email but haven’t gotten much response from them yet.

Try running the script, which ought to be on your system as part of the video driver installation, e.g.: /usr/bin/


John Stone

Hi John

I can’t find nothing like this nvidia-bug-report script in Windows system.

Do you know which is the name of the script for Windows?

thanks, Alejo

Ouch, hmm, I don’t know what the equivalent tool for Windows is. Perhaps one of the NVIDIA staff can chime in on this.


Windows doesn’t ship a bug reporting tool.

I don’t think I’d expect good performance if different cards are in PCI-E slots running at different speeds. Can you elaborate why you believe this to be a CUDA problem rather than a motherboard limitation?

We’re trying to eliminate what the potential problem might be and potential solutions. We had put together an application that gave us bandwidth speeds across all graphics cards. What I’ve found is that even though the PCIe slots have different speeds here and there, testing the bandwidth showed when I go multi GPU, I’m only limited to 8x off the host, and the numbers showed it.

So while I can send data at 4x speeds (with 85% efficiency) to a GPU on a 4x slot (working on testing 8x speeds to only an 8x slot), I can achieve 8x (with 85% efficiency) to all 6 graphics cards at once.

So thus far what I’ve seen, I’ll get somewhere between 300 MB/s and 400 MB/s per GPU when sending data to all at once. The default device right now for sending to a single GPU is currently on a 4x slot, and that is getting around 850 MB/s.

So once I got these numbers down, to me, it seemed like there was a little issue going. I don’t know where exactly the issue is happening or how it is happening, but to consistently see massive slow downs when trying to go multi GPU seems worrisome to me when the bandwidth slowdown for going multi GPU isn’t close to what the overall slowdown factor of the application is.


I’ve got similar problems with the multiGPU-Demo.

(2* NVIDIA 8800GTX, AMD X2 Dual Core 5200+, 4GB RAM, Linux, Cuda 1.1)

Here my bandwidth-valus for 1 GPU:

Here my bandwidth-valus for 2 GPU (bandwidth-demo was forked at startup and both processes were set to different GPUs)

Further values for the multiGPU-demo (was forked as above)

Bandwidth seems not to be so bad when using 2 GPUs concurrently but running cuda-kernels concurrently cause different values for both GPUs. The faster of both is always the 1st GPU - independently whether the child or the parent process uses GPU1.

Does anyone have an idea?




I think I found the problem in NVIDIAs multiGPU-demo …

setCudaDevice seems to need about 400ms for setting the 2nd device. The solution is that setCudaDevice is excluded from the time measurement in the demo. This worked for me :-)