multiGPU poor performance up to 10x lowest performance in multiGPU

a_grau · November 15, 2007, 5:15pm

Hi
I found an issue with the performance in multiGPUs (there are other posts in
the forum for this issue, but I don’t find a solution yet). If I start threads to work
with differents GPUs the performance is up to 10x lowest.
I ran 3 exe programs at the same time, setted with different devices and the
performance was ok, then I think that is not a hardware problem.
Are there changes in CUDA 1.1 for the multiGPU and multiThread process?
many thanks,
Alejo

tachyon_john · November 19, 2007, 7:09am

It’s pretty hard for anyone to give advice without a lot more information on your system, what you’re doing in the multiGPU kernels, etc. Is your code I/O bound? Is your motherboard able to push more than one PCIe slot effectively? Some simple tests and back-of-the-envelope calculations should help you determine the source of your problem.

John Stone

e.ping · November 19, 2007, 4:53pm

Hi John. I can’t comment on the code itself, but I can comment on the hardware. The system we are using has 8x PCIe running off of the host to a multi PCIe backplane. Some slots are 4x electrical and some are 8x electrical, and I believe with the 6 GPUs we are using, half are on 4x and the other half are on 8x.

Alejo, it may be worth looking at the I/O parts of your kernel, since we are limited to 4 GB/s off of the host board split across GPUs. This may not bode well for trying to do real time work to any major degree, but I’m guessing you could look at processing larger chunks of memory at a time, rather than equally splitting information across GPUs, since memory use can vary, and smaller problems will be bottlenecked by I/O.

a_grau · November 19, 2007, 5:21pm

Hi John

Thanks for your reply. I’m working in a ray tracer aplication, the idea is to divide the frame in areas to render in different GPUs in a multuGPU box. Anyway this issue with performance has been reported for the multiGPU example as well (see http://forums.nvidia.com/index.php?showtop…869&hl=multigpu ). If you run this example with more than one GPU the kernell take 600ms approx, but if you run with one GPU the process take only 40ms. I had have similar results with my aplication.

Can you give me some examples about this simple tests nad back-of-the-envelope calculations?

many thanks, Alejo

e.ping · November 19, 2007, 8:30pm

Is there a profiler you can use to find out what is happening where when using multiple threads? I think last time we looked at it, there was a thread syncing issue. I think there is a function for synchronizing threads which ended up increasing the the time to completion of the software.

tachyon_john · November 21, 2007, 4:48pm

Hi,

If you’re using some cards in 4x slots, you’re probably outside of the realm of testing that the NVIDIA people do with their drivers. There could be all kinds of issues with the hardware configuration you’re describing. I’d suggest running NVIDIAs bugreport script that collects hardware info and providing them the output. Just hearing about the highly unusual bus topology you’re using scares me a bit :-)

Maybe one of the NVIDIA people can make a comment on whether a system setup like this has much chance of working.

Cheers,

John

tachyon_john · November 21, 2007, 4:52pm

I think that the people experiencing these 10x slowdowns probably have an unusual hardware combination. We’ve had no problems like this with any of our CUDA test machines, and I’ve been doing MultiGPU codes on several different host systems, motherboards, and GPU combinations. I’d suggest that people having huge slowdowns like this post their hardware info and/or report a bug. If you post the output of the NVIDIA bug reporting script someone at NVIDIA could presumably look into it. Without that info I don’t think there’s much anyone can do to help since most other hardware combos are working fine.

Cheers,

John

e.ping · November 21, 2007, 5:14pm

It is definitely a more exotic hardware setup. Although I don’t believe that the fact we are running on 4x and 8x PCIe is the cause of major concern since that primarily regards the bandwidth capacity, and not overall GPU performance. Could be wrong though. Where is this bug reporting script?

Also do you have any software I can use to test these GPUs in the multi GPU environment? I was hoping to try and see if I can actually benchmark time of I/O as well as tax all GPUs while data is in GPU memory.

Just a little more info on our setup, it isn’t a traditional motherboard setup. We are using a host board/backplane setup using a Trenton MCXT host board on a Cyclone PCIe backplane. This gives us 14 PCIe slots to load up. Some are x4 electrical and some are x8 electrical, although mechanically, all of them are x8, so we had to use adapters, since there is currently no backplanes available that natively supports x16 mechanical slots.

With this, we have 6 GeForce 8800 GTS 640MB GPUs. This is all powered by a 1200W PSU, although I’m getting the feeling that I may need to look at a dedicated PSU for a few of the GPUs, given some of the issues we have been having.

EDIT: BTW, I’ve been trying to get a hold of a number of people at NVIDIA directly by email but haven’t gotten much response from them yet.

tachyon_john · November 21, 2007, 5:24pm

Try running the nvidia-bug-report.sh script, which ought to be on your system as part of the video driver installation, e.g.: /usr/bin/nvidia-bug-report.sh

Cheers,

John Stone

It is definitely a more exotic hardware setup. Although I don’t believe that the fact we are running on 4x and 8x PCIe is the cause of major concern since that primarily regards the bandwidth capacity, and not overall GPU performance. Could be wrong though. Where is this bug reporting script?

Also do you have any software I can use to test these GPUs in the multi GPU environment? I was hoping to try and see if I can actually benchmark time of I/O as well as tax all GPUs while data is in GPU memory.

Just a little more info on our setup, it isn’t a traditional motherboard setup. We are using a host board/backplane setup using a Trenton MCXT host board on a Cyclone PCIe backplane. This gives us 14 PCIe slots to load up. Some are x4 electrical and some are x8 electrical, although mechanically, all of them are x8, so we had to use adapters, since there is currently no backplanes available that natively supports x16 mechanical slots.

With this, we have 6 GeForce 8800 GTS 640MB GPUs. This is all powered by a 1200W PSU, although I’m getting the feeling that I may need to look at a dedicated PSU for a few of the GPUs, given some of the issues we have been having.

EDIT: BTW, I’ve been trying to get a hold of a number of people at NVIDIA directly by email but haven’t gotten much response from them yet.

[snapback]282609[/snapback]

a_grau · November 21, 2007, 7:05pm

Hi John

I can’t find nothing like this nvidia-bug-report script in Windows system.

Do you know which is the name of the script for Windows?

thanks, Alejo

tachyon_john · November 21, 2007, 7:07pm

Ouch, hmm, I don’t know what the equivalent tool for Windows is. Perhaps one of the NVIDIA staff can chime in on this.

John

netllama · November 23, 2007, 11:03pm

Windows doesn’t ship a bug reporting tool.

I don’t think I’d expect good performance if different cards are in PCI-E slots running at different speeds. Can you elaborate why you believe this to be a CUDA problem rather than a motherboard limitation?

e.ping · November 26, 2007, 5:41pm

We’re trying to eliminate what the potential problem might be and potential solutions. We had put together an application that gave us bandwidth speeds across all graphics cards. What I’ve found is that even though the PCIe slots have different speeds here and there, testing the bandwidth showed when I go multi GPU, I’m only limited to 8x off the host, and the numbers showed it.

So while I can send data at 4x speeds (with 85% efficiency) to a GPU on a 4x slot (working on testing 8x speeds to only an 8x slot), I can achieve 8x (with 85% efficiency) to all 6 graphics cards at once.

So thus far what I’ve seen, I’ll get somewhere between 300 MB/s and 400 MB/s per GPU when sending data to all at once. The default device right now for sending to a single GPU is currently on a 4x slot, and that is getting around 850 MB/s.

So once I got these numbers down, to me, it seemed like there was a little issue going. I don’t know where exactly the issue is happening or how it is happening, but to consistently see massive slow downs when trying to go multi GPU seems worrisome to me when the bandwidth slowdown for going multi GPU isn’t close to what the overall slowdown factor of the application is.

pototschnig · January 17, 2008, 10:50am

Hi,

I’ve got similar problems with the multiGPU-Demo.

(2* NVIDIA 8800GTX, AMD X2 Dual Core 5200+, 4GB RAM, Linux, Cuda 1.1)

Here my bandwidth-valus for 1 GPU:

Here my bandwidth-valus for 2 GPU (bandwidth-demo was forked at startup and both processes were set to different GPUs)

Further values for the multiGPU-demo (was forked as above)

Bandwidth seems not to be so bad when using 2 GPUs concurrently but running cuda-kernels concurrently cause different values for both GPUs. The faster of both is always the 1st GPU - independently whether the child or the parent process uses GPU1.

Does anyone have an idea?

regards

pototschnig

pototschnig · January 18, 2008, 8:11am

Hi,

I think I found the problem in NVIDIAs multiGPU-demo …

setCudaDevice seems to need about 400ms for setting the 2nd device. The solution is that setCudaDevice is excluded from the time measurement in the demo. This worked for me :-)

regards
pototschnig

Topic		Replies	Views
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2180	January 18, 2023
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4197	May 13, 2010
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1164	April 17, 2023
Frequent catastrophic crashes on a multiple GPU machine CUDA Setup and Installation	8	4666	October 22, 2017
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	5712	October 2, 2013
four 9800GX2 cards: will it work? CUDA Programming and Performance	33	23303	May 28, 2008
Computer restarts on parallel CUDA initialization on multiple GPUs CUDA Programming and Performance	10	1174	November 2, 2017
Using more than 1 CUDA card at a time. Physics simulations flat out flying on GPU CUDA Programming and Performance	12	12540	March 12, 2010
Multi-GPU performance incredibly slow CUDA Programming and Performance	7	3001	January 2, 2020
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3462	March 10, 2011

multiGPU poor performance up to 10x lowest performance in multiGPU

Related topics