multidevice got slower than single gpu device

mohaa · August 19, 2010, 7:25am

Hi,

I wrote a simple multidevice cuda program using openmp. I observed that the execution time of
kernel and cudamemcpy became much worse than single device. The kernel and cudamemcpy is
exactly the same in both. I do the cudasetdevice and kernel call in different threads. I also checked
the cudagetdevice and got sure that cudasetdevice work fine.

Do you know what happen? I think something works sequentially ( it looks like time sharing in cpu ) .

I work on a machine with 3 GTX480s, cuda 3.1 and the os is debian sid 2.6.32-5-686-bigmem

Thanks
M

Rambooo · August 26, 2010, 3:42pm

Hi,

I observed similar behavior of my GTX295. I equally divide the input data between two devices. I checked that both kernels perform exactly half the job of the kernel on single device. All results are correct. However, two devices together work slower than one device. I measured pure computations time in each of parallel kernels and it is about 50% of a single kernel execution. So, I believe the problem is in memory copying. My task is highly data dependent and requires a lot of copying.

Cheers,
Krzysztof

Rambooo · August 26, 2010, 3:42pm

Hi,

I observed similar behavior of my GTX295. I equally divide the input data between two devices. I checked that both kernels perform exactly half the job of the kernel on single device. All results are correct. However, two devices together work slower than one device. I measured pure computations time in each of parallel kernels and it is about 50% of a single kernel execution. So, I believe the problem is in memory copying. My task is highly data dependent and requires a lot of copying.

Cheers,
Krzysztof

Charley · August 27, 2010, 6:51pm

When you have are using two gpus, the cudamemcpy time between host and device can take longer than if you are only using one. You may want to take a look at this concurrent bandwidthTest) if you haven’t already:
[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtopic=86536[/url]

Charley · August 27, 2010, 6:51pm

When you have are using two gpus, the cudamemcpy time between host and device can take longer than if you are only using one. You may want to take a look at this concurrent bandwidthTest) if you haven’t already:
[url=“http://forums.nvidia.com/index.php?showtopic=86536”]http://forums.nvidia.com/index.php?showtopic=86536[/url]

laughingrice · August 28, 2010, 10:40am

Check the number of PCIe lanes for each device. Some boards have multiple PCIex16 slots but only one PCIe hub so for two cards you would get two PCIex8 slots instead of one PCIex16

laughingrice · August 28, 2010, 10:40am

Check the number of PCIe lanes for each device. Some boards have multiple PCIex16 slots but only one PCIe hub so for two cards you would get two PCIex8 slots instead of one PCIex16

Rambooo · August 31, 2010, 9:55am

Does this mean that two completely independent devices will work together better than single “double” device like GTX 295 ?

I was highly disappointed with performance of my GTX 295. Good scalability of tasks for multiple devices are crucial for serious computations.

Rambooo · August 31, 2010, 9:55am

Does this mean that two completely independent devices will work together better than single “double” device like GTX 295 ?

I was highly disappointed with performance of my GTX 295. Good scalability of tasks for multiple devices are crucial for serious computations.

Rambooo · August 31, 2010, 9:58am

Charley, do I understand that discussion correctly? The problem of slow transfer host to device for multiple devices is partially caused by motherboards and that it is going to be improved in i7 architecture? I am just trying to understand what is going on, can we expect some improvement and how expensive it could be.

Rambooo · August 31, 2010, 9:58am

Charley, do I understand that discussion correctly? The problem of slow transfer host to device for multiple devices is partially caused by motherboards and that it is going to be improved in i7 architecture? I am just trying to understand what is going on, can we expect some improvement and how expensive it could be.

MisterAnderson42 · August 31, 2010, 1:05pm

You can easily compute a theoretical performance level for your application. With PCIe gen2 x16 you can expect 6.5 GiB/s tops per PCIe x16 v2.0 slot (so if copying to the two GPUs of a 295 at the same time, expect half that). Nehalem is faster than previous architectures simply because it has massive main memory bandwidth.

Multi-gpu scaling works extremely well if you are not limited by the mem copies.

MisterAnderson42 · August 31, 2010, 1:05pm

You can easily compute a theoretical performance level for your application. With PCIe gen2 x16 you can expect 6.5 GiB/s tops per PCIe x16 v2.0 slot (so if copying to the two GPUs of a 295 at the same time, expect half that). Nehalem is faster than previous architectures simply because it has massive main memory bandwidth.

Multi-gpu scaling works extremely well if you are not limited by the mem copies.

laughingrice · October 23, 2010, 9:01pm

If your problem is pci-e bound (cpu ↔ gpu communication) then yes, two independent cards will be faster that a gtx295 (assuming you’re motherboard supports to x16 slots as such and doesn’t turn them into x8 slots)

laughingrice · October 23, 2010, 9:01pm

If your problem is pci-e bound (cpu ↔ gpu communication) then yes, two independent cards will be faster that a gtx295 (assuming you’re motherboard supports to x16 slots as such and doesn’t turn them into x8 slots)

Topic		Replies	Views
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1123	April 17, 2023
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10763	January 18, 2008
Multi gpu copy performance Any experiences to share? CUDA Programming and Performance	7	3369	February 3, 2010
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1628	October 30, 2015
Memory copy to GPU 1 is slower in multi-GPU CUDA Programming and Performance	2	4391	April 5, 2010
Handful of Slow Memory Transfers CUDA Programming and Performance	7	813	June 17, 2016
cudaMemcpy latency unusually high on some machines CUDA Programming and Performance	9	83	November 11, 2024
Weird multiGPU performance About 10 times slower than single GPU CUDA Programming and Performance	10	3914	November 25, 2009
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4183	May 13, 2010
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8071	June 30, 2010

multidevice got slower than single gpu device

Related topics