multidevice got slower than single gpu device

Hi,

I wrote a simple multidevice cuda program using openmp. I observed that the execution time of
kernel and cudamemcpy became much worse than single device. The kernel and cudamemcpy is
exactly the same in both. I do the cudasetdevice and kernel call in different threads. I also checked
the cudagetdevice and got sure that cudasetdevice work fine.

Do you know what happen? I think something works sequentially ( it looks like time sharing in cpu ) .

I work on a machine with 3 GTX480s, cuda 3.1 and the os is debian sid 2.6.32-5-686-bigmem

Thanks
M

Hi,

I observed similar behavior of my GTX295. I equally divide the input data between two devices. I checked that both kernels perform exactly half the job of the kernel on single device. All results are correct. However, two devices together work slower than one device. I measured pure computations time in each of parallel kernels and it is about 50% of a single kernel execution. So, I believe the problem is in memory copying. My task is highly data dependent and requires a lot of copying.

Cheers,
Krzysztof

Hi,

I observed similar behavior of my GTX295. I equally divide the input data between two devices. I checked that both kernels perform exactly half the job of the kernel on single device. All results are correct. However, two devices together work slower than one device. I measured pure computations time in each of parallel kernels and it is about 50% of a single kernel execution. So, I believe the problem is in memory copying. My task is highly data dependent and requires a lot of copying.

Cheers,
Krzysztof

When you have are using two gpus, the cudamemcpy time between host and device can take longer than if you are only using one. You may want to take a look at this concurrent bandwidthTest) if you haven’t already:
http://forums.nvidia.com/index.php?showtopic=86536

When you have are using two gpus, the cudamemcpy time between host and device can take longer than if you are only using one. You may want to take a look at this concurrent bandwidthTest) if you haven’t already:
http://forums.nvidia.com/index.php?showtopic=86536

Check the number of PCIe lanes for each device. Some boards have multiple PCIex16 slots but only one PCIe hub so for two cards you would get two PCIex8 slots instead of one PCIex16

Check the number of PCIe lanes for each device. Some boards have multiple PCIex16 slots but only one PCIe hub so for two cards you would get two PCIex8 slots instead of one PCIex16

Does this mean that two completely independent devices will work together better than single “double” device like GTX 295 ?

I was highly disappointed with performance of my GTX 295. Good scalability of tasks for multiple devices are crucial for serious computations.

Does this mean that two completely independent devices will work together better than single “double” device like GTX 295 ?

I was highly disappointed with performance of my GTX 295. Good scalability of tasks for multiple devices are crucial for serious computations.

Charley, do I understand that discussion correctly? The problem of slow transfer host to device for multiple devices is partially caused by motherboards and that it is going to be improved in i7 architecture? I am just trying to understand what is going on, can we expect some improvement and how expensive it could be.

Charley, do I understand that discussion correctly? The problem of slow transfer host to device for multiple devices is partially caused by motherboards and that it is going to be improved in i7 architecture? I am just trying to understand what is going on, can we expect some improvement and how expensive it could be.

You can easily compute a theoretical performance level for your application. With PCIe gen2 x16 you can expect 6.5 GiB/s tops per PCIe x16 v2.0 slot (so if copying to the two GPUs of a 295 at the same time, expect half that). Nehalem is faster than previous architectures simply because it has massive main memory bandwidth.

Multi-gpu scaling works extremely well if you are not limited by the mem copies.

You can easily compute a theoretical performance level for your application. With PCIe gen2 x16 you can expect 6.5 GiB/s tops per PCIe x16 v2.0 slot (so if copying to the two GPUs of a 295 at the same time, expect half that). Nehalem is faster than previous architectures simply because it has massive main memory bandwidth.

Multi-gpu scaling works extremely well if you are not limited by the mem copies.

If your problem is pci-e bound (cpu <-> gpu communication) then yes, two independent cards will be faster that a gtx295 (assuming you’re motherboard supports to x16 slots as such and doesn’t turn them into x8 slots)

If your problem is pci-e bound (cpu <-> gpu communication) then yes, two independent cards will be faster that a gtx295 (assuming you’re motherboard supports to x16 slots as such and doesn’t turn them into x8 slots)