I wrote a simple multidevice cuda program using openmp. I observed that the execution time of
kernel and cudamemcpy became much worse than single device. The kernel and cudamemcpy is
exactly the same in both. I do the cudasetdevice and kernel call in different threads. I also checked
the cudagetdevice and got sure that cudasetdevice work fine.
Do you know what happen? I think something works sequentially ( it looks like time sharing in cpu ) .
I work on a machine with 3 GTX480s, cuda 3.1 and the os is debian sid 2.6.32-5-686-bigmem
I observed similar behavior of my GTX295. I equally divide the input data between two devices. I checked that both kernels perform exactly half the job of the kernel on single device. All results are correct. However, two devices together work slower than one device. I measured pure computations time in each of parallel kernels and it is about 50% of a single kernel execution. So, I believe the problem is in memory copying. My task is highly data dependent and requires a lot of copying.
I observed similar behavior of my GTX295. I equally divide the input data between two devices. I checked that both kernels perform exactly half the job of the kernel on single device. All results are correct. However, two devices together work slower than one device. I measured pure computations time in each of parallel kernels and it is about 50% of a single kernel execution. So, I believe the problem is in memory copying. My task is highly data dependent and requires a lot of copying.
Check the number of PCIe lanes for each device. Some boards have multiple PCIex16 slots but only one PCIe hub so for two cards you would get two PCIex8 slots instead of one PCIex16
Check the number of PCIe lanes for each device. Some boards have multiple PCIex16 slots but only one PCIe hub so for two cards you would get two PCIex8 slots instead of one PCIex16
Charley, do I understand that discussion correctly? The problem of slow transfer host to device for multiple devices is partially caused by motherboards and that it is going to be improved in i7 architecture? I am just trying to understand what is going on, can we expect some improvement and how expensive it could be.
Charley, do I understand that discussion correctly? The problem of slow transfer host to device for multiple devices is partially caused by motherboards and that it is going to be improved in i7 architecture? I am just trying to understand what is going on, can we expect some improvement and how expensive it could be.
You can easily compute a theoretical performance level for your application. With PCIe gen2 x16 you can expect 6.5 GiB/s tops per PCIe x16 v2.0 slot (so if copying to the two GPUs of a 295 at the same time, expect half that). Nehalem is faster than previous architectures simply because it has massive main memory bandwidth.
Multi-gpu scaling works extremely well if you are not limited by the mem copies.
You can easily compute a theoretical performance level for your application. With PCIe gen2 x16 you can expect 6.5 GiB/s tops per PCIe x16 v2.0 slot (so if copying to the two GPUs of a 295 at the same time, expect half that). Nehalem is faster than previous architectures simply because it has massive main memory bandwidth.
Multi-gpu scaling works extremely well if you are not limited by the mem copies.
If your problem is pci-e bound (cpu ↔ gpu communication) then yes, two independent cards will be faster that a gtx295 (assuming you’re motherboard supports to x16 slots as such and doesn’t turn them into x8 slots)
If your problem is pci-e bound (cpu ↔ gpu communication) then yes, two independent cards will be faster that a gtx295 (assuming you’re motherboard supports to x16 slots as such and doesn’t turn them into x8 slots)