Multi gpu copy performance Any experiences to share?

I am in benchmarking/tuning hell at the moment and I am wondering whether anyone has done much multi-gpu PCI-e bandwith testing?

Right now I finding that simultaneous copy to a pair of GPUs is a lot slower than copying to just one. In my app I am using cudaMemcpy2D to copy up fairly reasonable size double precision arrays to the device from pageable memory.

For one device (first numbers are timestamps for the end of the operation, second is thread ID):

1265145595.096542 {8285c950} gpuUpload m=8192 n=8192 lda=8256 time=156.331940 in gputhread_support.c, line 207

1265145595.147366 {8285c950} gpuUpload m=8192 n=2624 lda=8256 time=50.780289 in gputhread_support.c, line 207

1265145595.198151 {8285c950} gpuUpload m=8192 n=2624 lda=8256 time=50.744095 in gputhread_support.c, line 207

For the other device (this has an active display on it so a little less free memory, hence the smaller transfer size)

1265145860.849084 {619c8950} gpuUpload m=8192 n=8192 lda=8256 time=161.112320 in gputhread_support.c, line 207

1265145860.896129 {619c8950} gpuUpload m=8192 n=2112 lda=8256 time=46.999073 in gputhread_support.c, line 207

1265145860.942402 {619c8950} gpuUpload m=8192 n=2112 lda=8256 time=46.229313 in gputhread_support.c, line 207

For both simultaneously:

1265145999.716060 {16d08950} gpuUpload m=8192 n=8192 lda=8256 time=293.442627 in gputhread_support.c, line 207

1265145999.732586 {16507950} gpuUpload m=8192 n=8192 lda=8256 time=310.005157 in gputhread_support.c, line 207

1265145999.792686 {16d08950} gpuUpload m=8192 n=2112 lda=8256 time=76.552162 in gputhread_support.c, line 207

1265145999.813305 {16507950} gpuUpload m=8192 n=2624 lda=8256 time=80.668671 in gputhread_support.c, line 207

1265145999.868906 {16d08950} gpuUpload m=8192 n=2112 lda=8256 time=76.159904 in gputhread_support.c, line 207

1265145999.884756 {16507950} gpuUpload m=8192 n=2624 lda=8256 time=71.402046 in gputhread_support.c, line 207

All the timings are done with CUDA events. This is on an AMD 790FX chipset board with a 3GHz Phenom II X4, so I wouldn’t expect the PCI-e performance to be bad - a single gpu hits over 5Gb/s. But the upload times with both GPUs copying together are approaching double. My first thought was that I had accidentally messed up the threading code and wasn’t releasing a mutex or something, but the time stamps show the operations are pretty much fully overlapped. Anyone have an opinion on what might be going on here? Does this chipset just suck?

I wrote an app to test this a while ago:

http://forums.nvidia.com/index.php?showtopic=86536

Thanks for the link Tim. I’ll give it a go straight away.

EDIT: And I guess I have my explanation:

avid@cuda:~$ ./concBandwidthTest 0

Device 0 took 1217.797729 ms

Average HtoD bandwidth in MB/s: 5255.388184

Device 0 took 1143.444458 ms

Average DtoH bandwidth in MB/s: 5597.123535
avid@cuda:~$ ./concBandwidthTest 1

Device 1 took 1202.025146 ms

Average HtoD bandwidth in MB/s: 5324.347656

Device 1 took 1141.787231 ms

Average DtoH bandwidth in MB/s: 5605.247559
avid@cuda:~$ ./concBandwidthTest 0 1

Device 0 took 2118.291504 ms

Device 1 took 2114.531494 ms

Average HtoD bandwidth in MB/s: 6047.978027

Device 0 took 2011.028198 ms

Device 1 took 2007.709839 ms

Average DtoH bandwidth in MB/s: 6370.163330

What is the HT bandwidth limit? The 790FX north bridge has a 2GHz 16 bit HT3.0 link. How much bandwidth should that give?

Scaling from a table on the Wikipedia page on HT, it looks like the theoretical max for 2 GHz/16-bit/unidirectional transfer is 8 GB/sec. (Seriously?? They put 42 lanes of PCI-Express 2.0 on the other side of link that can only do 8 GB/sec?!)

I guess I just finished reading the same page and came to the same conclusion. That is a slightly more polite version of my reaction :)

I might be able to live with it. Thankfully, even the dual card transfer times are small compared to what the GPU code does with the arrays, so the overall effect isn’t enormous. It was only when I started comparing model predictions of the code performance with actual measurements that I noticed a discrepancy in the transfer time predictions. Of course I had assiduously fitted the model with measured data from single GPU transfer measurements…

Even the 36 lane X58 boards get no more than 10GB/s of concurrent bandwidth.

@ avidday

What was the take-away/explanation of your results? As I am about to venture into multi-GPU, I would like to understand this better.

Thanks

MMB

Basically it looks like there is about 6.5Gb/s of usable bandwidth on the HT link between the 790FX IO hub (where the PCI-e controller is) and the CPU/memory controller. So for a single GT200 GPU, large, sustained transfers are PCI-e bus limited (something like 5.2 Gb/s). For two GT200 GPUs performing simultaneous large, sustained transfers, the bottle neck appears to move to the HT link. The total transfer bandwidth is certainly higher than for the single GPU (6.5 Gb/s over 5,5 Gb/s), but the per GPU performance is lower.

You should keep in mind that my application is performing really large transfers simultaneously - pretty much filling both GPUs memory in 3 transactions. This should be the absolute worst case. For smaller transfers, latency rather than bandwidth should dominate the transfer performance and I don’t see any evidence that latency is worse in the dual GPU case. If the GPUs aren’t transferring simultaneously, the per GPU performance is still excellent. I think I would still prefer this arrangement to using NF200 switch based boards (which have have both inferior per bandwidth and higher latency during simultaneous GPU operations). Tim indicated that the Intel X58 might be somewhat better in this regard, because the QPI link has higher theoretical bandwidth than the HT link my board has. I haven’t tried my code on a multi-gpu X58 system to say. AMD should be on the verge of rolling out the new 890FX chipset, which may also be an improvement over what I am using.