I am in benchmarking/tuning hell at the moment and I am wondering whether anyone has done much multi-gpu PCI-e bandwith testing?
Right now I finding that simultaneous copy to a pair of GPUs is a lot slower than copying to just one. In my app I am using cudaMemcpy2D to copy up fairly reasonable size double precision arrays to the device from pageable memory.
For one device (first numbers are timestamps for the end of the operation, second is thread ID):
1265145595.096542 {8285c950} gpuUpload m=8192 n=8192 lda=8256 time=156.331940 in gputhread_support.c, line 207
1265145595.147366 {8285c950} gpuUpload m=8192 n=2624 lda=8256 time=50.780289 in gputhread_support.c, line 207
1265145595.198151 {8285c950} gpuUpload m=8192 n=2624 lda=8256 time=50.744095 in gputhread_support.c, line 207
For the other device (this has an active display on it so a little less free memory, hence the smaller transfer size)
1265145860.849084 {619c8950} gpuUpload m=8192 n=8192 lda=8256 time=161.112320 in gputhread_support.c, line 207
1265145860.896129 {619c8950} gpuUpload m=8192 n=2112 lda=8256 time=46.999073 in gputhread_support.c, line 207
1265145860.942402 {619c8950} gpuUpload m=8192 n=2112 lda=8256 time=46.229313 in gputhread_support.c, line 207
For both simultaneously:
1265145999.716060 {16d08950} gpuUpload m=8192 n=8192 lda=8256 time=293.442627 in gputhread_support.c, line 207
1265145999.732586 {16507950} gpuUpload m=8192 n=8192 lda=8256 time=310.005157 in gputhread_support.c, line 207
1265145999.792686 {16d08950} gpuUpload m=8192 n=2112 lda=8256 time=76.552162 in gputhread_support.c, line 207
1265145999.813305 {16507950} gpuUpload m=8192 n=2624 lda=8256 time=80.668671 in gputhread_support.c, line 207
1265145999.868906 {16d08950} gpuUpload m=8192 n=2112 lda=8256 time=76.159904 in gputhread_support.c, line 207
1265145999.884756 {16507950} gpuUpload m=8192 n=2624 lda=8256 time=71.402046 in gputhread_support.c, line 207
All the timings are done with CUDA events. This is on an AMD 790FX chipset board with a 3GHz Phenom II X4, so I wouldn’t expect the PCI-e performance to be bad - a single gpu hits over 5Gb/s. But the upload times with both GPUs copying together are approaching double. My first thought was that I had accidentally messed up the threading code and wasn’t releasing a mutex or something, but the time stamps show the operations are pretty much fully overlapped. Anyone have an opinion on what might be going on here? Does this chipset just suck?