Erratic multi-gpu bandwidth

I am running an 8 gpu (7* Titan 1*K20XM) system on the Tyan B7059F77AV6R-N and have noticed several inconsistencies with my P2P bandwidth:
Multiple concurrent independent transfers (e.g 0>1 2>3 4>5 6>7) seem to slow one another down by a factor of 4 - running them sequentially runs faster in fact.

Theoretically identical transfers executed one-at-atime between identical gpus on the same IOHUB show vastly different bandwidths - e.g 4>5 gets 7GB/s and 7>5 (an in fact more complicated transfer in terms of the pci topology) yields 9GB/s

Further still, some transfers do not benefit from P2P at all (even though they should) - and are in fact slowed down - compare 7>5 no p2p - 9GB/s 7>5 with p2p - 7GB/s

This is not just one execution of a benchmark by the way, this is 100% repeatable.

Already filed a bug report several months ago (particularly with respect to the first problem - independent transfers slowing down one another) but I just wanted to get opinions from here :)

Disregard any transfers to gpu 6 as it’s a K20XM instead of a Titan

Here’s a sample output:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX TITAN, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX TITAN, pciBusID: 5, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX TITAN, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX TITAN, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device: 4, GeForce GTX TITAN, pciBusID: 85, pciDeviceID: 0, pciDomainID:0
Device: 5, GeForce GTX TITAN, pciBusID: 86, pciDeviceID: 0, pciDomainID:0
Device: 6, Tesla K20Xm, pciBusID: 89, pciDeviceID: 0, pciDomainID:0
Device: 7, GeForce GTX TITAN, pciBusID: 8a, pciDeviceID: 0, pciDomainID:0
P2P Cliques: 
[0 1 2 3]
[4 5 6 7]
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 114.02   7.49   7.86   7.81   9.50   9.30   6.29   9.42 
     1   8.31 112.87   9.00   8.99   9.48   9.40   6.38   9.39 
     2   9.32   9.38 112.47   8.48   9.39   9.44   6.38   9.43 
     3   8.65   8.85   8.34 113.15   9.44   9.50   6.38   9.41 
     4   9.48  10.05   9.80   9.95 113.26   7.55   6.41   7.91 
     5   9.62  10.13   9.78   9.88   8.59 113.01   6.08   9.43 
     6   6.12   6.12   6.11   6.12   6.23   6.20  97.32   6.22 
     7   9.50  10.05   9.79   9.91   9.41   9.54   6.40 113.34 
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 113.87  13.08   7.81   7.70   9.50   9.45   6.37   9.52 
     1  11.54 114.60   7.70   7.70   9.61   9.58   6.05   9.38 
     2   7.70   7.70 113.99  12.60   9.34   9.60   6.37   9.56 
     3   7.72   7.70  12.24 114.20   9.46   9.49   6.38   9.45 
     4   9.60  10.06   9.79   9.97 113.02  11.51   6.53   7.69 
     5   9.61  10.13   9.79   9.88  11.29 113.45   6.60   7.69 
     6   6.12   6.12   6.13   6.13   5.69   5.69  97.80   6.58 
     7   9.51  10.00   9.80   9.88   7.70   7.71   6.61 113.56 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 114.11   6.96   8.21   7.29   9.44   9.48   6.55   9.33 
     1   7.26 114.09   8.21   7.62   9.12   9.14   6.28   9.37 
     2   8.01   8.62 114.06   7.45   9.16   9.13   6.58   9.42 
     3   7.04   7.40   7.64 113.86   8.95   9.38   6.58   9.31 
     4   9.39   9.35   9.34   9.34 114.43   7.41   6.60   7.86 
     5   9.43   9.47   9.41   9.36   7.41 114.44   6.59   9.20 
     6   6.60   6.61   6.61   6.60   6.62   6.65  99.13   6.45 
     7   9.35   9.40   9.37   9.26   7.75   9.15   6.45 113.98 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 114.98  24.58  15.34  15.37   9.34   9.05   6.58   9.34 
     1  24.55 114.04  15.37  15.32   9.35   9.38   6.49   9.38 
     2  15.38  15.39 114.41  24.60   9.34   9.43   6.59   9.19 
     3  15.38  15.39  24.60 115.11   9.42   9.34   6.58   9.32 
     4   9.28   9.47   9.43   9.38 113.41  22.81  11.32  15.38 
     5   9.44   9.54   9.53   9.36  22.83 113.36  11.32  15.38 
     6   6.59   6.62   6.61   6.59  13.05  13.04  98.76  13.08 
     7   9.36   9.43   9.41   9.31  15.38  15.38  13.05 113.86 
P2P=Disabled Latency Matrix (us)
   D\D     0      1      2      3      4      5      6      7 
     0   6.05  33.97  33.61  37.46  36.66  39.81  26.55  39.92 
     1  41.15   6.05  33.49  37.34  37.07  39.81  26.41  40.02 
     2  41.08  34.17   5.98  37.38  37.20  39.85  26.37  39.96 
     3  41.03  34.13  32.98   5.70  37.85  41.10  26.70  41.07 
     4  40.50  33.90  33.25  37.01   6.62  41.36  27.22  41.22 
     5  40.36  33.18  32.78  36.70  38.04   6.13  27.24  41.20 
     6  39.18  33.28  33.03  37.08  35.63  40.35   6.20  40.24 
     7  40.51  33.73  33.60  36.72  37.90  41.13  27.11   6.40 
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3      4      5      6      7 
     0   6.18  21.22  21.36  25.07  37.26  41.03  26.83  41.13 
     1  27.48   6.09  21.28  24.82  37.67  40.89  26.81  40.91 
     2  27.53  21.44   6.13  24.95  37.73  40.88  26.53  40.92 
     3  27.51  21.38  21.16   5.77  38.09  41.10  27.16  41.10 
     4  40.41  33.29  33.10  36.86   6.45  24.77   9.97  28.11 
     5  40.42  33.15  32.84  36.62  24.83   6.07   9.93  28.17 
     6  39.25  33.27  32.99  37.13  24.93  24.93   6.07  28.08 
     7  40.48  33.66  33.53  36.75  24.85  24.90  10.05   6.37

Some random ideas, although I’m not sure any of those might help:

  • Have you tried enabling Advanced Error Reporting for the Linux PCIe driver (if supported for your chipset?) and checked dmesg for any AER messages?
  • Maybe you can learn something from replacing the cudaMemcpyPeerAsync() calls with custom kernels that write to (or read from, just as a test) peer cards using floats4s
  • For the ring transfer, can you do a table of bandwidths or total transfer time for all 255 combinations of enabling/disabling some of the 8 overlapping copies? That should give a hint which transfers are colliding.
  • Not sure this is a valid metric in this case as the latency has additional overhead and thus is not the same as the delay. However I notice the bandwidth-latency-product gets as far up as ~300 kB for transfers on the same root complex. That would be an awful lot of data to buffer.

Hi tera,
I have to edit my entire response upon learning more about the situation.

If I time
(0>1 2>3 4>5 6>7)
SYNC ALL DEVICES
(1>2 3>4 5>6 7>0)
SYNC ALL DEVICES

This runs about 2* as quickly as the entire transfer in parallel.

It appears cudaMemcpy suffers terribly when two transfers use the same card (one as an output and one as an input)

Interestingly, a basic kernel of the form y[idx]=x[idx]; does not suffer from the same problems…

I have found the “culprit” to the situation - in two parts

  1. For some reason you need to enable P2P in both directions to get the full speed bonus i.e

for a transfer 0>1 you must
cudaError_t e_ = cudaDeviceEnablePeerAccess(1,0); from device 0
as well as
cudaError_t e_ = cudaDeviceEnablePeerAccess(0,0); from device 1 (I wasn’t doing this part, because I didn’t think it was necessary)

  1. You must use cudaMemcpyAsync with an arbitrary stream (any non-null stream) instead.

It seems to me if you call multiple cudaMemcpys (synchronous, not async) from different threads and different cards they interfere with each other.

By that I mean if you execute
0>1 1>2 2>3 3>0, then, at one point, only
0>1 and 2>3 can execute, and after they finish,
1>2 3>0 can execute, doubling the effective runtime

These things sound like bugs to me

If I understand the situation correctly, the second observation sounds like things are working as designed and documented. Use of overlapping DMA transfers of any kind has always required the use of asynchronous copies in different non-null streams (the null stream has synchronizing properties, for legacy reasons as I recall). For GPUs prior to Kepler there were additional complications since independent operations could me mapped to the same hardware queue, which could cause false dependencies.

I cannot speak to the first observation, the required idiom for turning on peer-to-peer in a bi-directional fashion. What does the documentation say on this subject? I found an older presentation that says cudaDeviceEnablePeerAccess() enables “current GPU to access memory on peer GPU”. So for each of two GPUs to see each others memory, one would have to call cudaDeviceEnablePeerAccess() for each with appropriate arguments.

The presentation on multi-GPU programming is here:

http://on-demand.gputechconf.com/gtc/2012/presentations/S0515-GTC2012-Multi-GPU-Programming.pdf

Interesting. I would not have thought it synchronizes across ALL threads though! (Each transfer is initiated by a different thread)

Furthermore, I’m noticing strange things when synchronizing the arbitrary stream

Say I want to do

cudaMemcpyAsync(dst,src,bytes,cudaMemcpyDefault,stream)

Wait for the transfer, and then

cublasSaxpy(…,dst,…)

(Applies only to devices where I can’t peer2peer e.g on a different IOH)

What I’ve tried is cudaDeviceSynchronize() as well as cublasSetStream(handle,stream) to wait for the relevant transfer, but this doesn’t work. Either way, the transfers once again take a big performance hit, probably performing sequentially again…

Operations within the same stream always execute sequentially, so no additional synchronization is necessary. A cudaMemcpyAsync() call is asynchronous to the host, and other streams, but synchronous to a dependent operation in the same stream, such as a CUBLAS call.

Sorry I didn’t explain myself very well.

If I use cublasSetStream() to make saxpy run after memcpy, the memcopies on other gpus (NOT tied to the stream) once again serialise.

Something funny is going on here. It may be hardware related.

This means that Titan (ehm, which one?) supports P2P? How come? Isn’t it just Quadro/Tesla thing?