P2P: How do I know if cudaMemcpy falls back to non-P2P?

I have a setup with 5x Tesla v100 in a PCIe (gen3) system, and I’m seeing really low bandwidth numbers using p2pBandwidthLatencyTest. I’ve also crafted my own CUDA app to experiment with various usages of cudaMemcpy() /Async w/ stream, etc.

The GPUs claim to allow cudaDeviceCanAccessPeer() and when i cudaDeviceEnablePeerAccess() it doesn’t fail.

any way to see if it actually did a p2p transaction through the API?

note: i have 4 GPUs in a PCIe expansion chassis, and GPU4 sits on a SuperMicro motherboard. What’s bizarre is that it gets better rates to the other GPUs, than them between themselves (which leads me to suspect it’s not P2P) when it claims P2P is enabled.
and when P2P is DISABLED, it seems it’s pulling proper rates???

relevant dumps from topo -m and p2pBandwidthTest below.

[root@localhost RDMADriver]# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    CPU Affinity    NUMA Affinity
GPU0     X      PIX     PXB     PXB     SYS     0-7,16-23       0
GPU1    PIX      X      PXB     PXB     SYS     0-7,16-23       0
GPU2    PXB     PXB      X      PIX     SYS     0-7,16-23       0
GPU3    PXB     PXB     PIX      X      SYS     0-7,16-23       0
GPU4    SYS     SYS     SYS     SYS      X      8-15,24-31      1
Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

[root@localhost p2pBandwidthLatencyTest]# ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla V100-PCIE-32GB, pciBusID: 25, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla V100-PCIE-32GB, pciBusID: 26, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla V100-PCIE-32GB, pciBusID: 2f, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla V100-PCIE-32GB, pciBusID: 30, pciDeviceID: 0, pciDomainID:0
Device: 4, Tesla V100-PCIE-32GB, pciBusID: 86, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3     4
     0       1     1     1     1     1
     1       1     1     1     1     1
     2       1     1     1     1     1
     3       1     1     1     1     1
     4       1     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4
     0 718.06   9.02   9.36   9.10   5.85
     1   8.83 727.46   9.39   9.07   5.85
     2   8.98   9.15 726.07   9.14   5.87
     3   9.02   9.17   9.30 730.14   5.87
     4   5.76   5.91   5.90   5.89 735.64
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4
     0 720.71   0.72   0.72   0.72   8.65
     1   0.62 732.88   0.72   0.72   8.29
     2   0.62   0.72 730.14   0.72   8.62
     3   0.55   0.62   0.72 737.03   8.21
     4   8.67   8.66   8.67   7.67 737.03
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4
     0 721.38   9.98   9.95   9.90   9.83
     1  10.03 727.42   9.80   9.81   9.75
     2   9.97   9.75 730.82   9.40   9.74
     3   9.94   9.78   9.21 732.88   9.80
     4   9.88  10.09  10.13  10.18 751.20
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4
     0 725.39   1.02   1.30   1.30  17.17
     1   1.30 735.64   1.30   1.30  17.15
     2   1.30   1.30 734.95   1.30  17.17
     3   1.30   1.30   1.30 735.64  17.19
     4  17.18  17.22  17.20  17.20 748.32
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3      4
     0   2.66  25.82  23.53  24.35  25.51
     1  28.52   2.53  24.22  24.10  24.81
     2  24.41  24.10   2.28  22.31  22.98
     3  24.55  24.82  22.57   2.29  23.65
     4  24.64  24.55  22.55  22.31   1.76

   CPU     0      1      2      3      4
     0  10.15  27.46  25.70  25.35  22.43
     1  24.16   8.28  23.79  23.62  20.72
     2  21.61  22.14   7.28  22.07  19.50
     3  20.38  21.91  21.00   6.77  19.14
     4  17.93  19.39  18.91  18.39   5.51
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4
     0   2.66 49252.33 49252.37 49252.36   2.71
     1 49256.18   2.52 49252.68 49252.67   2.71
     2 49252.71 49252.60   2.28 49252.59   2.46
     3 49256.04 49252.62 49252.63   2.28   2.46
     4   2.82   2.80   2.53   2.53   2.10

   CPU     0      1      2      3      4
     0   5.68   4.71   3.81   3.74   3.35
     1   3.73   4.56   3.65   4.12   4.19
     2   4.00   3.67   4.23   3.59   3.91
     3   4.36   4.14   3.81   4.38   3.63
     4   3.18   3.16   3.34   3.25   3.98

What are the specifications of the PCIe slots involved? Full transfer speed (~12 GB/sec per direction) requires PCIe gen3 x16 links. What kind of PCIe expansion chassis is this? Based on some sort of PCIe bridge chip? What are its specifications? What kind of link connects the PCIe expansion chassis to the host system? Is it a PCI gen 3 x16 link?

Thanks for your reply njuffa, this is the expansion chassis in question: https://www.onestopsystems.com/product/expressbox-3600-10

I’m seeing some people reporting issues with PLX switches and ACS, will look closer at that…

Considering the various datapoints:

  1. The same P2P test reports some better results and some worse results
  2. The worse datapoints are localized to traffic on the expansion chassis
  3. The latency measurements are also bad (50 milliseconds)

It seems evident that the expansion chassis has an issue with P2P traffic. I’m not able to diagnose it here. My suggestion would be to contact the manufacterer of the expansion chassis and report your observations.

You have the source code for the tests you’ve run, and it’s not reasonable to conclude that the test is somehow doing something different at the API level when it is setting up a test between GPU 4 and GPU 0 vs. when it is setting up a test between GPU 3 and GPU 0. That last part is just my opinion.

Supermicro, for example, has many system boards that use PCIE switches, probably similar to what your expansion chassis uses to connect slots together, and they work fine with these tests.

I’m not aware of any futher confirmation of P2P than just using the appropriate API (e.g. cudaMemcpyPeerAsync) and even if there was a fallback from P2P to non-P2P, it wouldn’t explain any of the datapoints I see here. That would not cause the transfer bandwidth to go from ~8GB/s to ~0.7GB/s. Nor would it cause latency to increase by a factor of 1000 or more.

Thanks for the pointer to the product description. Unfortunately it is basically a black box, the only thing that seems to be clear from the vendor description is that it uses a PCIe gen3 x16 link back to the host system. Since the host system already has a GPU installed, that would require a host system that provides at least 32 PCIe gen 3 lanes. Is that the case? Is the host a dual socket system with two CPUs?

I have basic knowledge about PCIe. This is a situation that seems to require a person with more in-depth PCIe knowledge to diagnose what is going on.

1 Like

Thanks for the reply @Robert_Crovella, I looked over the p2pBandwithLatencyTest.cu and it seems the difference between P2P=Enabled / P2P=Disabled as seen in my output above is:

when P2P=enabled it’s simply calling cudaDeviceEnablePeerAccess() before cudaMemcpyPeerAsync()

When P2P=Disabled it’s calling cudaDeviceDisablePeerAccess() before cudaMemcpyPeerAsync()

doesn’t seem to be much more to it.

I modified my own code to mimic the behavior of the sample from NVIDIA, and I can see that when I enable P2P, the call to cudaMemcpyPeerAsync() does not block, but my call to cudaStreamSynchronize() does. When I disable P2P the call to cudaMemcpyPeerAsync() does NOT block, and according to top the biggest consumer of CPU is my main thread, which i assume comes from cudaMemcpyPeerAsync() blocking. I also see the same numbers on throughput…

I came across another thread with @Robert_Crovella and the solution was to disable ACS in my SuperMicro BIOS.