P2P: How do I know if cudaMemcpy falls back to non-P2P?

pfranck · November 6, 2020, 8:17pm

I have a setup with 5x Tesla v100 in a PCIe (gen3) system, and I’m seeing really low bandwidth numbers using p2pBandwidthLatencyTest. I’ve also crafted my own CUDA app to experiment with various usages of cudaMemcpy() /Async w/ stream, etc.

The GPUs claim to allow cudaDeviceCanAccessPeer() and when i cudaDeviceEnablePeerAccess() it doesn’t fail.

any way to see if it actually did a p2p transaction through the API?

note: i have 4 GPUs in a PCIe expansion chassis, and GPU4 sits on a SuperMicro motherboard. What’s bizarre is that it gets better rates to the other GPUs, than them between themselves (which leads me to suspect it’s not P2P) when it claims P2P is enabled.
and when P2P is DISABLED, it seems it’s pulling proper rates???

relevant dumps from topo -m and p2pBandwidthTest below.

[root@localhost RDMADriver]# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    CPU Affinity    NUMA Affinity
GPU0     X      PIX     PXB     PXB     SYS     0-7,16-23       0
GPU1    PIX      X      PXB     PXB     SYS     0-7,16-23       0
GPU2    PXB     PXB      X      PIX     SYS     0-7,16-23       0
GPU3    PXB     PXB     PIX      X      SYS     0-7,16-23       0
GPU4    SYS     SYS     SYS     SYS      X      8-15,24-31      1
Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

[root@localhost p2pBandwidthLatencyTest]# ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla V100-PCIE-32GB, pciBusID: 25, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla V100-PCIE-32GB, pciBusID: 26, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla V100-PCIE-32GB, pciBusID: 2f, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla V100-PCIE-32GB, pciBusID: 30, pciDeviceID: 0, pciDomainID:0
Device: 4, Tesla V100-PCIE-32GB, pciBusID: 86, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3     4
     0       1     1     1     1     1
     1       1     1     1     1     1
     2       1     1     1     1     1
     3       1     1     1     1     1
     4       1     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4
     0 718.06   9.02   9.36   9.10   5.85
     1   8.83 727.46   9.39   9.07   5.85
     2   8.98   9.15 726.07   9.14   5.87
     3   9.02   9.17   9.30 730.14   5.87
     4   5.76   5.91   5.90   5.89 735.64
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4
     0 720.71   0.72   0.72   0.72   8.65
     1   0.62 732.88   0.72   0.72   8.29
     2   0.62   0.72 730.14   0.72   8.62
     3   0.55   0.62   0.72 737.03   8.21
     4   8.67   8.66   8.67   7.67 737.03
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4
     0 721.38   9.98   9.95   9.90   9.83
     1  10.03 727.42   9.80   9.81   9.75
     2   9.97   9.75 730.82   9.40   9.74
     3   9.94   9.78   9.21 732.88   9.80
     4   9.88  10.09  10.13  10.18 751.20
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4
     0 725.39   1.02   1.30   1.30  17.17
     1   1.30 735.64   1.30   1.30  17.15
     2   1.30   1.30 734.95   1.30  17.17
     3   1.30   1.30   1.30 735.64  17.19
     4  17.18  17.22  17.20  17.20 748.32
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3      4
     0   2.66  25.82  23.53  24.35  25.51
     1  28.52   2.53  24.22  24.10  24.81
     2  24.41  24.10   2.28  22.31  22.98
     3  24.55  24.82  22.57   2.29  23.65
     4  24.64  24.55  22.55  22.31   1.76

   CPU     0      1      2      3      4
     0  10.15  27.46  25.70  25.35  22.43
     1  24.16   8.28  23.79  23.62  20.72
     2  21.61  22.14   7.28  22.07  19.50
     3  20.38  21.91  21.00   6.77  19.14
     4  17.93  19.39  18.91  18.39   5.51
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4
     0   2.66 49252.33 49252.37 49252.36   2.71
     1 49256.18   2.52 49252.68 49252.67   2.71
     2 49252.71 49252.60   2.28 49252.59   2.46
     3 49256.04 49252.62 49252.63   2.28   2.46
     4   2.82   2.80   2.53   2.53   2.10

   CPU     0      1      2      3      4
     0   5.68   4.71   3.81   3.74   3.35
     1   3.73   4.56   3.65   4.12   4.19
     2   4.00   3.67   4.23   3.59   3.91
     3   4.36   4.14   3.81   4.38   3.63
     4   3.18   3.16   3.34   3.25   3.98

njuffa · November 6, 2020, 8:26pm

What are the specifications of the PCIe slots involved? Full transfer speed (~12 GB/sec per direction) requires PCIe gen3 x16 links. What kind of PCIe expansion chassis is this? Based on some sort of PCIe bridge chip? What are its specifications? What kind of link connects the PCIe expansion chassis to the host system? Is it a PCI gen 3 x16 link?

pfranck · November 6, 2020, 8:29pm

Thanks for your reply njuffa, this is the expansion chassis in question: https://www.onestopsystems.com/product/expressbox-3600-10

I’m seeing some people reporting issues with PLX switches and ACS, will look closer at that…

Robert_Crovella · November 6, 2020, 8:35pm

Considering the various datapoints:

The same P2P test reports some better results and some worse results
The worse datapoints are localized to traffic on the expansion chassis
The latency measurements are also bad (50 milliseconds)

It seems evident that the expansion chassis has an issue with P2P traffic. I’m not able to diagnose it here. My suggestion would be to contact the manufacterer of the expansion chassis and report your observations.

You have the source code for the tests you’ve run, and it’s not reasonable to conclude that the test is somehow doing something different at the API level when it is setting up a test between GPU 4 and GPU 0 vs. when it is setting up a test between GPU 3 and GPU 0. That last part is just my opinion.

Supermicro, for example, has many system boards that use PCIE switches, probably similar to what your expansion chassis uses to connect slots together, and they work fine with these tests.

I’m not aware of any futher confirmation of P2P than just using the appropriate API (e.g. cudaMemcpyPeerAsync) and even if there was a fallback from P2P to non-P2P, it wouldn’t explain any of the datapoints I see here. That would not cause the transfer bandwidth to go from ~8GB/s to ~0.7GB/s. Nor would it cause latency to increase by a factor of 1000 or more.

njuffa · November 6, 2020, 8:39pm

Thanks for the pointer to the product description. Unfortunately it is basically a black box, the only thing that seems to be clear from the vendor description is that it uses a PCIe gen3 x16 link back to the host system. Since the host system already has a GPU installed, that would require a host system that provides at least 32 PCIe gen 3 lanes. Is that the case? Is the host a dual socket system with two CPUs?

I have basic knowledge about PCIe. This is a situation that seems to require a person with more in-depth PCIe knowledge to diagnose what is going on.

pfranck · November 6, 2020, 8:58pm

Thanks for the reply @Robert_Crovella, I looked over the p2pBandwithLatencyTest.cu and it seems the difference between P2P=Enabled / P2P=Disabled as seen in my output above is:

when P2P=enabled it’s simply calling cudaDeviceEnablePeerAccess() before cudaMemcpyPeerAsync()

When P2P=Disabled it’s calling cudaDeviceDisablePeerAccess() before cudaMemcpyPeerAsync()

doesn’t seem to be much more to it.

pfranck · November 6, 2020, 9:28pm

I modified my own code to mimic the behavior of the sample from NVIDIA, and I can see that when I enable P2P, the call to cudaMemcpyPeerAsync() does not block, but my call to cudaStreamSynchronize() does. When I disable P2P the call to cudaMemcpyPeerAsync() does NOT block, and according to top the biggest consumer of CPU is my main thread, which i assume comes from cudaMemcpyPeerAsync() blocking. I also see the same numbers on throughput…

pfranck · November 6, 2020, 9:47pm

I came across another thread with @Robert_Crovella and the solution was to disable ACS in my SuperMicro BIOS.

Topic		Replies	Views
cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues CUDA Programming and Performance	11	13023	May 3, 2011
P2P access not enabled, is this a software or a hardware issue? CUDA Setup and Installation	7	9495	November 10, 2015
One GPU NOT capable of Peer-to-Peer (P2P) CUDA Programming and Performance	22	4996	November 27, 2018
Peer access not supported between devices CUDA Programming and Performance	11	6919	November 9, 2017
multi-GPU Peer to Peer access CUDA SDK example not working, why? CUDA Programming and Performance	13	5132	February 26, 2015
CUDA peer to peer example ./simpleP2P failing CUDA Programming and Performance	11	8562	February 5, 2015
P2p Bandwidth 150% higher than maximum achievable CUDA Programming and Performance cuda , ubuntu	10	2603	April 11, 2023
cudaMemcpyPeerAsync behavior for different hardware CUDA Programming and Performance cuda	6	390	May 13, 2024
P2P not working for P600s? CUDA Programming and Performance	7	1786	April 5, 2018
Low P2P GPU bandwidth performance between GeForce GPUs CUDA Programming and Performance	20	500	October 9, 2024

P2P: How do I know if cudaMemcpy falls back to non-P2P?

Related topics