I have a setup with 5x Tesla v100 in a PCIe (gen3) system, and I’m seeing really low bandwidth numbers using p2pBandwidthLatencyTest. I’ve also crafted my own CUDA app to experiment with various usages of cudaMemcpy() /Async w/ stream, etc.
The GPUs claim to allow cudaDeviceCanAccessPeer() and when i cudaDeviceEnablePeerAccess() it doesn’t fail.
any way to see if it actually did a p2p transaction through the API?
note: i have 4 GPUs in a PCIe expansion chassis, and GPU4 sits on a SuperMicro motherboard. What’s bizarre is that it gets better rates to the other GPUs, than them between themselves (which leads me to suspect it’s not P2P) when it claims P2P is enabled.
and when P2P is DISABLED, it seems it’s pulling proper rates???
relevant dumps from topo -m and p2pBandwidthTest below.
[root@localhost RDMADriver]# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 CPU Affinity NUMA Affinity
GPU0 X PIX PXB PXB SYS 0-7,16-23 0
GPU1 PIX X PXB PXB SYS 0-7,16-23 0
GPU2 PXB PXB X PIX SYS 0-7,16-23 0
GPU3 PXB PXB PIX X SYS 0-7,16-23 0
GPU4 SYS SYS SYS SYS X 8-15,24-31 1
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
[root@localhost p2pBandwidthLatencyTest]# ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla V100-PCIE-32GB, pciBusID: 25, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla V100-PCIE-32GB, pciBusID: 26, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla V100-PCIE-32GB, pciBusID: 2f, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla V100-PCIE-32GB, pciBusID: 30, pciDeviceID: 0, pciDomainID:0
Device: 4, Tesla V100-PCIE-32GB, pciBusID: 86, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3 4
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 1
3 1 1 1 1 1
4 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4
0 718.06 9.02 9.36 9.10 5.85
1 8.83 727.46 9.39 9.07 5.85
2 8.98 9.15 726.07 9.14 5.87
3 9.02 9.17 9.30 730.14 5.87
4 5.76 5.91 5.90 5.89 735.64
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4
0 720.71 0.72 0.72 0.72 8.65
1 0.62 732.88 0.72 0.72 8.29
2 0.62 0.72 730.14 0.72 8.62
3 0.55 0.62 0.72 737.03 8.21
4 8.67 8.66 8.67 7.67 737.03
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4
0 721.38 9.98 9.95 9.90 9.83
1 10.03 727.42 9.80 9.81 9.75
2 9.97 9.75 730.82 9.40 9.74
3 9.94 9.78 9.21 732.88 9.80
4 9.88 10.09 10.13 10.18 751.20
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4
0 725.39 1.02 1.30 1.30 17.17
1 1.30 735.64 1.30 1.30 17.15
2 1.30 1.30 734.95 1.30 17.17
3 1.30 1.30 1.30 735.64 17.19
4 17.18 17.22 17.20 17.20 748.32
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3 4
0 2.66 25.82 23.53 24.35 25.51
1 28.52 2.53 24.22 24.10 24.81
2 24.41 24.10 2.28 22.31 22.98
3 24.55 24.82 22.57 2.29 23.65
4 24.64 24.55 22.55 22.31 1.76
CPU 0 1 2 3 4
0 10.15 27.46 25.70 25.35 22.43
1 24.16 8.28 23.79 23.62 20.72
2 21.61 22.14 7.28 22.07 19.50
3 20.38 21.91 21.00 6.77 19.14
4 17.93 19.39 18.91 18.39 5.51
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4
0 2.66 49252.33 49252.37 49252.36 2.71
1 49256.18 2.52 49252.68 49252.67 2.71
2 49252.71 49252.60 2.28 49252.59 2.46
3 49256.04 49252.62 49252.63 2.28 2.46
4 2.82 2.80 2.53 2.53 2.10
CPU 0 1 2 3 4
0 5.68 4.71 3.81 3.74 3.35
1 3.73 4.56 3.65 4.12 4.19
2 4.00 3.67 4.23 3.59 3.91
3 4.36 4.14 3.81 4.38 3.63
4 3.18 3.16 3.34 3.25 3.98