Peer-to-Peer Access Fails between 2 GPUs

I have installed 2 P-100s in my machine but they could not access memory directly.

My GPUs are in the same CPU sockets

nvidia-smi topo -m
GPU0 GPU1 CPU Affinity
GPU0 X SOC 0-7,16-23
GPU1 SOC X 8-15,24-31

Legend:

X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

But no Peer access each other

/usr/local/cuda/samples/0_Simple/simpleP2P/simpleP2P
[/usr/local/cuda/samples/0_Simple/simpleP2P/simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

GPU0 = “Tesla P100-PCIE-16GB” IS capable of Peer-to-Peer (P2P)
GPU1 = “Tesla P100-PCIE-16GB” IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…

Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU1) : No
Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU0) : No
Two or more GPUs with SM 2.0 or higher capability are required for /usr/local/cuda/samples/0_Simple/simpleP2P/simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

When you see SOC in the topology matrix, it means those 2 GPUs share a socket-level link, like QPI.

You cannot run a P2P connection over QPI. So the problem is not in the GPUs or the software, but in the system you have them plugged into. One GPU is connected to one CPU socket, and the other GPU is connected to the other CPU socket, and that will not work for P2P.

You cannot fix this except by fixing the system.

Thanks for your post txbob, I was told that my GPUs were installed in the same GPUs according to my machine maker but they were not. I will check them out again.

I solved the problem. My server box Supermicro 1028GR-TR and its manual MNL-1625.pdf had a wrong CPU socket description on Figure 6-5 on the manual page 6-5 (PDF page: 77). The figure shows that GPU Slot 1 and 2 are connected to CPU 1 but this is wrong. Actually, GPU Slot 1 and 4 are connected to CPU 2 and GPU Slot 2 is connected to CPU 1. The manufacture told me about this information. I initially installed my GPUs into Slot 1 and 2 as the manual description but GPUs did not get P2P access. I moved one of GPUs to Slot 4 from Slot 2. Now, my GPUs installed on Slot 1 and 4. They can have P2P access now.

I reported to the manufacture about the problem on their manual. However, I like to post this here in case of that someone has the same problem as I did.

My P2P works now!

nvidia-smi topo -m
GPU0 GPU1 CPU Affinity
GPU0 X PHB 8-15,24-31
GPU1 PHB X 8-15,24-31

Legend:

X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

[/usr/local/cuda/samples/0_Simple/simpleP2P/simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

GPU0 = “Tesla P100-PCIE-16GB” IS capable of Peer-to-Peer (P2P)
GPU1 = “Tesla P100-PCIE-16GB” IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…

Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU1) : Yes
Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…
Checking GPU0 and GPU1 for UVA capabilities…
Tesla P100-PCIE-16GB (GPU0) supports UVA: Yes
Tesla P100-PCIE-16GB (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 9.30GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Disabling peer access…
Shutting down…
Test passed