MultiGPU P2P Access Weird result.

Hello Forum,

I want to use multiGPU P2P Access between a set of 4 Tesla K80, also along with UVA. The thing is that my program needs P2P Access between GPU0 and all the other ones. Unfortunately, this can’t be done and I don’t know why… If someone can explain me what’s happening here would be helpful.

Here is my execution of the simpleP2P example.

This shows that the first two K80 (0,1,2,3) can have P2P between themselves. And that the other two K80 (4,5,6,7) can have P2P access between themselves too. But not between all the group! Which is strange considering that these 4 cards are connected in the same server…

$ ./simpleP2P 
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 8
> GPU0 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU2 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU3 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU4 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU5 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU6 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU7 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU2) : Yes
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU3) : Yes
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU4) : No
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU5) : No
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU6) : No
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU7) : No
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU2) : Yes
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU3) : Yes
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU4) : No
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU5) : No
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU6) : No
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU7) : No
> Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU0) : Yes
> Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU1) : Yes
> Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU3) : Yes
> Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU4) : No
> Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU5) : No
> Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU6) : No
> Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU7) : No
> Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU0) : Yes
> Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU1) : Yes
> Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU2) : Yes
> Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU4) : No
> Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU5) : No
> Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU6) : No
> Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU7) : No
> Peer access from Tesla K80 (GPU4) -> Tesla K80 (GPU0) : No
> Peer access from Tesla K80 (GPU4) -> Tesla K80 (GPU1) : No
> Peer access from Tesla K80 (GPU4) -> Tesla K80 (GPU2) : No
> Peer access from Tesla K80 (GPU4) -> Tesla K80 (GPU3) : No
> Peer access from Tesla K80 (GPU4) -> Tesla K80 (GPU5) : Yes
> Peer access from Tesla K80 (GPU4) -> Tesla K80 (GPU6) : Yes
> Peer access from Tesla K80 (GPU4) -> Tesla K80 (GPU7) : Yes
> Peer access from Tesla K80 (GPU5) -> Tesla K80 (GPU0) : No
> Peer access from Tesla K80 (GPU5) -> Tesla K80 (GPU1) : No
> Peer access from Tesla K80 (GPU5) -> Tesla K80 (GPU2) : No
> Peer access from Tesla K80 (GPU5) -> Tesla K80 (GPU3) : No
> Peer access from Tesla K80 (GPU5) -> Tesla K80 (GPU4) : Yes
> Peer access from Tesla K80 (GPU5) -> Tesla K80 (GPU6) : Yes
> Peer access from Tesla K80 (GPU5) -> Tesla K80 (GPU7) : Yes
> Peer access from Tesla K80 (GPU6) -> Tesla K80 (GPU0) : No
> Peer access from Tesla K80 (GPU6) -> Tesla K80 (GPU1) : No
> Peer access from Tesla K80 (GPU6) -> Tesla K80 (GPU2) : No
> Peer access from Tesla K80 (GPU6) -> Tesla K80 (GPU3) : No
> Peer access from Tesla K80 (GPU6) -> Tesla K80 (GPU4) : Yes
> Peer access from Tesla K80 (GPU6) -> Tesla K80 (GPU5) : Yes
> Peer access from Tesla K80 (GPU6) -> Tesla K80 (GPU7) : Yes
> Peer access from Tesla K80 (GPU7) -> Tesla K80 (GPU0) : No
> Peer access from Tesla K80 (GPU7) -> Tesla K80 (GPU1) : No
> Peer access from Tesla K80 (GPU7) -> Tesla K80 (GPU2) : No
> Peer access from Tesla K80 (GPU7) -> Tesla K80 (GPU3) : No
> Peer access from Tesla K80 (GPU7) -> Tesla K80 (GPU4) : Yes
> Peer access from Tesla K80 (GPU7) -> Tesla K80 (GPU5) : Yes
> Peer access from Tesla K80 (GPU7) -> Tesla K80 (GPU6) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> Tesla K80 (GPU0) supports UVA: Yes
> Tesla K80 (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 7.42GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

Peer-to-peer requires that the participating GPUs are on the same PCIe root complex. Each x86 CPU provides its own PCIe root complex, so I would hypothesize that this is a dual-CPU machine, where GPUs 0-3 are coupled to one CPU, and GPUs 4-7 are coupled to the other CPU.

I will ask about that. In that case the only way to work with all the GPU at the same time would be using MPI?

I have no experience with your kind of hardware setup, but I believe that if direct communication between GPUs across PCIe is not possible, there is a fallback path that moves the data through a host buffer (so GPUx -> CPU -> GPUy). Obviously that results in lower performance.

Some of the forum participants here have experience with high-end systems such as yours, my recommendation would be to wait for knowledgeable comments from them.

No, it’s not correct to say “the only way to work with all the GPUs would be using MPI”.

There are no significant issues with working with 2, 4, or 8 GPUs in your setup.

Yes, P2P will not work amongst any 2 GPUs. But if you equate that with an inability to work with GPUs, then you simply don’t understand one or both of the following:

  1. How to work with multiple GPUs (e.g. see simpleMultiGPU, or cudaOpenMP sample codes, niether of which depend on P2P)
  2. How P2P works, and what it means.

Since these topics have been covered extensively elsewhere, I’m not going to cover that ground. Feel free to use your google-fu.

Yes, sorry I made a mistake with that statement.

So P2P between some GPUs won’t work between some GPUs and with other ones will…

I know how to work with multiple GPU, in fact I’m using P2P+UVA and OpenMP to work. And yes there is a significant working from 2 to 8, because in my program, the speedup gets better with more GPUs working. Now, P2P makes easier the coding and also the communication between GPUs. Said that, It would be easier if all the GPU were in the same PCIe to work with P2P and UVA.

I don’t really know what you’re trying to say there, but I would agree that many programs will benefit by using more GPUs, and there shouldn’t be much preventing you (no significant issues) from doing that in your setup.

That particular issue is not something that you’re going to be able to solve with software. As njuffa said already, it’s a hardware (topology) issue associated with the platform that you have these GPUs plugged into.

Yes I know, but my question here would be:

How can I know if the participating GPUs are on the same PCIe root complex or not?

I’ve got the topology tree with lspci -t -v. Are the GPUs on the same PCIe root complex?

$ lspci -t -v
-+-[0000:ff]-+-08.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-08.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-08.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-09.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-09.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-09.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-0b.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 R3 QPI Link 0 & 1 Monitoring
 |           +-0b.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 R3 QPI Link 0 & 1 Monitoring
 |           +-0b.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 R3 QPI Link 0 & 1 Monitoring
 |           +-0c.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0d.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0d.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent
 |           +-0f.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent
 |           +-0f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent
 |           +-0f.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent
 |           +-0f.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 System Address Decoder & Broadcast Registers
 |           +-0f.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 System Address Decoder & Broadcast Registers
 |           +-0f.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 System Address Decoder & Broadcast Registers
 |           +-10.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCIe Ring Interface
 |           +-10.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCIe Ring Interface
 |           +-10.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers
 |           +-10.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers
 |           +-10.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers
 |           +-12.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 0
 |           +-12.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 0
 |           +-12.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 1
 |           +-12.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 1
 |           +-13.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
 |           +-13.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
 |           +-13.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel Target Address Decoder
 |           +-13.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel Target Address Decoder
 |           +-13.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Channel 0/1 Broadcast
 |           +-13.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Global Broadcast
 |           +-14.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 0 Thermal Control
 |           +-14.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 1 Thermal Control
 |           +-14.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 0 ERROR Registers
 |           +-14.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 1 ERROR Registers
 |           +-14.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1
 |           +-14.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1
 |           +-14.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1
 |           +-14.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1
 |           +-16.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
 |           +-16.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
 |           +-16.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel Target Address Decoder
 |           +-16.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel Target Address Decoder
 |           +-16.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Channel 2/3 Broadcast
 |           +-16.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Global Broadcast
 |           +-17.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 0 Thermal Control
 |           +-17.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 1 Thermal Control
 |           +-17.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 0 ERROR Registers
 |           +-17.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 1 ERROR Registers
 |           +-17.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3
 |           +-17.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3
 |           +-17.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3
 |           +-17.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3
 |           +-1e.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
 |           +-1e.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
 |           +-1e.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
 |           +-1e.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
 |           +-1e.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
 |           +-1f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 |           \-1f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 +-[0000:80]-+-02.0-[81-84]----00.0-[82-84]--+-08.0-[83]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
 |           |                               \-10.0-[84]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
 |           +-03.0-[85-88]----00.0-[86-88]--+-08.0-[87]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
 |           |                               \-10.0-[88]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
 |           +-05.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Address Map, VTd_Misc, System Management
 |           +-05.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Hot Plug
 |           +-05.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 RAS, Control Status and Global Errors
 |           \-05.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 I/O APIC
 +-[0000:7f]-+-08.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-08.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-08.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-09.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-09.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-09.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-0b.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 R3 QPI Link 0 & 1 Monitoring
 |           +-0b.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 R3 QPI Link 0 & 1 Monitoring
 |           +-0b.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 R3 QPI Link 0 & 1 Monitoring
 |           +-0c.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0c.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0d.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0d.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers
 |           +-0f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent
 |           +-0f.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent
 |           +-0f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent
 |           +-0f.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent
 |           +-0f.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 System Address Decoder & Broadcast Registers
 |           +-0f.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 System Address Decoder & Broadcast Registers
 |           +-0f.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 System Address Decoder & Broadcast Registers
 |           +-10.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCIe Ring Interface
 |           +-10.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCIe Ring Interface
 |           +-10.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers
 |           +-10.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers
 |           +-10.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers
 |           +-12.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 0
 |           +-12.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 0
 |           +-12.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 1
 |           +-12.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 1
 |           +-13.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
 |           +-13.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
 |           +-13.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel Target Address Decoder
 |           +-13.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel Target Address Decoder
 |           +-13.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Channel 0/1 Broadcast
 |           +-13.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Global Broadcast
 |           +-14.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 0 Thermal Control
 |           +-14.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 1 Thermal Control
 |           +-14.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 0 ERROR Registers
 |           +-14.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 1 ERROR Registers
 |           +-14.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1
 |           +-14.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1
 |           +-14.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1
 |           +-14.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1
 |           +-16.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
 |           +-16.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
 |           +-16.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel Target Address Decoder
 |           +-16.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel Target Address Decoder
 |           +-16.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Channel 2/3 Broadcast
 |           +-16.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Global Broadcast
 |           +-17.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 0 Thermal Control
 |           +-17.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 1 Thermal Control
 |           +-17.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 0 ERROR Registers
 |           +-17.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 1 ERROR Registers
 |           +-17.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3
 |           +-17.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3
 |           +-17.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3
 |           +-17.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3
 |           +-1e.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
 |           +-1e.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
 |           +-1e.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
 |           +-1e.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
 |           +-1e.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
 |           +-1f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 |           \-1f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 \-[0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
             +-01.0-[01]--
             +-01.1-[02]----00.0  ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller
             +-02.0-[03-06]----00.0-[04-06]--+-08.0-[05]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
             |                               \-10.0-[06]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
             +-03.0-[07-0a]----00.0-[08-0a]--+-08.0-[09]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
             |                               \-10.0-[0a]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
             +-05.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Address Map, VTd_Misc, System Management
             +-05.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Hot Plug
             +-05.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 RAS, Control Status and Global Errors
             +-05.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 I/O APIC
             +-11.0  Intel Corporation C610/X99 series chipset SPSR
             +-11.4  Intel Corporation C610/X99 series chipset sSATA Controller [AHCI mode]
             +-14.0  Intel Corporation C610/X99 series chipset USB xHCI Host Controller
             +-16.0  Intel Corporation C610/X99 series chipset MEI Controller #1
             +-16.1  Intel Corporation C610/X99 series chipset MEI Controller #2
             +-1a.0  Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2
             +-1b.0  Intel Corporation C610/X99 series chipset HD Audio Controller
             +-1c.0-[0b]--
             +-1c.2-[0c]----00.0  Intel Corporation I210 Gigabit Network Connection
             +-1c.3-[0d]----00.0  Intel Corporation I210 Gigabit Network Connection
             +-1c.4-[0e]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA Controller
             +-1c.7-[0f-10]----00.0-[10]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family
             +-1d.0  Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #1
             +-1f.0  Intel Corporation C610/X99 series chipset LPC Controller
             +-1f.2  Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
             \-1f.3  Intel Corporation C610/X99 series chipset SMBus Controller

The test you’ve run is already a good one for demonstrating that. If they are on the same root complex, they will be able to establish P2P access with each other.

You could also study your motherboard documentation. It may provide such topology information.

And there are other tools you can use to discover it such as lspci, lstopo (part of hwloc), and nvidia-smi

For nvidia-smi try:

nvidia-smi topo -h

to get started. A possible command option might be:

nvidia-smi topo -m

connections labelled SOC indicate that the path between those GPUs involves a socket level link, which means those GPUs are on separate PCIE root complexes. The other connection types (PHB, PXB, PIX) all indicate connections that should support P2P.

Yes, you are right…

The results are:

GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity
GPU0	 X 	PIX	PHB	PHB	SOC	SOC	SOC	SOC	0-9
GPU1	PIX	 X 	PHB	PHB	SOC	SOC	SOC	SOC	0-9
GPU2	PHB	PHB	 X 	PIX	SOC	SOC	SOC	SOC	0-9
GPU3	PHB	PHB	PIX	 X 	SOC	SOC	SOC	SOC	0-9
GPU4	SOC	SOC	SOC	SOC	 X 	PIX	PHB	PHB	10-19
GPU5	SOC	SOC	SOC	SOC	PIX	 X 	PHB	PHB	10-19
GPU6	SOC	SOC	SOC	SOC	PHB	PHB	 X 	PIX	10-19
GPU7	SOC	SOC	SOC	SOC	PHB	PHB	PIX	 X 	10-19

Legend:

  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch