P2P Transfers Across Single PCIe Switch Fail

I currently have a set up with 2 GPUs on the same PCIe switch. I confirmed this by doing nvidia-smi topo -m and “PIX” is the connection between the 2 GPUs.

GPU0    GPU1    CPU Affinity                                           
GPU0     X      PIX     0-17                                                   
GPU1    PIX      X      0-17  
Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)                       
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)             
  PIX  = Connection traversing a single PCIe switch                                                         
  NV#  = Connection traversing a bonded set of # NVLinks

I then build and run simpleP2P from the CUDA samples.

[./simpleP2P] - Starting...               
Checking for multiple GPUs...             
CUDA-capable device count: 2              
> GPU0 = "Tesla V100-PCIE-32GB" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "Tesla V100-PCIE-32GB" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU1) : Yes
> Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...                                      
Checking GPU0 and GPU1 for UVA capabilities...                                     
> Tesla V100-PCIE-32GB (GPU0) supports UVA: Yes                                    
> Tesla V100-PCIE-32GB (GPU1) supports UVA: Yes                                    
Both GPUs can support UVA, enabling...                                             
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...                            
Creating event handles...                                                          
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.05GB/s                        
Preparing host buffer and memcpy to GPU0...                                        
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...            
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...            
Copy data back to host from GPU0 and verify results...                             
Verification error @ element 0: val = nan, ref = 0.000000                          
Verification error @ element 1: val = nan, ref = 4.000000                          
Verification error @ element 2: val = nan, ref = 8.000000                          
Verification error @ element 3: val = nan, ref = 12.000000                         
Verification error @ element 4: val = nan, ref = 16.000000                         
Verification error @ element 5: val = nan, ref = 20.000000                         
Verification error @ element 6: val = nan, ref = 24.000000                         
Verification error @ element 7: val = nan, ref = 28.000000                         
Verification error @ element 8: val = nan, ref = 32.000000                         
Verification error @ element 9: val = nan, ref = 36.000000                         
Verification error @ element 10: val = nan, ref = 40.000000                        
Verification error @ element 11: val = nan, ref = 44.000000                        
Disabling peer access...                                                           
Shutting down...                                                                   
Test failed!

So everything looks good, no CUDA errors show up or anything, but in the end no data was actually transferred and the verification fails. What could be causing this problem? When I move a GPU to a different slot where the connection is a “NODE” connection according to nvidia-smi topo the transfer is able to work.

If the GPUs are installed in an OEM certified system for their use, you should probably just bring this to their attention and ask for help.

It’s a system issue. PCIE switches can be misconfigured by the system BIOS, preventing proper P2P activity. Without knowing the system or operating system, not much else can be said. Even if you provide that info, there probably wouldn’t be much that I could say. Recommendations:

  1. Make sure you are using a system that is certified by the OEM for the GPUs you have installed.
  2. Make sure you are using a proper CUDA system configuration. The proper/supported configs are listed in the linux install guide.
  3. Use the latest CUDA version and GPU driver
  4. Make sure your system is updated to the latest BIOS image offered by the OEM for that system.

If none of that helps, consult your system OEM. There is no CUDA configuration or system settings that should be needed to make this work, in a proper OEM certified system.

Thank you for the help.The problem was not having the latest BIOS. Installing the latest BIOS solved the problem.

Hi, sorry for naive question. I was studying NVLink advantage over PCIe. Having able to transmit data directly between two GPUs using PCIe as we see above, what’s the advantage of NVLink over that beyond higher bandwidth?

I think the primary motivation for NVLink is higher bandwidth. That’s normally something people are interested in using GPUs (for compute purposes) for. The primary purpose of the compute/CUDA GPU is to make code run faster. I know of no other purpose.

1 Like

I have seen claims (e.g. here) that NVlink offers reduced latency compared to PCIe in addition to improved bandwidth. I am not an expert in interconnects, but suspect that for latency comparisons it matters which versions of PCIe and NVlink are being looked at, as this is still a fairly rapidly evolving field (see recent release of the first draft of PCIe gen 7). A quick internet search shows some relevant publications, e.g.

Ang Li, et al., “Evaluating modern GPU interconnect: PCIe, NVlink, NV-SLI, NVSwitch and GPUDirect.” IEEE Transactions on Parallel and Distributed Systems, Vol. 31, No. 1, Jul. 2019, pp. 94-110 (preprint on ArXiv).

The latency advantage of NVLink may be fairly small, as suggested by this data:

Christopher M. Siefert, et al., “Latency and Bandwidth Microbenchmarks of US Department of Energy Systems in the June 2023 Top500 List.” In Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Nov. 2023. pp. 1298–1305 (online):

Device to device transfer latency is roughly 25 𝜇𝑠 via the NVLink connections on the V100 and about 2 𝜇𝑠 slower on the non-NVLink connections.

Generally speaking, reducing latency is hard. An age old engineering adage states: “You can pay for higher bandwidth, but latency is forever”. Again, not my area of expertise, but my understanding is that in the context of super computers, even small reductions in the latency of both inter-node interconnects (e.g. Infiniband) and intra-node interconnects (e.g. NVlink) can lead to meaningful improvements in overall system performance.

1 Like