Hello,
I am debugging P2P data access between 2 NVIDIA RTX 5000 Ada Generation Embedded GPUs connected to a single PCIe domain over Microchip PFX Gen4 PCIe switch. It’s a custom board connected to an Intel based motherboard’s PCIe slot.
With NVIDIA 535.230 Linux Graphic driver the following is the output of simpleP2P CUDA sample. It showed the P2P access is available but data verification failed.
$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA RTX 5000 Ada Generation Embedded GPU (GPU0) -> NVIDIA RTX 5000 Ada Generation Embedded GPU (GPU1) : Yes
> Peer access from NVIDIA RTX 5000 Ada Generation Embedded GPU (GPU1) -> NVIDIA RTX 5000 Ada Generation Embedded GPU (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 3.14GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 1: val = 0.000000, ref = 4.000000
Verification error @ element 2: val = 0.000000, ref = 8.000000
Verification error @ element 3: val = 0.000000, ref = 12.000000
Verification error @ element 4: val = 0.000000, ref = 16.000000
Verification error @ element 5: val = 0.000000, ref = 20.000000
Verification error @ element 6: val = 0.000000, ref = 24.000000
Verification error @ element 7: val = 0.000000, ref = 28.000000
Verification error @ element 8: val = 0.000000, ref = 32.000000
Verification error @ element 9: val = 0.000000, ref = 36.000000
Verification error @ element 10: val = 0.000000, ref = 40.000000
Verification error @ element 11: val = 0.000000, ref = 44.000000
Verification error @ element 12: val = 0.000000, ref = 48.000000
Disabling peer access...
Shutting down...
Test failed!
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX 0-11 0 N/A
GPU1 PIX X 0-11 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
$ nvidia-smi topo -p2p w
GPU0 GPU1
GPU0 X OK
GPU1 OK X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
However when I updated NVIDIA driver to 570.124 version output of simpleP2P is different
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA RTX 5000 Ada Generation Embedded GPU (GPU0) -> NVIDIA RTX 5000 Ada Generation Embedded GPU (GPU1) : No
> Peer access from NVIDIA RTX 5000 Ada Generation Embedded GPU (GPU1) -> NVIDIA RTX 5000 Ada Generation Embedded GPU (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./p2p_test.
Peer to Peer access is not available amongst GPUs in the system, waiving test.
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX 0-11 0 N/A
GPU1 PIX X 0-11 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
$ nvidia-smi topo -p2p w
GPU0 GPU1
GPU0 X CNS
GPU1 CNS X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
Apparently driver v570 better detects P2P access (in)ability.
The question that would help me direct my debug efforts is what in the driver v570 has been changed in comparison to driver v535 that changes the output of cudaDeviceCanAccessPeer()?
I appreciate any info about this CUDA API.
Please note that P2P access works well on this motherboard where 2 P2000 GPUs are installed into PCIe slots so in my mind it is related to either RTX 5000 Ada Generation Embedded GPU or PCIe switch.