System Information:
- GPU: NVIDIA RTX 4090 x4
- CUDA Version: 12.1
- Driver Version: 530.30.02
- Operating System: CentOS 8 (Kernel: 4.18.0-348.el8.x86_64)
- Motherboard:
- Manufacturer: Nettrix
- Product Name: 60EA32X
- Version: 24003523
- Serial Number: 2400352330001979
- BIOS Settings:
- Above 4G Decoding: Enabled
- IOMMU: Enabled
- NVIDIA-SMI Topology Output
-
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU1 NODE X SYS SYS 0-31 0
GPU2 SYS SYS X NODE 32-63 1
GPU3 SYS SYS NODE X 32-63 1
I am running thesimpleP2P
example from the CUDA Samples to test Peer-to-Peer (P2P) memory access on my multi-GPU system. While the test recognizes P2P support between GPUs, the verification step fails. Below are the details:
- P2P connectivity is confirmed via
nvidia-smi topo -m
. - GPUs in the same NUMA node (e.g., GPU0 and GPU1) are connected via
NODE
, indicating potential P2P support. - However, when running
simpleP2P
, I encounter verification errors.
Error Message:Copy data back to host from GPU0 and verify results…
Verification error @ element 1: val = 0.000000, ref = 4.000000
Verification error @ element 2: val = 0.000000, ref = 8.000000
Verification error @ element 3: val = 0.000000, ref = 12.000000
Verification error @ element 4: val = 0.000000, ref = 16.000000
Verification error @ element 5: val = 0.000000, ref = 20.000000
Verification error @ element 6: val = 0.000000, ref = 24.000000
Verification error @ element 7: val = 0.000000, ref = 28.000000
Verification error @ element 8: val = 0.000000, ref = 32.000000
Verification error @ element 9: val = 0.000000, ref = 36.000000
Verification error @ element 10: val = 0.000000, ref = 40.000000
Verification error @ element 11: val = 0.000000, ref = 44.000000
Verification error @ element 12: val = 0.000000, ref = 48.000000
Disabling peer access…
Shutting down…
Test failed!
- Verified that P2P is supported between GPUs using
nvidia-smi topo -m
. - Checked IOMMU and Above 4G Decoding in BIOS.
[root@node4 simpleP2P]# dmesg | grep -i iommu
[ 0.001137] DMAR-IR: IOAPIC id 8 under DRHD base 0x967fc000 IOMMU 19
[ 1.690072] iommu: Default domain type: Passthrough
Questions:
- Are there additional system or BIOS settings required to ensure proper P2P functionality for RTX 4090 GPUs?
- Does the
simpleP2P
example need specific modifications for the Ada Lovelace architecture? - What further steps can I take to debug and resolve this issue?