CUDA 12.1 SimpleP2P Verification Errors

System Information:

  • GPU: NVIDIA RTX 4090 x4
  • CUDA Version: 12.1
  • Driver Version: 530.30.02
  • Operating System: CentOS 8 (Kernel: 4.18.0-348.el8.x86_64)
  • Motherboard:
    • Manufacturer: Nettrix
    • Product Name: 60EA32X
    • Version: 24003523
    • Serial Number: 2400352330001979
  • BIOS Settings:
    • Above 4G Decoding: Enabled
    • IOMMU: Enabled
    • NVIDIA-SMI Topology Output
    •   GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
      
    GPU0 X NODE SYS SYS 0-31 0
    GPU1 NODE X SYS SYS 0-31 0
    GPU2 SYS SYS X NODE 32-63 1
    GPU3 SYS SYS NODE X 32-63 1
    I am running the simpleP2P example from the CUDA Samples to test Peer-to-Peer (P2P) memory access on my multi-GPU system. While the test recognizes P2P support between GPUs, the verification step fails. Below are the details:
  1. P2P connectivity is confirmed via nvidia-smi topo -m.
  2. GPUs in the same NUMA node (e.g., GPU0 and GPU1) are connected via NODE, indicating potential P2P support.
  3. However, when running simpleP2P, I encounter verification errors.
    Error Message:Copy data back to host from GPU0 and verify results…
    Verification error @ element 1: val = 0.000000, ref = 4.000000
    Verification error @ element 2: val = 0.000000, ref = 8.000000
    Verification error @ element 3: val = 0.000000, ref = 12.000000
    Verification error @ element 4: val = 0.000000, ref = 16.000000
    Verification error @ element 5: val = 0.000000, ref = 20.000000
    Verification error @ element 6: val = 0.000000, ref = 24.000000
    Verification error @ element 7: val = 0.000000, ref = 28.000000
    Verification error @ element 8: val = 0.000000, ref = 32.000000
    Verification error @ element 9: val = 0.000000, ref = 36.000000
    Verification error @ element 10: val = 0.000000, ref = 40.000000
    Verification error @ element 11: val = 0.000000, ref = 44.000000
    Verification error @ element 12: val = 0.000000, ref = 48.000000
    Disabling peer access…
    Shutting down…
    Test failed!
  • Verified that P2P is supported between GPUs using nvidia-smi topo -m.
  • Checked IOMMU and Above 4G Decoding in BIOS.
    [root@node4 simpleP2P]# dmesg | grep -i iommu
    [ 0.001137] DMAR-IR: IOAPIC id 8 under DRHD base 0x967fc000 IOMMU 19
    [ 1.690072] iommu: Default domain type: Passthrough
    Questions:
  1. Are there additional system or BIOS settings required to ensure proper P2P functionality for RTX 4090 GPUs?
  2. Does the simpleP2P example need specific modifications for the Ada Lovelace architecture?
  3. What further steps can I take to debug and resolve this issue?

I recently notice that Peer to Peer is not supported on 4090. Thanks.

In the official driver. There is a version of Nvidia’s open driver here, which adds P2P.