simpleP2P fails on 8*L40S server

I was trying pytorch DDP and the program was stuck. So I ran the simpleP2P example program in cuda-samples and found out that GPUs could not communicate with each other as normal.

The results of simple p2p:

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 8

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU6) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.05GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!

results of nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L40S                    Off | 00000000:18:00.0 Off |                    0 |
| N/A   36C    P0              80W / 350W |      3MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA L40S                    Off | 00000000:19:00.0 Off |                    0 |
| N/A   36C    P0              84W / 350W |      3MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA L40S                    Off | 00000000:1B:00.0 Off |                    0 |
| N/A   35C    P0              78W / 350W |      3MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA L40S                    Off | 00000000:1C:00.0 Off |                    0 |
| N/A   33C    P0              80W / 350W |      3MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA L40S                    Off | 00000000:28:00.0 Off |                    0 |
| N/A   35C    P0              79W / 350W |      3MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA L40S                    Off | 00000000:29:00.0 Off |                    0 |
| N/A   34C    P0              80W / 350W |      3MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA L40S                    Off | 00000000:2A:00.0 Off |                    0 |
| N/A   36C    P0              79W / 350W |      3MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA L40S                    Off | 00000000:2B:00.0 Off |                    0 |
| N/A   37C    P0              81W / 350W |      3MiB / 46068MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

results of nvidia-smi topo -m:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     PIX     PIX     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     0-43,88-131     0               N/A
GPU1    PIX      X      PIX     PIX     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     0-43,88-131     0               N/A
GPU2    PIX     PIX      X      PIX     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     0-43,88-131     0               N/A
GPU3    PIX     PIX     PIX      X      SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     0-43,88-131     0               N/A
GPU4    SYS     SYS     SYS     SYS      X      PIX     PIX     PIX     SYS     PIX     PIX     SYS     SYS     0-43,88-131     0               N/A
GPU5    SYS     SYS     SYS     SYS     PIX      X      PIX     PIX     SYS     PIX     PIX     SYS     SYS     0-43,88-131     0               N/A
GPU6    SYS     SYS     SYS     SYS     PIX     PIX      X      PIX     SYS     PIX     PIX     SYS     SYS     0-43,88-131     0               N/A
GPU7    SYS     SYS     SYS     SYS     PIX     PIX     PIX      X      SYS     PIX     PIX     SYS     SYS     0-43,88-131     0               N/A
NIC0    PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC1    SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     SYS      X      PIX     SYS     SYS
NIC2    SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     SYS     PIX      X      SYS     SYS
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4

results of nvidia-smi topo -p2p w:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
 GPU0   X       OK      OK      OK      OK      OK      OK      OK
 GPU1   OK      X       OK      OK      OK      OK      OK      OK
 GPU2   OK      OK      X       OK      OK      OK      OK      OK
 GPU3   OK      OK      OK      X       OK      OK      OK      OK
 GPU4   OK      OK      OK      OK      X       OK      OK      OK
 GPU5   OK      OK      OK      OK      OK      X       OK      OK
 GPU6   OK      OK      OK      OK      OK      OK      X       OK
 GPU7   OK      OK      OK      OK      OK      OK      OK      X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

After searching the forum, I disabled Intel VT-d but it did not work. Does anyone know how to fix it? Thanks very much!!

If I had a server like that, that I purchased configured that way from the server vendor, I would contact the server vendor for assistance. This has some relevant info also, but you have already modified the setting suggested. You may also need to update your server SBIOS to the latest available.