simpleP2P verification failed on a VM with 2 L40S GPUs with P2P enabled

I am getting simpleP2P verification errors on a VM with 2 L40S GPUs with P2P enabled.
Driver Version: 560.28.03
CUDA Version: 12.6

[opc@akakshah-multi-gpu-vm simpleP2P]$ ./simpleP2P 
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 21.06GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 6144: val = 15872.000000, ref = 8192.000000
Verification error @ element 6145: val = 15876.000000, ref = 8196.000000
Verification error @ element 6146: val = 15880.000000, ref = 8200.000000
Verification error @ element 6147: val = 15884.000000, ref = 8204.000000
Verification error @ element 6148: val = 15888.000000, ref = 8208.000000
Verification error @ element 6149: val = 15892.000000, ref = 8212.000000
Verification error @ element 6150: val = 15896.000000, ref = 8216.000000
Verification error @ element 6151: val = 15900.000000, ref = 8220.000000
Verification error @ element 6152: val = 15904.000000, ref = 8224.000000
Verification error @ element 6153: val = 15908.000000, ref = 8228.000000
Verification error @ element 6154: val = 15912.000000, ref = 8232.000000
Verification error @ element 6155: val = 15916.000000, ref = 8236.000000
Disabling peer access...
Shutting down...
Test failed!

For debugging, I introduced some logs that prints the values of buffer before and after copy. And have been observing some inconsistency issues.
Modified SimpleKernel code for reference

__global__ void SimpleKernel(float *src, float *dst, bool g0tog1) {
  // Just a dummy kernel, doing enough for us to verify that everything
  // worked
    
  const int idx = blockIdx.x * blockDim.x + threadIdx.x;
  
  if (g0tog1) {
    printf("Before GPU0 -> GPU1 copy: gpu0[%d]: addr=%p, val=%f; gpu1[%d]: addr=%p, val=%f\n", idx, &src[idx], src[idx], idx, &dst[idx], dst[idx]);
  } else { 
    printf("Before GPU1 -> GPU0 copy: gpu1[%d]: addr=%p, val=%f; gpu0[%d]: addr=%p, val=%f\n", idx, &src[idx], src[idx], idx, &dst[idx], dst[idx]);
  }
    
  dst[idx] = src[idx];
  
  if (g0tog1) {
    printf("After GPU0 -> GPU1 copy: gpu0[%d]: addr=%p, val=%f; gpu1[%d]: addr=%p, val=%f\n", idx, &src[idx], src[idx], idx, &dst[idx], dst[idx]);
  } else { 
    printf("After GPU1 -> GPU0 copy: gpu1[%d]: addr=%p, val=%f; gpu0[%d]: addr=%p, val=%f\n", idx, &src[idx], src[idx], idx, &dst[idx], dst[idx]);
  }
}

1 strange observation noticed was the value of GPU1 device buffer abruptly changed.

Before GPU1 -> GPU0 copy: gpu1[6144]: addr=0x7f81b1006000, val=2048.000000; gpu0[6144]: addr=0x7f81b0e06000, val=2048.000000
After GPU1 -> GPU0 copy: gpu1[6144]: addr=0x7f81b1006000, val=2432.000000; gpu0[6144]: addr=0x7f81b0e06000, val=2432.000000

The value of gpu1[6144] changed from 2048 to 2432 without any updates to GPU1 buffer. There’s literally just “dst[idx] = src[idx];” operation between the prints.

1 other strange observation is this issue is only seen with specific GPU pairs. Extended the test to perform verification on all GPU pairs on a VM with 4 L40S GPUs. And the failure was only observed when verification was performed between (GPU0, GPU3) and (GPU1, GPU2).

[opc@gpu-ui-4x-2 ~]$ nvidia-smi 
Fri Nov 22 23:37:39 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:00:05.0 Off |                    0 |
| N/A   25C    P8             23W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    Off |   00000000:00:06.0 Off |                    0 |
| N/A   27C    P8             22W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    Off |   00000000:00:07.0 Off |                    0 |
| N/A   25C    P8             23W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L40S                    Off |   00000000:00:08.0 Off |                    0 |
| N/A   26C    P8             24W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[opc@gpu-ui-4x-2 ~]$ 
[opc@gpu-ui-4x-2 ~]$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	PHB	PHB	0-215	0		N/A
GPU1	PHB	 X 	PHB	PHB	0-215	0		N/A
GPU2	PHB	PHB	 X 	PHB	0-215	0		N/A
GPU3	PHB	PHB	PHB	 X 	0-215	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
[opc@gpu-ui-4x-2 ~]$ 
[opc@gpu-ui-4x-2 ~]$ nvidia-smi topo -p2p r
 	GPU0	GPU1	GPU2	GPU3	
 GPU0	X	OK	OK	OK	
 GPU1	OK	X	OK	OK	
 GPU2	OK	OK	X	OK	
 GPU3	OK	OK	OK	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown
[opc@gpu-ui-4x-2 ~]$ nvidia-smi topo -p2p p
 	GPU0	GPU1	GPU2	GPU3	
 GPU0	X	OK	OK	OK	
 GPU1	OK	X	OK	OK	
 GPU2	OK	OK	X	OK	
 GPU3	OK	OK	OK	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown
[opc@gpu-ui-4x-2 ~]$  ./simpleP2PExtend 
[./simpleP2PExtend] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 21.05GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Enabling peer access between GPU0 and GPU2...
Allocating buffers (64MB on GPU0, GPU2 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU2: 21.05GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU2, taking source data from GPU0 and writing to GPU2...
Run kernel on GPU0, taking source data from GPU2 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Enabling peer access between GPU0 and GPU3...
Allocating buffers (64MB on GPU0, GPU3 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU3: 21.06GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU3, taking source data from GPU0 and writing to GPU3...
Run kernel on GPU0, taking source data from GPU3 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 3040: val = 3584.000000, ref = 12160.000000
Verification error @ element 3041: val = 3588.000000, ref = 12164.000000
Verification error @ element 3042: val = 3592.000000, ref = 12168.000000
Verification error @ element 3043: val = 3596.000000, ref = 12172.000000
Verification error @ element 3044: val = 3600.000000, ref = 12176.000000
Verification error @ element 3045: val = 3604.000000, ref = 12180.000000
Verification error @ element 3046: val = 3608.000000, ref = 12184.000000
Verification error @ element 3047: val = 3612.000000, ref = 12188.000000
Verification error @ element 3048: val = 3616.000000, ref = 12192.000000
Verification error @ element 3049: val = 3620.000000, ref = 12196.000000
Verification error @ element 3050: val = 3624.000000, ref = 12200.000000
Verification error @ element 3051: val = 3628.000000, ref = 12204.000000
Disabling peer access...
Enabling peer access between GPU1 and GPU0...
Allocating buffers (64MB on GPU1, GPU0 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU0: 21.05GB/s
Preparing host buffer and memcpy to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Copy data back to host from GPU1 and verify results...
Disabling peer access...
Enabling peer access between GPU1 and GPU2...
Allocating buffers (64MB on GPU1, GPU2 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU2: 21.05GB/s
Preparing host buffer and memcpy to GPU1...
Run kernel on GPU2, taking source data from GPU1 and writing to GPU2...
Run kernel on GPU1, taking source data from GPU2 and writing to GPU1...
Copy data back to host from GPU1 and verify results...
Verification error @ element 352: val = 16128.000000, ref = 1408.000000
Verification error @ element 353: val = 16132.000000, ref = 1412.000000
Verification error @ element 354: val = 16136.000000, ref = 1416.000000
Verification error @ element 355: val = 16140.000000, ref = 1420.000000
Verification error @ element 356: val = 16144.000000, ref = 1424.000000
Verification error @ element 357: val = 16148.000000, ref = 1428.000000
Verification error @ element 358: val = 16152.000000, ref = 1432.000000
Verification error @ element 359: val = 16156.000000, ref = 1436.000000
Verification error @ element 360: val = 16160.000000, ref = 1440.000000
Verification error @ element 361: val = 16164.000000, ref = 1444.000000
Verification error @ element 362: val = 16168.000000, ref = 1448.000000
Verification error @ element 363: val = 16172.000000, ref = 1452.000000
Disabling peer access...
Enabling peer access between GPU1 and GPU3...
Allocating buffers (64MB on GPU1, GPU3 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU3: 21.05GB/s
Preparing host buffer and memcpy to GPU1...
Run kernel on GPU3, taking source data from GPU1 and writing to GPU3...
Run kernel on GPU1, taking source data from GPU3 and writing to GPU1...
Copy data back to host from GPU1 and verify results...
Disabling peer access...
Enabling peer access between GPU2 and GPU0...
Allocating buffers (64MB on GPU2, GPU0 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU2 and GPU0: 21.05GB/s
Preparing host buffer and memcpy to GPU2...
Run kernel on GPU0, taking source data from GPU2 and writing to GPU0...
Run kernel on GPU2, taking source data from GPU0 and writing to GPU2...
Copy data back to host from GPU2 and verify results...
Disabling peer access...
Enabling peer access between GPU2 and GPU1...
Allocating buffers (64MB on GPU2, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU2 and GPU1: 21.06GB/s
Preparing host buffer and memcpy to GPU2...
Run kernel on GPU1, taking source data from GPU2 and writing to GPU1...
Run kernel on GPU2, taking source data from GPU1 and writing to GPU2...
Copy data back to host from GPU2 and verify results...
Verification error @ element 3552: val = 15488.000000, ref = 14208.000000
Verification error @ element 3553: val = 15492.000000, ref = 14212.000000
Verification error @ element 3554: val = 15496.000000, ref = 14216.000000
Verification error @ element 3555: val = 15500.000000, ref = 14220.000000
Verification error @ element 3556: val = 15504.000000, ref = 14224.000000
Verification error @ element 3557: val = 15508.000000, ref = 14228.000000
Verification error @ element 3558: val = 15512.000000, ref = 14232.000000
Verification error @ element 3559: val = 15516.000000, ref = 14236.000000
Verification error @ element 3560: val = 15520.000000, ref = 14240.000000
Verification error @ element 3561: val = 15524.000000, ref = 14244.000000
Verification error @ element 3562: val = 15528.000000, ref = 14248.000000
Verification error @ element 3563: val = 15532.000000, ref = 14252.000000
Disabling peer access...
Enabling peer access between GPU2 and GPU3...
Allocating buffers (64MB on GPU2, GPU3 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU2 and GPU3: 21.06GB/s
Preparing host buffer and memcpy to GPU2...
Run kernel on GPU3, taking source data from GPU2 and writing to GPU3...
Run kernel on GPU2, taking source data from GPU3 and writing to GPU2...
Copy data back to host from GPU2 and verify results...
Disabling peer access...
Enabling peer access between GPU3 and GPU0...
Allocating buffers (64MB on GPU3, GPU0 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU3 and GPU0: 21.06GB/s
Preparing host buffer and memcpy to GPU3...
Run kernel on GPU0, taking source data from GPU3 and writing to GPU0...
Run kernel on GPU3, taking source data from GPU0 and writing to GPU3...
Copy data back to host from GPU3 and verify results...
Verification error @ element 3840: val = 6784.000000, ref = 15360.000000
Verification error @ element 3841: val = 6788.000000, ref = 15364.000000
Verification error @ element 3842: val = 6792.000000, ref = 15368.000000
Verification error @ element 3843: val = 6796.000000, ref = 15372.000000
Verification error @ element 3844: val = 6800.000000, ref = 15376.000000
Verification error @ element 3845: val = 6804.000000, ref = 15380.000000
Verification error @ element 3846: val = 6808.000000, ref = 15384.000000
Verification error @ element 3847: val = 6812.000000, ref = 15388.000000
Verification error @ element 3848: val = 6816.000000, ref = 15392.000000
Verification error @ element 3849: val = 6820.000000, ref = 15396.000000
Verification error @ element 3850: val = 6824.000000, ref = 15400.000000
Verification error @ element 3851: val = 6828.000000, ref = 15404.000000
Disabling peer access...
Enabling peer access between GPU3 and GPU1...
Allocating buffers (64MB on GPU3, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU3 and GPU1: 21.05GB/s
Preparing host buffer and memcpy to GPU3...
Run kernel on GPU1, taking source data from GPU3 and writing to GPU1...
Run kernel on GPU3, taking source data from GPU1 and writing to GPU3...
Copy data back to host from GPU3 and verify results...
Disabling peer access...
Enabling peer access between GPU3 and GPU2...
Allocating buffers (64MB on GPU3, GPU2 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU3 and GPU2: 21.05GB/s
Preparing host buffer and memcpy to GPU3...
Run kernel on GPU2, taking source data from GPU3 and writing to GPU2...
Run kernel on GPU3, taking source data from GPU2 and writing to GPU3...
Copy data back to host from GPU3 and verify results...
Disabling peer access...
Test failed!
Failures: 264

Appreciate any help on resolving this issue.

Is the IOMMU set to pass-through for the GPUs ?

On Linux only, CUDA and the display driver does not support IOMMU-enabled bare-metal PCIe peer to peer memory copy. However, CUDA and the display driver does support IOMMU via VM pass through. As a consequence, users on Linux, when running on a native bare metal system, should disable the IOMMU. The IOMMU should be enabled and the VFIO driver be used as a PCIe pass through for virtual machines.

IOMMU is enabled. And the test was ran in VM with PCIe pass through.
The VM has 4 GPUs. And the test only fails with verification errors for pairs (GPU0, GPU3) and (GPU1, GPU2). While it succeeds for other pairs.

@vramesh1 Any further guidance on this?
The peer to peer test always fails between GPUs in the same NUMA. And succeeds between GPUs on different NUMA.
i.e When I create a VM with 2 GPUs that are associated with different NUMA, simpleP2P succeeds. While a VM with both GPUs on same NUMA, simpleP2P fails with verification errors.