I am getting simpleP2P verification errors on a VM with 2 L40S GPUs with P2P enabled.
Driver Version: 560.28.03
CUDA Version: 12.6
[opc@akakshah-multi-gpu-vm simpleP2P]$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 21.06GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 6144: val = 15872.000000, ref = 8192.000000
Verification error @ element 6145: val = 15876.000000, ref = 8196.000000
Verification error @ element 6146: val = 15880.000000, ref = 8200.000000
Verification error @ element 6147: val = 15884.000000, ref = 8204.000000
Verification error @ element 6148: val = 15888.000000, ref = 8208.000000
Verification error @ element 6149: val = 15892.000000, ref = 8212.000000
Verification error @ element 6150: val = 15896.000000, ref = 8216.000000
Verification error @ element 6151: val = 15900.000000, ref = 8220.000000
Verification error @ element 6152: val = 15904.000000, ref = 8224.000000
Verification error @ element 6153: val = 15908.000000, ref = 8228.000000
Verification error @ element 6154: val = 15912.000000, ref = 8232.000000
Verification error @ element 6155: val = 15916.000000, ref = 8236.000000
Disabling peer access...
Shutting down...
Test failed!
For debugging, I introduced some logs that prints the values of buffer before and after copy. And have been observing some inconsistency issues.
Modified SimpleKernel code for reference
__global__ void SimpleKernel(float *src, float *dst, bool g0tog1) {
// Just a dummy kernel, doing enough for us to verify that everything
// worked
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (g0tog1) {
printf("Before GPU0 -> GPU1 copy: gpu0[%d]: addr=%p, val=%f; gpu1[%d]: addr=%p, val=%f\n", idx, &src[idx], src[idx], idx, &dst[idx], dst[idx]);
} else {
printf("Before GPU1 -> GPU0 copy: gpu1[%d]: addr=%p, val=%f; gpu0[%d]: addr=%p, val=%f\n", idx, &src[idx], src[idx], idx, &dst[idx], dst[idx]);
}
dst[idx] = src[idx];
if (g0tog1) {
printf("After GPU0 -> GPU1 copy: gpu0[%d]: addr=%p, val=%f; gpu1[%d]: addr=%p, val=%f\n", idx, &src[idx], src[idx], idx, &dst[idx], dst[idx]);
} else {
printf("After GPU1 -> GPU0 copy: gpu1[%d]: addr=%p, val=%f; gpu0[%d]: addr=%p, val=%f\n", idx, &src[idx], src[idx], idx, &dst[idx], dst[idx]);
}
}
1 strange observation noticed was the value of GPU1 device buffer abruptly changed.
Before GPU1 -> GPU0 copy: gpu1[6144]: addr=0x7f81b1006000, val=2048.000000; gpu0[6144]: addr=0x7f81b0e06000, val=2048.000000
After GPU1 -> GPU0 copy: gpu1[6144]: addr=0x7f81b1006000, val=2432.000000; gpu0[6144]: addr=0x7f81b0e06000, val=2432.000000
The value of gpu1[6144] changed from 2048 to 2432 without any updates to GPU1 buffer. There’s literally just “dst[idx] = src[idx];” operation between the prints.
1 other strange observation is this issue is only seen with specific GPU pairs. Extended the test to perform verification on all GPU pairs on a VM with 4 L40S GPUs. And the failure was only observed when verification was performed between (GPU0, GPU3) and (GPU1, GPU2).
[opc@gpu-ui-4x-2 ~]$ nvidia-smi
Fri Nov 22 23:37:39 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S Off | 00000000:00:05.0 Off | 0 |
| N/A 25C P8 23W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L40S Off | 00000000:00:06.0 Off | 0 |
| N/A 27C P8 22W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L40S Off | 00000000:00:07.0 Off | 0 |
| N/A 25C P8 23W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L40S Off | 00000000:00:08.0 Off | 0 |
| N/A 26C P8 24W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
[opc@gpu-ui-4x-2 ~]$
[opc@gpu-ui-4x-2 ~]$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB PHB PHB 0-215 0 N/A
GPU1 PHB X PHB PHB 0-215 0 N/A
GPU2 PHB PHB X PHB 0-215 0 N/A
GPU3 PHB PHB PHB X 0-215 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
[opc@gpu-ui-4x-2 ~]$
[opc@gpu-ui-4x-2 ~]$ nvidia-smi topo -p2p r
GPU0 GPU1 GPU2 GPU3
GPU0 X OK OK OK
GPU1 OK X OK OK
GPU2 OK OK X OK
GPU3 OK OK OK X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
[opc@gpu-ui-4x-2 ~]$ nvidia-smi topo -p2p p
GPU0 GPU1 GPU2 GPU3
GPU0 X OK OK OK
GPU1 OK X OK OK
GPU2 OK OK X OK
GPU3 OK OK OK X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
[opc@gpu-ui-4x-2 ~]$ ./simpleP2PExtend
[./simpleP2PExtend] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 21.05GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Enabling peer access between GPU0 and GPU2...
Allocating buffers (64MB on GPU0, GPU2 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU2: 21.05GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU2, taking source data from GPU0 and writing to GPU2...
Run kernel on GPU0, taking source data from GPU2 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Enabling peer access between GPU0 and GPU3...
Allocating buffers (64MB on GPU0, GPU3 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU3: 21.06GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU3, taking source data from GPU0 and writing to GPU3...
Run kernel on GPU0, taking source data from GPU3 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 3040: val = 3584.000000, ref = 12160.000000
Verification error @ element 3041: val = 3588.000000, ref = 12164.000000
Verification error @ element 3042: val = 3592.000000, ref = 12168.000000
Verification error @ element 3043: val = 3596.000000, ref = 12172.000000
Verification error @ element 3044: val = 3600.000000, ref = 12176.000000
Verification error @ element 3045: val = 3604.000000, ref = 12180.000000
Verification error @ element 3046: val = 3608.000000, ref = 12184.000000
Verification error @ element 3047: val = 3612.000000, ref = 12188.000000
Verification error @ element 3048: val = 3616.000000, ref = 12192.000000
Verification error @ element 3049: val = 3620.000000, ref = 12196.000000
Verification error @ element 3050: val = 3624.000000, ref = 12200.000000
Verification error @ element 3051: val = 3628.000000, ref = 12204.000000
Disabling peer access...
Enabling peer access between GPU1 and GPU0...
Allocating buffers (64MB on GPU1, GPU0 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU0: 21.05GB/s
Preparing host buffer and memcpy to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Copy data back to host from GPU1 and verify results...
Disabling peer access...
Enabling peer access between GPU1 and GPU2...
Allocating buffers (64MB on GPU1, GPU2 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU2: 21.05GB/s
Preparing host buffer and memcpy to GPU1...
Run kernel on GPU2, taking source data from GPU1 and writing to GPU2...
Run kernel on GPU1, taking source data from GPU2 and writing to GPU1...
Copy data back to host from GPU1 and verify results...
Verification error @ element 352: val = 16128.000000, ref = 1408.000000
Verification error @ element 353: val = 16132.000000, ref = 1412.000000
Verification error @ element 354: val = 16136.000000, ref = 1416.000000
Verification error @ element 355: val = 16140.000000, ref = 1420.000000
Verification error @ element 356: val = 16144.000000, ref = 1424.000000
Verification error @ element 357: val = 16148.000000, ref = 1428.000000
Verification error @ element 358: val = 16152.000000, ref = 1432.000000
Verification error @ element 359: val = 16156.000000, ref = 1436.000000
Verification error @ element 360: val = 16160.000000, ref = 1440.000000
Verification error @ element 361: val = 16164.000000, ref = 1444.000000
Verification error @ element 362: val = 16168.000000, ref = 1448.000000
Verification error @ element 363: val = 16172.000000, ref = 1452.000000
Disabling peer access...
Enabling peer access between GPU1 and GPU3...
Allocating buffers (64MB on GPU1, GPU3 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU3: 21.05GB/s
Preparing host buffer and memcpy to GPU1...
Run kernel on GPU3, taking source data from GPU1 and writing to GPU3...
Run kernel on GPU1, taking source data from GPU3 and writing to GPU1...
Copy data back to host from GPU1 and verify results...
Disabling peer access...
Enabling peer access between GPU2 and GPU0...
Allocating buffers (64MB on GPU2, GPU0 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU2 and GPU0: 21.05GB/s
Preparing host buffer and memcpy to GPU2...
Run kernel on GPU0, taking source data from GPU2 and writing to GPU0...
Run kernel on GPU2, taking source data from GPU0 and writing to GPU2...
Copy data back to host from GPU2 and verify results...
Disabling peer access...
Enabling peer access between GPU2 and GPU1...
Allocating buffers (64MB on GPU2, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU2 and GPU1: 21.06GB/s
Preparing host buffer and memcpy to GPU2...
Run kernel on GPU1, taking source data from GPU2 and writing to GPU1...
Run kernel on GPU2, taking source data from GPU1 and writing to GPU2...
Copy data back to host from GPU2 and verify results...
Verification error @ element 3552: val = 15488.000000, ref = 14208.000000
Verification error @ element 3553: val = 15492.000000, ref = 14212.000000
Verification error @ element 3554: val = 15496.000000, ref = 14216.000000
Verification error @ element 3555: val = 15500.000000, ref = 14220.000000
Verification error @ element 3556: val = 15504.000000, ref = 14224.000000
Verification error @ element 3557: val = 15508.000000, ref = 14228.000000
Verification error @ element 3558: val = 15512.000000, ref = 14232.000000
Verification error @ element 3559: val = 15516.000000, ref = 14236.000000
Verification error @ element 3560: val = 15520.000000, ref = 14240.000000
Verification error @ element 3561: val = 15524.000000, ref = 14244.000000
Verification error @ element 3562: val = 15528.000000, ref = 14248.000000
Verification error @ element 3563: val = 15532.000000, ref = 14252.000000
Disabling peer access...
Enabling peer access between GPU2 and GPU3...
Allocating buffers (64MB on GPU2, GPU3 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU2 and GPU3: 21.06GB/s
Preparing host buffer and memcpy to GPU2...
Run kernel on GPU3, taking source data from GPU2 and writing to GPU3...
Run kernel on GPU2, taking source data from GPU3 and writing to GPU2...
Copy data back to host from GPU2 and verify results...
Disabling peer access...
Enabling peer access between GPU3 and GPU0...
Allocating buffers (64MB on GPU3, GPU0 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU3 and GPU0: 21.06GB/s
Preparing host buffer and memcpy to GPU3...
Run kernel on GPU0, taking source data from GPU3 and writing to GPU0...
Run kernel on GPU3, taking source data from GPU0 and writing to GPU3...
Copy data back to host from GPU3 and verify results...
Verification error @ element 3840: val = 6784.000000, ref = 15360.000000
Verification error @ element 3841: val = 6788.000000, ref = 15364.000000
Verification error @ element 3842: val = 6792.000000, ref = 15368.000000
Verification error @ element 3843: val = 6796.000000, ref = 15372.000000
Verification error @ element 3844: val = 6800.000000, ref = 15376.000000
Verification error @ element 3845: val = 6804.000000, ref = 15380.000000
Verification error @ element 3846: val = 6808.000000, ref = 15384.000000
Verification error @ element 3847: val = 6812.000000, ref = 15388.000000
Verification error @ element 3848: val = 6816.000000, ref = 15392.000000
Verification error @ element 3849: val = 6820.000000, ref = 15396.000000
Verification error @ element 3850: val = 6824.000000, ref = 15400.000000
Verification error @ element 3851: val = 6828.000000, ref = 15404.000000
Disabling peer access...
Enabling peer access between GPU3 and GPU1...
Allocating buffers (64MB on GPU3, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU3 and GPU1: 21.05GB/s
Preparing host buffer and memcpy to GPU3...
Run kernel on GPU1, taking source data from GPU3 and writing to GPU1...
Run kernel on GPU3, taking source data from GPU1 and writing to GPU3...
Copy data back to host from GPU3 and verify results...
Disabling peer access...
Enabling peer access between GPU3 and GPU2...
Allocating buffers (64MB on GPU3, GPU2 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU3 and GPU2: 21.05GB/s
Preparing host buffer and memcpy to GPU3...
Run kernel on GPU2, taking source data from GPU3 and writing to GPU2...
Run kernel on GPU3, taking source data from GPU2 and writing to GPU3...
Copy data back to host from GPU3 and verify results...
Disabling peer access...
Test failed!
Failures: 264
Appreciate any help on resolving this issue.