I was trying pytorch DDP and the program was stuck. So I ran the simpleP2P
example program in cuda-samples
and found out that GPUs could not communicate with each other as normal.
The results of simple p2p:
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 8
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU0) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU1) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU2) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU3) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU4) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU6) : Yes
> Peer access from NVIDIA L40S (GPU5) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU6) -> NVIDIA L40S (GPU7) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU0) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU1) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU2) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU3) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU4) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU5) : Yes
> Peer access from NVIDIA L40S (GPU7) -> NVIDIA L40S (GPU6) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.05GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!
results of nvidia-smi
:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L40S Off | 00000000:18:00.0 Off | 0 |
| N/A 36C P0 80W / 350W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA L40S Off | 00000000:19:00.0 Off | 0 |
| N/A 36C P0 84W / 350W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA L40S Off | 00000000:1B:00.0 Off | 0 |
| N/A 35C P0 78W / 350W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA L40S Off | 00000000:1C:00.0 Off | 0 |
| N/A 33C P0 80W / 350W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA L40S Off | 00000000:28:00.0 Off | 0 |
| N/A 35C P0 79W / 350W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA L40S Off | 00000000:29:00.0 Off | 0 |
| N/A 34C P0 80W / 350W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA L40S Off | 00000000:2A:00.0 Off | 0 |
| N/A 36C P0 79W / 350W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA L40S Off | 00000000:2B:00.0 Off | 0 |
| N/A 37C P0 81W / 350W | 3MiB / 46068MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
results of nvidia-smi topo -m
:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX PIX PIX SYS SYS SYS SYS PIX SYS SYS SYS SYS 0-43,88-131 0 N/A
GPU1 PIX X PIX PIX SYS SYS SYS SYS PIX SYS SYS SYS SYS 0-43,88-131 0 N/A
GPU2 PIX PIX X PIX SYS SYS SYS SYS PIX SYS SYS SYS SYS 0-43,88-131 0 N/A
GPU3 PIX PIX PIX X SYS SYS SYS SYS PIX SYS SYS SYS SYS 0-43,88-131 0 N/A
GPU4 SYS SYS SYS SYS X PIX PIX PIX SYS PIX PIX SYS SYS 0-43,88-131 0 N/A
GPU5 SYS SYS SYS SYS PIX X PIX PIX SYS PIX PIX SYS SYS 0-43,88-131 0 N/A
GPU6 SYS SYS SYS SYS PIX PIX X PIX SYS PIX PIX SYS SYS 0-43,88-131 0 N/A
GPU7 SYS SYS SYS SYS PIX PIX PIX X SYS PIX PIX SYS SYS 0-43,88-131 0 N/A
NIC0 PIX PIX PIX PIX SYS SYS SYS SYS X SYS SYS SYS SYS
NIC1 SYS SYS SYS SYS PIX PIX PIX PIX SYS X PIX SYS SYS
NIC2 SYS SYS SYS SYS PIX PIX PIX PIX SYS PIX X SYS SYS
NIC3 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
NIC4 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
results of nvidia-smi topo -p2p w
:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X OK OK OK OK OK OK OK
GPU1 OK X OK OK OK OK OK OK
GPU2 OK OK X OK OK OK OK OK
GPU3 OK OK OK X OK OK OK OK
GPU4 OK OK OK OK X OK OK OK
GPU5 OK OK OK OK OK X OK OK
GPU6 OK OK OK OK OK OK X OK
GPU7 OK OK OK OK OK OK OK X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
After searching the forum, I disabled Intel VT-d but it did not work. Does anyone know how to fix it? Thanks very much!!