I’m trying to connect 2 GPU nodes over infiniband with Direct RDMA, but it works only with Direct RDMA disabled, if I try enabling it, the script hangs after initialization:
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"
print("PyTorch NCCL is successful!")
Here are the logs:
oot@gh-3714u06:/sgl-workspace# NCCL_NET_GDR_LEVEL=0 NCCL_DEBUG=INFO torchrun --nnodes=2 --nproc-per-node=1 --node_rank=0 --rdzv_backend=static --rdzv_endpoint=10.1.32.3:29400 test.py
gh-3714u06:4052:4052 [0] NCCL INFO Bootstrap : Using ib0.0065:10.1.32.3<0>
gh-3714u06:4052:4052 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
gh-3714u06:4052:4052 [0] NCCL INFO Comm config Blocking set to 1
gh-3714u06:4052:4122 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
gh-3714u06:4052:4122 [0] NCCL INFO P2P plugin IBext_v8
gh-3714u06:4052:4122 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [4]mlx5_2:1/RoCE [RO]; OOB ib0.0065:10.1.32.3<0>
gh-3714u06:4052:4122 [0] NCCL INFO Using non-device net plugin version 0
gh-3714u06:4052:4122 [0] NCCL INFO Using network IBext_v8
gh-3714u06:4052:4122 [0] NCCL INFO ncclCommInitRank comm 0x55de8ca60500 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 19000 commId 0x8056544fe7821b76 - Init START
gh-3714u06:4052:4122 [0] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to LOC
gh-3714u06:4052:4122 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
gh-3714u06:4052:4122 [0] NCCL INFO comm 0x55de8ca60500 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
gh-3714u06:4052:4122 [0] NCCL INFO Channel 00/04 : 0 1
gh-3714u06:4052:4122 [0] NCCL INFO Channel 01/04 : 0 1
gh-3714u06:4052:4122 [0] NCCL INFO Channel 02/04 : 0 1
gh-3714u06:4052:4122 [0] NCCL INFO Channel 03/04 : 0 1
gh-3714u06:4052:4122 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1
gh-3714u06:4052:4122 [0] NCCL INFO P2P Chunksize set to 131072
gh-3714u06:4052:4122 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[0] [receive] via NET/IBext_v8/1
gh-3714u06:4052:4122 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[0] [receive] via NET/IBext_v8/1
gh-3714u06:4052:4122 [0] NCCL INFO Channel 02/0 : 1[1] -> 0[0] [receive] via NET/IBext_v8/1
gh-3714u06:4052:4122 [0] NCCL INFO Channel 03/0 : 1[1] -> 0[0] [receive] via NET/IBext_v8/1
gh-3714u06:4052:4122 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [send] via NET/IBext_v8/1
gh-3714u06:4052:4122 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [send] via NET/IBext_v8/1
gh-3714u06:4052:4122 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] [send] via NET/IBext_v8/1
gh-3714u06:4052:4122 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] [send] via NET/IBext_v8/1
gh-3714u06:4052:4122 [0] NCCL INFO Connected all rings
gh-3714u06:4052:4122 [0] NCCL INFO Connected all trees
gh-3714u06:4052:4122 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gh-3714u06:4052:4122 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
gh-3714u06:4052:4122 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
gh-3714u06:4052:4122 [0] NCCL INFO ncclCommInitRank comm 0x55de8ca60500 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 19000 commId 0x8056544fe7821b76 - Init COMPLETE
PyTorch NCCL is successful!
[rank0]:[W605 14:45:35.666511672 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
gh-3714u06:4052:4128 [0] NCCL INFO [Service thread] Connection closed by localRank 0
gh-3714u06:4052:4131 [0] NCCL INFO comm 0x55de8ca60500 rank 0 nranks 2 cudaDev 0 busId 19000 - Abort COMPLETE
root@gh-3714u06:/sgl-workspace# NCCL_DEBUG=INFO torchrun --nnodes=2 --nproc-per-node=1 --node_rank=0 --rdzv_backend=static --rdzv_endpoint=10.1.32.3:29400 test.py
gh-3714u06:4197:4197 [0] NCCL INFO Bootstrap : Using ib0.0065:10.1.32.3<0>
gh-3714u06:4197:4197 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
gh-3714u06:4197:4197 [0] NCCL INFO Comm config Blocking set to 1
gh-3714u06:4197:4267 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
gh-3714u06:4197:4267 [0] NCCL INFO P2P plugin IBext_v8
gh-3714u06:4197:4267 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [4]mlx5_2:1/RoCE [RO]; OOB ib0.0065:10.1.32.3<0>
gh-3714u06:4197:4267 [0] NCCL INFO Using non-device net plugin version 0
gh-3714u06:4197:4267 [0] NCCL INFO Using network IBext_v8
gh-3714u06:4197:4267 [0] NCCL INFO ncclCommInitRank comm 0x5594b5eb16a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 19000 commId 0xb458bb774b113b3a - Init START
gh-3714u06:4197:4267 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
gh-3714u06:4197:4267 [0] NCCL INFO comm 0x5594b5eb16a0 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
gh-3714u06:4197:4267 [0] NCCL INFO Channel 00/04 : 0 1
gh-3714u06:4197:4267 [0] NCCL INFO Channel 01/04 : 0 1
gh-3714u06:4197:4267 [0] NCCL INFO Channel 02/04 : 0 1
gh-3714u06:4197:4267 [0] NCCL INFO Channel 03/04 : 0 1
gh-3714u06:4197:4267 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1
gh-3714u06:4197:4267 [0] NCCL INFO P2P Chunksize set to 131072
gh-3714u06:4197:4267 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
gh-3714u06:4197:4267 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
gh-3714u06:4197:4267 [0] NCCL INFO Channel 02/0 : 1[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
gh-3714u06:4197:4267 [0] NCCL INFO Channel 03/0 : 1[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
gh-3714u06:4197:4267 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [send] via NET/IBext_v8/0/GDRDMA
gh-3714u06:4197:4267 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [send] via NET/IBext_v8/0/GDRDMA
gh-3714u06:4197:4267 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] [send] via NET/IBext_v8/0/GDRDMA
gh-3714u06:4197:4267 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] [send] via NET/IBext_v8/0/GDRDMA
gh-3714u06:4197:4267 [0] NCCL INFO Connected all rings
gh-3714u06:4197:4267 [0] NCCL INFO Connected all trees
gh-3714u06:4197:4267 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gh-3714u06:4197:4267 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
gh-3714u06:4197:4267 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
gh-3714u06:4197:4267 [0] NCCL INFO ncclCommInitRank comm 0x5594b5eb16a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 19000 commId 0xb458bb774b113b3a - Init COMPLETE
And some useful info:
Node 0:
(base) [szymon.ozog@gh-3714u06 ~]$ lsmod | grep nvidia
nvidia_uvm 1523712 0
nvidia_drm 73728 0
nvidia_modeset 1306624 1 nvidia_drm
nvidia_peermem 16384 0
nvidia 56598528 147 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
drm_kms_helper 167936 7 drm_vram_helper,ast,drm_display_helper,nvidia_drm,nouveau
ib_core 442368 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
drm 577536 10 drm_kms_helper,drm_vram_helper,ast,drm_display_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm,nouveau
(base) [szymon.ozog@gh-3714u06 ~]$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS 0-31,64-95 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS 0-31,64-95 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE PIX NODE NODE SYS SYS 0-31,64-95 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS 0-31,64-95 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE 32-63,96-127 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE NODE 32-63,96-127 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE PIX 32-63,96-127 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE 32-63,96-127 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS
NIC1 NODE NODE PIX NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE X PIX SYS SYS
NIC3 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE PIX X SYS SYS
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE
NIC5 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
(base) [szymon.ozog@gh-3714u06 ~]$ ibstat
CA 'mlx5_0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.41.1000
Hardware version: 0
Node GUID: 0x9c63c003005548fe
System image GUID: 0x9c63c003005548fe
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 975
LMC: 0
SM lid: 359
Capability mask: 0xa751e848
Port GUID: 0x9c63c003005548fe
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4129
Number of ports: 1
Firmware version: 28.41.1000
Hardware version: 0
Node GUID: 0x9c63c003005548b6
System image GUID: 0x9c63c003005548b6
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 976
LMC: 0
SM lid: 359
Capability mask: 0xa751e848
Port GUID: 0x9c63c003005548b6
Link layer: InfiniBand
CA 'mlx5_2'
CA type: MT4123
Number of ports: 1
Firmware version: 20.38.1900
Hardware version: 0
Node GUID: 0xb83fd2030091eb2a
System image GUID: 0xb83fd2030091eb2a
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xba3fd2fffe91eb2a
Link layer: Ethernet
CA 'mlx5_3'
CA type: MT4123
Number of ports: 1
Firmware version: 20.38.1900
Hardware version: 0
Node GUID: 0xb83fd2030091eb2b
System image GUID: 0xb83fd2030091eb2a
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xba3fd2fffe91eb2b
Link layer: Ethernet
CA 'mlx5_4'
CA type: MT4129
Number of ports: 1
Firmware version: 28.41.1000
Hardware version: 0
Node GUID: 0x9c63c003005546fe
System image GUID: 0x9c63c003005546fe
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 981
LMC: 0
SM lid: 359
Capability mask: 0xa751e848
Port GUID: 0x9c63c003005546fe
Link layer: InfiniBand
CA 'mlx5_5'
CA type: MT4129
Number of ports: 1
Firmware version: 28.41.1000
Hardware version: 0
Node GUID: 0x9c63c003005b1e14
System image GUID: 0x9c63c003005b1e14
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 984
LMC: 0
SM lid: 359
Capability mask: 0xa751e848
Port GUID: 0x9c63c003005b1e14
Link layer: InfiniBand
Node 1:
(base) [szymon.ozog@gh-3714u15 ~]$ lsmod | grep nvidia
nvidia_uvm 1523712 0
nvidia_drm 73728 0
nvidia_modeset 1306624 1 nvidia_drm
nvidia_peermem 16384 0
nvidia 56598528 147 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
drm_kms_helper 167936 7 drm_vram_helper,ast,drm_display_helper,nvidia_drm,nouveau
ib_core 442368 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
drm 577536 10 drm_kms_helper,drm_vram_helper,ast,drm_display_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm,nouveau
(base) [szymon.ozog@gh-3714u15 ~]$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS 0-31,64-95 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS 0-31,64-95 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE PIX NODE NODE SYS SYS 0-31,64-95 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS 0-31,64-95 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE 32-63,96-127 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE NODE 32-63,96-127 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE PIX 32-63,96-127 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE 32-63,96-127 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS
NIC1 NODE NODE PIX NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE X PIX SYS SYS
NIC3 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE PIX X SYS SYS
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE
NIC5 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
(base) [szymon.ozog@gh-3714u15 ~]$ ibstat
CA 'mlx5_0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.41.1000
Hardware version: 0
Node GUID: 0x9c63c00300554706
System image GUID: 0x9c63c00300554706
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 977
LMC: 0
SM lid: 359
Capability mask: 0xa751e848
Port GUID: 0x9c63c00300554706
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4129
Number of ports: 1
Firmware version: 28.41.1000
Hardware version: 0
Node GUID: 0x9c63c0030055487e
System image GUID: 0x9c63c0030055487e
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 978
LMC: 0
SM lid: 359
Capability mask: 0xa751e848
Port GUID: 0x9c63c0030055487e
Link layer: InfiniBand
CA 'mlx5_2'
CA type: MT4123
Number of ports: 1
Firmware version: 20.38.1900
Hardware version: 0
Node GUID: 0x58a2e103008dd3ec
System image GUID: 0x58a2e103008dd3ec
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x5aa2e1fffe8dd3ec
Link layer: Ethernet
CA 'mlx5_3'
CA type: MT4123
Number of ports: 1
Firmware version: 20.38.1900
Hardware version: 0
Node GUID: 0x58a2e103008dd3ed
System image GUID: 0x58a2e103008dd3ec
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x5aa2e1fffe8dd3ed
Link layer: Ethernet
CA 'mlx5_4'
CA type: MT4129
Number of ports: 1
Firmware version: 28.41.1000
Hardware version: 0
Node GUID: 0x9c63c003005546b6
System image GUID: 0x9c63c003005546b6
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 985
LMC: 0
SM lid: 359
Capability mask: 0xa751e848
Port GUID: 0x9c63c003005546b6
Link layer: InfiniBand
CA 'mlx5_5'
CA type: MT4129
Number of ports: 1
Firmware version: 28.41.1000
Hardware version: 0
Node GUID: 0x9c63c003005548ae
System image GUID: 0x9c63c003005548ae
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 987
LMC: 0
SM lid: 359
Capability mask: 0xa751e848
Port GUID: 0x9c63c003005548ae
Link layer: InfiniBand
What could be the issue here?