Hi, I’m trying to setup communication between two nodes. I’m using pytorch’s DDP and running the following script on 2 nodes connected via infiniband
master_addr = sys.argv[1]
master_port = sys.argv[2]
rank = int(sys.argv[3])
torch.distributed.init_process_group(
backend="nccl",
world_size=2,
rank=rank,
init_method=f"tcp://{master_addr}:{master_port}",
timeout=timedelta(minutes=20),
)
if rank == 0:
x = torch.ones(1024, device=f"cuda:{rank}")
else:
x = torch.ones(1024, device=f"cuda:{rank}") * 100
torch.distributed.all_reduce(x)
print(x)
Running with NCCL_IB_DISABLE=1 works and correctly distributes my tensor, but when trying to use infiniband I’m getting a crash
gh-3714u06:2683693:2683693 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
gh-3714u06:2683693:2683693 [0] NCCL INFO Bootstrap : Using bond0:IP_ADDR<0>
gh-3714u06:2683693:2683693 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
gh-3714u06:2683693:2683693 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
gh-3714u06:2683693:2683877 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
gh-3714u06:2683693:2683877 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
gh-3714u06:2683693:2683877 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/RoCE [3]mlx5_4:1/IB [4]mlx5_5:1/IB [RO]; OOB bond0:IP_ADDR
<0>
gh-3714u06:2683693:2683877 [0] NCCL INFO Using non-device net plugin version 0
gh-3714u06:2683693:2683877 [0] NCCL INFO Using network IB
gh-3714u06:2683693:2683877 [0] NCCL INFO comm 0x55db965a1a10 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 19000 commId 0x4272235706597e16 - Init START
gh-3714u06:2683693:2683877 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
gh-3714u06:2683693:2683877 [0] NCCL INFO comm 0x55db965a1a10 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 00/04 : 0 1
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 01/04 : 0 1
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 02/04 : 0 1
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 03/04 : 0 1
gh-3714u06:2683693:2683877 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1
gh-3714u06:2683693:2683877 [0] NCCL INFO P2P Chunksize set to 131072
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 02/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 03/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA
gh-3714u06:2683693:2683877 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA
gh-3714u06:2683693:2683884 [0] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory
gh-3714u06:2683693:2683884 [0] NCCL INFO transport/net_ib.cc:659 -> 2
gh-3714u06:2683693:2683884 [0] NCCL INFO transport/net_ib.cc:795 -> 2
gh-3714u06:2683693:2683884 [0] NCCL INFO transport/net.cc:683 -> 2
gh-3714u06:2683693:2683877 [0] NCCL INFO transport/net.cc:304 -> 2
gh-3714u06:2683693:2683877 [0] NCCL INFO transport.cc:165 -> 2
gh-3714u06:2683693:2683877 [0] NCCL INFO init.cc:1222 -> 2
gh-3714u06:2683693:2683877 [0] NCCL INFO init.cc:1501 -> 2
gh-3714u06:2683693:2683877 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
gh-3714u06:2683693:2683693 [0] NCCL INFO group.cc:418 -> 2
gh-3714u06:2683693:2683693 [0] NCCL INFO init.cc:1876 -> 2
gh-3714u06:2683693:2683884 [0] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory
gh-3714u06:2683693:2683884 [0] NCCL INFO transport/net_ib.cc:659 -> 2
gh-3714u06:2683693:2683884 [0] NCCL INFO transport/net_ib.cc:795 -> 2
gh-3714u06:2683693:2683884 [0] NCCL INFO transport/net.cc:683 -> 2
gh-3714u06:2683693:2683884 [0] NCCL INFO misc/socket.cc:47 -> 3
gh-3714u06:2683693:2683884 [0] NCCL INFO misc/socket.cc:58 -> 3
gh-3714u06:2683693:2683884 [0] NCCL INFO misc/socket.cc:775 -> 3
gh-3714u06:2683693:2683884 [0] NCCL INFO proxy.cc:1384 -> 3
gh-3714u06:2683693:2683884 [0] NCCL INFO proxy.cc:1425 -> 3
gh-3714u06:2683693:2683884 [0] proxy.cc:1567 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
gh-3714u06:2683693:2683693 [0] NCCL INFO comm 0x55db965a1a10 rank 0 nranks 2 cudaDev 0 busId 19000 - Abort COMPLETE
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/test.py", line 65, in <module>
[rank0]: torch.distributed.all_reduce(x)
[rank0]: File "/home/miniconda3/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/miniconda3/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank0]: work = group.allreduce([tensor], opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank0]: Last error:
[rank0]: Call to ibv_create_cq failed with error Cannot allocate memory
[rank0]:[W123 09:17:08.055163379 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())