Trying to run physical simulations on a HPC cluster, I encounter an issue with NCCL.
I am running the code on several nodes with H100 GPU in order to benchmark the app.
Each node has 4 GPU. I succesfully ran the code on 1, 2 and 3 nodes. But when it comes to 4 nodes with a total of 16 GPU I get the following error:
connecting to address with family 0 is neither AF_INET(2) nor
AF_INET6(10)
jzxh119:938113:938113 [2] NCCL INFO bootstrap.cc:285 → 3
jzxh119:938113:938113 [2] NCCL INFO init.cc:1393 → 3
jzxh116:955934:955934 [1] NCCL INFO Using network Socket
jzxh119:938113:938113 [2] NCCL INFO init.cc:1667 → 3
jzxh116:955936:955936 [3] NCCL INFO Using network Socket
jzxh119:938113:938113 [2] NCCL INFO init.cc:1706 → 3
Failed, NCCL error
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/team/team_internal.cpp:421
'internal error - please report this issue to the NCCL developers
maybe if you are in a cluster you might want to use Infiniband. It seems to me it is using socket. Are you running from a container? If so, you could expose these two paths to singularity with -B /etc/libibverbs.d:/etc/libibverbs.d
[1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
[0] NCCL INFO NCCL_IB_HCA set to mlx5_1
[1] NCCL INFO NCCL_IB_HCA set to mlx5_1
[3] NCCL INFO NCCL_IB_HCA set to mlx5_1
[2] NCCL INFO NET/Socket : Using [0]ibp…
[2] NCCL INFO Using network Socket
I don’t know how to make it detect my IB disposal… I sent a message to the cluster administrators but I’m still waiting for an answer