NCCL error

Hello,

Trying to run physical simulations on a HPC cluster, I encounter an issue with NCCL.
I am running the code on several nodes with H100 GPU in order to benchmark the app.
Each node has 4 GPU. I succesfully ran the code on 1, 2 and 3 nodes. But when it comes to 4 nodes with a total of 16 GPU I get the following error:

jzxh119:938113:938113 [2] misc/socket.cc:716 NCCL WARN ncclSocketInit:

connecting to address with family 0 is neither AF_INET(2) nor
AF_INET6(10)
jzxh119:938113:938113 [2] NCCL INFO bootstrap.cc:285 → 3
jzxh119:938113:938113 [2] NCCL INFO init.cc:1393 → 3
jzxh116:955934:955934 [1] NCCL INFO Using network Socket
jzxh119:938113:938113 [2] NCCL INFO init.cc:1667 → 3
jzxh116:955936:955936 [3] NCCL INFO Using network Socket
jzxh119:938113:938113 [2] NCCL INFO init.cc:1706 → 3
Failed, NCCL error
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/team/team_internal.cpp:421
'internal error - please report this issue to the NCCL developers

Do you have any idea how to solve this issue?

thanks in advance

Hello,

Please let me know if some more details are needed to investigate. I still did not manage to solve this issue.

Regards

Hello,

maybe if you are in a cluster you might want to use Infiniband. It seems to me it is using socket. Are you running from a container? If so, you could expose these two paths to singularity with -B /etc/libibverbs.d:/etc/libibverbs.d

Hello,

Many thanks for your answer.
I think you are right. I should use Infiniband, and NCCL is using socket. I am not using Singularity.

I tried to export the following env variables to force NCCL using IB:

export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_1

But I still get the following errors:

[1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
[0] NCCL INFO NCCL_IB_HCA set to mlx5_1
[1] NCCL INFO NCCL_IB_HCA set to mlx5_1
[3] NCCL INFO NCCL_IB_HCA set to mlx5_1
[2] NCCL INFO NET/Socket : Using [0]ibp…
[2] NCCL INFO Using network Socket

I don’t know how to make it detect my IB disposal… I sent a message to the cluster administrators but I’m still waiting for an answer

Hello,

This did the job on my cluster, together with some other stuff but related to the use of a container:

export NCCL_IB_ENABLE=1
export NCCL_IB_HCA=mlx5
export NCCL_SOCKET_IFNAME=ib0

it should force looking for IB; if not found it should crash.

Laura