NCCL error

Hello,

Trying to run physical simulations on a HPC cluster, I encounter an issue with NCCL.
I am running the code on several nodes with H100 GPU in order to benchmark the app.
Each node has 4 GPU. I succesfully ran the code on 1, 2 and 3 nodes. But when it comes to 4 nodes with a total of 16 GPU I get the following error:

jzxh119:938113:938113 [2] misc/socket.cc:716 NCCL WARN ncclSocketInit:

connecting to address with family 0 is neither AF_INET(2) nor
AF_INET6(10)
jzxh119:938113:938113 [2] NCCL INFO bootstrap.cc:285 → 3
jzxh119:938113:938113 [2] NCCL INFO init.cc:1393 → 3
jzxh116:955934:955934 [1] NCCL INFO Using network Socket
jzxh119:938113:938113 [2] NCCL INFO init.cc:1667 → 3
jzxh116:955936:955936 [3] NCCL INFO Using network Socket
jzxh119:938113:938113 [2] NCCL INFO init.cc:1706 → 3
Failed, NCCL error
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/team/team_internal.cpp:421
'internal error - please report this issue to the NCCL developers

Do you have any idea how to solve this issue?

thanks in advance

Hello,

Please let me know if some more details are needed to investigate. I still did not manage to solve this issue.

Regards