Hello,
Trying to run physical simulations on a HPC cluster, I encounter an issue with NCCL.
I am running the code on several nodes with H100 GPU in order to benchmark the app.
Each node has 4 GPU. I succesfully ran the code on 1, 2 and 3 nodes. But when it comes to 4 nodes with a total of 16 GPU I get the following error:
jzxh119:938113:938113 [2] misc/socket.cc:716 NCCL WARN ncclSocketInit:
connecting to address with family 0 is neither AF_INET(2) nor
AF_INET6(10)
jzxh119:938113:938113 [2] NCCL INFO bootstrap.cc:285 → 3
jzxh119:938113:938113 [2] NCCL INFO init.cc:1393 → 3
jzxh116:955934:955934 [1] NCCL INFO Using network Socket
jzxh119:938113:938113 [2] NCCL INFO init.cc:1667 → 3
jzxh116:955936:955936 [3] NCCL INFO Using network Socket
jzxh119:938113:938113 [2] NCCL INFO init.cc:1706 → 3
Failed, NCCL error
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/team/team_internal.cpp:421
'internal error - please report this issue to the NCCL developers
Do you have any idea how to solve this issue?
thanks in advance