NCCL error

kigwe · December 18, 2024, 2:27pm

Hello,

Trying to run physical simulations on a HPC cluster, I encounter an issue with NCCL.
I am running the code on several nodes with H100 GPU in order to benchmark the app.
Each node has 4 GPU. I succesfully ran the code on 1, 2 and 3 nodes. But when it comes to 4 nodes with a total of 16 GPU I get the following error:

jzxh119:938113:938113 [2] misc/socket.cc:716 NCCL WARN ncclSocketInit:

connecting to address with family 0 is neither AF_INET(2) nor
AF_INET6(10)
jzxh119:938113:938113 [2] NCCL INFO bootstrap.cc:285 → 3
jzxh119:938113:938113 [2] NCCL INFO init.cc:1393 → 3
jzxh116:955934:955934 [1] NCCL INFO Using network Socket
jzxh119:938113:938113 [2] NCCL INFO init.cc:1667 → 3
jzxh116:955936:955936 [3] NCCL INFO Using network Socket
jzxh119:938113:938113 [2] NCCL INFO init.cc:1706 → 3
Failed, NCCL error
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/team/team_internal.cpp:421
'internal error - please report this issue to the NCCL developers

Do you have any idea how to solve this issue?

thanks in advance

kigwe · January 6, 2025, 9:04am

Hello,

Please let me know if some more details are needed to investigate. I still did not manage to solve this issue.

Regards

l.bellentani · February 19, 2025, 9:23am

Hello,

maybe if you are in a cluster you might want to use Infiniband. It seems to me it is using socket. Are you running from a container? If so, you could expose these two paths to singularity with -B /etc/libibverbs.d:/etc/libibverbs.d

kigwe · February 19, 2025, 3:59pm

Hello,

Many thanks for your answer.
I think you are right. I should use Infiniband, and NCCL is using socket. I am not using Singularity.

I tried to export the following env variables to force NCCL using IB:

export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_1

But I still get the following errors:

[1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
[0] NCCL INFO NCCL_IB_HCA set to mlx5_1
[1] NCCL INFO NCCL_IB_HCA set to mlx5_1
[3] NCCL INFO NCCL_IB_HCA set to mlx5_1
[2] NCCL INFO NET/Socket : Using [0]ibp…
[2] NCCL INFO Using network Socket

I don’t know how to make it detect my IB disposal… I sent a message to the cluster administrators but I’m still waiting for an answer

l.bellentani · February 19, 2025, 9:00pm

Hello,

This did the job on my cluster, together with some other stuff but related to the use of a container:

export NCCL_IB_ENABLE=1
export NCCL_IB_HCA=mlx5
export NCCL_SOCKET_IFNAME=ib0

it should force looking for IB; if not found it should crash.

Laura

Topic		Replies	Views
NCCL randomly crashes on Leonardo Container: HPC	5	451	June 6, 2025
NCCL can't use IB network GPU-Accelerated Libraries ubuntu , cudnn , nccl	2	2052	October 11, 2023
NCCL 2.0 support inter-node communication using Sockets? GPU-Accelerated Libraries	3	4384	December 21, 2018
Why is my NCCL broken? DGX Spark / GB10	26	594	February 19, 2026
Unable to make nccl work Container: HPC	0	343	December 20, 2023
NCCL testing: Error: no plugin found (libnccl-net.so) CUDA Programming and Performance	4	7559	October 15, 2019
CUDA NCCL Error "operation not supported" Multi-GPUs CUDA Setup and Installation cuda	1	1024	June 26, 2025
NCCL AllGather & AllReduce error CUDA Programming and Performance	1	2656	April 18, 2018
Potential NCCL bug in topology discovery in NCCL2.1.15 GPU-Accelerated Libraries	0	1300	March 16, 2018
How can I tell whether NCCL is using PCIe or IB network interface while doing AllReduce? Deep Learning (Training & Inference)	0	821	March 6, 2020

NCCL error

Related topics