NCCL randomly crashes on Leonardo

l.bellentani · February 20, 2025, 7:56am

Hello,

I am running an LLM training on Leonardo cluster by using singularity container. The LLM training is implemented by means of Colossal AI functionalities and hybrid parallelization using pipeline parallelism + data parallelism. NCCL crashes from time to time with the same error:

5: [default0]:[rank20]: work = group.allreduce([tensor], opts)
5: [default0]:[rank20]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
5: [default0]:[rank20]: torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
5: [default0]:[rank20]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
5: [default0]:[rank20]: Last error:
5: [default0]:[rank20]: socketPollConnect: Connect to 10.128.9.129<34485> returned 113(No route to host) errno 115(Operation now in progress)

I already enabled Infiniband (it was not found unless binding some path to the container) and checked with NCCL_DEBUG=INFO. Could you please provide more information about the error, or suggestions about how to further investigate the issue?

I exploit this channel also for a question regarding the communications reported in the logfile:

10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 00/0 : 39[3] → 40[0] [receive] via NET/IB/0/GDRDMA
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 04/0 : 39[3] → 40[0] [receive] via NET/IB/0/GDRDMA
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 00/0 : 40[0] → 41[1] via P2P/CUMEM/read
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 04/0 : 40[0] → 41[1] via P2P/CUMEM/read

On Leonardo there is Infiniband and NVLink intra-node. Are the above communication kinds good for the platform? I am not familiar with CUMEM.

Thank you for your time,

Laura

l.bellentani · February 20, 2025, 8:14am

I have more information by using NCCL_DEBUG=INFO. I see a number of these messages

7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO Call to connect returned Connection refused, retrying
14: [default0]:lrdn1623:3732664:3733794 [0] NCCL INFO Call to connect returned Connection refused, retrying
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO Call to connect returned Connection refused, retrying

and this

7: [default0]:lrdn1176:2122410:2123520 [0] misc/socket.cc:467 NCCL WARN socketStartConnect: exceeded retries (20000)
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO misc/socket.cc:567 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO misc/socket.cc:621 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO bootstrap.cc:425 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO transport.cc:131 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO init.cc:1232 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO init.cc:1501 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO group.cc:64 → 6 [Async thread]
7: [default0]:lrdn1176:2122410:2122410 [0] NCCL INFO group.cc:418 → 6
7: [default0]:lrdn1176:2122410:2122410 [0] NCCL INFO init.cc:1876 → 6
7: [default0]:lrdn1176:2122410:2123521 [0] NCCL INFO [Service thread] Connection closed by localRank 0

utilisateur2593 · June 5, 2025, 12:37pm

Hello, have you had any success with your LLM trainings on Leonardo ?
We are struggling to launch jobs with >=90 nodes on Leonardo. At the beginning of training, NCCL errors pop up, like this one for example:

[rank562]:[E605 04:36:53.526632995 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 5(DATA_PARALLEL_GROUP) Rank 140] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank545]:[E605 04:36:53.129921218 ProcessGroupNCCL.cpp:542] [Rank 136] Collective WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=2, NumelOut=2, Timeout(ms)=601000) raised the following async exception: NCCL error: remote process exited or there was a network error, NCCL version 2.21.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.

We are using Megatron.

l.bellentani · June 5, 2025, 12:43pm

Hello,

yes, it was a temporary issue becoming more critical when running on a large number of nodes. I would try increasing something like the NCCL timeout parameter…

Laura

utilisateur2593 · June 5, 2025, 1:21pm

Thank you for your fast (and precious) answer.
I will try adjusting the NCCL variables to see if that eases the thing for us.
Did you have to do anything on your end to make it work or was it only the Leonardo team that had to make repairs ?

utilisateur2593 · June 6, 2025, 7:53am

Hello,
I have had no success by tinkering with the NCCL environment variables.
For example i tried setting these:

export NCCL_IB_ENABLE=1
export NCCL_IB_HCA=mlx5
export NCCL_SOCKET_IFNAME=ib0

or this to increase the timeout limit:
export NCCL_IB_TIMEOUT=25
or even this :
export NCCL_IB_DISABLE=1
but still get errors. For example, when setting export NCCL_IB_DISABLE=1 I get this error:

[rank224]:[E605 18:31:55.516028980 ProcessGroupNCCL.cpp:542] [Rank 56] Collective WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=2, NumelOut=2, Timeout(ms)=600000) raised the following async exception: NCCL error: remote process exited or there was a network error, NCCL version 2.21.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.

Or with the increased timeout, I get this :
[rank40]:[E606 09:46:50.419909226 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=8, OpType=ALLREDUCE, NumelIn=8388608, NumelOut=8388608, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.

For full reference, here is my sbatch script (it is adapted from the default GPT training script provided in the latest Megatron release) : script_training.sh · GitHub

Many thanks

Topic		Replies	Views
NCCL error GPU-Accelerated Libraries	4	483	February 19, 2025
NCCL can't use IB network GPU-Accelerated Libraries ubuntu , cudnn , nccl	2	2018	October 11, 2023
NCCL 2.0 support inter-node communication using Sockets? GPU-Accelerated Libraries	3	4365	December 21, 2018
NCCL example fails on WSL2 and 1 or 2 A5500's cuDNN cuda	3	265	September 15, 2024
Code runs in RTX 3060 but not in 4xTesla T4 Azure cluster Microsoft Azure Image pytorch , python , cudnn	0	498	March 5, 2024
NCCL error on multi machine. transport/p2p.cu :515 WARN failed to open CUDA IPC handle : 30 unknown error Deep Learning (Training & Inference)	0	888	May 31, 2018
RuntimeError: NCCL Error 3: internal error - please report this issue to the NCCL developers GPU-Accelerated Libraries hpc , gpu , nccl	0	1240	April 5, 2024
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels GPU-Accelerated Libraries cuda , pytorch , ai-training , a100 , infiniband	0	4429	February 16, 2024
NCCL socket transport fails with pipeline parallelism (mesh_pp) on DGX Spark DGX Spark / GB10 pytorch , nemo , parallel-computing , llama	5	259	January 3, 2026
NCCL error when training data in GCP GPU-Accelerated Libraries cuda , tensorflow , ubuntu , python	2	1520	August 23, 2024

NCCL randomly crashes on Leonardo

Related topics