NCCL randomly crashes on Leonardo

Hello,

I am running an LLM training on Leonardo cluster by using singularity container. The LLM training is implemented by means of Colossal AI functionalities and hybrid parallelization using pipeline parallelism + data parallelism. NCCL crashes from time to time with the same error:

5: [default0]:[rank20]: work = group.allreduce([tensor], opts)
5: [default0]:[rank20]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
5: [default0]:[rank20]: torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
5: [default0]:[rank20]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
5: [default0]:[rank20]: Last error:
5: [default0]:[rank20]: socketPollConnect: Connect to 10.128.9.129<34485> returned 113(No route to host) errno 115(Operation now in progress)

I already enabled Infiniband (it was not found unless binding some path to the container) and checked with NCCL_DEBUG=INFO. Could you please provide more information about the error, or suggestions about how to further investigate the issue?

I exploit this channel also for a question regarding the communications reported in the logfile:

10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 00/0 : 39[3] → 40[0] [receive] via NET/IB/0/GDRDMA
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 04/0 : 39[3] → 40[0] [receive] via NET/IB/0/GDRDMA
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 00/0 : 40[0] → 41[1] via P2P/CUMEM/read
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 04/0 : 40[0] → 41[1] via P2P/CUMEM/read

On Leonardo there is Infiniband and NVLink intra-node. Are the above communication kinds good for the platform? I am not familiar with CUMEM.

Thank you for your time,

Laura

I have more information by using NCCL_DEBUG=INFO. I see a number of these messages

7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO Call to connect returned Connection refused, retrying
14: [default0]:lrdn1623:3732664:3733794 [0] NCCL INFO Call to connect returned Connection refused, retrying
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO Call to connect returned Connection refused, retrying

and this

7: [default0]:lrdn1176:2122410:2123520 [0] misc/socket.cc:467 NCCL WARN socketStartConnect: exceeded retries (20000)
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO misc/socket.cc:567 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO misc/socket.cc:621 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO bootstrap.cc:425 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO transport.cc:131 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO init.cc:1232 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO init.cc:1501 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO group.cc:64 → 6 [Async thread]
7: [default0]:lrdn1176:2122410:2122410 [0] NCCL INFO group.cc:418 → 6
7: [default0]:lrdn1176:2122410:2122410 [0] NCCL INFO init.cc:1876 → 6
7: [default0]:lrdn1176:2122410:2123521 [0] NCCL INFO [Service thread] Connection closed by localRank 0

Hello, have you had any success with your LLM trainings on Leonardo ?
We are struggling to launch jobs with >=90 nodes on Leonardo. At the beginning of training, NCCL errors pop up, like this one for example:

[rank562]:[E605 04:36:53.526632995 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 5(DATA_PARALLEL_GROUP) Rank 140] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank545]:[E605 04:36:53.129921218 ProcessGroupNCCL.cpp:542] [Rank 136] Collective WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=2, NumelOut=2, Timeout(ms)=601000) raised the following async exception: NCCL error: remote process exited or there was a network error, NCCL version 2.21.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.

We are using Megatron.

Hello,

yes, it was a temporary issue becoming more critical when running on a large number of nodes. I would try increasing something like the NCCL timeout parameter…

Laura

Thank you for your fast (and precious) answer.
I will try adjusting the NCCL variables to see if that eases the thing for us.
Did you have to do anything on your end to make it work or was it only the Leonardo team that had to make repairs ?

Hello,
I have had no success by tinkering with the NCCL environment variables.
For example i tried setting these:

export NCCL_IB_ENABLE=1
export NCCL_IB_HCA=mlx5
export NCCL_SOCKET_IFNAME=ib0

or this to increase the timeout limit:
export NCCL_IB_TIMEOUT=25
or even this :
export NCCL_IB_DISABLE=1
but still get errors. For example, when setting export NCCL_IB_DISABLE=1 I get this error:

[rank224]:[E605 18:31:55.516028980 ProcessGroupNCCL.cpp:542] [Rank 56] Collective WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=2, NumelOut=2, Timeout(ms)=600000) raised the following async exception: NCCL error: remote process exited or there was a network error, NCCL version 2.21.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.

Or with the increased timeout, I get this :
[rank40]:[E606 09:46:50.419909226 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=8, OpType=ALLREDUCE, NumelIn=8388608, NumelOut=8388608, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.

For full reference, here is my sbatch script (it is adapted from the default GPT training script provided in the latest Megatron release) : script_training.sh · GitHub

Many thanks