NCCL randomly crashes on Leonardo

Hello,

I am running an LLM training on Leonardo cluster by using singularity container. The LLM training is implemented by means of Colossal AI functionalities and hybrid parallelization using pipeline parallelism + data parallelism. NCCL crashes from time to time with the same error:

5: [default0]:[rank20]: work = group.allreduce([tensor], opts)
5: [default0]:[rank20]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
5: [default0]:[rank20]: torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
5: [default0]:[rank20]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
5: [default0]:[rank20]: Last error:
5: [default0]:[rank20]: socketPollConnect: Connect to 10.128.9.129<34485> returned 113(No route to host) errno 115(Operation now in progress)

I already enabled Infiniband (it was not found unless binding some path to the container) and checked with NCCL_DEBUG=INFO. Could you please provide more information about the error, or suggestions about how to further investigate the issue?

I exploit this channel also for a question regarding the communications reported in the logfile:

10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 00/0 : 39[3] → 40[0] [receive] via NET/IB/0/GDRDMA
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 04/0 : 39[3] → 40[0] [receive] via NET/IB/0/GDRDMA
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 00/0 : 40[0] → 41[1] via P2P/CUMEM/read
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 04/0 : 40[0] → 41[1] via P2P/CUMEM/read

On Leonardo there is Infiniband and NVLink intra-node. Are the above communication kinds good for the platform? I am not familiar with CUMEM.

Thank you for your time,

Laura

I have more information by using NCCL_DEBUG=INFO. I see a number of these messages

7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO Call to connect returned Connection refused, retrying
14: [default0]:lrdn1623:3732664:3733794 [0] NCCL INFO Call to connect returned Connection refused, retrying
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO Call to connect returned Connection refused, retrying

and this

7: [default0]:lrdn1176:2122410:2123520 [0] misc/socket.cc:467 NCCL WARN socketStartConnect: exceeded retries (20000)
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO misc/socket.cc:567 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO misc/socket.cc:621 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO bootstrap.cc:425 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO transport.cc:131 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO init.cc:1232 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO init.cc:1501 → 6
7: [default0]:lrdn1176:2122410:2123520 [0] NCCL INFO group.cc:64 → 6 [Async thread]
7: [default0]:lrdn1176:2122410:2122410 [0] NCCL INFO group.cc:418 → 6
7: [default0]:lrdn1176:2122410:2122410 [0] NCCL INFO init.cc:1876 → 6
7: [default0]:lrdn1176:2122410:2123521 [0] NCCL INFO [Service thread] Connection closed by localRank 0