Hello,
I am running an LLM training on Leonardo cluster by using singularity container. The LLM training is implemented by means of Colossal AI functionalities and hybrid parallelization using pipeline parallelism + data parallelism. NCCL crashes from time to time with the same error:
5: [default0]:[rank20]: work = group.allreduce([tensor], opts)
5: [default0]:[rank20]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
5: [default0]:[rank20]: torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
5: [default0]:[rank20]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
5: [default0]:[rank20]: Last error:
5: [default0]:[rank20]: socketPollConnect: Connect to 10.128.9.129<34485> returned 113(No route to host) errno 115(Operation now in progress)
I already enabled Infiniband (it was not found unless binding some path to the container) and checked with NCCL_DEBUG=INFO. Could you please provide more information about the error, or suggestions about how to further investigate the issue?
I exploit this channel also for a question regarding the communications reported in the logfile:
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 00/0 : 39[3] → 40[0] [receive] via NET/IB/0/GDRDMA
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 04/0 : 39[3] → 40[0] [receive] via NET/IB/0/GDRDMA
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 00/0 : 40[0] → 41[1] via P2P/CUMEM/read
10: [default0]:lrdn1235:3487109:3488633 [0] NCCL INFO Channel 04/0 : 40[0] → 41[1] via P2P/CUMEM/read
On Leonardo there is Infiniband and NVLink intra-node. Are the above communication kinds good for the platform? I am not familiar with CUMEM.
Thank you for your time,
Laura