I’m running multi-nodes multi-gpu training for LLM with deepspeed+megatron, and the job runs on a Slurm cluster with more than 10 nodes of RTX 3090.
Driver Version: 530.30.02 CUDA Version: 12.1
NCCL Version: 2.14.3
other important versions
import torch
print(torch.version)
2.0.1
print(torch.version.cuda)
11.8
print(torch.backends.cudnn.version())
8700
attached slurm log:
IB error slurm.log (68.8 KB)
some NCCL warning that indicating some failures or errors:
NCCL WARN Call to ibv_reg_mr_iova2 failed with error Cannot allocate memory
NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2
NCCL WARN Error: trying to connect already connected sendComm
NCCL WARN Net : Connection closed by remote peer 10.19.160.30<50955>
NCCL WARN Proxy Call to rank 1 failed (Connect)
Please help me out
(NOTE: if i disable IB, the job can run well and finished with expected results)