NCCL can't use IB network

I’m running multi-nodes multi-gpu training for LLM with deepspeed+megatron, and the job runs on a Slurm cluster with more than 10 nodes of RTX 3090.

Driver Version: 530.30.02 CUDA Version: 12.1
NCCL Version: 2.14.3

other important versions

import torch
print(torch.version)
2.0.1
print(torch.version.cuda)
11.8
print(torch.backends.cudnn.version())
8700

attached slurm log:
IB error slurm.log (68.8 KB)

some NCCL warning that indicating some failures or errors:
NCCL WARN Call to ibv_reg_mr_iova2 failed with error Cannot allocate memory
NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2
NCCL WARN Error: trying to connect already connected sendComm
NCCL WARN Net : Connection closed by remote peer 10.19.160.30<50955>
NCCL WARN Proxy Call to rank 1 failed (Connect)

Please help me out

(NOTE: if i disable IB, the job can run well and finished with expected results)

let me know if more detail information needed

Same warning encountered! Have you fixed this?