NCCL can't use IB network

ronleonearth · October 1, 2023, 4:18pm

I’m running multi-nodes multi-gpu training for LLM with deepspeed+megatron, and the job runs on a Slurm cluster with more than 10 nodes of RTX 3090.

Driver Version: 530.30.02 CUDA Version: 12.1
NCCL Version: 2.14.3

other important versions

import torch
print(torch.version)
2.0.1
print(torch.version.cuda)
11.8
print(torch.backends.cudnn.version())
8700

attached slurm log:
IB error slurm.log (68.8 KB)

some NCCL warning that indicating some failures or errors:
NCCL WARN Call to ibv_reg_mr_iova2 failed with error Cannot allocate memory
NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2
NCCL WARN Error: trying to connect already connected sendComm
NCCL WARN Net : Connection closed by remote peer 10.19.160.30<50955>
NCCL WARN Proxy Call to rank 1 failed (Connect)

Please help me out

(NOTE: if i disable IB, the job can run well and finished with expected results)

ronleonearth · October 7, 2023, 3:47am

let me know if more detail information needed

yyt2021614317 · October 11, 2023, 10:49am

Same warning encountered! Have you fixed this?

Topic		Replies	Views
Potential NCCL bug in topology discovery in NCCL2.1.15 GPU-Accelerated Libraries	0	1243	March 16, 2018
NCCL WARN Cuda failure 'initialization error' Data Science Workbench cuda , cudnn	0	250	October 18, 2023
How can I tell whether NCCL is using PCIe or IB network interface while doing AllReduce? Deep Learning (Training & Inference)	0	692	March 6, 2020
NCCL 2.0 support inter-node communication using Sockets? GPU-Accelerated Libraries	3	4075	December 21, 2018
NCCL failure : "unhandled system error" for 2 GPUs CUDA on Windows Subsystem for Linux	1	3971	January 21, 2021
NCCL error when training data in GCP GPU-Accelerated Libraries cuda , tensorflow , ubuntu , python	2	1318	August 23, 2024
NCCL error GPU-Accelerated Libraries	0	8	December 18, 2024
Internode nvshmme and ib problem GPU-Accelerated Libraries nvshmem	20	1065	April 24, 2024
NCCL failure common.cu:908 'unhandled cuda error'. Deep Learning (Training & Inference)	1	1303	April 26, 2018
NCCL test on 2x HGX failed with 3G as the upper limit GPU-Accelerated Libraries nccl	0	52	October 16, 2024

NCCL can't use IB network

Related topics