NCCL AllGather & AllReduce error

gyeonginyu · April 13, 2018, 1:00pm

I’m running a distributed TensorFlow job using NCCL AllGather and AllReduce.
My machines are connected over Mellanox ConnectX-4 adapter (Infiniband), and each machine in the cluster is equipped with 6 Titan Xp(s).

I ran my job using 2 machines (2*6 = 12 GPUs), without any problem.
However, once I used 4 machines, it gave me error and did not proceed.
The error occurred in the second (not first) iteration.
I can run other jobs which use only NCCL AllReduce on 4 (and 8) machines.
This problem only occurs when I try to use both NCCL AllGather and AllReduce with 4 or more machines.

mlx5: medici-03: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000003 00000000 00000000 00000000
00000000 93005204 090006d0 0b8035d3

medici-03:21077:22641 [5] transport/net_ib.cu:775 WARN NET/IB : Got completion with error 4, opcode 1, vendor err 82

I have no idea what does the error code means, could you kindly give me some help?

gyeonginyu · April 18, 2018, 5:14am

More specifically, the job hangs after printing the above message to console (NCCL operation does not return).
This problem does not happen when I use “NCCL_IB_DISABLE=1” flag (use socket instead of Infiniband).

Topic		Replies	Views
ncclAllReduce hangs GPU-Accelerated Libraries nccl	1	1015	December 18, 2023
NCCL error GPU-Accelerated Libraries	4	389	February 19, 2025
NCCL all_reduce_perf hangs with A100 SXM4 on AMD CPUs (driver 570.172.08 + CUDA 12.8) but works on driver 550.163.01 GPU-Accelerated Libraries cuda , nccl , a100 , software-and-drivers	0	115	September 3, 2025
NCCL operation hangs Riva	1	655	November 29, 2023
tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce()) doesn't work with more than 2 GPUs CUDA Setup and Installation	1	557	October 5, 2024
CUDA NCCL Error "operation not supported" Multi-GPUs CUDA Setup and Installation cuda	1	696	June 26, 2025
Multinode NCCL test hangs after Init COMPLETE GPU-Accelerated Libraries nccl	1	917	August 6, 2024
How can I tell whether NCCL is using PCIe or IB network interface while doing AllReduce? Deep Learning (Training & Inference)	0	787	March 6, 2020
NCCL can't use IB network GPU-Accelerated Libraries ubuntu , cudnn , nccl	2	1947	October 11, 2023
ncclGroupEnd "unhandled cuda error" CUDA Programming and Performance	8	3462	October 23, 2020

NCCL AllGather & AllReduce error

Related topics