I’m running a distributed TensorFlow job using NCCL AllGather and AllReduce.
My machines are connected over Mellanox ConnectX-4 adapter (Infiniband), and each machine in the cluster is equipped with 6 Titan Xp(s).
I ran my job using 2 machines (2*6 = 12 GPUs), without any problem.
However, once I used 4 machines, it gave me error and did not proceed.
The error occurred in the second (not first) iteration.
I can run other jobs which use only NCCL AllReduce on 4 (and 8) machines.
This problem only occurs when I try to use both NCCL AllGather and AllReduce with 4 or more machines.
mlx5: medici-03: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000003 00000000 00000000 00000000
00000000 93005204 090006d0 0b8035d3
medici-03:21077:22641 [5] transport/net_ib.cu:775 WARN NET/IB : Got completion with error 4, opcode 1, vendor err 82
I have no idea what does the error code means, could you kindly give me some help?