NCCL AllGather & AllReduce error

I’m running a distributed TensorFlow job using NCCL AllGather and AllReduce.
My machines are connected over Mellanox ConnectX-4 adapter (Infiniband), and each machine in the cluster is equipped with 6 Titan Xp(s).

I ran my job using 2 machines (2*6 = 12 GPUs), without any problem.
However, once I used 4 machines, it gave me error and did not proceed.
The error occurred in the second (not first) iteration.
I can run other jobs which use only NCCL AllReduce on 4 (and 8) machines.
This problem only occurs when I try to use both NCCL AllGather and AllReduce with 4 or more machines.

mlx5: medici-03: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000003 00000000 00000000 00000000
00000000 93005204 090006d0 0b8035d3

medici-03:21077:22641 [5] transport/net_ib.cu:775 WARN NET/IB : Got completion with error 4, opcode 1, vendor err 82

I have no idea what does the error code means, could you kindly give me some help?

More specifically, the job hangs after printing the above message to console (NCCL operation does not return).
This problem does not happen when I use “NCCL_IB_DISABLE=1” flag (use socket instead of Infiniband).