Hi!
Here is the output with NCCL_DEBUG=info:
[1,0]:660382a73181:151:639 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,0]:660382a73181:151:639 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,0]:660382a73181:151:639 [0] NCCL INFO P2P plugin IBext
[1,0]:660382a73181:151:639 [0] NCCL INFO NET/IB : No device found.
[1,0]:660382a73181:151:639 [0] NCCL INFO NET/IB : No device found.
[1,0]:660382a73181:151:639 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,0]:660382a73181:151:639 [0] NCCL INFO Using network Socket
[1,0]:NCCL version 2.9.6+cuda11.3
[1,2]:660382a73181:153:642 [2] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,2]:660382a73181:153:642 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,2]:660382a73181:153:642 [2] NCCL INFO P2P plugin IBext
[1,2]:660382a73181:153:642 [2] NCCL INFO NET/IB : No device found.
[1,3]:660382a73181:154:641 [3] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,1]:660382a73181:152:640 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,2]:660382a73181:153:642 [2] NCCL INFO NET/IB : No device found.
[1,2]:660382a73181:153:642 [2] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,2]:660382a73181:153:642 [2] NCCL INFO Using network Socket
[1,3]:660382a73181:154:641 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,3]:660382a73181:154:641 [3] NCCL INFO P2P plugin IBext
[1,3]:660382a73181:154:641 [3] NCCL INFO NET/IB : No device found.
[1,1]:660382a73181:152:640 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,1]:660382a73181:152:640 [1] NCCL INFO P2P plugin IBext
[1,1]:660382a73181:152:640 [1] NCCL INFO NET/IB : No device found.
[1,3]:660382a73181:154:641 [3] NCCL INFO NET/IB : No device found.
[1,3]:660382a73181:154:641 [3] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,3]:660382a73181:154:641 [3] NCCL INFO Using network Socket
[1,1]:660382a73181:152:640 [1] NCCL INFO NET/IB : No device found.
[1,1]:660382a73181:152:640 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,1]:660382a73181:152:640 [1] NCCL INFO Using network Socket
[1,1]:660382a73181:152:640 [1] NCCL INFO Trees [0] -1/-1/-1->1->2 [1] 2/-1/-1->1->-1 [2] -1/-1/-1->1->2 [3] 2/-1/-1->1->-1 [4] -1/-1/-1->1->2 [5] 2/-1/-1->1->-1 [6] -1/-1/-1->1->2 [7] 2/-1/-1->1->-1
[1,1]:660382a73181:152:640 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
[1,2]:660382a73181:153:642 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 1/-1/-1->2->3 [3] 3/-1/-1->2->1 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 1/-1/-1->2->3 [7] 3/-1/-1->2->1
[1,2]:660382a73181:153:642 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff00,000fffff
[1,3]:660382a73181:154:641 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 2/-1/-1->3->0 [3] 0/-1/-1->3->2 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 2/-1/-1->3->0 [7] 0/-1/-1->3->2
[1,3]:660382a73181:154:641 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 00/08 : 0 1 2 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 01/08 : 0 3 2 1
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 02/08 : 0 3 1 2
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 03/08 : 0 2 1 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 04/08 : 0 1 2 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 05/08 : 0 3 2 1
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 06/08 : 0 3 1 2
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 07/08 : 0 2 1 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] -1/-1/-1->0->3 [2] 3/-1/-1->0->-1 [3] -1/-1/-1->0->3 [4] 3/-1/-1->0->-1 [5] -1/-1/-1->0->3 [6] 3/-1/-1->0->-1 [7] -1/-1/-1->0->3
[1,0]:660382a73181:151:639 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 00 : 3[b000] → 0[6000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 00 : 1[7000] → 2[a000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 03 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 00 : 2[a000] → 3[b000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 02 : 1[7000] → 2[a000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 04 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 04 : 2[a000] → 3[b000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 04 : 1[7000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 00 : 0[6000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 07 : 3[b000] → 0[6000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 06 : 1[7000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 04 : 0[6000] → 1[7000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 02 : 2[a000] → 0[6000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 02 : 3[b000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 03 : 1[7000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 03 : 0[6000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 06 : 2[a000] → 0[6000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 06 : 3[b000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 07 : 1[7000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 07 : 0[6000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 01 : 2[a000] → 1[7000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 01 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 03 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 01 : 3[b000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 02 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 05 : 2[a000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 01 : 1[7000] → 0[6000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 05 : 3[b000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 05 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 07 : 2[a000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 05 : 1[7000] → 0[6000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 06 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Connected all rings
[1,1]:660382a73181:152:640 [1] NCCL INFO Connected all rings
[1,3]:660382a73181:154:641 [3] NCCL INFO Connected all rings
[1,0]:660382a73181:151:639 [0] NCCL INFO Connected all rings
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 01 : 1[7000] → 2[a000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 03 : 1[7000] → 2[a000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 05 : 1[7000] → 2[a000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 07 : 1[7000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 01 : 2[a000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 02 : 2[a000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 03 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 01 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 05 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 02 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 06 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 05 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 07 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 06 : 3[b000] → 0[6000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 00 : 0[6000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 03 : 0[6000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 04 : 0[6000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 07 : 0[6000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 00 : 3[b000] → 2[a000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 02 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 00 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 03 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 02 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 04 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 04 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 06 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 06 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 07 : 3[b000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Connected all trees
[1,0]:660382a73181:151:639 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,0]:660382a73181:151:639 [0] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,1]:660382a73181:152:640 [1] NCCL INFO Connected all trees
[1,1]:660382a73181:152:640 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,1]:660382a73181:152:640 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,1]:660382a73181:152:640 [1] NCCL INFO comm 0x7fbdbb444480 rank 1 nranks 4 cudaDev 1 busId 7000 - Init COMPLETE
[1,0]:660382a73181:151:639 [0] NCCL INFO comm 0x7fbf734ca7c0 rank 0 nranks 4 cudaDev 0 busId 6000 - Init COMPLETE
[1,2]:660382a73181:153:642 [2] NCCL INFO Connected all trees
[1,2]:660382a73181:153:642 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,2]:660382a73181:153:642 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,3]:660382a73181:154:641 [3] NCCL INFO Connected all trees
[1,3]:660382a73181:154:641 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,3]:660382a73181:154:641 [3] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,2]:660382a73181:153:642 [2] NCCL INFO comm 0x7efd834445d0 rank 2 nranks 4 cudaDev 2 busId a000 - Init COMPLETE
[1,3]:660382a73181:154:641 [3] NCCL INFO comm 0x7fe8ab43c160 rank 3 nranks 4 cudaDev 3 busId b000 - Init COMPLETE
[1,1]:
[1,1]:660382a73181:152:640 [1] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,1]:660382a73181:152:640 [1] NCCL INFO enqueue.cc:884 → 1
[1,2]:
[1,2]:660382a73181:153:642 [2] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,2]:660382a73181:153:642 [2] NCCL INFO enqueue.cc:884 → 1
[1,3]:
[1,3]:660382a73181:154:641 [3] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,3]:660382a73181:154:641 [3] NCCL INFO enqueue.cc:884 → 1
[1,0]:
[1,0]:660382a73181:151:639 [0] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,0]:660382a73181:151:639 [0] NCCL INFO enqueue.cc:884 → 1
[1,0]:
[1,0]:660382a73181:151:639 [0] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,0]:660382a73181:151:639 [0] NCCL INFO enqueue.cc:874 → 4
[1,0]:
[1,0]:660382a73181:151:639 [0] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’
[1,1]:
[1,1]:660382a73181:152:640 [1] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,1]:660382a73181:152:640 [1] NCCL INFO enqueue.cc:874 → 4
[1,1]:
[1,1]:660382a73181:152:640 [1] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’
[1,3]:
[1,3]:660382a73181:154:641 [3] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,3]:660382a73181:154:641 [3] NCCL INFO enqueue.cc:874 → 4
[1,3]:
[1,3]:660382a73181:154:641 [3] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’
[1,2]:
[1,2]:660382a73181:153:642 [2] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,2]:660382a73181:153:642 [2] NCCL INFO enqueue.cc:874 → 4
[1,2]:
[1,2]:660382a73181:153:642 [2] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’