Multinode NCCL test hangs after Init COMPLETE

Team,

I’m seeing an Multinode NCCL test hangs after Init_COMPLETE.

Verified the following.

  • Communication reachability between nodes is working fine.
  • Verified multi-device-perf-test also works fine.
  • Verified a small training job also works fine.

**Log Snippet: **

mpijob-pynn-nccl-test-h100-16nics-2-worker-0:253:253 [0] NCCL INFO cudaDriverVersion 12020

NCCL version 2.18.1+cuda12.1

mpijob-pynn-nccl-test-h100-16nics-2-worker-0:253:253 [0] NCCL INFO init.cc:1535 Cuda Host Alloc Size 4 pointer 0x7f5b21e00000

mpijob-pynn-nccl-test-h100-16nics-2-worker-0:254:254 [1] NCCL INFO cudaDriverVersion 12020
|
|
|
|
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:252:379 [0] NCCL INFO init.cc:411 Cuda Host Alloc Size 128 pointer 0x7f0699fae200
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:253:374 [0] NCCL INFO comm 0x564cfc17fbc0 rank 0 nranks 16 cudaDev 0 busId 1a000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:255:377 [2] NCCL INFO comm 0x559096a056d0 rank 2 nranks 16 cudaDev 2 busId 5e000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:260:375 [6] NCCL INFO comm 0x556afbfefa20 rank 14 nranks 16 cudaDev 6 busId dc000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

mpijob-pynn-nccl-test-h100-16nics-2-worker-1:252:379 [0] NCCL INFO comm 0x555be67206a0 rank 8 nranks 16 cudaDev 0 busId 1a000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:257:376 [4] NCCL INFO comm 0x5579dd5d8710 rank 12 nranks 16 cudaDev 4 busId 9c000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:254:373 [2] NCCL INFO comm 0x55d523753200 rank 10 nranks 16 cudaDev 2 busId 5e000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:259:372 [5] NCCL INFO comm 0x557164e0c4e0 rank 13 nranks 16 cudaDev 5 busId 9e000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:255:377 [3] NCCL INFO comm 0x555b84efa930 rank 11 nranks 16 cudaDev 3 busId 60000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:253:378 [1] NCCL INFO comm 0x55c5a394be80 rank 9 nranks 16 cudaDev 1 busId 1c000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:262:374 [7] NCCL INFO comm 0x55f5e619e800 rank 15 nranks 16 cudaDev 7 busId de000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:261:380 [7] NCCL INFO comm 0x55a8cd1c6f30 rank 7 nranks 16 cudaDev 7 busId de000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:256:376 [3] NCCL INFO comm 0x5598617918a0 rank 3 nranks 16 cudaDev 3 busId 60000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:254:375 [1] NCCL INFO comm 0x55f77417be60 rank 1 nranks 16 cudaDev 1 busId 1c000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:258:381 [5] NCCL INFO comm 0x55c5098ad9f0 rank 5 nranks 16 cudaDev 5 busId 9e000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:257:379 [4] NCCL INFO comm 0x561ac3e6d4a0 rank 4 nranks 16 cudaDev 4 busId 9c000 commId 0x205dd8d4a2ab20b9 - Init COMPLETE
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:253:253 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f5900000000 recvbuff 0x7f5700000000 count 2147483648 datatype 7 op 0 root 0 comm 0x564cfc17fbc0 [nranks=16] stream 0x564cfbec08b0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:253:253 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:254:254 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f3a00000000 recvbuff 0x7f3800000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55f77417be60 [nranks=16] stream 0x55f773ea7bc0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:254:254 [1] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:259:259 [6] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f58e0000000 recvbuff 0x7f56e0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c055fde260 [nranks=16] stream 0x55c055d09f30
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:259:259 [6] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:252:252 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f0480000000 recvbuff 0x7f0280000000 count 2147483648 datatype 7 op 0 root 0 comm 0x555be67206a0 [nranks=16] stream 0x555be644d0f0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:252:252 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:255:255 [2] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe500000000 recvbuff 0x7fe300000000 count 2147483648 datatype 7 op 0 root 0 comm 0x559096a056d0 [nranks=16] stream 0x559096731440
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:255:255 [2] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:255:255 [3] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f1940000000 recvbuff 0x7f1740000000 count 2147483648 datatype 7 op 0 root 0 comm 0x555b84efa930 [nranks=16] stream 0x555b84c26c10
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:255:255 [3] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:254:254 [2] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f7120000000 recvbuff 0x7f6f20000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55d523753200 [nranks=16] stream 0x55d52347f140
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:254:254 [2] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:262:262 [7] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fd9a0000000 recvbuff 0x7fd7a0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55f5e619e800 [nranks=16] stream 0x55f5e5ecaaf0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:262:262 [7] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:256:256 [3] NCCL INFO AllReduce: opCount 0 sendbuff 0x7ff500000000 recvbuff 0x7ff300000000 count 2147483648 datatype 7 op 0 root 0 comm 0x5598617918a0 [nranks=16] stream 0x5598614bdc40
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:256:256 [3] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:257:257 [4] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f49a0000000 recvbuff 0x7f47a0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x561ac3e6d4a0 [nranks=16] stream 0x561ac3b993f0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:257:257 [4] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:261:261 [7] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe820000000 recvbuff 0x7fe620000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55a8cd1c6f30 [nranks=16] stream 0x55a8ccef2ae0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:261:261 [7] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:258:258 [5] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fbd20000000 recvbuff 0x7fbb20000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c5098ad9f0 [nranks=16] stream 0x55c5095d99d0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:258:258 [5] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:259:259 [5] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fbfc0000000 recvbuff 0x7fbdc0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x557164e0c4e0 [nranks=16] stream 0x557164b38980
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:259:259 [5] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:254:254 [1] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f3a00000000 recvbuff 0x7f3800000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55f77417be60 [nranks=16] stream 0x55f773ea7bc0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:254:254 [1] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f3a00000000 recvbuff 0x7f3800000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55f77417be60 [nranks=16] stream 0x55f773ea7bc0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:253:253 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f5900000000 recvbuff 0x7f5700000000 count 2147483648 datatype 7 op 0 root 0 comm 0x564cfc17fbc0 [nranks=16] stream 0x564cfbec08b0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:253:253 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f5900000000 recvbuff 0x7f5700000000 count 2147483648 datatype 7 op 0 root 0 comm 0x564cfc17fbc0 [nranks=16] stream 0x564cfbec08b0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:253:253 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f5900000000 recvbuff 0x7f5700000000 count 2147483648 datatype 7 op 0 root 0 comm 0x564cfc17fbc0 [nranks=16] stream 0x564cfbec08b0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:259:259 [6] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f58e0000000 recvbuff 0x7f56e0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c055fde260 [nranks=16] stream 0x55c055d09f30
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:259:259 [6] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f58e0000000 recvbuff 0x7f56e0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c055fde260 [nranks=16] stream 0x55c055d09f30
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:259:259 [6] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f58e0000000 recvbuff 0x7f56e0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c055fde260 [nranks=16] stream 0x55c055d09f30
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:259:259 [6] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f58e0000000 recvbuff 0x7f56e0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c055fde260 [nranks=16] stream 0x55c055d09f30
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:254:254 [1] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f3a00000000 recvbuff 0x7f3800000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55f77417be60 [nranks=16] stream 0x55f773ea7bc0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:254:254 [1] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f3a00000000 recvbuff 0x7f3800000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55f77417be60 [nranks=16] stream 0x55f773ea7bc0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:261:261 [7] NCCL INFO AllReduce: opCount 1 sendbuff 0x7fe820000000 recvbuff 0x7fe620000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55a8cd1c6f30 [nranks=16] stream 0x55a8ccef2ae0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:261:261 [7] NCCL INFO AllReduce: opCount 2 sendbuff 0x7fe820000000 recvbuff 0x7fe620000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55a8cd1c6f30 [nranks=16] stream 0x55a8ccef2ae0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:261:261 [7] NCCL INFO AllReduce: opCount 3 sendbuff 0x7fe820000000 recvbuff 0x7fe620000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55a8cd1c6f30 [nranks=16] stream 0x55a8ccef2ae0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:261:261 [7] NCCL INFO AllReduce: opCount 4 sendbuff 0x7fe820000000 recvbuff 0x7fe620000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55a8cd1c6f30 [nranks=16] stream 0x55a8ccef2ae0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:253:253 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fcea0000000 recvbuff 0x7fcca0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c5a394be80 [nranks=16] stream 0x55c5a3678470
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:253:253 [1] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:255:255 [2] NCCL INFO AllReduce: opCount 1 sendbuff 0x7fe500000000 recvbuff 0x7fe300000000 count 2147483648 datatype 7 op 0 root 0 comm 0x559096a056d0 [nranks=16] stream 0x559096731440
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:255:255 [2] NCCL INFO AllReduce: opCount 2 sendbuff 0x7fe500000000 recvbuff 0x7fe300000000 count 2147483648 datatype 7 op 0 root 0 comm 0x559096a056d0 [nranks=16] stream 0x559096731440
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:255:255 [2] NCCL INFO AllReduce: opCount 3 sendbuff 0x7fe500000000 recvbuff 0x7fe300000000 count 2147483648 datatype 7 op 0 root 0 comm 0x559096a056d0 [nranks=16] stream 0x559096731440
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:255:255 [2] NCCL INFO AllReduce: opCount 4 sendbuff 0x7fe500000000 recvbuff 0x7fe300000000 count 2147483648 datatype 7 op 0 root 0 comm 0x559096a056d0 [nranks=16] stream 0x559096731440
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:260:260 [6] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f95e0000000 recvbuff 0x7f93e0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x556afbfefa20 [nranks=16] stream 0x556afbd1bf20
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:260:260 [6] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:256:256 [3] NCCL INFO AllReduce: opCount 1 sendbuff 0x7ff500000000 recvbuff 0x7ff300000000 count 2147483648 datatype 7 op 0 root 0 comm 0x5598617918a0 [nranks=16] stream 0x5598614bdc40
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:256:256 [3] NCCL INFO AllReduce: opCount 2 sendbuff 0x7ff500000000 recvbuff 0x7ff300000000 count 2147483648 datatype 7 op 0 root 0 comm 0x5598617918a0 [nranks=16] stream 0x5598614bdc40
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:256:256 [3] NCCL INFO AllReduce: opCount 3 sendbuff 0x7ff500000000 recvbuff 0x7ff300000000 count 2147483648 datatype 7 op 0 root 0 comm 0x5598617918a0 [nranks=16] stream 0x5598614bdc40
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:256:256 [3] NCCL INFO AllReduce: opCount 4 sendbuff 0x7ff500000000 recvbuff 0x7ff300000000 count 2147483648 datatype 7 op 0 root 0 comm 0x5598617918a0 [nranks=16] stream 0x5598614bdc40
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:257:257 [4] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f9a00000000 recvbuff 0x7f9800000000 count 2147483648 datatype 7 op 0 root 0 comm 0x5579dd5d8710 [nranks=16] stream 0x5579dd304b20
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:257:257 [4] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:253:253 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f5900000000 recvbuff 0x7f5700000000 count 2147483648 datatype 7 op 0 root 0 comm 0x564cfc17fbc0 [nranks=16] stream 0x564cfbec08b0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:259:259 [5] NCCL INFO AllReduce: opCount 1 sendbuff 0x7fbfc0000000 recvbuff 0x7fbdc0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x557164e0c4e0 [nranks=16] stream 0x557164b38980
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:259:259 [5] NCCL INFO AllReduce: opCount 2 sendbuff 0x7fbfc0000000 recvbuff 0x7fbdc0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x557164e0c4e0 [nranks=16] stream 0x557164b38980
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:259:259 [5] NCCL INFO AllReduce: opCount 3 sendbuff 0x7fbfc0000000 recvbuff 0x7fbdc0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x557164e0c4e0 [nranks=16] stream 0x557164b38980
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:258:258 [5] NCCL INFO AllReduce: opCount 1 sendbuff 0x7fbd20000000 recvbuff 0x7fbb20000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c5098ad9f0 [nranks=16] stream 0x55c5095d99d0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:258:258 [5] NCCL INFO AllReduce: opCount 2 sendbuff 0x7fbd20000000 recvbuff 0x7fbb20000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c5098ad9f0 [nranks=16] stream 0x55c5095d99d0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:258:258 [5] NCCL INFO AllReduce: opCount 3 sendbuff 0x7fbd20000000 recvbuff 0x7fbb20000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c5098ad9f0 [nranks=16] stream 0x55c5095d99d0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:258:258 [5] NCCL INFO AllReduce: opCount 4 sendbuff 0x7fbd20000000 recvbuff 0x7fbb20000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c5098ad9f0 [nranks=16] stream 0x55c5095d99d0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:252:252 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f0480000000 recvbuff 0x7f0280000000 count 2147483648 datatype 7 op 0 root 0 comm 0x555be67206a0 [nranks=16] stream 0x555be644d0f0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:252:252 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f0480000000 recvbuff 0x7f0280000000 count 2147483648 datatype 7 op 0 root 0 comm 0x555be67206a0 [nranks=16] stream 0x555be644d0f0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:252:252 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f0480000000 recvbuff 0x7f0280000000 count 2147483648 datatype 7 op 0 root 0 comm 0x555be67206a0 [nranks=16] stream 0x555be644d0f0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:257:257 [4] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f49a0000000 recvbuff 0x7f47a0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x561ac3e6d4a0 [nranks=16] stream 0x561ac3b993f0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:257:257 [4] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f49a0000000 recvbuff 0x7f47a0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x561ac3e6d4a0 [nranks=16] stream 0x561ac3b993f0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:257:257 [4] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f49a0000000 recvbuff 0x7f47a0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x561ac3e6d4a0 [nranks=16] stream 0x561ac3b993f0
mpijob-pynn-nccl-test-h100-16nics-2-worker-0:257:257 [4] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f49a0000000 recvbuff 0x7f47a0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x561ac3e6d4a0 [nranks=16] stream 0x561ac3b993f0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:253:253 [1] NCCL INFO AllReduce: opCount 1 sendbuff 0x7fcea0000000 recvbuff 0x7fcca0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c5a394be80 [nranks=16] stream 0x55c5a3678470
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:253:253 [1] NCCL INFO AllReduce: opCount 2 sendbuff 0x7fcea0000000 recvbuff 0x7fcca0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c5a394be80 [nranks=16] stream 0x55c5a3678470
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:253:253 [1] NCCL INFO AllReduce: opCount 3 sendbuff 0x7fcea0000000 recvbuff 0x7fcca0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c5a394be80 [nranks=16] stream 0x55c5a3678470
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:253:253 [1] NCCL INFO AllReduce: opCount 4 sendbuff 0x7fcea0000000 recvbuff 0x7fcca0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55c5a394be80 [nranks=16] stream 0x55c5a3678470
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:255:255 [3] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f1940000000 recvbuff 0x7f1740000000 count 2147483648 datatype 7 op 0 root 0 comm 0x555b84efa930 [nranks=16] stream 0x555b84c26c10
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:255:255 [3] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f1940000000 recvbuff 0x7f1740000000 count 2147483648 datatype 7 op 0 root 0 comm 0x555b84efa930 [nranks=16] stream 0x555b84c26c10
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:255:255 [3] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f1940000000 recvbuff 0x7f1740000000 count 2147483648 datatype 7 op 0 root 0 comm 0x555b84efa930 [nranks=16] stream 0x555b84c26c10
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:255:255 [3] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f1940000000 recvbuff 0x7f1740000000 count 2147483648 datatype 7 op 0 root 0 comm 0x555b84efa930 [nranks=16] stream 0x555b84c26c10
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:262:262 [7] NCCL INFO AllReduce: opCount 1 sendbuff 0x7fd9a0000000 recvbuff 0x7fd7a0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55f5e619e800 [nranks=16] stream 0x55f5e5ecaaf0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:262:262 [7] NCCL INFO AllReduce: opCount 2 sendbuff 0x7fd9a0000000 recvbuff 0x7fd7a0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55f5e619e800 [nranks=16] stream 0x55f5e5ecaaf0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:262:262 [7] NCCL INFO AllReduce: opCount 3 sendbuff 0x7fd9a0000000 recvbuff 0x7fd7a0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55f5e619e800 [nranks=16] stream 0x55f5e5ecaaf0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:262:262 [7] NCCL INFO AllReduce: opCount 4 sendbuff 0x7fd9a0000000 recvbuff 0x7fd7a0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55f5e619e800 [nranks=16] stream 0x55f5e5ecaaf0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:254:254 [2] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f7120000000 recvbuff 0x7f6f20000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55d523753200 [nranks=16] stream 0x55d52347f140
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:254:254 [2] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f7120000000 recvbuff 0x7f6f20000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55d523753200 [nranks=16] stream 0x55d52347f140
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:254:254 [2] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f7120000000 recvbuff 0x7f6f20000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55d523753200 [nranks=16] stream 0x55d52347f140
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:254:254 [2] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f7120000000 recvbuff 0x7f6f20000000 count 2147483648 datatype 7 op 0 root 0 comm 0x55d523753200 [nranks=16] stream 0x55d52347f140
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:259:259 [5] NCCL INFO AllReduce: opCount 4 sendbuff 0x7fbfc0000000 recvbuff 0x7fbdc0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x557164e0c4e0 [nranks=16] stream 0x557164b38980
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:260:260 [6] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f95e0000000 recvbuff 0x7f93e0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x556afbfefa20 [nranks=16] stream 0x556afbd1bf20
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:260:260 [6] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f95e0000000 recvbuff 0x7f93e0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x556afbfefa20 [nranks=16] stream 0x556afbd1bf20
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:260:260 [6] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f95e0000000 recvbuff 0x7f93e0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x556afbfefa20 [nranks=16] stream 0x556afbd1bf20
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:260:260 [6] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f95e0000000 recvbuff 0x7f93e0000000 count 2147483648 datatype 7 op 0 root 0 comm 0x556afbfefa20 [nranks=16] stream 0x556afbd1bf20
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:252:252 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f0480000000 recvbuff 0x7f0280000000 count 2147483648 datatype 7 op 0 root 0 comm 0x555be67206a0 [nranks=16] stream 0x555be644d0f0
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:257:257 [4] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f9a00000000 recvbuff 0x7f9800000000 count 2147483648 datatype 7 op 0 root 0 comm 0x5579dd5d8710 [nranks=16] stream 0x5579dd304b20
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:257:257 [4] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f9a00000000 recvbuff 0x7f9800000000 count 2147483648 datatype 7 op 0 root 0 comm 0x5579dd5d8710 [nranks=16] stream 0x5579dd304b20
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:257:257 [4] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f9a00000000 recvbuff 0x7f9800000000 count 2147483648 datatype 7 op 0 root 0 comm 0x5579dd5d8710 [nranks=16] stream 0x5579dd304b20
mpijob-pynn-nccl-test-h100-16nics-2-worker-1:257:257 [4] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f9a00000000 recvbuff 0x7f9800000000 count 2147483648 datatype 7 op 0 root 0 comm 0x5579dd5d8710 [nranks=16] stream 0x5579dd304b20

1 Like

I meet the same problem.