Hi!
I’m running the nccl test
NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
and get an error
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 12877 on h0913n-ubuntu device 0 [0x0e] GeForce GTX 1080 Ti
# Rank 1 Pid 12877 on h0913n-ubuntu device 1 [0x0f] GeForce GTX 1080 Ti
# Rank 2 Pid 12877 on h0913n-ubuntu device 2 [0x01] GeForce GTX 1070
# Rank 3 Pid 12877 on h0913n-ubuntu device 3 [0x02] GeForce GTX 1070
# Rank 4 Pid 12877 on h0913n-ubuntu device 4 [0x03] GeForce GTX 1070
# Rank 5 Pid 12877 on h0913n-ubuntu device 5 [0x04] GeForce GTX 1070
# Rank 6 Pid 12877 on h0913n-ubuntu device 6 [0x05] GeForce GTX 1070
# Rank 7 Pid 12877 on h0913n-ubuntu device 7 [0x06] GeForce GTX 1070
h0913n-ubuntu:12877:12877 [0] NCCL INFO Bootstrap : Using [0]enp0s31f6:192.168.97.149<0>
h0913n-ubuntu:12877:12877 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h0913n-ubuntu:12877:12877 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
h0913n-ubuntu:12877:12877 [0] NCCL INFO NET/Socket : Using [0]enp0s31f6:192.168.97.149<0>
NCCL version 2.4.8+cuda10.1
h0913n-ubuntu:12877:12877 [7] NCCL INFO nranks 8
h0913n-ubuntu:12877:12877 [0] NCCL INFO Setting affinity for GPU 0 to 03
h0913n-ubuntu:12877:12877 [1] NCCL INFO Setting affinity for GPU 1 to 03
h0913n-ubuntu:12877:12877 [2] NCCL INFO Setting affinity for GPU 2 to 03
h0913n-ubuntu:12877:12877 [3] NCCL INFO Setting affinity for GPU 3 to 03
h0913n-ubuntu:12877:12877 [4] NCCL INFO Setting affinity for GPU 4 to 03
h0913n-ubuntu:12877:12877 [5] NCCL INFO Setting affinity for GPU 5 to 03
h0913n-ubuntu:12877:12877 [6] NCCL INFO Setting affinity for GPU 6 to 03
h0913n-ubuntu:12877:12877 [7] NCCL INFO Setting affinity for GPU 7 to 03
h0913n-ubuntu:12877:12877 [7] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
h0913n-ubuntu:12877:12877 [7] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7
h0913n-ubuntu:12877:12877 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
h0913n-ubuntu:12877:12877 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
h0913n-ubuntu:12877:12877 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via direct shared memory
h0913n-ubuntu:12877:12877 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via direct shared memory
h0913n-ubuntu:12877:12877 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via direct shared memory
h0913n-ubuntu:12877:12877 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via direct shared memory
h0913n-ubuntu:12877:12877 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via direct shared memory
h0913n-ubuntu:12877:12877 [7] NCCL INFO Ring 00 : 7[7] -> 0[0] via direct shared memory
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
h0913n-ubuntu:12877:12877 [0] NCCL INFO Launch mode Group/CGMD
8 2 float sum 36.45 0.00 0.00 1e-07 36.47 0.00 0.00 1e-07
16 4 float sum 36.89 0.00 0.00 1e-07 36.80 0.00 0.00 1e-07
32 8 float sum 37.64 0.00 0.00 6e-08 36.47 0.00 0.00 6e-08
64 16 float sum 42.00 0.00 0.00 6e-08 41.92 0.00 0.00 6e-08
128 32 float sum 49.08 0.00 0.00 6e-08 48.88 0.00 0.00 6e-08
256 64 float sum 59.33 0.00 0.01 3e-08 59.96 0.00 0.01 3e-08
512 128 float sum 74.01 0.01 0.01 3e-08 81.47 0.01 0.01 3e-08
1024 256 float sum 88.78 0.01 0.02 1e-07 92.33 0.01 0.02 1e-07
2048 512 float sum 108.0 0.02 0.03 2e-07 115.2 0.02 0.03 2e-07
4096 1024 float sum 132.2 0.03 0.05 2e-07 139.4 0.03 0.05 2e-07
8192 2048 float sum 222.9 0.04 0.06 2e-07 228.7 0.04 0.06 2e-07
16384 4096 float sum 420.8 0.04 0.07 2e-07 418.9 0.04 0.07 2e-07
32768 8192 float sum 865.2 0.04 0.07 2e-07 872.7 0.04 0.07 2e-07
65536 16384 float sum 1846.6 0.04 0.06 2e-07 1842.7 0.04 0.06 2e-07
131072 32768 float sum 1796.1 0.07 0.13 2e-07 1797.3 0.07 0.13 2e-07
262144 65536 float sum 3145.5 0.08 0.15 2e-07 3142.8 0.08 0.15 2e-07
524288 131072 float sum 5870.5 0.09 0.16 2e-07 5870.8 0.09 0.16 2e-07
1048576 262144 float sum 11424 0.09 0.16 2e-07 11424 0.09 0.16 2e-07
2097152 524288 float sum 22585 0.09 0.16 2e-07 22585 0.09 0.16 2e-07
4194304 1048576 float sum 44879 0.09 0.16 2e-07 44884 0.09 0.16 2e-07
8388608 2097152 float sum 89462 0.09 0.16 2e-07 89471 0.09 0.16 2e-07
16777216 4194304 float sum 181043 0.09 0.16 2e-07 181040 0.09 0.16 2e-07
33554432 8388608 float sum 361538 0.09 0.16 2e-07 361522 0.09 0.16 2e-07
67108864 16777216 float sum 722343 0.09 0.16 2e-07 722392 0.09 0.16 2e-07
134217728 33554432 float sum 1444196 0.09 0.16 2e-07 1444096 0.09 0.16 2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0.0849651
#
Could you help me to solve this issues?