NCCL testing: Error: no plugin found (libnccl-net.so)

Hi!
I’m running the nccl test

NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

and get an error

# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid  12877 on h0913n-ubuntu device  0 [0x0e] GeForce GTX 1080 Ti
#   Rank  1 Pid  12877 on h0913n-ubuntu device  1 [0x0f] GeForce GTX 1080 Ti
#   Rank  2 Pid  12877 on h0913n-ubuntu device  2 [0x01] GeForce GTX 1070
#   Rank  3 Pid  12877 on h0913n-ubuntu device  3 [0x02] GeForce GTX 1070
#   Rank  4 Pid  12877 on h0913n-ubuntu device  4 [0x03] GeForce GTX 1070
#   Rank  5 Pid  12877 on h0913n-ubuntu device  5 [0x04] GeForce GTX 1070
#   Rank  6 Pid  12877 on h0913n-ubuntu device  6 [0x05] GeForce GTX 1070
#   Rank  7 Pid  12877 on h0913n-ubuntu device  7 [0x06] GeForce GTX 1070
h0913n-ubuntu:12877:12877 [0] NCCL INFO Bootstrap : Using [0]enp0s31f6:192.168.97.149<0>
h0913n-ubuntu:12877:12877 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

h0913n-ubuntu:12877:12877 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
h0913n-ubuntu:12877:12877 [0] NCCL INFO NET/Socket : Using [0]enp0s31f6:192.168.97.149<0>
NCCL version 2.4.8+cuda10.1
h0913n-ubuntu:12877:12877 [7] NCCL INFO nranks 8
h0913n-ubuntu:12877:12877 [0] NCCL INFO Setting affinity for GPU 0 to 03
h0913n-ubuntu:12877:12877 [1] NCCL INFO Setting affinity for GPU 1 to 03
h0913n-ubuntu:12877:12877 [2] NCCL INFO Setting affinity for GPU 2 to 03
h0913n-ubuntu:12877:12877 [3] NCCL INFO Setting affinity for GPU 3 to 03
h0913n-ubuntu:12877:12877 [4] NCCL INFO Setting affinity for GPU 4 to 03
h0913n-ubuntu:12877:12877 [5] NCCL INFO Setting affinity for GPU 5 to 03
h0913n-ubuntu:12877:12877 [6] NCCL INFO Setting affinity for GPU 6 to 03
h0913n-ubuntu:12877:12877 [7] NCCL INFO Setting affinity for GPU 7 to 03
h0913n-ubuntu:12877:12877 [7] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
h0913n-ubuntu:12877:12877 [7] NCCL INFO Channel 00 :    0   1   2   3   4   5   6   7
h0913n-ubuntu:12877:12877 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
h0913n-ubuntu:12877:12877 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
h0913n-ubuntu:12877:12877 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via direct shared memory
h0913n-ubuntu:12877:12877 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via direct shared memory
h0913n-ubuntu:12877:12877 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via direct shared memory
h0913n-ubuntu:12877:12877 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via direct shared memory
h0913n-ubuntu:12877:12877 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via direct shared memory
h0913n-ubuntu:12877:12877 [7] NCCL INFO Ring 00 : 7[7] -> 0[0] via direct shared memory
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
h0913n-ubuntu:12877:12877 [0] NCCL INFO Launch mode Group/CGMD
           8             2   float     sum    36.45    0.00    0.00  1e-07    36.47    0.00    0.00  1e-07
          16             4   float     sum    36.89    0.00    0.00  1e-07    36.80    0.00    0.00  1e-07
          32             8   float     sum    37.64    0.00    0.00  6e-08    36.47    0.00    0.00  6e-08
          64            16   float     sum    42.00    0.00    0.00  6e-08    41.92    0.00    0.00  6e-08
         128            32   float     sum    49.08    0.00    0.00  6e-08    48.88    0.00    0.00  6e-08
         256            64   float     sum    59.33    0.00    0.01  3e-08    59.96    0.00    0.01  3e-08
         512           128   float     sum    74.01    0.01    0.01  3e-08    81.47    0.01    0.01  3e-08
        1024           256   float     sum    88.78    0.01    0.02  1e-07    92.33    0.01    0.02  1e-07
        2048           512   float     sum    108.0    0.02    0.03  2e-07    115.2    0.02    0.03  2e-07
        4096          1024   float     sum    132.2    0.03    0.05  2e-07    139.4    0.03    0.05  2e-07
        8192          2048   float     sum    222.9    0.04    0.06  2e-07    228.7    0.04    0.06  2e-07
       16384          4096   float     sum    420.8    0.04    0.07  2e-07    418.9    0.04    0.07  2e-07
       32768          8192   float     sum    865.2    0.04    0.07  2e-07    872.7    0.04    0.07  2e-07
       65536         16384   float     sum   1846.6    0.04    0.06  2e-07   1842.7    0.04    0.06  2e-07
      131072         32768   float     sum   1796.1    0.07    0.13  2e-07   1797.3    0.07    0.13  2e-07
      262144         65536   float     sum   3145.5    0.08    0.15  2e-07   3142.8    0.08    0.15  2e-07
      524288        131072   float     sum   5870.5    0.09    0.16  2e-07   5870.8    0.09    0.16  2e-07
     1048576        262144   float     sum    11424    0.09    0.16  2e-07    11424    0.09    0.16  2e-07
     2097152        524288   float     sum    22585    0.09    0.16  2e-07    22585    0.09    0.16  2e-07
     4194304       1048576   float     sum    44879    0.09    0.16  2e-07    44884    0.09    0.16  2e-07
     8388608       2097152   float     sum    89462    0.09    0.16  2e-07    89471    0.09    0.16  2e-07
    16777216       4194304   float     sum   181043    0.09    0.16  2e-07   181040    0.09    0.16  2e-07
    33554432       8388608   float     sum   361538    0.09    0.16  2e-07   361522    0.09    0.16  2e-07
    67108864      16777216   float     sum   722343    0.09    0.16  2e-07   722392    0.09    0.16  2e-07
   134217728      33554432   float     sum  1444196    0.09    0.16  2e-07  1444096    0.09    0.16  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.0849651 
#

Could you help me to solve this issues?

[url]https://github.com/NVIDIA/nccl/issues/162[/url]

@Robert thank for reply!
Am I right to understand that libnccl-net.so messages are informative? In this case, what does hex error code in the erorr column of the table mean?

When NCCL outputs a message with INFO in it, it is not considered an error (by NCCL).

Not sure what you are referring to. If you mean this:

... error
...
... 1e-07
... 1e-07
... 6e-08

Those are not hex error codes. That is a numerical error that is calculated by the all reduce or whatever algorithm NCCL is running as a test. if the numerical error across all tests is small enough, then you see output like this:

# Out of bounds values : 0 OK

NCCL is considered a deep learning library, you may wish to ask NCCL questions here:

https://devtalk.nvidia.com/default/board/307/other-libraries/

Thank you!