More than 1 GPU not working using Tao Train

okay i see the nvcc in that folder now after downloading cuda.

I get this error when running make:

./verifiable/verifiable.cu:4:10: fatal error: nccl.h: No such file or directory 4 | #include <nccl.h> | ^~~~~~~~ compilation terminated. make[1]: *** [../verifiable/verifiable.mk:11: /home/amrc_cymru/nccl-tests/build/verifiable/verifiable.o] Error 1 make[1]: Leaving directory '/home/amrc_cymru/nccl-tests/src' make: *** [Makefile:20: src.build] Error 2

I ran export PATH=/usr/local/cuda-12/bin${PATH:+:${PATH}} , but that didn’t help.

To avoid unexpected issue, can you pull below TensorRT docker and run nccl test inside it? Thanks.
$ docker pull nvcr.io/nvidia/tensorrt:22.11-py3

Hello, I pulled the tensorrt docker and ran nccl inside it with nvidia driver 515

still the same issue

root@b9ce305d96cb:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1017 on b9ce305d96cb device  0 [0x3b] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid   1017 on b9ce305d96cb device  1 [0x5e] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid   1017 on b9ce305d96cb device  2 [0x86] NVIDIA RTX A6000
#  Rank  3 Group  0 Pid   1017 on b9ce305d96cb device  3 [0xaf] NVIDIA RTX A6000
[1680170342.329956] [b9ce305d96cb:1017 :0]           debug.c:1289 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1680170342.329978] [b9ce305d96cb:1017 :0]           debug.c:1289 UCX  WARN  ucs_debug_disable_signal: signal 1 was not set in ucs
[1680170342.329983] [b9ce305d96cb:1017 :1]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[b9ce305d96cb:1017 :0:1030] Caught signal 7 (Bus error: nonexistent physical address)
[b9ce305d96cb:1017 :1:1032] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:   1030) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000007587d ncclGroupEnd()  ???:0
 3 0x000000000007b0ef ncclGroupEnd()  ???:0
 4 0x0000000000059e97 ncclGetUniqueId()  ???:0
 5 0x00000000000489b1 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 6 0x000000000004a655 ???()  /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
 7 0x0000000000063dcc ncclRedOpDestroy()  ???:0
 8 0x0000000000008609 start_thread()  ???:0
 9 0x000000000011f133 clone()  ???:0
=================================
Bus error (core dumped)
root@b9ce305d96cb:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1038 on b9ce305d96cb device  0 [0x3b] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid   1038 on b9ce305d96cb device  1 [0x5e] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid   1038 on b9ce305d96cb device  2 [0x86] NVIDIA RTX A6000
[b9ce305d96cb:1038 :0:1049] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)
root@b9ce305d96cb:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1055 on b9ce305d96cb device  0 [0x3b] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid   1055 on b9ce305d96cb device  1 [0x5e] NVIDIA RTX A6000
[b9ce305d96cb:1055 :0:1067] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)
root@b9ce305d96cb:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1068 on b9ce305d96cb device  0 [0x3b] NVIDIA RTX A6000
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     3.27    0.00    0.00      0     0.16    0.05    0.00      0
          16             4     float     sum      -1     3.78    0.00    0.00      0     0.18    0.09    0.00      0
          32             8     float     sum      -1     3.37    0.01    0.00      0     0.18    0.18    0.00      0
          64            16     float     sum      -1     3.39    0.02    0.00      0     0.18    0.36    0.00      0
         128            32     float     sum      -1     3.36    0.04    0.00      0     0.18    0.72    0.00      0
         256            64     float     sum      -1     3.35    0.08    0.00      0     0.18    1.44    0.00      0
         512           128     float     sum      -1     3.33    0.15    0.00      0     0.18    2.87    0.00      0
        1024           256     float     sum      -1     3.31    0.31    0.00      0     0.18    5.74    0.00      0
        2048           512     float     sum      -1     3.36    0.61    0.00      0     0.18   11.56    0.00      0
        4096          1024     float     sum      -1     3.30    1.24    0.00      0     0.18   23.08    0.00      0
        8192          2048     float     sum      -1     3.29    2.49    0.00      0     0.18   45.55    0.00      0
       16384          4096     float     sum      -1     3.32    4.93    0.00      0     0.18   92.17    0.00      0
       32768          8192     float     sum      -1     3.96    8.27    0.00      0     0.18  185.13    0.00      0
       65536         16384     float     sum      -1     3.10   21.12    0.00      0     0.17  387.90    0.00      0
      131072         32768     float     sum      -1     3.28   39.95    0.00      0     0.16  844.54    0.00      0
      262144         65536     float     sum      -1     4.14   63.27    0.00      0     0.16  1600.88    0.00      0
      524288        131072     float     sum      -1     4.29  122.19    0.00      0     0.15  3398.95    0.00      0
     1048576        262144     float     sum      -1     5.35  195.99    0.00      0     0.15  6780.32    0.00      0
     2097152        524288     float     sum      -1     9.06  231.55    0.00      0     0.17  12446.01    0.00      0
     4194304       1048576     float     sum      -1    15.86  264.49    0.00      0     0.16  26903.81    0.00      0
     8388608       2097152     float     sum      -1    28.91  290.14    0.00      0     0.16  53687.09    0.00      0
    16777216       4194304     float     sum      -1    52.99  316.64    0.00      0     0.16  107892.06    0.00      0
    33554432       8388608     float     sum      -1    102.1  328.77    0.00      0     0.15  217532.78    0.00      0
    67108864      16777216     float     sum      -1    200.2  335.25    0.00      0     0.16  412216.61    0.00      0
   134217728      33554432     float     sum      -1    396.6  338.45    0.00      0     0.15  873244.81    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

So, seems to be not an issue from TAO container.
It looks like nccl issue.

Can you run with old version of TAO docker as well?
docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

or 22.05 version.
docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3

So I’ve tried the old version of docker (22.05 version) and ran nccl tests inside it. Nvidia driver version was 515.

For 4 GPUs I get this (it gets stuck on the last message so I have the Ctrl+c to exit):

root@014ebf08f84a:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     44 on 014ebf08f84a device  0 [0x3b] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid     44 on 014ebf08f84a device  1 [0x5e] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid     44 on 014ebf08f84a device  2 [0x86] NVIDIA RTX A6000
#  Rank  3 Group  0 Pid     44 on 014ebf08f84a device  3 [0xaf] NVIDIA RTX A6000
014ebf08f84a:44:44 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
014ebf08f84a:44:44 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
014ebf08f84a:44:44 [0] NCCL INFO P2P plugin IBext
014ebf08f84a:44:44 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:44:44 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:44:44 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
014ebf08f84a:44:44 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
014ebf08f84a:44:56 [0] NCCL INFO Channel 00/04 :    0   1   2   3
014ebf08f84a:44:56 [0] NCCL INFO Channel 01/04 :    0   3   2   1
014ebf08f84a:44:56 [0] NCCL INFO Channel 02/04 :    0   1   2   3
014ebf08f84a:44:56 [0] NCCL INFO Channel 03/04 :    0   3   2   1
014ebf08f84a:44:57 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] -1/-1/-1->1->2
014ebf08f84a:44:56 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 3/-1/-1->0->-1
014ebf08f84a:44:57 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
014ebf08f84a:44:56 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
014ebf08f84a:44:59 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 2/-1/-1->3->0 [2] -1/-1/-1->3->2 [3] 2/-1/-1->3->0
014ebf08f84a:44:59 [3] NCCL INFO Setting affinity for GPU 3 to ffff,f00000ff,fff00000
014ebf08f84a:44:58 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 1/-1/-1->2->3 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3
014ebf08f84a:44:58 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
014ebf08f84a:44:58 [2] NCCL INFO Channel 00 : 2[86000] -> 3[af000] via direct shared memory
014ebf08f84a:44:58 [2] NCCL INFO Channel 02 : 2[86000] -> 3[af000] via direct shared memory
014ebf08f84a:44:56 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:44:56 [0] NCCL INFO Channel 02 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:44:59 [3] NCCL INFO Channel 00 : 3[af000] -> 0[3b000] via P2P/direct pointer
014ebf08f84a:44:59 [3] NCCL INFO Channel 02 : 3[af000] -> 0[3b000] via P2P/direct pointer
014ebf08f84a:44:57 [1] NCCL INFO Channel 00 : 1[5e000] -> 2[86000] via P2P/direct pointer
014ebf08f84a:44:57 [1] NCCL INFO Channel 02 : 1[5e000] -> 2[86000] via P2P/direct pointer
014ebf08f84a:44:59 [3] NCCL INFO Channel 01 : 3[af000] -> 2[86000] via direct shared memory
014ebf08f84a:44:59 [3] NCCL INFO Channel 03 : 3[af000] -> 2[86000] via direct shared memory
014ebf08f84a:44:57 [1] NCCL INFO Channel 01 : 1[5e000] -> 0[3b000] via direct shared memory
014ebf08f84a:44:57 [1] NCCL INFO Channel 03 : 1[5e000] -> 0[3b000] via direct shared memory

014ebf08f84a:44:56 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
014ebf08f84a:44:56 [0] NCCL INFO include/shm.h:41 -> 2

014ebf08f84a:44:56 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-ed257aa4e8c671bb-3-1-0 (size 9637888)
014ebf08f84a:44:56 [0] NCCL INFO transport/shm.cc:100 -> 2
014ebf08f84a:44:56 [0] NCCL INFO transport.cc:34 -> 2
014ebf08f84a:44:56 [0] NCCL INFO transport.cc:87 -> 2
014ebf08f84a:44:56 [0] NCCL INFO init.cc:804 -> 2
014ebf08f84a:44:56 [0] NCCL INFO init.cc:941 -> 2

014ebf08f84a:44:58 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
014ebf08f84a:44:58 [2] NCCL INFO include/shm.h:41 -> 2

014ebf08f84a:44:58 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-ed257aa4e8c671bb-3-3-2 (size 9637888)
014ebf08f84a:44:56 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
014ebf08f84a:44:58 [2] NCCL INFO transport/shm.cc:100 -> 2
014ebf08f84a:44:58 [2] NCCL INFO transport.cc:34 -> 2
014ebf08f84a:44:58 [2] NCCL INFO transport.cc:87 -> 2
014ebf08f84a:44:58 [2] NCCL INFO init.cc:804 -> 2
014ebf08f84a:44:58 [2] NCCL INFO init.cc:941 -> 2
014ebf08f84a:44:58 [2] NCCL INFO group.cc:72 -> 2 [Async thread]

for 3 gpu i get this:

root@014ebf08f84a:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     64 on 014ebf08f84a device  0 [0x3b] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid     64 on 014ebf08f84a device  1 [0x5e] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid     64 on 014ebf08f84a device  2 [0x86] NVIDIA RTX A6000
014ebf08f84a:64:64 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
014ebf08f84a:64:64 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
014ebf08f84a:64:64 [0] NCCL INFO P2P plugin IBext
014ebf08f84a:64:64 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:64:64 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:64:64 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
014ebf08f84a:64:64 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
014ebf08f84a:64:74 [0] NCCL INFO Channel 00/02 :    0   1   2
014ebf08f84a:64:74 [0] NCCL INFO Channel 01/02 :    0   1   2
014ebf08f84a:64:74 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
014ebf08f84a:64:74 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
014ebf08f84a:64:75 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
014ebf08f84a:64:75 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
014ebf08f84a:64:76 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
014ebf08f84a:64:76 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
014ebf08f84a:64:76 [2] NCCL INFO Channel 00 : 2[86000] -> 0[3b000] via direct shared memory
014ebf08f84a:64:76 [2] NCCL INFO Channel 01 : 2[86000] -> 0[3b000] via direct shared memory
014ebf08f84a:64:74 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:64:75 [1] NCCL INFO Channel 00 : 1[5e000] -> 2[86000] via P2P/direct pointer
014ebf08f84a:64:74 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:64:75 [1] NCCL INFO Channel 01 : 1[5e000] -> 2[86000] via P2P/direct pointer
014ebf08f84a:64:75 [1] NCCL INFO Connected all rings
014ebf08f84a:64:74 [0] NCCL INFO Connected all rings
014ebf08f84a:64:76 [2] NCCL INFO Connected all rings
014ebf08f84a:64:76 [2] NCCL INFO Channel 00 : 2[86000] -> 1[5e000] via P2P/direct pointer
014ebf08f84a:64:75 [1] NCCL INFO Channel 00 : 1[5e000] -> 0[3b000] via direct shared memory
014ebf08f84a:64:76 [2] NCCL INFO Channel 01 : 2[86000] -> 1[5e000] via P2P/direct pointer
014ebf08f84a:64:75 [1] NCCL INFO Channel 01 : 1[5e000] -> 0[3b000] via direct shared memory
014ebf08f84a:64:76 [2] NCCL INFO Connected all trees
014ebf08f84a:64:76 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
014ebf08f84a:64:76 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer

014ebf08f84a:64:74 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
014ebf08f84a:64:74 [0] NCCL INFO include/shm.h:41 -> 2

014ebf08f84a:64:74 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-c477d5051b4c00c3-0-1-0 (size 9637888)
014ebf08f84a:64:74 [0] NCCL INFO transport/shm.cc:100 -> 2
014ebf08f84a:64:74 [0] NCCL INFO transport.cc:34 -> 2
014ebf08f84a:64:74 [0] NCCL INFO transport.cc:87 -> 2
014ebf08f84a:64:74 [0] NCCL INFO init.cc:815 -> 2
014ebf08f84a:64:74 [0] NCCL INFO init.cc:941 -> 2
014ebf08f84a:64:74 [0] NCCL INFO group.cc:72 -> 2 [Async thread]

for 2 GPU (this was okay):

root@014ebf08f84a:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     80 on 014ebf08f84a device  0 [0x3b] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid     80 on 014ebf08f84a device  1 [0x5e] NVIDIA RTX A6000
014ebf08f84a:80:80 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
014ebf08f84a:80:80 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
014ebf08f84a:80:80 [0] NCCL INFO P2P plugin IBext
014ebf08f84a:80:80 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:80:80 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:80:80 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
014ebf08f84a:80:80 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
014ebf08f84a:80:89 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
014ebf08f84a:80:88 [0] NCCL INFO Channel 00/02 :    0   1
014ebf08f84a:80:88 [0] NCCL INFO Channel 01/02 :    0   1
014ebf08f84a:80:89 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
014ebf08f84a:80:88 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
014ebf08f84a:80:88 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
014ebf08f84a:80:89 [1] NCCL INFO Channel 00 : 1[5e000] -> 0[3b000] via direct shared memory
014ebf08f84a:80:88 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:80:89 [1] NCCL INFO Channel 01 : 1[5e000] -> 0[3b000] via direct shared memory
014ebf08f84a:80:88 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:80:88 [0] NCCL INFO Connected all rings
014ebf08f84a:80:89 [1] NCCL INFO Connected all rings
014ebf08f84a:80:88 [0] NCCL INFO Connected all trees
014ebf08f84a:80:89 [1] NCCL INFO Connected all trees
014ebf08f84a:80:89 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
014ebf08f84a:80:89 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
014ebf08f84a:80:88 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
014ebf08f84a:80:88 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
014ebf08f84a:80:89 [1] NCCL INFO comm 0x7fc1bc0010c0 rank 1 nranks 2 cudaDev 1 busId 5e000 - Init COMPLETE
014ebf08f84a:80:88 [0] NCCL INFO comm 0x7fc1c40010c0 rank 0 nranks 2 cudaDev 0 busId 3b000 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
014ebf08f84a:80:80 [0] NCCL INFO Launch mode Parallel
           8             2     float     sum      -1     7.73    0.00    0.00      0     7.85    0.00    0.00      0
          16             4     float     sum      -1    10.45    0.00    0.00      0     8.32    0.00    0.00      0
          32             8     float     sum      -1     8.85    0.00    0.00      0     8.27    0.00    0.00      0
          64            16     float     sum      -1     8.79    0.01    0.01      0     8.26    0.01    0.01      0
         128            32     float     sum      -1     8.72    0.01    0.01      0     8.19    0.02    0.02      0
         256            64     float     sum      -1     8.88    0.03    0.03      0     8.18    0.03    0.03      0
         512           128     float     sum      -1     9.07    0.06    0.06      0     8.40    0.06    0.06      0
        1024           256     float     sum      -1     9.11    0.11    0.11      0     8.91    0.11    0.11      0
        2048           512     float     sum      -1     9.04    0.23    0.23      0     8.10    0.25    0.25      0
        4096          1024     float     sum      -1     9.06    0.45    0.45      0     8.47    0.48    0.48      0
        8192          2048     float     sum      -1    10.16    0.81    0.81      0     9.69    0.85    0.85      0
       16384          4096     float     sum      -1    13.47    1.22    1.22      0    12.88    1.27    1.27      0
       32768          8192     float     sum      -1    16.44    1.99    1.99      0    16.15    2.03    2.03      0
       65536         16384     float     sum      -1    28.86    2.27    2.27      0    28.81    2.27    2.27      0
      131072         32768     float     sum      -1    36.91    3.55    3.55      0    37.32    3.51    3.51      0
      262144         65536     float     sum      -1    56.79    4.62    4.62      0    55.95    4.69    4.69      0
      524288        131072     float     sum      -1    90.47    5.80    5.80      0    89.67    5.85    5.85      0
     1048576        262144     float     sum      -1    159.8    6.56    6.56      0    159.5    6.57    6.57      0
     2097152        524288     float     sum      -1    295.8    7.09    7.09      0    295.8    7.09    7.09      0
     4194304       1048576     float     sum      -1    562.9    7.45    7.45      0    565.9    7.41    7.41      0
     8388608       2097152     float     sum      -1   1094.0    7.67    7.67      0   1093.5    7.67    7.67      0
    16777216       4194304     float     sum      -1   2172.1    7.72    7.72      0   2164.8    7.75    7.75      0
    33554432       8388608     float     sum      -1   4329.9    7.75    7.75      0   4330.9    7.75    7.75      0
    67108864      16777216     float     sum      -1   8643.6    7.76    7.76      0   8640.2    7.77    7.77      0
   134217728      33554432     float     sum      -1    17231    7.79    7.79      0    17204    7.80    7.80      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.24406

and 1 GPU was okay. as well.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

From topic Error when training with multiple GPUs in TAO , that user can run 8gpus well with 22.05 docker.

And also, as you mentioned above, the “With tensorrt docker and ran nccl inside it with nvidia driver 515, still the same issue”, I am afraid that the issue is related to the topo.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.