okay i see the nvcc in that folder now after downloading cuda.
I get this error when running make:
./verifiable/verifiable.cu:4:10: fatal error: nccl.h: No such file or directory 4 | #include <nccl.h> | ^~~~~~~~ compilation terminated. make[1]: *** [../verifiable/verifiable.mk:11: /home/amrc_cymru/nccl-tests/build/verifiable/verifiable.o] Error 1 make[1]: Leaving directory '/home/amrc_cymru/nccl-tests/src' make: *** [Makefile:20: src.build] Error 2
I ran export PATH=/usr/local/cuda-12/bin${PATH:+:${PATH}} , but that didn’t help.
To avoid unexpected issue, can you pull below TensorRT docker and run nccl test inside it? Thanks.
$ docker pull nvcr.io/nvidia/tensorrt:22.11-py3
Hello, I pulled the tensorrt docker and ran nccl inside it with nvidia driver 515
still the same issue
root@b9ce305d96cb:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1017 on b9ce305d96cb device 0 [0x3b] NVIDIA RTX A6000
# Rank 1 Group 0 Pid 1017 on b9ce305d96cb device 1 [0x5e] NVIDIA RTX A6000
# Rank 2 Group 0 Pid 1017 on b9ce305d96cb device 2 [0x86] NVIDIA RTX A6000
# Rank 3 Group 0 Pid 1017 on b9ce305d96cb device 3 [0xaf] NVIDIA RTX A6000
[1680170342.329956] [b9ce305d96cb:1017 :0] debug.c:1289 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[1680170342.329978] [b9ce305d96cb:1017 :0] debug.c:1289 UCX WARN ucs_debug_disable_signal: signal 1 was not set in ucs
[1680170342.329983] [b9ce305d96cb:1017 :1] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[b9ce305d96cb:1017 :0:1030] Caught signal 7 (Bus error: nonexistent physical address)
[b9ce305d96cb:1017 :1:1032] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 1030) ====
0 0x0000000000014420 __funlockfile() ???:0
1 0x000000000018bb41 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000007b0ef ncclGroupEnd() ???:0
4 0x0000000000059e97 ncclGetUniqueId() ???:0
5 0x00000000000489b1 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
6 0x000000000004a655 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
7 0x0000000000063dcc ncclRedOpDestroy() ???:0
8 0x0000000000008609 start_thread() ???:0
9 0x000000000011f133 clone() ???:0
=================================
Bus error (core dumped)
root@b9ce305d96cb:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1038 on b9ce305d96cb device 0 [0x3b] NVIDIA RTX A6000
# Rank 1 Group 0 Pid 1038 on b9ce305d96cb device 1 [0x5e] NVIDIA RTX A6000
# Rank 2 Group 0 Pid 1038 on b9ce305d96cb device 2 [0x86] NVIDIA RTX A6000
[b9ce305d96cb:1038 :0:1049] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)
root@b9ce305d96cb:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1055 on b9ce305d96cb device 0 [0x3b] NVIDIA RTX A6000
# Rank 1 Group 0 Pid 1055 on b9ce305d96cb device 1 [0x5e] NVIDIA RTX A6000
[b9ce305d96cb:1055 :0:1067] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)
root@b9ce305d96cb:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1068 on b9ce305d96cb device 0 [0x3b] NVIDIA RTX A6000
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 3.27 0.00 0.00 0 0.16 0.05 0.00 0
16 4 float sum -1 3.78 0.00 0.00 0 0.18 0.09 0.00 0
32 8 float sum -1 3.37 0.01 0.00 0 0.18 0.18 0.00 0
64 16 float sum -1 3.39 0.02 0.00 0 0.18 0.36 0.00 0
128 32 float sum -1 3.36 0.04 0.00 0 0.18 0.72 0.00 0
256 64 float sum -1 3.35 0.08 0.00 0 0.18 1.44 0.00 0
512 128 float sum -1 3.33 0.15 0.00 0 0.18 2.87 0.00 0
1024 256 float sum -1 3.31 0.31 0.00 0 0.18 5.74 0.00 0
2048 512 float sum -1 3.36 0.61 0.00 0 0.18 11.56 0.00 0
4096 1024 float sum -1 3.30 1.24 0.00 0 0.18 23.08 0.00 0
8192 2048 float sum -1 3.29 2.49 0.00 0 0.18 45.55 0.00 0
16384 4096 float sum -1 3.32 4.93 0.00 0 0.18 92.17 0.00 0
32768 8192 float sum -1 3.96 8.27 0.00 0 0.18 185.13 0.00 0
65536 16384 float sum -1 3.10 21.12 0.00 0 0.17 387.90 0.00 0
131072 32768 float sum -1 3.28 39.95 0.00 0 0.16 844.54 0.00 0
262144 65536 float sum -1 4.14 63.27 0.00 0 0.16 1600.88 0.00 0
524288 131072 float sum -1 4.29 122.19 0.00 0 0.15 3398.95 0.00 0
1048576 262144 float sum -1 5.35 195.99 0.00 0 0.15 6780.32 0.00 0
2097152 524288 float sum -1 9.06 231.55 0.00 0 0.17 12446.01 0.00 0
4194304 1048576 float sum -1 15.86 264.49 0.00 0 0.16 26903.81 0.00 0
8388608 2097152 float sum -1 28.91 290.14 0.00 0 0.16 53687.09 0.00 0
16777216 4194304 float sum -1 52.99 316.64 0.00 0 0.16 107892.06 0.00 0
33554432 8388608 float sum -1 102.1 328.77 0.00 0 0.15 217532.78 0.00 0
67108864 16777216 float sum -1 200.2 335.25 0.00 0 0.16 412216.61 0.00 0
134217728 33554432 float sum -1 396.6 338.45 0.00 0 0.15 873244.81 0.00 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0
#
So, seems to be not an issue from TAO container.
It looks like nccl issue.
So I’ve tried the old version of docker (22.05 version) and ran nccl tests inside it. Nvidia driver version was 515.
For 4 GPUs I get this (it gets stuck on the last message so I have the Ctrl+c to exit):
root@014ebf08f84a:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 44 on 014ebf08f84a device 0 [0x3b] NVIDIA RTX A6000
# Rank 1 Group 0 Pid 44 on 014ebf08f84a device 1 [0x5e] NVIDIA RTX A6000
# Rank 2 Group 0 Pid 44 on 014ebf08f84a device 2 [0x86] NVIDIA RTX A6000
# Rank 3 Group 0 Pid 44 on 014ebf08f84a device 3 [0xaf] NVIDIA RTX A6000
014ebf08f84a:44:44 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
014ebf08f84a:44:44 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
014ebf08f84a:44:44 [0] NCCL INFO P2P plugin IBext
014ebf08f84a:44:44 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:44:44 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:44:44 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
014ebf08f84a:44:44 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
014ebf08f84a:44:56 [0] NCCL INFO Channel 00/04 : 0 1 2 3
014ebf08f84a:44:56 [0] NCCL INFO Channel 01/04 : 0 3 2 1
014ebf08f84a:44:56 [0] NCCL INFO Channel 02/04 : 0 1 2 3
014ebf08f84a:44:56 [0] NCCL INFO Channel 03/04 : 0 3 2 1
014ebf08f84a:44:57 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] -1/-1/-1->1->2
014ebf08f84a:44:56 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 3/-1/-1->0->-1
014ebf08f84a:44:57 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
014ebf08f84a:44:56 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
014ebf08f84a:44:59 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 2/-1/-1->3->0 [2] -1/-1/-1->3->2 [3] 2/-1/-1->3->0
014ebf08f84a:44:59 [3] NCCL INFO Setting affinity for GPU 3 to ffff,f00000ff,fff00000
014ebf08f84a:44:58 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 1/-1/-1->2->3 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3
014ebf08f84a:44:58 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
014ebf08f84a:44:58 [2] NCCL INFO Channel 00 : 2[86000] -> 3[af000] via direct shared memory
014ebf08f84a:44:58 [2] NCCL INFO Channel 02 : 2[86000] -> 3[af000] via direct shared memory
014ebf08f84a:44:56 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:44:56 [0] NCCL INFO Channel 02 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:44:59 [3] NCCL INFO Channel 00 : 3[af000] -> 0[3b000] via P2P/direct pointer
014ebf08f84a:44:59 [3] NCCL INFO Channel 02 : 3[af000] -> 0[3b000] via P2P/direct pointer
014ebf08f84a:44:57 [1] NCCL INFO Channel 00 : 1[5e000] -> 2[86000] via P2P/direct pointer
014ebf08f84a:44:57 [1] NCCL INFO Channel 02 : 1[5e000] -> 2[86000] via P2P/direct pointer
014ebf08f84a:44:59 [3] NCCL INFO Channel 01 : 3[af000] -> 2[86000] via direct shared memory
014ebf08f84a:44:59 [3] NCCL INFO Channel 03 : 3[af000] -> 2[86000] via direct shared memory
014ebf08f84a:44:57 [1] NCCL INFO Channel 01 : 1[5e000] -> 0[3b000] via direct shared memory
014ebf08f84a:44:57 [1] NCCL INFO Channel 03 : 1[5e000] -> 0[3b000] via direct shared memory
014ebf08f84a:44:56 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
014ebf08f84a:44:56 [0] NCCL INFO include/shm.h:41 -> 2
014ebf08f84a:44:56 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-ed257aa4e8c671bb-3-1-0 (size 9637888)
014ebf08f84a:44:56 [0] NCCL INFO transport/shm.cc:100 -> 2
014ebf08f84a:44:56 [0] NCCL INFO transport.cc:34 -> 2
014ebf08f84a:44:56 [0] NCCL INFO transport.cc:87 -> 2
014ebf08f84a:44:56 [0] NCCL INFO init.cc:804 -> 2
014ebf08f84a:44:56 [0] NCCL INFO init.cc:941 -> 2
014ebf08f84a:44:58 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
014ebf08f84a:44:58 [2] NCCL INFO include/shm.h:41 -> 2
014ebf08f84a:44:58 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-ed257aa4e8c671bb-3-3-2 (size 9637888)
014ebf08f84a:44:56 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
014ebf08f84a:44:58 [2] NCCL INFO transport/shm.cc:100 -> 2
014ebf08f84a:44:58 [2] NCCL INFO transport.cc:34 -> 2
014ebf08f84a:44:58 [2] NCCL INFO transport.cc:87 -> 2
014ebf08f84a:44:58 [2] NCCL INFO init.cc:804 -> 2
014ebf08f84a:44:58 [2] NCCL INFO init.cc:941 -> 2
014ebf08f84a:44:58 [2] NCCL INFO group.cc:72 -> 2 [Async thread]
for 3 gpu i get this:
root@014ebf08f84a:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 64 on 014ebf08f84a device 0 [0x3b] NVIDIA RTX A6000
# Rank 1 Group 0 Pid 64 on 014ebf08f84a device 1 [0x5e] NVIDIA RTX A6000
# Rank 2 Group 0 Pid 64 on 014ebf08f84a device 2 [0x86] NVIDIA RTX A6000
014ebf08f84a:64:64 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
014ebf08f84a:64:64 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
014ebf08f84a:64:64 [0] NCCL INFO P2P plugin IBext
014ebf08f84a:64:64 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:64:64 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:64:64 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
014ebf08f84a:64:64 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
014ebf08f84a:64:74 [0] NCCL INFO Channel 00/02 : 0 1 2
014ebf08f84a:64:74 [0] NCCL INFO Channel 01/02 : 0 1 2
014ebf08f84a:64:74 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
014ebf08f84a:64:74 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
014ebf08f84a:64:75 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
014ebf08f84a:64:75 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
014ebf08f84a:64:76 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
014ebf08f84a:64:76 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
014ebf08f84a:64:76 [2] NCCL INFO Channel 00 : 2[86000] -> 0[3b000] via direct shared memory
014ebf08f84a:64:76 [2] NCCL INFO Channel 01 : 2[86000] -> 0[3b000] via direct shared memory
014ebf08f84a:64:74 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:64:75 [1] NCCL INFO Channel 00 : 1[5e000] -> 2[86000] via P2P/direct pointer
014ebf08f84a:64:74 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:64:75 [1] NCCL INFO Channel 01 : 1[5e000] -> 2[86000] via P2P/direct pointer
014ebf08f84a:64:75 [1] NCCL INFO Connected all rings
014ebf08f84a:64:74 [0] NCCL INFO Connected all rings
014ebf08f84a:64:76 [2] NCCL INFO Connected all rings
014ebf08f84a:64:76 [2] NCCL INFO Channel 00 : 2[86000] -> 1[5e000] via P2P/direct pointer
014ebf08f84a:64:75 [1] NCCL INFO Channel 00 : 1[5e000] -> 0[3b000] via direct shared memory
014ebf08f84a:64:76 [2] NCCL INFO Channel 01 : 2[86000] -> 1[5e000] via P2P/direct pointer
014ebf08f84a:64:75 [1] NCCL INFO Channel 01 : 1[5e000] -> 0[3b000] via direct shared memory
014ebf08f84a:64:76 [2] NCCL INFO Connected all trees
014ebf08f84a:64:76 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
014ebf08f84a:64:76 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
014ebf08f84a:64:74 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
014ebf08f84a:64:74 [0] NCCL INFO include/shm.h:41 -> 2
014ebf08f84a:64:74 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-c477d5051b4c00c3-0-1-0 (size 9637888)
014ebf08f84a:64:74 [0] NCCL INFO transport/shm.cc:100 -> 2
014ebf08f84a:64:74 [0] NCCL INFO transport.cc:34 -> 2
014ebf08f84a:64:74 [0] NCCL INFO transport.cc:87 -> 2
014ebf08f84a:64:74 [0] NCCL INFO init.cc:815 -> 2
014ebf08f84a:64:74 [0] NCCL INFO init.cc:941 -> 2
014ebf08f84a:64:74 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
for 2 GPU (this was okay):
root@014ebf08f84a:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 80 on 014ebf08f84a device 0 [0x3b] NVIDIA RTX A6000
# Rank 1 Group 0 Pid 80 on 014ebf08f84a device 1 [0x5e] NVIDIA RTX A6000
014ebf08f84a:80:80 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
014ebf08f84a:80:80 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
014ebf08f84a:80:80 [0] NCCL INFO P2P plugin IBext
014ebf08f84a:80:80 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:80:80 [0] NCCL INFO NET/IB : No device found.
014ebf08f84a:80:80 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
014ebf08f84a:80:80 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
014ebf08f84a:80:89 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
014ebf08f84a:80:88 [0] NCCL INFO Channel 00/02 : 0 1
014ebf08f84a:80:88 [0] NCCL INFO Channel 01/02 : 0 1
014ebf08f84a:80:89 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
014ebf08f84a:80:88 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
014ebf08f84a:80:88 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
014ebf08f84a:80:89 [1] NCCL INFO Channel 00 : 1[5e000] -> 0[3b000] via direct shared memory
014ebf08f84a:80:88 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:80:89 [1] NCCL INFO Channel 01 : 1[5e000] -> 0[3b000] via direct shared memory
014ebf08f84a:80:88 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[5e000] via direct shared memory
014ebf08f84a:80:88 [0] NCCL INFO Connected all rings
014ebf08f84a:80:89 [1] NCCL INFO Connected all rings
014ebf08f84a:80:88 [0] NCCL INFO Connected all trees
014ebf08f84a:80:89 [1] NCCL INFO Connected all trees
014ebf08f84a:80:89 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
014ebf08f84a:80:89 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
014ebf08f84a:80:88 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
014ebf08f84a:80:88 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
014ebf08f84a:80:89 [1] NCCL INFO comm 0x7fc1bc0010c0 rank 1 nranks 2 cudaDev 1 busId 5e000 - Init COMPLETE
014ebf08f84a:80:88 [0] NCCL INFO comm 0x7fc1c40010c0 rank 0 nranks 2 cudaDev 0 busId 3b000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
014ebf08f84a:80:80 [0] NCCL INFO Launch mode Parallel
8 2 float sum -1 7.73 0.00 0.00 0 7.85 0.00 0.00 0
16 4 float sum -1 10.45 0.00 0.00 0 8.32 0.00 0.00 0
32 8 float sum -1 8.85 0.00 0.00 0 8.27 0.00 0.00 0
64 16 float sum -1 8.79 0.01 0.01 0 8.26 0.01 0.01 0
128 32 float sum -1 8.72 0.01 0.01 0 8.19 0.02 0.02 0
256 64 float sum -1 8.88 0.03 0.03 0 8.18 0.03 0.03 0
512 128 float sum -1 9.07 0.06 0.06 0 8.40 0.06 0.06 0
1024 256 float sum -1 9.11 0.11 0.11 0 8.91 0.11 0.11 0
2048 512 float sum -1 9.04 0.23 0.23 0 8.10 0.25 0.25 0
4096 1024 float sum -1 9.06 0.45 0.45 0 8.47 0.48 0.48 0
8192 2048 float sum -1 10.16 0.81 0.81 0 9.69 0.85 0.85 0
16384 4096 float sum -1 13.47 1.22 1.22 0 12.88 1.27 1.27 0
32768 8192 float sum -1 16.44 1.99 1.99 0 16.15 2.03 2.03 0
65536 16384 float sum -1 28.86 2.27 2.27 0 28.81 2.27 2.27 0
131072 32768 float sum -1 36.91 3.55 3.55 0 37.32 3.51 3.51 0
262144 65536 float sum -1 56.79 4.62 4.62 0 55.95 4.69 4.69 0
524288 131072 float sum -1 90.47 5.80 5.80 0 89.67 5.85 5.85 0
1048576 262144 float sum -1 159.8 6.56 6.56 0 159.5 6.57 6.57 0
2097152 524288 float sum -1 295.8 7.09 7.09 0 295.8 7.09 7.09 0
4194304 1048576 float sum -1 562.9 7.45 7.45 0 565.9 7.41 7.41 0
8388608 2097152 float sum -1 1094.0 7.67 7.67 0 1093.5 7.67 7.67 0
16777216 4194304 float sum -1 2172.1 7.72 7.72 0 2164.8 7.75 7.75 0
33554432 8388608 float sum -1 4329.9 7.75 7.75 0 4330.9 7.75 7.75 0
67108864 16777216 float sum -1 8643.6 7.76 7.76 0 8640.2 7.77 7.77 0
134217728 33554432 float sum -1 17231 7.79 7.79 0 17204 7.80 7.80 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 3.24406
and 1 GPU was okay. as well.
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks
From topic Error when training with multiple GPUs in TAO , that user can run 8gpus well with 22.05 docker.
And also, as you mentioned above, the “With tensorrt docker and ran nccl inside it with nvidia driver 515, still the same issue”, I am afraid that the issue is related to the topo.
system
Closed
May 18, 2023, 7:46am
50
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.