Hello,
I can run nccl single node test on both nodes successfully with upper limit set at 16g:
user@h100:~/Downloads/NVIDIA/NCCL$ nccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 8
…
Out of bounds values : 0 OK
Avg bus bandwidth : 157.721
However, when I use mpirun a 2-node test, the test complets with 2g as the upper limit so is 2560m (eg 2.5g) .
time mpirun -np 16 -H 172.30.1.74:8,172.30.1.75:8 /home/user/Downloads/NVIDIA/NCCL/nccl-tests/build/all_reduce_perf -b 8 -e time mpirun -np 16 -H 172.30.1.74:8,172.30.1.75:8 /home/user/Downloads/NVIDIA/NCCL/nccl-tests/build/all_reduce_perf -b 8 -e 2560m -f 2 -g 8
Out of bounds values : 0 OK
Avg bus bandwidth : 2.71165
-f 2 -g 8
…
Out of bounds values : 0 OK
Avg bus bandwidth : 3.28946
time mpirun -np 16 -H 172.30.1.74:8,172.30.1.75:8 /home/user/Downloads/NVIDIA/NCCL/nccl-tests/build/all_reduce_perf -b 8 -e 2560m -f 2 -g 8
Out of bounds values : 0 OK
Avg bus bandwidth : 2.71165
But it fails at 3g and above consistently due to the slave node H100 "out of memory’.
Wondering why is that, given the slave node can run the same test with 16g as upper limit but now with 2 node it runs into out of memory with 3g?
$ NCCL_DEBUG=INFO mpirun -np 16 -H 172.30.1.74:8,172.30.1.75:8 -x NCCL_DEBUG /home/user/Downloads/NVIDIA/NCCL/nccl-tests/build/all_reduce_perf -b 8 -e 3g -f 2 -g 8
h100: Test NCCL failure common.cu:1005 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / ’
… h100 pid 138262: Test failure common.cu:891
h100:138259:138465 [4] include/alloc.h:229 NCCL WARN Cuda failure 2 ‘out of memory’
h100:138259:138465 [4] NCCL INFO include/alloc.h:339 → 1
h100:138259:138465 [4] include/alloc.h:347 NCCL WARN Failed to CUDA calloc async 32 bytes
My system env blow:
user@q-h100:~/Downloads/NVIDIA/NCCL$ env | grep NCCL
PWD=/home/user/Downloads/NVIDIA/NCCL
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/lib64:/home/user/Downloads/NVIDIA/NCCL/ompi/ompi_install/lib:/home/user/Downloads/NVIDIA/NCCL/ucx/ucx_install/lib:
PATH=/home/user/.local/bin:/usr/local/cuda/bin:/home/user/Downloads/NVIDIA/NCCL/ompi/ompi_install/bin:/home/user/Downloads/NVIDIA/NCCL/ucx/ucx_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
user@q-h100:~/Downloads/NVIDIA/NCCL$ apt list --installed | grep nccl
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libnccl-dev/unknown,now 2.23.4-1+cuda12.4 amd64 [installed,upgradable to: 2.23.4-1+cuda12.6]
libnccl2/unknown,now 2.23.4-1+cuda12.4 amd64 [installed,upgradable to: 2.23.4-1+cuda12.6]
nccl-local-repo-ubuntu2204-2.23.4-cuda12.4/now 1.0-1 amd64 [installed,local]
user@q-h100:~/Downloads/NVIDIA/NCCL$ uname -a
Linux q-h100 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux