NCCL test on 2x HGX failed with 3G as the upper limit

playwithai · October 16, 2024, 5:37pm

Hello,

I can run nccl single node test on both nodes successfully with upper limit set at 16g:

user@h100:~/Downloads/NVIDIA/NCCL$ nccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 8
…

Out of bounds values : 0 OK

Avg bus bandwidth : 157.721

However, when I use mpirun a 2-node test, the test complets with 2g as the upper limit so is 2560m (eg 2.5g) .

time mpirun -np 16 -H 172.30.1.74:8,172.30.1.75:8 /home/user/Downloads/NVIDIA/NCCL/nccl-tests/build/all_reduce_perf -b 8 -e time mpirun -np 16 -H 172.30.1.74:8,172.30.1.75:8 /home/user/Downloads/NVIDIA/NCCL/nccl-tests/build/all_reduce_perf -b 8 -e 2560m -f 2 -g 8

Out of bounds values : 0 OK

Avg bus bandwidth : 2.71165

-f 2 -g 8
…

Out of bounds values : 0 OK

Avg bus bandwidth : 3.28946

time mpirun -np 16 -H 172.30.1.74:8,172.30.1.75:8 /home/user/Downloads/NVIDIA/NCCL/nccl-tests/build/all_reduce_perf -b 8 -e 2560m -f 2 -g 8

Out of bounds values : 0 OK

Avg bus bandwidth : 2.71165

But it fails at 3g and above consistently due to the slave node H100 "out of memory’.

Wondering why is that, given the slave node can run the same test with 16g as upper limit but now with 2 node it runs into out of memory with 3g?

$ NCCL_DEBUG=INFO mpirun -np 16 -H 172.30.1.74:8,172.30.1.75:8 -x NCCL_DEBUG /home/user/Downloads/NVIDIA/NCCL/nccl-tests/build/all_reduce_perf -b 8 -e 3g -f 2 -g 8

h100: Test NCCL failure common.cu:1005 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / ’
… h100 pid 138262: Test failure common.cu:891

h100:138259:138465 [4] include/alloc.h:229 NCCL WARN Cuda failure 2 ‘out of memory’
h100:138259:138465 [4] NCCL INFO include/alloc.h:339 → 1

h100:138259:138465 [4] include/alloc.h:347 NCCL WARN Failed to CUDA calloc async 32 bytes

My system env blow:

user@q-h100:~/Downloads/NVIDIA/NCCL$ env | grep NCCL
PWD=/home/user/Downloads/NVIDIA/NCCL
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/lib64:/home/user/Downloads/NVIDIA/NCCL/ompi/ompi_install/lib:/home/user/Downloads/NVIDIA/NCCL/ucx/ucx_install/lib:
PATH=/home/user/.local/bin:/usr/local/cuda/bin:/home/user/Downloads/NVIDIA/NCCL/ompi/ompi_install/bin:/home/user/Downloads/NVIDIA/NCCL/ucx/ucx_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

user@q-h100:~/Downloads/NVIDIA/NCCL$ apt list --installed | grep nccl

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnccl-dev/unknown,now 2.23.4-1+cuda12.4 amd64 [installed,upgradable to: 2.23.4-1+cuda12.6]
libnccl2/unknown,now 2.23.4-1+cuda12.4 amd64 [installed,upgradable to: 2.23.4-1+cuda12.6]
nccl-local-repo-ubuntu2204-2.23.4-cuda12.4/now 1.0-1 amd64 [installed,local]

user@q-h100:~/Downloads/NVIDIA/NCCL$ uname -a
Linux q-h100 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Topic		Replies	Views
Nccl-test poor performance GPU-Accelerated Libraries	3	357	October 29, 2024
NCCL failure : "unhandled system error" for 2 GPUs CUDA on Windows Subsystem for Linux	1	4133	January 21, 2021
How can I improve the 'p2p enabled' bandwidth when testing NCCL performance with two A5000 GPU using PCIe 4.0 x16? CUDA Programming and Performance cuda	2	1138	September 15, 2023
NCCL example fails on WSL2 and 1 or 2 A5500's cuDNN cuda	3	106	September 15, 2024
Sendrecv_perf nccl-tests - The process needs to be terminated manually - Volatile GPU-Util: 100% Container: HPC cuda , ubuntu	0	47	February 6, 2025
Run HPL on 4x A100 CUDA Programming and Performance	3	3069	July 17, 2021
nccl-test with nccl2 not run in centos6, crash in init rank GPU-Accelerated Libraries	1	629	February 2, 2018
OpenCL Performance benchmarking and comparative analysis CUDA Programming and Performance	5	19266	June 9, 2009
NCCL testing: Error: no plugin found (libnccl-net.so) CUDA Programming and Performance	4	6959	October 15, 2019
CUDA HPL NaN failed CUDA Programming and Performance	1	1564	December 27, 2014

NCCL test on 2x HGX failed with 3G as the upper limit

Out of bounds values : 0 OK

Avg bus bandwidth : 157.721

Out of bounds values : 0 OK

Avg bus bandwidth : 2.71165

Out of bounds values : 0 OK

Avg bus bandwidth : 3.28946

Out of bounds values : 0 OK

Avg bus bandwidth : 2.71165

Related topics