How about using and old version of TAO docker?
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tensorrt:22.11-py3 /bin/bash
Then inside the docker
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
More, can you add -shm-size=16g
and --ulimit memlock=-1
in the docker command as well?
And also, before you run nccl test, please add export NCCL_DEBUG=INFO
or export NCCL_DEBUG=WARN
. Refer to https://github.com/NVIDIA/nccl/issues/411
and https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out