TAO API - Detectnet_v2 - Multi GPU Stuck

How about using and old version of TAO docker?
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tensorrt:22.11-py3 /bin/bash

Then inside the docker
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests/
$ make
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

More, can you add -shm-size=16g and --ulimit memlock=-1 in the docker command as well?
And also, before you run nccl test, please add export NCCL_DEBUG=INFO or export NCCL_DEBUG=WARN . Refer to https://github.com/NVIDIA/nccl/issues/411 and https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out