After checking your logs, there are [cc903a1bb6fb:00924] Read -1, expected 2359296, errno = 1. This is similar to Vader in a Docker Container · Issue #4948 · open-mpi/ompi · GitHub
Could you export below firstly and run training again? Thanks.
export OMPI_MCA_btl_vader_single_copy_mechanism=none
i.e. ,
OMPI_MCA_btl_vader_single_copy_mechanism=none NCCL_P2P_LEVEL=NVL detectnet_v2 train -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode --gpus 2