TAO API - Detectnet_v2 - Multi GPU Stuck

For 4.0.1 docker, the error is the same as https://forums.developer.nvidia.com/t/error-during-multi-gpu-training-of-classification-tf1-cma-ep-c-process-vm-readv-operation-not-permitted/ .
According to that topic, there are two options here.

  1. You can use the 22.05 docker which has working mpi version of openmpi-4.1.2.
  2. In 4.0.1 docker, change the mpi version.

# from https://edu.itp.phys.ethz.ch/hs12/programming_techniques/openmpi.pdf and https://www.open-mpi.org/software/ompi/v4.1/
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.bz2
mkdir src
mv openmpi-4.1.5.tar.bz2 src/
cd src/
tar -jxf openmpi-4.1.5.tar.bz2
cd openmpi-4.1.5
./configure --prefix=$HOME/opt/openmpi
make -j128 all
make install
mpirun --version
echo “export PATH=$PATH:$HOME/opt/openmpi/bin” >> $HOME/.bashrc
echo “export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/opt/openmpi/lib” >> $HOME/.bashrc
. ~/.bashrc
export OPAL_PREFIX=$HOME/opt/openmpi/

Then,

mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none -np 2 python /usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r result -k tlt_encode