TAO API - Detectnet_v2 - Multi GPU Stuck

Morganh · May 29, 2023, 9:43am

For 4.0.1 docker, the error is the same as https://forums.developer.nvidia.com/t/error-during-multi-gpu-training-of-classification-tf1-cma-ep-c-process-vm-readv-operation-not-permitted/ .
According to that topic, there are two options here.

You can use the 22.05 docker which has working mpi version of openmpi-4.1.2.
In 4.0.1 docker, change the mpi version.

# from https://edu.itp.phys.ethz.ch/hs12/programming_techniques/openmpi.pdf and https://www.open-mpi.org/software/ompi/v4.1/
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.bz2
mkdir src
mv openmpi-4.1.5.tar.bz2 src/
cd src/
tar -jxf openmpi-4.1.5.tar.bz2
cd openmpi-4.1.5
./configure --prefix=$HOME/opt/openmpi
make -j128 all
make install
mpirun --version
echo “export PATH=$PATH:$HOME/opt/openmpi/bin” >> $HOME/.bashrc
echo “export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/opt/openmpi/lib” >> $HOME/.bashrc
. ~/.bashrc
export OPAL_PREFIX=$HOME/opt/openmpi/

Then,

mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none -np 2 python /usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r result -k tlt_encode

Topic		Replies	Views
Error during multi-GPU training of classification_tf1: cma_ep.c process_vm_readv Operation not permitted TAO Toolkit	30	2273	June 1, 2023
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck TAO Toolkit	80	2890	October 11, 2023
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - EXTRA GPU TAO Toolkit	14	1152	November 7, 2023
More than 1 GPU not working using Tao Train TAO Toolkit	47	5214	April 9, 2023
TAO training on multiple gpus failed TAO Toolkit	10	1301	March 9, 2023
Multigpu training raises error TAO Toolkit	9	1254	November 15, 2022
TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain TAO Toolkit	6	766	July 17, 2023
WSL2 & TAO issues TAO Toolkit wsl , tao	27	4036	January 5, 2022
TAO not running when using multiple GPUs TAO Toolkit	12	253	August 17, 2024
Unable to use multiple GPUs to train grounding dino TAO Toolkit cuda , tao	13	88	January 19, 2026

TAO API - Detectnet_v2 - Multi GPU Stuck

Related topics