Error during multi-GPU training of classification_tf1: cma_ep.c process_vm_readv Operation not permitted

Morganh · May 25, 2023, 3:42am

Thanks a lot for the info. Glad to know it is working.
For the different result of different MPI version, we will check further.

Morganh · May 25, 2023, 4:49am

Hi,
Could you help check if the stable release openmpi-4.1.5.tar.bz2 can also work in 4.0.1 docker?
Very appreciate for your time and help. Thanks in advanced.

veritable · May 25, 2023, 2:19pm

After installing MPI 4.1.5 in a new 4.0.1 Docker container, I am able to run the training.

Just before the first batch started, I noticed that there were many lines of [f3363b50bf60:207940] Read -1, expected 6449, errno = 1, with a variety of seemingly uncorrelated random values in place of the 207940 and 6449. It doesn’t seem to affect anything as the training occurs despite this log message. It also happened in yesterday’s logs. Probably not relevant.

Morganh · May 25, 2023, 3:12pm

Thanks a lot for the info.
For the many lines you mentioned, please double check if it affects the training. You can also share the log as well.

veritable · May 25, 2023, 8:28pm

Here’s the log.

log_20230525.txt (57.1 KB)

The train and validation loss are trending down, while the accuracy is trending up, which leads me to believe the training is working. Is there something else I should check?

Looking more closely at these logs, the Read -1, expected 11865, errno = 1 messages occur immediately after the first batch of the first epoch, just like my original error, but in case it’s not fatal.

Morganh · May 26, 2023, 3:28am

From the hint of https://github.com/open-mpi/ompi/issues/4948 , could you add --mca btl_vader_single_copy_mechanism none as below?

mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none -np 4 python /usr/local/lib/python3.6/dist-packages/iva/makenet/scripts/train.py -e spec.txt -r result -k key

veritable · May 29, 2023, 3:09pm

Yes, adding --mca btl_vader_single_copy_mechanism none to mpirun removed all of the Read -1, expected 11865, errno = 1 messages.

Is MPI 4.1.5 going to be used in future TAO containers? (instead of 4.1.5a1)

Morganh · May 29, 2023, 3:26pm

I will sync internally.

Would you please share the latest training log? Thanks a lot!

veritable · May 31, 2023, 7:29pm

Here’s the latest log:
log_20230531.txt (53.5 KB)

Morganh · June 1, 2023, 8:54am

Thanks. Glad to know it works.

system · June 26, 2023, 7:29am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO API - Detectnet_v2 - Multi GPU Stuck TAO Toolkit	57	2405	August 29, 2023
Multigpu training raises error TAO Toolkit	9	1254	November 15, 2022
TAO training on multiple gpus failed TAO Toolkit	10	1301	March 9, 2023
More than 1 GPU not working using Tao Train TAO Toolkit	47	5214	April 9, 2023
Unable to use multiple GPUs to train grounding dino TAO Toolkit cuda , tao	13	88	January 19, 2026
Error when training with multiple GPUs in TAO TAO Toolkit	17	2168	May 4, 2023
TAO not running when using multiple GPUs TAO Toolkit	12	253	August 17, 2024
Yolov4 multi-gpu training with Darknet Arch encounters a problem TAO Toolkit	17	961	July 2, 2023
Train Pointpillar with Multi-GPU TAO Toolkit tao	11	2784	August 29, 2023
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - EXTRA GPU TAO Toolkit	14	1152	November 7, 2023

Error during multi-GPU training of classification_tf1: cma_ep.c process_vm_readv Operation not permitted

Related topics