Error during multi-GPU training of classification_tf1: cma_ep.c process_vm_readv Operation not permitted

Thanks a lot for the info. Glad to know it is working.
For the different result of different MPI version, we will check further.

Could you help check if the stable release openmpi-4.1.5.tar.bz2 can also work in 4.0.1 docker?
Very appreciate for your time and help. Thanks in advanced.

After installing MPI 4.1.5 in a new 4.0.1 Docker container, I am able to run the training.

Just before the first batch started, I noticed that there were many lines of [f3363b50bf60:207940] Read -1, expected 6449, errno = 1, with a variety of seemingly uncorrelated random values in place of the 207940 and 6449. It doesn’t seem to affect anything as the training occurs despite this log message. It also happened in yesterday’s logs. Probably not relevant.

Thanks a lot for the info.
For the many lines you mentioned, please double check if it affects the training. You can also share the log as well.

Here’s the log.

log_20230525.txt (57.1 KB)

The train and validation loss are trending down, while the accuracy is trending up, which leads me to believe the training is working. Is there something else I should check?

Looking more closely at these logs, the Read -1, expected 11865, errno = 1 messages occur immediately after the first batch of the first epoch, just like my original error, but in case it’s not fatal.

From the hint of , could you add --mca btl_vader_single_copy_mechanism none as below?

mpirun --allow-run-as-root --mca btl_vader_single_copy_mechanism none -np 4 python /usr/local/lib/python3.6/dist-packages/iva/makenet/scripts/ -e spec.txt -r result -k key

Yes, adding --mca btl_vader_single_copy_mechanism none to mpirun removed all of the Read -1, expected 11865, errno = 1 messages.

Is MPI 4.1.5 going to be used in future TAO containers? (instead of 4.1.5a1)

I will sync internally.

Would you please share the latest training log? Thanks a lot!