Training with multiple GPUs has error using TAO toolkit

I am using the following command to train maskrcnn.

!mask_rcnn train -e $LOCAL_SPECS_DIR/maskrcnn_train_resnet50.txt \
                     -d $LOCAL_EXPERIMENT_DIR/experiment_dir_unpruned\
                     -k $KEY \
                     --gpus 4

If I set --gpus 1, it is fine.
If I set 4, I have the following errors.

[MaskRCNN] INFO    : # ============================================= #
[MaskRCNN] INFO    :                  Start Training                  
[MaskRCNN] INFO    : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #

[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO    : Pretrained weights loaded with success...
    
[MaskRCNN] INFO    : Saving checkpoints for 0 into /workspace/Nyan/cv_samples_v1.3.0/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real noticed that process rank 3 with PID 0 on node 47754d4a4716 exited on signal 9 (Killed).
--------------------------------------------------------------------------

Please try a new folder.

I always clear folder before training starts

Could you check $nvidia-smi ?

Looks normal. Now I am training one GPU.

Training with multiple GPUs was working fine before switched to workaround approach using tao docker bash and run jupyter inside.

Hi,
May I know the latest status for running 4gpus?

Still cannot. I think because of running directly from inside docker.
Previously was fine.

OK, please use
! tao mask_rcnn train xxx

tao still doesn’t work. Still docker terminates itself. so I am using your workaround approach. Then in the workaround approach, I can’t run 4GPUs

Do you mean Chmod: cannot access '/opt/ngccli/ngc': No such file or directory - #2 by Morganh

I meant I still have this error.

I can run maskrcnn command using the workaround approach.

If I do like,

docker login nvcr.io
workon launcher
jupyter notebook --ip 0.0.0.0 --allow-root --port 8888
!tao maskrcnn train xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

the container still stops.
Multiple GPUs training doesn’t work in workaround approach.

The new version of the wheel has already been released to PyPI.
nvidia-tao==0.1.24

Please update it.

New version was updated. But multiple GPUs still have same error.

@edit_or ,
Do you still meet error when run with multi gpus?

I’ll change image sizes. And I’ll discuss in new post if necessary. Thanks for support.

Let’s close this topic, feel free to submit new post if necessary, thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.