Hanging issue of Modulus v22.07 running on multi-node GPUs

Shen666 · September 2, 2022, 5:45pm

I want to run Modulus on two machines with each have 4 GPUs.
For now, I can run Modulus on each machine with 4 GPUs without any issues, as attached figure.

However, when I use the two machines, it will hang as shown in the screen shot. It cannot initialized GPUs using openmpi as in the above worked simulation. It works for v21.06 but failed for v22.07.

There is no mpi issues across the two machines and also no mpi incompatibility issue.
Does anyone meet similar issue or have any comments on this issue?
Thanks

ngeneva · September 6, 2022, 7:56pm

Hi @Shen666 ,

We don’t use singularity for running our image. I am somewhat suspecting that this may be an issue with that and the Modulus docker image (I am assuming the singularity container is built off the docker file?). I have not seen this issue before.

After some Googling, there does seem to be some potentially relevant information here about using singularity with a nvidia-docker container (mention the cuda path you have here mid way down).

https://www.nmr.mgh.harvard.edu/martinos/userInfo/computer/docker.php

If this doesn’t work still you may want to try a bare-metal install (python install) which still has the majority Modulus’ features functional.

prakhar_sharma · September 8, 2022, 8:47am

I am able to train on multiple nodes using an Apptainer image.

Shen666 · September 8, 2022, 1:15pm

Thanks @ngeneva

Shen666 · September 8, 2022, 1:18pm

Hi @prakhar_sharma , for Apptainer image, is it an image on AWS or other cloud provider? Where could I try?
Do you convert the Modulus to singularity container image .sif or sandbox?
Thanks

prakhar_sharma · September 8, 2022, 1:36pm

I didn’t convert the image. I was learning Apptainer. So, out of interest I created my own Apptainer definition file to build an image from the base image.

it works perfectly fine, with multiple GPUs, mpirun and with pysdf.

Shen666 · September 8, 2022, 1:49pm

Interesting. Since Modulus only release docker image, I am curious how you run it with Apptainer without convert it to .sif file. Would you mind sharing the cmd you run to launch Modulus with Apptainer? That will be very helpful for me to understand. Many thanks @prakhar_sharma

prakhar_sharma · September 8, 2022, 2:08pm

sorry I can’ t share. I am looking to create an MR on their Gitlab repo if they allow me. But it is not too difficult. You just need the base image (find the link in the previous comment) and then apply everything which is required for the bare metal installation.

Shen666 · September 8, 2022, 2:16pm

I see. Thanks. I will try with Apptainer

Topic		Replies	Views
Modulus v22.03 docker container mpirun issue Report a Bug (Modulus Only)	11	1898	August 4, 2022
Enabling multiple GPUs Technical Support (Modulus Only) gpu	1	1406	March 29, 2023
Modulus container no longer functions after updating to latest display + cuda drivers Technical Support (Modulus Only) cuda , driver , rhel	3	1541	November 4, 2022
GPUs parallel, program does not exit after training Report a Bug (Modulus Only)	4	939	May 26, 2023
How to run Modulus thru docker in Saturn cloud Technical Support (Modulus Only)	4	544	December 22, 2022
Just Released: NVIDIA Modulus 24.01 Technical Blog	1	217	January 31, 2024
Just Released: NVIDIA Modulus 24.01 Technical Blog	1	245	February 1, 2024
Just Released: NVIDIA Modulus 23.08 Technical Blog	0	349	August 10, 2023
Using Singularity for the docker image Parabricks	4	1053	October 4, 2022
Unable to run on more than 1 GPU Report a Bug (Modulus Only)	3	1194	October 12, 2022

Hanging issue of Modulus v22.07 running on multi-node GPUs

Related topics