Hanging issue of Modulus v22.07 running on multi-node GPUs

I want to run Modulus on two machines with each have 4 GPUs.
For now, I can run Modulus on each machine with 4 GPUs without any issues, as attached figure.


However, when I use the two machines, it will hang as shown in the screen shot. It cannot initialized GPUs using openmpi as in the above worked simulation. It works for v21.06 but failed for v22.07.

There is no mpi issues across the two machines and also no mpi incompatibility issue.
Does anyone meet similar issue or have any comments on this issue?
Thanks

Hi @Shen666 ,

We don’t use singularity for running our image. I am somewhat suspecting that this may be an issue with that and the Modulus docker image (I am assuming the singularity container is built off the docker file?). I have not seen this issue before.

After some Googling, there does seem to be some potentially relevant information here about using singularity with a nvidia-docker container (mention the cuda path you have here mid way down).

https://www.nmr.mgh.harvard.edu/martinos/userInfo/computer/docker.php

If this doesn’t work still you may want to try a bare-metal install (python install) which still has the majority Modulus’ features functional.

I am able to train on multiple nodes using an Apptainer image.

Thanks @ngeneva

Hi @prakhar_sharma , for Apptainer image, is it an image on AWS or other cloud provider? Where could I try?
Do you convert the Modulus to singularity container image .sif or sandbox?
Thanks

I didn’t convert the image. I was learning Apptainer. So, out of interest I created my own Apptainer definition file to build an image from the base image.

it works perfectly fine, with multiple GPUs, mpirun and with pysdf.

Interesting. Since Modulus only release docker image, I am curious how you run it with Apptainer without convert it to .sif file. Would you mind sharing the cmd you run to launch Modulus with Apptainer? That will be very helpful for me to understand. Many thanks @prakhar_sharma

sorry I can’ t share. I am looking to create an MR on their Gitlab repo if they allow me. But it is not too difficult. You just need the base image (find the link in the previous comment) and then apply everything which is required for the bare metal installation.

I see. Thanks. I will try with Apptainer

1 Like