Modulus v22.03 docker container mpirun issue

Hi, thanks for your attention!
When I load the Modulus v22.03 docker container and want to run with multiple GPUs by mpirun -np 4 python ldc_2d.py, I met an issue as shown in the figure: mulitple process will shown on rank 0.


Do you know the possible reason? The environment works well with Modulus v21.06.
Does anyone successfully run with multiple GPUs in Modulus v22.03 docker container with mpirun?
Thanks

This will make simulation hangs in there. Is this expected?
Any comments on how to solve this issue are appreciated?
@TomK @ramc
Thanks

Hi @Shen666,

I am hoping @ramc can help, as I am not a technical resource for Modulus.

Thanks @TomK

Hello, so the multiple processes on GPU 0 are expected. Can you post an output of where it hangs? We have not encountered this before. Does it run fine on a single GPU?

Thanks @ohennigh. It runs fine on a single GPU.
For multiple GPUs, I run with mpirun -np 4 python ldc_2d.py or any other examples, it hangs after print [step 0] loss.

I’m suffering from the same issue. When I tried to run fpga_flow via mpirun it hangs right after the first step.

Furthermore, after training it with a single GPU, I tried to run fpga_heat via mpirun and some error occurred. I suspected that it is related to the find_unused_parameters option in DDP, so I edited it to be False within continuous/constraints/constraint.py. At first glance it seemed it works as I saw all GPUs were being used through nvidia-smi, but the problem is that the training is not accelerated at all - at the same speed as a single GPU.

1 Like

Thanks @sy0319.kim for your experience.
@ramc @ohennigh This seems to be a common issue. I suggest you also test using NGC. Please let us know if you have any suggestions on the MPI hanging issues