Modulus v22.03 docker container mpirun issue

Hi, thanks for your attention!
When I load the Modulus v22.03 docker container and want to run with multiple GPUs by mpirun -np 4 python ldc_2d.py, I met an issue as shown in the figure: mulitple process will shown on rank 0.


Do you know the possible reason? The environment works well with Modulus v21.06.
Does anyone successfully run with multiple GPUs in Modulus v22.03 docker container with mpirun?
Thanks

This will make simulation hangs in there. Is this expected?
Any comments on how to solve this issue are appreciated?
@TomK @ramc
Thanks

Hi @Shen666,

I am hoping @ramc can help, as I am not a technical resource for Modulus.

Thanks @TomK

Hello, so the multiple processes on GPU 0 are expected. Can you post an output of where it hangs? We have not encountered this before. Does it run fine on a single GPU?

Thanks @ohennigh. It runs fine on a single GPU.
For multiple GPUs, I run with mpirun -np 4 python ldc_2d.py or any other examples, it hangs after print [step 0] loss.

I’m suffering from the same issue. When I tried to run fpga_flow via mpirun it hangs right after the first step.

Furthermore, after training it with a single GPU, I tried to run fpga_heat via mpirun and some error occurred. I suspected that it is related to the find_unused_parameters option in DDP, so I edited it to be False within continuous/constraints/constraint.py. At first glance it seemed it works as I saw all GPUs were being used through nvidia-smi, but the problem is that the training is not accelerated at all - at the same speed as a single GPU.

1 Like

Thanks @sy0319.kim for your experience.
@ramc @ohennigh This seems to be a common issue. I suggest you also test using NGC. Please let us know if you have any suggestions on the MPI hanging issues

Was this issue resolved? I am stuck with the same problem running 2 Quadro RTX 5000s. Running the examples with a single GPU works fine but whenever I try to utilize both using mpirun as explained here, the program will go through the first step then stop, leaving each GPU at 100% utilization indefinitely until I kill the program.

Experiencing the same issue that the fpga froze after step 1. I am using the modulus 22.03 container.

It turned out that if you turn off the validator, the program will continue running.

It is suggesting that there is a bug in validator whey applying multiGPU.

1 Like

Follow up with anyone running 22.03.

There was a bug for FPGA multi-gpu that is related to the validator confusing DDP, as @yunchaoyang pointed out in 22.03. This has been corrected in 22.07. If the validator is absolutely needed in this version, you may want to try setting requires_grad=False to see if that fixes the hang or shut if off during training.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.