Modulus v22.03 docker container mpirun issue

Shen666 · April 22, 2022, 6:48am

Hi, thanks for your attention!
When I load the Modulus v22.03 docker container and want to run with multiple GPUs by mpirun -np 4 python ldc_2d.py, I met an issue as shown in the figure: mulitple process will shown on rank 0.

Do you know the possible reason? The environment works well with Modulus v21.06.
Does anyone successfully run with multiple GPUs in Modulus v22.03 docker container with mpirun?
Thanks

Shen666 · May 9, 2022, 3:58pm

This will make simulation hangs in there. Is this expected?
Any comments on how to solve this issue are appreciated?
@TomNVIDIA @ramc
Thanks

TomNVIDIA · May 9, 2022, 4:19pm

Hi @Shen666,

I am hoping @ramc can help, as I am not a technical resource for Modulus.

Shen666 · May 9, 2022, 4:25pm

Thanks @TomNVIDIA

ohennigh · May 18, 2022, 5:17pm

Hello, so the multiple processes on GPU 0 are expected. Can you post an output of where it hangs? We have not encountered this before. Does it run fine on a single GPU?

Shen666 · May 18, 2022, 5:26pm

Thanks @ohennigh. It runs fine on a single GPU.
For multiple GPUs, I run with mpirun -np 4 python ldc_2d.py or any other examples, it hangs after print [step 0] loss.

sy0319.kim · May 19, 2022, 5:04am

I’m suffering from the same issue. When I tried to run fpga_flow via mpirun it hangs right after the first step.

Furthermore, after training it with a single GPU, I tried to run fpga_heat via mpirun and some error occurred. I suspected that it is related to the find_unused_parameters option in DDP, so I edited it to be False within continuous/constraints/constraint.py. At first glance it seemed it works as I saw all GPUs were being used through nvidia-smi, but the problem is that the training is not accelerated at all - at the same speed as a single GPU.

Shen666 · May 19, 2022, 5:25am

Thanks @sy0319.kim for your experience.
@ramc @ohennigh This seems to be a common issue. I suggest you also test using NGC. Please let us know if you have any suggestions on the MPI hanging issues

dylanatahodonoghue · June 7, 2022, 7:10pm

Was this issue resolved? I am stuck with the same problem running 2 Quadro RTX 5000s. Running the examples with a single GPU works fine but whenever I try to utilize both using mpirun as explained here, the program will go through the first step then stop, leaving each GPU at 100% utilization indefinitely until I kill the program.

yunchaoyang · August 3, 2022, 6:48pm

Experiencing the same issue that the fpga froze after step 1. I am using the modulus 22.03 container.

It turned out that if you turn off the validator, the program will continue running.

It is suggesting that there is a bug in validator whey applying multiGPU.

ngeneva · August 4, 2022, 8:35pm

Follow up with anyone running 22.03.

There was a bug for FPGA multi-gpu that is related to the validator confusing DDP, as @yunchaoyang pointed out in 22.03. This has been corrected in 22.07. If the validator is absolutely needed in this version, you may want to try setting requires_grad=False to see if that fixes the hang or shut if off during training.

system · August 18, 2022, 8:36pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hanging issue of Modulus v22.07 running on multi-node GPUs Technical Support (Modulus Only) modulus	8	933	September 8, 2022
GPUs parallel, program does not exit after training Report a Bug (Modulus Only)	4	935	May 26, 2023
GDB CUDA Fortran hang? Legacy PGI Compilers	3	9765	May 20, 2014
mpirun limit on number of processors Legacy PGI Compilers	3	3999	December 12, 2013
Enabling multiple GPUs Technical Support (Modulus Only) gpu	1	1388	March 29, 2023
Using multiple GPUs Legacy PGI Compilers	7	22072	August 11, 2009
Multi-Process freeze with docker CUDA Programming and Performance	1	851	August 31, 2023
How to use multi-GPUs on a single mechine to run the cases in Modulus Technical Support (Modulus Only)	7	1176	June 4, 2023
Training multiple models on multiple GPUs hangs Frameworks pytorch	0	813	February 19, 2021
CUDA Fortran + MPI Debug help! CUDA Programming and Performance	1	645	May 8, 2014

Modulus v22.03 docker container mpirun issue

Related topics