GPUs parallel, program does not exit after training

Hi, I used the openmpi and three GPUs to run the case “/modulus-sym/examples/three_fin_2d/heat_sink_inverse.py”. After training, the program did not exit automatically. What is the possible reason?
Thanks!

Hi @zhangzhenthu

Its impossible to fully tell why a multi-process just hangs without logs. If this is still an issue, please provide some additional information such as your environment, if the job finished training / saving the outputs, when you kill the process are there any logs / errors, etc.

But just as a heads up, soft locks / hangs like this with MPI can be extremely difficult to fully debug. This may just be some MPI / PyTorch anomaly.

Hi, thank you for your reply!
Yes, it still happen. I am using the docker environment. All output is fine, and there is no log or error when I kill the process.
This issue doesn’t have much impact. Considering the debugging is difficult, we can just jet it go.
Thanks again.

Hi @ngeneva

We are encountering the similar issue of parallel GPU training with modulus-sym via Open MPI.

We have successfully mounted the latest version of the Modulus-sym container (Ver. 22.12) on our A40 server. It is able to perform the Lid Driven Cavity (LDC) example with single-GPU training.

However, when attempting to execute multi-GPU training, the program gets stuck at iteration 0, and fails to regularly update iteration information and store iteration results.

Regarding parallel GPU training, officially documents recommend using the “mpirun --allow-run-as-root -np #” in conjunction with the original command to implement it.

One can find more details in the following links:
Performance - NVIDIA Docs
Turbulence Super Resolution - NVIDIA Docs

We have attached the log files related to this issue, as well as the GPU information.
ldc_2d_training_OpenMPI_1gpu (3.3 KB)
ldc_2d_training_OpenMPI_2gpu (5.4 KB)
nvidia_smi (5.5 KB)

Here are the system details:
Operating System: Ubuntu 20.04.6 LTS
Docker version: 24.0.1
GPU: NVIDIA A40*8
Driver version: 530.30.02 from NVIDIA’s public website
CUDA driver: 12.1

We would greatly appreciate any recommendations or assistance one can provide to resolve this issue.

Hi @johnlaide

I’ve tested running the LDC example with mpirun --allow-run-as-root -np 2 python ldc_2d.py on a V100 DGX box with out issues (This is on older drivers Driver Version: 510.47.03 CUDA Version: 11.8).

Typically a good place to start is shutting off optimizations (turn off JIT, functorch and cuda graphs) to see if that changes anything. These can be shut off in your config.yaml:

cuda_graphs: false
jit: false
graph:
  func_arch: false

Expected output of LDC with two MPI processes:

Initialized process 1 of 2 using method "openmpi". Device set to cuda:1
Initialized process 0 of 2 using method "openmpi". Device set to cuda:0
[02:46:14] - JIT using the NVFuser TorchScript backend
[02:46:14] - Disabling JIT because functorch does not work with it.
[02:46:14] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[02:46:14] - GraphManager: {'_func_arch': True, '_debug': False, '_func_arch_allow_partial_hessian': True}
[02:46:14] - JIT using the NVFuser TorchScript backend
[02:46:14] - Disabling JIT because functorch does not work with it.
[02:46:14] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[02:46:14] - GraphManager: {'_func_arch': True, '_debug': False, '_func_arch_allow_partial_hessian': True}
[02:46:17] - Arch Node: flow_network has been converted to a FuncArch node.
[02:46:17] - Arch Node: flow_network has been converted to a FuncArch node.
[02:46:17] - attempting to restore from: outputs/ldc_2d
[02:46:17] - optimizer checkpoint not found
[02:46:17] - model flow_network.1.pth not found
[02:46:17] - attempting to restore from: outputs/ldc_2d
[02:46:17] - Success loading optimizer: outputs/ldc_2d/optim_checkpoint.0.pth
[02:46:17] - Success loading model: outputs/ldc_2d/flow_network.0.pth
[02:46:19] - [step:          0] record constraint batch time:  8.982e-02s
[02:46:19] - [step:          0] saved checkpoint to outputs/ldc_2d
[02:46:19] - [step:          0] loss:  4.237e-02
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:21] - Attempting cuda graph building, this may take a bit...
[02:46:21] - Attempting cuda graph building, this may take a bit...
[02:46:30] - [step:        100] loss:  9.370e-03, time/iteration:  1.045e+02 ms
[02:46:40] - [step:        200] loss:  5.393e-03, time/iteration:  9.969e+01 ms
1 Like