Hi, I used the openmpi and three GPUs to run the case “/modulus-sym/examples/three_fin_2d/heat_sink_inverse.py”. After training, the program did not exit automatically. What is the possible reason?
Thanks!
Its impossible to fully tell why a multi-process just hangs without logs. If this is still an issue, please provide some additional information such as your environment, if the job finished training / saving the outputs, when you kill the process are there any logs / errors, etc.
But just as a heads up, soft locks / hangs like this with MPI can be extremely difficult to fully debug. This may just be some MPI / PyTorch anomaly.
Hi, thank you for your reply!
Yes, it still happen. I am using the docker environment. All output is fine, and there is no log or error when I kill the process.
This issue doesn’t have much impact. Considering the debugging is difficult, we can just jet it go.
Thanks again.
Hi @ngeneva
We are encountering the similar issue of parallel GPU training with modulus-sym via Open MPI.
We have successfully mounted the latest version of the Modulus-sym container (Ver. 22.12) on our A40 server. It is able to perform the Lid Driven Cavity (LDC) example with single-GPU training.
However, when attempting to execute multi-GPU training, the program gets stuck at iteration 0, and fails to regularly update iteration information and store iteration results.
Regarding parallel GPU training, officially documents recommend using the “mpirun --allow-run-as-root -np #” in conjunction with the original command to implement it.
One can find more details in the following links:
Performance - NVIDIA Docs
Turbulence Super Resolution - NVIDIA Docs
We have attached the log files related to this issue, as well as the GPU information.
ldc_2d_training_OpenMPI_1gpu (3.3 KB)
ldc_2d_training_OpenMPI_2gpu (5.4 KB)
nvidia_smi (5.5 KB)
Here are the system details:
Operating System: Ubuntu 20.04.6 LTS
Docker version: 24.0.1
GPU: NVIDIA A40*8
Driver version: 530.30.02 from NVIDIA’s public website
CUDA driver: 12.1
We would greatly appreciate any recommendations or assistance one can provide to resolve this issue.
Hi @johnlaide
I’ve tested running the LDC example with mpirun --allow-run-as-root -np 2 python ldc_2d.py
on a V100 DGX box with out issues (This is on older drivers Driver Version: 510.47.03 CUDA Version: 11.8).
Typically a good place to start is shutting off optimizations (turn off JIT, functorch and cuda graphs) to see if that changes anything. These can be shut off in your config.yaml:
cuda_graphs: false
jit: false
graph:
func_arch: false
Expected output of LDC with two MPI processes:
Initialized process 1 of 2 using method "openmpi". Device set to cuda:1
Initialized process 0 of 2 using method "openmpi". Device set to cuda:0
[02:46:14] - JIT using the NVFuser TorchScript backend
[02:46:14] - Disabling JIT because functorch does not work with it.
[02:46:14] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[02:46:14] - GraphManager: {'_func_arch': True, '_debug': False, '_func_arch_allow_partial_hessian': True}
[02:46:14] - JIT using the NVFuser TorchScript backend
[02:46:14] - Disabling JIT because functorch does not work with it.
[02:46:14] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[02:46:14] - GraphManager: {'_func_arch': True, '_debug': False, '_func_arch_allow_partial_hessian': True}
[02:46:17] - Arch Node: flow_network has been converted to a FuncArch node.
[02:46:17] - Arch Node: flow_network has been converted to a FuncArch node.
[02:46:17] - attempting to restore from: outputs/ldc_2d
[02:46:17] - optimizer checkpoint not found
[02:46:17] - model flow_network.1.pth not found
[02:46:17] - attempting to restore from: outputs/ldc_2d
[02:46:17] - Success loading optimizer: outputs/ldc_2d/optim_checkpoint.0.pth
[02:46:17] - Success loading model: outputs/ldc_2d/flow_network.0.pth
[02:46:19] - [step: 0] record constraint batch time: 8.982e-02s
[02:46:19] - [step: 0] saved checkpoint to outputs/ldc_2d
[02:46:19] - [step: 0] loss: 4.237e-02
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:19] - Reducer buckets have been rebuilt in this iteration.
[02:46:21] - Attempting cuda graph building, this may take a bit...
[02:46:21] - Attempting cuda graph building, this may take a bit...
[02:46:30] - [step: 100] loss: 9.370e-03, time/iteration: 1.045e+02 ms
[02:46:40] - [step: 200] loss: 5.393e-03, time/iteration: 9.969e+01 ms