Errors occured when training FourCastNet with multiple GPU

I tried to train fourcastnet model with 2 or more GPUs, but some errors occurred.
The training on only 1 GPU was working properly in my environment.
I ran my training script in modulus:22.09 docker image with 2 v100 gpus, using

mpirun -n 2 --allow-run-as-root python fcn_era5.py
The driver and cuda version are: Driver Version: 460.106.00 CUDA Version: 11.2
Any suggestion that I can solve these issues?

The errors :


Hi @yswang891121

Based on when the error occurred, seems this is an issue with the cuda graphs getting recorded. Please try turning cuda graphs off by adding the following to your configuration.

cuda_graphs: False

I noted that in the NCCL documentation, cuda graph support with NCCL is only support CUDA version 11.3+, so this could be an older driver issue. Try just shutting of cuda graphs first before a driver update.

1 Like

Thank you, it works! but how does cuda_graphs affect training if it’s turned off?

Hi @yswang891121

This can impact performance for some problems (many see some speed up using CUDA graphs, it depends). You’ll need to do a driver update to use them. Check out these articles for background on cuda graphs in PyTorch and cuda graphs in Modulus for more information.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.