I tried to train fourcastnet model with 2 or more GPUs, but some errors occurred.
The training on only 1 GPU was working properly in my environment.
I ran my training script in modulus:22.09 docker image with 2 v100 gpus, using
mpirun -n 2 --allow-run-as-root python fcn_era5.py
The driver and cuda version are: Driver Version: 460.106.00 CUDA Version: 11.2
Any suggestion that I can solve these issues?
Based on when the error occurred, seems this is an issue with the cuda graphs getting recorded. Please try turning cuda graphs off by adding the following to your configuration.
cuda_graphs: False
I noted that in the NCCL documentation, cuda graph support with NCCL is only support CUDA version 11.3+, so this could be an older driver issue. Try just shutting of cuda graphs first before a driver update.
This can impact performance for some problems (many see some speed up using CUDA graphs, it depends). You’ll need to do a driver update to use them. Check out these articles for background on cuda graphs in PyTorch and cuda graphs in Modulus for more information.