Errors occured when training FourCastNet with multiple GPU

yswang891121 · June 8, 2023, 4:15am

I tried to train fourcastnet model with 2 or more GPUs, but some errors occurred.
The training on only 1 GPU was working properly in my environment.
I ran my training script in modulus:22.09 docker image with 2 v100 gpus, using

mpirun -n 2 --allow-run-as-root python fcn_era5.py
The driver and cuda version are: Driver Version: 460.106.00 CUDA Version: 11.2
Any suggestion that I can solve these issues?

The errors :

ngeneva · June 9, 2023, 3:09am

Hi @yswang891121

Based on when the error occurred, seems this is an issue with the cuda graphs getting recorded. Please try turning cuda graphs off by adding the following to your configuration.

cuda_graphs: False

I noted that in the NCCL documentation, cuda graph support with NCCL is only support CUDA version 11.3+, so this could be an older driver issue. Try just shutting of cuda graphs first before a driver update.

yswang891121 · June 12, 2023, 5:27am

Thank you, it works! but how does cuda_graphs affect training if it’s turned off?

ngeneva · June 12, 2023, 6:38pm

Hi @yswang891121

This can impact performance for some problems (many see some speed up using CUDA graphs, it depends). You’ll need to do a driver update to use them. Check out these articles for background on cuda graphs in PyTorch and cuda graphs in Modulus for more information.

system · June 26, 2023, 6:38pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Training multiple models on multiple GPUs hangs Frameworks pytorch	0	813	February 19, 2021
When I perform multi-card parallel training on 4-card 4090, it will get stuck or CPU soft lock Linux	5	733	February 6, 2023
Running Cuda on CPU only CUDA Developer Tools	0	848	December 29, 2020
Assertion error when restarting training Technical Support (Modulus Only)	4	625	August 12, 2022
How to use cuda to train a model with pytorch TensorRT cuda	2	1323	July 27, 2022
Error when training with multiple GPUs in TAO TAO Toolkit	17	1898	May 4, 2023
I have one server w/ Two type GPU(RTX 3090, TITAN RTX) CUDA Programming and Performance cuda	5	941	April 8, 2021
How to use multi-GPUs on a single mechine to run the cases in Modulus Technical Support (Modulus Only)	7	1161	June 4, 2023
Kernel panic when training with PyTorch & GTX1080Ti Frameworks kernel	0	700	September 9, 2021
Cuda error CUDA Programming and Performance	3	1857	June 23, 2021

Errors occured when training FourCastNet with multiple GPU

Related topics