Training multiple models on multiple GPUs hangs

andraz.mehle · February 19, 2021, 8:11am

Hi,
I am having problems when training models on a machine with 4 RTX3090 GPUs on Windows.

When I run multiple train sessions on multiple GPUs (one model per GPU), I am getting repeatable problems on one GPU (GPU 3). Note that this GPU is the only one configured for video output as well.

Symptoms:
a.) If I run the first training on the affected GPU 3, the training hangs as soon as I start two or more training sessions on other GPUs. (GPU 3 is not available afterward, reboot required.)
Sometimes I get the following error: RuntimeError(‘CUDA error: the launch timed out and was terminated’)

b.) If GPU 3 is the last to run the training, the entire system freezes (hard reboot required).

If I train only on the affected GPU 3, it runs without any problems.

Environment:
OS: Microsoft Windows 10 Enterprise

Python version: 3.8 (64-bit runtime) in Anaconda
Deep learning framework: PyTorch 1.7.1
Is CUDA available: True
CUDA runtime version: 11.0
GPU models and configuration: 4 x RTX3090
Nvidia driver version: 461.40
cuDNN version: 8.0.4

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchvision==0.8.2
[conda] blas 1.0 mkl conda-forge
[conda] cudatoolkit 11.0.221 h74a9793_0 anaconda
[conda] mkl 2020.2 256 anaconda
[conda] mkl-service 2.3.0 py38hb782905_0
[conda] mkl_fft 1.2.0 py38h45dec08_0
[conda] mkl_random 1.1.1 py38h47e9c7a_0 anaconda
[conda] numpy 1.19.2 py38hadc3359_0
[conda] numpy-base 1.19.2 py38ha3acd2a_0
[conda] pytorch 1.7.1 py3.8_cuda110_cudnn8_0 pytorch
[conda] torchvision 0.8.2 py38_cu110 pytorch

CPU: AMD Ryzen Threadripper 3970X

I have already tried the solutions suggested here: Repeatable system freezes under GPU load with Threadripper & Ubuntu 18.04 - GPU - Level1Techs Forums

The problem seems to be more frequent when more GPU memory is utilized (e.g. using larger batch sizes for the training).

(I have posted the question to PyTorch forum as well.)

Topic		Replies	Views
My desktop freezes at random times while training with pytorch Frameworks cuda , ubuntu , pytorch	3	1357	March 11, 2024
Frequent Freezes during CUDA Training on Ubuntu 23.10 NVIDIA AI Workbench	4	312	May 27, 2024
Ubuntu 18.04 freezed when using gpu-burn on RTX2080 Ti Linux	1	990	December 9, 2019
GPU crashes and shows Err! when running DL application Linux cuda , pytorch , python	0	577	January 20, 2023
Multi-GPUs stuck/freeze while one GPU works well CUDA Programming and Performance cuda	3	821	February 7, 2021
Training WSL 2 CUDA hangs over several training steps cuDNN	14	4251	October 7, 2021
Performance Slowdown during Distributed Training with 4x RTX 4090 GPUs cuDNN cuda , pytorch , ai-training , gpu	6	3992	September 29, 2023
Error running pytorch on RTX3090/3060 Frameworks cuda , pytorch , python	0	914	January 13, 2023
How do I debug this? (pytorch stalls when moving tensor to GPU?) CUDA Programming and Performance cuda	5	1587	March 3, 2021
Errors occured when training FourCastNet with multiple GPU Technical Support (Modulus Only)	4	578	June 12, 2023

Training multiple models on multiple GPUs hangs

Related topics