I am having problems when training models on a machine with 4 RTX3090 GPUs on Windows.
When I run multiple train sessions on multiple GPUs (one model per GPU), I am getting repeatable problems on one GPU (GPU 3). Note that this GPU is the only one configured for video output as well.
a.) If I run the first training on the affected GPU 3, the training hangs as soon as I start two or more training sessions on other GPUs. (GPU 3 is not available afterward, reboot required.)
Sometimes I get the following error: RuntimeError(‘CUDA error: the launch timed out and was terminated’)
b.) If GPU 3 is the last to run the training, the entire system freezes (hard reboot required).
If I train only on the affected GPU 3, it runs without any problems.
OS: Microsoft Windows 10 Enterprise
Python version: 3.8 (64-bit runtime) in Anaconda
Deep learning framework: PyTorch 1.7.1
Is CUDA available: True
CUDA runtime version: 11.0
GPU models and configuration: 4 x RTX3090
Nvidia driver version: 461.40
cuDNN version: 8.0.4
Versions of relevant libraries:
[conda] blas 1.0 mkl conda-forge
[conda] cudatoolkit 11.0.221 h74a9793_0 anaconda
[conda] mkl 2020.2 256 anaconda
[conda] mkl-service 2.3.0 py38hb782905_0
[conda] mkl_fft 1.2.0 py38h45dec08_0
[conda] mkl_random 1.1.1 py38h47e9c7a_0 anaconda
[conda] numpy 1.19.2 py38hadc3359_0
[conda] numpy-base 1.19.2 py38ha3acd2a_0
[conda] pytorch 1.7.1 py3.8_cuda110_cudnn8_0 pytorch
[conda] torchvision 0.8.2 py38_cu110 pytorch
CPU: AMD Ryzen Threadripper 3970X
I have already tried the solutions suggested here: Repeatable system freezes under GPU load with Threadripper & Ubuntu 18.04 - GPU - Level1Techs Forums
The problem seems to be more frequent when more GPU memory is utilized (e.g. using larger batch sizes for the training).
(I have posted the question to PyTorch forum as well.)