GPU RTX 3090 Keeps on going to ERR mode after a couple of trainings

lucasilva11 · March 15, 2024, 3:02pm

Hello,
I have been losing my mind over some issue with a GPU I have on my machine. Currently the GPU works fine for one or two trainings in pytorch, but then someone goes into ERR when I type nvidia-smi. After this happens usually what happens is that I have a python process which I cannot kill, not even with sudo kill -9 PID. This is always accompaigned by a core whose bar is 100% red in htop, not sure what that means.

If I try to restart the GPU, it tells me he cannot because the GPU is being used in some processes, which I guess are the ones which I cannot kill. This happens consistently, if I reboot the problem seems to get solved, but again after a couple of trainings I get this issue. The main issue is that most of the time I am connecting to my machine trough ssh, so if I reboot I have to ask someone to turn back on my machine, or go myself.

The OS on my machine is Manjaro, but had also issues with Ubuntu 22.04, where I got “CUDA error: unspecified launch failure”. I don’t think it can be hardware related, as the GPU is one year old, and again it is able to train once restarted.

The specs of my machine are the following:

CPU: intel i9-13900K/KF 5.8GHz
Motherboard: MSI PRO Z690-A DDR4
RAM: 64GB DDR4 3200Mhz 2x32GB
Power supply: Corsair RM1000 80+ Gold Modular

The machine has also another GPU, which is an RTX 2080 TI.

This has been compromising my work a lot and I need a definite fix to this problem

Best,
Luca

generix · March 19, 2024, 10:07am

https://forums.developer.nvidia.com/t/request-gpu-memory-junction-temperature-via-nvidia-smi-or-nvml-api/168346/368

lucasilva11 · March 19, 2024, 2:43pm

I have not fully understood the forum you report. It does not seem that my problem is compatible with the memory junction temperature, as currently I get ERR even if I try to load model. Do you have maybe any other possible reference?

Luca

generix · March 19, 2024, 3:44pm

Once the memory is overheating, the gpu will go into error condition, only to be fixed ba rebooting. So you should monitor vmem/hotspot temperatures.
In general, of course, please enable the nvidia-persistenced to start on boot, make sure it is continuously running and check if that resolves the issue.

Topic		Replies	Views
GPU crashes and shows Err! when running DL application Linux cuda , pytorch , python	0	663	January 20, 2023
GPUs give ERR! with NVRM: Xid (PCI:0000:b5:00): 61 Linux	2	1280	July 22, 2019
Ubuntu 22.04 - GPU Falls off Bus - Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Linux	10	2675	February 12, 2024
Error running pytorch on RTX3090/3060 Frameworks (archived) cuda , pytorch , python	0	991	January 13, 2023
NVIDIA-SMI Shows ERR! on both Fan and Power Usage Linux	32	49936	August 30, 2022
The model training stops midway without throwing any errors and fails to resume. Additionally, running nvidia-smi returns an error CUDA Developer Tools	0	39	July 7, 2025
ERR in nvidia-smi Linux	0	522	August 29, 2023
GPU doesn't respond after training with PyTorch Linux cuda , pytorch , python	5	848	December 28, 2023
"Xid:79, GPU has fallen off the bus" training a deep learning model on Nvidia 3090 Linux nvbugs	0	777	September 21, 2023
Unable to determine the device handle for GPU0000:17:00: Unknown Error Linux	0	174	September 9, 2024

GPU RTX 3090 Keeps on going to ERR mode after a couple of trainings

Related topics