GPU RTX 3090 Keeps on going to ERR mode after a couple of trainings

Hello,
I have been losing my mind over some issue with a GPU I have on my machine. Currently the GPU works fine for one or two trainings in pytorch, but then someone goes into ERR when I type nvidia-smi. After this happens usually what happens is that I have a python process which I cannot kill, not even with sudo kill -9 PID. This is always accompaigned by a core whose bar is 100% red in htop, not sure what that means.

If I try to restart the GPU, it tells me he cannot because the GPU is being used in some processes, which I guess are the ones which I cannot kill. This happens consistently, if I reboot the problem seems to get solved, but again after a couple of trainings I get this issue. The main issue is that most of the time I am connecting to my machine trough ssh, so if I reboot I have to ask someone to turn back on my machine, or go myself.

The OS on my machine is Manjaro, but had also issues with Ubuntu 22.04, where I got “CUDA error: unspecified launch failure”. I don’t think it can be hardware related, as the GPU is one year old, and again it is able to train once restarted.

The specs of my machine are the following:

  • CPU: intel i9-13900K/KF 5.8GHz
  • Motherboard: MSI PRO Z690-A DDR4
  • RAM: 64GB DDR4 3200Mhz 2x32GB
  • Power supply: Corsair RM1000 80+ Gold Modular

The machine has also another GPU, which is an RTX 2080 TI.

This has been compromising my work a lot and I need a definite fix to this problem

Best,
Luca

https://forums.developer.nvidia.com/t/request-gpu-memory-junction-temperature-via-nvidia-smi-or-nvml-api/168346/368

I have not fully understood the forum you report. It does not seem that my problem is compatible with the memory junction temperature, as currently I get ERR even if I try to load model. Do you have maybe any other possible reference?

Luca

Once the memory is overheating, the gpu will go into error condition, only to be fixed ba rebooting. So you should monitor vmem/hotspot temperatures.
In general, of course, please enable the nvidia-persistenced to start on boot, make sure it is continuously running and check if that resolves the issue.