Hello,
I have been losing my mind over some issue with a GPU I have on my machine. Currently the GPU works fine for one or two trainings in pytorch, but then someone goes into ERR when I type nvidia-smi. After this happens usually what happens is that I have a python process which I cannot kill, not even with sudo kill -9 PID. This is always accompaigned by a core whose bar is 100% red in htop, not sure what that means.
If I try to restart the GPU, it tells me he cannot because the GPU is being used in some processes, which I guess are the ones which I cannot kill. This happens consistently, if I reboot the problem seems to get solved, but again after a couple of trainings I get this issue. The main issue is that most of the time I am connecting to my machine trough ssh, so if I reboot I have to ask someone to turn back on my machine, or go myself.
The OS on my machine is Manjaro, but had also issues with Ubuntu 22.04, where I got “CUDA error: unspecified launch failure”. I don’t think it can be hardware related, as the GPU is one year old, and again it is able to train once restarted.
The specs of my machine are the following:
- CPU: intel i9-13900K/KF 5.8GHz
- Motherboard: MSI PRO Z690-A DDR4
- RAM: 64GB DDR4 3200Mhz 2x32GB
- Power supply: Corsair RM1000 80+ Gold Modular
The machine has also another GPU, which is an RTX 2080 TI.
This has been compromising my work a lot and I need a definite fix to this problem
Best,
Luca