This error has been bugging me for months. Now it has gotten so worse that I can’t work anymore.
The device has 4 GPUs. I have reseated the GPUs about 6 months back and I believe it might have helped a bit or maybe it did not. I don’t think the power supply is an issue here cause it is connected directly to the power supply of the whole university (Little I can do to change that either).
uname -a
Linux lambda-quad 4.15.0-130-generic #134-Ubuntu SMP Tue Jan 5 20:46:26 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
nvidia-smi
Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU
lspci | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
06:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
09:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
09:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
0a:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
0a:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
nvidia-bug-report.log.gz (2.6 MB)