I was training a model and I suddenly I had an error and the training stopped. I tryied to run the training script again but it crashed. I tried to start nvtop to see the gpu usage and I got:
No GPU to monitor.
I was very confused by that becuase everything was working fine a minute ago. I then tried nvidia-smi:
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
This baffled me even more. I then tried: lspci -nnk | grep -i nvidia -A3:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD103 [GeForce RTX 4070 Ti SUPER] [10de:2705] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device [1458:413d]
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22bb] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device [1458:413d]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
This struck me as very odd, the GPU is detected but there is no kernel in use? I tried restarting and it didn’t help. This is where I kinda got stuck, has anyone experienced anything similar to this, and if yes, is there a way to fix it? I’ve read somewhere that something similar can be solved by disabling Secure Boot in BIOS, but this is a remote workstation to which I connect with ssh and currently I’m not physically near the computer so I can’t go there to deal with BIOS. Any suggestion would be much appreciated.