I’m running an RTX 4070Ti on Ubuntu 24.04 with the 6.8.0-52 kernel and 550 nvidia drivers. I connect remotely to the computer through ssh and it doesn’t have a monitor connected (if that’s relevant). My gpu keeps failing. When I run nvidia-smi I get:
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error
After I restart the computer it works normally but after a while it just stops working again, has anyone encountered anything similar and what can I do to fix this? Here are some relevant commands that I ran and their output:
>lsmod | grep nvidia
nvidia_drm 122880 2
nvidia_modeset 1355776 3 nvidia_drm
nvidia 54386688 30 nvidia_modeset
video 73728 2 amdgpu,nvidia_modeset
>dmesg | grep -i nvidia
[ 4.613640] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input8
[ 4.613876] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input9
[ 4.614315] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input10
[ 4.615153] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input11
[ 4.655236] nvidia: loading out-of-tree module taints kernel.
[ 4.655242] nvidia: module license 'NVIDIA' taints kernel.
[ 4.655245] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 4.655246] nvidia: module license taints kernel.
[ 5.623795] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 5.624974] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[ 5.675412] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.144.03 Mon Dec 30 17:44:08 UTC 2024
[ 5.684118] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.144.03 Mon Dec 30 17:10:10 UTC 2024
[ 5.686076] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 6.440172] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[ 6.452495] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 6.470561] nvidia-uvm: Loaded the UVM driver, major device number 508.
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[316109.267216] nvidia-uvm: Unloaded the UVM driver.
> dmesg | grep -i pci
...
[89212.109319] NVRM: GPU at PCI:0000:01:00: GPU-fe5c340e-4c73-2c72-9782-5bd0fbdd56cf