Nvidia-smi No devices were found

driver version 525.105.17
OS core - 5.4.17-2136.300.7.el8uek.x86_64
From time to time server losses tesla card, nvidia-smi says it does not see devices, but lspci shows it:

lspci -vv | grep -i tesla
07:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

strace nvidia-smi shows a lot info, but main i think:
openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = -1 EIO (Input/output error)

Attempted to rescan device

echo "1" > /sys/bus/pci/devices/0000\:07\:00.0/remove
echo "1" > /sys/bus/pci/rescan

But no luck. Only server reboot helps. How to avoid such device lost and how to get it back without reboot?

P.s.

ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 May 24 14:58 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 May 24 14:58 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 May 24 14:58 /dev/nvidia-modeset
crw-rw-rw- 1 root root 235,   0 May 24 14:58 /dev/nvidia-uvm
crw-rw-rw- 1 root root 235,   1 May 24 14:58 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
cr-------- 1 root root 238, 1 May 24 14:58 nvidia-cap1
cr--r--r-- 1 root root 238, 2 May 24 14:58 nvidia-cap2

Seems to be reason was obvious, kernel version
5.4.17-2136.300.7.el8uek.x86_64 - is not in list of supported
moved to 4.18.0-425.3.1.el8.x86_64 and works like a sharm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.