driver version 525.105.17
OS core - 5.4.17-2136.300.7.el8uek.x86_64
From time to time server losses tesla card, nvidia-smi says it does not see devices, but lspci shows it:
lspci -vv | grep -i tesla
07:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
strace nvidia-smi shows a lot info, but main i think:
openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = -1 EIO (Input/output error)
Attempted to rescan device
echo "1" > /sys/bus/pci/devices/0000\:07\:00.0/remove
echo "1" > /sys/bus/pci/rescan
But no luck. Only server reboot helps. How to avoid such device lost and how to get it back without reboot?
P.s.
ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 May 24 14:58 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 May 24 14:58 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 May 24 14:58 /dev/nvidia-modeset
crw-rw-rw- 1 root root 235, 0 May 24 14:58 /dev/nvidia-uvm
crw-rw-rw- 1 root root 235, 1 May 24 14:58 /dev/nvidia-uvm-tools
/dev/nvidia-caps:
total 0
cr-------- 1 root root 238, 1 May 24 14:58 nvidia-cap1
cr--r--r-- 1 root root 238, 2 May 24 14:58 nvidia-cap2