- H/W : Dell PowerEdge R740(Server) + Nvidia Quadro RTX 5000(GPU)
- OS : CentOS7.5
- Driver Version : 430.34(linux 64bit)
some time ago, If you enter the ‘nvidia-smi’ command, a hang occurs for about 20 seconds, and then the server reboots.
The service has been running fine for 3 months, but I’ve had problems since reboot for maintenance purposes.
This is what I checked.
Nvidia driver related
1] I can use and check the nvidia driver in lsmod |grep nvidia, lshw -class display, cat /proc/driver/nvidia/version
2] Reinstall the driver of the same version (install after --uninstall)
3] Driver version upgrade (430.34 -> 440.100)
–> Same issue after action
Check OS log
3] Collected nvidia-bug-repost.sh
–> No log related before and after entering ‘nvidia-smi’ command, server boot related No error log
H/W diag LED normal (no LED alarm)
No specifics TT_TT…
I’m trying to replace it with a spare GPU card. Do you have any additional check point before working?
Answers I’ll wait!! thank you