Nvidia unloading driver(nvidia-drm, nvidia-modeset, nvidia-uvm, nvidia-nvlink)

spec.
H/W : Dell PowerEdge R740(Server) + Nvidia Quadro RTX 5000(GPU)
OS : CentOS7.5(UEFI Installation)
Driver Version : 430.34(linux 64bit)
config :

  1. /etc/default/grub
    GRUB_CMDLINE_LINUX= Add “rd.driver.blacklist=nouveau nouveau.modeset=0”
  2. grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
  3. mv /boot/initramfs-(uname -r).img /boot/initramfs-(uname -r).img_backup
  4. dracut -f -v
  5. /etc/modprobe.d/blacklist.conf
    blacklist nouveau
  6. reboot

Hello

some time ago, If you enter the ‘nvidia-smi’ command, a hang occurs for about 20 seconds, and then the server reboots.
The service has been running fine for 3 months, but I’ve had problems since reboot for maintenance purposes.

cat /var/log/messages


Jul 9 19:19:04 stt02 kernel: VFIO - User Level meta-driver version: 0.3
Jul 9 19:19:04 stt02 kernel: nvidia: loading out-of-tree module taints kernel.
Jul 9 19:19:04 stt02 kernel: nvidia: module license ‘NVIDIA’ taints kernel.
Jul 9 19:19:04 stt02 kernel: Disabling lock debugging due to kernel taint
Jul 9 19:19:04 stt02 kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Jul 9 19:19:04 stt02 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
Jul 9 19:19:04 stt02 kernel: nvidia 0000:3b:00.0: enabling device (0000 -> 0003)
Jul 9 19:19:04 stt02 kernel: vgaarb: device changed decodes: PCI:0000:3b:00.0,olddecodes=io+mem,decodes=none:owns=none
Jul 9 19:19:04 stt02 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 430.34 Wed Jun 26 12:19:48 CDT 2019
Jul 9 19:19:04 stt02 kernel: nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 234
Jul 9 19:19:04 stt02 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 430.34 Wed Jun 26 12:15:10 CDT 2019
Jul 9 19:19:04 stt02 kernel: [drm] [nvidia-drm] [GPU ID 0x00003b00] Loading driver
Jul 9 19:19:04 stt02 kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:3b:00.0 on minor 1
Jul 9 19:19:04 stt02 kernel: [drm] [nvidia-drm] [GPU ID 0x00003b00] Unloading driver
Jul 9 19:19:04 stt02 kernel: nvidia-modeset: Unloading
Jul 9 19:19:04 stt02 kernel: nvidia-uvm: Unloaded the UVM driver in 8 mode
Jul 9 19:19:04 stt02 kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 236
Jul 9 19:19:15 stt02 kernel: ipmi device interface
Jul 9 19:19:15 stt02 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
Jul 9 19:19:15 stt02 kernel: vgaarb: device changed decodes: PCI:0000:3b:00.0,olddecodes=none,decodes=none:owns=none
Jul 9 19:19:15 stt02 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 430.34 Wed Jun 26 12:19:48 CDT 2019
Jul 9 19:19:15 stt02 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 430.34 Wed Jun 26 12:15:10 CDT 2019
Jul 9 19:19:15 stt02 kernel: [drm] [nvidia-drm] [GPU ID 0x00003b00] Loading driver
Jul 9 19:19:15 stt02 kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:3b:00.0 on minor 1


nvidia-bug-report is not printed.
command : # nvidia-bug-report.sh --output-file nvidia-bug-report (Current location folder)


Start of NVIDIA bug report log file. Please include this file, along
with a detailed description of your problem, when reporting a graphics
driver bug via the NVIDIA Linux forum (see devtalk.nvidia.com)
or by sending email to ‘linux-bugs@nvidia.com’.

nvidia-bug-report.sh Version: 26547152

Date: Thu Jul 9 19:20:51 KST 2020
uname: Linux stt02 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
command line flags: --output-file nvidia-bug-report


Nothing prints…

But, ‘Nvidia-smi’ can be executed when changing the card from Quadro RTX5000 to Quadro P4000

What’s the problem?

dmesg.log (156.1 KB) messages.log (874.3 KB) nvidia-bug-report.log (535 Bytes) nvidia-installer.log (26.8 KB)