I’m running Proxmox 8.3.1 based on Debian 6.8.12-4 and have added a second hand NVIDIA Tesla T4 card and installed the latest vgpu driver (NVIDIA-GRID-Linux-KVM-550.127.06-550.127.05-553.24) (eval) but after installation nvidia-smi doesn’t find any devices. Any ideas? Thanks! (I have added nvidia-log-report)
nvidia-bug-report.log (1011.1 KB)
root@pve-asus:~# nvidia-smi
No devices were found
root@pve-asus:~# dmesg
[ 125.486548] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 125.541662] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[ 125.542749] nvidia 0000:01:00.0: enabling device (0000 -> 0002)
[ 125.589182] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.127.06 Wed Oct 9 12:31:27 UTC 2024
[ 125.632243] nvidia-nvlink: Unregistered Nvlink Core, major device number 508
[ 133.233612] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[ 133.233618] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.127.06 Wed Oct 9 12:31:27 UTC 2024
[ 143.846796] NVRM: GPU at 0000:01:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
[ 144.291490] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1601)
[ 144.291947] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
root@pve-asus:~# lspci -knnd 10de:
01:00.0 3D controller [0302]: NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8] (rev a1)
Subsystem: NVIDIA Corporation TU104GL [Tesla T4] [10de:12a2]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
root@pve-asus:~# journalctl -b | grep -i "nvidia\|NVRM\|01:00.0"
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: [10de:1eb8] type 00 class 0x030200 PCIe Endpoint
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: BAR 0 [mem 0xf4000000-0xf4ffffff]
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: BAR 1 [mem 0xfca0000000-0xfcafffffff 64bit pref]
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: BAR 3 [mem 0xfcd0000000-0xfcd1ffffff 64bit pref]
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: Enabling HDA controller
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: VF BAR 0 [mem 0xf5000000-0xf503ffff]
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: VF BAR 0 [mem 0xf5000000-0xf53fffff]: contains BAR 0 for 16 VFs
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: VF BAR 1 [mem 0xfba0000000-0xfbafffffff 64bit pref]
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: VF BAR 1 [mem 0xfba0000000-0xfc9fffffff 64bit pref]: contains BAR 1 for 16 VFs
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: VF BAR 3 [mem 0xfcb0000000-0xfcb1ffffff 64bit pref]
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: VF BAR 3 [mem 0xfcb0000000-0xfccfffffff 64bit pref]: contains BAR 3 for 16 VFs
Dec 05 02:44:49 pve-asus kernel: pci 0000:01:00.0: Adding to iommu group 13
Dec 05 02:44:49 pve-asus kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Dec 05 02:44:49 pve-asus kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 508
Dec 05 02:44:49 pve-asus kernel: nvidia 0000:01:00.0: enabling device (0000 -> 0002)
Dec 05 02:44:49 pve-asus kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.127.06 Wed Oct 9 12:31:27 UTC 2024
Dec 05 02:44:55 pve-asus audit[1227]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1227 comm="apparmor_parser"
Dec 05 02:44:55 pve-asus audit[1227]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1227 comm="apparmor_parser"
Dec 05 02:44:55 pve-asus kernel: audit: type=1400 audit(1733363095.551:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1227 comm="apparmor_parser"
Dec 05 02:44:55 pve-asus kernel: audit: type=1400 audit(1733363095.551:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1227 comm="apparmor_parser"
Dec 05 02:44:55 pve-asus systemd[1]: Starting nvidia-vgpud.service - NVIDIA vGPU Daemon...
Dec 05 02:44:55 pve-asus nvidia-vgpud[1247]: Global settings:
Dec 05 02:44:55 pve-asus nvidia-vgpud[1247]: Size: 16
Dec 05 02:44:55 pve-asus nvidia-vgpud[1247]: Homogeneous vGPUs: 1
Dec 05 02:44:55 pve-asus nvidia-vgpud[1247]: vGPU types: 492
Dec 05 02:44:55 pve-asus nvidia-vgpud[1247]:
Dec 05 02:44:55 pve-asus nvidia-vgpud[1247]: pciId of gpu [0]: 0:1:0:0
Dec 05 02:44:55 pve-asus kernel: NVRM: GPU at 0000:01:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
Dec 05 02:44:56 pve-asus kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1601)
Dec 05 02:44:56 pve-asus kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Dec 05 02:44:56 pve-asus kernel: NVRM: GPU at 0000:01:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
Dec 05 02:44:56 pve-asus nvidia-vgpud[1247]: error: failed to attach device: 59
Dec 05 02:44:56 pve-asus nvidia-vgpud[1247]: error: failed to read pGPU information: 9
Dec 05 02:44:56 pve-asus nvidia-vgpud[1247]: error: failed to send vGPU configuration info to RM: 9
Dec 05 02:44:56 pve-asus kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1601)
Dec 05 02:44:56 pve-asus kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Dec 05 02:44:56 pve-asus systemd[1]: nvidia-vgpud.service: Deactivated successfully.
Dec 05 02:44:56 pve-asus systemd[1]: Finished nvidia-vgpud.service - NVIDIA vGPU Daemon.
Dec 05 02:44:56 pve-asus systemd[1]: Starting nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon...
Dec 05 02:44:56 pve-asus systemd[1]: Started nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon.
Dec 05 02:44:56 pve-asus kernel: NVRM: GPU at 0000:01:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
Dec 05 02:44:58 pve-asus kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x65:1552)
Dec 05 02:44:58 pve-asus kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Dec 05 02:44:58 pve-asus kernel: NVRM: GPU at 0000:01:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
Dec 05 02:44:59 pve-asus kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x65:1552)
Dec 05 02:44:59 pve-asus kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Dec 05 02:44:59 pve-asus nvidia-vgpu-mgr[1441]: error: vmiop_env_log: Failed to attach device: 0x59 (gpuId 0x100)
Dec 05 02:44:59 pve-asus systemd[1]: nvidia-vgpu-mgr.service: Main process exited, code=exited, status=1/FAILURE
Dec 05 02:44:59 pve-asus systemd[1]: nvidia-vgpu-mgr.service: Failed with result 'exit-code'.
Dec 05 02:44:59 pve-asus systemd[1]: nvidia-vgpu-mgr.service: Consumed 3.662s CPU time.