nvidia-driver-runtime-n2f6s:/ # nvidia-smi
No devices were found
nvidia-driver-runtime-n2f6s:/ # lsmod | grep nvidia
nvidia_vgpu_vfio 86016 0
nvidia 8699904 1 nvidia_vgpu_vfio
mdev 28672 1 nvidia_vgpu_vfio
vfio 45056 3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev
drm 634880 7 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,ttm
kvm 1056768 2 kvm_amd,nvidia_vgpu_vfio
irqbypass 16384 2 nvidia_vgpu_vfio,kvm
nvidia-driver-runtime-n2f6s:/ # lspci | grep NVIDIA
41:00.0 VGA compatible controller: NVIDIA Corporation Device 26b2 (rev a1)
41:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
nvidia-driver-runtime-n2f6s:/ # dmesg | grep NVIDIA
[ 153.856013] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 550.90.05 Release Build (dvs-builder@U16-I1-N08-05-1) Mon May 27 14:37:46 UTC 2024
[ 155.996611] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 550.90.05 Release Build (dvs-builder@U16-I1-N08-05-1) Mon May 27 14:37:46 UTC 2024
nvidia-driver-runtime-n2f6s:/ # dmesg | grep nvidia
[ 153.784296] nvidia: loading out-of-tree module taints kernel.
[ 153.787846] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 153.808814] nvidia: externally supported module, setting X kernel taint flag.
[ 153.810802] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
[ 153.812787] nvidia 0000:41:00.0: enabling device (0000 → 0003)
[ 153.812983] nvidia 0000:41:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 153.862748] nvidia_vgpu_vfio: externally supported module, setting X kernel taint flag.
[ 153.918226] nvidia-nvlink: Unregistered Nvlink Core, major device number 511
[ 155.942593] nvidia: externally supported module, setting X kernel taint flag.
[ 155.945345] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
[ 155.947949] nvidia 0000:41:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 156.001285] nvidia_vgpu_vfio: externally supported module, setting X kernel taint flag.
[ 156.146216] nvidia 0000:41:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[ 156.146990] nvidia 0000:41:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[ 156.151932] nvidia 0000:41:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[ 156.152440] nvidia 0000:41:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[ 241.904348] nvidia 0000:41:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
So I managed to get nvidia-smi working by running through the below commands only works with NVIDIA-Linux-x86_64-535.183.04-vgpu-kvm.run NOT NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run < so I am not sure what changed in 550.
sudo /usr/bin/nvidia-uninstall
sudo rm -rf /etc/modprobe.d/nvidia.conf
sudo rm -rf /etc/dracut.conf.d/nvidia.conf
sudo find /lib/modules/$(uname -r) -name 'nvidia*' -exec rm -rf {} +
sudo dracut --force
sudo bash -c 'echo "options nvidia NVreg_EnablePCIeGen3=1
options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" > /etc/modprobe.d/nvidia.conf'
sudo /tmp/NVIDIA.run
nvidia-smi
I have exactly the same issue. Your workaround indeed works for 535 by disabling GPU, but for 550 It doesnt work.
Any idea?
Correct it doesn’t work for 550 and I still do not have any ideas why.
This GPU has been frustrating to work with. I am still having issues.
Actually I have just found the reason at least why it couldn’t load the firmware on our side.
In our case, the nvidia software gets installed by a pod (kubernetes).
When loading the nvidia module, it actually going to look on the OS to the firmware folder (/lib/firmware/nvidia) however the firmware actually is in the container.
Maybe simular issue at your side?
@dallas_koat_ai I noticed your nvidia_driver-runtime, so I assume you are using harvester as I am
Link to the issue:
Please read the latest feedback, this will give you a workaround.