Getting vGPU working on RTX 5000 ADA

I know the infrastructure may not be what others are using but If anyone can help with getting my GPU working for vGPU.

It is my understanding that the RTX 5000 ADA should have SR-IOV support to unlock the vGPU capabilities of this card.
Notes on my setup: I use Harvester/Rancher (suse) for deploying Kubernetes Clusters and it is capable of vGPU passthrough to pools and pods. The Harvester environment has a setup that installs the NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run Driver
Log here > https://pastebin.com/raw/y6wjRJYw

lsmod | grep nvidia
nvidia_vgpu_vfio       86016  0
nvidia               8699904  1 nvidia_vgpu_vfio
mdev                   28672  1 nvidia_vgpu_vfio
vfio                   45056  4 nvidia_vgpu_vfio,vfio_iommu_type1,vfio_pci,mdev
drm                   634880  7 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,ttm
kvm                  1056768  2 kvm_amd,nvidia_vgpu_vfio
irqbypass              16384  3 nvidia_vgpu_vfio,vfio_pci,kvm

After the install I still get nvidia-smi No Devices Found
lspci
lists the device

0b:00.0 VGA compatible controller: NVIDIA Corporation Device 26b2 (rev a1)
0b:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)

ls -lart /sys/bus/pci/devices/0000:0b:00.0/
doesn’t show the sriov_vf_device file?

dmesg
Shows some failiure in the driver

[ 4696.401058] NVRM: RmFetchGspRmImages: No firmware image found
[ 4696.401066] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x61:0x56:1697)
[ 4696.401618] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 4696.404266] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2

I am seeing some notes that I may need to change the mode of the GPU into DC mode? The Display is on by Default on these cards.

Any Help on this would be appreciated by not only me but other supporters of Harvester/Rancher and vGPU

So I managed to get nvidia-smi working by running through the below commands only works with NVIDIA-Linux-x86_64-535.183.04-vgpu-kvm.run NOT NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run < so I am not sure what changed in 550.

sudo /usr/bin/nvidia-uninstall

sudo rm -rf /etc/modprobe.d/nvidia.conf
sudo rm -rf /etc/dracut.conf.d/nvidia.conf
sudo find /lib/modules/$(uname -r) -name 'nvidia*' -exec rm -rf {} +
sudo dracut --force

sudo bash -c 'echo "options nvidia NVreg_EnablePCIeGen3=1
options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" > /etc/modprobe.d/nvidia.conf'

sudo /tmp/NVIDIA.run

nvidia-smi

Next I need to try and get vGPU’s recognized.

The right approach would be to use the display mode selector tool and change the mode for the GPU to DC mode. Make sure you have a second GPU in the system to serve as primary adapter.

the Card does have the proper mode selected based on the nvidia-smi -q
But now
sudo /usr/lib/nvidia/sriov-manage -e 0000:41:00.0
Doesn’t result in an error or in virtfn creation.
ls -l /sys/bus/pci/devices/0000:41:00.0/ | grep virtfn

Is there any logging??

https://pastebin.com/raw/wuVMyCCR
Display Mode : Disabled
Display Active : Disabled
vGPU Device Capability
Fractional Multi-vGPU : Supported
Heterogeneous Time-Slice Profiles : Supported
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV

So I did as you suggested and used the display mode selector tool and the GPU looks like it applied it with out issue.

Press 'y' to confirm or 'n' to choose adapters or any other key to abort:
y

Updating GPU Mode of all eligible adapters to "physical_display_disabled"

Apply GPU Mode <4> corresponds to "physical_display_disabled"

Reading EEPROM (this operation may take up to 30 seconds)

Reading EEPROM (this operation may take up to 30 seconds)

[==================================================] 100 %
Reading EEPROM (this operation may take up to 30 seconds)

Reading EEPROM (this operation may take up to 30 seconds)

Successfully updated GPU mode to "physical_display_disabled" ( Mode 4 ).

A reboot is required for the update to take effect.

Reading EEPROM (this operation may take up to 30 seconds)

Reading EEPROM (this operation may take up to 30 seconds)

But now when this GPU is installed on my MOBO it doesn’t even get to the BIOS it hangs the boot operations completely.

I also mentioned you need to make sure to have a second GPU in place to boot the system as the 5000 GPU is now in DC mode and cannot boot the system anymore. This is expected behavior

This is a headless mobo with a IPMI and VGA out on it. I also added a Video Card to check what you suggest to it and still results in the same behaviour the motherboard won’t post.
The Mobo readout is “de ad df 19 04 65”
https://www.asrockrack.com/general/productdetail.asp?Model=X399D8A-2T#Manual
When I remove the card it will post and boot just fine.

That’s one of the reasons why we support only certified hardware. You will need to find another system where you can boot and revert the changes with display mode selector tool

Got another step closer. in the BIOS I needed to enable “Above 4G Decoding” and now my mobo posts and now I can
sudo /usr/lib/nvidia/sriov-manage -e 0000:0b:00.0
Enabling VFs on 0000:0b:00.0
ls -l /sys/bus/pci/devices/0000:0b:00.0/ | grep virtfn

SO I can confirm that the devices get created using Ubuntu’s KVM Host Driver.
I am going to switch my OS back to Harvester to try the build there again with the Linux kvm.run drivers now