I know the infrastructure may not be what others are using but If anyone can help with getting my GPU working for vGPU.
It is my understanding that the RTX 5000 ADA should have SR-IOV support to unlock the vGPU capabilities of this card.
Notes on my setup: I use Harvester/Rancher (suse) for deploying Kubernetes Clusters and it is capable of vGPU passthrough to pools and pods. The Harvester environment has a setup that installs the NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run Driver
Log here > https://pastebin.com/raw/y6wjRJYw
lsmod | grep nvidia
nvidia_vgpu_vfio 86016 0
nvidia 8699904 1 nvidia_vgpu_vfio
mdev 28672 1 nvidia_vgpu_vfio
vfio 45056 4 nvidia_vgpu_vfio,vfio_iommu_type1,vfio_pci,mdev
drm 634880 7 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,ttm
kvm 1056768 2 kvm_amd,nvidia_vgpu_vfio
irqbypass 16384 3 nvidia_vgpu_vfio,vfio_pci,kvm
After the install I still get nvidia-smi No Devices Found
lspci
lists the device
0b:00.0 VGA compatible controller: NVIDIA Corporation Device 26b2 (rev a1)
0b:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
ls -lart /sys/bus/pci/devices/0000:0b:00.0/
doesn’t show the sriov_vf_device file?
dmesg
Shows some failiure in the driver
[ 4696.401058] NVRM: RmFetchGspRmImages: No firmware image found
[ 4696.401066] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x61:0x56:1697)
[ 4696.401618] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 4696.404266] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
I am seeing some notes that I may need to change the mode of the GPU into DC mode? The Display is on by Default on these cards.
Any Help on this would be appreciated by not only me but other supporters of Harvester/Rancher and vGPU
So I managed to get nvidia-smi working by running through the below commands only works with NVIDIA-Linux-x86_64-535.183.04-vgpu-kvm.run NOT NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run < so I am not sure what changed in 550.
sudo /usr/bin/nvidia-uninstall
sudo rm -rf /etc/modprobe.d/nvidia.conf
sudo rm -rf /etc/dracut.conf.d/nvidia.conf
sudo find /lib/modules/$(uname -r) -name 'nvidia*' -exec rm -rf {} +
sudo dracut --force
sudo bash -c 'echo "options nvidia NVreg_EnablePCIeGen3=1
options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" > /etc/modprobe.d/nvidia.conf'
sudo /tmp/NVIDIA.run
nvidia-smi
Next I need to try and get vGPU’s recognized.
The right approach would be to use the display mode selector tool and change the mode for the GPU to DC mode. Make sure you have a second GPU in the system to serve as primary adapter.
the Card does have the proper mode selected based on the nvidia-smi -q
But now
sudo /usr/lib/nvidia/sriov-manage -e 0000:41:00.0
Doesn’t result in an error or in virtfn creation.
ls -l /sys/bus/pci/devices/0000:41:00.0/ | grep virtfn
Is there any logging??
https://pastebin.com/raw/wuVMyCCR
Display Mode : Disabled
Display Active : Disabled
vGPU Device Capability
Fractional Multi-vGPU : Supported
Heterogeneous Time-Slice Profiles : Supported
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
So I did as you suggested and used the display mode selector tool and the GPU looks like it applied it with out issue.
Press 'y' to confirm or 'n' to choose adapters or any other key to abort:
y
Updating GPU Mode of all eligible adapters to "physical_display_disabled"
Apply GPU Mode <4> corresponds to "physical_display_disabled"
Reading EEPROM (this operation may take up to 30 seconds)
Reading EEPROM (this operation may take up to 30 seconds)
[==================================================] 100 %
Reading EEPROM (this operation may take up to 30 seconds)
Reading EEPROM (this operation may take up to 30 seconds)
Successfully updated GPU mode to "physical_display_disabled" ( Mode 4 ).
A reboot is required for the update to take effect.
Reading EEPROM (this operation may take up to 30 seconds)
Reading EEPROM (this operation may take up to 30 seconds)
But now when this GPU is installed on my MOBO it doesn’t even get to the BIOS it hangs the boot operations completely.
I also mentioned you need to make sure to have a second GPU in place to boot the system as the 5000 GPU is now in DC mode and cannot boot the system anymore. This is expected behavior
This is a headless mobo with a IPMI and VGA out on it. I also added a Video Card to check what you suggest to it and still results in the same behaviour the motherboard won’t post.
The Mobo readout is “de ad df 19 04 65”
https://www.asrockrack.com/general/productdetail.asp?Model=X399D8A-2T#Manual
When I remove the card it will post and boot just fine.
That’s one of the reasons why we support only certified hardware. You will need to find another system where you can boot and revert the changes with display mode selector tool
Got another step closer. in the BIOS I needed to enable “Above 4G Decoding” and now my mobo posts and now I can
sudo /usr/lib/nvidia/sriov-manage -e 0000:0b:00.0
Enabling VFs on 0000:0b:00.0
ls -l /sys/bus/pci/devices/0000:0b:00.0/ | grep virtfn
SO I can confirm that the devices get created using Ubuntu’s KVM Host Driver.
I am going to switch my OS back to Harvester to try the build there again with the Linux kvm.run drivers now
We are still having issues with the drivers on this card.
I have tried NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run
but
nvidia-smi
No devices were found
The only driver I got working was NVIDIA-Linux-x86_64-535.183.04-vgpu-kvm.run
and only if I had the /etc/modprobe.d/nvidia.conf
file in place
options nvidia NVreg_EnablePCIeGen3=1
options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1
So now I have the 535 driver installed I can successfully get the vGPU devices created.
But now those devices are not working when I try to pass in.
[ 233.525762] nvidia 0000:0b:04.3: enabling device (0000 → 0002)
[ 233.526028] nvidia 0000:0b:04.3: Driver cannot be asked to release device
[ 233.526119] nvidia 0000:0b:04.3: MDEV: Registered
[ 368.399315] nvidia-vgpu-vfio 13a9d53c-56d3-4c7b-947d-1fc71c39b4dc: Adding to iommu group 95
[ 368.399344] nvidia-vgpu-vfio 13a9d53c-56d3-4c7b-947d-1fc71c39b4dc: MDEV: group_id = 95
[ 5118.772799] [nvidia-vgpu-vfio] 13a9d53c-56d3-4c7b-947d-1fc71c39b4dc: start failed. status: 0x1
nvidia-driver-runtime-z6lk9:/ # lsof /dev/nvidia*
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nvidia-vg 30746 root 1u CHR 195,255 0t0 249 /dev/nvidiactl
nvidia-vg 30746 root 2u CHR 195,255 0t0 249 /dev/nvidiactl
nvidia-vg 30746 root 3u CHR 195,255 0t0 249 /dev/nvidiactl
nvidia-driver-runtime-z6lk9:/ # lspci -nnk -s 0000:0b:00.5
0b:00.5 3D controller [0302]: NVIDIA Corporation Device [10de:26b2] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
Kernel driver in use: nvidia
Kernel modules: nvidia_vgpu_vfio, nvidia