Getting vGPU working on RTX 5000 ADA

dallas_koat_ai · July 23, 2024, 6:29pm

I know the infrastructure may not be what others are using but If anyone can help with getting my GPU working for vGPU.

It is my understanding that the RTX 5000 ADA should have SR-IOV support to unlock the vGPU capabilities of this card.
Notes on my setup: I use Harvester/Rancher (suse) for deploying Kubernetes Clusters and it is capable of vGPU passthrough to pools and pods. The Harvester environment has a setup that installs the NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run Driver
Log here > https://pastebin.com/raw/y6wjRJYw

lsmod | grep nvidia
nvidia_vgpu_vfio       86016  0
nvidia               8699904  1 nvidia_vgpu_vfio
mdev                   28672  1 nvidia_vgpu_vfio
vfio                   45056  4 nvidia_vgpu_vfio,vfio_iommu_type1,vfio_pci,mdev
drm                   634880  7 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,ttm
kvm                  1056768  2 kvm_amd,nvidia_vgpu_vfio
irqbypass              16384  3 nvidia_vgpu_vfio,vfio_pci,kvm

After the install I still get nvidia-smi No Devices Found
lspci
lists the device

0b:00.0 VGA compatible controller: NVIDIA Corporation Device 26b2 (rev a1)
0b:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)

ls -lart /sys/bus/pci/devices/0000:0b:00.0/
doesn’t show the sriov_vf_device file?

dmesg
Shows some failiure in the driver

[ 4696.401058] NVRM: RmFetchGspRmImages: No firmware image found
[ 4696.401066] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x61:0x56:1697)
[ 4696.401618] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 4696.404266] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2

I am seeing some notes that I may need to change the mode of the GPU into DC mode? The Display is on by Default on these cards.

Any Help on this would be appreciated by not only me but other supporters of Harvester/Rancher and vGPU

dallas_koat_ai · July 26, 2024, 9:02pm

So I managed to get nvidia-smi working by running through the below commands only works with NVIDIA-Linux-x86_64-535.183.04-vgpu-kvm.run NOT NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run < so I am not sure what changed in 550.

sudo /usr/bin/nvidia-uninstall

sudo rm -rf /etc/modprobe.d/nvidia.conf
sudo rm -rf /etc/dracut.conf.d/nvidia.conf
sudo find /lib/modules/$(uname -r) -name 'nvidia*' -exec rm -rf {} +
sudo dracut --force

sudo bash -c 'echo "options nvidia NVreg_EnablePCIeGen3=1
options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" > /etc/modprobe.d/nvidia.conf'

sudo /tmp/NVIDIA.run

nvidia-smi

Next I need to try and get vGPU’s recognized.

sschaber · July 27, 2024, 8:58am

The right approach would be to use the display mode selector tool and change the mode for the GPU to DC mode. Make sure you have a second GPU in the system to serve as primary adapter.

dallas_koat_ai · July 27, 2024, 1:19pm

the Card does have the proper mode selected based on the nvidia-smi -q
But now
sudo /usr/lib/nvidia/sriov-manage -e 0000:41:00.0
Doesn’t result in an error or in virtfn creation.
ls -l /sys/bus/pci/devices/0000:41:00.0/ | grep virtfn

Is there any logging??

https://pastebin.com/raw/wuVMyCCR
Display Mode : Disabled
Display Active : Disabled
vGPU Device Capability
Fractional Multi-vGPU : Supported
Heterogeneous Time-Slice Profiles : Supported
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV

dallas_koat_ai · July 31, 2024, 12:24am

So I did as you suggested and used the display mode selector tool and the GPU looks like it applied it with out issue.

Press 'y' to confirm or 'n' to choose adapters or any other key to abort:
y

Updating GPU Mode of all eligible adapters to "physical_display_disabled"

Apply GPU Mode <4> corresponds to "physical_display_disabled"

Reading EEPROM (this operation may take up to 30 seconds)

Reading EEPROM (this operation may take up to 30 seconds)

[==================================================] 100 %
Reading EEPROM (this operation may take up to 30 seconds)

Reading EEPROM (this operation may take up to 30 seconds)

Successfully updated GPU mode to "physical_display_disabled" ( Mode 4 ).

A reboot is required for the update to take effect.

Reading EEPROM (this operation may take up to 30 seconds)

Reading EEPROM (this operation may take up to 30 seconds)

But now when this GPU is installed on my MOBO it doesn’t even get to the BIOS it hangs the boot operations completely.

sschaber · July 31, 2024, 4:49am

I also mentioned you need to make sure to have a second GPU in place to boot the system as the 5000 GPU is now in DC mode and cannot boot the system anymore. This is expected behavior

dallas_koat_ai · July 31, 2024, 10:26am

This is a headless mobo with a IPMI and VGA out on it. I also added a Video Card to check what you suggest to it and still results in the same behaviour the motherboard won’t post.
The Mobo readout is “de ad df 19 04 65”
https://www.asrockrack.com/general/productdetail.asp?Model=X399D8A-2T#Manual
When I remove the card it will post and boot just fine.

sschaber · July 31, 2024, 10:41am

That’s one of the reasons why we support only certified hardware. You will need to find another system where you can boot and revert the changes with display mode selector tool

dallas_koat_ai · July 31, 2024, 4:00pm

Got another step closer. in the BIOS I needed to enable “Above 4G Decoding” and now my mobo posts and now I can
sudo /usr/lib/nvidia/sriov-manage -e 0000:0b:00.0
Enabling VFs on 0000:0b:00.0
ls -l /sys/bus/pci/devices/0000:0b:00.0/ | grep virtfn

SO I can confirm that the devices get created using Ubuntu’s KVM Host Driver.
I am going to switch my OS back to Harvester to try the build there again with the Linux kvm.run drivers now

dallas_koat_ai · August 27, 2024, 3:36pm

We are still having issues with the drivers on this card.
I have tried NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run but
nvidia-smi
No devices were found

The only driver I got working was NVIDIA-Linux-x86_64-535.183.04-vgpu-kvm.run
and only if I had the /etc/modprobe.d/nvidia.conf file in place
options nvidia NVreg_EnablePCIeGen3=1
options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1

So now I have the 535 driver installed I can successfully get the vGPU devices created.
But now those devices are not working when I try to pass in.
[ 233.525762] nvidia 0000:0b:04.3: enabling device (0000 → 0002)
[ 233.526028] nvidia 0000:0b:04.3: Driver cannot be asked to release device
[ 233.526119] nvidia 0000:0b:04.3: MDEV: Registered
[ 368.399315] nvidia-vgpu-vfio 13a9d53c-56d3-4c7b-947d-1fc71c39b4dc: Adding to iommu group 95
[ 368.399344] nvidia-vgpu-vfio 13a9d53c-56d3-4c7b-947d-1fc71c39b4dc: MDEV: group_id = 95
[ 5118.772799] [nvidia-vgpu-vfio] 13a9d53c-56d3-4c7b-947d-1fc71c39b4dc: start failed. status: 0x1

nvidia-driver-runtime-z6lk9:/ # lsof /dev/nvidia*
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nvidia-vg 30746 root 1u CHR 195,255 0t0 249 /dev/nvidiactl
nvidia-vg 30746 root 2u CHR 195,255 0t0 249 /dev/nvidiactl
nvidia-vg 30746 root 3u CHR 195,255 0t0 249 /dev/nvidiactl
nvidia-driver-runtime-z6lk9:/ # lspci -nnk -s 0000:0b:00.5
0b:00.5 3D controller [0302]: NVIDIA Corporation Device [10de:26b2] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:0000]
Kernel driver in use: nvidia
Kernel modules: nvidia_vgpu_vfio, nvidia

Topic		Replies	Views
Issue in vGPU setup in Ubuntu 20.04.3 General Discussion ubuntu	23	9803	January 10, 2022
SR-IOV on RTX 5000 ADA NVIDIA Virtual GPU Drivers	1	244	August 27, 2024
GPU hardware detected but unable to start (error code 10) NVIDIA Virtual GPU Technology	2	28380	June 1, 2015
Question on vGPU on RTX General Discussion	1	3574	July 9, 2019
Nvidia driver 460.32.03 on Linux says that the RTX 6000 vGPU is not supported More vGPU Forums	1	1210	April 28, 2021
A5000 not shows mdev_supported_types and I can’t create vGPUS instances; General Discussion	3	1491	July 20, 2022
Vgpu for rtx a4000 rtx a5000 NVIDIA Virtual GPU Drivers	4	4810	May 5, 2021
vGPU support ? General Discussion	7	19450	July 23, 2015
A100 - cannot create vGPU with SR-VIO NVIDIA Virtual GPU Drivers	1	2451	March 11, 2022
[SOLVED] M10 with ESXi 6.5 - vGPU: Device not supported General Discussion	7	22893	October 18, 2017

Getting vGPU working on RTX 5000 ADA

Related topics