A6000 ADA device not found on kvm guest system

For our AI team
we are preparing a kvm virtual machine running rocky linux 9.2 over a host system running rocky linux too based on a SuperMicro Server with 2 A6000 ADA.
I was able to map the two GPUs to guest machine and I installed on the guest the nvidia-driver-545.23.06-1.el9.x86_64 and dependency packages,
at boot the guuest host seems to load the nvidia driver correctly as reported in dmesg:
[ 4.428374] nvidia: loading out-of-tree module taints kernel.
[ 4.428759] nvidia: module license ‘NVIDIA’ taints kernel.
[ 4.429065] Disabling lock debugging due to kernel taint
[ 4.447134] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 4.514942] nvidia-nvlink: Nvlink Core is being initialized, major device number 236

[ 4.518018] nvidia 0000:07:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 4.572192] nvidia 0000:08:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 4.617205] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 545.23.06 Sun Oct 15 17:43:11 UTC 2023
[ 4.706172] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 4.786754] nvidia-uvm: Loaded the UVM driver, major device number 234.
[ 4.830294] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 545.23.06 Sun Oct 15 17:22:43 UTC 2023
[ 4.839535] [drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver
[ 4.839977] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:07:00.0 on minor 1
[ 4.841266] [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
[ 4.841678] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 2
if I runn the command
lspci -vv | grep -i Nvidia
07:00.0 VGA compatible controller: NVIDIA Corporation AD102GL [L6000 / RTX 6000 Ada Generation] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 16a1
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
08:00.0 VGA compatible controller: NVIDIA Corporation AD102GL [L6000 / RTX 6000 Ada Generation] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 16a1
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
I see them but
nvidia-smi --list-gpus
or
nvidia-smi pci -i 07:00.0

No devices were found

I mean that even if the guest host is headless with no graphics runnig I shoud obtain the info of two board configured in the guest machine.
My final target is to run on it tensorflow by cudatoolkit
but is nvidia-smi does not detect the GPU the installation of cuda-toolkit seems to me useless

Thanks in advance
Claudio

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Dear generix,
as you requested the attached nvidia-bug-report.log.gz
nvidia-bug-report.log.gz (77.0 KB)
Moreover continuing my investigation I found this strange content in the Video BIOS field inside the driver information:
cat /proc/driver/nvidia/gpus/0000:00:07.0/information
Model: NVIDIA RTX 6000 Ada Generation
IRQ: 11
GPU UUID: GPU-1f8ea747-1513-4817-4430-2bf230b7b2f0
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:00:07.0
Device Minor: 0
GPU Excluded: No

Thanks in advance for your support
Claudio

That’s rather odd, the gpus are there, the driver loads. Only issue I can see is you didn’t pass through the accompanying audio devices. Please change that and check if this resolves your issue.

Dear generix,
following your suggestion I added the soud devices to the guest machine:
[root@rocky92test ~]# lspci | grep NVIDIA
00:07.0 VGA compatible controller: NVIDIA Corporation AD102GL [L6000 / RTX 6000 Ada Generation] (rev a1)
00:08.0 VGA compatible controller: NVIDIA Corporation AD102GL [L6000 / RTX 6000 Ada Generation] (rev a1)
00:0b.0 Audio device: NVIDIA Corporation AD102 High Definition Audio Controller (rev a1)
00:0c.0 Audio device: NVIDIA Corporation AD102 High Definition Audio Controller (rev a1)

The dmesg shows the driver recognized them too

root@rocky92test ~]# dmesg | grep -i nvidia
[ 2.909121] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:0b.0/sound/card0/input6
[ 2.909264] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:0b.0/sound/card0/input7
[ 2.909289] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:0b.0/sound/card0/input8
[ 2.909313] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:0b.0/sound/card0/input9
[ 2.973503] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:0c.0/sound/card1/input10
[ 2.973620] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:0c.0/sound/card1/input11
[ 2.973715] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:0c.0/sound/card1/input12
[ 2.973817] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:0c.0/sound/card1/input13
[ 4.549633] nvidia: loading out-of-tree module taints kernel.
[ 4.549665] nvidia: module license ‘NVIDIA’ taints kernel.
[ 4.566586] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 4.629218] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 4.641680] nvidia 0000:00:07.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 4.704232] nvidia 0000:00:08.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 4.754876] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 545.23.06 Sun Oct 15 17:43:11 UTC 2023
[ 4.831141] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 4.909491] nvidia-uvm: Loaded the UVM driver, major device number 234.
[ 4.946512] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 545.23.06 Sun Oct 15 17:22:43 UTC 2023
[ 4.951867] [drm] [nvidia-drm] [GPU ID 0x00000007] Loading driver
[ 4.951884] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:07.0 on minor 1
[ 4.952760] [drm] [nvidia-drm] [GPU ID 0x00000008] Loading driver
[ 4.952776] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:08.0 on minor 2

but again when I run nvidia-smi

[root@rocky92test ~]# nvidia-smi
No devices were found

and in the /var/log/messages I obtain

Nov 6 11:52:47 rocky92test kernel: ACPI Warning: _SB.PCI0.S38._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20211217/nsarguments-61)
Nov 6 11:52:52 rocky92test kernel: NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x11:0x45:2550)
Nov 6 11:52:52 rocky92test kernel: NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 0
Nov 6 11:52:56 rocky92test kernel: NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x11:0x45:2550)
Nov 6 11:52:56 rocky92test kernel: NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 0
Nov 6 11:52:56 rocky92test kernel: ACPI Warning: _SB.PCI0.S40._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20211217/nsarguments-61)
Nov 6 11:53:01 rocky92test kernel: NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x11:0x45:2550)
Nov 6 11:53:01 rocky92test kernel: NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 1
Nov 6 11:53:06 rocky92test kernel: NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x11:0x45:2550)
Nov 6 11:53:06 rocky92test kernel: NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 1

Last but not least, looking in the forum I saw another user having a different problem in my usage scenario

where in the last message he states:

Just to update on this and close the topic. I’ve talked to Nvidia and was informed that vGPU approach is required whether it’s a pass-through mode or splitting GPU to multiple users. vGPU requires a valid license and installation of video driver on the host and on the guest. To my best knowledge there is no way to directly pass through a professional GPU without using vGPU.

Before I’ve figured out the solution with vGPU I’ve tried all kinds of Proxmox tricks like setting kernel boot parameters, changing GPU PCIe physical slot, etc. GPU was visible in the guest OS but the driver would not work with it. On the same guest VM I can pass RTX 3090 without issues.
RTX 6000 Ada works fine with the same linux driver on bare metal.

May you confirm this is the case ?

Thanks in advance
Claudio

The audio devices are incorrectly passed through, so the driver is now bailing out. They have to be passed through as sub-devices, i.e. 07.1/08.1 instead of as own devices (0b.0/0c.0).

The info from the other thread is simply wrong. The user just used an outdated driver, working for the Ampere gen 3090 but not for the newer Ada gen.

Dear generix,
I guess I fixed as per your indication the configuration on guest machine:
[root@rocky92test ~]# lspci | grep NVIDIA
00:07.0 VGA compatible controller: NVIDIA Corporation AD102GL [L6000 / RTX 6000 Ada Generation] (rev a1)
00:07.1 Audio device: NVIDIA Corporation AD102 High Definition Audio Controller (rev a1)
00:08.0 VGA compatible controller: NVIDIA Corporation AD102GL [L6000 / RTX 6000 Ada Generation] (rev a1)
00:08.1 Audio device: NVIDIA Corporation AD102 High Definition Audio Controller (rev a1)

From dmesg the drive seem loaded correctly:
[ 2.732632] [drm] Initialized virtio_gpu 0.1.0 0 for virtio0 on minor 0
[ 2.737763] virtio_gpu virtio0: [drm] drm_plane_enable_fb_damage_clips() not called
[ 2.737793] Console: switching to colour frame buffer device 160x50
[ 2.751531] virtio_gpu virtio0: [drm] fb0: virtio_gpudrmfb frame buffer device
[ 2.804977] snd_hda_intel 0000:00:07.1: Disabling MSI
[ 2.849376] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:07.1/sound/card0/input6
[ 2.849528] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:07.1/sound/card0/input7
[ 2.849633] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:07.1/sound/card0/input8
[ 2.849722] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:07.1/sound/card0/input9
[ 3.036699] snd_hda_intel 0000:00:08.1: Disabling MSI
[ 3.076857] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:08.1/sound/card1/input10
[ 3.078286] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:08.1/sound/card1/input11
[ 3.081611] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:08.1/sound/card1/input12
[ 3.082026] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:08.1/sound/card1/input13
[ 4.232874] nvidia: loading out-of-tree module taints kernel.
[ 4.232912] nvidia: module license ‘NVIDIA’ taints kernel.
[ 4.232923] Disabling lock debugging due to kernel taint
[ 4.249262] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 4.312948] nvidia-nvlink: Nvlink Core is being initialized, major device number 236

[ 4.325627] nvidia 0000:00:07.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 4.386745] nvidia 0000:00:08.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 4.432767] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 545.23.06 Sun Oct 15 17:43:11 UTC 2023
[ 4.507578] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 4.586363] nvidia-uvm: Loaded the UVM driver, major device number 234.
[ 4.620706] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 545.23.06 Sun Oct 15 17:22:43 UTC 2023
[ 4.628169] [drm] [nvidia-drm] [GPU ID 0x00000007] Loading driver
[ 4.628187] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:07.0 on minor 1
[ 4.628370] [drm] [nvidia-drm] [GPU ID 0x00000008] Loading driver
[ 4.628384] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:08.0 on minor 2

Nevetheless when I run
[root@rocky92test ~]# nvidia-smi
No devices were found
and in the syslog I obtain the more or less the same errors:
Nov 8 18:29:33 rocky92test kernel: ACPI Warning: _SB.PCI0.S38._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20211217/nsarguments-61)
Nov 8 18:29:37 rocky92test kernel: NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x11:0x45:2550)
Nov 8 18:29:37 rocky92test kernel: NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 0
Nov 8 18:29:42 rocky92test kernel: NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x11:0x45:2550)
Nov 8 18:29:42 rocky92test kernel: NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 0
Nov 8 18:29:42 rocky92test kernel: ACPI Warning: _SB.PCI0.S40._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20211217/nsarguments-61)
Nov 8 18:29:47 rocky92test kernel: NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x11:0x45:2550)
Nov 8 18:29:47 rocky92test kernel: NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 1
Nov 8 18:29:52 rocky92test kernel: NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x11:0x45:2550)
Nov 8 18:29:52 rocky92test kernel: NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 1

At this point if you don’t have any further suggestion, my plan is to restart from the basic, scratch the whole host and try to install the driver on a clean install of host machine to verify if GPUs are fully compatible with SuperMicro Server.

Thanks in advance
Claudio

I digged around a bit and it seems nvidia has indeed changed again the pass-through requirements for the Ada gen, without official notice. Please see this for a solution:
https://forums.developer.nvidia.com/t/passthrough-rtx-6000-ada-to-proxmox-vm-linux-driver-crash-follow-up-better-solution/258997

Dear generix,
your last indication worked fine.
The usage of displaymodeselector is little bit scaring with risk to break the GPU but anyway it went well
Many thanks for your support
Claudio