RTX 6000 Ada Linux driver crash

I’m passing through RTX 6000 Ada Generation card as a PCIe device on Proxmox to the VM. The card is recognized by the guest VM as NVIDIA and I’ve installed the latest proprietary Linux driver 530 and also tried 525. However, I’m getting the error message below (dmesg).

[    3.808057] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  530.30.02  Wed Feb 22 04:11:39 UTC 2023
...
[    3.849336] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  530.30.02  Wed Feb 22 03:45:40 UTC 2023
...
[    9.202528] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x11:0x45:2529)
[    9.203943] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    9.250008] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[    9.250267] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[    9.250317] BUG: kernel NULL pointer dereference, address: 0000000000000040
[    9.250320] #PF: supervisor read access in kernel mode
[    9.250324] #PF: error_code(0x0000) - not-present page
...
[    9.250336] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[    9.250337] RIP: 0010:_nv000460kms+0x11/0x50 [nvidia_modeset]
[    9.250367] Code: f8 b9 08 00 00 00 89 45 f8 be 26 00 00 00 e8 d6 06 f6 ff c9 c3 0f 1f 40 00 f3 0f 1e fa 55 b8 01 00 00 00 48 89 e5 48 83 ec 10 <8b> 57 40 48 c7 45 f8 00 00 00 00 85 d2 75 08 c9 c3 66 0f 1f 44 00
[    9.250369] RSP: 0018:ffffa7bdc4213aa8 EFLAGS: 00010282
[    9.250371] RAX: 0000000000000001 RBX: ffff8d43c4f6b000 RCX: 0000000000000000
[    9.250373] RDX: 0000000000000001 RSI: ffff8d43ca002c00 RDI: 0000000000000000
[    9.250374] RBP: ffffa7bdc4213ab8 R08: 0000000000000000 R09: 0000000000000000
[    9.250375] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d43c4f6b000
[    9.250377] R13: ffff8d440d6f2b40 R14: 0000000000000000 R15: ffff8d43ca002c10
[    9.250380] FS:  00007fa3850d2740(0000) GS:ffff8d472fcc0000(0000) knlGS:0000000000000000
[    9.250382] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.250383] CR2: 0000000000000040 CR3: 0000000112fea000 CR4: 0000000000750ee0
[    9.250386] PKRU: 55555554
[    9.250388] Call Trace:
[    9.250390]  <TASK>
[    9.250393]  nv_drm_master_set+0x25/0x50 [nvidia_drm]
[    9.250399]  drm_new_set_master+0xa9/0x130 [drm]
[    9.250425]  drm_master_open+0x93/0xc0 [drm]
[    9.250446]  drm_open+0xf8/0x270 [drm]
[    9.250468]  drm_stub_open+0xba/0x140 [drm]
[    9.250492]  chrdev_open+0xc7/0x250
[    9.250496]  ? cdev_device_add+0xa0/0xa0
[    9.250499]  do_dentry_open+0x16a/0x400
[    9.250503]  vfs_open+0x2d/0x40
[    9.250506]  do_open+0x223/0x490
[    9.250508]  path_openat+0x11d/0x2c0
[    9.250511]  do_filp_open+0xb2/0x160
[    9.250514]  ? __check_object_size+0x23/0x30
[    9.250517]  do_sys_openat2+0xb3/0x180
[    9.250520]  __x64_sys_openat+0x55/0xa0
[    9.250522]  do_syscall_64+0x5c/0x90
[    9.250525]  ? exit_to_user_mode_prepare+0x3b/0xd0
[    9.250529]  ? syscall_exit_to_user_mode+0x2a/0x50
[    9.250532]  ? do_syscall_64+0x69/0x90
[    9.250534]  ? exit_to_user_mode_prepare+0x3b/0xd0
[    9.250536]  ? syscall_exit_to_user_mode+0x2a/0x50
[    9.250538]  ? do_syscall_64+0x69/0x90
[    9.250540]  ? do_syscall_64+0x69/0x90
[    9.250542]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[    9.250545] RIP: 0033:0x7fa384f146eb

Driver versions tried:

Package: nvidia-driver-530
Version: 530.30.02-0ubuntu1

and

Package: nvidia-driver-525
Version: 525.85.12-0ubuntu1

OS details:

Linux cpu-vm1 5.19.0-35-generic #36~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 17 15:17:25 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"

Just for reference, RTX 3090 is passed through to another VM and works fine on the same host system.

I’ve ran nvidia bug report and attached it here. Please suggest a fix or let me know if you require other information.

nvidia-bug-report.log.gz (1.8 MB)

I’ve tried to add

mem_encrypt=off

to GRUB boot loader as suggested here, but it failed with the same result.

When I run

nvidia-smi

I get:

No devices were found

and I see these messages in the system log:

[   29.326097] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x11:0x45:2529)
[   29.326457] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   44.914626] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x11:0x45:2529)
[   44.914978] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   49.964583] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x11:0x45:2529)
[   49.964971] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

The drivers are still loaded:

$ lsmod | grep nvidia
nvidia_uvm           1433600  0
nvidia_drm             77824  1
nvidia_modeset       1273856  1 nvidia_drm
nvidia              55738368  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        200704  4 qxl,nvidia_drm
drm                   581632  9 drm_kms_helper,qxl,nvidia,drm_ttm_helper,nvidia_drm,ttm


$ lsmod | grep nouveau

Seems like a driver issue. I’m just passing through the GPU as PCIe device without using any vGPU options. The host OS has no nvidia driver installed, no monitors are plugged into the card. Guest OS clearly sees the card as pci device.

Interestingly

on GUEST OS I get this:

lspci -v

...

01:00.0 VGA compatible controller: NVIDIA Corporation Device 26b1 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device 16a1
	Physical Slot: 0
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Memory at f0000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 1000000000 (64-bit, prefetchable) [size=64G]
	Memory at 2000000000 (64-bit, prefetchable) [size=32M]
	I/O ports at 5000 [size=128]
	Expansion ROM at f1000000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

01:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
	Subsystem: NVIDIA Corporation Device 16a1
	Physical Slot: 0
	Flags: bus master, fast devsel, latency 0, IRQ 17
	Memory at f1080000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

while on HOST OS I get this:

2d:00.0 VGA compatible controller: NVIDIA Corporation Device 26b1 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device 16a1
	Flags: bus master, fast devsel, latency 0, IRQ 133, IOMMU group 26
	Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 5000000000 (64-bit, prefetchable) [size=64G]
	Memory at 6000000000 (64-bit, prefetchable) [size=32M]
	I/O ports at f000 [size=128]
	Expansion ROM at fc000000 [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100] Virtual Channel
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00] Lane Margining at the Receiver <?>
	Capabilities: [e00] Data Link Feature <?>
	Kernel driver in use: vfio-pci
	Kernel modules: nvidiafb, nouveau

2d:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
	Subsystem: NVIDIA Corporation Device 16a1
	Flags: bus master, fast devsel, latency 0, IRQ 138, IOMMU group 26
	Memory at fc080000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [160] Data Link Feature <?>
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel

Just to update I’ve also tried to remove all nvidia drivers and install from local installer downloaded from here.

## Linux x64 (AMD64/EM64T) Display Driver

|Version:|525.89.02|
| --- | --- |
|Release Date:|2023.2.8|
|Operating System:|Linux 64-bit|
|Language:|English (US)|
|File Size:|394.93 MB|

This is a slightly different version of 525 release.

This didn’t help and resulted in the same error messages.

Hi there @aleksey.izmailov,

I don’t have a good solution for you here, just a few observations.

RTX 6000 Ada is our newest workstation card, so you will definitely need the latest possible drivers.
You will likely have purchased this card from one of our partners with proper (enterprise?) support, that means you should talk to them first about using the device in a virtual machine environment. I am not really sure that they all support this as the GeForce GPUs do.

Finally since I am not the expert, you might find more help over in the vGPU forums, where there are also discussions on normal pass-through usage of our GPUs, not only licensed vGPU setups.

Thanks!

Thank you Markus, I was thinking the same, will open a support ticket.

Just to update on this and close the topic. I’ve talked to Nvidia and was informed that vGPU approach is required whether it’s a pass-through mode or splitting GPU to multiple users. vGPU requires a valid license and installation of video driver on the host and on the guest. To my best knowledge there is no way to directly pass through a professional GPU without using vGPU.

Before I’ve figured out the solution with vGPU I’ve tried all kinds of Proxmox tricks like setting kernel boot parameters, changing GPU PCIe physical slot, etc. GPU was visible in the guest OS but the driver would not work with it. On the same guest VM I can pass RTX 3090 without issues.
RTX 6000 Ada works fine with the same linux driver on bare metal.

Kudos,
Alex

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.