A40 ubuntu failed to load driver "NVRM: request_mem_region failed for 0M @ 0x0"

I have tried multiple ubuntu version and nothing seems to work as with all my other Nvidia builds.

Is it the A40? Any ideas?

[ 2.021946] nvidia: loading out-of-tree module taints kernel.
[ 2.021955] nvidia: module license ‘NVIDIA’ taints kernel.
[ 2.021956] Disabling lock debugging due to kernel taint
[ 2.022125] kvm: Nested Virtualization enabled
[ 2.022145] SVM: kvm: Nested Paging enabled
[ 2.025494] Huh? What family is it: 0x19?!
[ 2.034825] pcieport 0000:00:1c.3: pciehp: Failed to check link status
[ 2.040325] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.042761] pcieport 0000:00:1c.1: pciehp: Failed to check link status
[ 2.050764] pcieport 0000:00:1c.2: pciehp: Failed to check link status
[ 2.053052] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 2.054912] NVRM: request_mem_region failed for 0M @ 0x0. This can
NVRM: occur when a driver such as rivatv is loaded and claims
NVRM: ownership of the device’s registers.
[ 2.055355] nvidia: probe of 0000:01:00.0 failed with error -1
[ 2.055370] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 2.055370] NVRM: None of the NVIDIA devices were initialized.
[ 2.055964] nvidia-nvlink: Unregistered the Nvlink Core, major device number 237
[ 2.074192] audit: type=1400 audit(1624618208.032:2): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe” pid=596 comm=“apparmor_parser”
[ 2.074196] audit: type=1400 audit(1624618208.032:3): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe//kmod” pid=596 comm=“apparmor_parser”
[ 2.074735] audit: type=1400 audit(1624618208.032:4): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“/usr/bin/man” pid=598 comm=“apparmor_parser”
[ 2.074738] audit: type=1400 audit(1624618208.036:5): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“man_filter” pid=598 comm=“apparmor_parser”
[ 2.074739] audit: type=1400 audit(1624618208.036:6): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“man_groff” pid=598 comm=“apparmor_parser”
[ 2.074785] audit: type=1400 audit(1624618208.036:7): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“libreoffice-oopslash” pid=591 comm=“apparmor_parser”
[ 2.075746] audit: type=1400 audit(1624618208.036:8): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“libreoffice-senddoc” pid=590 comm=“apparmor_parser”
[ 2.076467] audit: type=1400 audit(1624618208.036:9): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“lsb_release” pid=601 comm=“apparmor_parser”
[ 2.076663] audit: type=1400 audit(1624618208.036:10): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“/usr/sbin/cups-browsed” pid=592 comm=“apparmor_parser”
[ 2.356483] Huh? What family is it: 0x19?!
[ 2.411622] Huh? What family is it: 0x19?!
[ 2.463269] Huh? What family is it: 0x19?!
[ 2.531640] Huh? What family is it: 0x19?!
[ 2.711943] Huh? What family is it: 0x19?!
[ 2.716228] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 2.717653] NVRM: request_mem_region failed for 0M @ 0x0. This can
NVRM: occur when a driver such as rivatv is loaded and claims
NVRM: ownership of the device’s registers.
[ 2.717986] nvidia: probe of 0000:01:00.0 failed with error -1
[ 2.718003] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 2.718003] NVRM: None of the NVIDIA devices were initialized.
[ 2.718226] nvidia-nvlink: Unregistered the Nvlink Core, major device number 237

k@u2104serv01:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
k@u2104serv01:~$ sudo lspci -s 01:00 -v
01:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A40] (rev a1)
Subsystem: NVIDIA Corporation Device 145a
Physical Slot: 0
Flags: fast devsel, IRQ 16
Memory at (32-bit, non-prefetchable) [disabled]
Memory at (64-bit, prefetchable) [disabled]
Memory at (64-bit, prefetchable) [disabled]
Capabilities: [60] Power Management version 3
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [c8] MSI-X: Enable- Count=6 Masked- Capabilities: [100] Virtual Channel Capabilities: [250] Latency Tolerance Reporting Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

k@u2104serv01:~$ lspci -n -s 01:00
01:00.0 0302: 10de:2235 (rev a1)
k@u2104serv01:~$ grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
k@u2104serv01:~$ grep nouv /etc/modprobe.d/* /lib/modprobe.d/*
/lib/modprobe.d/nvidia-graphics-drivers.conf:blacklist nouveau
/lib/modprobe.d/nvidia-graphics-drivers.conf:blacklist lbm-nouveau
/lib/modprobe.d/nvidia-graphics-drivers.conf:alias nouveau off
/lib/modprobe.d/nvidia-graphics-drivers.conf:alias lbm-nouveau off
k@u2104serv01:~$ sudo modprobe nvidia
modprobe: ERROR: could not insert ‘nvidia’: No such device

nvidia-bug-report.log.gz (725.1 KB)

Hi, did you ever manage to figure this out? I have the same problem, so would be grateful for any assistance.

your system specs? motherboard? CPU?

I got it working yesterday demonix, I’m not sure if you are still interested but my setup is identical to yours, i.e. new Epyc Milan GB server populated with A40s and running Proxmox VE. ALso, sorry for the late reply, I didn’t notice your post yesterday.

The issue can be seen within a VM using lspci -v, where I was getting:

Memory at <ignored> (64-bit, prefetchable) [disabled]

Whilst I don’t understand it fully, I believe the was indicating an issue with the BAR. Essentially, it was an incorrect VM and Grub config. If you run lspci -v on proxmox you will see that the “ignored” has memory references instead. The settings that I used to fix this were:

Proxmox GRUB:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on pcie_acs_override=downstream,multifunction video=efifb:eek:ff"

VM#.conf:

Machine: Defualt (i440fx)
BIOS: Defualy (SeaBIOS)
PCI: no additional parameters

Proxmox VFIO:

as standard + disable_vga=1 flag

Once you fix it, your VM should show hex values instead of “ignored” when running lspci -v.

Cheers

D

1 Like

wow thanks!! Ill have to check. Im indeed running a GB Milan with 2 A40.

It will take me a couple of days to test as I had some issues with my systems and Ill go back to this shortly.

Can I ask you of you BIOS settings? Its my first AMD system in decades and my Xeon boxes name things differently.

Thank you so much!

EDIT: you are also using SeaBIOS?! I usually dont use SeaBIOS on passthrough.

I believe using the VMs with a UEFI BIOS may have been the issue or at least contributed to it. I don’t have much time to look into this at the moment, but can look again if you still have issues.

My GB bios is all “as supplied” other than allowing IOMMU which I guessed you must have done to get as far as you did?

Like wise, this is my first AMD system in a longwhile so things are a bit new to me too! If you want anything specific I am more than happy to look it up for you.

Good luck!

1 Like

thank you so much! Dont worry for now. Ill have to test your solution first, next week I think, and if I have a problem Ill poke you here again.

This is super helpful!

Solved it! The problem was the VM BIOS as you said!!
i440fx + SeaBIOS works as expected!! All my other Intel server are configured with “UEFI” based VM bios so I never really tested SeaBIOS.

Thanks a lot!

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.