NVRM: This PCI I/O region assigned to your NVIDIA device is invalid

Dear All,

We use a lot of ProLiant SL250s Gen8 server with Tesla K40m. The servers boots from networking an unified boot image.

We replaced the motherboard in a server, then it cannot load the nvidia driver:

[root@hostname ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
[root@hostname ~]# lspci | grep Tesla
08:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
24:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
27:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
Sep 13 10:21:54 hostname kernel: Lustre: 10287:0:(client.c:1908:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
Sep 13 10:22:35 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:35 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:08:00.0)
Sep 13 10:22:35 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:35 hostname kernel: nvidia: probe of 0000:08:00.0 failed with error -1
Sep 13 10:22:36 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:36 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:24:00.0)
Sep 13 10:22:36 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:36 hostname kernel: nvidia: probe of 0000:24:00.0 failed with error -1
Sep 13 10:22:37 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:37 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:27:00.0)
Sep 13 10:22:37 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:37 hostname kernel: nvidia: probe of 0000:27:00.0 failed with error -1
Sep 13 10:22:37 hostname kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Sep 13 10:22:38 hostname kernel: NVRM: The NVIDIA probe routine failed for 3 device(s).
Sep 13 10:22:38 hostname kernel: NVRM: None of the NVIDIA graphics adapters were initialized!
Sep 13 10:22:38 hostname kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 243
Sep 13 10:22:39 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:39 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:08:00.0)
Sep 13 10:22:39 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:39 hostname kernel: nvidia: probe of 0000:08:00.0 failed with error -1
Sep 13 10:22:40 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:40 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:24:00.0)
Sep 13 10:22:40 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:40 hostname kernel: nvidia: probe of 0000:24:00.0 failed with error -1
Sep 13 10:22:41 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:41 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:27:00.0)
Sep 13 10:22:41 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:41 hostname kernel: nvidia: probe of 0000:27:00.0 failed with error -1
Sep 13 10:22:42 hostname kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Sep 13 10:22:42 hostname kernel: NVRM: The NVIDIA probe routine failed for 3 device(s).
Sep 13 10:22:42 hostname kernel: NVRM: None of the NVIDIA graphics adapters were initialized!
Sep 13 10:22:42 hostname kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 243
Sep 13 10:22:42 hostname modprobe: FATAL: Error inserting nvidia (/lib/modules/2.6.32-696.3.2.el6.x86_64/weak-updates/nvidia/nvidia.ko): No such device

What do you think, what can cause this issue? How can we fix this?

Please check the bios for an option like “above 4G decoding” or “large/64bit BARs” and enable it.

Unfortunately, it is not available on this Server Model.

https://imgur.com/a/PbI9heN

Any other idea?

That model has two options to enable 64bit BARs, one hardware switch on the mainboard (check your service manual) and a bios option that’s not easily be found
HPE:
Update the system ROM to version 2014.02.10 (or later).
Reboot the host. During POST, enter RBSU (F9-setup).
At the Main Screen, press Ctrl+A.
Select “Service Options.”
Select “PCI Express 64-bit BAR Support.”

You are the best, thank you! :)

Hi, @generix. I’m having nearly the same problem except with a Tesla K80 on Ubuntu 18.04. In the BIOS, is enabled “Above 4G decoding” but after doing so, the computer stalled during boot. I’m using a ROG Z390-e motherboard. Here is the output from dmesg (which repeats):

[ 421.525273] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 421.525818] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:03:00.0)
[ 421.525820] NVRM: The system BIOS may have misconfigured your GPU.
[ 421.525825] nvidia: probe of 0000:03:00.0 failed with error -1
[ 421.525834] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:04:00.0)
[ 421.525835] NVRM: The system BIOS may have misconfigured your GPU.
[ 421.525838] nvidia: probe of 0000:04:00.0 failed with error -1

[ 421.525858] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 421.525858] NVRM: None of the NVIDIA devices were initialized.
[ 421.526106] nvidia-nvlink: Unregistered the Nvlink Core, major device number 237

I tried attaching the nvidia bug report, but the interface won’t let me upload a .gz file. I added a .txt extension and tried again, but nothing happened. How can I upload the bug report? I don’t see a paperclip icon, even after creating this post.

Add .log at the end.

Thanks. It couldn’t upload with a .log extension either. After specifying the file and clicking ‘upload’, it just returns me to the editing window. Same for drag and drop.

Anyway, I put the file on github:
https://github.com/mattroos/temp_repo/blob/master/nvidia-bug-report.log.gz

You probably couldn’t upload the file because it’s too big. Unfortunately the log flood from the BAR warning pushed out any useful info. Please reboot and create a new nvidia-bug-report.log right after boot.

I did that but the file size is even larger. It just keeps growing with time, even after a reboot and fresh call to the generating script. The nvidia-bug-report.log file (even one created right after a reboot) has entries going back to yesterday evening, when I installed the K80 and started the CUDA+driver install effort, despite having rebooted many times. Is there some other log file I should delete first? Or command to run?

You can also just provide a dmesg output after boot:
sudo dmesg >dmesg.log

Frustrating. It’s a similar problem with dmesg. The buffer fills up quickly with NVRM error messages, and overwrites the messages from earlier during the bootup. I can’t figure out how to increase the buffer size or otherwise capture those early messages.

Something must be triggering nvidia.ko to try to load. Can you try to boot into single-user rescue mode or something? Alternatively, you could temporarily uninstall or blacklist the nvidia driver for this experiment so that it can’t load while you’re trying to check the kernel logs.

Thanks, @aplattner. For the life of me, I can’t get the messages to stop. Tried booting in recovery mode. And tried blacklisting in modprobe.d. If I ‘apt remove’ all nvidia packages, will nvidia-bug-report.sh still work, and give useful information? Will dmesg?

Also, is there a way to clear whatever file/buffer/log is drawn upon to create the nvidia-bug-report.log? The generated .log file has info that spans several days and is getting to be gigantic in size.

Might be nvidia-persistenced.
Nevermind, just uninstall the driver by running (in an empty directory)
sudo apt remove nvidia*
afterwards, just create the dmesg log.

Thanks, @generix. I removed the nvidia packages. Log file is attached.dmesg.log (70.4 KB)

Looks like a bios boot. Please disable CSM in bios and do a clean EFI boot reinstall. Afterwards, don’t install the driver but provide a new dmesg log.

That’s pushing me past my experience level. With CSM disabled, I can’t boot from the drive (perhaps that was expected). Can you point me toward some resources/info on how to “do a clean EFI boot reinstall?”

You’ll have to format and reinstall, i.e. disable csm, put your Ubuntu install medium back in (e.g. connect usb thumb drive) and boot from it. Then repartition the harddisk and install.

https://itsfoss.com/install-ubuntu/