NVRM: This PCI I/O region assigned to your NVIDIA device is invalid

Atisom · September 13, 2019, 9:01am

Dear All,

We use a lot of ProLiant SL250s Gen8 server with Tesla K40m. The servers boots from networking an unified boot image.

We replaced the motherboard in a server, then it cannot load the nvidia driver:

[root@hostname ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

[root@hostname ~]# lspci | grep Tesla
08:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
24:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
27:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)

Sep 13 10:21:54 hostname kernel: Lustre: 10287:0:(client.c:1908:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
Sep 13 10:22:35 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:35 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:08:00.0)
Sep 13 10:22:35 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:35 hostname kernel: nvidia: probe of 0000:08:00.0 failed with error -1
Sep 13 10:22:36 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:36 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:24:00.0)
Sep 13 10:22:36 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:36 hostname kernel: nvidia: probe of 0000:24:00.0 failed with error -1
Sep 13 10:22:37 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:37 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:27:00.0)
Sep 13 10:22:37 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:37 hostname kernel: nvidia: probe of 0000:27:00.0 failed with error -1
Sep 13 10:22:37 hostname kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Sep 13 10:22:38 hostname kernel: NVRM: The NVIDIA probe routine failed for 3 device(s).
Sep 13 10:22:38 hostname kernel: NVRM: None of the NVIDIA graphics adapters were initialized!
Sep 13 10:22:38 hostname kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 243
Sep 13 10:22:39 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:39 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:08:00.0)
Sep 13 10:22:39 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:39 hostname kernel: nvidia: probe of 0000:08:00.0 failed with error -1
Sep 13 10:22:40 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:40 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:24:00.0)
Sep 13 10:22:40 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:40 hostname kernel: nvidia: probe of 0000:24:00.0 failed with error -1
Sep 13 10:22:41 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:41 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:27:00.0)
Sep 13 10:22:41 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:41 hostname kernel: nvidia: probe of 0000:27:00.0 failed with error -1
Sep 13 10:22:42 hostname kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Sep 13 10:22:42 hostname kernel: NVRM: The NVIDIA probe routine failed for 3 device(s).
Sep 13 10:22:42 hostname kernel: NVRM: None of the NVIDIA graphics adapters were initialized!
Sep 13 10:22:42 hostname kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 243
Sep 13 10:22:42 hostname modprobe: FATAL: Error inserting nvidia (/lib/modules/2.6.32-696.3.2.el6.x86_64/weak-updates/nvidia/nvidia.ko): No such device

What do you think, what can cause this issue? How can we fix this?

generix · September 13, 2019, 9:04am

Please check the bios for an option like “above 4G decoding” or “large/64bit BARs” and enable it.

Atisom · September 13, 2019, 10:57am

Unfortunately, it is not available on this Server Model.

[url]https://imgur.com/a/PbI9heN[/url]

Any other idea?

generix · September 13, 2019, 11:04am

That model has two options to enable 64bit BARs, one hardware switch on the mainboard (check your service manual) and a bios option that’s not easily be found
HPE:
Update the system ROM to version 2014.02.10 (or later).
Reboot the host. During POST, enter RBSU (F9-setup).
At the Main Screen, press Ctrl+A.
Select “Service Options.”
Select “PCI Express 64-bit BAR Support.”

Atisom · September 13, 2019, 12:00pm

You are the best, thank you! :)

MattRoos · June 2, 2020, 5:03pm

Hi, @generix. I’m having nearly the same problem except with a Tesla K80 on Ubuntu 18.04. In the BIOS, is enabled “Above 4G decoding” but after doing so, the computer stalled during boot. I’m using a ROG Z390-e motherboard. Here is the output from dmesg (which repeats):

[ 421.525273] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 421.525818] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:03:00.0)
[ 421.525820] NVRM: The system BIOS may have misconfigured your GPU.
[ 421.525825] nvidia: probe of 0000:03:00.0 failed with error -1
[ 421.525834] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:04:00.0)
[ 421.525835] NVRM: The system BIOS may have misconfigured your GPU.
[ 421.525838] nvidia: probe of 0000:04:00.0 failed with error -1

[ 421.525858] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 421.525858] NVRM: None of the NVIDIA devices were initialized.
[ 421.526106] nvidia-nvlink: Unregistered the Nvlink Core, major device number 237

I tried attaching the nvidia bug report, but the interface won’t let me upload a .gz file. I added a .txt extension and tried again, but nothing happened. How can I upload the bug report? I don’t see a paperclip icon, even after creating this post.

generix · June 2, 2020, 6:53pm

Add .log at the end.

MattRoos · June 2, 2020, 7:19pm

Thanks. It couldn’t upload with a .log extension either. After specifying the file and clicking ‘upload’, it just returns me to the editing window. Same for drag and drop.

Anyway, I put the file on github:

generix · June 2, 2020, 7:25pm

You probably couldn’t upload the file because it’s too big. Unfortunately the log flood from the BAR warning pushed out any useful info. Please reboot and create a new nvidia-bug-report.log right after boot.

MattRoos · June 2, 2020, 8:52pm

I did that but the file size is even larger. It just keeps growing with time, even after a reboot and fresh call to the generating script. The nvidia-bug-report.log file (even one created right after a reboot) has entries going back to yesterday evening, when I installed the K80 and started the CUDA+driver install effort, despite having rebooted many times. Is there some other log file I should delete first? Or command to run?

generix · June 2, 2020, 9:37pm

You can also just provide a dmesg output after boot:
sudo dmesg >dmesg.log

MattRoos · June 2, 2020, 10:30pm

Frustrating. It’s a similar problem with dmesg. The buffer fills up quickly with NVRM error messages, and overwrites the messages from earlier during the bootup. I can’t figure out how to increase the buffer size or otherwise capture those early messages.

aplattner · June 3, 2020, 5:44am

Something must be triggering nvidia.ko to try to load. Can you try to boot into single-user rescue mode or something? Alternatively, you could temporarily uninstall or blacklist the nvidia driver for this experiment so that it can’t load while you’re trying to check the kernel logs.

MattRoos · June 3, 2020, 2:41pm

Thanks, @aplattner. For the life of me, I can’t get the messages to stop. Tried booting in recovery mode. And tried blacklisting in modprobe.d. If I ‘apt remove’ all nvidia packages, will nvidia-bug-report.sh still work, and give useful information? Will dmesg?

Also, is there a way to clear whatever file/buffer/log is drawn upon to create the nvidia-bug-report.log? The generated .log file has info that spans several days and is getting to be gigantic in size.

generix · June 3, 2020, 3:00pm

Might be nvidia-persistenced.
Nevermind, just uninstall the driver by running (in an empty directory)
sudo apt remove nvidia*
afterwards, just create the dmesg log.

MattRoos · June 3, 2020, 3:13pm

Thanks, @generix. I removed the nvidia packages. Log file is attached.dmesg.log (70.4 KB)

generix · June 3, 2020, 3:19pm

Looks like a bios boot. Please disable CSM in bios and do a clean EFI boot reinstall. Afterwards, don’t install the driver but provide a new dmesg log.

MattRoos · June 3, 2020, 3:26pm

That’s pushing me past my experience level. With CSM disabled, I can’t boot from the drive (perhaps that was expected). Can you point me toward some resources/info on how to “do a clean EFI boot reinstall?”

generix · June 3, 2020, 3:32pm

You’ll have to format and reinstall, i.e. disable csm, put your Ubuntu install medium back in (e.g. connect usb thumb drive) and boot from it. Then repartition the harddisk and install.

generix · June 3, 2020, 3:39pm

https://itsfoss.com/install-ubuntu/

Topic		Replies	Views
This PCI I/O region assigned to your NVIDIA device is invalid: Linux cuda	5	5457	October 12, 2021
Error when installing nvidia driver - Tesla K40m on Linux RHEL Linux	28	2653	October 12, 2021
Tesla K80 detected on OpenSuse 15.5, but nvidia-smi couldn't communicate with the NVIDIA driver Linux driver	8	1213	June 18, 2023
7 out of 8 GPUs on G291 with Debian 10 Linux	17	1728	July 30, 2019
A30 NVRM: This PCI I/O region assigned to your NVIDIA device is invalid Linux	3	2573	November 9, 2022
Nvidia-smi "No devices were found" Linux kernel , ubuntu , driver	8	1166	July 23, 2024
Driver Installation for Tesla K80 - Problems CUDA Setup and Installation	17	6014	January 18, 2020
H100 PCIe, NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Linux kernel , ubuntu , gpu , driver , nvidia-smi	17	3898	April 12, 2024
Broken GPU state query failure in AMD + H100 Confidential Computing	10	882	February 15, 2024
Tesla K80 Installation Issue CUDA Setup and Installation	2	1892	August 31, 2020

NVRM: This PCI I/O region assigned to your NVIDIA device is invalid

Related topics