We use a lot of ProLiant SL250s Gen8 server with Tesla K40m. The servers boots from networking an unified boot image.
We replaced the motherboard in a server, then it cannot load the nvidia driver:
[root@hostname ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
[root@hostname ~]# lspci | grep Tesla
08:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
24:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
27:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
Sep 13 10:21:54 hostname kernel: Lustre: 10287:0:(client.c:1908:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
Sep 13 10:22:35 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:35 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:08:00.0)
Sep 13 10:22:35 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:35 hostname kernel: nvidia: probe of 0000:08:00.0 failed with error -1
Sep 13 10:22:36 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:36 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:24:00.0)
Sep 13 10:22:36 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:36 hostname kernel: nvidia: probe of 0000:24:00.0 failed with error -1
Sep 13 10:22:37 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:37 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:27:00.0)
Sep 13 10:22:37 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:37 hostname kernel: nvidia: probe of 0000:27:00.0 failed with error -1
Sep 13 10:22:37 hostname kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Sep 13 10:22:38 hostname kernel: NVRM: The NVIDIA probe routine failed for 3 device(s).
Sep 13 10:22:38 hostname kernel: NVRM: None of the NVIDIA graphics adapters were initialized!
Sep 13 10:22:38 hostname kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 243
Sep 13 10:22:39 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:39 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:08:00.0)
Sep 13 10:22:39 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:39 hostname kernel: nvidia: probe of 0000:08:00.0 failed with error -1
Sep 13 10:22:40 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:40 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:24:00.0)
Sep 13 10:22:40 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:40 hostname kernel: nvidia: probe of 0000:24:00.0 failed with error -1
Sep 13 10:22:41 hostname kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Sep 13 10:22:41 hostname kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:27:00.0)
Sep 13 10:22:41 hostname kernel: NVRM: The system BIOS may have misconfigured your GPU.
Sep 13 10:22:41 hostname kernel: nvidia: probe of 0000:27:00.0 failed with error -1
Sep 13 10:22:42 hostname kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Sep 13 10:22:42 hostname kernel: NVRM: The NVIDIA probe routine failed for 3 device(s).
Sep 13 10:22:42 hostname kernel: NVRM: None of the NVIDIA graphics adapters were initialized!
Sep 13 10:22:42 hostname kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 243
Sep 13 10:22:42 hostname modprobe: FATAL: Error inserting nvidia (/lib/modules/2.6.32-696.3.2.el6.x86_64/weak-updates/nvidia/nvidia.ko): No such device
What do you think, what can cause this issue? How can we fix this?
That model has two options to enable 64bit BARs, one hardware switch on the mainboard (check your service manual) and a bios option thatās not easily be found
HPE:
Update the system ROM to version 2014.02.10 (or later).
Reboot the host. During POST, enter RBSU (F9-setup).
At the Main Screen, press Ctrl+A.
Select āService Options.ā
Select āPCI Express 64-bit BAR Support.ā
Hi, @generix. Iām having nearly the same problem except with a Tesla K80 on Ubuntu 18.04. In the BIOS, is enabled āAbove 4G decodingā but after doing so, the computer stalled during boot. Iām using a ROG Z390-e motherboard. Here is the output from dmesg (which repeats):
[ 421.525273] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 421.525818] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:03:00.0)
[ 421.525820] NVRM: The system BIOS may have misconfigured your GPU.
[ 421.525825] nvidia: probe of 0000:03:00.0 failed with error -1
[ 421.525834] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:04:00.0)
[ 421.525835] NVRM: The system BIOS may have misconfigured your GPU.
[ 421.525838] nvidia: probe of 0000:04:00.0 failed with error -1
[ 421.525858] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 421.525858] NVRM: None of the NVIDIA devices were initialized.
[ 421.526106] nvidia-nvlink: Unregistered the Nvlink Core, major device number 237
I tried attaching the nvidia bug report, but the interface wonāt let me upload a .gz file. I added a .txt extension and tried again, but nothing happened. How can I upload the bug report? I donāt see a paperclip icon, even after creating this post.
Thanks. It couldnāt upload with a .log extension either. After specifying the file and clicking āuploadā, it just returns me to the editing window. Same for drag and drop.
You probably couldnāt upload the file because itās too big. Unfortunately the log flood from the BAR warning pushed out any useful info. Please reboot and create a new nvidia-bug-report.log right after boot.
I did that but the file size is even larger. It just keeps growing with time, even after a reboot and fresh call to the generating script. The nvidia-bug-report.log file (even one created right after a reboot) has entries going back to yesterday evening, when I installed the K80 and started the CUDA+driver install effort, despite having rebooted many times. Is there some other log file I should delete first? Or command to run?
Frustrating. Itās a similar problem with dmesg. The buffer fills up quickly with NVRM error messages, and overwrites the messages from earlier during the bootup. I canāt figure out how to increase the buffer size or otherwise capture those early messages.
Something must be triggering nvidia.ko to try to load. Can you try to boot into single-user rescue mode or something? Alternatively, you could temporarily uninstall or blacklist the nvidia driver for this experiment so that it canāt load while youāre trying to check the kernel logs.
Thanks, @aplattner. For the life of me, I canāt get the messages to stop. Tried booting in recovery mode. And tried blacklisting in modprobe.d. If I āapt removeā all nvidia packages, will nvidia-bug-report.sh still work, and give useful information? Will dmesg?
Also, is there a way to clear whatever file/buffer/log is drawn upon to create the nvidia-bug-report.log? The generated .log file has info that spans several days and is getting to be gigantic in size.
Might be nvidia-persistenced.
Nevermind, just uninstall the driver by running (in an empty directory)
sudo apt remove nvidia*
afterwards, just create the dmesg log.
Looks like a bios boot. Please disable CSM in bios and do a clean EFI boot reinstall. Afterwards, donāt install the driver but provide a new dmesg log.
Thatās pushing me past my experience level. With CSM disabled, I canāt boot from the drive (perhaps that was expected). Can you point me toward some resources/info on how to ādo a clean EFI boot reinstall?ā
Youāll have to format and reinstall, i.e. disable csm, put your Ubuntu install medium back in (e.g. connect usb thumb drive) and boot from it. Then repartition the harddisk and install.