Error when installing nvidia driver - Tesla K40m on Linux RHEL

I am receiving the following error when I attempted to install the nvidia driver version 384.130 on a RHEL 7.6 server:

ERROR: Unable to load the 'nvidia-drm' kernel module.

Here is the log of the attempted install:

nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Wed May  8 14:47:08 2019
installer version: 384.130

PATH: /usr/lib64/qt-3.3/bin:/root/perl5/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin

nvidia-installer command line:
    ./nvidia-installer

Unable to load: nvidia-installer ncurses v6 user interface

Using: nvidia-installer ncurses user interface
-> Detected 48 CPUs online; setting concurrency level to 32.
-> Installing NVIDIA driver version 384.130.
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. (Answer: Yes)
-> Installing both new and classic TLS OpenGL libraries.
-> Installing both new and classic TLS 32bit OpenGL libraries.
-> Install NVIDIA's 32-bit compatibility libraries? (Answer: Yes)
-> Will install GLVND GLX client libraries.
-> Will install GLVND EGL client libraries.
-> Skipping GLX non-GLVND file: "libGL.so.384.130"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.384.130"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
-> Skipping GLX non-GLVND file: "./32/libGL.so.384.130"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "./32/libEGL.so.384.130"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
Looking for install checker script at ./libglvnd_install_checker/check-libglvnd-install.sh
   executing: '/bin/sh ./libglvnd_install_checker/check-libglvnd-install.sh'...
   Checking for libglvnd installation.
   Checking libGLdispatch...
   Checking libGLdispatch dispatch table
   Checking call through libGLdispatch
   All OK
   libGLdispatch is OK
   Checking for libGLX
   libGLX is OK
   Checking for libEGL
   libEGL is OK
   Checking entrypoint library libOpenGL.so.0
   Checking call through libGLdispatch
   Checking call through library libOpenGL.so.0
   All OK
   Entrypoint library libOpenGL.so.0 is OK
   Checking entrypoint library libGL.so.1
   Checking call through libGLdispatch
   Checking call through library libGL.so.1
   All OK
   Entrypoint library libGL.so.1 is OK
   libglvnd appears to be installed.
Will not install libglvnd libraries.
-> Skipping GLVND file: "libOpenGL.so.0"
-> Skipping GLVND file: "libOpenGL.so"
-> Skipping GLVND file: "libGLESv1_CM.so.1.2.0"
-> Skipping GLVND file: "libGLESv1_CM.so.1"
-> Skipping GLVND file: "libGLESv1_CM.so"
-> Skipping GLVND file: "libGLESv2.so.2.1.0"
-> Skipping GLVND file: "libGLESv2.so.2"
-> Skipping GLVND file: "libGLESv2.so"
-> Skipping GLVND file: "libGLdispatch.so.0"
-> Skipping GLVND file: "libGLX.so.0"
-> Skipping GLVND file: "libGLX.so"
-> Skipping GLVND file: "libGL.so.1.7.0"
-> Skipping GLVND file: "libGL.so.1"
-> Skipping GLVND file: "libGL.so"
-> Skipping GLVND file: "libEGL.so.1.1.0"
-> Skipping GLVND file: "libEGL.so.1"
-> Skipping GLVND file: "libEGL.so"
-> Skipping GLVND file: "./32/libOpenGL.so.0"
-> Skipping GLVND file: "libOpenGL.so"
-> Skipping GLVND file: "./32/libGLdispatch.so.0"
-> Skipping GLVND file: "./32/libGLESv2.so.2.1.0"
-> Skipping GLVND file: "libGLESv2.so.2"
-> Skipping GLVND file: "libGLESv2.so"
-> Skipping GLVND file: "./32/libGLESv1_CM.so.1.2.0"
-> Skipping GLVND file: "libGLESv1_CM.so.1"
-> Skipping GLVND file: "libGLESv1_CM.so"
-> Skipping GLVND file: "./32/libGL.so.1.7.0"
-> Skipping GLVND file: "libGL.so.1"
-> Skipping GLVND file: "libGL.so"
-> Skipping GLVND file: "./32/libGLX.so.0"
-> Skipping GLVND file: "libGLX.so"
-> Skipping GLVND file: "./32/libEGL.so.1.1.0"
-> Skipping GLVND file: "libEGL.so.1"
-> Skipping GLVND file: "libEGL.so"
Will install libEGL vendor library config file to /usr/share/glvnd/egl_vendor.d
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (384.130):
   executing: '/sbin/ldconfig'...
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
-> done.
ERROR: Unable to load the 'nvidia-drm' kernel module.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Here are the NVIDIA cards installed:

# lspci | grep -i nvidia
09:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
0a:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
0d:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
0e:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
28:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
2b:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
30:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)
33:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m] (rev a1)

The bug_report is also attached.

nvidia-bug-report.log.gz (85.8 KB)

Kernel bug with recent kernels:

[    1.314256] pci 0000:0d:00.0: BAR 1: no space for [mem size 0x400000000 64bit pref]
[    1.314653] pci 0000:0d:00.0: BAR 1: failed to assign [mem size 0x400000000 64bit pref]
[    1.315043] pci 0000:0d:00.0: BAR 1: no space for [mem size 0x400000000 64bit pref]
[    1.315450] pci 0000:0d:00.0: BAR 1: failed to assign [mem size 0x400000000 64bit pref]
[    1.315851] pci 0000:0c:08.0: PCI bridge to [bus 0d]
[    1.316080] pci 0000:0c:08.0:   bridge window [mem 0xea000000-0xeaffffff]
[    1.316317] pci 0000:0c:08.0:   bridge window [mem 0xe0000000-0xe1ffffff 64bit pref]
[    1.316730] pci 0000:0e:00.0: BAR 1: no space for [mem size 0x400000000 64bit pref]
[    1.317130] pci 0000:0e:00.0: BAR 1: failed to assign [mem size 0x400000000 64bit pref]
[    1.317543] pci 0000:0e:00.0: BAR 1: no space for [mem size 0x400000000 64bit pref]
[    1.317948] pci 0000:0e:00.0: BAR 1: failed to assign [mem size 0x400000000 64bit pref]

Don’t know if anyone ever reported it and also no other fix than downgrading to an older kernel.

Thanks for the quick response generix. I have another server running the same kernel level and it is working fine with the nvidia driver version 384.125. If I tried installing that version, would it work? If so, where can I download that version?

No, the driver doesn’t matter in this situation, it’s happening in the early resource assignment phase of the kernel, the driver gets loaded much later and finds a non-working unit.
I don’t really know why this is happening; previously, two cases could be observed especially with Teslas, probably depending on mainboard chipset(?)

  1. BAR1 wants 256MB -> mapped in 32bit address range, works
  2. BAR1 wants 16GB -> mapped in 64bit address range, works
    With recent kernels, the second case somehow fails
    2b. BAR1 wants 16GB -> mapped in 32bit address range, fails

Thanks for the response. I have another server that is working fine with the same kernel version and driver version 384.125. I have attached the bug_report for it. Can you check the bug_report and see if there is anything I can do to get the problem server working?
nvidia-bug-report.log.gz (1.24 MB)

The two systems are virtually identical, except for the working system actually has an older bios from 2015 while the non-working system’s bios is from 2018.
The working system has 32bit and 64bit entries for mtrr and pci root bus resorces

[    0.000000] MTRR variable ranges enabled:
[    0.000000]   0 base 0000C0000000 mask 3FFFC0000000 uncachable
[    0.000000]   1 disabled
[    0.000000]   2 disabled
[    0.000000]   3 disabled
[    0.000000]   4 disabled
[    0.000000]   5 disabled
[    0.000000]   6 disabled
[    0.000000]   7 disabled
[    0.000000]   8 disabled
[    0.000000]   9 base 038000000000 mask 3F8000000000 uncachable
[    1.225350] pci_bus 0000:00: root bus resource [mem 0xeb000000-0xf3ffffff window]
[    1.225755] pci_bus 0000:00: root bus resource [mem 0x3c3fc000000-0x3dfffffffff window]

while the non-working system only has 32bit entries for both:

[    0.000000] MTRR variable ranges enabled:
[    0.000000]   0 base 0000C0000000 mask 3FFFC0000000 uncachable
[    0.000000]   1 disabled
[    0.000000]   2 disabled
[    0.000000]   3 disabled
[    0.000000]   4 disabled
[    0.000000]   5 disabled
[    0.000000]   6 disabled
[    0.000000]   7 disabled
[    0.000000]   8 disabled
[    0.000000]   9 disabled
[    1.151612] pci_bus 0000:00: root bus resource [mem 0xda000000-0xebffffff window]

Under normal circumstances, this would be configurable in bios by a “Above 4G decoding” option but in previous cases, this didn’t have an effect. Please check for it, anyway.
I’ll have look at the kernel if there’s some option now that gets enabled depending on bios date. At least that’s my best guess for now.

Maybe something stupid, please try adding kernel parameters
acpi_osi="!Windows 2017" acpi_osi="!Windows 2017.2"

Sorry for the delay generix, the server was down the past few days.

I tried using the kernel parameters that you suggested and the NVIDIA driver still didn’t install. I also booted the server using an older kernel and the NVIDIA drivers did not install with the same error message:

-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (384.130):
   executing: '/sbin/ldconfig'...
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
-> done.
ERROR: Unable to load the 'nvidia-drm' kernel module.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Any suggestions as to what to do next?

One more thing, I could not find an “Above 4G decoding” option in the BIOS.

Which exact kernel version are you running now? Furthermore, please attach an acpidump from the working and the non-working machine, if possible.

Here is the kernel I am running now

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-957.5.1.el7.x86_64 root=UUID=a8037b6b-fd9a-4b52-b147-e1104b143d52 ro crashkernel=auto rd.md.uuid=507323d5:bc32492a:6078ddcf:34e282b6 rd.md.uuid=acdd2793:21c4c87c:a7d99654:9269cf58 rd.lvm.lv=rhel_uslv-papp-gpu02/swap rd.lvm.lv=rhel_uslv-papp-gpu02/swap00 modprobe.blacklist=nouveau LANG=en_US.UTF-8 rd.driver.blacklist=nouveau nvidia-drm.modeset=1

# uname -r
3.10.0-957.5.1.el7.x86_64

Working on the acpidump now, will upload shortly.

Attached are the acpidump output files.
working.txt (361 KB)
nonworking.txt (361 KB)

generix - what do you think of the discussion below?

https://access.redhat.com/discussions/3830601

At the bottom of the discussion thread the following is stated:

"Hi Prem, it's me again ... I have tested the new drivers and checked something. You most probably
are having the latest kernel-headers version 3.10.0-957.5.1.el7 installed and that version may not
be able to build the dkms module for the old kernel 3.10.0-957.el7 - (the first kernel that shipped
with GA of RHEL 7.6) - so, you have to live with this situation, because downgrading the headers
wouldn't be a good idea. When the next kernel gets released the modules get built automatically.
Anyway, don't worry about it ... important is that the new drivers work with the current kernel. :)"

Do I need to upgrade the kernel?

The current kernel is the same as the previous one.

It looks like this is partly triggered by the bios changes, from the 2015 version to the 2018 version the 64bit resources have been removed; diff from nonworking(-) and working(+):

-            0xDA000000,         // Range Minimum
-            0xEBFFFFFF,         // Range Maximum
+            0xEB000000,         // Range Minimum
+            0xF3FFFFFF,         // Range Maximum
             0x00000000,         // Translation Offset
-            0x12000000,         // Length
+            0x09000000,         // Length
+            ,, , AddressRangeMemory, TypeStatic)
+        QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite,
+            0x0000000000000000, // Granularity
+            0x000003C3FC000000, // Range Minimum
+            0x000003DFFFFFFFFF, // Range Maximum
+            0x0000000000000000, // Translation Offset
+            0x0000001C04000000, // Length

So the remaining odd bug is why the kernel tries to assign 64bit resources if there aren’t any.
A quick workaround would be to simply downgrade the bios, if possible.

The red hat thread does not apply, like said, the driver builds fine but cannot be loaded because the kernel fails to assign resources beforehands.

Ok thanks - do you have a link to where I can find the NVIDIA driver version 384.125?
I am working with HP regarding how to downgrade the BIOS.

http://us.download.nvidia.com/tesla/384.125/NVIDIA-Linux-x86_64-384.125.run
Shouldn’t be necessary, though.

Thanks for the link. Tried installing 384.125, received this error:

Failed to run `/sbin/dkms add -m nvidia -v 384.125 -k 3.10.0-957.5.1.el7.x86_64`: Error! DKMS tree already contains: nvidia-384.125
You cannot add the same module/version combo more than once.

Is this worth pursuing?

Try running with option --uninstall first.

Thanks, ran it with the --uninstall option, logs attached.
nvidia-installer.log (25.7 KB)
nvidia-uninstall.log (1.15 KB)