Driver doesn't get loaded on RHEL8

ilkka.tengvall_nv · January 11, 2021, 8:54am

I had an earlier topic, but I’m afraid the topic doesn’t describe the situation well, and things are just getting worse. So I create a new ticket.

Since 8.3 came out, the packaged drivers from Nvidia have not been working at all for display. Now it’s even worse, the drivers can’t even be loaded.

Steps I took:

uname -r
    4.18.0-240.10.1.el8_3.x86_64+debug
sudo dnf module install nvidia-driver:450
sudo insmod /lib/modules/4.18.0-193.19.1.el8_2.x86_64/extra/drivers/video/nvidia/nvidia.ko 
insmod: ERROR: could not insert module /lib/modules/4.18.0-193.19.1.el8_2.x86_64/extra/drivers/video/nvidia/nvidia.ko: Invalid module format
file /lib/modules/4.18.0-193.19.1.el8_2.x86_64/extra/drivers/video/nvidia/nvidia.ko
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/extra/drivers/video/nvidia/nvidia.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=51a6dc8b325a5d543ed0b229ddc3fc326881a610, with debug_info, not stripped

generix · January 11, 2021, 9:09am

RH changed their 8.3 kernel in a way the nvidia driver doesn’t work anymore (like in your first bug report) so the precompiled driver was removed for that kernels. So it’s now trying to load the driver for the -193 kernel (likely the last one working) which doesn’t work at all (wrong kernel version). Only thing you can do until nvidia worked this out is booting into the older, working kernel.

ilkka.tengvall_nv · January 11, 2021, 9:20am

Yep, I just noticed I tried the wrong version, blind me. I got confused while booting around different versions trying to make it work. So basically, my PC is useless as workstation with Nvidia card for the time being.

Is there any plans to make the driver work with the current kernels? Or does it work if re-installed locally some other way, like the scripted install? Or are they the same binaries behind all of the different install ways?

leigh123linux · January 11, 2021, 10:00am

Try using the standard non-debug kernel

ilkka.tengvall_nv · January 11, 2021, 10:55am

I’ve tried all released kernels since 8.3 came out. Debug or not. There is something to fix in Nvidia build process.

generix · January 11, 2021, 11:02am

IDK if the just released 460.32.03 driver fixes things with rhel 8.3 kernels, might be worth a shot though you’d have to install it using the .run installer and have your build system including kernel headers set up which is not really recommended.
But leigh123linux is right, always make sure you’re not booting into a -debug kernel, the nvidia driver won’t work on those at all.

ilkka.tengvall_nv · January 11, 2021, 6:06pm

Finally it works. These are the steps to get it right on RHEL8.3:

sudo dnf remove nvidia-driver
sudo dnf module reset nvidia-driver
sudo dnf module install nvidia-driver:450-dkms
reboot

kmittman · January 11, 2021, 6:44pm

Hi @ilkka.tengvall_nv
Sorry just saw your other post (I’m in Pacific timezone).

Yes as @leigh123linux pointed out, the precompiled packaging only works with official standard RHEL 8.3 kernels, not including debug nor EUS ones. Basically the kernel string (ex: 4.18.0-240.10.1.el8_3.x86_64 ) must match exactly; which is what the dnf plugin looks for.

ilkka.tengvall_nv · January 11, 2021, 8:28pm

Thanks, but it hasn’t worked for me using standard kernels since 8.3 came out. I only use the distribution’s kernels, nothing custom here. Now this -dkms is the only way to get it going.

ilkka.tengvall_nv · January 12, 2021, 7:18am

Sorry, I was too hasty. It worked for one boot, but now it’s back at the blank screen state as it has been for months.

╰─➤ uname -r
4.18.0-240.10.1.el8_3.x86_64
╰─➤ lsmod |grep nvidia
nvidia_drm 53248 0
nvidia_modeset 1183744 1 nvidia_drm
nvidia 19718144 7 nvidia_modeset
drm_kms_helper 217088 1 nvidia_drm
drm 557056 3 drm_kms_helper,nvidia_drm
╰─➤ rpm -qa ‘nvidia*’
nvidia-xconfig-450.80.02-1.el8.x86_64
nvidia-libXNVCtrl-450.80.02-1.el8.x86_64
nvidia-driver-libs-450.80.02-1.el8.x86_64
nvidia-driver-450.80.02-1.el8.x86_64
nvidia-libXNVCtrl-devel-450.80.02-1.el8.x86_64
nvidia-driver-cuda-450.80.02-1.el8.x86_64
nvidia-settings-450.80.02-1.el8.x86_64
nvidia-driver-NVML-450.80.02-1.el8.x86_64
nvidia-driver-devel-450.80.02-1.el8.x86_64
nvidia-kmod-common-450.80.02-1.el8.noarch
nvidia-persistenced-450.80.02-1.el8.x86_64
nvidia-modprobe-450.80.02-1.el8.x86_64
nvidia-driver-NvFBCOpenGL-450.80.02-1.el8.x86_64
nvidia-driver-cuda-libs-450.80.02-1.el8.x86_64
╰─➤ dmesg |grep nvidia
[ 3.747369] nvidia: loading out-of-tree module taints kernel.
[ 3.747381] nvidia: module license ‘NVIDIA’ taints kernel.
[ 3.756471] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 3.766821] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[ 3.767973] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 4.055061] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.80.02 Wed Sep 23 00:48:09 UTC 2020
[ 4.058667] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 4.058669] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[ 17.349211] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 34.247692] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 51.013963] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 67.778645] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 84.535692] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 101.295577] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs

and from dmesg, this is the content of the last bar problems:

[ 17.348903] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x00
0dffff window]
[ 17.349211] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 17.606171] usb 1-4: reset high-speed USB device number 4 using ehci-pci
[ 33.604667] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 33.604699] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 34.247380] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 34.247692] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 50.412435] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 50.412468] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 51.013674] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 51.013963] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 67.179299] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 67.179332] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 67.778330] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 67.778645] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 83.943277] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 83.943310] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 84.535389] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 84.535692] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 100.699711] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 100.699744] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 101.295282] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 101.295577] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 117.460527] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 117.460561] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

nvidia-bug-report.log.gz (1.2 MB)
I attached here the full bug report.

ilkka.tengvall_nv · January 12, 2021, 7:21am

Is this somehow a case of BIOS or some chip having different idea of PCI bus memory addressing vs. what the driver sees? Anyhow, it has worked for rather long time, couple of years until it broke like this.

@kmittman what would you need from my system to get to understand the case?

generix · January 12, 2021, 11:47am

Before you start banging your head against the wall, it’s a problem with some incompatible changes to newe RH kernels and nvidia is very well aware of it:
https://forums.developer.nvidia.com/t/nvidia-smi-no-devices-were-found-on-centos-8-3-and-460-27-04/164697?u=generix

amrits · January 12, 2021, 7:29pm

Please have a look at RHEL Bugzilla 1904213 for more updates since this is a RHEL bug.

ilkka.tengvall_nv · January 12, 2021, 8:35pm

Thanks for the pointer. I get it. I will report back the success once the kernel is out.

Thanks for the help everyone!

Topic		Replies	Views
Streamlining NVIDIA Driver Deployment on RHEL 8 with Modularity Streams Technical Blog	43	7114	January 20, 2024
Driver allocating memory over pci slot size Linux kernel	12	2692	February 16, 2021
RHEL 8.4 is out, and display is blank Linux	4	1414	October 12, 2021
Error when installing nvidia driver - Tesla K40m on Linux RHEL Linux	28	2724	October 12, 2021
GeForce 310 with RHEL 8.1 Linux Linux	17	1990	December 9, 2020
RED HAT 8 Problem Compile Driver For 330M NVIDIA-Linux-x86_64-340.108.run Help Me! Plese! Linux	5	2075	January 28, 2021
Nvidia quadro p400 driver not installing in rhel 8.0 Drivers - Linux, Windows, MacOS installation , driver	1	1270	January 19, 2024
367.134 GRID vGPU driver fails to install RHEL 7.x or 8.x Linux	3	1453	August 7, 2022
Looking RHEL kernel module 4.18.0-348.23.1 Linux	6	1850	September 1, 2022
GeForce RTX 2080 Rocky Linux release 8.6 couldn't communicate with the NVIDIA driver Linux	11	1392	November 2, 2022

Driver doesn't get loaded on RHEL8

Related topics