I had an earlier topic , but I’m afraid the topic doesn’t describe the situation well, and things are just getting worse. So I create a new ticket.
Since 8.3 came out, the packaged drivers from Nvidia have not been working at all for display. Now it’s even worse, the drivers can’t even be loaded.
Steps I took:
uname -r
4.18.0-240.10.1.el8_3.x86_64+debug
sudo dnf module install nvidia-driver:450
sudo insmod /lib/modules/4.18.0-193.19.1.el8_2.x86_64/extra/drivers/video/nvidia/nvidia.ko
insmod: ERROR: could not insert module /lib/modules/4.18.0-193.19.1.el8_2.x86_64/extra/drivers/video/nvidia/nvidia.ko: Invalid module format
file /lib/modules/4.18.0-193.19.1.el8_2.x86_64/extra/drivers/video/nvidia/nvidia.ko
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/extra/drivers/video/nvidia/nvidia.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=51a6dc8b325a5d543ed0b229ddc3fc326881a610, with debug_info, not stripped
RH changed their 8.3 kernel in a way the nvidia driver doesn’t work anymore (like in your first bug report) so the precompiled driver was removed for that kernels. So it’s now trying to load the driver for the -193 kernel (likely the last one working) which doesn’t work at all (wrong kernel version). Only thing you can do until nvidia worked this out is booting into the older, working kernel.
Yep, I just noticed I tried the wrong version, blind me. I got confused while booting around different versions trying to make it work. So basically, my PC is useless as workstation with Nvidia card for the time being.
Is there any plans to make the driver work with the current kernels? Or does it work if re-installed locally some other way, like the scripted install? Or are they the same binaries behind all of the different install ways?
Try using the standard non-debug kernel
I’ve tried all released kernels since 8.3 came out. Debug or not. There is something to fix in Nvidia build process.
generix
January 11, 2021, 11:02am
7
IDK if the just released 460.32.03 driver fixes things with rhel 8.3 kernels, might be worth a shot though you’d have to install it using the .run installer and have your build system including kernel headers set up which is not really recommended.
But leigh123linux is right, always make sure you’re not booting into a -debug kernel, the nvidia driver won’t work on those at all.
Finally it works. These are the steps to get it right on RHEL8.3:
sudo dnf remove nvidia-driver
sudo dnf module reset nvidia-driver
sudo dnf module install nvidia-driver:450-dkms
reboot
Hi @ilkka.tengvall_nv
Sorry just saw your other post (I’m in Pacific timezone).
Yes as @leigh123linux pointed out, the precompiled packaging only works with official standard RHEL 8.3 kernels, not including debug nor EUS ones. Basically the kernel string (ex: 4.18.0-240.10.1.el8_3.x86_64 ) must match exactly; which is what the dnf plugin looks for.
Thanks, but it hasn’t worked for me using standard kernels since 8.3 came out. I only use the distribution’s kernels, nothing custom here. Now this -dkms is the only way to get it going.
Sorry, I was too hasty. It worked for one boot, but now it’s back at the blank screen state as it has been for months.
╰─➤ uname -r
4.18.0-240.10.1.el8_3.x86_64
╰─➤ lsmod |grep nvidia
nvidia_drm 53248 0
nvidia_modeset 1183744 1 nvidia_drm
nvidia 19718144 7 nvidia_modeset
drm_kms_helper 217088 1 nvidia_drm
drm 557056 3 drm_kms_helper,nvidia_drm
╰─➤ rpm -qa ‘nvidia*’
nvidia-xconfig-450.80.02-1.el8.x86_64
nvidia-libXNVCtrl-450.80.02-1.el8.x86_64
nvidia-driver-libs-450.80.02-1.el8.x86_64
nvidia-driver-450.80.02-1.el8.x86_64
nvidia-libXNVCtrl-devel-450.80.02-1.el8.x86_64
nvidia-driver-cuda-450.80.02-1.el8.x86_64
nvidia-settings-450.80.02-1.el8.x86_64
nvidia-driver-NVML-450.80.02-1.el8.x86_64
nvidia-driver-devel-450.80.02-1.el8.x86_64
nvidia-kmod-common-450.80.02-1.el8.noarch
nvidia-persistenced-450.80.02-1.el8.x86_64
nvidia-modprobe-450.80.02-1.el8.x86_64
nvidia-driver-NvFBCOpenGL-450.80.02-1.el8.x86_64
nvidia-driver-cuda-libs-450.80.02-1.el8.x86_64
╰─➤ dmesg |grep nvidia
[ 3.747369] nvidia: loading out-of-tree module taints kernel.
[ 3.747381] nvidia: module license ‘NVIDIA’ taints kernel.
[ 3.756471] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 3.766821] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[ 3.767973] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 4.055061] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.80.02 Wed Sep 23 00:48:09 UTC 2020
[ 4.058667] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 4.058669] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[ 17.349211] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 34.247692] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 51.013963] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 67.778645] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 84.535692] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 101.295577] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
and from dmesg, this is the content of the last bar problems:
[ 17.348903] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x00
0dffff window]
[ 17.349211] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 17.606171] usb 1-4: reset high-speed USB device number 4 using ehci-pci
[ 33.604667] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 33.604699] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 34.247380] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 34.247692] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 50.412435] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 50.412468] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 51.013674] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 51.013963] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 67.179299] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 67.179332] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 67.778330] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 67.778645] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 83.943277] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 83.943310] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 84.535389] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 84.535692] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 100.699711] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 100.699744] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 101.295282] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 101.295577] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 117.460527] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[ 117.460561] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
nvidia-bug-report.log.gz (1.2 MB)
I attached here the full bug report.
Is this somehow a case of BIOS or some chip having different idea of PCI bus memory addressing vs. what the driver sees? Anyhow, it has worked for rather long time, couple of years until it broke like this.
@kmittman what would you need from my system to get to understand the case?
generix
January 12, 2021, 11:47am
13
Before you start banging your head against the wall, it’s a problem with some incompatible changes to newe RH kernels and nvidia is very well aware of it:
https://forums.developer.nvidia.com/t/nvidia-smi-no-devices-were-found-on-centos-8-3-and-460-27-04/164697?u=generix
1 Like
amrits
January 12, 2021, 7:29pm
14
Please have a look at RHEL Bugzilla 1904213 for more updates since this is a RHEL bug.
1 Like
Thanks for the pointer. I get it. I will report back the success once the kernel is out.
Thanks for the help everyone!