Driver allocating memory over pci slot size

Hi,

I have 1050ti GPU and driver version 450.80.02 on RHEL8.3. The display does not come up at the point graphics should start. This happens repeatedly today, sometimes I have got by doing reboot with older version from grub.

Earlier it worked with the devel driver, but I had to downgrade due no support for nvidia devel drivers in flathub flatpakk packaged software. Even with those drivers I had to always switch the port of LG 4K monitor at each boot to get it recognized. With these drivers it worked at first, but now is broken. Displays are some Samsung FullHD (DVI) and LG 4K monitor (HDMI).

I see dmesg has these lines related to nvidia:

[  +0.000151] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:01:00.1/sound/card2/input16
[  +0.000214] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:01:00.1/sound/card2/input17
[  +0.000237] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:01:00.1/sound/card2/input18
[  +0.000142] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:02.0/0000:01:00.1/sound/card2/input19
[  +0.000104] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:02.0/0000:01:00.1/sound/card2/input20
[  +0.000532] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[  +0.087752] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.80.02  Wed Sep 23 01:13:39 UTC 2020
[  +0.051973] nvidia-uvm: Loaded the UVM driver, major device number 239.
[  +0.030732] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  450.80.02  Wed Sep 23 00:48:09 UTC 2020
[  +0.006106] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[  +0.000003] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[  +0.648023] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[  +0.000208] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[  +0.080295] usb 1-4: reset high-speed USB device number 4 using ehci-pci
[  +0.585993] virbr0: port 1(virbr0-nic) entered disabled state
[ +15.592257] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[  +0.000032] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +0.644779] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[  +0.000233] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[Nov18 10:05] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[  +0.000032] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +0.591821] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[  +0.000209] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ +16.164278] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[  +0.000033] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +0.589168] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[  +0.000244] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ +16.165303] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[  +0.000029] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +0.590336] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[  +0.000219] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[Nov18 10:06] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[  +0.000032] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +0.587230] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[  +0.000249] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ +16.164630] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1266)
[  +0.000030] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

And X11.log ends up with this:

[   101.396] (II) NVIDIA GLX Module  450.80.02  Wed Sep 23 00:51:32 UTC 2020
[   101.396] (II) NVIDIA: The X server supports PRIME Render Offload.
[   101.396] (WW) NVIDIA(0): Failed to initialize Base Mosaic!  Reason: Only one GPU
[   101.396] (WW) NVIDIA(0):     detected.  Only one GPU will be used for this X screen.
[   117.809] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:1:0:0.  Please
[   117.809] (EE) NVIDIA(GPU-0):     check your system's kernel log for additional error
[   117.809] (EE) NVIDIA(GPU-0):     messages and refer to Chapter 8: Common Problems in the
[   117.809] (EE) NVIDIA(GPU-0):     README for additional information.
[   117.809] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
[   117.809] (EE) NVIDIA(0): Failing initialization of X screen
[   117.809] (II) UnloadModule: "nvidia"
[   117.809] (II) UnloadSubModule: "glxserver_nvidia"
[   117.809] (II) Unloading glxserver_nvidia
[   117.809] (II) UnloadSubModule: "wfb"
[   117.809] (II) UnloadSubModule: "fb"
[   117.809] (EE) Screen(s) found, but none have a usable configuration.
[   117.809] (EE) 
Fatal server error:
[   117.809] (EE) no screens found(EE) 
[   117.809] (EE) 
Please consult the The X.Org Foundation support 
         at http://wiki.x.org
 for help. 
[   117.809] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[   117.809] (EE) 
[   117.812] (EE) Server terminated with error (1). Closing log file.

Modinfo:

modinfo nvidia
filename:       /lib/modules/4.18.0-240.1.1.el8_3.x86_64/extra/drivers/video/nvidia/nvidia.ko
alias:          char-major-195-*
version:        450.80.02
supported:      external
license:        NVIDIA
rhelversion:    8.3
srcversion:     2132A76E28730AB295AF17B

Under Windows both monitors work fine. This is dual boot machine. And it has worked fine for long.

BR,

ikke

Now I rebooted to older kernel, and it works. This is the dmesg from that:

╰─➤  dmesg -H |grep -i -A2 -B2 nvidia                                                                                  2 ↵
[  +0.000071] input: HDA ATI SB Front Headphone as /devices/pci0000:00/0000:00:14.2/sound/card1/input22
[  +0.057355] usbcore: registered new interface driver snd-usb-audio
[  +0.152953] nvidia: loading out-of-tree module taints kernel.
[  +0.000011] nvidia: module license 'NVIDIA' taints kernel.
[  +0.000000] Disabling lock debugging due to kernel taint
[  +0.009578] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[  +0.011320] nvidia-nvlink: Nvlink Core is being initialized, major device number 240
[  +0.013077] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[  +0.023867] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.0/0000:01:00.1/sound/card2/input23
[  +0.000096] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:01:00.1/sound/card2/input24
[  +0.000074] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:01:00.1/sound/card2/input25
[  +0.059442] kvm: Nested Paging enabled
[  +0.004572] MCE: In-kernel MCE decoding enabled.
--
               Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
               (Note that use of the override may cause unknown side effects.)
[  +0.032696] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.80.02  Wed Sep 23 01:13:39 UTC 2020
[  +0.051806] nvidia-uvm: Loaded the UVM driver, major device number 238.
[  +0.030977] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  450.80.02  Wed Sep 23 00:48:09 UTC 2020
[  +0.007377] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[  +0.000002] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[  +0.499297] XFS (dm-2): Mounting V5 Filesystem
[  +0.000964] XFS (sdd2): Mounting V5 Filesystem
--
[  +0.061189] virbr0: port 1(virbr0-nic) entered disabled state
[  +0.637155] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[  +0.000229] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[  +0.531917] usb 1-4: reset high-speed USB device number 4 using ehci-pci
[  +0.093284] hrtimer: interrupt took 3008553 ns

modinfo:

╰─➤  modinfo nvidia
filename:       /lib/modules/4.18.0-193.28.1.el8_2.x86_64/extra/drivers/video/nvidia/nvidia.ko
alias:          char-major-195-*
version:        450.80.02
supported:      external
license:        NVIDIA
rhelversion:    8.2
srcversion:     2132A76E28730AB295AF17B
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        
name:           nvidia
vermagic:       4.18.0-193.28.1.el8_2.x86_64 SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         NVIDIA

I have the same issue with RHEL 8.3, an RTX 2070 and the 455.45.01 driver. Upgrading to the from RHEL 8.2 to 8.3 seems to prevent the driver from working which was working prior to dnf update.

Just FYI, nvidia driver won’t work on the RHEL 8.3 4.18.0-240 kernel either. out of the kernels in my rhel8.3 box, the -193 from 8.2 is the last one that works with nvidia:

kernel-core-4.18.0-193.28.1.el8_2.x86_64
kernel-core-4.18.0-240.1.1.el8_3.x86_64
kernel-core-4.18.0-240.8.1.el8_3.x86_64

So if you are on RHEL 8, either avoid kernel updates, or Nvidia. It would be nice if nvidia would work to fix this.

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.
In the non-working case, of course.

1 Like

@generix thanks for the advice, here is the log attached. Looking at the log, you’ll find some of the boots successfull. That’s due I use the old kernel to get my work done. 8.3 kernels fail 100% of the times. I installed the driver by doing:

sudo dnf module install nvidia-driver:450

I use 450 due flatpak having some app that only has that version flatpak userland parts for nvidia.

nvidia-bug-report.log.gz (519.3 KB)

Are the working and non-working kernels compiled using the same gcc version?

They are both official RHEL 8 kernels. I did not compile either of them. I’d be surpriced if gcc major would have changed between them. Within a RHEL release it should only bring in bug and security features mainly, as RHEL also promises ABI compatibility within a release.

And note that I did not compile the nvidia dkms, but use the nvidia prebuilt modules from nvidia repo. I have no idea whats the build system for them.

Having the same issue, identical driver, kernel and GPU. Not sure what the issue is between X and NVIDIA, but I’ve found that after X times out and leaves you a blinking cursor, switching to VTY2 (and not logging in) and then switching back to VTY1 (X session) is a workaround for the issue, albeit a poor one.

This is being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1915814

Thanks for the update.

Red Hat released an errata for this RmInitAdapter issue: RHSA-2021:0558