NVIDIA-Linux-x86_64-340.104.run caused problem in RHEL 7.4

Hello,

I have a workstation running RHEL 7.3. Recently the system was updated to RHEL 7.4. After the update the system won’t boot to GNOME. After getting help from tech support of Red Hat, I downloaded the latest nvidia driver 340.104 and installed it. With the updated nvidia driver, my system won’t boot to GNOME. It gave some error in /usr/share/X11/xorg.conf.d" Since your driver is IP protected, the tech support of Red Hat won’t be able to find what is wrong. So I need your help with trouble shooting. Thank you in advance!

Regards,
Haoming

Please run nvidia-bug-report.sh as root and attach the tar.gz file it creates to your post.

Thanks. I generated the report. But I see nowhere I can attach the report.

nvidia-bug-report.log.gz (78.5 KB)

sorry. got it. HZ

That’s a bit tricky. The 340.104 driver installs fine and the kernel driver is working, but on reboot not the freshly installed but an old 340.102 kernel driver gets loaded from somewhere. Please post the output of
modinfo nvidia
to know from where.

The output of
dkms status
could also be of interest.

ok I’ll get both.

The output of modinfo nvidia is attached. But I don’t have dkms installed, so did not get the dkms status. If you need dkms status, please let me know if it’s possible to install as my system is running in text mode.

Many thanks,
Haoming
modinfo.txt (1.33 KB)

In RHEL 7.4 I have four kernels to select as to which is booted into. I don’t know if the old driver is read in from other kernels. Just a thought.

No need to install dkms, I just wanted to make sure if it is used or not.
Now the mystery got bigger, modinfo said it would load the .104 driver, yet the kernel loads .102.
So I think the old nvidia driver is integrated in the initrd. Please reinstall the kernel you’re currently using, this should rebuild the initrd.

Joe from Red Hat asked me to send his diagnosis to you and hoped it may help you. Here are his comments:
Most recent comment: On 2017-10-20 16:15:33, Wright, Joe commented:
"Good Afternoon,

Yea, this looks like what I originally thought. It cant find any displays:

[ 60.325] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[ 60.325] (EE) NVIDIA: system’s kernel log for additional error messages and
[ 60.325] (EE) NVIDIA: consult the NVIDIA README for details.
[ 60.325] (EE) No devices detected.
[ 60.325] (EE)
Fatal server error:
[ 60.325] (EE) no screens found(EE)
[ 60.326] (EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
[ 60.326] (EE) Please also check the log file at “/var/log/Xorg.0.log” for additional information.
[ 60.326] (EE)
[ 60.326] (EE) Server terminated with error (1). Closing log file.

I don’t see any problems with initializing the module otherwise:

[ 59.659] (II) Module ABI versions:
[ 59.659] X.Org ANSI C Emulation: 0.4
[ 59.659] X.Org Video Driver: 23.0
[ 59.659] X.Org XInput driver : 24.1
[ 59.659] X.Org Server Extension : 10.0
[ 59.659] (II) xfree86: Adding drm device (/dev/dri/card0)
[ 59.664] (–) PCI:*(0:6:0:0) 10de:05fe:10de:0594 rev 161, Mem @ 0xfa000000/16777216, 0xd0000000/268435456, 0xf8000000/33554432, I/O @ 0x0000ec00/128, BIOS @ 0x???/524288
[ 59.664] (II) LoadModule: “glx”
[ 59.683] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[ 60.297] (II) Module glx: vendor=“NVIDIA Corporation”
[ 60.297] compiled for 4.0.2, module version = 1.0.0
[ 60.297] Module class: X.Org Server Extension
[ 60.302] (II) NVIDIA GLX Module 340.104 Thu Sep 14 16:40:42 PDT 2017
[ 60.308] (II) LoadModule: “nvidia”
[ 60.308] (II) Loading /usr/lib64/xorg/modules/drivers/nvidia_drv.so
[ 60.309] (II) Module nvidia: vendor=“NVIDIA Corporation”
[ 60.309] compiled for 4.0.2, module version = 1.0.0
[ 60.309] Module class: X.Org Video Driver
[ 60.309] (II) NVIDIA dlloader X Driver 340.104 Thu Sep 14 16:18:31 PDT 2017
[ 60.309] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[ 60.309] (++) using VT number 1

[ 60.312] (II) Loading sub module “fb”
[ 60.312] (II) LoadModule: “fb”
[ 60.324] (II) Loading /usr/lib64/xorg/modules/libfb.so
[ 60.324] (II) Module fb: vendor=“X.Org Foundation”
[ 60.324] compiled for 1.19.3, module version = 1.0.0
[ 60.324] ABI class: X.Org ANSI C Emulation, version 0.4
[ 60.324] (WW) Unresolved symbol: fbGetGCPrivateKey
[ 60.324] (II) Loading sub module “wfb”
[ 60.324] (II) LoadModule: “wfb”
[ 60.324] (II) Loading /usr/lib64/xorg/modules/libwfb.so
[ 60.324] (II) Module wfb: vendor=“X.Org Foundation”
[ 60.324] compiled for 1.19.3, module version = 1.0.0
[ 60.324] ABI class: X.Org ANSI C Emulation, version 0.4
[ 60.324] (II) Loading sub module “ramdac”
[ 60.324] (II) LoadModule: “ramdac”
[ 60.324] (II) Module “ramdac” already built-in

For whatever reason, it’s not finding the GPU. Looking at the devices on the PCI bus:

$ grep -i nvidia lspci
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GT200GL [Quadro FX 4800] [10de:05fe] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:0594]
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia

nouveau is blacklisted correctly:

cat etc/modprobe.d/blacklist.conf blacklist nouveau cat proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-693.2.2.el7.x86_64 root=/dev/mapper/rhel-root ro crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau plymouth.ignore-udev modprobe.blacklist=nouveau

We definitely see the GPU though, so its not a problem with the kernel seeing the device on the PCI bus, meaning there’s something not right with the driver as it pertains to the GPU and the 7.4 kernel. I would suggest forwarding my findings to Nvidia for review as well.

Thanks and have a great day!

Best Regards,
Joe Wright, RHCE RHCVA
Senior Technical Support Engineer
Customer Experience & Engagement - North America
Red Hat, Inc"

No news.
The kernel is loading the .102 driver probably from initrd, X loads the .104 driver which is then looking for the .104 kernel driver. Doesn’t find that, bails out.
So I really think that just the initrd has to be rebuilt.