not able to update Tesla P100 driver 384 to 418

I have a Redhat 7.4 server running Tesla P100 driver version 384. Recently the users updated Xorg and now I keep getting message.

This server has a video driver ABI version 24.o that this driver does not officially support. Please check http://www.nvidia.com for driver updates or downgrade to an X server with a supported driver ABI.

I downloaded the new driver V. 418 and updated but running nvidia-smi still shows me old driver. I want to know why is the new driver not being recognized and how can I bring the server back up in GUI mod?

Thanks!
nvidia-bug-report.log.gz (254 KB)

The old driver is probably just left in the initrd, rebuild it as root with
dracut -f
Using the .run installer over a probably packaged previous install is not a good thing to do, probably needing a reinstall of the driver on kernel updates. You should rather uninstall the .run installer using the --uninstall option and switch to a repo driver like rpmfusion, which requires a fully updated RHEL (current 7.7?) or at least re-run the .run installer with the --dkms option.
If you have further problems, please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
[url]https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/[/url]

Thanks for your reply. I ran dracut -f and still the same. I have attached the bug report for review. How can I obtain the .run installer or where can I find it? The 418 driver is an rpm and when I ran rpm -ivh I got the message the driver is installed. Please let me know what you find out from the bug report.

It looks a bit messy, seems different 384.x drivers were installed around 12/2017 using different methods (.run/rpm) but there’s no trace of any 418 driver being installed. Stick to the .rpm for now.
Please post the output of
dkms status

Thanks generix. Here is the output from dkms status.

nvidia, 384.81, 3.10.0-693.11.1.el7.x86_64, x86_64: installed
nvidia, 384.81, 3.10.0-693.11.6.el7.x86_64, x86_64: installed (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!)
nvidia, 384.81, 3.10.0-693.5.2.el7.x86_64, x86_64: installed (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!)

How can I find the Diff between built and installed module?

The difference doesn’t matter, it’s just that the same driver (384.81) is installed three times, ignore.
More noteworthy is that there’s no trace of the 418 driver. Please reinstall it and post the complete output.

Appreciate your time and help very much.

Here is the output of reinstall.
[rajaya@SRHS /]$ sudo rpm -ivh nvidia-diag-driver-local-repo-rhel7-418.67-1.0-1.x86_64.rpm
[sudo] password for rajaya:
Preparing… ################################# [100%]
package nvidia-diag-driver-local-repo-rhel7-418.67-1.0-1.x86_64 is already installed

This is what I get every time.

That’s just adding the repo, not installing the driver. You missed the following steps:

ii) `yum clean all`
iii) `yum install cuda-drivers`
iv) `reboot`

Here is what I got…


→ Running transaction check
—> Package kmod-nvidia-3.10.0-957.el7.x86_64.x86_64 3:430.26-1.el7 will be installed
→ Processing Dependency: nvidia-kmod-common >= 3:430.26 for package: 3:kmod-nvidia-3.10.0-957.el7.x86_64-430.26-1.el7.x86_64
Package xorg-x11-drv-nvidia is obsoleted by nvidia-driver, but obsoleting package does not provide for requirements
—> Package libselinux-python.x86_64 0:2.5-11.el7 will be updated
—> Package libselinux-python.x86_64 0:2.5-14.1.el7 will be an update
—> Package nvidia-driver-cuda.x86_64 3:418.67-4.el7 will be installed
→ Processing Dependency: nvidia-persistenced = 3:418.67 for package: 3:nvidia-driver-cuda-418.67-4.el7.x86_64
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Finished Dependency Resolution
→ Running transaction check
—> Package kernel.x86_64 0:3.10.0-693.5.2.el7 will be erased
—> Package kmod-nvidia-3.10.0-957.el7.x86_64.x86_64 3:430.26-1.el7 will be installed
→ Processing Dependency: nvidia-kmod-common >= 3:430.26 for package: 3:kmod-nvidia-3.10.0-957.el7.x86_64-430.26-1.el7.x86_64
Package xorg-x11-drv-nvidia is obsoleted by nvidia-driver, but obsoleting package does not provide for requirements
—> Package nvidia-driver-cuda.x86_64 3:418.67-4.el7 will be installed
→ Processing Dependency: nvidia-persistenced = 3:418.67 for package: 3:nvidia-driver-cuda-418.67-4.el7.x86_64
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Finished Dependency Resolution
Error: Package: 3:nvidia-driver-cuda-418.67-4.el7.x86_64 (cuda-10-1-local-10.1.168-418.67)
Requires: nvidia-persistenced = 3:418.67
Available: 3:nvidia-persistenced-418.67-1.el7.x86_64 (cuda-10-1-local-10.1.168-418.67)
nvidia-persistenced = 3:418.67-1.el7
Installing: 3:nvidia-persistenced-430.26-1.el7.x86_64 (rpmfusion-nonfree-updates)
nvidia-persistenced = 3:430.26-1.el7
Error: Package: 3:kmod-nvidia-3.10.0-957.el7.x86_64-430.26-1.el7.x86_64 (rpmfusion-nonfree-updates)
Requires: nvidia-kmod-common >= 3:430.26
Installing: 3:nvidia-driver-418.67-4.el7.x86_64 (cuda-10-1-local-10.1.168-418.67)
nvidia-kmod-common = 3:418.67
Available: 3:xorg-x11-drv-nvidia-430.26-1.el7.x86_64 (rpmfusion-nonfree-updates)
nvidia-kmod-common = 3:430.26
Error: dkms-nvidia conflicts with 3:kmod-nvidia-430.26-1.el7.x86_64
Error: dkms-nvidia conflicts with 3:kmod-nvidia-3.10.0-957.el7.x86_64-430.26-1.el7.x86_64
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest


This is where yum update breaks down as well.

Thanks!

Ok, this looks like there was already the rpmfusion repo added at some time but the system never got updated. Though I wonder how somebody managed to upgrade the Xserver then.
Please post the output of
yum repolist enabled

Here is the output of ‘yum repolist enabled’
(base) [root@SRHS /]# yum repolist enabled
Loaded plugins: langpacks, product-id, rhnplugin, search-disabled-repos, subscription-manager
This system is receiving updates from RHN Classic or Red Hat Satellite.
repo id repo name status
cuda-10-1-local-10.1.168-418.67 cuda-10-1-local-10.1.168-418.67 79
epel/x86_64 Extra Packages for Enterprise Linux 7 - x86_64 13,343
nux-dextop/x86_64 Nux.Ro RPMs for general desktop use 2,710
nvidia-diag-driver-local-418.67 nvidia-diag-driver-local-418.67 26
rhel-x86_64-server-7 Red Hat Enterprise Linux Server (v. 7 for 64-bit x86_64) 26,158
rpmfusion-free-updates/x86_64 RPM Fusion for EL 7 - Free - Updates 247
rpmfusion-nonfree-updates/x86_64 RPM Fusion for EL 7 - Nonfree - Updates 75
repolist: 42,638
(base) [root@SRHS /]#

Thanks a lot.

Problem is that you now have three repos providing the nvidia driver so it’s unclear which to take. Better stick to rpmfusion, try this:

yum --disablerepo=\* --enablerepo=rpmfusion-nonfree-updates install xorg-x11-drv-nvidia akmod-nvidia xorg-x11-drv-nvidia-cuda

This should tell yum which driver to use.

Thanks. I tried the command and here is the output.

—> Package nvidia-modprobe.x86_64 3:430.40-1.el7 will be installed
—> Package nvidia-persistenced.x86_64 3:430.40-1.el7 will be installed
—> Package nvidia-settings.x86_64 3:430.40-1.el7 will be installed
—> Package nvidia-xconfig.x86_64 3:430.40-1.el7 will be installed
—> Package xorg-x11-drv-nvidia-cuda.x86_64 3:430.40-1.el7 will be installed
→ Processing Dependency: opencl-filesystem for package: 3:xorg-x11-drv-nvidia-cuda-430.40-1.el7.x86_64
→ Processing Dependency: ocl-icd(x86-64) for package: 3:xorg-x11-drv-nvidia-cuda-430.40-1.el7.x86_64
—> Package xorg-x11-drv-nvidia-cuda-libs.x86_64 3:430.40-1.el7 will be installed
—> Package xorg-x11-drv-nvidia-kmodsrc.x86_64 3:430.40-1.el7 will be installed
—> Package xorg-x11-drv-nvidia-libs.x86_64 3:430.40-1.el7 will be installed
→ Processing Dependency: egl-wayland >= 1.0.0 for package: 3:xorg-x11-drv-nvidia-libs-430.40-1.el7.x86_64
→ Finished Dependency Resolution
Error: Package: 3:xorg-x11-drv-nvidia-libs-430.40-1.el7.x86_64 (rpmfusion-nonfree-updates)
Requires: egl-wayland >= 1.0.0
Error: Package: 3:xorg-x11-drv-nvidia-cuda-430.40-1.el7.x86_64 (rpmfusion-nonfree-updates)
Requires: ocl-icd(x86-64)
Error: Package: 3:xorg-x11-drv-nvidia-cuda-430.40-1.el7.x86_64 (rpmfusion-nonfree-updates)
Requires: opencl-filesystem
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

Looks like it also needs packages from the epel repo, try

yum --disablerepo=\* --enablerepo=rpmfusion-nonfree-updates --enablerepo=epel install xorg-x11-drv-nvidia akmod-nvidia xorg-x11-drv-nvidia-cuda

Oh man… you are a savior!!! Finally an update installed successfully.

Installed:
akmod-nvidia.x86_64 3:430.40-1.el7 xorg-x11-drv-nvidia.x86_64 3:430.40-1.el7
xorg-x11-drv-nvidia-cuda.x86_64 3:430.40-1.el7

Dependency Installed:
egl-wayland.x86_64 0:1.1.3-1.el7 nvidia-modprobe.x86_64 3:430.40-1.el7
nvidia-persistenced.x86_64 3:430.40-1.el7 nvidia-settings.x86_64 3:430.40-1.el7
nvidia-xconfig.x86_64 3:430.40-1.el7 ocl-icd.x86_64 0:2.2.12-1.el7
opencl-filesystem.noarch 0:1.0-5.el7 xorg-x11-drv-nvidia-cuda-libs.x86_64 3:430.40-1.el7
xorg-x11-drv-nvidia-kmodsrc.x86_64 3:430.40-1.el7 xorg-x11-drv-nvidia-libs.x86_64 3:430.40-1.el7

Complete!

I will schedule a restart to see the OS loads up to the GUI and post an update. Thanks a lot.

I did a restart of the server and ran yum update to apply latest updates so the server is at recent level. Below is where the update stopped.


—> Package xorg-x11-drv-nvidia.x86_64 3:430.40-1.el7 will be obsoleted
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:akmod-nvidia-430.40-1.el7.x86_64
→ Running transaction check
—> Package cdrdao.x86_64 0:1.2.3-20.el7 will be installed
—> Package daxctl-libs.x86_64 0:64.1-2.el7 will be installed
—> Package fwupdate-efi.x86_64 0:12-5.el7 will be installed
—> Package icedax.x86_64 0:1.1.11-25.el7 will be installed
→ Processing Dependency: vorbis-tools for package: icedax-1.1.11-25.el7.x86_64
→ Processing Dependency: cdparanoia for package: icedax-1.1.11-25.el7.x86_64
—> Package libburn.x86_64 0:1.2.8-4.el7 will be installed
—> Package libisofs.x86_64 0:1.2.8-4.el7 will be installed
—> Package xorg-x11-drv-nvidia.x86_64 3:430.40-1.el7 will be obsoleted
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:akmod-nvidia-430.40-1.el7.x86_64
→ Running transaction check
—> Package cdparanoia.x86_64 0:10.2-17.el7 will be installed
—> Package vorbis-tools.x86_64 1:1.4.0-13.el7 will be installed
—> Package xorg-x11-drv-nvidia.x86_64 3:430.40-1.el7 will be obsoleted
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:akmod-nvidia-430.40-1.el7.x86_64
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Finished Dependency Resolution
→ Running transaction check
—> Package kernel.x86_64 0:3.10.0-693.5.2.el7 will be erased
—> Package kernel-devel.x86_64 0:3.10.0-693.5.2.el7 will be erased
—> Package xorg-x11-drv-nvidia.x86_64 3:430.40-1.el7 will be obsoleted
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:akmod-nvidia-430.40-1.el7.x86_64
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Finished Dependency Resolution
Error: dkms-nvidia conflicts with 3:akmod-nvidia-430.40-1.el7.x86_64
Error: Package: 3:akmod-nvidia-430.40-1.el7.x86_64 (@rpmfusion-nonfree-updates)
Requires: nvidia-kmod-common >= 3:430.40
Removing: 3:xorg-x11-drv-nvidia-430.40-1.el7.x86_64 (@rpmfusion-nonfree-updates)
nvidia-kmod-common = 3:430.40
Obsoleted By: 3:nvidia-driver-418.67-4.el7.x86_64 (cuda-10-1-local-10.1.168-418.67)
nvidia-kmod-common = 3:418.67
Error: Package: 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64 (installed)
Requires: nvidia-kmod-common >= 3:430.40
Removing: 3:xorg-x11-drv-nvidia-430.40-1.el7.x86_64 (@rpmfusion-nonfree-updates)
nvidia-kmod-common = 3:430.40
Obsoleted By: 3:nvidia-driver-418.67-4.el7.x86_64 (cuda-10-1-local-10.1.168-418.67)
nvidia-kmod-common = 3:418.67
Error: dkms-nvidia conflicts with 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

Is the serer again seeing more than 1 driver?
Thanks!

Yes this seems like someone tried to install cuda 10 which is a metapackage consisting of cuda-toolkit and the nvidia driver. I guess you should remove all cuda/nvidia packages/repos using
[url]Installation Guide Linux :: CUDA Toolkit Documentation
Afterwards, reinstall the driver from the rpmfusion repo, then download the cuda 10.1 rpm and add the repo to your system (first three instructions steps on download page) and then don’t install the nvidia driver and “cuda” but only “cuda-toolkit-10-1”.
This should give you a clean, updatable sytem.

I uninstalled all the nvidia and CUDA drivers by following the instructions in the document. I looked up rpmfusion repo at this link Howto/NVIDIA - RPM Fusion

Are these the command I need to run to install NVidia drivers and then the CUDA 10.1 toolkit? Just want to verify before I proceed.

Thanks!

On another note… After uninstalling the drivers, yum update was able to download all the packages fine. Should I just update the server first before installing the nvidia driver and cuda 10.1 toolkit?

Just fully update the server first, then reinstall the driver with the commands from the rpmfusion website and afterwards the cuda toolkit.