not able to update Tesla P100 driver 384 to 418

rajaya · August 13, 2019, 9:31pm

I have a Redhat 7.4 server running Tesla P100 driver version 384. Recently the users updated Xorg and now I keep getting message.

This server has a video driver ABI version 24.o that this driver does not officially support. Please check http://www.nvidia.com for driver updates or downgrade to an X server with a supported driver ABI.

I downloaded the new driver V. 418 and updated but running nvidia-smi still shows me old driver. I want to know why is the new driver not being recognized and how can I bring the server back up in GUI mod?

Thanks!
nvidia-bug-report.log.gz (254 KB)

generix · August 14, 2019, 8:26am

The old driver is probably just left in the initrd, rebuild it as root with
dracut -f
Using the .run installer over a probably packaged previous install is not a good thing to do, probably needing a reinstall of the driver on kernel updates. You should rather uninstall the .run installer using the --uninstall option and switch to a repo driver like rpmfusion, which requires a fully updated RHEL (current 7.7?) or at least re-run the .run installer with the --dkms option.
If you have further problems, please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
[url]https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/[/url]

rajaya · August 14, 2019, 4:09pm

Thanks for your reply. I ran dracut -f and still the same. I have attached the bug report for review. How can I obtain the .run installer or where can I find it? The 418 driver is an rpm and when I ran rpm -ivh I got the message the driver is installed. Please let me know what you find out from the bug report.

generix · August 14, 2019, 6:00pm

It looks a bit messy, seems different 384.x drivers were installed around 12/2017 using different methods (.run/rpm) but there’s no trace of any 418 driver being installed. Stick to the .rpm for now.
Please post the output of
dkms status

rajaya · August 14, 2019, 7:51pm

Thanks generix. Here is the output from dkms status.

nvidia, 384.81, 3.10.0-693.11.1.el7.x86_64, x86_64: installed
nvidia, 384.81, 3.10.0-693.11.6.el7.x86_64, x86_64: installed (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!)
nvidia, 384.81, 3.10.0-693.5.2.el7.x86_64, x86_64: installed (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!)

How can I find the Diff between built and installed module?

generix · August 14, 2019, 8:10pm

The difference doesn’t matter, it’s just that the same driver (384.81) is installed three times, ignore.
More noteworthy is that there’s no trace of the 418 driver. Please reinstall it and post the complete output.

rajaya · August 15, 2019, 2:56am

Appreciate your time and help very much.

Here is the output of reinstall.
[rajaya@SRHS /]$ sudo rpm -ivh nvidia-diag-driver-local-repo-rhel7-418.67-1.0-1.x86_64.rpm
[sudo] password for rajaya:
Preparing… ################################# [100%]
package nvidia-diag-driver-local-repo-rhel7-418.67-1.0-1.x86_64 is already installed

This is what I get every time.

generix · August 15, 2019, 8:10am

That’s just adding the repo, not installing the driver. You missed the following steps:

ii) `yum clean all`
iii) `yum install cuda-drivers`
iv) `reboot`

rajaya · August 15, 2019, 2:22pm

Here is what I got…

→ Running transaction check
—> Package kmod-nvidia-3.10.0-957.el7.x86_64.x86_64 3:430.26-1.el7 will be installed
→ Processing Dependency: nvidia-kmod-common >= 3:430.26 for package: 3:kmod-nvidia-3.10.0-957.el7.x86_64-430.26-1.el7.x86_64
Package xorg-x11-drv-nvidia is obsoleted by nvidia-driver, but obsoleting package does not provide for requirements
—> Package libselinux-python.x86_64 0:2.5-11.el7 will be updated
—> Package libselinux-python.x86_64 0:2.5-14.1.el7 will be an update
—> Package nvidia-driver-cuda.x86_64 3:418.67-4.el7 will be installed
→ Processing Dependency: nvidia-persistenced = 3:418.67 for package: 3:nvidia-driver-cuda-418.67-4.el7.x86_64
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Finished Dependency Resolution
→ Running transaction check
—> Package kernel.x86_64 0:3.10.0-693.5.2.el7 will be erased
—> Package kmod-nvidia-3.10.0-957.el7.x86_64.x86_64 3:430.26-1.el7 will be installed
→ Processing Dependency: nvidia-kmod-common >= 3:430.26 for package: 3:kmod-nvidia-3.10.0-957.el7.x86_64-430.26-1.el7.x86_64
Package xorg-x11-drv-nvidia is obsoleted by nvidia-driver, but obsoleting package does not provide for requirements
—> Package nvidia-driver-cuda.x86_64 3:418.67-4.el7 will be installed
→ Processing Dependency: nvidia-persistenced = 3:418.67 for package: 3:nvidia-driver-cuda-418.67-4.el7.x86_64
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Finished Dependency Resolution
Error: Package: 3:nvidia-driver-cuda-418.67-4.el7.x86_64 (cuda-10-1-local-10.1.168-418.67)
Requires: nvidia-persistenced = 3:418.67
Available: 3:nvidia-persistenced-418.67-1.el7.x86_64 (cuda-10-1-local-10.1.168-418.67)
nvidia-persistenced = 3:418.67-1.el7
Installing: 3:nvidia-persistenced-430.26-1.el7.x86_64 (rpmfusion-nonfree-updates)
nvidia-persistenced = 3:430.26-1.el7
Error: Package: 3:kmod-nvidia-3.10.0-957.el7.x86_64-430.26-1.el7.x86_64 (rpmfusion-nonfree-updates)
Requires: nvidia-kmod-common >= 3:430.26
Installing: 3:nvidia-driver-418.67-4.el7.x86_64 (cuda-10-1-local-10.1.168-418.67)
nvidia-kmod-common = 3:418.67
Available: 3:xorg-x11-drv-nvidia-430.26-1.el7.x86_64 (rpmfusion-nonfree-updates)
nvidia-kmod-common = 3:430.26
Error: dkms-nvidia conflicts with 3:kmod-nvidia-430.26-1.el7.x86_64
Error: dkms-nvidia conflicts with 3:kmod-nvidia-3.10.0-957.el7.x86_64-430.26-1.el7.x86_64
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

This is where yum update breaks down as well.

Thanks!

generix · August 15, 2019, 4:46pm

Ok, this looks like there was already the rpmfusion repo added at some time but the system never got updated. Though I wonder how somebody managed to upgrade the Xserver then.
Please post the output of
yum repolist enabled

rajaya · August 15, 2019, 5:27pm

Here is the output of ‘yum repolist enabled’
(base) [root@SRHS /]# yum repolist enabled
Loaded plugins: langpacks, product-id, rhnplugin, search-disabled-repos, subscription-manager
This system is receiving updates from RHN Classic or Red Hat Satellite.
repo id repo name status
cuda-10-1-local-10.1.168-418.67 cuda-10-1-local-10.1.168-418.67 79
epel/x86_64 Extra Packages for Enterprise Linux 7 - x86_64 13,343
nux-dextop/x86_64 Nux.Ro RPMs for general desktop use 2,710
nvidia-diag-driver-local-418.67 nvidia-diag-driver-local-418.67 26
rhel-x86_64-server-7 Red Hat Enterprise Linux Server (v. 7 for 64-bit x86_64) 26,158
rpmfusion-free-updates/x86_64 RPM Fusion for EL 7 - Free - Updates 247
rpmfusion-nonfree-updates/x86_64 RPM Fusion for EL 7 - Nonfree - Updates 75
repolist: 42,638
(base) [root@SRHS /]#

Thanks a lot.

generix · August 16, 2019, 7:44am

Problem is that you now have three repos providing the nvidia driver so it’s unclear which to take. Better stick to rpmfusion, try this:

yum --disablerepo=\* --enablerepo=rpmfusion-nonfree-updates install xorg-x11-drv-nvidia akmod-nvidia xorg-x11-drv-nvidia-cuda

This should tell yum which driver to use.

rajaya · August 16, 2019, 2:38pm

Thanks. I tried the command and here is the output.

—> Package nvidia-modprobe.x86_64 3:430.40-1.el7 will be installed
—> Package nvidia-persistenced.x86_64 3:430.40-1.el7 will be installed
—> Package nvidia-settings.x86_64 3:430.40-1.el7 will be installed
—> Package nvidia-xconfig.x86_64 3:430.40-1.el7 will be installed
—> Package xorg-x11-drv-nvidia-cuda.x86_64 3:430.40-1.el7 will be installed
→ Processing Dependency: opencl-filesystem for package: 3:xorg-x11-drv-nvidia-cuda-430.40-1.el7.x86_64
→ Processing Dependency: ocl-icd(x86-64) for package: 3:xorg-x11-drv-nvidia-cuda-430.40-1.el7.x86_64
—> Package xorg-x11-drv-nvidia-cuda-libs.x86_64 3:430.40-1.el7 will be installed
—> Package xorg-x11-drv-nvidia-kmodsrc.x86_64 3:430.40-1.el7 will be installed
—> Package xorg-x11-drv-nvidia-libs.x86_64 3:430.40-1.el7 will be installed
→ Processing Dependency: egl-wayland >= 1.0.0 for package: 3:xorg-x11-drv-nvidia-libs-430.40-1.el7.x86_64
→ Finished Dependency Resolution
Error: Package: 3:xorg-x11-drv-nvidia-libs-430.40-1.el7.x86_64 (rpmfusion-nonfree-updates)
Requires: egl-wayland >= 1.0.0
Error: Package: 3:xorg-x11-drv-nvidia-cuda-430.40-1.el7.x86_64 (rpmfusion-nonfree-updates)
Requires: ocl-icd(x86-64)
Error: Package: 3:xorg-x11-drv-nvidia-cuda-430.40-1.el7.x86_64 (rpmfusion-nonfree-updates)
Requires: opencl-filesystem
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

generix · August 19, 2019, 10:00am

Looks like it also needs packages from the epel repo, try

yum --disablerepo=\* --enablerepo=rpmfusion-nonfree-updates --enablerepo=epel install xorg-x11-drv-nvidia akmod-nvidia xorg-x11-drv-nvidia-cuda

rajaya · August 19, 2019, 3:31pm

Oh man… you are a savior!!! Finally an update installed successfully.

Installed:
akmod-nvidia.x86_64 3:430.40-1.el7 xorg-x11-drv-nvidia.x86_64 3:430.40-1.el7
xorg-x11-drv-nvidia-cuda.x86_64 3:430.40-1.el7

Dependency Installed:
egl-wayland.x86_64 0:1.1.3-1.el7 nvidia-modprobe.x86_64 3:430.40-1.el7
nvidia-persistenced.x86_64 3:430.40-1.el7 nvidia-settings.x86_64 3:430.40-1.el7
nvidia-xconfig.x86_64 3:430.40-1.el7 ocl-icd.x86_64 0:2.2.12-1.el7
opencl-filesystem.noarch 0:1.0-5.el7 xorg-x11-drv-nvidia-cuda-libs.x86_64 3:430.40-1.el7
xorg-x11-drv-nvidia-kmodsrc.x86_64 3:430.40-1.el7 xorg-x11-drv-nvidia-libs.x86_64 3:430.40-1.el7

Complete!

I will schedule a restart to see the OS loads up to the GUI and post an update. Thanks a lot.

rajaya · August 20, 2019, 1:50am

I did a restart of the server and ran yum update to apply latest updates so the server is at recent level. Below is where the update stopped.

—> Package xorg-x11-drv-nvidia.x86_64 3:430.40-1.el7 will be obsoleted
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:akmod-nvidia-430.40-1.el7.x86_64
→ Running transaction check
—> Package cdrdao.x86_64 0:1.2.3-20.el7 will be installed
—> Package daxctl-libs.x86_64 0:64.1-2.el7 will be installed
—> Package fwupdate-efi.x86_64 0:12-5.el7 will be installed
—> Package icedax.x86_64 0:1.1.11-25.el7 will be installed
→ Processing Dependency: vorbis-tools for package: icedax-1.1.11-25.el7.x86_64
→ Processing Dependency: cdparanoia for package: icedax-1.1.11-25.el7.x86_64
—> Package libburn.x86_64 0:1.2.8-4.el7 will be installed
—> Package libisofs.x86_64 0:1.2.8-4.el7 will be installed
—> Package xorg-x11-drv-nvidia.x86_64 3:430.40-1.el7 will be obsoleted
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:akmod-nvidia-430.40-1.el7.x86_64
→ Running transaction check
—> Package cdparanoia.x86_64 0:10.2-17.el7 will be installed
—> Package vorbis-tools.x86_64 1:1.4.0-13.el7 will be installed
—> Package xorg-x11-drv-nvidia.x86_64 3:430.40-1.el7 will be obsoleted
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:akmod-nvidia-430.40-1.el7.x86_64
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Finished Dependency Resolution
→ Running transaction check
—> Package kernel.x86_64 0:3.10.0-693.5.2.el7 will be erased
—> Package kernel-devel.x86_64 0:3.10.0-693.5.2.el7 will be erased
—> Package xorg-x11-drv-nvidia.x86_64 3:430.40-1.el7 will be obsoleted
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64
→ Processing Dependency: nvidia-kmod-common >= 3:430.40 for package: 3:akmod-nvidia-430.40-1.el7.x86_64
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Finished Dependency Resolution
Error: dkms-nvidia conflicts with 3:akmod-nvidia-430.40-1.el7.x86_64
Error: Package: 3:akmod-nvidia-430.40-1.el7.x86_64 (@rpmfusion-nonfree-updates)
Requires: nvidia-kmod-common >= 3:430.40
Removing: 3:xorg-x11-drv-nvidia-430.40-1.el7.x86_64 (@rpmfusion-nonfree-updates)
nvidia-kmod-common = 3:430.40
Obsoleted By: 3:nvidia-driver-418.67-4.el7.x86_64 (cuda-10-1-local-10.1.168-418.67)
nvidia-kmod-common = 3:418.67
Error: Package: 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64 (installed)
Requires: nvidia-kmod-common >= 3:430.40
Removing: 3:xorg-x11-drv-nvidia-430.40-1.el7.x86_64 (@rpmfusion-nonfree-updates)
nvidia-kmod-common = 3:430.40
Obsoleted By: 3:nvidia-driver-418.67-4.el7.x86_64 (cuda-10-1-local-10.1.168-418.67)
nvidia-kmod-common = 3:418.67
Error: dkms-nvidia conflicts with 3:kmod-nvidia-3.10.0-693.11.6.el7.x86_64-430.40-1.el7.x86_64
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

Is the serer again seeing more than 1 driver?
Thanks!

generix · August 20, 2019, 10:19am

Yes this seems like someone tried to install cuda 10 which is a metapackage consisting of cuda-toolkit and the nvidia driver. I guess you should remove all cuda/nvidia packages/repos using
[url]Installation Guide Linux :: CUDA Toolkit Documentation
Afterwards, reinstall the driver from the rpmfusion repo, then download the cuda 10.1 rpm and add the repo to your system (first three instructions steps on download page) and then don’t install the nvidia driver and “cuda” but only “cuda-toolkit-10-1”.
This should give you a clean, updatable sytem.

rajaya · August 20, 2019, 1:50pm

I uninstalled all the nvidia and CUDA drivers by following the instructions in the document. I looked up rpmfusion repo at this link Howto/NVIDIA - RPM Fusion

Are these the command I need to run to install NVidia drivers and then the CUDA 10.1 toolkit? Just want to verify before I proceed.

Thanks!

rajaya · August 20, 2019, 1:53pm

On another note… After uninstalling the drivers, yum update was able to download all the packages fine. Should I just update the server first before installing the nvidia driver and cuda 10.1 toolkit?

generix · August 20, 2019, 2:15pm

Just fully update the server first, then reinstall the driver with the commands from the rpmfusion website and afterwards the cuda toolkit.

Topic		Replies	Views
NVIDIA-Linux-x86_64-418.113 wouldn't build Linux	36	3154	October 12, 2021
OpenGL, NVIDIA and Ubuntu 14.04 issues Linux	28	17310	September 22, 2017
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371332	March 19, 2021
NVIDIA driver is not loaded. Ubuntu 18.10 Linux	310	128867	February 14, 2024
run on K40 Linux	83	4606	June 29, 2018
Latest kernel driver RPM packages pull all X11 stack?? Linux	1	1369	August 20, 2019
AMD Ryzen 7 integrated GPU + NVidia 1650 in same linux machine cause Xorg to default to outdated drivers Linux	19	15808	October 12, 2021
Black Screen After install CUDA 10.1 on Ubuntu 18.04 Linux	37	19647	November 30, 2022
Intel for display + nVidia for CUDA - Optimus bug? Linux	15	9122	September 1, 2017
Error when installing nvidia driver - Tesla K40m on Linux RHEL Linux	28	2657	October 12, 2021

not able to update Tesla P100 driver 384 to 418

Oh man… you are a savior!!! Finally an update installed successfully.

Complete!

Related topics