Issues with Nvidia Drivers on CentOS 7.6/7.7

joet76cv · September 27, 2019, 10:11pm

Recently, I’ve been encountering issues with setting up Nvidia drivers on CentOS 7.6 and 7.7 .

As an example, on 7.6,

$ uname -r
3.10.0-957.27.2.el7.x86_64
$ rpm -qa kernel*
kernel-devel-3.10.0-957.el7.x86_64
kernel-tools-libs-3.10.0-957.27.2.el7.x86_64
kernel-tools-3.10.0-957.27.2.el7.x86_64
kernel-3.10.0-957.el7.x86_64
kernel-3.10.0-957.27.2.el7.x86_64
kernel-headers-3.10.0-957.el7.x86_64

I am installing the Nvidia drivers using the following steps

$ wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.1.243-1.x86_64.rpm
$ rpm -i cuda-repo-rhel7-10.1.243-1.x86_64.rpm 
$ sudo rpm -i cuda-repo-rhel7-10.1.243-1.x86_64.rpm 
$ sudo yum clean all
$ sudo yum -y install epel-release
$ sudo yum -y install cuda
$ sudo shutdown -r now

Upon rebooting the system, nvidia-smi shows the following

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Here are the Nvidia packages/drivers that are installed

$ rpm -qa nvidia*
nvidia-driver-latest-cuda-libs-418.87.00-2.el7.x86_64
nvidia-persistenced-latest-418.87.00-2.el7.x86_64
nvidia-driver-latest-418.87.00-2.el7.x86_64
nvidia-xconfig-latest-418.87.00-2.el7.x86_64
nvidia-modprobe-latest-418.87.00-2.el7.x86_64
nvidia-driver-latest-cuda-418.87.00-2.el7.x86_64
nvidia-driver-latest-NvFBCOpenGL-418.87.00-2.el7.x86_64
nvidia-libXNVCtrl-418.87.00-2.el7.x86_64
nvidia-libXNVCtrl-devel-418.87.00-2.el7.x86_64
nvidia-driver-latest-libs-418.87.00-2.el7.x86_64
nvidia-driver-latest-NVML-418.87.00-2.el7.x86_64
nvidia-driver-latest-devel-418.87.00-2.el7.x86_64
nvidia-settings-418.87.00-2.el7.x86_64

I am seeing identical behavior with the 3.10.0-1062.1.1.el7 kernel as well.

I regularly build CentOS images for GPU computing internally, and this has only become an issue in the last week.

Robert_Crovella · September 28, 2019, 2:55am

depending on the previous history of the machine, its certainly possible that attempting to install the way you did will result in a broken driver install, which is what is being indicated.

joet76cv · September 28, 2019, 3:03am

This is from an image on Google Cloud, out of the box,

joet76cv · September 29, 2019, 1:45pm

Can you help me understand why this would result in a broken driver install?

Robert_Crovella · September 29, 2019, 2:15pm

“an image on Google Cloud”

No, I cannot help with that. I have no idea what the history of that image is (how things were installed into that image).

A typical reason for broken driver install when using a particular install method with no attention paid to what was done previously is documented in section 2.7 of the linux install guide:

[url]Installation Guide Linux :: CUDA Toolkit Documentation

I don’t know for certain that this is the issue, of course. It’s entirely possible its some other problem.

joet76cv · September 29, 2019, 2:28pm

The image I’m starting from is a clean CentOS 7 image. What I’ve shown in the opening comment on this thread is all that has been done. There are no other installs of Nvidia* or cuda* on this system.

I’ll look through the section of documentation you’ve auggested

Robert_Crovella · September 29, 2019, 2:38pm

after the install/reboot process, when you have gotten the message that nvidia-smi has failed, what is the output from:

dmesg | grep NVRM

?

joet76cv · September 29, 2019, 8:59pm

I’ve gone through the suggested documentation carefully when setting up an instance from GCP’s CentOS 7 image. There were issues between a mismatched kernel (3.10.0-957) and the kernel-devel and kernel-headers that are available through yum. Over the last week, with the CentOS 7.7 release, kernel-devel-3.10.0-957 and kernel-headers-3.10.0-957 is no longer directly available through the yum package manager.

To get a working build starting from the currently released CentOS 7 image on GCP ( built 09/16/2019 ), I’m first executing

yum install kernel
yum update
reboot

Upon reboot, the following actions are taken

yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
yum install gcc dkms libvdpau wget pciutils
wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-rhel7-10-1-local-10.1.243-418.87.00-1.0-1.x86_64.rpm
rpm --install cuda-repo-rhel7-10-1-local-10.1.243-418.87.00-1.0-1.x86_64.rpm 
yum clean expire-cache
yum install nvidia-driver-latest-dkms cuda
reboot

Upon reboot, I follow the post-installation instructions ( https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions )setting PATH and starting the persistence daemon on boot. Rebooting one more time, and now nvidia-smi returns the expected output.

Thanks for pushing me towards documentation - sometimes it’s easy to forget that the answer’s already out there.

Topic		Replies	Views
Updating to Centos 7.7 fails CUDA Setup and Installation	6	5142	September 24, 2019
Solved: NVIDIA driver installation fails. CUDA Setup and Installation	34	51548	March 7, 2018
Centos 7 crashes after CUDA 10.1 installation.. PLEASE HELP!!!!!!!! Linux	2	781	September 15, 2019
CUDA Drivers Fail in Multiple Ways After Fresh Install (Linux) CUDA Setup and Installation	5	2387	August 13, 2023
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. Linux	2	5592	August 16, 2019
NVIDIA-SMI can't communicate with NVIDIA driver Linux	1	3906	March 10, 2022
Unable to set up cuda-8.0 on RHEL 7.4 CUDA Setup and Installation	5	2020	February 1, 2018
Nvidia driver fail on build kernel module form source << Centos 7.5 >> CUDA Setup and Installation	1	1523	June 4, 2018
CUDA Driver Failure after Redhat EL6 Kernel Update - What are the Nvidia/CUDA reinstallation steps? Linux	9	935	October 14, 2021
Cuda and Nvidia drivers failing to install on ubuntu CUDA Setup and Installation	8	7825	September 11, 2019

Issues with Nvidia Drivers on CentOS 7.6/7.7

Related topics