Issues with Nvidia Drivers on CentOS 7.6/7.7

Recently, I’ve been encountering issues with setting up Nvidia drivers on CentOS 7.6 and 7.7 .

As an example, on 7.6,

$ uname -r
3.10.0-957.27.2.el7.x86_64
$ rpm -qa kernel*
kernel-devel-3.10.0-957.el7.x86_64
kernel-tools-libs-3.10.0-957.27.2.el7.x86_64
kernel-tools-3.10.0-957.27.2.el7.x86_64
kernel-3.10.0-957.el7.x86_64
kernel-3.10.0-957.27.2.el7.x86_64
kernel-headers-3.10.0-957.el7.x86_64

I am installing the Nvidia drivers using the following steps

$ wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.1.243-1.x86_64.rpm
$ rpm -i cuda-repo-rhel7-10.1.243-1.x86_64.rpm 
$ sudo rpm -i cuda-repo-rhel7-10.1.243-1.x86_64.rpm 
$ sudo yum clean all
$ sudo yum -y install epel-release
$ sudo yum -y install cuda
$ sudo shutdown -r now

Upon rebooting the system, nvidia-smi shows the following

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Here are the Nvidia packages/drivers that are installed

$ rpm -qa nvidia*
nvidia-driver-latest-cuda-libs-418.87.00-2.el7.x86_64
nvidia-persistenced-latest-418.87.00-2.el7.x86_64
nvidia-driver-latest-418.87.00-2.el7.x86_64
nvidia-xconfig-latest-418.87.00-2.el7.x86_64
nvidia-modprobe-latest-418.87.00-2.el7.x86_64
nvidia-driver-latest-cuda-418.87.00-2.el7.x86_64
nvidia-driver-latest-NvFBCOpenGL-418.87.00-2.el7.x86_64
nvidia-libXNVCtrl-418.87.00-2.el7.x86_64
nvidia-libXNVCtrl-devel-418.87.00-2.el7.x86_64
nvidia-driver-latest-libs-418.87.00-2.el7.x86_64
nvidia-driver-latest-NVML-418.87.00-2.el7.x86_64
nvidia-driver-latest-devel-418.87.00-2.el7.x86_64
nvidia-settings-418.87.00-2.el7.x86_64

I am seeing identical behavior with the 3.10.0-1062.1.1.el7 kernel as well.

I regularly build CentOS images for GPU computing internally, and this has only become an issue in the last week.

depending on the previous history of the machine, its certainly possible that attempting to install the way you did will result in a broken driver install, which is what is being indicated.

This is from an image on Google Cloud, out of the box,

Can you help me understand why this would result in a broken driver install?

“an image on Google Cloud”

No, I cannot help with that. I have no idea what the history of that image is (how things were installed into that image).

A typical reason for broken driver install when using a particular install method with no attention paid to what was done previously is documented in section 2.7 of the linux install guide:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#handle-uninstallation

I don’t know for certain that this is the issue, of course. It’s entirely possible its some other problem.

The image I’m starting from is a clean CentOS 7 image. What I’ve shown in the opening comment on this thread is all that has been done. There are no other installs of Nvidia* or cuda* on this system.

I’ll look through the section of documentation you’ve auggested

after the install/reboot process, when you have gotten the message that nvidia-smi has failed, what is the output from:

dmesg | grep NVRM

?

I’ve gone through the suggested documentation carefully when setting up an instance from GCP’s CentOS 7 image. There were issues between a mismatched kernel (3.10.0-957) and the kernel-devel and kernel-headers that are available through yum. Over the last week, with the CentOS 7.7 release, kernel-devel-3.10.0-957 and kernel-headers-3.10.0-957 is no longer directly available through the yum package manager.

To get a working build starting from the currently released CentOS 7 image on GCP ( built 09/16/2019 ), I’m first executing

yum install kernel
yum update
reboot

Upon reboot, the following actions are taken

yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
yum install gcc dkms libvdpau wget pciutils
wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-rhel7-10-1-local-10.1.243-418.87.00-1.0-1.x86_64.rpm
rpm --install cuda-repo-rhel7-10-1-local-10.1.243-418.87.00-1.0-1.x86_64.rpm 
yum clean expire-cache
yum install nvidia-driver-latest-dkms cuda
reboot

Upon reboot, I follow the post-installation instructions ( https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions )setting PATH and starting the persistence daemon on boot. Rebooting one more time, and now nvidia-smi returns the expected output.

Thanks for pushing me towards documentation - sometimes it’s easy to forget that the answer’s already out there.