Nvidia-smi fails on CentOS 7 since a semi-recent update

Olafur · January 21, 2022, 1:39pm

Greetings
My server running CentOS 7 has stopped working since I updated it a couple of months ago (oh how time flies). The main symptoms are

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

and the more practical

$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices())"
2022-01-21 13:16:29.988536: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-01-21 13:16:29.988661: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (piip.rhi.hi.is): /proc/driver/nvidia/version does not exist
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

So far I have completely uninstalled everything to do with CUDA/nvidia/cublas and installed according to the standard instructions but with no luck.
Searching around I have tried a couple of things.
The persistence demon seems to also fail:

systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
   Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since lau 2022-01-15 19:25:04 GMT; 5 days ago
  Process: 1980 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced/* (code=exited, status=0/SUCCESS)
  Process: 1580 ExecStart=/usr/bin/nvidia-persistenced --verbose (code=exited, status=1/FAILURE)

jan 15 19:25:04 piip.rhi.hi.is systemd[1]: Failed to start NVIDIA Persistence Daemon.
jan 15 19:25:04 piip.rhi.hi.is systemd[1]: Unit nvidia-persistenced.service entered failed state.
jan 15 19:25:04 piip.rhi.hi.is systemd[1]: nvidia-persistenced.service failed.
jan 15 19:25:04 piip.rhi.hi.is systemd[1]: nvidia-persistenced.service holdoff time over, scheduling restart.
jan 15 19:25:04 piip.rhi.hi.is systemd[1]: Stopped NVIDIA Persistence Daemon.
jan 15 19:25:04 piip.rhi.hi.is systemd[1]: start request repeated too quickly for nvidia-persistenced.service
jan 15 19:25:04 piip.rhi.hi.is systemd[1]: Failed to start NVIDIA Persistence Daemon.
jan 15 19:25:04 piip.rhi.hi.is systemd[1]: Unit nvidia-persistenced.service entered failed state.
jan 15 19:25:04 piip.rhi.hi.is systemd[1]: nvidia-persistenced.service failed.

although, this might be expected.
Secure boot is disabled:

$ sestatus
SELinux status:                 disabled

I did install gcc-11.2 since the instructions indicated that the default version shipped with CentOS 7 (i.e. 4.8.5) is not sufficient although I hesitate to alias gcc from CentOS’ default to the new version since this might cause instability. Do I then need to specify as I install CUDA the alternate version of gcc to use? Then again, it could be entirely irrelevant which gcc I am using, this is just my latest thought.

Here’s my log as generated by nvidia-bug-report.sh as root:
nvidia-bug-report.log.gz (54.9 KB)

Olafur · January 26, 2022, 3:24pm

I have fixed the error. Turns out that the main branch is not compatible with my Tesla k80. I had to find and install an older version, found by printing the list of available drivers:

$ yum --disablerepo="*" --enablerepo="cuda*" list available

then installing the 470 branch which is compatible with the Tesla k80 and cuda-11-4 to go with it

$ sudo yum --disablerepo="*" --enablerepo="cuda*" install nvidia-driver-branch-470.x86_64
$ sudo yum --disablerepo="*" --enablerepo="cuda*" install cuda-11-4.x86_64

Then in order to have Tensorflow working I needed cudnn, but I couldn’t find a version that was specifically for cuda11-4 but the following seems to be compatible (although I am worried it might not be permanent)

$ sudo yum --disablerepo="*" --enablerepo="cuda*" install libcudnn8.x86_64
$ sudo yum --disablerepo="*" --enablerepo="cuda*" install libcudnn8-devel.x86_64

then a simple Tensorflow example worked well.