NVIDIA-SMI can't communicate with NVIDIA driver

revolveextra · March 10, 2022, 10:40am

Problem description

I am trying to set up a centos-7 GPU (Nvidia Tesla K80) instance on Google Cloud, to execute CUDA work.

Unfortunately, I can’t seem to properly install/configure drivers.

Indeed, here is what happens when trying to interact with nvidia-smi (NVIDIA System Management Interface):

# nvidia-smi -pm 1
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Same operation with more recent method nvidia-persistenced :

# nvidia-persistenced
nvidia-persistenced failed to initialize. Check syslog for more details.

I get the following error in syslog (using journalctl command):

Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.

Indeed, no nvidia devices are present:

# ll /dev/nvidia*
ls: cannot access /dev/nvidia*: No such file or directory

However, here is a proof that the GPU is correctly connected to the instance:

# lshw -numeric -C display
  *-display UNCLAIMED       
       description: 3D controller
       product: GK210GL [Tesla K80] [10DE:102D]
       vendor: NVIDIA Corporation [10DE]
       physical id: 4
       bus info: pci@0000:00:04.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: msi pm cap_list
       configuration: latency=0
       resources: iomemory:40-3f iomemory:80-7f memory:fc000000-fcffffff memory:400000000-7ffffffff memory:800000000-801ffffff ioport:c000(size=128)

Installation process I followed

Creation of the centos-7 instance,

gcloud compute instances create test-gpu-drivers \
    --machine-type n1-standard-2 \
    --boot-disk-size 250GB \
    --accelerator type=nvidia-tesla-k80,count=1 \
    --image-family centos-7 --image-project centos-cloud \
    --maintenance-policy TERMINATE

Then, the installation process I followed for the drivers & CUDA is inspired by [Google Cloud documentation, but with latest versions instead:

gcloud compute ssh test-gpu-drivers
sudo su
yum -y update

# Reboot for kernel update to be taken into account
reboot

gcloud compute ssh test-gpu-drivers
sudo su

# Install nvidia drivers repository, found here: https://www.nvidia.com/Download/index.aspx?lang=en-us
curl -J -O http://us.download.nvidia.com/tesla/410.72/nvidia-diag-driver-local-repo-rhel7-410.72-1.0-1.x86_64.rpm
yum -y install ./nvidia-diag-driver-local-repo-rhel7-410.72-1.0-1.x86_64.rpm

# Install CUDA repository, found here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=CentOS&target_version=7&target_type=rpmlocal
curl -J -O https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.0.130-1.x86_64.rpm
yum -y install ./cuda-repo-rhel7-10.0.130-1.x86_64.rpm

# Install CUDA & drivers & dependencies
yum clean all
yum -y install cuda

nvidia-smi -pm 1

reboot

gcloud compute ssh test-gpu-drivers
sudo su
nvidia-smi -pm 1

Full logs here:https://gist.github.com/elouanKeryell-Even/983c07b4bf118524187760a67951d8d1 mehndi.

(I also tried the exact GCE driver install script, without upgrading versions, but with no luck too)

Environment

Distribution release

[root@test-gpu-drivers myuser]# cat /etc/*-release | head -n 1
CentOS Linux release 7.6.1810 (Core)

Kernel release

[root@test-gpu-drivers myuser]# uname -r
3.10.0-957.1.3.el7.x86_64

I can make it work on Ubuntu!

To analyze the problem, I decided to try doing the same thing on Ubuntu 18.04 (LTS). This time, I had no problem.

Instance creation:

gcloud compute instances create gpu-ubuntu-1804 \
    --machine-type n1-standard-2 \
    --boot-disk-size 250GB \
    --accelerator type=nvidia-tesla-k80,count=1 \
    --image-family ubuntu-1804-lts --image-project ubuntu-os-cloud \
    --maintenance-policy TERMINATE

Install process:

gcloud compute ssh gpu-ubuntu-1804
sudo su
apt update
apt -y upgrade
reboot

gcloud compute ssh gpu-ubuntu-1804
sudo su
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
apt -y install ./cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
rm cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt-get update
apt-get -y install cuda
nvidia-smi -pm 1

Full installation logs available here.

Test:

# nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:00:04.0.
All done.
# ll /dev/nvidia*
crw-rw-rw- 1 root root 241,   0 Dec  4 14:01 /dev/nvidia-uvm
crw-rw-rw- 1 root root 195,   0 Dec  4 14:01 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Dec  4 14:01 /dev/nvidiactl

One thing I noticed is that on Ubuntu installation of package nvidia-dkms triggers some stuff, which I did not see on centos:

Setting up nvidia-dkms-410 (410.79-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)

A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau
from loading. This can be reverted by deleting the following file:
/etc/modprobe.d/nvidia-graphics-drivers.conf

A new initrd image has also been created. To revert, please regenerate your
initrd by running the following command after deleting the modprobe.d file:
`/usr/sbin/initramfs -u`

*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can   ***
*** be loaded.                                                            ***
*****************************************************************************

Loading new nvidia-410.79 DKMS files...
Building for 4.15.0-1025-gcp
Building for architecture x86_64
Building initial module for 4.15.0-1025-gcp
Generating a 2048 bit RSA private key
.............................................................................................................+++
..........+++
writing new private key to '/var/lib/shim-signed/mok/MOK.priv'
-----
EFI variables are not supported on this system
/sys/firmware/efi/efivars not found, aborting.
Done.

nvidia:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

nvidia-modeset.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

nvidia-drm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

nvidia-uvm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

depmod...

DKMS: install completed.

Environment

Distribution release

root@gpu-ubuntu-1804:/home/elouan_keryell-even# cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel release

root@gpu-ubuntu-1804:/home/elouan_keryell-even# uname -r
4.15.0-1025-gcp

Question

Does someone understand what goes wrong with my installation of NVIDIA drivers on Centos 7?

generix · March 10, 2022, 12:22pm

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Topic		Replies	Views
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. Linux	2	5762	August 16, 2019
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371550	March 19, 2021
Can't use any NVIDIA driver on Ubuntu 18.04 (4.15.0-39-generic) Linux	7	20600	October 12, 2021
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" on Ubuntu 17.10 CUDA Setup and Installation	10	12700	June 4, 2018
Hybrid Intel/NVIDIA GPU. NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Linux	17	4217	October 12, 2021
Nvidia-smi “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure ..." Linux	11	2700	January 10, 2022
Upgrade from Ubuntu 18 to 20 messed up graphics drivers Linux	18	3525	January 31, 2022
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running Linux nvidia-smi	19	5896	November 16, 2023
Unable to set up cuda-8.0 on RHEL 7.4 CUDA Setup and Installation	5	2025	February 1, 2018
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. on ubuntu 18.04 with NVIDIA Corporation GK110GL [Quadro K5200] Linux	7	3581	October 14, 2021

NVIDIA-SMI can't communicate with NVIDIA driver

Problem description

Installation process I followed

Environment

I can make it work on Ubuntu!

Environment

Question

Related topics