NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Tesla v100 GPU installed on HP Proliant DL380 running RHEL 9.4

Tesla v100 GPU installed on HP Proliant DL380 running RHEL 9.4. SMI Fails to launch

Posted on May 28, 2025 10:28 AM

I have a Tesla v100 installed in my HP Proliant DL380 gen8 server running RHEL 9.4
I’m getting error:
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

attached is bug report from

# /usr/bin/nvidia-bug-report.sh
nvidia-bug-report.log (2.3 MB)


info and steps:

uname -a

Linux hp1 5.14.0-427.35.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 30 15:47:10 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux

uname -r

5.14.0-427.35.1.el9_4.x86_64

modinfo nvidia | grep ^version

version: 570.124.06

nvidia-bug-report.logdownloaded driver:
-rw-r–r–. 1 root root 531810314 Feb 27 01:19 nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64.rpm

followed instructions for RHEL 9 on 1. Introduction — NVIDIA Driver Installation Guide r570 documentation

  1. dnf install kernel-devel-matched kernel-headers

************************************************** OUTPUT *************************************************
Updating Subscription Management repositories.
Last metadata expiration check: 1:27:16 ago on Wed 09 Apr 2025 09:56:20 AM EDT.
Package kernel-devel-matched-5.14.0-427.35.1.el9_4.x86_64 is already installed.
Package kernel-headers-5.14.0-427.35.1.el9_4.x86_64 is already installed.

Dependencies resolved.

Package Architecture Version Repository Size

Installing:
kernel-core x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-baseos-rpms 18 M
kernel-devel x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-appstream-rpms 22 M
kernel-modules-core x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-baseos-rpms 31 M
Upgrading:
kernel-devel-matched x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-appstream-rpms 2.0 M
kernel-headers x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-appstream-rpms 3.8 M

Transaction Summary

Install 3 Packages
Upgrade 2 Packages

Total download size: 76 M
Is this ok [y/N]: y
Downloading Packages:
(1/5): kernel-core-5.14.0-503.35.1.el9_5.x86_64.rpm 7.5 MB/s | 18 MB 00:02
(2/5): kernel-devel-matched-5.14.0-503.35.1.el9_5.x86_64.rpm 5.3 MB/s | 2.0 MB 00:00
(3/5): kernel-devel-5.14.0-503.35.1.el9_5.x86_64.rpm 7.3 MB/s | 22 MB 00:03
(4/5): kernel-headers-5.14.0-503.35.1.el9_5.x86_64.rpm 7.5 MB/s | 3.8 MB 00:00

(5/5): kernel-modules-core-5.14.0-503.35.1.el9_5.x86_64.rpm 8.5 MB/s | 31 MB 00:03

Total 21 MB/s | 76 MB 00:03
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing :

Installing : kernel-modules-core-5.14.0-503.35.1.el9_5.x86_64

Installing : kernel-core-5.14.0-503.35.1.el9_5.x86_64

Running scriptlet: kernel-core-5.14.0-503.35.1.el9_5.x86_64

Installing : kernel-devel-5.14.0-503.35.1.el9_5.x86_64

Running scriptlet: kernel-devel-5.14.0-503.35.1.el9_5.x86_64

Upgrading : kernel-devel-matched-5.14.0-503.35.1.el9_5.x86_64

Upgrading : kernel-headers-5.14.0-503.35.1.el9_5.x86_64

Cleanup : kernel-headers-5.14.0-427.35.1.el9_4.x86_64

Cleanup : kernel-devel-matched-5.14.0-427.35.1.el9_4.x86_64
Running scriptlet: kernel-modules-core-5.14.0-503.35.1.el9_5.x
Sign command: /lib/modules/5.14.0-503.35.1.el9_5.x86_64/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub

Autoinstall of module nvidia/570.124.06 for kernel 5.14.0-503.35.1.el9_5.x86_64 (x86_64)
Cleaning build area… done.
Building module(s)… done.
Signing module /var/lib/dkms/nvidia/570.124.06/build/nvidia.ko
Signing module /var/lib/dkms/nvidia/570.124.06/build/nvidia-modeset.ko
Signing module /var/lib/dkms/nvidia/570.124.06/build/nvidia-drm.ko
Signing module /var/lib/dkms/nvidia/570.124.06/build/nvidia-uvm.ko
Signing module /var/lib/dkms/nvidia/570.124.06/build/nvidia-peermem.ko
Cleaning build area… done.
Installing /lib/modules/5.14.0-503.35.1.el9_5.x86_64/extra/nvidia.ko.xz
Installing /lib/modules/5.14.0-503.35.1.el9_5.x86_64/extra/nvidia-modeset.ko.xz
Installing /lib/modules/5.14.0-503.35.1.el9_5.x86_64/extra/nvidia-drm.ko.xz
Installing /lib/modules/5.14.0-503.35.1.el9_5.x86_64/extra/nvidia-uvm.ko.xz
Installing /lib/modules/5.14.0-503.35.1.el9_5.x86_64/extra/nvidia-peermem.ko.xz
Running depmod… done.

Autoinstall on 5.14.0-503.35.1.el9_5.x86_64 succeeded for module(s) nvidia.

Running scriptlet: kernel-devel-matched-5.14.0-427.35.1.el9_4.x86_64

Verifying : kernel-core-5.14.0-503.35.1.el9_5.x86_64

Verifying : kernel-modules-core-5.14.0-503.35.1.el9_5.x86_64

Verifying : kernel-devel-5.14.0-503.35.1.el9_5.x86_64

Verifying : kernel-devel-matched-5.14.0-503.35.1.el9_5.x86_64

Verifying : kernel-devel-matched-5.14.0-427.35.1.el9_4.x86_64

Verifying : kernel-headers-5.14.0-503.35.1.el9_5.x86_64

Verifying : kernel-headers-5.14.0-427.35.1.el9_4.x86_64

Installed products updated.

Upgraded:
kernel-devel-matched-5.14.0-503.35.1.el9_5.x86_64 kernel-headers-5.14.0-503.35.1.el9_5.x86_64
Installed:
kernel-core-5.14.0-503.35.1.el9_5.x86_64 kernel-devel-5.14.0-503.35.1.el9_5.x86_64 kernel-modules-core-5.14.0-503.35.1.el9_5.x86_64

Complete!


  1. satisfy third party requirements

subscription-manager repos --enable=rhel-9-for-$arch-appstream-rpms

subscription-manager repos --enable=rhel-9-for-$arch-baseos-rpms

subscription-manager repos --enable=codeready-builder-for-rhel-9-$arch-rpms

dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

  1. downloaded
    -rw-r–r–. 1 root root 531810314 Feb 27 01:19 nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64.rpm
  2. install repository

rpm --install nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64.rpm

************************************************** OUTPUT *************************************************
package nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64 is already installed


  1. Driver installation

dnf module install nvidia-driver:latest-dkms

************************************************** OUTPUT *************************************************
Updating Subscription Management repositories.
Last metadata expiration check: 0:03:00 ago on Wed 09 Apr 2025 11:36:40 AM EDT.

Dependencies resolved.

Package Architecture Version Repository Size

Installing group/module packages:
nvidia-driver x86_64 3:570.124.06-1.el9 nvidia-driver-local-rhel9-570.124.06 4.0 M
nvidia-settings x86_64 3:570.124.06-1.el9 nvidia-driver-local-rhel9-570.124.06 844 k
nvidia-xconfig x86_64 3:570.124.06-1.el9 nvidia-driver-local-rhel9-570.124.06 93 k
Installing dependencies:
xorg-x11-nvidia x86_64 3:570.124.06-1.el9 nvidia-driver-local-rhel9-570.124.06 2.4 M
Installing module profiles:
nvidia-driver/default
Enabling module streams:
nvidia-driver latest-dkms

Transaction Summary

Install 4 Packages

Total size: 7.3 M
Installed size: 37 M
Is this ok [y/N]: y
Downloading Packages:
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : nvidia-driver-3:570.124.06-1.el9.x86_64 1/4
Running scriptlet: nvidia-driver-3:570.124.06-1.el9.x86_64 1/4
Created symlink /etc/systemd/system/systemd-hibernate.service.wants/nvidia-hibernate.service → /usr/lib/systemd/system/nvidia-hibernate.service.
Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-powerd.service → /usr/lib/systemd/system/nvidia-powerd.service.
Created symlink /etc/systemd/system/systemd-suspend.service.wants/nvidia-resume.service → /usr/lib/systemd/system/nvidia-resume.service.
Created symlink /etc/systemd/system/systemd-hibernate.service.wants/nvidia-resume.service → /usr/lib/systemd/system/nvidia-resume.service.
Created symlink /etc/systemd/system/systemd-suspend-then-hibernate.service.wants/nvidia-resume.service → /usr/lib/systemd/system/nvidia-resume.service.
Created symlink /etc/systemd/system/systemd-suspend.service.wants/nvidia-suspend.service → /usr/lib/systemd/system/nvidia-suspend.service.
Created symlink /etc/systemd/system/systemd-suspend-then-hibernate.service.wants/nvidia-suspend-then-hibernate.service → /usr/lib/systemd/system/nvidia-suspend-then-h
ibernate.service.

Installing : xorg-x11-nvidia-3:570.124.06-1.el9.x86_64

Installing : nvidia-xconfig-3:570.124.06-1.el9.x86_64

Installing : nvidia-settings-3:570.124.06-1.el9.x86_64

Running scriptlet: nvidia-settings-3:570.124.06-1.el9.x86_64

Verifying : nvidia-driver-3:570.124.06-1.el9.x86_64

Verifying : nvidia-settings-3:570.124.06-1.el9.x86_64

Verifying : nvidia-xconfig-3:570.124.06-1.el9.x86_64

Verifying : xorg-x11-nvidia-3:570.124.06-1.el9.x86_64

Installed products updated.

Installed:
nvidia-driver-3:570.124.06-1.el9.x86_64 nvidia-settings-3:570.124.06-1.el9.x86_64 nvidia-xconfig-3:570.124.06-1.el9.x86_64
xorg-x11-nvidia-3:570.124.06-1.el9.x86_64

Complete!


  1. install compute-only system

dnf install nvidia-driver-cuda kmod-nvidia-latest-dkms

************************************************** OUTPUT *************************************************
Updating Subscription Management repositories.
Last metadata expiration check: 0:05:53 ago on Wed 09 Apr 2025 11:36:40 AM EDT.
Package nvidia-driver-cuda-3:570.124.06-1.el9.x86_64 is already installed.
Package kmod-nvidia-latest-dkms-3:570.124.06-1.el9.x86_64 is already installed.
Dependencies resolved.
Nothing to do.
Complete!


  1. reboot

nvidia-smi

************************************************** OUTPUT *************************************************
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


  1. debug

lsmod | grep nvidia

(reports nothing)

mokutil --sb-state

EFI variables are not supported on this system
(I assume i’m running in BIOS mode so no secure boot.)

modprobe nvidia

modprobe: ERROR: could not insert ‘nvidia’: No such device

dmesg |grep nvidia

[ 4933.185877] nvidia: loading out-of-tree module taints kernel.
[ 4933.185892] nvidia: module license ‘NVIDIA’ taints kernel.
[ 4933.218122] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 4933.391589] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 4933.395237] nvidia-nvlink: Unregistered Nvlink Core, major device number 237
[ 4942.443274] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 4942.445489] nvidia-nvlink: Unregistered Nvlink Core, major device number 237

systemctl status nvidia-powerd

○ nvidia-powerd.service - nvidia-powerd service
Loaded: loaded (/usr/lib/systemd/system/nvidia-powerd.service; enabled; preset: enabled)
Active: inactive (dead) since Wed 2025-04-09 10:27:14 EDT; 1h 24min ago
Duration: 764ms
Main PID: 1449 (code=exited, status=1/FAILURE)
CPU: 16ms

Apr 09 10:27:14 hp1 systemd[1]: Started nvidia-powerd service.
Apr 09 10:27:14 hp1 /usr/bin/nvidia-powerd[1449]: nvidia-powerd version:1.0(build 1)
Apr 09 10:27:14 hp1 /usr/bin/nvidia-powerd[1449]: Allocate client failed 89
Apr 09 10:27:14 hp1 /usr/bin/nvidia-powerd[1449]: Failed to initialize RM Client
Apr 09 10:27:14 hp1 systemd[1]: nvidia-powerd.service: Deactivated successfully.

rpm -qa | grep -i Nvidia

nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64
libnvidia-ml-570.124.06-1.el9.x86_64
libnvidia-cfg-570.124.06-1.el9.x86_64
nvidia-driver-cuda-libs-570.124.06-1.el9.x86_64
nvidia-libXNVCtrl-570.124.06-1.el9.x86_64
nvidia-persistenced-570.124.06-1.el9.x86_64
nvidia-modprobe-570.124.06-1.el9.x86_64
nvidia-kmod-common-570.124.06-1.el9.noarch
kmod-nvidia-latest-dkms-570.124.06-1.el9.x86_64
nvidia-driver-libs-570.124.06-1.el9.x86_64
nvidia-driver-cuda-570.124.06-1.el9.x86_64
nvidia-libXNVCtrl-devel-570.124.06-1.el9.x86_64
libnvidia-fbc-570.124.06-1.el9.x86_64
nvidia-driver-570.124.06-1.el9.x86_64
xorg-x11-nvidia-570.124.06-1.el9.x86_64
nvidia-xconfig-570.124.06-1.el9.x86_64
nvidia-settings-570.124.06-1.el9.x86_64

systemctl status nvidia-persistenced

○ nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; disabled; preset: enabled)
Active: inactive (dead)