Tesla v100 GPU installed on HP Proliant DL380 running RHEL 9.4. SMI Fails to launch
Posted on May 28, 2025 10:28 AM
I have a Tesla v100 installed in my HP Proliant DL380 gen8 server running RHEL 9.4
I’m getting error:
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
attached is bug report from
# /usr/bin/nvidia-bug-report.sh
nvidia-bug-report.log (2.3 MB)
info and steps:
uname -a
Linux hp1 5.14.0-427.35.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 30 15:47:10 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
uname -r
5.14.0-427.35.1.el9_4.x86_64
modinfo nvidia | grep ^version
version: 570.124.06
nvidia-bug-report.logdownloaded driver:
-rw-r–r–. 1 root root 531810314 Feb 27 01:19 nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64.rpm
followed instructions for RHEL 9 on 1. Introduction — NVIDIA Driver Installation Guide r570 documentation
************************************************** OUTPUT *************************************************
Updating Subscription Management repositories.
Last metadata expiration check: 1:27:16 ago on Wed 09 Apr 2025 09:56:20 AM EDT.
Package kernel-devel-matched-5.14.0-427.35.1.el9_4.x86_64 is already installed.
Package kernel-headers-5.14.0-427.35.1.el9_4.x86_64 is already installed.
Dependencies resolved.
Package Architecture Version Repository Size
Installing:
kernel-core x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-baseos-rpms 18 M
kernel-devel x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-appstream-rpms 22 M
kernel-modules-core x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-baseos-rpms 31 M
Upgrading:
kernel-devel-matched x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-appstream-rpms 2.0 M
kernel-headers x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-appstream-rpms 3.8 M
Transaction Summary
Install 3 Packages
Upgrade 2 Packages
Total download size: 76 M
Is this ok [y/N]: y
Downloading Packages:
(1/5): kernel-core-5.14.0-503.35.1.el9_5.x86_64.rpm 7.5 MB/s | 18 MB 00:02
(2/5): kernel-devel-matched-5.14.0-503.35.1.el9_5.x86_64.rpm 5.3 MB/s | 2.0 MB 00:00
(3/5): kernel-devel-5.14.0-503.35.1.el9_5.x86_64.rpm 7.3 MB/s | 22 MB 00:03
(4/5): kernel-headers-5.14.0-503.35.1.el9_5.x86_64.rpm 7.5 MB/s | 3.8 MB 00:00
(5/5): kernel-modules-core-5.14.0-503.35.1.el9_5.x86_64.rpm 8.5 MB/s | 31 MB 00:03
Total 21 MB/s | 76 MB 00:03
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing :
Installing : kernel-modules-core-5.14.0-503.35.1.el9_5.x86_64
Installing : kernel-core-5.14.0-503.35.1.el9_5.x86_64
Running scriptlet: kernel-core-5.14.0-503.35.1.el9_5.x86_64
Installing : kernel-devel-5.14.0-503.35.1.el9_5.x86_64
Running scriptlet: kernel-devel-5.14.0-503.35.1.el9_5.x86_64
Upgrading : kernel-devel-matched-5.14.0-503.35.1.el9_5.x86_64
Upgrading : kernel-headers-5.14.0-503.35.1.el9_5.x86_64
Cleanup : kernel-headers-5.14.0-427.35.1.el9_4.x86_64
Cleanup : kernel-devel-matched-5.14.0-427.35.1.el9_4.x86_64
Running scriptlet: kernel-modules-core-5.14.0-503.35.1.el9_5.x
Sign command: /lib/modules/5.14.0-503.35.1.el9_5.x86_64/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub
Autoinstall of module nvidia/570.124.06 for kernel 5.14.0-503.35.1.el9_5.x86_64 (x86_64)
Cleaning build area… done.
Building module(s)… done.
Signing module /var/lib/dkms/nvidia/570.124.06/build/nvidia.ko
Signing module /var/lib/dkms/nvidia/570.124.06/build/nvidia-modeset.ko
Signing module /var/lib/dkms/nvidia/570.124.06/build/nvidia-drm.ko
Signing module /var/lib/dkms/nvidia/570.124.06/build/nvidia-uvm.ko
Signing module /var/lib/dkms/nvidia/570.124.06/build/nvidia-peermem.ko
Cleaning build area… done.
Installing /lib/modules/5.14.0-503.35.1.el9_5.x86_64/extra/nvidia.ko.xz
Installing /lib/modules/5.14.0-503.35.1.el9_5.x86_64/extra/nvidia-modeset.ko.xz
Installing /lib/modules/5.14.0-503.35.1.el9_5.x86_64/extra/nvidia-drm.ko.xz
Installing /lib/modules/5.14.0-503.35.1.el9_5.x86_64/extra/nvidia-uvm.ko.xz
Installing /lib/modules/5.14.0-503.35.1.el9_5.x86_64/extra/nvidia-peermem.ko.xz
Running depmod… done.
Autoinstall on 5.14.0-503.35.1.el9_5.x86_64 succeeded for module(s) nvidia.
Running scriptlet: kernel-devel-matched-5.14.0-427.35.1.el9_4.x86_64
Verifying : kernel-core-5.14.0-503.35.1.el9_5.x86_64
Verifying : kernel-modules-core-5.14.0-503.35.1.el9_5.x86_64
Verifying : kernel-devel-5.14.0-503.35.1.el9_5.x86_64
Verifying : kernel-devel-matched-5.14.0-503.35.1.el9_5.x86_64
Verifying : kernel-devel-matched-5.14.0-427.35.1.el9_4.x86_64
Verifying : kernel-headers-5.14.0-503.35.1.el9_5.x86_64
Verifying : kernel-headers-5.14.0-427.35.1.el9_4.x86_64
Installed products updated.
Upgraded:
kernel-devel-matched-5.14.0-503.35.1.el9_5.x86_64 kernel-headers-5.14.0-503.35.1.el9_5.x86_64
Installed:
kernel-core-5.14.0-503.35.1.el9_5.x86_64 kernel-devel-5.14.0-503.35.1.el9_5.x86_64 kernel-modules-core-5.14.0-503.35.1.el9_5.x86_64
Complete!
- satisfy third party requirements
subscription-manager repos --enable=rhel-9-for-$arch-appstream-rpms
subscription-manager repos --enable=rhel-9-for-$arch-baseos-rpms
subscription-manager repos --enable=codeready-builder-for-rhel-9-$arch-rpms
dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
- downloaded
-rw-r–r–. 1 root root 531810314 Feb 27 01:19 nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64.rpm - install repository
rpm --install nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64.rpm
************************************************** OUTPUT *************************************************
package nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64 is already installed
- Driver installation
dnf module install nvidia-driver:latest-dkms
************************************************** OUTPUT *************************************************
Updating Subscription Management repositories.
Last metadata expiration check: 0:03:00 ago on Wed 09 Apr 2025 11:36:40 AM EDT.
Dependencies resolved.
Package Architecture Version Repository Size
Installing group/module packages:
nvidia-driver x86_64 3:570.124.06-1.el9 nvidia-driver-local-rhel9-570.124.06 4.0 M
nvidia-settings x86_64 3:570.124.06-1.el9 nvidia-driver-local-rhel9-570.124.06 844 k
nvidia-xconfig x86_64 3:570.124.06-1.el9 nvidia-driver-local-rhel9-570.124.06 93 k
Installing dependencies:
xorg-x11-nvidia x86_64 3:570.124.06-1.el9 nvidia-driver-local-rhel9-570.124.06 2.4 M
Installing module profiles:
nvidia-driver/default
Enabling module streams:
nvidia-driver latest-dkms
Transaction Summary
Install 4 Packages
Total size: 7.3 M
Installed size: 37 M
Is this ok [y/N]: y
Downloading Packages:
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : nvidia-driver-3:570.124.06-1.el9.x86_64 1/4
Running scriptlet: nvidia-driver-3:570.124.06-1.el9.x86_64 1/4
Created symlink /etc/systemd/system/systemd-hibernate.service.wants/nvidia-hibernate.service → /usr/lib/systemd/system/nvidia-hibernate.service.
Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-powerd.service → /usr/lib/systemd/system/nvidia-powerd.service.
Created symlink /etc/systemd/system/systemd-suspend.service.wants/nvidia-resume.service → /usr/lib/systemd/system/nvidia-resume.service.
Created symlink /etc/systemd/system/systemd-hibernate.service.wants/nvidia-resume.service → /usr/lib/systemd/system/nvidia-resume.service.
Created symlink /etc/systemd/system/systemd-suspend-then-hibernate.service.wants/nvidia-resume.service → /usr/lib/systemd/system/nvidia-resume.service.
Created symlink /etc/systemd/system/systemd-suspend.service.wants/nvidia-suspend.service → /usr/lib/systemd/system/nvidia-suspend.service.
Created symlink /etc/systemd/system/systemd-suspend-then-hibernate.service.wants/nvidia-suspend-then-hibernate.service → /usr/lib/systemd/system/nvidia-suspend-then-h
ibernate.service.
Installing : xorg-x11-nvidia-3:570.124.06-1.el9.x86_64
Installing : nvidia-xconfig-3:570.124.06-1.el9.x86_64
Installing : nvidia-settings-3:570.124.06-1.el9.x86_64
Running scriptlet: nvidia-settings-3:570.124.06-1.el9.x86_64
Verifying : nvidia-driver-3:570.124.06-1.el9.x86_64
Verifying : nvidia-settings-3:570.124.06-1.el9.x86_64
Verifying : nvidia-xconfig-3:570.124.06-1.el9.x86_64
Verifying : xorg-x11-nvidia-3:570.124.06-1.el9.x86_64
Installed products updated.
Installed:
nvidia-driver-3:570.124.06-1.el9.x86_64 nvidia-settings-3:570.124.06-1.el9.x86_64 nvidia-xconfig-3:570.124.06-1.el9.x86_64
xorg-x11-nvidia-3:570.124.06-1.el9.x86_64
Complete!
- install compute-only system
dnf install nvidia-driver-cuda kmod-nvidia-latest-dkms
************************************************** OUTPUT *************************************************
Updating Subscription Management repositories.
Last metadata expiration check: 0:05:53 ago on Wed 09 Apr 2025 11:36:40 AM EDT.
Package nvidia-driver-cuda-3:570.124.06-1.el9.x86_64 is already installed.
Package kmod-nvidia-latest-dkms-3:570.124.06-1.el9.x86_64 is already installed.
Dependencies resolved.
Nothing to do.
Complete!
- reboot
nvidia-smi
************************************************** OUTPUT *************************************************
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
- debug
lsmod | grep nvidia
(reports nothing)
mokutil --sb-state
EFI variables are not supported on this system
(I assume i’m running in BIOS mode so no secure boot.)
modprobe nvidia
modprobe: ERROR: could not insert ‘nvidia’: No such device
dmesg |grep nvidia
[ 4933.185877] nvidia: loading out-of-tree module taints kernel.
[ 4933.185892] nvidia: module license ‘NVIDIA’ taints kernel.
[ 4933.218122] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 4933.391589] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 4933.395237] nvidia-nvlink: Unregistered Nvlink Core, major device number 237
[ 4942.443274] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 4942.445489] nvidia-nvlink: Unregistered Nvlink Core, major device number 237
systemctl status nvidia-powerd
○ nvidia-powerd.service - nvidia-powerd service
Loaded: loaded (/usr/lib/systemd/system/nvidia-powerd.service; enabled; preset: enabled)
Active: inactive (dead) since Wed 2025-04-09 10:27:14 EDT; 1h 24min ago
Duration: 764ms
Main PID: 1449 (code=exited, status=1/FAILURE)
CPU: 16ms
Apr 09 10:27:14 hp1 systemd[1]: Started nvidia-powerd service.
Apr 09 10:27:14 hp1 /usr/bin/nvidia-powerd[1449]: nvidia-powerd version:1.0(build 1)
Apr 09 10:27:14 hp1 /usr/bin/nvidia-powerd[1449]: Allocate client failed 89
Apr 09 10:27:14 hp1 /usr/bin/nvidia-powerd[1449]: Failed to initialize RM Client
Apr 09 10:27:14 hp1 systemd[1]: nvidia-powerd.service: Deactivated successfully.
rpm -qa | grep -i Nvidia
nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64
libnvidia-ml-570.124.06-1.el9.x86_64
libnvidia-cfg-570.124.06-1.el9.x86_64
nvidia-driver-cuda-libs-570.124.06-1.el9.x86_64
nvidia-libXNVCtrl-570.124.06-1.el9.x86_64
nvidia-persistenced-570.124.06-1.el9.x86_64
nvidia-modprobe-570.124.06-1.el9.x86_64
nvidia-kmod-common-570.124.06-1.el9.noarch
kmod-nvidia-latest-dkms-570.124.06-1.el9.x86_64
nvidia-driver-libs-570.124.06-1.el9.x86_64
nvidia-driver-cuda-570.124.06-1.el9.x86_64
nvidia-libXNVCtrl-devel-570.124.06-1.el9.x86_64
libnvidia-fbc-570.124.06-1.el9.x86_64
nvidia-driver-570.124.06-1.el9.x86_64
xorg-x11-nvidia-570.124.06-1.el9.x86_64
nvidia-xconfig-570.124.06-1.el9.x86_64
nvidia-settings-570.124.06-1.el9.x86_64
systemctl status nvidia-persistenced
○ nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; disabled; preset: enabled)
Active: inactive (dead)