NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

Hello

I installed the nvidia drivers on RHEL8.8 by using this procedure 1. Introduction — Installation Guide for Linux 12.3 documentation. After this installation, doing nvidia-smi I got the error.
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

This is my config:

lspci | grep -i nvidia
02:00.0 VGA compatible controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
02:01.0 VGA compatible controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)

grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
/lib/modprobe.d/dist-blacklist.conf:blacklist nvidiafb
/lib/modprobe.d/nvidia.conf:# Make a soft dependency for nvidia-uvm as adding the module loading to
/lib/modprobe.d/nvidia.conf:# /usr/lib/modules-load.d/nvidia-uvm.conf for systemd consumption, makes the
/lib/modprobe.d/nvidia.conf:softdep nvidia post: nvidia-uvm
/lib/modprobe.d/nvidia.conf:options nvidia NVreg_DynamicPowerManagement=0x02
/lib/modprobe.d/nvidia.conf:# options nvidia-drm mod

dkms status
nvidia/545.23.06: added

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Thanks for your answer. here is the bug report
nvidia-bug-report.log.gz (69.9 KB)

Please run
sudo dkms install nvidia/545.23.06
and post any errors displayed.

sudo dkms install nvidia/545.23.06
Error! Your kernel headers for kernel 4.18.0-477.27.1.el8_8.x86_64 cannot be found at /lib/modules/4.18.0-477.27.1.el8_8.x86_64/build or /lib/modules/4.18.0-477.27.1.el8_8.x86_64/source.
Please install the linux-headers-4.18.0-477.27.1.el8_8.x86_64 package or use the --kernelsourcedir option to tell DKMS where it's located.

Please re-run
sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
and post any errors.

Repository epel is listed more than once in the configuration
Last metadata expiration check: 2:24:12 ago on Tue 14 Nov 2023 12:02:20 PM CET.
No match for argument: kernel-devel-4.18.0-477.27.1.el8_8.x86_64
No match for argument: kernel-headers-4.18.0-477.27.1.el8_8.x86_64
Error: Unable to find a match: kernel-devel-4.18.0-477.27.1.el8_8.x86_64 kernel-headers-4.18.0-477.27.1.el8_8.x86_64

this what i have like kernel

rpm -qa | grep kernel
kernel-4.18.0-477.27.1.el8_8.x86_64
kernel-headers-4.18.0-477.10.1.el8_8.x86_64
kernel-4.18.0-477.10.1.el8_8.x86_64
kernel-modules-4.18.0-477.27.1.el8_8.x86_64
kernel-tools-libs-4.18.0-477.10.1.el8_8.x86_64
kernel-tools-4.18.0-477.10.1.el8_8.x86_64
kernel-devel-4.18.0-477.10.1.el8_8.x86_64
kernel-core-4.18.0-477.10.1.el8_8.x86_64
kernel-core-4.18.0-477.27.1.el8_8.x86_64
kernel-modules-4.18.0-477.10.1.el8_8.x86_64

Then there’s something wrong with your RHEL repos, 4.18.0-477.27.1 should be the latest kernel for rhel 8.8 so the -headers and -devel packages should be available. The initial 4.18.0-477.10.1 kernel is complete, though.

Hello Thanks for your help. do you know whichrepo i must enable if i want to install devel and headers?

I resolve the kernel issue. BaseOS current and latest were not activated.

rpm -qa | grep kernel
kernel-devel-4.18.0-477.27.1.el8_8.x86_64
kernel-4.18.0-477.27.1.el8_8.x86_64
kernel-devel-4.18.0-477.10.1.el8_8.x86_64
kernel-4.18.0-477.10.1.el8_8.x86_64
kernel-modules-4.18.0-477.27.1.el8_8.x86_64
kernel-tools-libs-4.18.0-477.10.1.el8_8.x86_64
kernel-tools-4.18.0-477.10.1.el8_8.x86_64
kernel-core-4.18.0-477.10.1.el8_8.x86_64
kernel-headers-4.18.0-477.27.1.el8_8.x86_64
kernel-core-4.18.0-477.27.1.el8_8.x86_64
kernel-modules-4.18.0-477.10.1.el8_8.x86_64

rpm -qa | grep kernel | grep 4.18.0-477.27
kernel-devel-4.18.0-477.27.1.el8_8.x86_64
kernel-4.18.0-477.27.1.el8_8.x86_64
kernel-modules-4.18.0-477.27.1.el8_8.x86_64
kernel-headers-4.18.0-477.27.1.el8_8.x86_64
kernel-core-4.18.0-477.27.1.el8_8.x86_64

but dkms command still not working. the error is different

dkms install nvidia/545.23.06
Sign command: /lib/modules/4.18.0-477.27.1.el8_8.x86_64/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub
Error! Could not find module source directory.
Directory: /usr/src/nvidia-545.23.06 does not exist.

Odd. Please try reinstalling the driver
sudo dnf module reinstall nvidia-driver:latest-dkms
Post any errors, afterwards the output of
dkms status
ls -l /usr/src

dnf module install nvidia-driver:latest-dkms
Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Repository epel is listed more than once in the configuration
Last metadata expiration check: 0:10:11 ago on Wed 15 Nov 2023 10:14:28 AM CET.
NOTE: Skipping kernel installation since no kernel module package kmod-nvidia-545.23.06-4.18.0-513.5.1 for kernel version 4.18.0-513.5.1.el8_9 and NVIDIA driver 545.23.06 could be found
Error:
 Problem: problem with installed package kmod-nvidia-545.23.06-4.18.0-477.27.1-3:545.23.06-3.el8_8.x86_64
  - package kmod-nvidia-545.23.06-4.18.0-477.27.1-3:545.23.06-3.el8_8.x86_64 conflicts with kmod-nvidia-latest-dkms provided by kmod-nvidia-latest-dkms-3:545.23.06-1.el8.x86_64
  - conflicting requests
(try to add '--allowerasing' to command line to replace conflicting packages or '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)

i add --allowerasing after i have this

dkms install nvidia/545.23.06
Module nvidia/545.23.06 already installed on kernel 4.18.0-477.27.1.el8_8.x86_64 (x86_64), skip. You may override by specifying --force.
ll /usr/src/
total 4
drwxr-xr-x  2 root root   35 Nov  9 09:52 annobin
drwxr-xr-x. 2 root root    6 Jun 21  2021 debug
drwxr-xr-x. 4 root root   78 Nov 15 08:07 kernels
drwxr-xr-x  8 root root 4096 Nov 15 10:25 nvidia-545.23.06

Meanwhile you switched from the initial dkms modules to precompiled modules, now back to dkms. The message

Module nvidia/545.23.06 already installed

tells that the driver should now work (after a reboot). Please check, otherwise create a new nvidia-bug-report.log

nvidia-bug-report.log.gz (76.8 KB)

Incompatible driver. Seems you’re inside a VM on a vGPU system. Please use the GRID driver for your vGPU version instead of the normal nvidia driver. Please uninstall any nvidia packages first.

yes i am in a VM. where can i found the GRID driver?

The GRID driver can be downloaded from the vGPU customer portal, where you (or the one that set up the host) acquired the vGPU software.

Thank you for your help. System teams send me the good driver and it works.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.