GeForce RTX 2080 Rocky Linux release 8.6 couldn't communicate with the NVIDIA driver

After installing on rocky linux 8.6 and booting nvidia device does not exist? Attaching logs.

[root@node70 log]# nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

[root@curie2 ~]# echo $CHROOTCUDA
/opt/ohpc/admin/images/rocky8.5.cuda

dnf -y --installroot=$CHROOTCUDA install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
dnf -y --installroot=$CHROOTCUDA install kernel kernel-core kernel-modules
dnf -y --installroot=$CHROOTCUDA update
dnf -y --installroot=$CHROOTCUDA install dnf-plugin-config-manager
dnf -y --installroot=$CHROOTCUDA config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
dnf -y --installroot=$CHROOTCUDA clean all
dnf -y --installroot=$CHROOTCUDA module install nvidia-driver:latest-dkms
dnf -y --installroot=$CHROOTCUDA install cuda
dnf -y --installroot=$CHROOTCUDA install hwinfo

No /dev entry or lsmod:

[root@node70 ~]# ls -l /dev/nv*
crw------- 1 root root 10, 144 Oct 26 16:42 /dev/nvram
[root@node70 ~]# lsmod | grep -i nv
[root@node70 ~]# rpm -qa | grep -i nvid
nvidia-kmod-common-520.61.05-1.el8.noarch
nvidia-settings-520.61.05-1.el8.x86_64
nvidia-driver-cuda-520.61.05-1.el8.x86_64
nvidia-fs-2.13.5-1.x86_64
nvidia-fs-dkms-2.13.5-1.x86_64
dnf-plugin-nvidia-2.0-1.el8.noarch
nvidia-driver-libs-520.61.05-1.el8.x86_64
nvidia-driver-cuda-libs-520.61.05-1.el8.x86_64
nvidia-driver-devel-520.61.05-1.el8.x86_64
nvidia-gds-11.8.0-1.x86_64
nvidia-libXNVCtrl-520.61.05-1.el8.x86_64
nvidia-libXNVCtrl-devel-520.61.05-1.el8.x86_64
kmod-nvidia-latest-dkms-520.61.05-1.el8.x86_64
nvidia-driver-NvFBCOpenGL-520.61.05-1.el8.x86_64
nvidia-modprobe-520.61.05-1.el8.x86_64
nvidia-driver-NVML-520.61.05-1.el8.x86_64
nvidia-persistenced-520.61.05-1.el8.x86_64
nvidia-gds-11-8-11.8.0-1.x86_64
nvidia-driver-520.61.05-1.el8.x86_64
nvidia-xconfig-520.61.05-1.el8.x86_64
[root@node70 ~]#

The lspci shows the devices:

[root@node70 log]# lspci | grep RTX
03:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2080] (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2080] (rev a1)
[root@node70 log]#

[root@node70 log]# rpm -qa | grep -i kernel
kernel-modules-4.18.0-372.26.1.el8_6.x86_64
kernel-4.18.0-372.26.1.el8_6.x86_64
kernel-headers-4.18.0-372.26.1.el8_6.x86_64
kernel-devel-4.18.0-372.26.1.el8_6.x86_64
kernel-devel-4.18.0-372.9.1.el8.x86_64
kernel-core-4.18.0-372.26.1.el8_6.x86_64
[root@node70 log]# uname -r
4.18.0-372.9.1.el8.x86_64

nvidia-bug-report.log.gz (57.5 KB)
Xorg.0.log (86.1 KB)

There are no kernel modules installed. Please check the output of
dkms status

[root@node70 ~]# dkms status
nvidia/520.61.05: added
nvidia-fs/2.13.5: added

[root@node70 ~]# uname -r
4.18.0-372.9.1.el8.x86_64
[root@node70 ~]# dkms status -k 4.18.0-372.9.1.el8.x86_64
nvidia/520.61.05: added
nvidia-fs/2.13.5: added
[root@node70 ~]# rpm -qa | grep -i kernel
kernel-modules-4.18.0-372.26.1.el8_6.x86_64
kernel-4.18.0-372.26.1.el8_6.x86_64
kernel-headers-4.18.0-372.26.1.el8_6.x86_64
kernel-devel-4.18.0-372.26.1.el8_6.x86_64
kernel-devel-4.18.0-372.9.1.el8.x86_64
kernel-core-4.18.0-372.26.1.el8_6.x86_64
[root@node70 ~]# rpm -qa | grep -i dkms
dkms-3.0.7-1.el8.noarch
nvidia-fs-dkms-2.13.5-1.x86_64
kmod-nvidia-latest-dkms-520.61.05-1.el8.x86_64
[root@node70 ~]# rpm -qa | grep -i nvidia
nvidia-kmod-common-520.61.05-1.el8.noarch
nvidia-settings-520.61.05-1.el8.x86_64
nvidia-driver-cuda-520.61.05-1.el8.x86_64
nvidia-fs-2.13.5-1.x86_64
nvidia-fs-dkms-2.13.5-1.x86_64
dnf-plugin-nvidia-2.0-1.el8.noarch
nvidia-driver-libs-520.61.05-1.el8.x86_64
nvidia-driver-cuda-libs-520.61.05-1.el8.x86_64
nvidia-driver-devel-520.61.05-1.el8.x86_64
nvidia-gds-11.8.0-1.x86_64
nvidia-libXNVCtrl-520.61.05-1.el8.x86_64
nvidia-libXNVCtrl-devel-520.61.05-1.el8.x86_64
kmod-nvidia-latest-dkms-520.61.05-1.el8.x86_64
nvidia-driver-NvFBCOpenGL-520.61.05-1.el8.x86_64
nvidia-modprobe-520.61.05-1.el8.x86_64
nvidia-driver-NVML-520.61.05-1.el8.x86_64
nvidia-persistenced-520.61.05-1.el8.x86_64
nvidia-gds-11-8-11.8.0-1.x86_64
nvidia-driver-520.61.05-1.el8.x86_64
nvidia-xconfig-520.61.05-1.el8.x86_64
[root@node70 ~]#

Please run
sudo dkms install nvidia/520.61.05
and post any errors displayed.

[root@node70 boot]# dkms install nvidia/520.61.05
Sign command: /lib/modules/4.18.0-372.9.1.el8.x86_64/build/scripts/sign-file
Binary /lib/modules/4.18.0-372.9.1.el8.x86_64/build/scripts/sign-file not found, modules won’t be signed
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/520.61.05/source/dkms.conf does not exist.
[root@node70 boot]# ls /var/lib/dkms
nvidia nvidia-fs
[root@node70 boot]# ls /var/lib/dkms/nvidia
520.61.05
[root@node70 boot]# ls /var/lib/dkms/nvidia/520.61.05
build source
[root@node70 boot]# ls /var/lib/dkms/nvidia/520.61.05/source
/var/lib/dkms/nvidia/520.61.05/source
[root@node70 boot]# ls -l /var/lib/dkms/nvidia/520.61.05/source/dkms.conf
ls: cannot access ‘/var/lib/dkms/nvidia/520.61.05/source/dkms.conf’: No such file or directory
[root@node70 boot]#

please install the package nvidia-dkms-520.61.05

That package does not seem to exist? Can you please tell me where this can be found?

For rhel, the name seems to be
kmod-nvidia-latest-dkms-520.61.05-1

We have made some progress. Thanks for the help.

[root@node70 ~]# nvidia-smi
Tue Nov 1 14:06:12 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:03:00.0 Off | N/A |
| 32% 66C P0 53W / 215W | 0MiB / 8192MiB | 1% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … Off | 00000000:04:00.0 Off | N/A |
| 44% 72C P0 54W / 215W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
[root@node70 ~]# ls /dev/nv*
/dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm /dev/nvram
/dev/nvidia1 /dev/nvidia-modeset /dev/nvidia-uvm-tools
/dev/nvidia-caps:
nvidia-cap1 nvidia-cap2

lsmod | grep nvid
nvidia_drm 61440 0
nvidia_modeset 1138688 1 nvidia_drm
nvidia_uvm 1236992 0
nvidia 54571008 2 nvidia_uvm,nvidia_modeset
drm_kms_helper 266240 5 drm_vram_helper,ast,nvidia_drm
drm 585728 8 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,nvidia_drm,ttm

[root@node70 ~]# find /lib/modules | grep nvid
/lib/modules/4.18.0-372.26.1.el8_6.x86_64/extra/drivers/video/nvidia
/lib/modules/4.18.0-372.26.1.el8_6.x86_64/extra/drivers/video/nvidia/nvidia-peermem.ko
/lib/modules/4.18.0-372.26.1.el8_6.x86_64/extra/drivers/video/nvidia/nvidia-drm.ko
/lib/modules/4.18.0-372.26.1.el8_6.x86_64/extra/drivers/video/nvidia/nvidia-modeset.ko
/lib/modules/4.18.0-372.26.1.el8_6.x86_64/extra/drivers/video/nvidia/nvidia-uvm.ko
/lib/modules/4.18.0-372.26.1.el8_6.x86_64/extra/drivers/video/nvidia/nvidia.ko

[root@node70 ~]# uname -r
4.18.0-372.26.1.el8_6.x86_64

[root@node70 ~]# !!
egrep -i ‘dkms|nvid|cuda’ /var/log/messages
Nov 1 14:04:12 node70 kernel: nvidia: loading out-of-tree module taints kernel.
Nov 1 14:04:12 node70 kernel: nvidia: module license ‘NVIDIA’ taints kernel.
Nov 1 14:04:12 node70 kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Nov 1 14:04:12 node70 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 238
Nov 1 14:04:12 node70 kernel: nvidia 0000:03:00.0: enabling device (0100 → 0103)
Nov 1 14:04:12 node70 kernel: nvidia 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Nov 1 14:04:12 node70 systemd-udevd[1769]: Process ‘/usr/bin/bash -c ‘/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255’’ failed with exit code 1.
Nov 1 14:04:12 node70 systemd-udevd[1769]: Process ‘/usr/bin/bash -c ‘for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \ -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) ${i}; done’’ failed with exit code 1.
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.2/0000:03:00.1/sound/card1/input9
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.2/0000:03:00.1/sound/card1/input10
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.2/0000:03:00.1/sound/card1/input11
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card2/input16
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.2/0000:03:00.1/sound/card1/input12
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:02.2/0000:03:00.1/sound/card1/input13
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card2/input17
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card2/input18
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:02.2/0000:03:00.1/sound/card1/input14
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:02.2/0000:03:00.1/sound/card1/input15
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card2/input19
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card2/input20
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card2/input21
Nov 1 14:04:12 node70 kernel: input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:03.0/0000:04:00.1/sound/card2/input22
Nov 1 14:04:13 node70 kernel: nvidia 0000:04:00.0: enabling device (0100 → 0103)
Nov 1 14:04:13 node70 kernel: nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Nov 1 14:04:13 node70 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 520.61.05 Thu Sep 29 05:30:25 UTC 2022
Nov 1 14:04:13 node70 kernel: nvidia-uvm: Loaded the UVM driver, major device number 236.
Nov 1 14:04:13 node70 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 520.61.05 Thu Sep 29 05:29:37 UTC 2022
Nov 1 14:04:13 node70 kernel: [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
Nov 1 14:04:13 node70 kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
Nov 1 14:04:13 node70 kernel: [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
Nov 1 14:04:13 node70 kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 2
[root@node70 ~]#

I meant to include the installed rpms also…

[root@node70 ~]# rpm -qa | egrep -i ‘dkms|nvid|cuda’
nvidia-kmod-common-520.61.05-1.el8.noarch
nvidia-driver-NVML-520.61.05-1.el8.x86_64
cuda-cccl-11-8-11.8.89-1.x86_64
cuda-nsight-systems-11-8-11.8.0-1.x86_64
cuda-gdb-11-8-11.8.86-1.x86_64
cuda-compiler-11-8-11.8.0-1.x86_64
cuda-sanitizer-11-8-11.8.86-1.x86_64
kmod-nvidia-520.61.05-4.18.0-372.26.1-520.61.05-3.el8_6.x86_64
cuda-toolkit-config-common-11.8.89-1.noarch
cuda-nvrtc-11-8-11.8.89-1.x86_64
nvidia-driver-devel-520.61.05-1.el8.x86_64
cuda-profiler-api-11-8-11.8.86-1.x86_64
cuda-documentation-11-8-11.8.86-1.x86_64
cuda-toolkit-11-8-11.8.0-1.x86_64
dnf-plugin-nvidia-2.0-1.el8.noarch
nvidia-driver-cuda-libs-520.61.05-1.el8.x86_64
cuda-cudart-11-8-11.8.89-1.x86_64
cuda-nvml-devel-11-8-11.8.86-1.x86_64
cuda-nvrtc-devel-11-8-11.8.89-1.x86_64
nvidia-modprobe-520.61.05-1.el8.x86_64
cuda-nsight-compute-11-8-11.8.0-1.x86_64
cuda-nvcc-11-8-11.8.89-1.x86_64
cuda-libraries-devel-11-8-11.8.0-1.x86_64
cuda-cupti-11-8-11.8.87-1.x86_64
cuda-drivers-520.61.05-1.x86_64
cuda-nvvp-11-8-11.8.87-1.x86_64
cuda-tools-11-8-11.8.0-1.x86_64
nvidia-driver-520.61.05-1.el8.x86_64
cuda-toolkit-11-8-config-common-11.8.89-1.noarch
nvidia-driver-NvFBCOpenGL-520.61.05-1.el8.x86_64
cuda-nvprof-11-8-11.8.87-1.x86_64
cuda-cudart-devel-11-8-11.8.89-1.x86_64
nvidia-xconfig-520.61.05-1.el8.x86_64
cuda-nvprune-11-8-11.8.86-1.x86_64
cuda-driver-devel-11-8-11.8.89-1.x86_64
cuda-cuxxfilt-11-8-11.8.86-1.x86_64
nvidia-settings-520.61.05-1.el8.x86_64
nvidia-driver-cuda-520.61.05-1.el8.x86_64
cuda-nsight-11-8-11.8.86-1.x86_64
cuda-command-line-tools-11-8-11.8.0-1.x86_64
cuda-11.8.0-1.x86_64
cuda-toolkit-11-config-common-11.8.89-1.noarch
cuda-libraries-11-8-11.8.0-1.x86_64
nvidia-libXNVCtrl-devel-520.61.05-1.el8.x86_64
cuda-nvtx-11-8-11.8.86-1.x86_64
cuda-demo-suite-11-8-11.8.86-1.x86_64
nvidia-persistenced-520.61.05-1.el8.x86_64
cuda-11-8-11.8.0-1.x86_64
nvidia-driver-libs-520.61.05-1.el8.x86_64
nvidia-libXNVCtrl-520.61.05-1.el8.x86_64
cuda-nvdisasm-11-8-11.8.86-1.x86_64
cuda-memcheck-11-8-11.8.86-1.x86_64
cuda-cuobjdump-11-8-11.8.86-1.x86_64
cuda-runtime-11-8-11.8.0-1.x86_64
cuda-visual-tools-11-8-11.8.0-1.x86_64
dkms-3.0.7-1.el8.noarch
[root@node70 ~]#

[root@node70 ~]# cat /etc/rocky-release
Rocky Linux release 8.6 (Green Obsidian)

[root@node70 ~]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 520.61.05 Thu Sep 29 05:30:25 UTC 2022
GCC version: gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)

[root@node70 ~]# yum repolist
repo id repo name
OpenHPC OpenHPC-2 - Base
OpenHPC-updates OpenHPC-2 - Updates
appstream Rocky Linux 8 - AppStream
baseos Rocky Linux 8 - BaseOS
cuda-rhel8-x86_64 cuda-rhel8-x86_64
download.rockylinux.org_pub_rocky_8_AppStream_basearch_os created by dnf config-manager from Index of /pub/rocky/8/AppStream/x86_64/os/
download.rockylinux.org_pub_rocky_8_BaseOS_basearch_os created by dnf config-manager from Index of /pub/rocky/8/BaseOS/x86_64/os/
download.rockylinux.org_pub_rocky_8_PowerTools_basearch_os created by dnf config-manager from Index of /pub/rocky/8/PowerTools/x86_64/os/
epel Extra Packages for Enterprise Linux 8 - x86_64
epel-modular Extra Packages for Enterprise Linux Modular 8 - x86_64
extras Rocky Linux 8 - Extras
oneAPI Intel(R) oneAPI repository
[root@node70 ~]#

Do you have any additional issues or why are you posting the installed packages?

No other issues. Was just showing installed packages for future reference. Also, would just like to document that part of the issue turned out to be that the running kernel version was not matching up with the /lib/modules. Had to essentially do mkinitrd, however, the kernel was being deployed via a warewulf cluster so the bootstrap and kernel was updated also via warewulf.
Thanks again for the help with the nvidia drivers.
wwbootstrap -c $CHROOTCUDA 4.18.0-372.26.1.el8_6.x86_64
wwsh -y provision set node[70] --bootstrap=4.18.0-372.26.1.el8_6.x86_64 --vnfs=rocky8.5.cuda --files=dynamic_hosts,passwd,group,shadow,slurm.conf,munge.key,network
wwvnfs -y --chroot=$CHROOTCUDA