CUDA Toolkit on Rocky Linux 9 nvidia-smi Fails

I am trying to install the CUDA toolkit on a fresh install of Rocky Linux 9. It is not working and I am wondering what I can do to fix it. I am also wondering if it has to do with having more than one video device.

The output of
lspci | grep VGA
is
00:02.0 VGA compatible controller: Intel Corporation TigerLake-H GT1 [UHD Graphics] (rev 01)
01:00.0 VGA compatible controller: NVIDIA Corporation GA104M [GeForce RTX 3070 Mobile / Max-Q] (rev a1)

These are the commands that I used to install the CUDA Toolkit on a fresh install of Rocky 9:
dnf config-manager --set-enabled crb
dnf install -y epel-release
dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
dnf clean all
dnf -y module install nvidia-driver:latest-dkms
dnf -y install cuda

After rebooting, I get the following when I run nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Any help would be very much appreciated.

1 Like

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

I have a similar problem in ubuntu 22.04 after upgrading from cuda 11.6 to cuda 11.8, in that now nvidia-smi takes forever and graphics interface never comes up. I suspect some problem with nvidia-modeset that hangs any communication with the drivers.

key line when nvidia-smi finally responds (after more than 10s)
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |

lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)

Kernel: 5.15.0-48-generic x86_64 bits: 64 compiler: gcc v: 11.2.0
parameters: BOOT_IMAGE=/boot/vmlinuz-5.15.0-48-generic
root=UUID=XXX ro quiet splash
vt.handoff=7
Console: pty pts/1 DM: GDM3 42.0
Distro: Ubuntu 22.04.1 LTS (Jammy Jellyfish)
Graphics:
Device-1: NVIDIA GA102 [GeForce RTX 3090] vendor: Micro-Star MSI

I’ll also try to attach a bug report
nvidia-bug-report.log.gz (523.6 KB)

Here is the output of nvidia-bug-report.sh.
nvidia-bug-report.log.gz (81.7 KB)

I should clarify that this is from Rocky 9, but with several other programs and packages added after running the commands above. If you like, I can run it again with just a fresh install, but I won’t be able to until tonight.

@Tomas.nordstrom “My car is broken” “Oh, I also have a car that’s broken”
Cuda 11.8 installed driver v520.61.05 that has obviously the same bug as 515.76, it doesn’t work on some hardware. Please attach your nvidia-bug-report.log here: https://forums.developer.nvidia.com/t/515-76-nvidia-drivers/229132/16
Meanwhile, you’ll have to downgrade to cuda 11.6/driver 515.65 to get a working system.

@craigchristianson There are no kernel modules compiled, please post the output of
dkms status

output of dkms status
nvidia/520.61.05: added

Please run
sudo dkms install nvidia/520.61.05
and post any errors that are given and attach the referenced make.log

1 Like

Output of sudo dkms install nvidia/520.61.05
This system doesn’t support Secure Boot
Sign command: /lib/modules/5.14.0-70.26.1.el9_0.x86_64/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub
Certificate or key are missing, generating self signed certificate for MOK


Building module:
Cleaning build area

‘make’ -j2 module SYSSRC=/lib/modules/5.14.0-70.26.1.el9_0.x86_64/build IGNORE_XEN_PRESENCE=1 IGNORE_PREEMPT_RT_PRESENCE=1 IGNORE_CC_MISMATCH=1

Signing module /var/lib/dkms/nvidia/520.61.05/build/nvidia.ko
Signing module /var/lib/dkms/nvidia/520.61.05/build/nvidia-modeset.ko
Signing module /var/lib/dkms/nvidia/520.61.05/build/nvidia-drm.ko
Signing module /var/lib/dkms/nvidia/520.61.05/build/nvidia-uvm.ko
Signing module /var/lib/dkms/nvidia/520.61.05/build/nvidia-peermem.ko
Cleaning build area


nvidia.ko.xz:
Running module version sanity check.

  • Original module
    • No original module exists within this kernel
  • Installation
    • Installing to /lib/modules/5.14.0-70.26.1.el9_0.x86_64/extra/

nvidia-modeset.ko.xz:
Running module version sanity check.

  • Original module
    • No original module exists within this kernel
  • Installation
    • Installing to /lib/modules/5.14.0-70.26.1.el9_0.x86_64/extra/

nvidia-drm.ko.xz:
Running module version sanity check.

  • Original module
    • No original module exists within this kernel
  • Installation
    • Installing to /lib/modules/5.14.0-70.26.1.el9_0.x86_64/extra/

nvidia-uvm.ko.xz:
Running module version sanity check.

  • Original module
    • No original module exists within this kernel
  • Installation
    • Installing to /lib/modules/5.14.0-70.26.1.el9_0.x86_64/extra/

nvidia-peermem.ko.xz:
Running module version sanity check.

  • Original module
    • No original module exists within this kernel
  • Installation
    • Installing to /lib/modules/5.14.0-70.26.1.el9_0.x86_64/extra/
      Adding any weak-modules
      depmod


Where can I find the referenced make.log?

Output of nvidia-smi
Wed Oct 5 10:02:06 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce 
 Off | 00000000:01:00.0 Off | N/A |
| N/A 61C P0 28W / N/A | 0MiB / 8192MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Output of hello-world (cuda)
cudaMalloc: 0
cudaMalloc: 0
cudaMalloc: 0
cudaMalloc: 0
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
out: 3.000000 a: 1.000000 b: 2.000000
PASSED

Things appear to be working now. Thank you. Should the install script mentioned for Rocky 9 Cuda toolkit be updated to include these commands?

The make.log is only referenced if somethings fails during compilation. The dkms install should be triggered automatically if something gets installed, don’t know why this wasn’t the case.

As a work-around, it seems the solution until something better can be worked out is to do the following if things are not working. It’s a bit hacky, but it should work.

DKMS_STATUS=$(dkms status)
DKMS_STATUS=${DKMS_STATUS%“: added”}
sudo dkms install $DKMS_STATUS