Failed to initialize NVML: Driver/library version mismatch running nvidia-smi after kernal update and reboot

We recently upgraded to CUDA 11.7.1 using the cuda_11.7.1_515.65.01_linux.run file and then updating the kernal driver using NVIDIA-Linux-x86_64-515.86.01.run to address a security vulnerablility.

(base) [root@xxxxx cuda]# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

(base) [root@xxxxx cuda]# modinfo nvidia
filename: /lib/modules/3.10.0-1160.88.1.el7.x86_64/kernel/drivers/video/nvidia.ko
firmware: nvidia/515.86.01/gsp.bin
alias: char-major-195-*
version: 515.86.01
supported: external
license: NVIDIA
retpoline: Y
rhelversion: 7.9

(base) [root@xxxx nvidia]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.86.01 Wed Oct 26 09:12:38 UTC 2022
GCC version: gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)

But the /usr/local/cuda/version.json file shows:

“nvidia_driver” : {
“name” : “NVIDIA Linux Driver”,
“version” : “515.65.01”

When I reboot and run the nvidia-smi client I get:
Failed to initialize NVML: Driver/library version mismatch

and when I do the cat /proc/driver/nvidia/version I get an older version showing and I have no idea where it comes from. I check the yum history and did not see any installations with that version.

(base) [root@paidsrfchtc01 cuda]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 450.51.06 Sun Jul 19 20:02:54 UTC 2020
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)

Hello @gary.gallion and welcome to the NVIDIA developer forums!

You are very likely facing a mismatch in driver installation processes. Possibly the 450 driver was installed as part of the previous CUDA version installation. Using the standalone .run file for the driver now can very well fail to correctly replace all the symlinks.

Check out this comparison matrix on what to do with which installation method.

To be safe I would recommend at this point to uninstall both the cuda toolkit as well as ALL NVIDIA drivers, reboot, make sure everything is purged, and then re-install CUDA fresh.

Alternatively you could try to explicitly uninstall only the older CUDA toolkit and find all v450 driver files and delete them.

I hope that helps!

For further CUDA installation help you can also check our dedicated CUDA forums.

Markus, I have thoroughly gone through the rpm repository and yum history and don’t show any signs of an installation. The 450.51.06 is installed with the 11.0.3 version of the cuda toolkit, but I have now history of it being installed or the stand alone driver installation.

I cleaned up the /usr/local/cuda-xx.x files for any previous versions. I then reinstall cuda 11.7 which installs the 515.65.01. I show that the files in the /proc/driver/ have been updated.

(base) [root@xxxxxxxxx nvidia]# ls -alrt
total 0
drwxrwxr-x 6 root root 0 Jul 13 18:11 …
dr-xr-xr-x 6 root root 0 Jul 13 18:11 .
dr-xr-xr-x 2 root root 0 Jul 13 18:11 warnings
-r–r–r-- 1 root root 0 Jul 13 18:11 version
-rw-r–r-- 1 root root 0 Jul 13 18:11 suspend_depth
-rw-r–r-- 1 root root 0 Jul 13 18:11 suspend
-rw-r–r-- 1 root root 0 Jul 13 18:11 registry
dr-xr-xr-x 2 root root 0 Jul 13 18:11 patches
-r–r–r-- 1 root root 0 Jul 13 18:11 params
dr-xr-xr-x 4 root root 0 Jul 13 18:11 gpus

and the version file shows:

(base) [root@xxxxxxxxx nvidia]# cat version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.86.01 Wed Oct 26 09:12:38 UTC 2022
GCC version: gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)

However, as soon as I reboot it reverts back to the older driver in the version file. I have no idea where what is changing it.

The fix to the issueshere:
How to prevent API mismatch - Graphics / Linux / Linux - NVIDIA Developer Forums](How to prevent API mismatch)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.