Failed to initialize NVML: Driver/library version mismatch

Hi all,
in a disk-less cluster running CentOS 7 and hosting K80 cards, after an upgrade of the NVIDIA driver to 375.66 I got this error when trying to run nvidia-smi:

Failed to initialize NVML: Driver/library version mismatch

In the dmesg I found these errors:

NVRM: API mismatch: the client has the version 375.66, but
NVRM: this kernel module has the version 367.48.  Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.

Looking for the loaded module version I get the expected version when issuing this command:

modinfo nvidia

version:        375.66

but I still get the old one looking in:

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  367.48  Sat Sep  3 18:21:08 PDT 2016
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC)

If I unload all nvidia related modules with rmmod and load them again with modprobe, everything works fine, but if I reboot a compute node /proc/driver/nvidia/version report again the old module version and the problem appears again.

In the whole machine I am not able to find the old kernel module file, so, how is it possible that the old module get loaded? Does someone has some ideas about how I could debug this problem?

Thanks and Best Regards,

Enrico

1 Like

the driver was not installed correctly. This can happen if the previous driver was installed using the runfile installer and the new driver was installed using package manager, or vice versa. There are probably other scenarios as well.

Remove all previous package manager installs, and all previous runfile installer installs, then reinstall the driver.

[url]Installation Guide Linux :: CUDA Toolkit Documentation

Hi,
all the drivers (old ones and new one) were installed using runfiles, anyhow I SOLVED the problem.
I report it here just for reference for other users who may have the same issue:

Compute nodes of this cluster load an initramfs from the master node via tftp. To embed all the needed kernel modules the initramfs is commonly built using “dracut --host-only” on one of the compute nodes. In the past the nvidia kernel module was probably embedded in the initramfs and thus loaded at boot time causing the weird issue previously reported. Re-building the initramfs using "dracut --host-only --omit-drivers " solved the problem.

Thanks,

Enrico

2 Likes

Hi,
I came into the same problem and found an issue in github: https://github.com/tensorflow/tensorflow/issues/4349.
Following the instruction in the issue, rebooting solved my problem.

Hope it would help.

Similar issue I met, and below 2nd way works for me:

1.Firstly you may try reboot to check if nvidia-smi works.
2.And if above not working, it maybe your system environment issue. Below method works on my fedora33 system.

  • uninstall nvidia driver
  • update the initrd with cmd “sudo dracut --force”
  • reboot
  • reinstall target nvidia driver

Hello Nvidia,

We have a server with 8x 2090ti GPUs that is throwing mismatch error:

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

As far as I can see there is no mismatch between drivers and libraries:

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 495.29.05 Thu Sep 30 16:00:29 UTC 2021
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)
$ rpm -qa | grep nvidia-driver
nvidia-driver-latest-dkms-cuda-libs-495.29.05-1.el7.x86_64
nvidia-driver-latest-dkms-495.29.05-1.el7.x86_64
nvidia-driver-latest-dkms-NvFBCOpenGL-495.29.05-1.el7.x86_64
nvidia-driver-latest-dkms-cuda-495.29.05-1.el7.x86_64
nvidia-driver-latest-dkms-devel-495.29.05-1.el7.x86_64
nvidia-driver-latest-dkms-NVML-495.29.05-1.el7.x86_64
nvidia-driver-latest-dkms-libs-495.29.05-1.el7.x86_64

There are no recent nvidia updates using package manager that could have caused this:

$ rpm -qa --last | grep nvidia
kmod-nvidia-latest-dkms-495.29.05-1.el7.x86_64 Do 02 Dez 2021 00:24:29 CET
nvidia-driver-latest-dkms-495.29.05-1.el7.x86_64 Do 02 Dez 2021 00:24:27 CET
nvidia-xconfig-latest-dkms-495.29.05-1.el7.x86_64 Do 02 Dez 2021 00:24:26 CET
nvidia-persistenced-latest-dkms-495.29.05-1.el7.x86_64 Do 02 Dez 2021 00:24:26 CET
nvidia-driver-latest-dkms-NVML-495.29.05-1.el7.x86_64 Do 02 Dez 2021 00:24:26 CET
nvidia-driver-latest-dkms-NvFBCOpenGL-495.29.05-1.el7.x86_64 Do 02 Dez 2021 00:24:26 CET
nvidia-driver-latest-dkms-libs-495.29.05-1.el7.x86_64 Do 02 Dez 2021 00:24:26 CET
nvidia-driver-latest-dkms-devel-495.29.05-1.el7.x86_64 Do 02 Dez 2021 00:24:26 CET
nvidia-driver-latest-dkms-cuda-495.29.05-1.el7.x86_64 Do 02 Dez 2021 00:24:26 CET
nvidia-driver-latest-dkms-cuda-libs-495.29.05-1.el7.x86_64 Do 02 Dez 2021 00:24:12 CET
nvidia-modprobe-latest-dkms-495.29.05-1.el7.x86_64 Do 02 Dez 2021 00:24:08 CET
yum-plugin-nvidia-1.0.2-1.el7.elrepo.noarch Di 22 Dez 2020 10:29:18 CET

The server was rebooted/power cycled and still the same error is being seen. Could you suggest solutions?

Thanks,
Durai

Thank you benry, your workaround worked also for me.