Failed to initialize NVML: Driver/library version mismatch

Hi all,
in a disk-less cluster running CentOS 7 and hosting K80 cards, after an upgrade of the NVIDIA driver to 375.66 I got this error when trying to run nvidia-smi:

Failed to initialize NVML: Driver/library version mismatch

In the dmesg I found these errors:

NVRM: API mismatch: the client has the version 375.66, but
NVRM: this kernel module has the version 367.48.  Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.

Looking for the loaded module version I get the expected version when issuing this command:

modinfo nvidia

version:        375.66

but I still get the old one looking in:

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  367.48  Sat Sep  3 18:21:08 PDT 2016
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC)

If I unload all nvidia related modules with rmmod and load them again with modprobe, everything works fine, but if I reboot a compute node /proc/driver/nvidia/version report again the old module version and the problem appears again.

In the whole machine I am not able to find the old kernel module file, so, how is it possible that the old module get loaded? Does someone has some ideas about how I could debug this problem?

Thanks and Best Regards,

Enrico

the driver was not installed correctly. This can happen if the previous driver was installed using the runfile installer and the new driver was installed using package manager, or vice versa. There are probably other scenarios as well.

Remove all previous package manager installs, and all previous runfile installer installs, then reinstall the driver.

http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#abstract

Hi,
all the drivers (old ones and new one) were installed using runfiles, anyhow I SOLVED the problem.
I report it here just for reference for other users who may have the same issue:

Compute nodes of this cluster load an initramfs from the master node via tftp. To embed all the needed kernel modules the initramfs is commonly built using “dracut --host-only” on one of the compute nodes. In the past the nvidia kernel module was probably embedded in the initramfs and thus loaded at boot time causing the weird issue previously reported. Re-building the initramfs using "dracut --host-only --omit-drivers " solved the problem.

Thanks,

Enrico

Hi,
I came into the same problem and found an issue in github: https://github.com/tensorflow/tensorflow/issues/4349.
Following the instruction in the issue, rebooting solved my problem.

Hope it would help.