How to prevent API mismatch

Once per month I face this issue:

$> nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

$> dmesg
[2381101.873914] NVRM: API mismatch: the client has the version 495.46, but
NVRM: this kernel module has the version 495.29.05. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.

I know that the solution for this situation is to update the drivers and reload them or alternatively restart the system. The problem that this situation happens periodically and I cannot prevent it. I do perform the system updates regularly but there is no connection between system updates and this error. One day it suddenly announces the API mismatch error. This situation is quite uncomfortable as we provide two 4xA100 GPU servers and we need to ask all users to stop computing again and again. Is there any way how to prevent the situation?

Which distribution are you using?

Linux Ubuntu 18.04 and 20.04

If you’re doing a full system update this will always also update kernel and nvidia driver (if available) so without reboot (or driver reload), the graphics/cuda stack will inadvertedly get out of sync. This of course only happens to new contexts started after the driver upgrade. So running tasks are not affected.
One way around this would be using “apt hold” to exclude the driver from system updates and maybe unhold it on updating prior to a planned reboot.

Thanks! I will try to keep in mind that each manually initiated update requires also a reload of the drivers. In the past, I tried to use “apt hold” but I faced the problem with versions. Your solution with holding and unholding makes sense.

Depending on your general setup, this might also require sticking to a specific cuda-toolkit version.

I ran into this problem, but it had nothing to do with CUDA (which wasn’t installed on some of the systems). On my system the kernel modules were being embedded inside the compressed kernel image, then being loaded early in the boot process. These embedded, but outdated modules, would then prevent the correct, and newly installed/compiled standalone module files from being loaded. You can confirm this issue easily. Check the following:

cat /proc/driver/nvidia/version
cat /sys/module/nvidia/version

If the loaded modules loaded don’t match the driver version, you could also be facing this problem. Assuming the correct kernel modules are available, which you can confirm by running (assuming your distro uses DKMS):

dkms status

For me the fix simply involved regenerating my kernel images. On Red Hat distros, and its derivatives (Fedora, CentOS, Alma, Rocky, Oracle, etc) you can run:

(rpm -q --qf="%{VERSION}-%{RELEASE}.%{ARCH}\n" --whatprovides kernel ; uname -r) | \
sort | uniq | while read KERNEL ; do 
  dracut -f "/boot/initramfs-${KERNEL}.img" "${KERNEL}" || exit 1
done

This will regenerate the image for every installed kernel. For the equivalent logic on Debian distros, and its derivatives (including Ubuntu), you can run:

for kernel in /boot/config-*; do 
  [ -f "$kernel" ] || continue
  KERNEL=${kernel#*-}
  mkinitramfs -o "/boot/initrd.img-${KERNEL}.img" "${KERNEL}" || exit 1
done

Then reboot. You can also fix the problem temporarily, by manually removing (unloading) the NVIDIA module using rmmod or modprobe, then reloading them. When you do modprobe will use the standalone kernel module which should match your installed driver version.

P.S. I hit this issue when I upgraded from the 470.x driver, to the 510.x driver, which recently became the reccomended, stable, install version. I never ran into this problem while using the 460.x and 470.x driver releases.

Or simply sudo update-initramfs -u -k all

Thanks Mart, that is a much better method. In retrospect using:

dracut --regenerate-all --force

Would probably be easier, and work just fine for most people on Red Hat systems (and its various offspring). That’s what I get for copy/pasting from my bash scrips without thinking.