Ubuntu "unattended-upgrades" leads to "Failed to initialize NVML: Driver/library version mismatch"

We have long running deployments, which run 24/7 over multiple months without downtime. For multiple years now we had an issue with Ubuntus “unattended-upgrades” service, which will occasionally upgrade the GPU driver automatically. That leads to some docker containers not being able to start, and nvidia-smi reports

“Failed to initialize NVML: Driver/library version mismatch”

Rebooting the server solves the issue, but requires manual maintenance and at that point already involved downtime of our service.

What is the recommended way to avoid downtime of our services while not fully disabling critical security updates through unattended-upgrades which are obligatory for most enterprise deployments?

1 Like

Please try using apt hold to stick to a fixed version.

Thanks for the quick response. Is there documentation available for this?
Which packages exactly have to be pinned?

man apt-mark
What to hold depends on what and how you installed it.

Ubuntu 22 Desktop / Server using either of the following commands:

sudo apt install nvidia-driver-525
or
sudo apt install nvidia-headless-525 nvidia-utils-525 libnvidia-decode-525 libnvidia-encode-525

I realize this is a more “generic” Linux issue than it is a specific driver issue, but nonetheless it currently affects our uptime. So some official, reliable documentation for it would be very much appreciated.

Anything we are missing with:

sudo apt-mark hold nvidia-driver-525
sudo apt-mark hold nvidia-headless-525 nvidia-utils-525 libnvidia-decode-525 libnvidia-encode-525

Also, blacklisting them in the unattended-upgrades configuration seems more reasonable?

Similar to: