Ubuntu "unattended-upgrades" leads to "Failed to initialize NVML: Driver/library version mismatch"

philipp.schmidt · April 19, 2023, 11:36am

We have long running deployments, which run 24/7 over multiple months without downtime. For multiple years now we had an issue with Ubuntus “unattended-upgrades” service, which will occasionally upgrade the GPU driver automatically. That leads to some docker containers not being able to start, and nvidia-smi reports

“Failed to initialize NVML: Driver/library version mismatch”

Rebooting the server solves the issue, but requires manual maintenance and at that point already involved downtime of our service.

What is the recommended way to avoid downtime of our services while not fully disabling critical security updates through unattended-upgrades which are obligatory for most enterprise deployments?

generix · April 19, 2023, 1:03pm

Please try using apt hold to stick to a fixed version.

philipp.schmidt · April 19, 2023, 1:31pm

Thanks for the quick response. Is there documentation available for this?
Which packages exactly have to be pinned?

generix · April 19, 2023, 1:48pm

man apt-mark
What to hold depends on what and how you installed it.

philipp.schmidt · April 24, 2023, 9:44am

Ubuntu 22 Desktop / Server using either of the following commands:

sudo apt install nvidia-driver-525
or
sudo apt install nvidia-headless-525 nvidia-utils-525 libnvidia-decode-525 libnvidia-encode-525

I realize this is a more “generic” Linux issue than it is a specific driver issue, but nonetheless it currently affects our uptime. So some official, reliable documentation for it would be very much appreciated.

philipp.schmidt · April 24, 2023, 9:47am

Anything we are missing with:

sudo apt-mark hold nvidia-driver-525
sudo apt-mark hold nvidia-headless-525 nvidia-utils-525 libnvidia-decode-525 libnvidia-encode-525

philipp.schmidt · April 24, 2023, 9:49am

Also, blacklisting them in the unattended-upgrades configuration seems more reasonable?

Similar to: