We have long running deployments, which run 24/7 over multiple months without downtime. For multiple years now we had an issue with Ubuntus “unattended-upgrades” service, which will occasionally upgrade the GPU driver automatically. That leads to some docker containers not being able to start, and nvidia-smi reports
“Failed to initialize NVML: Driver/library version mismatch”
Rebooting the server solves the issue, but requires manual maintenance and at that point already involved downtime of our service.
What is the recommended way to avoid downtime of our services while not fully disabling critical security updates through unattended-upgrades which are obligatory for most enterprise deployments?
I realize this is a more “generic” Linux issue than it is a specific driver issue, but nonetheless it currently affects our uptime. So some official, reliable documentation for it would be very much appreciated.