Summary of Issue
Containerized GPU workloads may suddenly lose access to their GPUs. This situation occurs when systemd is used to manage the cgroups of the container and it is triggered to reload any Unit files that have references to NVIDIA GPUs
Affected environments are those using runc
and enabling systemd cgroup management
at the high-level container runtime. If the system is NOT using systemd
to manage cgroups
, then it is NOT subject to this issue.
Solution
The container needs to be deleted once the issue occurs.
When it is restarted (manually or automatically depending on the use of a container orchestration platform), it will regain access to the GPU.
The issue originates from the fact that recent versions of runc
require that symlinks be present under /dev/char
to any device nodes being injected into a container. Unfortunately, these symlinks are not present for NVIDIA devices, and the NVIDIA GPU driver does not (currently) provide a means for them to be created automatically.
A fix will be present in the next patch release of all supported NVIDIA GPU drivers (as of 02/08/23)
More details can be found here : Github Issue Report