Summary of Issue
Containerized GPU workloads may suddenly lose access to their GPUs. This situation occurs when systemd is used to manage the cgroups of the container and it is triggered to reload any Unit files that have references to NVIDIA GPUs
Affected environments are those
using runc and
enabling systemd cgroup management at the high-level container runtime. If the system is NOT using
systemd to manage
cgroups, then it is NOT subject to this issue.
The container needs to be deleted once the issue occurs.
When it is restarted (manually or automatically depending on the use of a container orchestration platform), it will regain access to the GPU.
The issue originates from the fact that recent versions of
runc require that symlinks be present under
/dev/char to any device nodes being injected into a container. Unfortunately, these symlinks are not present for NVIDIA devices, and the NVIDIA GPU driver does not (currently) provide a means for them to be created automatically.
A fix will be present in the next patch release of all supported NVIDIA GPU drivers (as of 02/08/23)
More details can be found here : Github Issue Report