Containers losing access to GPUs, Failed to Initialize NVML

nadeemm · February 10, 2023, 12:23am

Summary of Issue
Containerized GPU workloads may suddenly lose access to their GPUs. This situation occurs when systemd is used to manage the cgroups of the container and it is triggered to reload any Unit files that have references to NVIDIA GPUs
Affected environments are those using runc and enabling systemd cgroup management at the high-level container runtime. If the system is NOT using systemd to manage cgroups, then it is NOT subject to this issue.
Solution
The container needs to be deleted once the issue occurs.

When it is restarted (manually or automatically depending on the use of a container orchestration platform), it will regain access to the GPU.

The issue originates from the fact that recent versions of runc require that symlinks be present under /dev/char to any device nodes being injected into a container. Unfortunately, these symlinks are not present for NVIDIA devices, and the NVIDIA GPU driver does not (currently) provide a means for them to be created automatically.
A fix will be present in the next patch release of all supported NVIDIA GPU drivers (as of 02/08/23)
More details can be found here : Github Issue Report

edward.schneeweiss · June 20, 2024, 6:04pm

This is over a year old, is there any update? Has this been fixed in the latest driver version?

Topic		Replies	Views
Nvida Container Toolkit: Failed to initialize NVML: Unknown Error Linux	8	22237	June 29, 2025
nvidia-docker inside Kubernetes - Failed to initialize NVML: Unknown Error CUDA Setup and Installation	3	4305	January 9, 2022
Nvidia driver-container does not work after restart Docker and NVIDIA Docker	7	6543	March 24, 2022
GPU becomes unavailable after some time in Docker container CUDA Setup and Installation	4	4074	October 12, 2021
[solved - somehow] CUDA in Docker gives "Failed to initialize NVML: Unknown Error" CUDA Setup and Installation	2	4298	May 2, 2024
Driver Seems to Disappear (Containers) CUDA Setup and Installation	2	604	January 8, 2019
Ubuntu 20.04 blocked after upgrading nvidia drivers Linux	8	3940	March 1, 2022
GPU devices lost with 'NVRM: RmInitAdapter failed' When CPU or Network is busy Linux	4	7923	May 22, 2021
/dev/nvidia-uvm IO error on Ubuntu 22.04, 520 to 535 driver versions Linux cuda , opencl , linux-driver	2	3264	August 27, 2023
Nvidia-container-cli: relocation error Docker and NVIDIA Docker	0	795	July 19, 2023

Containers losing access to GPUs, Failed to Initialize NVML

Related topics