nvidia-docker inside Kubernetes - Failed to initialize NVML: Unknown Error

p.oliveira.castro · March 6, 2020, 2:38pm

Hello everyone!

I’m trying to make the NVIDIA GPU available into my Kubernetes cluster. For that, I followed the guide in https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#deploying-nvidia-gpu-device-plugin. First thing, I ensured that all the Drivers and CUDA related things were installed into a node of the cluster, so I could run the nvidia-smi command there with success. Then I installed nvidia-docker2 in the node and the nvidia plugin into the Kubernetes cluster, so that the node can run containers using this runtime (also had to change the default runtime as mentioned in the guide).

After that I tested with the recommended nvidia/cuda:10.0-base image and tested with nvidia-smi command. Success! Everything work as expected:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Sure, no problem. However, with a real application, the GPU started to fail and I couldn’t train my Tensorflow model inside the container. Investigating it further, we figured that the issue is that the GPU start failing some time after the container starts. To reproduce the issue we ran the same image as above with the following command:

bash -c "for i in {0..360}; do echo $i; nvidia-smi; sleep 1; done"

And the nvidia-smi command starts failing after 3 to 5 tries:

0
Thu Mar  5 22:37:19 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
1
Thu Mar  5 22:37:20 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
2
Thu Mar  5 22:37:21 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
3
Thu Mar  5 22:37:22 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
4
Thu Mar  5 22:37:23 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
5
Failed to initialize NVML: Unknown Error
6
Failed to initialize NVML: Unknown Error
7
Failed to initialize NVML: Unknown Error
8
Failed to initialize NVML: Unknown Error

Notice that we start seeing the following error after some time

Failed to initialize NVML: Unknown Error

We checked the persistence of the driver running nvidia-smi -q and it seems to report:

...
Persistence Mode                : Enabled
...

This problem only happens inside the container and the node seems fine (kubernetes doesn’t report any errors and the command seems to still run fine inside the node, even after failing inside the container).

Do you have any idea what is happening?

BenjiBe · June 1, 2021, 9:17am

Helloo,
I am experiencing the same issue. Did you somehow manage to solve that?

Thanks for any help!

tobias.guenther · June 1, 2021, 9:50am

Maybe this can help: GPU becomes unavailable after some time in Docker container · Issue #1469 · NVIDIA/nvidia-docker · GitHub @p.oliveira.castro

ben1 · January 9, 2022, 2:21am

I had this same issue. Here is how I fixed it:

in terminal, sudo nano /etc/nvidia-container-runtime/config.toml

disable-require = false
#swarm-resource = “DOCKER_RESOURCE_GPU”
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = “/run/nvidia/driver”
#path = “/usr/bin/nvidia-container-cli”
environment =
#debug = “/var/log/nvidia-container-toolkit.log”
#ldcache = “/etc/ld.so.cache”
load-kmods = false
no-cgroups = false
user = “root:video”
ldconfig = “@/sbin/ldconfig.real”

[nvidia-container-runtime]
#debug = “/var/log/nvidia-container-runtime.log”

Topic		Replies	Views
[solved - somehow] CUDA in Docker gives "Failed to initialize NVML: Unknown Error" CUDA Setup and Installation	2	5305	May 2, 2024
Failed to initialize NVML: Unknown Error when running nvidia-smi on Docker container CUDA Programming and Performance cuda , ubuntu , docker	2	11266	October 18, 2020
Nvidia-smi "Failed to initialize NVML: Unknown Error" CUDA Setup and Installation cuda	0	207	November 24, 2025
nvidia-smi -----> Failed to initialize NVML: Unknown Error (in docker) CUDA Setup and Installation	4	20342	August 12, 2019
Nvida Container Toolkit: Failed to initialize NVML: Unknown Error Linux	8	27475	June 29, 2025
"Failed to initialize NVML: Unknown Error" running nvidia-smi in a docker container only after some hours/days DGX Spark / GB10	28	1206	January 27, 2026
NVIDIA Docker - initialization error: nvml error: driver not loaded Docker and NVIDIA Docker	0	6429	October 4, 2020
Nvml errors when attempting to use docker with Nvidia Container Toolkit Docker and NVIDIA Docker cuda , ubuntu	1	2815	November 1, 2024
GPU becomes unavailable after some time in Docker container CUDA Setup and Installation	3	4597	June 8, 2021
Nvidia-container-cli: detection error: nvml error: function not found: unknown CUDA Programming and Performance cuda , ubuntu , docker	5	8483	April 24, 2021

nvidia-docker inside Kubernetes - Failed to initialize NVML: Unknown Error

Related topics