"Failed to initialize NVML: Unknown Error" running nvidia-smi in a docker container only after some hours/days

I created my own dashboard which shows some metrics and docker containers on my Spark. The full source is at GitHub - DanTup/dgx_dashboard: A simple dashboard for the DGX Spark. but it basically runs:

nvidia-smi --query-gpu=utilization.gpu,temperature.gpu,power.draw --format=csv,noheader,nounits -l=5

To get metrics streamed every 5s. It starts the process only when there is a browser connected to the backend, and then terminates the process after a timeout when there are no connections.

I ran the container with:

docker run -d --gpus all \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -p 8080:8080 \
    --pull=always \
    --restart=unless-stopped \
    --name dashboard \
    ghcr.io/dantup/dgx_dashboard:latest

Everything generally works fine, but now and then after some hours or days, the nvidia-smi command starts failing with:

Failed to initialize NVML: Unknown Error

When it fails, I try restarting it a few times after a delay, but once it has gotten into this state it always fails forever more, until I stop and start the docker container again. The command continues to work fine on the host.

08:43:48: nvidia-smi exited with code 0
10:21:10: Resumed metrics stream
10:21:32: Paused metrics stream and cleared data buffer
10:21:33: nvidia-smi exited with code 0
13:45:46: Resumed metrics stream
13:46:05: Paused metrics stream and cleared data buffer
13:46:05: nvidia-smi exited with code 0
21:35:32: Resumed metrics stream
21:35:32: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
21:35:32: nvidia-smi process terminated unexpectedly - 4 restarts remaining
21:35:32: nvidia-smi exited with code 0
21:35:37: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
21:35:37: nvidia-smi process terminated unexpectedly - 3 restarts remaining
21:35:37: nvidia-smi exited with code 0

Does anyone have any ideas what might cause this, or what I could do to troubleshoot?

ChatGPT told me to run some commands in the container to get more info. However it failed to come up with any ideas from the output - but I’m including it here in case it’s useful to someone that understands this better.

# nvidia-smi
Failed to initialize NVML: Unknown Error

# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 500,   0 Nov 19 16:23 /dev/nvidia-uvm
crw-rw-rw- 1 root root 500,   1 Nov 19 16:23 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Nov 19 16:23 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 19 16:23 /dev/nvidiactl

# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX Open Kernel Module for aarch64  580.95.05  Release Build  (dvs-builder@U22-I3-AF08-06-3)  Tue Sep 23 09:46:53 UTC 2025
GCC version:  gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04) 

And on the host:

# docker inspect "$CID" | jq '.[0].HostConfig.Runtime'
"runc"

# docker inspect "$CID" | jq '.[0].HostConfig.Devices'
[]

It looks like the container cannot properly access the nvidia driver or see the GPUs. When it is in this state can you run nvidia-smi and get a normal output outside of the container?

Yep, it continues to work fine outside of the container, and it works again in the container if I docker restart it.

It’s not just a temporary issue though, once it has started happening, it is forever broken (until I docker restart), it never starts working again on its own.

Are there any logs or any additional debugging I could do when it next happens (it tends to happen ever day or two but I haven’t noticed any particular pattern).

Thanks!

I hadn’t seen this for a while, but it happened again this morning.

19:06:39: Paused metrics stream and cleared data buffer
19:06:39: nvidia-smi exited with code 0
08:15:12: Resumed metrics stream
08:15:12: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:12: nvidia-smi process terminated unexpectedly - 4 restarts remaining
08:15:12: nvidia-smi exited with code 0
08:15:17: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:17: nvidia-smi process terminated unexpectedly - 3 restarts remaining
08:15:17: nvidia-smi exited with code 0
08:15:22: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:22: nvidia-smi process terminated unexpectedly - 2 restarts remaining
08:15:22: nvidia-smi exited with code 0
08:15:27: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:27: nvidia-smi process terminated unexpectedly - 1 restarts remaining
08:15:27: nvidia-smi exited with code 0
08:15:32: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:32: nvidia-smi process terminated unexpectedly - 0 restarts remaining
08:15:32: nvidia-smi exited with code 0
08:15:37: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:37: nvidia-smi process terminated unexpectedly - will not restart
08:15:37: nvidia-smi exited with code 0

Everything works fine on the host:

danny@toad:~$ nvidia-smi
Thu Dec 11 08:16:46 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   33C    P8              4W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

But not in the container:

danny@toad:~$ docker exec -it dashboard nvidia-smi
Failed to initialize NVML: Unknown Error

I asked Gemini for any commands that might help debug, and here are some outputs:

danny@toad:~$ ls -la /dev/nvidia*
crw-rw-rw- 1 root root 195, 254 Dec  9 22:56 /dev/nvidia-modeset
crw-rw-rw- 1 root root 501,   0 Dec  9 22:56 /dev/nvidia-uvm
crw-rw-rw- 1 root root 501,   1 Dec  9 22:56 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Dec  9 22:56 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Dec  9 22:56 /dev/nvidiactl

/dev/nvidia-caps:
total 0
drwxr-xr-x  2 root root     80 Dec  9 22:56 .
drwxr-xr-x 19 root root   4480 Dec  9 22:56 ..
cr--------  1 root root 504, 1 Dec  9 22:56 nvidia-cap1
cr--r--r--  1 root root 504, 2 Dec  9 22:56 nvidia-cap2


danny@toad:~$ docker exec -it dashboard ls -la /dev/nvidia*
ls: cannot access '/dev/nvidia-caps': No such file or directory
ls: cannot access '/dev/nvidia-modeset': No such file or directory
crw-rw-rw- 1 root root 501,   0 Dec  9 22:56 /dev/nvidia-uvm
crw-rw-rw- 1 root root 501,   1 Dec  9 22:56 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Dec  9 22:56 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Dec  9 22:56 /dev/nvidiactl
danny@toad:~$ sudo journalctl -xe | grep -i 'nvidia\|docker'
[sudo] password for danny: 
Dec 11 01:46:26 toad systemd[1]: /etc/systemd/system/nvidia-cdi-refresh.service:26: Ignoring unknown escape sequences: "/(nvidia|nvidia-current)\.ko[:]"
Dec 11 01:46:28 toad systemd[1]: /etc/systemd/system/nvidia-cdi-refresh.service:26: Ignoring unknown escape sequences: "/(nvidia|nvidia-current)\.ko[:]"
Dec 11 08:19:42 toad dockerd[2126]: time="2025-12-11T08:19:42.811830017Z" level=error msg="Handler for POST /v1.51/exec/e49820b05a91e7dfdc50d1ecc7b34b72859f3af29eeef7fde3d6db606cccdd3b/resize returned error: cannot resize a stopped container: unknown"
danny@toad:~$ sudo journalctl -u docker.service | tail
Dec 09 22:56:26 toad dockerd[2126]: time="2025-12-09T22:56:26.384102912Z" level=info msg="Docker daemon" commit=f8215cc containerd-snapshotter=false storage-driver=overlay2 version=28.5.1
Dec 09 22:56:26 toad dockerd[2126]: time="2025-12-09T22:56:26.384310048Z" level=info msg="Initializing buildkit"
Dec 09 22:56:27 toad dockerd[2126]: time="2025-12-09T22:56:27.205070480Z" level=info msg="Completed buildkit initialization"
Dec 09 22:56:27 toad dockerd[2126]: time="2025-12-09T22:56:27.208644944Z" level=info msg="Daemon has completed initialization"
Dec 09 22:56:27 toad dockerd[2126]: time="2025-12-09T22:56:27.208688816Z" level=info msg="API listen on /run/docker.sock"
Dec 09 22:56:27 toad systemd[1]: Started docker.service - Docker Application Container Engine.
Dec 10 10:11:35 toad dockerd[2126]: time="2025-12-10T10:11:35.602080683Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=dd260ae3f954ff7c07f4a93705c9a063c250b2a12562d459566ed5e1031808a0
Dec 10 10:11:36 toad dockerd[2126]: time="2025-12-10T10:11:36.901551490Z" level=info msg="ignoring event" container=dd260ae3f954ff7c07f4a93705c9a063c250b2a12562d459566ed5e1031808a0 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Dec 10 10:11:36 toad dockerd[2126]: time="2025-12-10T10:11:36.930806655Z" level=warning msg="ShouldRestart failed, container will not be restarted" container=dd260ae3f954ff7c07f4a93705c9a063c250b2a12562d459566ed5e1031808a0 daemonShuttingDown=false error="restart canceled" execDuration=1h53m42.063553887s exitStatus="{137 2025-12-10 10:11:36.886208775 +0000 UTC}" hasBeenManuallyStopped=true restartCount=0
Dec 11 08:19:42 toad dockerd[2126]: time="2025-12-11T08:19:42.811830017Z" level=error msg="Handler for POST /v1.51/exec/e49820b05a91e7dfdc50d1ecc7b34b72859f3af29eeef7fde3d6db606cccdd3b/resize returned error: cannot resize a stopped container: unknown"

For this one, Gemini claims I should see “nvidia”, but I only see runc:

danny@toad:~$ docker info | grep -i runtime
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc

However it also told me:

This is a well-known, intermittent, and annoying issue in the NVIDIA/Docker ecosystem, typically related to how Linux handles container groups (cgroups).

This problem almost always happens when your Docker daemon is configured to use the systemd cgroup driver (which is the default on many modern Linux distributions like Ubuntu 20.04+ or Fedora).

When a host command like sudo systemctl daemon-reload is executed on the host (often triggered by installing, upgrading, or configuring any system service, not just Docker), it causes systemd to reload unit files, including the unit files managing your running container’s resources (cgroups).

It claims the fix is to:

The official, recommended mitigation for this specific intermittent issue is to configure the Docker daemon to use the cgroupfs cgroup driver instead of systemd. This driver is much less susceptible to resource management changes on the host.

I’m not very familiar with this, so I will wait for some feedback before trying this - please let me know if this seems plausible and I should try it.

Seems like the advice it gave is actually documented on the nvidia site here:

So I will give this a go this evening. If it’s a known issue and the fix is clear, I’m not sure why it’s not just changed in the default OS setup though?

In order to access the GPUs inside the container, you will need to use the --gpus=all flag

Yep, I am using this. nvidia-smi works fine for a number of days, and then suddenly stops working in the container (unless I restart the container), which seems to be what is described in the nvidia article I linked above.

I was going to try the workaround noted above, but I also just noticed that on that page it says:

Newer runc versions do not show this behavior and newer NVIDIA driver versions ensure that the required symlinks are present, reducing the likelihood of the specific issue occurring for affected runc versions.

So now I’m not so sure. However, I figure I have nothing to lose by trying it.

ok, so I just confirmed that running systemctl daemon-reload on the host definitely causes this issue. I’m not sure if that’s the cause I normally have (and if so, what is triggering systemctl daemon-reload), but it definitely does break things.

So I applied the workaround ({ "exec-opts": ["native.cgroupdriver=cgroupfs"] } in /etc/docker/daemon.json, and systemctl daemon-reload no longer causes this problem.

So, I’m not certain (because I don’t know for sure that this is what was happening for real and there isn’t another cause), but it seems likely this is the problem.

My suspicion is that unless the latest runc versions that don’t have this problem have not been pushed to use yet, that the suggestion that this issue is fixed on that page might not be accurate.

Unfortunately I can’t reproduce this issue based on the GitHub discussion linked in the article
I see that it does for you. Can you share your nvidia-ctk --version and docker info?

This is what repro’d for me:

  • Start bash in a container with --gpus=all and confirm nvidia-smi works
  • In the host, run systemctl daemon-reload
  • Run nvidia-smi again inside the container

Do you mean on the host or in the container? (and if in the container, do you mean when it’s in the broken state or not? I can revert the workaround to test if required).

Here’s the values from the host.

danny@toad:~$ nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.18.1
commit: efe99418ef87500dbe059cadc9ab418b2815b9d5
danny@toad:~$ docker info
Client: Docker Engine - Community
 Version:    28.5.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.29.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.40.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 4
  Running: 1
  Paused: 0
  Stopped: 3
 Images: 28
 Server Version: 28.5.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /etc/cdi
  /var/run/cdi
 Discovered Devices:
  cdi: nvidia.com/gpu=0
  cdi: nvidia.com/gpu=GPU-da687fe0-8408-993e-f760-e9e4fe24f190
  cdi: nvidia.com/gpu=all
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: b98a3aace656320842a23f4a392a33f46af97866
 runc version: v1.3.0-0-g4ca628d1
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.14.0-1013-nvidia
 Operating System: Ubuntu 24.04.3 LTS
 OSType: linux
 Architecture: aarch64
 CPUs: 20
 Total Memory: 119.7GiB
 Name: toad
 ID: dcb42044-2fa8-49e2-ab9c-8791940e609b
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false
1 Like

Have you tried using some cuda container as your base image?

I have not (I’m actually trying to trim my base image down to the absolute minimum I can). But it’s also not clear to me why this should be necessary - surely the behaviour described above is just a bug and the GPU should work with any base container?

Hi @DannyTup, I’ve heard back from engineering, they recommend to use CDI (Container Device Interface) directly instead.
Please start a container with the following command:
sudo docker run -it --entrypoint /bin/bash --device nvidia.com/gpu=all ubuntu
You should no longer see the Failed to initialize NVML: Unknown Error message after running systemctl daemon-reload

Thanks - I will do some testing of this. It does give me a few questions though:

  • is this a better resolution than the cgroup workaround I’d already applied (which seemed to work)?
  • should we generally use this instead of --gpus=all for all containers in all cases?
  • if we should always use this, is there a reason that --gpus=all couldn’t do the same thing?

(apologies if these are silly questions - I don’t yet entirely understand the difference between the two)

Thanks!

As it says on the Troubleshooting site, “Use the Container Device Interface (CDI) to inject devices into a container. When CDI is used to inject devices into a container, the required device nodes are included in the modifications made to the container config. This means that even if the container is updated it will still have access to the required devices.”

It’s not clear to me if this is better than the other workaround listed (using cgroupfs as the cgroup driver for containers) which also worked. I don’t understand either of them well enough to know what the trade-off is. If both work, which is preferred?

They should both work however CDI is the preferred mode of GPU injection in containers

Thanks! I tried this with my current container base, but it fails with:

17:06:26: unexpected nvidia-smi output: NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
17:06:26: unexpected nvidia-smi output: Please also try adding directory that contains libnvidia-ml.so to your system PATH.
17:06:26: nvidia-smi process terminated unexpectedly - 4 restarts remaining

So I presume this method has some requirements on the base image being used that the previous method does not. Do you know what these requirements are (or what the smallest image that satisfies them is)? I wanted to keep my dashboard as small as possible.