I hadn’t seen this for a while, but it happened again this morning.
19:06:39: Paused metrics stream and cleared data buffer
19:06:39: nvidia-smi exited with code 0
08:15:12: Resumed metrics stream
08:15:12: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:12: nvidia-smi process terminated unexpectedly - 4 restarts remaining
08:15:12: nvidia-smi exited with code 0
08:15:17: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:17: nvidia-smi process terminated unexpectedly - 3 restarts remaining
08:15:17: nvidia-smi exited with code 0
08:15:22: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:22: nvidia-smi process terminated unexpectedly - 2 restarts remaining
08:15:22: nvidia-smi exited with code 0
08:15:27: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:27: nvidia-smi process terminated unexpectedly - 1 restarts remaining
08:15:27: nvidia-smi exited with code 0
08:15:32: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:32: nvidia-smi process terminated unexpectedly - 0 restarts remaining
08:15:32: nvidia-smi exited with code 0
08:15:37: unexpected nvidia-smi output: Failed to initialize NVML: Unknown Error
08:15:37: nvidia-smi process terminated unexpectedly - will not restart
08:15:37: nvidia-smi exited with code 0
Everything works fine on the host:
danny@toad:~$ nvidia-smi
Thu Dec 11 08:16:46 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
| N/A 33C P8 4W / N/A | Not Supported | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
But not in the container:
danny@toad:~$ docker exec -it dashboard nvidia-smi
Failed to initialize NVML: Unknown Error
I asked Gemini for any commands that might help debug, and here are some outputs:
danny@toad:~$ ls -la /dev/nvidia*
crw-rw-rw- 1 root root 195, 254 Dec 9 22:56 /dev/nvidia-modeset
crw-rw-rw- 1 root root 501, 0 Dec 9 22:56 /dev/nvidia-uvm
crw-rw-rw- 1 root root 501, 1 Dec 9 22:56 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 Dec 9 22:56 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Dec 9 22:56 /dev/nvidiactl
/dev/nvidia-caps:
total 0
drwxr-xr-x 2 root root 80 Dec 9 22:56 .
drwxr-xr-x 19 root root 4480 Dec 9 22:56 ..
cr-------- 1 root root 504, 1 Dec 9 22:56 nvidia-cap1
cr--r--r-- 1 root root 504, 2 Dec 9 22:56 nvidia-cap2
danny@toad:~$ docker exec -it dashboard ls -la /dev/nvidia*
ls: cannot access '/dev/nvidia-caps': No such file or directory
ls: cannot access '/dev/nvidia-modeset': No such file or directory
crw-rw-rw- 1 root root 501, 0 Dec 9 22:56 /dev/nvidia-uvm
crw-rw-rw- 1 root root 501, 1 Dec 9 22:56 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 Dec 9 22:56 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Dec 9 22:56 /dev/nvidiactl
danny@toad:~$ sudo journalctl -xe | grep -i 'nvidia\|docker'
[sudo] password for danny:
Dec 11 01:46:26 toad systemd[1]: /etc/systemd/system/nvidia-cdi-refresh.service:26: Ignoring unknown escape sequences: "/(nvidia|nvidia-current)\.ko[:]"
Dec 11 01:46:28 toad systemd[1]: /etc/systemd/system/nvidia-cdi-refresh.service:26: Ignoring unknown escape sequences: "/(nvidia|nvidia-current)\.ko[:]"
Dec 11 08:19:42 toad dockerd[2126]: time="2025-12-11T08:19:42.811830017Z" level=error msg="Handler for POST /v1.51/exec/e49820b05a91e7dfdc50d1ecc7b34b72859f3af29eeef7fde3d6db606cccdd3b/resize returned error: cannot resize a stopped container: unknown"
danny@toad:~$ sudo journalctl -u docker.service | tail
Dec 09 22:56:26 toad dockerd[2126]: time="2025-12-09T22:56:26.384102912Z" level=info msg="Docker daemon" commit=f8215cc containerd-snapshotter=false storage-driver=overlay2 version=28.5.1
Dec 09 22:56:26 toad dockerd[2126]: time="2025-12-09T22:56:26.384310048Z" level=info msg="Initializing buildkit"
Dec 09 22:56:27 toad dockerd[2126]: time="2025-12-09T22:56:27.205070480Z" level=info msg="Completed buildkit initialization"
Dec 09 22:56:27 toad dockerd[2126]: time="2025-12-09T22:56:27.208644944Z" level=info msg="Daemon has completed initialization"
Dec 09 22:56:27 toad dockerd[2126]: time="2025-12-09T22:56:27.208688816Z" level=info msg="API listen on /run/docker.sock"
Dec 09 22:56:27 toad systemd[1]: Started docker.service - Docker Application Container Engine.
Dec 10 10:11:35 toad dockerd[2126]: time="2025-12-10T10:11:35.602080683Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=dd260ae3f954ff7c07f4a93705c9a063c250b2a12562d459566ed5e1031808a0
Dec 10 10:11:36 toad dockerd[2126]: time="2025-12-10T10:11:36.901551490Z" level=info msg="ignoring event" container=dd260ae3f954ff7c07f4a93705c9a063c250b2a12562d459566ed5e1031808a0 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Dec 10 10:11:36 toad dockerd[2126]: time="2025-12-10T10:11:36.930806655Z" level=warning msg="ShouldRestart failed, container will not be restarted" container=dd260ae3f954ff7c07f4a93705c9a063c250b2a12562d459566ed5e1031808a0 daemonShuttingDown=false error="restart canceled" execDuration=1h53m42.063553887s exitStatus="{137 2025-12-10 10:11:36.886208775 +0000 UTC}" hasBeenManuallyStopped=true restartCount=0
Dec 11 08:19:42 toad dockerd[2126]: time="2025-12-11T08:19:42.811830017Z" level=error msg="Handler for POST /v1.51/exec/e49820b05a91e7dfdc50d1ecc7b34b72859f3af29eeef7fde3d6db606cccdd3b/resize returned error: cannot resize a stopped container: unknown"
For this one, Gemini claims I should see “nvidia”, but I only see runc:
danny@toad:~$ docker info | grep -i runtime
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
However it also told me:
This is a well-known, intermittent, and annoying issue in the NVIDIA/Docker ecosystem, typically related to how Linux handles container groups (cgroups).
This problem almost always happens when your Docker daemon is configured to use the systemd cgroup driver (which is the default on many modern Linux distributions like Ubuntu 20.04+ or Fedora).
When a host command like sudo systemctl daemon-reload is executed on the host (often triggered by installing, upgrading, or configuring any system service, not just Docker), it causes systemd to reload unit files, including the unit files managing your running container’s resources (cgroups).
It claims the fix is to:
The official, recommended mitigation for this specific intermittent issue is to configure the Docker daemon to use the cgroupfs cgroup driver instead of systemd. This driver is much less susceptible to resource management changes on the host.
I’m not very familiar with this, so I will wait for some feedback before trying this - please let me know if this seems plausible and I should try it.