DGX Spark GPU usage 0 after 24 Hours Open-WebUI

nego_0 · December 5, 2025, 4:21pm

I have been running the DGX Spark platform for a little over a month, primarily using Open‑WebUI. While the GPU performance is excellent initially, I’ve noticed that after running for more than 24 hours the GPU is no longer utilized and the workload reverts to the CPU. A simple Docker restart or full system reboot—temporarily restores GPU usage. Unfortunately, I have not been able to pinpoint the root cause or a permanent fix.

Has anyone else observed a similar pattern, and is there a known resolution? I would appreciate any guidance.

NVES · December 5, 2025, 4:25pm

Hi, this is very strange behavior. Can you please share the exact workload / container and runtime steps? Eng would love to triage this.

nego_0 · December 5, 2025, 4:30pm

You bet! Currently i run a docker container:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3773c077b435 ghcr.io/open-webui/open-webui:ollama “bash start.sh” 7 days ago Up 2 hours (healthy) 0.0.0.0:12000->8080/tcp, [::]:12000->8080/tcp open-webui

I also have a virtual python environment for ComfyUI

My processes that are running look like:

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2829 G /usr/lib/xorg/Xorg 73MiB |
| 0 N/A N/A 2998 G /usr/bin/gnome-shell 18MiB |
| 0 N/A N/A 3522 C python 19841MiB |
±----------------------------------------------------------------------------------------+

NVES · December 5, 2025, 4:44pm

can you share a sudo nvidia-bug-report.sh log when container is using GPU. and another sudo nvidia-bug-report.sh when the same container is no longer using GPU?

nego_0 · December 5, 2025, 4:54pm

Absolutely. I should have that report first thing in the morning when the GPU fails.

May i assume that the tools are missing for this report:

eugr · December 5, 2025, 5:40pm

I may sound like a broken record, but I suggest dropping Ollama and switching to llama.cpp. You can use it with Open WebUI or it has it’s own web UI too.

I wouldn’t rely on NVIDIA-provided playbooks - as many of us encountered, they are often incomplete, under-performing or outright broken, with a few exceptions (like ComfyUI one).

DannyTup · December 5, 2025, 6:08pm

I wonder if this could be the same issue I reported here:

https://forums.developer.nvidia.com/t/failed-to-initialize-nvml-unknown-error-running-nvidia-smi-in-a-docker-container-only-after-some-hours-days

My dashboard container after some period starts getting errors from nvidia-smi - I wonder if the GPU is being lost in the container in the same way for both of us.

(although FWIW, I haven’t seen it for several days.. possibly since a recent update, but could also just be fluke)

eugr · December 5, 2025, 6:18pm

Interesting. I haven’t seen that with my cluster, but it wasn’t sitting idle for a long period of time. I’ll try to leave it running over a weekend and see what happens. I usually shut it down at night, because of the bug in Ray that keeps at least one core pegged at 100% in a busy loop, even when idle.

DannyTup · December 11, 2025, 8:30am

I hit mine again this morning. I posted some output from commands Gemini suggested might be helpful here:

Interestingly it also claimed this is a well known issue caused by using systemd instead of cgroupfs as the docker daemon cgroup driver. I posted details in the thread above, but I’d like a second opinion about that before I try it, because it didn’t give any sources 😄

DannyTup · December 11, 2025, 8:31am

Actually, I managed to get it to give some sources, and it seems that this is documented on the Nvidia site here:

DannyTup · December 11, 2025, 6:07pm

I confirmed that systemctl daemon-reload does lead to this issue for me, and that the workaround described does prevent it.

I added some more notes here:

I would suggest trying this too, there’s a chance it will also resolve your issue.

Topic		Replies	Views
"Failed to initialize NVML: Unknown Error" running nvidia-smi in a docker container only after some hours/days DGX Spark / GB10	13	166	December 12, 2025
DGX Spark NVIDIA driver issue DGX Spark / GB10	8	382	December 16, 2025
DGX SPARK GPU crash DGX Spark / GB10	11	509	November 6, 2025
DGX Spark. low fan speed, high temps, device very hot DGX Spark / GB10 kernel , gpu , fan-facts , debugging-and-troubleshooting	24	1433	December 12, 2025
Models not using Spark GPU? DGX Spark / GB10 containers	9	165	December 15, 2025
Max GPU usage 96%? Is this normal? DGX Spark / GB10	10	212	December 1, 2025
DGX Spark Power Consumption DGX Spark / GB10	2	213	November 3, 2025
DGX Spark hangs on boot DGX Spark / GB10 boot , kernel , ubuntu	5	111	December 7, 2025
DGX Memory Not Released After Stopping Ollama/OpenWebUI – FIXED DGX Spark / GB10	5	420	October 21, 2025
DGX Spark HDMI Display Stops Working Every DGX Spark / GB10 gaming	28	708	December 25, 2025

DGX Spark GPU usage 0 after 24 Hours Open-WebUI

Related topics