I’m seeing repeated hard power-offs / resets on an ASUS Ascent GX10 (host gx10-4323, GB10 / Blackwell) while running heavy inference in a Docker vLLM container. This does not look like GPU thermal shutdown (I often see ~85–90W and ~60–70°C right before the disconnect), but rather an abrupt power cut / firmware reset (“unclean shutdown”).
System info:
-
System: ASUS GX10 / DGX Spark (arm64)
Hostname: gx10-4323
DGX release package: dgx-release 7.4.0
/etc/dgx-release: SWBUILD 7.2.3; OTA entries include 7.3.1 and 7.4.0 (2026-02-05 20:12:54 +05)
OS: Ubuntu 24.04.3 LTS
Kernel (runtime): 6.14.0-1015-nvidia
Firmware: GX10DGX.0102.2025.1111.1531 (2025-11-11)NVIDIA driver (nvidia-smi): 580.126.09 (CUDA Version reported by driver: 13.0)
NVRM (/proc/driver/nvidia/version): 580.126.09
modinfo nvidia: version 580.126.09; vermagic 6.14.0-1015-nvidiaCUDA toolkit:
- cuda-toolkit-13-0 13.0.2-1
- cuda-nvcc-13-0 13.0.88-1
nvcc --version: Cuda compilation tools, release 13.0, V13.0.88
NVIDIA Container Toolkit:
- nvidia-container-toolkit 1.18.2-1 (libnvidia-container1 1.18.2-1)
Docker Engine:
- docker-ce 29.1.3 (docker compose plugin reports v5.0.1)
How I run the workload (this is what triggers it):
docker stop vllm-gptoss120b-mxfp4 || true
docker rm vllm-gptoss120b-mxfp4 || true
docker run -d \
--name vllm-gptoss120b-mxfp4 \
--restart on-failure:3 \
--gpus all \
--network host \
--ipc=host \
--memory=110g --memory-swap=110g \
-v $HOME/models/GPT-OSS-120B:/model:ro \
vllm-node-mxfp4:latest \
vllm serve /model \
--host 0.0.0.0 \
--port 8888 \
--served-model-name gpt-oss-120b \
--enable-auto-tool-choice \
--tool-call-parser openai \
--reasoning-parser openai_gptoss \
--load-format fastsafetensors \
--quantization mxfp4 \
--mxfp4-backend CUTLASS \
--mxfp4-layers moe,qkv,o,lm_head \
--attention-backend FLASHINFER \
--kv-cache-dtype fp8 \
--enforce-eager \
--gpu-memory-utilization 0.72 \
--enable-chunked-prefill \
--max-num-batched-tokens 1024 \
--max-num-seqs 1 \
--swap-space 1 \
--max-model-len 131072
What happens:
-
The SSH session (and any monitoring) abruptly disconnects: Software caused connection abort.
-
The machine becomes unreachable on the network until I manually power it back on.
-
After boot, journald reports an unclean shutdown (so it wasn’t a normal OS shutdown).
Evidence for the most recent reset (local time +05, boot -1 → boot 0):
-
/var/log/syslog shows the last line before the reset, then immediately the next kernel boot:
-
37510:2026-02-05T19:15:22.919929+05:00 gx10-4323 systemd[1]: session-14.scope: Deactivated successfully.
-
37511:2026-02-05T19:38:34.828912+05:00 gx10-4323 kernel: Booting Linux on physical CPU 0x0000000000 [0x410fd871]
-
-
journalctl -b 0 (current boot) shows the “unclean shutdown” marker:
- Feb 05 19:38:33 gx10-4323 systemd-journald[634]: File …/system.journal corrupted or uncleanly shut down, renaming and replacing.
-
The same “unclean shutdown” line is also present in /var/log/syslog:
- /var/log/syslog: 39248:2026-02-05T19:38:34.830923+05:00 … systemd-journald[634]: File …system.journal corrupted or uncleanly shut down, renaming and replacing.
Important: for this specific reset window, I could not find any NVRM/Xid/OOM lines right before the cut:
- journalctl -b -1 -k | egrep ‘nvCheckOkFailed|NV_ERR_NO_MEMORY|Out of memory|Xid’ → no matches.
However, on other earlier runs (same host), I do have NVIDIA kernel driver OOM evidence:
-
-
2026-02-05T13:14:03.945045+05:00 … NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] …
-
(also previously) 2026-02-02T19:49:22.168466+05:00 … NVRM: Xid … 31 … MMU Fault …
Also, on each boot I see:
-
journalctl -b 0 -k:
- mlx5_core … Detected insufficient power on the PCIe slot (27W). (multiple lines)
Questions:
-
Is a hard power-off/reset under vLLM long-context load a known issue on GB10 / DGX OS (e.g., driver OOM/hang that doesn’t always flush logs)?
-
Does the NVRM NV_ERR_NO_MEMORY / _memdescAllocInternal pattern match a known bug (and is there a fix/workaround)?
-
What’s the best way to capture useful diagnostics for NVIDIA when the box resets abruptly (e.g., recommended logging, pstore, nvidia-bug-report options) so the moment of failure is not lost?