DGX Spark Shutdown around 95°C during nanoChat Pretraining (20-30 min)

parallelArchitect · March 21, 2026, 6:55am

Your analysis here is solid, and the evidence you posted points to a memory-pressure failure path rather than a thermal-triggered shutdown.

The key signal is the _memdescAllocInternal ... NV_ERR_NO_MEMORY in the kernel log combined with high system memory pressure and swap activity. That’s not a typical user-space CUDA OOM — it indicates an internal driver allocation failure under heavy memory pressure, followed by an unclean reboot.

On Spark / GB10 systems, GPU and system memory operate within a unified memory model, so GPU allocations can compete with system memory usage (including page cache and kernel allocations) under load. Under sufficient pressure, the driver can hit allocation failures before a normal CUDA OOM is surfaced, which matches the hang → hard reset pattern you’re seeing.

I went through a full bug report showing the same failure pattern in detail here:

The same internal allocation failure also shows up in the llama4 crash thread:

For reference on the platform:

Your diagnostic approach (journalctl -k, sar, vmstat, nvidia-smi dmon) is appropriate. For additional visibility into the lead-up to failure, you can monitor system memory state in real time:

watch -n1 'grep -E "MemFree|MemAvailable|Cached|SwapFree" /proc/meminfo'

If MemAvailable drops while Cached remains elevated and swap activity increases, the system is approaching a state where unified-memory pressure can destabilize driver allocation paths.

Two quick questions that would help narrow this down:

Was this workload stable on your GX10 prior to 580.126.09, or has it always behaved this way?
Were you able to capture /proc/meminfo in the minutes leading up to the crash, or only the sar data after the fact?

Topic		Replies	Views
DGX Spark becomes unresponsive (“zombie”) instead of throwing CUDA OOM DGX Spark / GB10	16	1757	April 10, 2026
DGX Spark stability / out of RAM / overheating DGX Spark / GB10 llama , dgx-spark-issue	33	1946	July 1, 2026
System crashes when memory is full DGX Spark / GB10 dgx-spark-issue	35	2538	June 15, 2026
My DGX Spark Hangs ... is this normal? DGX Spark / GB10 Projects llm , dgx	4	478	April 13, 2026
DGXSPARK temperature too high, automatic shutdown。 DGX Spark / GB10	170	5957	June 22, 2026
My DGX System is getting shut itself down while running my LLM Fine tuning project . RAM Reaches to 100 percent along with GPU reaches 100 percent DGX Spark / GB10	10	865	March 31, 2026
DGX Spark (GB10) reproducibly hard powers-off under GPU load — fully updated, zero crash capture DGX Spark / GB10 boot , kernel , ota , dgx-spark-issue	13	390	June 14, 2026
Multi-Node Inference Crash on Blackwell GB10: Memory Allocation (0x51) & NCCL Timeouts (Tested on Qwen 122B & Nemotron 120B) DGX Spark / GB10 dgx-spark-issue	29	990	June 19, 2026
DGX Spark shutting down under load - MODS-020000600139 DGX Spark / GB10 rma , thermal	8	407	June 8, 2026
DGX Spark. low fan speed, high temps, device very hot DGX Spark / GB10 kernel , gpu , fan-facts , debugging-and-troubleshooting	60	5138	May 26, 2026

DGX Spark Shutdown around 95°C during nanoChat Pretraining (20-30 min)

Related topics