DGX Spark Shutdown around 95°C during nanoChat Pretraining (20-30 min)

Your analysis here is solid, and the evidence you posted points to a memory-pressure failure path rather than a thermal-triggered shutdown.

The key signal is the _memdescAllocInternal ... NV_ERR_NO_MEMORY in the kernel log combined with high system memory pressure and swap activity. That’s not a typical user-space CUDA OOM — it indicates an internal driver allocation failure under heavy memory pressure, followed by an unclean reboot.

On Spark / GB10 systems, GPU and system memory operate within a unified memory model, so GPU allocations can compete with system memory usage (including page cache and kernel allocations) under load. Under sufficient pressure, the driver can hit allocation failures before a normal CUDA OOM is surfaced, which matches the hang → hard reset pattern you’re seeing.

I went through a full bug report showing the same failure pattern in detail here:

The same internal allocation failure also shows up in the llama4 crash thread:

For reference on the platform:

Your diagnostic approach (journalctl -k, sar, vmstat, nvidia-smi dmon) is appropriate. For additional visibility into the lead-up to failure, you can monitor system memory state in real time:

watch -n1 'grep -E "MemFree|MemAvailable|Cached|SwapFree" /proc/meminfo'

If MemAvailable drops while Cached remains elevated and swap activity increases, the system is approaching a state where unified-memory pressure can destabilize driver allocation paths.

Two quick questions that would help narrow this down:

  1. Was this workload stable on your GX10 prior to 580.126.09, or has it always behaved this way?
  2. Were you able to capture /proc/meminfo in the minutes leading up to the crash, or only the sar data after the fact?