Your analysis here is solid, and the evidence you posted points to a memory-pressure failure path rather than a thermal-triggered shutdown.
The key signal is the _memdescAllocInternal ... NV_ERR_NO_MEMORY in the kernel log combined with high system memory pressure and swap activity. That’s not a typical user-space CUDA OOM — it indicates an internal driver allocation failure under heavy memory pressure, followed by an unclean reboot.
On Spark / GB10 systems, GPU and system memory operate within a unified memory model, so GPU allocations can compete with system memory usage (including page cache and kernel allocations) under load. Under sufficient pressure, the driver can hit allocation failures before a normal CUDA OOM is surfaced, which matches the hang → hard reset pattern you’re seeing.
I went through a full bug report showing the same failure pattern in detail here:
The same internal allocation failure also shows up in the llama4 crash thread:
For reference on the platform:
Your diagnostic approach (journalctl -k, sar, vmstat, nvidia-smi dmon) is appropriate. For additional visibility into the lead-up to failure, you can monitor system memory state in real time:
watch -n1 'grep -E "MemFree|MemAvailable|Cached|SwapFree" /proc/meminfo'
If MemAvailable drops while Cached remains elevated and swap activity increases, the system is approaching a state where unified-memory pressure can destabilize driver allocation paths.
Two quick questions that would help narrow this down:
- Was this workload stable on your GX10 prior to 580.126.09, or has it always behaved this way?
- Were you able to capture
/proc/meminfoin the minutes leading up to the crash, or only thesardata after the fact?