My DGX System is getting shut itself down while running my LLM Fine tuning project . RAM Reaches to 100 percent along with GPU reaches 100 percent

There’s an existing forum thread where people report the DGX Spark crashing or hard-locking when memory is exhausted during LLM fine-tuning. Instead of killing the offending process with an OOM error, the whole OS dies: SSH drops, HDMI goes black, keyboard/mouse lights off, and they have to hard power-cycle:

Similar to DGX Spark becomes unresponsive (“zombie”) instead of throwing CUDA OOM

Disabling swap often makes things better: the training process crashes, but the system stays alive. Users explicitly report that Spark behaves badly once it starts using swap.

sudo swapoff -a

The insufficient power alert is nothing to worry about, there’s an advisory that these kinds of “insufficient power on PCIe slot (27W)” messages can appear in logs while the connectx-7 adapter is actually OK, just because of how the motherboard’s PCIe power limits are encoded.