I would like to provide feedback regarding a crash issue on the DGX Spark.
Currently, I am using the system for LLM RL fine-tuning (TRL+GRPO+vLLM). I have noticed that if the process consumes all available memory, the system does not kill the process but instead crashes the entire OS.
For now you could use docker run --memory="80G" http://nvcr.io/nvidia/vllm:25.09-py3 with whatever memory limit you expect and If the container’s processes attempt to exceed this limit, the Linux kernel’s OOM killer will terminate the container.
Try launching your container with --oom-score-adj argument and set a high score so the kernel will nuke your container first instead of vital services that causes system lock.
docker run --oom-score-adj 1000 might be too agressive. Try a 500 score to increase the likability that your container will be terminated when OOM kicks in.
There were a few reports here (and I’ve experienced that as well) is that when the swap is being used, the system becomes unresponsive.
However, I was in a swap usage situation recently, and it didn’t crash on me. Don’t know if it’s kernel 6.14 or after setting this parameter sudo bash -c "echo 8192 > /sys/block/nvme0n1/queue/read_ahead_kb" (I have it set on boot).
I’ve had this issue plenty of times as well, if you have a process suddenly eat all the VRAM and it goes into swap, the DGX Spark usually locks up. The easiest way to do this is to kick off a training process that is set up with settings that immediately pushes the Spark to being out of memory. My solution has been to disable the swap file, then the process just crashes, but at least it leaves the system in a running state, though I’ve also had it reboot once, but that seems less common. Disabling the swapfile seems to be the way to go from what I’ve seen.
I suffer that exact problem as well. I used the same container. Just ask for too much RAM by setting batch size too large, and the entire system freezes and you need to power cycle the system.
As mentioned above, you can disable your swap file since as soon as it goes into swap it locks up anyway, so the way I see it, the swap file is almost completely useless on the Spark since it doesn’t work anyway. By disabling it, the process will just crap out and exit, but leaves you with a running system, which is better than the alternative, where you have to physically power cycle it.
Hi @yuxizhe2008, we are still trying to reproduce this issue, can you run the container again with a lower memory limit again, under 100G this time, and see if you still get the same result? docker run --memory="90G" http://nvcr.io/nvidia/vllm:25.09-py3
Yes, I have tried this parameter, but even when set to 30G, it still crashes. I think this parameter can only limit the memory (RAM), not the GPU’s VRAM. In my case, the issue is VRAM overflow. It seems to be related to the shared memory.
@yuxizhe2008 how do you start your container? If you suspect shared memory is the culprit try increasing it with –shm-size argument, i.e. docker run –shm-size=30G
Docker aside, I hope you don’t mean you’re still trying to reproduce the crash itself as I’d be hoping NVIDIA is already looking into how to resolve this issue as it’s a big problem with it locking up the entire system, requiring you to physically power cycle the device.
The issue is super easy to run into, it has left me with a Spark that is completely non-responsive many times. Just start any kind of training job or anything else that consumes all free GPU memory, as an example, I recently reenabled my swap file to help with stability when running LLMs that consume almost the entire memory of the Spark, and I accidentally started a second llama.cpp instance while GPT-OSS:120b was already running. The system almost instantly locked up on me.
I think what’s happening is that any memory not consumed by the CPU can be allocated by the GPU, the problem is, it can allocate 100% of the available system RAM, leaving nothing for the rest of the system to work with. The only way to avoid this seems to be to completely disable swap, but that leaves you with a different problem, namely, if you’re running things that consume almost all the RAM on the spark, those processes can be unstable and fall over without warning. On the other hand, the swap file is practically useless, as the entire system locks up the second you accidentally allocate all available RAM to the GPU.
I think the system needs to reserve a small amount of RAM which the GPU can’t allocate, so if you have 115GB of free memory, maybe you shouldn’t be able to allocate more than 114GB, so there’s always that gigabyte available for the system to work with even if you’ve attempted to allocate it all to the GPU. That way the system can at least swap data into RAM and still function. From what I can tell there’s no safeguards on the spark, you can just allocate as much as you want and easily lock up the system in the process.