System crashes when memory is full

yuxizhe2008 · November 24, 2025, 2:50am

Hi NVIDIA Team,

I would like to provide feedback regarding a crash issue on the DGX Spark.

Currently, I am using the system for LLM RL fine-tuning (TRL+GRPO+vLLM). I have noticed that if the process consumes all available memory, the system does not kill the process but instead crashes the entire OS.

ENV:
docker: nvcr.io/nvidia/vllm:25.09-py3

When this happens:

SSH becomes inaccessible.
The HDMI monitor goes black.
Mouse and keyboard lights go out.
I am forced to physically restart the machine to recover.

Could you please look into this OOM handling behavior? It is causing significant disruption to our workflows.

Thanks

raphael.amorim · November 24, 2025, 5:26am

Hey @yuxizhe2008,

For now you could use docker run --memory="80G" http://nvcr.io/nvidia/vllm:25.09-py3 with whatever memory limit you expect and If the container’s processes attempt to exceed this limit, the Linux kernel’s OOM killer will terminate the container.

yuxizhe2008 · November 24, 2025, 10:54am

I set it to 110G. still crash

elsaco · November 24, 2025, 4:58pm

Try launching your container with --oom-score-adj argument and set a high score so the kernel will nuke your container first instead of vital services that causes system lock.

docker run --oom-score-adj 1000 might be too agressive. Try a 500 score to increase the likability that your container will be terminated when OOM kicks in.

aniculescu · November 24, 2025, 5:04pm

Hi yuxizhe2008, after the system crashes can you boot the machine and immediately collect any logs found in /var/crash/* and send them to me?

Also I noticed you set raphael.amorim’s response as the solution but you said it still crashes. Do you see any difference in behavior

raphael.amorim · November 24, 2025, 5:15pm

That automatically stops the container if you cross the limit @yuxizhe2008. Did you clean your mem cache on the spark before executing the container?

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

eugr · November 24, 2025, 11:04pm

There were a few reports here (and I’ve experienced that as well) is that when the swap is being used, the system becomes unresponsive.

However, I was in a swap usage situation recently, and it didn’t crash on me. Don’t know if it’s kernel 6.14 or after setting this parameter sudo bash -c "echo 8192 > /sys/block/nvme0n1/queue/read_ahead_kb" (I have it set on boot).

elsaco · November 25, 2025, 1:25am

@yuxizhe2008 if --memory=80 isn’t working start with --gpu-memory-utilization 0.8 instead to have the container use only 80% of GPU memory.

yuxizhe2008 · November 25, 2025, 6:10am

in /var/crash/ have a new kdump_lock flie, but the file is empty

aniculescu · November 26, 2025, 4:07pm

@yuxizhe2008
To better help us debug your issue, can you share with me some more details? You can reply to this post or send this info to me directly.

Docker run command
Training script/command (TRL+GRPO setup)
Model name and size
Any custom kernel/OOM settings

yuxizhe2008 · November 27, 2025, 6:08am

I got a crash log

-rw-r----- 1 root root 2.0G Nov 27 11:54 _usr_bin_python3.12.0.crash

ProblemType: Crash
Architecture: arm64
Date: Thu Nov 27 11:52:43 2025
Dependencies:
apt 2.8.3
apt-utils 2.8.3
base-passwd 3.6.3build1
ca-certificates 20240203
debconf 1.5.86ubuntu1
debconf-i18n 1.5.86ubuntu1
dpkg 1.22.6ubuntu6.5
gcc-14-base 14.2.0-4ubuntu2~24.04
gpgv 2.4.4-2ubuntu17.3
libacl1 2.3.2-1build1.1
libapt-pkg6.0t64 2.8.3
libassuan0 2.5.6-1build1
libbz2-1.0 1.0.8-5.1build0.1
libc6 2.39-0ubuntu8.6
libcap2 1:2.66-5ubuntu2.2
libcrypt1 1:4.4.36-4build1
libdb5.3t64 5.3.28+dfsg2-7
libdebconfclient0 0.271ubuntu3
libexpat1 2.6.1-2ubuntu0.3
libffi8 3.4.6-1build1
libgcc-s1 14.2.0-4ubuntu2~24.04
libgcrypt20 1.10.3-2build1
libgmp10 2:6.3.0+dfsg-2ubuntu6.1
libgnutls30t64 3.8.3-1.1ubuntu3.4
libgpg-error-l10n 1.47-3build2.1
libgpg-error0 1.47-3build2.1
libgpm2 1.20.7-11
libhogweed6t64 3.9.1-2.2build1.1
libidn2-0 2.3.7-2build1.1
liblocale-gettext-perl 1.07-6ubuntu5
liblz4-1 1.9.4-1build1.1
liblzma5 5.6.1+really5.4.5-1ubuntu0.2
libmd0 1.1.0-2build1.1
libncursesw6 6.4+20240113-1ubuntu2
libnettle8t64 3.9.1-2.2build1.1
libnpth0t64 1.6-3.1build1
libp11-kit0 0.25.3-4ubuntu2.1
libpcre2-8-0 10.42-4ubuntu2.1
libpython3.12-minimal 3.12.3-1ubuntu0.8
libpython3.12-stdlib 3.12.3-1ubuntu0.8
libreadline8t64 8.2-4build1
libseccomp2 2.5.5-1ubuntu3.1
libselinux1 3.5-2ubuntu2.1
libsqlite3-0 3.45.1-1ubuntu2.5
libssl3t64 3.0.13-0ubuntu3.6
libstdc++6 14.2.0-4ubuntu2~24.04
libsystemd0 255.4-1ubuntu8.11
libtasn1-6 4.19.0-3ubuntu0.24.04.1
libtext-charwidth-perl 0.04-11build3
libtext-iconv-perl 1.7-8build3
libtext-wrapi18n-perl 0.06-10
libtinfo6 6.4+20240113-1ubuntu2
libudev1 255.4-1ubuntu8.11
libunistring5 1.1-2build1.1
libxxhash0 0.8.2-2build1
libzstd1 1.5.5+dfsg2-2build1.1
media-types 10.1.0
netbase 6.4
openssl 3.0.13-0ubuntu3.6
perl-base 5.38.2-3.2ubuntu0.2
python3.12 3.12.3-1ubuntu0.8
python3.12-minimal 3.12.3-1ubuntu0.8
readline-common 8.2-4build1
tar 1.35+dfsg-3build1
tzdata 2025b-0ubuntu0.24.04.1
ubuntu-keyring 2023.11.28.1
zlib1g 1:1.3.dfsg-3.1ubuntu2.1
DistroRelease: Ubuntu 24.04
ExecutablePath: /usr/bin/python3.12
ExecutableTimestamp: 1755193641
Package: python3.12-minimal 3.12.3-1ubuntu0.8
PackageArchitecture: arm64
ProcAttrCurrent: docker-default (enforce)
ProcCmdline: /usr/bin/python qerl.py --model-name /home/QeRL/llm-compressor/qmodel/Qwen2.5-7B-Instruct-NVFP4A16-GPTQ --output-dir ./ckpt/qwen2.5-7B-single-gpus-quantized_1e-5_0.2_xueqiu_b4_g8_r32_True --use-vllm True --learning-rate 1e-5 --adam-beta1 0.9 --adam-beta2 0.99 --weight-decay 0.1 --warmup-ratio 0.1 --lr-scheduler-type cosine --optim adamw_8bit --logging-steps 1 --per-device-train-batch-size 4 --gradient-accumulation-steps 8 --num-generations 8 --max-prompt-length 15360 --max-completion-length 4096 --num-train-epochs 1 --save-steps 1 --save-strategy steps --save-total-limit 20 --max-grad-norm 0.2 --max-seq-length 20480 --lora-rank 32 --lora-alpha 64 --fast-inference True --vllm-gpu-memory-utilization 0.5 --random-state 2025 --loss-type grpo --beta 0.04 --epsilon-high 0.2 --num-iterations 5 --mask-truncated-completions True --run-name qwen2.5-7B-single-gpus-quantized_1e-5_0.2_b4_g8_r32_True --ln True --sigma-start 1e-2 --sigma-end 1e-4 --num-stages 10
ProcCwd: /home/QeRL
ProcEnviron:
LD_LIBRARY_PATH=
PATH=(custom, no user)
SHELL=/bin/bash
TERM=xterm

ProcStatus:
Name: python
Umask: 0022
State: S (sleeping)
Tgid: 37040
Ngid: 0
Pid: 37040
PPid: 7259
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 128
Groups: 0
NStgid: 37040 2125
NSpid: 37040 2125
NSpgid: 36968 2069
NSsid: 7259 1
Kthread: 0
VmPeak: 262619340 kB
VmSize: 259303816 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 15546792 kB
VmRSS: 15504060 kB
RssAnon: 2750676 kB
RssFile: 612684 kB
RssShmem: 12140700 kB
VmData: 4741400 kB
VmStk: 208 kB
VmExe: 6196 kB
VmLib: 4752148 kB
VmPTE: 32744 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 1
THP_enabled: 1
untag_mask: 0xffffffffffffff
Threads: 25
SigQ: 1/511872
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000001001000
SigCgt: 0000000100000000
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 2
Seccomp_filters: 1
Speculation_Store_Bypass: thread vulnerable
SpeculationIndirectBranch: unknown
Cpus_allowed: fffff
Cpus_allowed_list: 0-19
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 292086
nonvoluntary_ctxt_switches: 1346
Signal: 11
SignalName: SIGSEGV
SourcePackage: python3.12
Uname: Linux 6.14.0-1013-nvidia aarch64
UserGroups: N/A
_HooksRun: no

RazielAU · November 29, 2025, 4:25pm

I’ve had this issue plenty of times as well, if you have a process suddenly eat all the VRAM and it goes into swap, the DGX Spark usually locks up. The easiest way to do this is to kick off a training process that is set up with settings that immediately pushes the Spark to being out of memory. My solution has been to disable the swap file, then the process just crashes, but at least it leaves the system in a running state, though I’ve also had it reboot once, but that seems less common. Disabling the swapfile seems to be the way to go from what I’ve seen.

nathan.hodas · December 4, 2025, 6:00pm

I suffer that exact problem as well. I used the same container. Just ask for too much RAM by setting batch size too large, and the entire system freezes and you need to power cycle the system.

RazielAU · December 5, 2025, 9:20am

As mentioned above, you can disable your swap file since as soon as it goes into swap it locks up anyway, so the way I see it, the swap file is almost completely useless on the Spark since it doesn’t work anyway. By disabling it, the process will just crap out and exit, but leaves you with a running system, which is better than the alternative, where you have to physically power cycle it.

eugr · December 5, 2025, 5:48pm

I have a suspicion that this weird swap behavior is directly related to poor mmap performance on Spark. Hope it gets fixed in the next kernel release.

RazielAU · December 6, 2025, 2:43am

It would be good if they can fix it, the swap file seems to do more harm than good at the moment.

aniculescu · December 8, 2025, 5:44pm

Hi @yuxizhe2008, we are still trying to reproduce this issue, can you run the container again with a lower memory limit again, under 100G this time, and see if you still get the same result?
docker run --memory="90G" http://nvcr.io/nvidia/vllm:25.09-py3

yuxizhe2008 · December 9, 2025, 2:14am

Yes, I have tried this parameter, but even when set to 30G, it still crashes. I think this parameter can only limit the memory (RAM), not the GPU’s VRAM. In my case, the issue is VRAM overflow. It seems to be related to the shared memory.

elsaco · December 9, 2025, 6:13am

@yuxizhe2008 how do you start your container? If you suspect shared memory is the culprit try increasing it with –shm-size argument, i.e. docker run –shm-size=30G

RazielAU · December 9, 2025, 8:59am

Docker aside, I hope you don’t mean you’re still trying to reproduce the crash itself as I’d be hoping NVIDIA is already looking into how to resolve this issue as it’s a big problem with it locking up the entire system, requiring you to physically power cycle the device.

The issue is super easy to run into, it has left me with a Spark that is completely non-responsive many times. Just start any kind of training job or anything else that consumes all free GPU memory, as an example, I recently reenabled my swap file to help with stability when running LLMs that consume almost the entire memory of the Spark, and I accidentally started a second llama.cpp instance while GPT-OSS:120b was already running. The system almost instantly locked up on me.

I think what’s happening is that any memory not consumed by the CPU can be allocated by the GPU, the problem is, it can allocate 100% of the available system RAM, leaving nothing for the rest of the system to work with. The only way to avoid this seems to be to completely disable swap, but that leaves you with a different problem, namely, if you’re running things that consume almost all the RAM on the spark, those processes can be unstable and fall over without warning. On the other hand, the swap file is practically useless, as the entire system locks up the second you accidentally allocate all available RAM to the GPU.

I think the system needs to reserve a small amount of RAM which the GPU can’t allocate, so if you have 115GB of free memory, maybe you shouldn’t be able to allocate more than 114GB, so there’s always that gigabyte available for the system to work with even if you’ve attempted to allocate it all to the GPU. That way the system can at least swap data into RAM and still function. From what I can tell there’s no safeguards on the spark, you can just allocate as much as you want and easily lock up the system in the process.

Topic		Replies	Views
DGX Spark becomes unresponsive (“zombie”) instead of throwing CUDA OOM DGX Spark / GB10	16	1757	April 10, 2026
DGX Spark Shutdown around 95°C during nanoChat Pretraining (20-30 min) DGX Spark / GB10	21	1784	March 23, 2026
DGX Spark stability / out of RAM / overheating DGX Spark / GB10 llama , dgx-spark-issue	33	1946	July 1, 2026
My DGX Spark Hangs ... is this normal? DGX Spark / GB10 Projects llm , dgx	4	478	April 13, 2026
Spark `hangs` - requires a hard-reset (physically unplugging) DGX Spark / GB10	5	495	April 10, 2026
Memory Creep on DGX Spark: Where Your 128 GB Actually Goes (And How to Stop It) DGX Spark / GB10 jetson , nemotron	2	1045	March 30, 2026
DGX Spark OS crash on llama4 launch DGX Spark / GB10	6	352	March 14, 2026
LM Studio models overload freezes Spark DGX Spark / GB10 dgx-spark-issue	7	412	June 7, 2026
My DGX System is getting shut itself down while running my LLM Fine tuning project . RAM Reaches to 100 percent along with GPU reaches 100 percent DGX Spark / GB10	10	865	March 31, 2026
Is transient freezing expected behavior? DGX Spark / GB10	8	572	November 19, 2025

System crashes when memory is full

Related topics