System crashes when memory is full

Hi NVIDIA Team,

I would like to provide feedback regarding a crash issue on the DGX Spark.

Currently, I am using the system for LLM RL fine-tuning (TRL+GRPO+vLLM). I have noticed that if the process consumes all available memory, the system does not kill the process but instead crashes the entire OS.

ENV:
docker: nvcr.io/nvidia/vllm:25.09-py3

When this happens:

  1. SSH becomes inaccessible.

  2. The HDMI monitor goes black.

  3. Mouse and keyboard lights go out.

  4. I am forced to physically restart the machine to recover.

Could you please look into this OOM handling behavior? It is causing significant disruption to our workflows.

Thanks

Hey @yuxizhe2008,

For now you could use docker run --memory="80G" http://nvcr.io/nvidia/vllm:25.09-py3 with whatever memory limit you expect and If the container’s processes attempt to exceed this limit, the Linux kernel’s OOM killer will terminate the container.

I set it to 110G. still crash

Try launching your container with --oom-score-adj argument and set a high score so the kernel will nuke your container first instead of vital services that causes system lock.

docker run --oom-score-adj 1000 might be too agressive. Try a 500 score to increase the likability that your container will be terminated when OOM kicks in.

Hi yuxizhe2008, after the system crashes can you boot the machine and immediately collect any logs found in /var/crash/* and send them to me?

Also I noticed you set raphael.amorim’s response as the solution but you said it still crashes. Do you see any difference in behavior

That automatically stops the container if you cross the limit @yuxizhe2008. Did you clean your mem cache on the spark before executing the container?

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

There were a few reports here (and I’ve experienced that as well) is that when the swap is being used, the system becomes unresponsive.

However, I was in a swap usage situation recently, and it didn’t crash on me. Don’t know if it’s kernel 6.14 or after setting this parameter sudo bash -c "echo 8192 > /sys/block/nvme0n1/queue/read_ahead_kb" (I have it set on boot).

@yuxizhe2008 if --memory=80 isn’t working start with --gpu-memory-utilization 0.8 instead to have the container use only 80% of GPU memory.

in /var/crash/ have a new kdump_lock flie, but the file is empty

@yuxizhe2008
To better help us debug your issue, can you share with me some more details? You can reply to this post or send this info to me directly.

  1. Docker run command
  2. Training script/command (TRL+GRPO setup)
  3. Model name and size
  4. Any custom kernel/OOM settings

I got a crash log

-rw-r----- 1 root root 2.0G Nov 27 11:54 _usr_bin_python3.12.0.crash

ProblemType: Crash
Architecture: arm64
Date: Thu Nov 27 11:52:43 2025
Dependencies:
apt 2.8.3
apt-utils 2.8.3
base-passwd 3.6.3build1
ca-certificates 20240203
debconf 1.5.86ubuntu1
debconf-i18n 1.5.86ubuntu1
dpkg 1.22.6ubuntu6.5
gcc-14-base 14.2.0-4ubuntu2~24.04
gpgv 2.4.4-2ubuntu17.3
libacl1 2.3.2-1build1.1
libapt-pkg6.0t64 2.8.3
libassuan0 2.5.6-1build1
libbz2-1.0 1.0.8-5.1build0.1
libc6 2.39-0ubuntu8.6
libcap2 1:2.66-5ubuntu2.2
libcrypt1 1:4.4.36-4build1
libdb5.3t64 5.3.28+dfsg2-7
libdebconfclient0 0.271ubuntu3
libexpat1 2.6.1-2ubuntu0.3
libffi8 3.4.6-1build1
libgcc-s1 14.2.0-4ubuntu2~24.04
libgcrypt20 1.10.3-2build1
libgmp10 2:6.3.0+dfsg-2ubuntu6.1
libgnutls30t64 3.8.3-1.1ubuntu3.4
libgpg-error-l10n 1.47-3build2.1
libgpg-error0 1.47-3build2.1
libgpm2 1.20.7-11
libhogweed6t64 3.9.1-2.2build1.1
libidn2-0 2.3.7-2build1.1
liblocale-gettext-perl 1.07-6ubuntu5
liblz4-1 1.9.4-1build1.1
liblzma5 5.6.1+really5.4.5-1ubuntu0.2
libmd0 1.1.0-2build1.1
libncursesw6 6.4+20240113-1ubuntu2
libnettle8t64 3.9.1-2.2build1.1
libnpth0t64 1.6-3.1build1
libp11-kit0 0.25.3-4ubuntu2.1
libpcre2-8-0 10.42-4ubuntu2.1
libpython3.12-minimal 3.12.3-1ubuntu0.8
libpython3.12-stdlib 3.12.3-1ubuntu0.8
libreadline8t64 8.2-4build1
libseccomp2 2.5.5-1ubuntu3.1
libselinux1 3.5-2ubuntu2.1
libsqlite3-0 3.45.1-1ubuntu2.5
libssl3t64 3.0.13-0ubuntu3.6
libstdc++6 14.2.0-4ubuntu2~24.04
libsystemd0 255.4-1ubuntu8.11
libtasn1-6 4.19.0-3ubuntu0.24.04.1
libtext-charwidth-perl 0.04-11build3
libtext-iconv-perl 1.7-8build3
libtext-wrapi18n-perl 0.06-10
libtinfo6 6.4+20240113-1ubuntu2
libudev1 255.4-1ubuntu8.11
libunistring5 1.1-2build1.1
libxxhash0 0.8.2-2build1
libzstd1 1.5.5+dfsg2-2build1.1
media-types 10.1.0
netbase 6.4
openssl 3.0.13-0ubuntu3.6
perl-base 5.38.2-3.2ubuntu0.2
python3.12 3.12.3-1ubuntu0.8
python3.12-minimal 3.12.3-1ubuntu0.8
readline-common 8.2-4build1
tar 1.35+dfsg-3build1
tzdata 2025b-0ubuntu0.24.04.1
ubuntu-keyring 2023.11.28.1
zlib1g 1:1.3.dfsg-3.1ubuntu2.1
DistroRelease: Ubuntu 24.04
ExecutablePath: /usr/bin/python3.12
ExecutableTimestamp: 1755193641
Package: python3.12-minimal 3.12.3-1ubuntu0.8
PackageArchitecture: arm64
ProcAttrCurrent: docker-default (enforce)
ProcCmdline: /usr/bin/python qerl.py --model-name /home/QeRL/llm-compressor/qmodel/Qwen2.5-7B-Instruct-NVFP4A16-GPTQ --output-dir ./ckpt/qwen2.5-7B-single-gpus-quantized_1e-5_0.2_xueqiu_b4_g8_r32_True --use-vllm True --learning-rate 1e-5 --adam-beta1 0.9 --adam-beta2 0.99 --weight-decay 0.1 --warmup-ratio 0.1 --lr-scheduler-type cosine --optim adamw_8bit --logging-steps 1 --per-device-train-batch-size 4 --gradient-accumulation-steps 8 --num-generations 8 --max-prompt-length 15360 --max-completion-length 4096 --num-train-epochs 1 --save-steps 1 --save-strategy steps --save-total-limit 20 --max-grad-norm 0.2 --max-seq-length 20480 --lora-rank 32 --lora-alpha 64 --fast-inference True --vllm-gpu-memory-utilization 0.5 --random-state 2025 --loss-type grpo --beta 0.04 --epsilon-high 0.2 --num-iterations 5 --mask-truncated-completions True --run-name qwen2.5-7B-single-gpus-quantized_1e-5_0.2_b4_g8_r32_True --ln True --sigma-start 1e-2 --sigma-end 1e-4 --num-stages 10
ProcCwd: /home/QeRL
ProcEnviron:
LD_LIBRARY_PATH=
PATH=(custom, no user)
SHELL=/bin/bash
TERM=xterm

ProcStatus:
Name: python
Umask: 0022
State: S (sleeping)
Tgid: 37040
Ngid: 0
Pid: 37040
PPid: 7259
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 128
Groups: 0
NStgid: 37040 2125
NSpid: 37040 2125
NSpgid: 36968 2069
NSsid: 7259 1
Kthread: 0
VmPeak: 262619340 kB
VmSize: 259303816 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 15546792 kB
VmRSS: 15504060 kB
RssAnon: 2750676 kB
RssFile: 612684 kB
RssShmem: 12140700 kB
VmData: 4741400 kB
VmStk: 208 kB
VmExe: 6196 kB
VmLib: 4752148 kB
VmPTE: 32744 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 1
THP_enabled: 1
untag_mask: 0xffffffffffffff
Threads: 25
SigQ: 1/511872
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000001001000
SigCgt: 0000000100000000
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 2
Seccomp_filters: 1
Speculation_Store_Bypass: thread vulnerable
SpeculationIndirectBranch: unknown
Cpus_allowed: fffff
Cpus_allowed_list: 0-19
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 292086
nonvoluntary_ctxt_switches: 1346
Signal: 11
SignalName: SIGSEGV
SourcePackage: python3.12
Uname: Linux 6.14.0-1013-nvidia aarch64
UserGroups: N/A
_HooksRun: no

I’ve had this issue plenty of times as well, if you have a process suddenly eat all the VRAM and it goes into swap, the DGX Spark usually locks up. The easiest way to do this is to kick off a training process that is set up with settings that immediately pushes the Spark to being out of memory. My solution has been to disable the swap file, then the process just crashes, but at least it leaves the system in a running state, though I’ve also had it reboot once, but that seems less common. Disabling the swapfile seems to be the way to go from what I’ve seen.

I suffer that exact problem as well. I used the same container. Just ask for too much RAM by setting batch size too large, and the entire system freezes and you need to power cycle the system.

As mentioned above, you can disable your swap file since as soon as it goes into swap it locks up anyway, so the way I see it, the swap file is almost completely useless on the Spark since it doesn’t work anyway. By disabling it, the process will just crap out and exit, but leaves you with a running system, which is better than the alternative, where you have to physically power cycle it.

I have a suspicion that this weird swap behavior is directly related to poor mmap performance on Spark. Hope it gets fixed in the next kernel release.

It would be good if they can fix it, the swap file seems to do more harm than good at the moment.

Hi @yuxizhe2008, we are still trying to reproduce this issue, can you run the container again with a lower memory limit again, under 100G this time, and see if you still get the same result?
docker run --memory="90G" http://nvcr.io/nvidia/vllm:25.09-py3

Yes, I have tried this parameter, but even when set to 30G, it still crashes. I think this parameter can only limit the memory (RAM), not the GPU’s VRAM. In my case, the issue is VRAM overflow. It seems to be related to the shared memory.

@yuxizhe2008 how do you start your container? If you suspect shared memory is the culprit try increasing it with –shm-size argument, i.e. docker run –shm-size=30G

Docker aside, I hope you don’t mean you’re still trying to reproduce the crash itself as I’d be hoping NVIDIA is already looking into how to resolve this issue as it’s a big problem with it locking up the entire system, requiring you to physically power cycle the device.

The issue is super easy to run into, it has left me with a Spark that is completely non-responsive many times. Just start any kind of training job or anything else that consumes all free GPU memory, as an example, I recently reenabled my swap file to help with stability when running LLMs that consume almost the entire memory of the Spark, and I accidentally started a second llama.cpp instance while GPT-OSS:120b was already running. The system almost instantly locked up on me.

I think what’s happening is that any memory not consumed by the CPU can be allocated by the GPU, the problem is, it can allocate 100% of the available system RAM, leaving nothing for the rest of the system to work with. The only way to avoid this seems to be to completely disable swap, but that leaves you with a different problem, namely, if you’re running things that consume almost all the RAM on the spark, those processes can be unstable and fall over without warning. On the other hand, the swap file is practically useless, as the entire system locks up the second you accidentally allocate all available RAM to the GPU.

I think the system needs to reserve a small amount of RAM which the GPU can’t allocate, so if you have 115GB of free memory, maybe you shouldn’t be able to allocate more than 114GB, so there’s always that gigabyte available for the system to work with even if you’ve attempted to allocate it all to the GPU. That way the system can at least swap data into RAM and still function. From what I can tell there’s no safeguards on the spark, you can just allocate as much as you want and easily lock up the system in the process.