DGX Spark becomes unresponsive (“zombie”) instead of throwing CUDA OOM

During training on a DGX Spark, whenever the model encounters a sequence that exceeds available GPU memory, instead of raising a normal CUDA out of memory error and crashing the process, the entire machine becomes unresponsive.

SSH hangs, the node stops reacting, but external monitors (e.g., W&B) still show the process “alive” with no progress. Only a physical reboot recovers it.

Expected behavior

GPU OOM → process throws RuntimeError: CUDA out of memory → training crashes cleanly → machine stays accessible.

Actual behavior

GPU OOM → machine locks up (“zombie”) → SSH dead, no logs flushed, node requires hard reboot.

Steps taken so far

Adjusted systemd-oomd to avoid killing ssh and to enforce early memory pressure handling:

# Limit memory pressure
sudo mkdir -p /etc/systemd/oomd.conf.d
printf "[oomd]\nDefaultMemoryPressureLimit=60%%\n" | sudo tee /etc/systemd/oomd.conf.d/override.conf

# Protect ssh
sudo mkdir -p /etc/systemd/system/ssh.service.d
printf "[Service]\nOOMScoreAdjust=-1000\n" | sudo tee /etc/systemd/system/ssh.service.d/override.conf

sudo systemctl daemon-reload
sudo systemctl restart ssh.service

After applying this, another training run again put the machine into a zombie state.

Reproduction

Happy to share the code if necessary.

What I’m looking for

  • How to ensure GPU OOM throws a regular CUDA OOM exception
  • How to prevent the entire node from locking up
  • Any DGX-specific kernel/driver/sysctl/systemd settings that prevent GPU OOM from cascading into system-level stalls

There is a another post discussing this problem System crashes when memory is full - #15 by RazielAU

Try to disable swap (via sudo swapoff -a for example) and see if the problem persists.

Thanks! it might be the same issue. I turned swap off, and verified it is off. but now a few hours later i managed to kill the device (this time with lots of CPU tasks - ffmpeg, not GPU - no hardware acceleration even)

I checked both machines that crashed, after a restart, no /var/crash/ logs (only an empty lock file)

Sounds like more a case for the NVIDIA SWAT team then. They will ask for the output of their bug reporting tool.

see *** If you have a problem PLEASE read this first ***

Could you share the steps for reproduction.

Attached the nvidia debug log file.
nvidia-bug-report.log.gz (601.9 KB)

Full reproduction (this crashed the machine twice and i am afraid to run it again)

  # Clone the repository
  git clone https://github.com/sign/word-sense-disambiguation.git
  cd word-sense-disambiguation

  # Install dependencies
  pip install ".[dev]"

  # Extract the training data (should create training/data/generated directory)
  tar -xJf training/data/generated.tar.xz -C training/data/

  # Train the model (batch size 128 crashes after maybe 15 minutes)
  python -m training.train \
    --model "answerdotai/ModernBERT-Large-Instruct" \
    --data-dir training/data/generated \
    --output-dir training/output2 \
    --batch-size 128 \
    --learning-rate 3e-5 \
    --num-epochs 1 \
    --seed 42

Hi @amit59, I am not able to access the repo link which you posted here for the word-sense-disambiguation project. It seems to be a private repo.
Could you please provide me read access for this repo? I am working on reproduction for this issue.

My github username: knirmal-nvidia

Thanks @knirmal . I updated the instructions, to a public repo. Please try again.

Hi @amit59, I am able to access the repo now, Thank you.
However, the training data is missing in the repo (generated.tar.xz). Can you please share that too?

My apologies, I was on a branch 🤦🏻‍♂️
The file and all the other files are now in main word-sense-disambiguation/training/data/generated.tar.xz at main · sign/word-sense-disambiguation · GitHub

Hi @amit59, When I run the model following the exact steps mentioned in above comment, the execution falls back to CPU-only, because the PyTorch version being used is CPU-only. The device becomes sluggish and doesn’t respond for 2–3 seconds due to oom killing processes but does not crash.

I also see the following warning:

Compiling the model with torch.compile and using a torch.cpu device is not supported. Falling back to non-compiled mode

Do you see the same warning on your setup?

Could you please confirm:

  1. Which CUDA toolkit version and PyTorch version you are using?

  2. Are you using the NGC PyTorch Container (Docker-based approach)? If yes, can you share the exact steps you are following?

  3. Are there any additional steps or modifications you are using apart from those mentioned in the forum post?

Thanks for the response. the machine is currently not-responding (waiting for someone to go physically restart it).

I did not encounter the error - I was using pytorch with GPU. It shouldn’t have overriden your base torch in the environment, but if it did, you can run:

pip uninstall -y torch torchvision
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130

i probably ran it myself. the model trains super quickly - 50 minutes for the entire training.

The CUDA version is 13.0 for sure, but the driver version is likely 580.95.05 (purchased two devices, this is on the other one).

I would imagine the only step would be - install latest torch using that command^ and it would train on GPU.

Also managed to kill the machine reliably with docker, even when speicifying memory limit being 100gb, and nothing else running on the machine.

I am training an NVIDIA model, here are the instructions: reproduction/repositories/nvidia-cosmos/cosmos-predict1 at main · sign-language-processing/reproduction · GitHub

To crash the machine, just change batch-size to >6 in the Dockerfile.

From the bug report data this appears to be a driver allocation failure occurring under unified memory pressure, which would explain why the system becomes unresponsive instead of returning a normal CUDA out-of-memory error.

The kernel log repeatedly shows failures in the driver’s internal allocation path:

NV_ERR_NO_MEMORY
_memdescAllocInternal @ mem_desc.c:1359

This function allocates internal driver memory descriptors used for GPU objects. Because this allocation occurs below the CUDA runtime layer, the application may not receive a normal cudaErrorMemoryAllocation when this path fails.

Once those descriptor allocations fail, the log shows the GPU context allocation path failing as well — kgrctxAllocMainCtxBuffer at kernel_graphics_context.c:1387, cascading to kgrctxAllocCtxBuffers at kernel_graphics_object.c:214. After that point the nvidia-modeset kernel thread enters uninterruptible sleep (D-state) for more than 122 seconds waiting for a resource that cannot be satisfied. That blocked display thread matches the freeze described in the thread.

In simplified form:

system memory pressure
→ driver descriptor allocation fails (NV_ERR_NO_MEMORY)
→ GPU context creation fails
→ nvidia-modeset blocks in D-state (122+ seconds)
→ system becomes unresponsive

This pattern appears multiple times across separate boot cycles in the report.


Memory state

The memory snapshot in the report shows an interesting pattern:

MemTotal:       ~125 GB
MemFree:        ~1 GB
MemAvailable:   ~103 GB
Cached:         ~98 GB (102,386,816 kB)
Inactive(file): ~94 GB
Slab:           ~6.4 GB

Only about 1 GB is truly free, while nearly 100 GB is held in file cache.

Linux counts the page cache as reclaimable and therefore reports large MemAvailable, but reclaim still needs to occur before new allocations succeed. If reclaim latency becomes high during a burst of allocations, a driver allocation path may fail even though MemAvailable appears large.


Unified memory architecture on GB10

One important difference on DGX Spark is that the GPU does not have a dedicated framebuffer. GPU allocations come from the same system memory pool used by the CPU.

On traditional discrete GPU systems memory allocation typically looks like this:

CPU processes → system RAM
GPU kernels   → VRAM
page cache    → system RAM

GPU memory pressure and Linux system memory pressure are largely independent.

NVIDIA describes this model in the CUDA Unified Memory documentation: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-memory-introduction

On DGX Spark the Grace CPU and Blackwell GPU access a shared system memory pool through the NVLink-C2C coherent interconnect. The following consumers all draw from the same pool:

CPU processes
filesystem page cache
kernel slab allocations
GPU allocations (driver descriptors, contexts, user buffers)

Under heavy workloads these components can compete for the same physical memory resources.

NVIDIA documents the DGX Spark hardware architecture here: https://docs.nvidia.com/dgx/dgx-spark/hardware.html


Monitoring limitations

Traditional GPU monitoring tools do not expose unified memory usage on this platform. The report shows:

TotalDedicatedGPUMemory → Operation not supported
UsedDedicatedGPUMemory  → Operation not supported
FB Memory Total/Used/Free → N/A

This is expected on UMA platforms where the GPU does not expose a discrete framebuffer.

The PCIe link information reported by nvidia-smi (Gen1 x1) is also expected on GB10 systems, since the GPU communicates with the Grace CPU through the NVLink-C2C interconnect rather than a conventional PCIe link.


System configuration factors

A few aspects of the environment may be relevant when investigating memory pressure on this platform.

Swap

The report indicates swap was enabled earlier and later disabled. On systems where the GPU shares system memory, many users prefer to keep swap disabled to simplify reclaim behavior. The report already shows swap disabled, so that step appears to have been taken.

cat /proc/swaps
grep swap /etc/fstab

Docker container limits

The thread also mentions that the crash occurs even when the workload runs inside a Docker container with a memory limit such as --memory=100g. Docker memory limits apply to container processes through Linux cgroups: https://docs.docker.com/engine/containers/resource_constraints/

Since the failure still occurs under those conditions, the memory pressure involved here may not be fully contained by the container limit.

Page cache reclaim

Some users report that freeing page cache before launching very large workloads can reduce memory pressure:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

This simply forces cached pages back into the free pool and should be considered a temporary workaround, not a long-term solution.

Related discussion: https://forums.developer.nvidia.com/t/how-to-automatically-free-shared-system-memory/363178


Observability

Diagnosing unified-memory pressure can be difficult because standard tools cannot directly show:

  • GPU residency in shared system memory

  • unified memory migration pressure

  • driver allocation pressure

To explore those aspects I have been experimenting with unified-memory diagnostics aimed at making that behavior more visible:

https://github.com/parallelArchitect/cuda-unified-memory-analyzer

The goal is simply to provide additional visibility into unified memory pressure so issues like this can be investigated earlier.


Summary

The failure pattern in the report does not look like a typical CUDA runtime OOM. Instead it appears that a driver-level allocation path fails while the system is under unified-memory pressure, after which the display stack becomes blocked waiting on that driver operation.

Given the architecture involved (shared system memory, NVLink-C2C interconnect, ATS addressing), further investigation may be needed to determine how memory reclaim interacts with the driver allocation path on this platform.

If additional diagnostics or traces would help narrow this down further, I would be interested to see them.


Update: systemd-run memory jail does NOT protect Docker containers on unified memory

We’ve been investigating this further on our DGX Spark (128GB unified memory, NVIDIA GB10). We tried the systemd-run cgroup approach recommended in various guides:

sudo systemd-run --scope -p MemoryMax=100G -p MemorySwapMax=0 docker run --gpus all ...

This does not work. We confirmed by inspecting cgroups from inside the container:

  • systemd-run creates a scope with memory.max = 107374182400 (100GB) ✓
  • But Docker creates its own cgroup for the container (0::/), with memory.max = max (unlimited)
  • The training process runs in Docker’s cgroup, not the systemd scope
  • Even within any cgroup, GPU allocations on unified memory are not tracked — we allocated 10.7GB on GPU and memory.current only increased by 377MB

So cgroups are doubly broken for this use case:

  1. Docker escapes the systemd scope
  2. GPU memory bypasses cgroup accounting entirely on unified memory

Memory profiling during training

(60M param model, SEQ=65536, MBS=1)

We monitored system memory every 5 seconds during a 10-step run:

Phase mem_used buf/cache free What happened
Baseline (idle) 4.9 GB 6.4 GB 112 GB OS + desktop
Docker + Python loading 9.2 GB 6.5 GB 108 GB Importing torch/megatron
GPU allocated 78.0 GB 6.7 GB 39 GB Model + optimizer on GPU (~71GB)
Step 1 (forward/backward) 94.6 GB 23.1 GB 22 GB Page cache jumps +16GB from mmap data reads
Steady state (steps 2–10) 103.7 GB 23.2 GB 13 GB Stable, only 13GB free
Checkpoint save (peak) 106.1 GB 24.5 GB 11 GB Serialization buffers + cache

The system is at 106GB / 122GB after just 10 steps with only 11GB headroom.

Over longer runs (~39 steps), the page cache keeps growing as new regions of the training data are read sequentially, eventually exhausting the remaining memory and crashing the machine.

The core issue matches @parallelArchitect’s analysis:
the kernel doesn’t know that ~71GB of “used” memory is GPU allocations, so it doesn’t feel memory pressure and keeps caching aggressively.

  • nvidia-smi reports [N/A] for memory on unified memory
  • No visibility from userspace either

Your 10.7GB vs 377MB measurement confirms the accounting gap directly — GPU allocations on UMA are not reflected in cgroup memory.current.

Worth instrumenting /proc/pressure/memory PSI stall metrics in your next run. The some and full stall percentages may show reclaim pressure rising before free memory drops — providing an earlier signal than MemAvailable, especially in high-cache conditions where MemAvailable still looks healthy despite pressure building.

One thing worth checking before the next run — if the Spark has been powered off but left plugged in for an extended period it may be in a degraded PD clock state. hoesing documented this pattern here: GPU PD Throttle Check Tool

At 513MHz instead of 2100MHz the same workload takes 3x longer, page cache has more time to grow before the run completes, leaving less headroom before the OOM cliff. Confirming the power state before memory diagnostics means the measurements reflect actual memory pressure rather than a throttled baseline: GitHub - parallelArchitect/spark-gpu-throttle-check: Enhanced GPU throttle diagnostic for DGX Spark (GB10): NVML direct telemetry, throttle cause decoder, PCIe link monitoring, baseline drift detection, timeline capture. · GitHub

If you want a bandwidth baseline before the run — this PTX-based probe targets unified memory platforms including GB10. Pascal is validated — GB10 data is pending. Builds on aarch64, no CUPTI dependency: GitHub - parallelArchitect/nvidia-uma-fault-probe: Cycle-accurate UMA fault latency and bandwidth measurement for NVIDIA GPUs. C and PTX. No Python. Pascal (SM 6.0) through Blackwell GB10 (SM 12.1). · GitHub

Running uma_bw prior to training and sharing uma_bw_results.json would help establish GB10 as a validated data point for the community.