DGX Spark OS crash on llama4 launch

villano · March 6, 2026, 1:51pm

Tried to run llama4 under Ollama (installed, not containerized). The whole system hangs and needs reset. 128 GB should me sufficient, anyway… But the OS crash is terrible.

Deterministic behavior, tried several times. Any idea about the problem?

aniculescu · March 9, 2026, 9:52pm

After your system crashes, you can run journalctl -k -b -1 -e to see logs from the previous boot to see what crashes the system.

villano · March 10, 2026, 11:33am

Done. As was apparent, it is an out of memory error. To be corrected ASAP. An OS that freezes reminds me of Windows 95, and the workstation becomes a toy. The log is attached. I am available for further tests/investigations.

Cheers

log.txt (98.8 KB)

josephbreda · March 10, 2026, 1:25pm

No argument with you general observation, but some tips:

You should have no problem running Llama 4 Scout with quantization. You obviously can’t run Llama 4 Maverick.

Suggest either llama.cpp or vllm over ollama. You also probably need to specify a context size since Llama 4 Scout has a maximum 10M token context length, which is definitely not going to fit.

parallelArchitect · March 14, 2026, 3:33am

The log indicates the failure occurs inside the NVIDIA kernel driver during memory allocation.

From journalctl -k -b -1:

NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051)
returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359

This message comes from the NVIDIA kernel module while allocating an internal memory descriptor. When this fails, the error occurs below the CUDA runtime, so there is no Python-level exception path. In that situation the system can become unresponsive rather than returning a normal CUDA OOM.

Why large-context models can trigger this on GB10 systems

Llama-4 Scout supports extremely large context lengths (up to ~10M tokens). If the runtime attempts to allocate KV cache based on the full context size, the allocation request can become very large.

On unified-memory systems such as GB10 / DGX Spark, GPU memory and system memory are part of a shared pool. Large allocations therefore compete with normal OS usage, and the driver can run out of allocatable memory before user-space tools report exhaustion.

One detail about unified-memory accounting

On unified-memory platforms, cudaMemGetInfo() does not reflect all memory currently used by the Linux kernel (for example buffer cache). That means the memory available to the driver may be lower than what application-level memory queries report.

Reclaiming page cache before launching very large workloads can sometimes increase available headroom:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Additional diagnostic step

If you want to confirm whether the failure is caused by unified memory exhaustion rather than a model bug, monitor system memory during launch:

watch -n1 free -h

On unified-memory systems like GB10, GPU allocations draw from the same physical memory pool as the OS. If available system memory rapidly approaches zero during model initialization, it indicates the driver is exhausting the unified pool rather than encountering a normal CUDA runtime OOM.

Practical mitigation

Limit the model context size explicitly. For example:

--ctx-size 8192

or another value appropriate for the workload and system memory capacity.

eugr · March 14, 2026, 6:39am

Ollama is not a good choice for DGX Spark. Use either llama.cpp or vLLM.

For vLLM you can use our community Docker at GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub

Make sure to not exceed available RAM size. Since Spark has unified RAM, it shares it with GPU, so if you are using your Spark as a workstation, you need to manage your memory resources carefully.

And finally, unless you have a specific reason to use llama 4 (I guess, it’s scout one?), there are newer and better models available. Qwen3.5 and Nemotron Super are the latest ones and they run well on Spark. Recipes for both Qwen3.5 and Nemotron Super are included in the community Docker.

There is also https://sparkrun.dev/ that unifies different runtimes (llama.cpp, our community vLLM docker, SGLang and soon TRT-LLM) and provides an easy access to the recipes.

You can also check out https://spark-arena.com/ for the latest model benchmarks and recipes.

villano · March 14, 2026, 7:02am

Thank you everyone for the suggestions. But I have no absolute necessity to run llama4 on the Spark. I was just trying a new workstation. The issue is another. It is absurd that in 2026 a user-mode request (the launch of llama4, or whatever else) crashes the whole operating system. The driver managing GPU memory overwrites OS data (buffer cache or whatever else)? Well, it is buggy and MUST be fixed soon. There can be multiple way to mitigate the problem, but they are not a solution in the OS perspective. The memory is unified? Well, such problems do not arise in OS-X on unified-memory Macs.

Once again, thank you all for your help.

Topic		Replies	Views
System crashes when memory is full DGX Spark / GB10 dgx-spark-issue	35	2538	June 15, 2026
DGX Spark becomes unresponsive (“zombie”) instead of throwing CUDA OOM DGX Spark / GB10	16	1757	April 10, 2026
DGX Spark Shutdown around 95°C during nanoChat Pretraining (20-30 min) DGX Spark / GB10	21	1784	March 23, 2026
My DGX Spark Hangs ... is this normal? DGX Spark / GB10 Projects llm , dgx	4	478	April 13, 2026
Memory Creep on DGX Spark: Where Your 128 GB Actually Goes (And How to Stop It) DGX Spark / GB10 jetson , nemotron	2	1045	March 30, 2026
Gemma 4 on DGX Spark (GB10): System Freeze at >80% Utilization & sm_121 Kernel Issues DGX Spark / GB10 ota	11	723	May 2, 2026
DGX Spark memory not released using llama.cpp? DGX Spark / GB10 llama	10	365	March 20, 2026
LM Studio models overload freezes Spark DGX Spark / GB10 dgx-spark-issue	7	412	June 7, 2026
DGX Spark crashes when running tensorrt-llm DGX Spark / GB10 llama	3	252	March 7, 2026
My DGX Spark keeps freezing and crashing when I try to run this code no matter the LLM NVIDIA AI Workbench llama	1	497	December 31, 2025

DGX Spark OS crash on llama4 launch

Why large-context models can trigger this on GB10 systems

One detail about unified-memory accounting

Additional diagnostic step

Practical mitigation

Related topics