Tried to run llama4 under Ollama (installed, not containerized). The whole system hangs and needs reset. 128 GB should me sufficient, anyway… But the OS crash is terrible.
Deterministic behavior, tried several times. Any idea about the problem?
Tried to run llama4 under Ollama (installed, not containerized). The whole system hangs and needs reset. 128 GB should me sufficient, anyway… But the OS crash is terrible.
Deterministic behavior, tried several times. Any idea about the problem?
After your system crashes, you can run journalctl -k -b -1 -e to see logs from the previous boot to see what crashes the system.
Done. As was apparent, it is an out of memory error. To be corrected ASAP. An OS that freezes reminds me of Windows 95, and the workstation becomes a toy. The log is attached. I am available for further tests/investigations.
Cheers
log.txt (98.8 KB)
No argument with you general observation, but some tips:
You should have no problem running Llama 4 Scout with quantization. You obviously can’t run Llama 4 Maverick.
Suggest either llama.cpp or vllm over ollama. You also probably need to specify a context size since Llama 4 Scout has a maximum 10M token context length, which is definitely not going to fit.
The log indicates the failure occurs inside the NVIDIA kernel driver during memory allocation.
From journalctl -k -b -1:
NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051)
returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
This message comes from the NVIDIA kernel module while allocating an internal memory descriptor. When this fails, the error occurs below the CUDA runtime, so there is no Python-level exception path. In that situation the system can become unresponsive rather than returning a normal CUDA OOM.
Llama-4 Scout supports extremely large context lengths (up to ~10M tokens). If the runtime attempts to allocate KV cache based on the full context size, the allocation request can become very large.
On unified-memory systems such as GB10 / DGX Spark, GPU memory and system memory are part of a shared pool. Large allocations therefore compete with normal OS usage, and the driver can run out of allocatable memory before user-space tools report exhaustion.
On unified-memory platforms, cudaMemGetInfo() does not reflect all memory currently used by the Linux kernel (for example buffer cache). That means the memory available to the driver may be lower than what application-level memory queries report.
Reclaiming page cache before launching very large workloads can sometimes increase available headroom:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
If you want to confirm whether the failure is caused by unified memory exhaustion rather than a model bug, monitor system memory during launch:
watch -n1 free -h
On unified-memory systems like GB10, GPU allocations draw from the same physical memory pool as the OS. If available system memory rapidly approaches zero during model initialization, it indicates the driver is exhausting the unified pool rather than encountering a normal CUDA runtime OOM.
Limit the model context size explicitly. For example:
--ctx-size 8192
or another value appropriate for the workload and system memory capacity.
Ollama is not a good choice for DGX Spark. Use either llama.cpp or vLLM.
For vLLM you can use our community Docker at GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub
Make sure to not exceed available RAM size. Since Spark has unified RAM, it shares it with GPU, so if you are using your Spark as a workstation, you need to manage your memory resources carefully.
And finally, unless you have a specific reason to use llama 4 (I guess, it’s scout one?), there are newer and better models available. Qwen3.5 and Nemotron Super are the latest ones and they run well on Spark. Recipes for both Qwen3.5 and Nemotron Super are included in the community Docker.
There is also https://sparkrun.dev/ that unifies different runtimes (llama.cpp, our community vLLM docker, SGLang and soon TRT-LLM) and provides an easy access to the recipes.
You can also check out https://spark-arena.com/ for the latest model benchmarks and recipes.
Thank you everyone for the suggestions. But I have no absolute necessity to run llama4 on the Spark. I was just trying a new workstation. The issue is another. It is absurd that in 2026 a user-mode request (the launch of llama4, or whatever else) crashes the whole operating system. The driver managing GPU memory overwrites OS data (buffer cache or whatever else)? Well, it is buggy and MUST be fixed soon. There can be multiple way to mitigate the problem, but they are not a solution in the OS perspective. The memory is unified? Well, such problems do not arise in OS-X on unified-memory Macs.
Once again, thank you all for your help.