I don’t have any formal training in AI and many technical discussions I online are way over my head, but I bought a 16 GB GPU for my computer and have been tinkering with LLMs for a long time now. A couple days ago, I got my DGX spark, and am trying to move from Ollama (which I have used for a while and am familiar with) to llama.cpp.
I’m hoping some folks can suggest any improvements to either compilation or command-line options for llama-server
I am having problems with crashing due to memory errors, or poor performance when I attempt to run a larger model (test results below).
Maybe the model I chose to test is unsuited for DGX spark (I am just learning about dense vs MOE models).
Thanks!!
llama.cpp info
The repo I cloned has tag b7653 (Dated Jan 7, 2026) Commands used to compile:
I followed the Nvidia Spark playbook (with some changes to the building compilation options) and can run Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf with good performance (48 t/s).
Trying to use this model either crashes llama-server when it tries to load it into memory, or the performance is painfully slow.
Test results
I ran tests with the Devstral model 3 times.
Test #
Performance
Notes
1
CRASH
System memory topped out ~110GB before crashing and memory drops down to ~3GB
2
1.2 t/s
Htop and Dashboard both show ~53/54 GB memory being used. On dashboard, GPU Utilization swings between 0-100% every few seconds
3
1.97 t/s
When I loaded the model, but before I prompted it, RAM Usage increased to 107GB. After prompting, RAM remained ~ 107GB and GPU Utilization was stable throughout @ 96%.
Command line arguments used:
Tests 1 and 2
# TEST #1 (CRASH):
-fa on \
--ubatch-size 1024 \
--threads-batch 20 \
--ctx-size 16384 \
--parallel 1 \
--cont-batching \
--cache-ram 8192
--n-gpu-layers -1 # (I did not set this explicitly but the documentation says default is -1)
# TEST #2
# Same as above, but I changed:
--n-gpu-layers 40
I spent some time chatting with another LLM to analyze my logs from Test#1. This is a summary of what it said:
Summary
Based on your log and system specifications, the error you’re encountering is a CUDA memory copy failure during model loading. Let’s analyze what’s happening and provide targeted solutions.
Root Cause Analysis
Your log shows:
Model: 123B parameter model (125.03B) in Q6_K quantization (95.52 GB)
Memory allocation:
Model buffer: 96,555.80 MiB (~96.5 GB) on CUDA0
KV cache: 5,632.00 MiB (~5.6 GB)
Compute buffer: 720.02 MiB (~0.7 GB)
Total GPU memory needed: ~102.8 GB
Available memory: ~118 GB free initially
Error: CUDA error: unspecified launch failure during cudaMemcpyAsync in ggml_backend_cuda_buffer_set_tensor
Despite having sufficient total memory (128 GB), the error occurs because:
The “unspecified launch failure” during cudaMemcpyAsync suggests:
Memory fragmentation: Even with 25+ GB free, the GPU memory might be fragmented, preventing allocation of large contiguous blocks needed for tensor transfers.
Q6_K on Blackwell: While Q6_K should work, Blackwell’s hardware-accelerated FP4 formats (NVFP4, MXFP4) are optimal. Q6_K may have suboptimal memory access patterns.
96GB weights won’t leave a lot of space for KV cache (context) - generally you don’t want to go beyond 115GB of VRAM usage on Spark. You may want to use lower quant.
The only parameters you really need are -fa on, --no-mmap and -ngl 999. Plus context size. Drop others, especially --cache-ram - doesn’t make any sense on Spark as RAM is unified.
Devstral is a dense model. Spark memory bandwidth is below 273GB/s, so theoretical maximum you can get from even 4 bit quant is ~4 t/s at zero context. In reality it will be even slower.
I also suggest -fit off, since the recent autofitting logic is not reliable in my experience, and can cause issues. --jinja usually seems recommended too.
In short, it worked for me. I no longer had to pass all those parameters like I did a month ago. Only thing I am doing different from you was to build on the latest ( b7681 as of today) source and NOT passing -DGGML_NATIVE=OFF.
# You should see the following different from
# when you pass -DGGML_NATIVE=OFF.
# -- Using CMAKE_CUDA_ARCHITECTURES=121a-real CMAKE_CUDA_ARCHITECTURES_NATIVE=121a-real
cmake .. -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 16
# Whether I pass all these other parameters or not made no different as far as I can see.
# ./llama-cli -fa on --no-mmap -ngl 999 -hf bartowski/mistralai_Devstral-2-123B-Instruct-2512-GGUF
# The straight simple one worked with no error.
./build/bin/llama-cli -hf bartowski/mistralai_Devstral-2-123B-Instruct-2512-GGUF
However , the generated text for this model became incoherent as you can see. But this might be the model issue.
Qwen3-4B-GGUF and Qwen3-32B-GGUF worked perfectly using <16GB and <32GB memory resulted 66 t/s and 8 t/s respectedly. This is definitely improved from last time I ran it.
./llama-cli -hf Qwen/Qwen3-4B-GGUF
build : b7681-8ece3836b
model : Qwen/Qwen3-4B-GGUF
modalities : text
...
> Generate a poem about Dog, Volcano, and motorcycle.
[Start thinking]
Okay, the user wants a poem about Dog, Volcano, and motorcycle. Let me start by thinking about how these three
... (CUT)
me try to refine each stanza step by step.
[End thinking]
**"Ash and Steel"**
Beneath the red and smoldering sky,
... (CUT)
Where the wild and the loyal found their way.
[ Prompt: 1091.2 t/s | Generation: 66.2 t/s ]
Thanks to all for the very helpful replies. Figuring out the basic configurations like this is pretty much over my head and this info really helps me move past that so I can work on actually creating and experimenting in domains where I’m more familiar. Cheers!
Can you clarify what you mean with the above? Are you saying that when you compiled b7681 release that you only used -DGGML_CUDA=ON as a build flag and that still gave you results as good as adding in a bunch of other flags?
And below, you are saying that with your simplified compilation flags, you were able to run the Devstral-2-123B model (which quant size?) with only the -fa on flag and it did not crash your llama.cpp?
Thanks for the link to the blog post. Good stuff in there; I would have missed it otherwise!
Yes, and Yes with fine print: and I didn’t even had to use “-fa on”. That three lines of commands are the exact commands I used to run (except the “git pull” for the sync to master branch), and it ran without errors that you were getting, and it generated text. But bartowski/mistralai_Devstral-2-123B-Instruct-2512-GGUF model’s text generation became incoherent as my screenshot shows after few words while Qwen3-* models generated a full poem. So I wouldn’t consider this as 100% success.