Help on llama.cpp command line arguments and compilation settings (performance testing included)

ChuckForbin · January 8, 2026, 9:42pm

I don’t have any formal training in AI and many technical discussions I online are way over my head, but I bought a 16 GB GPU for my computer and have been tinkering with LLMs for a long time now. A couple days ago, I got my DGX spark, and am trying to move from Ollama (which I have used for a while and am familiar with) to llama.cpp.

I’m hoping some folks can suggest any improvements to either compilation or command-line options for llama-server
I am having problems with crashing due to memory errors, or poor performance when I attempt to run a larger model (test results below).
Maybe the model I chose to test is unsuited for DGX spark (I am just learning about dense vs MOE models).
Thanks!!

llama.cpp info

The repo I cloned has tag b7653 (Dated Jan 7, 2026)
Commands used to compile:

cmake .. -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=OFF`
cmake --build build --config Release -j 16

I followed the Nvidia Spark playbook (with some changes to the building compilation options) and can run Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf with good performance (48 t/s).

I wanted to step it up a notch, so tried out bartowski/mistralai_Devstral-2-123B-Instruct-2512-GGUF-Q6_K (96 GB file size)

Trying to use this model either crashes llama-server when it tries to load it into memory, or the performance is painfully slow.

Test results

I ran tests with the Devstral model 3 times.

Test #	Performance	Notes
1	CRASH	System memory topped out ~110GB before crashing and memory drops down to ~3GB
2	1.2 t/s	Htop and Dashboard both show ~53/54 GB memory being used. On dashboard, GPU Utilization swings between 0-100% every few seconds
3	1.97 t/s	When I loaded the model, but before I prompted it, RAM Usage increased to 107GB. After prompting, RAM remained ~ 107GB and GPU Utilization was stable throughout @ 96%.

Command line arguments used:

Tests 1 and 2

# TEST #1 (CRASH):
-fa on \
--ubatch-size 1024 \
--threads-batch 20 \
--ctx-size 16384 \
--parallel 1 \
--cont-batching \
--cache-ram 8192
--n-gpu-layers -1 # (I did not set this explicitly but the documentation says default is -1)

# TEST #2 
# Same as above, but I changed:
--n-gpu-layers 40

Test 3

--n-gpu-layers 999 \
-fa on \
--ubatch-size 512 \
--threads-batch 20 \
--ctx-size 16384 \
--parallel 1 \
--cont-batching \
--cache-ram 4096 \
--no-mmap \
--fit on \
--fit-target 2048

LLM’s analysis of Test 1’s logs:

I spent some time chatting with another LLM to analyze my logs from Test #1. This is a summary of what it said:

Summary

Based on your log and system specifications, the error you’re encountering is a CUDA memory copy failure during model loading. Let’s analyze what’s happening and provide targeted solutions.

Root Cause Analysis

Your log shows:

Model: 123B parameter model (125.03B) in Q6_K quantization (95.52 GB)

Memory allocation:

Model buffer: 96,555.80 MiB (~96.5 GB) on CUDA0

KV cache: 5,632.00 MiB (~5.6 GB)

Compute buffer: 720.02 MiB (~0.7 GB)

Total GPU memory needed: ~102.8 GB

Available memory: ~118 GB free initially

Error: CUDA error: unspecified launch failure during cudaMemcpyAsync in ggml_backend_cuda_buffer_set_tensor

Despite having sufficient total memory (128 GB), the error occurs because:

Primary Issue: Memory Fragmentation & Allocation Pattern

The “unspecified launch failure” during cudaMemcpyAsync suggests:

Memory fragmentation: Even with 25+ GB free, the GPU memory might be fragmented, preventing allocation of large contiguous blocks needed for tensor transfers.

Q6_K on Blackwell: While Q6_K should work, Blackwell’s hardware-accelerated FP4 formats (NVFP4, MXFP4) are optimal. Q6_K may have suboptimal memory access patterns.

eugr · January 8, 2026, 10:21pm

96GB weights won’t leave a lot of space for KV cache (context) - generally you don’t want to go beyond 115GB of VRAM usage on Spark. You may want to use lower quant.
The only parameters you really need are -fa on, --no-mmap and -ngl 999. Plus context size. Drop others, especially --cache-ram - doesn’t make any sense on Spark as RAM is unified.
Devstral is a dense model. Spark memory bandwidth is below 273GB/s, so theoretical maximum you can get from even 4 bit quant is ~4 t/s at zero context. In reality it will be even slower.

coder543 · January 9, 2026, 4:04am

I also suggest -fit off, since the recent autofitting logic is not reliable in my experience, and can cause issues. --jinja usually seems recommended too.

eugr · January 9, 2026, 5:50am

Yes, right, forgot about --jinja

Neurfer · January 9, 2026, 6:36am

I saw this article (New Software and Model Optimizations Supercharge NVIDIA DGX Spark | NVIDIA Technical Blog) and wanted to see if llama.cpp really improved as it said, then saw your post.

In short, it worked for me. I no longer had to pass all those parameters like I did a month ago. Only thing I am doing different from you was to build on the latest ( b7681 as of today) source and NOT passing -DGGML_NATIVE=OFF.

# You should see the following different from
# when you pass  -DGGML_NATIVE=OFF.
#   -- Using CMAKE_CUDA_ARCHITECTURES=121a-real CMAKE_CUDA_ARCHITECTURES_NATIVE=121a-real
cmake .. -B build -DGGML_CUDA=ON

cmake --build build --config Release -j 16

# Whether I pass all these other parameters or not made no different as far as I can see.
#   ./llama-cli -fa on --no-mmap -ngl 999 -hf bartowski/mistralai_Devstral-2-123B-Instruct-2512-GGUF
# The straight simple one worked with no error.
./build/bin/llama-cli -hf bartowski/mistralai_Devstral-2-123B-Instruct-2512-GGUF

However , the generated text for this model became incoherent as you can see. But this might be the model issue.

Qwen3-4B-GGUF and Qwen3-32B-GGUF worked perfectly using <16GB and <32GB memory resulted 66 t/s and 8 t/s respectedly. This is definitely improved from last time I ran it.

./llama-cli -hf Qwen/Qwen3-4B-GGUF

build      : b7681-8ece3836b
model      : Qwen/Qwen3-4B-GGUF
modalities : text
...

> Generate a poem about Dog, Volcano, and motorcycle.

[Start thinking]
Okay, the user wants a poem about Dog, Volcano, and motorcycle. Let me start by thinking about how these three 
... (CUT)
me try to refine each stanza step by step.
[End thinking]

**"Ash and Steel"**  

Beneath the red and smoldering sky,  
... (CUT)
Where the wild and the loyal found their way.

[ Prompt: 1091.2 t/s | Generation: 66.2 t/s ]

ChuckForbin · January 9, 2026, 6:55pm

Thanks to all for the very helpful replies. Figuring out the basic configurations like this is pretty much over my head and this info really helps me move past that so I can work on actually creating and experimenting in domains where I’m more familiar. Cheers!

ChuckForbin · January 9, 2026, 7:36pm

Neurfer:

In short, it worked for me. I no longer had to pass all those parameters like I did a month ago. Only thing I am doing different from you was to build on the latest ( b7681 as of today) source and NOT passing -DGGML_NATIVE=OFF.
# You should see the following different from
# when you pass  -DGGML_NATIVE=OFF.
#   -- Using CMAKE_CUDA_ARCHITECTURES=121a-real CMAKE_CUDA_ARCHITECTURES_NATIVE=121a-real
cmake .. -B build -DGGML_CUDA=ON

cmake --build build --config Release -j 16

Can you clarify what you mean with the above? Are you saying that when you compiled b7681 release that you only used -DGGML_CUDA=ON as a build flag and that still gave you results as good as adding in a bunch of other flags?

And below, you are saying that with your simplified compilation flags, you were able to run the Devstral-2-123B model (which quant size?) with only the -fa on flag and it did not crash your llama.cpp?

Neurfer:

# Whether I pass all these other parameters or not made no different as far as I can see.
#   ./llama-cli -fa on --no-mmap -ngl 999 -hf bartowski/mistralai_Devstral-2-123B-Instruct-2512-GGUF
# The straight simple one worked with no error.
./build/bin/llama-cli -hf bartowski/mistralai_Devstral-2-123B-Instruct-2512-GGUF

✂️

Thanks for the link to the blog post. Good stuff in there; I would have missed it otherwise!

Neurfer · January 9, 2026, 8:56pm

Yes, and Yes with fine print: and I didn’t even had to use “-fa on”. That three lines of commands are the exact commands I used to run (except the “git pull” for the sync to master branch), and it ran without errors that you were getting, and it generated text. But bartowski/mistralai_Devstral-2-123B-Instruct-2512-GGUF model’s text generation became incoherent as my screenshot shows after few words while Qwen3-* models generated a full poem. So I wouldn’t consider this as 100% success.

Topic		Replies	Views
Compiling llama.cpp DGX Spark / GB10 llama	14	2523	February 7, 2026
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	28	8323	January 20, 2026
Single node and Dual node llama.cpp build flag DGX Spark / GB10 llama	5	314	March 11, 2026
Best LLM engine for several parallel models? DGX Spark / GB10 agentic-ai	5	1262	January 6, 2026
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2915	February 25, 2026
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	852	March 3, 2026
MTP+llama.cpp: a look at Qwen3.6-27B DGX Spark / GB10 llama	20	5355	May 17, 2026
Nemotron-3-Super 120B on GB10 — llama.cpp sm_121 build + Ollama GGUF incompatibility fix DGX Spark / GB10 Projects llama , nemotron	7	1288	July 8, 2026
Building llama.cpp container images for Spark/GB10 DGX Spark / GB10 Projects cuda , llama	16	3332	March 29, 2026
Can someone please just help me set the DGX Spark up for optimal LLM use? DGX Spark / GB10 llama	11	1488	June 20, 2026